arXiv Daily Digest

230

Papers

Olfactory pursuit: catching a moving odor source in complex flows

嗅觉追踪：在复杂流动中捕捉移动气味源

Carbone, Maurizio, Piro, Lorenzo, Heinonen, Robin A., Biferale, Luca, Cencini, Massimo, Celani, Antonio

Abstract

Locating and intercepting a moving target from possibly delayed, intermittent sensory signals is a paradigmatic problem in decision-making under uncertainty, and a fundamental challenge for, e.g., animals seeking prey or mates and autonomous robotic systems. Odor signals are intermittent, strongly mixed by turbulent-like transport, and typically lag behind the true target position, thereby complicating localization. Here, we formulate olfactory pursuit as a partially observable Markov decision process in which an agent maintains a joint belief over the target's position and velocity. Using a discrete run-and-tumble model, we compute quasi-optimal policies by numerically solving the Bellman equation and benchmark them against well-established information-theoretic strategies such as Infotaxis. We show that purely exploratory policies are near-optimal when the target frequently reorients, but fail dramatically when the target exhibits persistent motion. We thus introduce a computationally efficient hybrid policy that combines the information-gain drive of Infotaxis with a "greedy" value function derived from an associated fully observable control problem. Our heuristic achieves near-optimal performance across all persistence times and substantially outperforms purely exploratory approaches. Moreover, our proposal demonstrates strong robustness even in more complex search scenarios, including continuous run-and-tumble prey motion with moderate persistence time, model mismatch, and more accurate plume dynamics representation. Our results identify predictive inference of target motion as the key ingredient for effective olfactory pursuit and provide a general framework for search in information-poor, dynamically evolving environments.

Chinese Translation

从可能延迟的间歇性感官信号中定位和拦截移动目标是一个在不确定性下决策的典型问题，也是动物寻找猎物或伴侣以及自主机器人系统面临的基本挑战。气味信号是间歇性的，受到类似湍流的强烈混合，并且通常滞后于真实目标位置，从而使得定位变得复杂。在这里，我们将嗅觉追踪形式化为一个部分可观测的马尔可夫决策过程，其中代理保持对目标位置和速度的联合信念。通过使用离散的跑动-翻滚模型，我们通过数值求解贝尔曼方程计算出准最优策略，并将其与信息论策略（如信息引导策略 Infotaxis）进行基准比较。我们展示了当目标频繁重新定向时，纯探索策略接近最优，但当目标表现出持续运动时则表现不佳。因此，我们提出了一种计算效率高的混合策略，将信息增益驱动的 Infotaxis 与源自相关完全可观测控制问题的“贪婪”价值函数相结合。我们的启发式方法在所有持续时间上都达到了近似最优的表现，并且显著优于纯探索方法。此外，我们的提案在更复杂的搜索场景中表现出强大的鲁棒性，包括具有中等持续时间的连续跑动-翻滚猎物运动、模型不匹配以及更准确的气味羽流动态表示。我们的结果表明，目标运动的预测推断是有效嗅觉追踪的关键要素，并为在信息贫乏、动态演变环境中的搜索提供了一个通用框架。

View on arXiv Download PDF AI Translation

cs.RO / 2 / 2604.13142

Multi-modal panoramic 3D outdoor datasets for place categorization

用于地点分类的多模态全景3D户外数据集

Jung, Hojung, Oto, Yuki, Mozos, Oscar M., Iwashita, Yumi, Kurazume, Ryo

Abstract

We present two multi-modal panoramic 3D outdoor (MPO) datasets for semantic place categorization with six categories: forest, coast, residential area, urban area and indoor/outdoor parking lot. The first dataset consists of 650 static panoramic scans of dense (9,000,000 points) 3D color and reflectance point clouds obtained using a FARO laser scanner with synchronized color images. The second dataset consists of 34,200 real-time panoramic scans of sparse (70,000 points) 3D reflectance point clouds obtained using a Velodyne laser scanner while driving a car. The datasets were obtained in the city of Fukuoka, Japan and are publicly available in [1], [2]. In addition, we compare several approaches for semantic place categorization with best results of 96.42% (dense) and 89.67% (sparse).

Chinese Translation

我们提出了两个用于语义地点分类的多模态全景3D户外（MPO）数据集，涵盖六个类别：森林、海岸、住宅区、城市区域以及室内/室外停车场。第一个数据集包含650个静态全景扫描，包含密集（9,000,000点）3D彩色和反射点云，使用FARO激光扫描仪获取，并同步彩色图像。第二个数据集包含34,200个实时全景扫描，包含稀疏（70,000点）3D反射点云，使用Velodyne激光扫描仪在驾驶汽车时获取。这些数据集是在日本福冈市获得的，并已公开发布于[1]，[2]。此外，我们比较了几种语义地点分类的方法，得到了最佳结果：密集数据集为96.42%，稀疏数据集为89.67%。

View on arXiv Download PDF AI Translation

cs.RO / 3 / 2604.13204

Weakly-supervised Learning for Physics-informed Neural Motion Planning via Sparse Roadmap

基于稀疏路线图的物理信息神经运动规划的弱监督学习

Ni, Ruiqi, Liu, Yuchen, Qureshi, Ahmed H.

Abstract

The motion planning problem requires finding a collision-free path between start and goal configurations in high-dimensional, cluttered spaces. Recent learning-based methods offer promising solutions, with self-supervised physics-informed approaches such as Neural Time Fields (NTFields) solving the Eikonal equation to learn value functions without expert demonstrations. However, existing physics-informed methods struggle to scale in complex, multi-room environments, where simply increasing the number of samples cannot resolve local minima or guarantee global consistency. We propose Hierarchical Neural Time Fields (H-NTFields), a weakly-supervised framework that combines weak supervision from sparse roadmaps with physics-informed PDE regularization. The roadmap provides global topological anchors through upper and lower bounds on travel times, while PDE losses enforce local geometric fidelity and obstacle-aware propagation. Experiments on 18 Gibson environments and real robotic platforms show that H-NTFields substantially improves robustness over prior physics-informed methods, while enabling fast amortized inference through a continuous value representation.

Chinese Translation

运动规划问题需要在高维、杂乱的空间中找到从起始配置到目标配置的无碰撞路径。最近的基于学习的方法提供了有前景的解决方案，其中自监督的物理信息方法如神经时间场（Neural Time Fields, NTFields）通过求解Eikonal方程来学习价值函数，而无需专家演示。然而，现有的物理信息方法在复杂的多房间环境中难以扩展，单纯增加样本数量无法解决局部极小值或保证全局一致性。我们提出了层次神经时间场（Hierarchical Neural Time Fields, H-NTFields），这是一种弱监督框架，结合了来自稀疏路线图的弱监督和物理信息偏微分方程（PDE）正则化。路线图通过对旅行时间的上下界提供全局拓扑锚点，而PDE损失则强制执行局部几何保真度和障碍物感知传播。在18个Gibson环境和真实机器人平台上的实验表明，H-NTFields在鲁棒性上显著优于先前的物理信息方法，同时通过连续值表示实现快速的摊销推理。

View on arXiv Download PDF AI Translation

cs.RO / 4 / 2604.13245

Capability-Aware Heterogeneous Control Barrier Functions for Decentralized Multi-Robot Safe Navigation

基于能力感知的异构控制障碍函数用于去中心化多机器人安全导航

Kim, Joonkyung, Zhang, Yanze, Luo, Wenhao, Lyu, Yiwei

Abstract

Safe navigation for multi-robot systems requires enforcing safety without sacrificing task efficiency under decentralized decision-making. Existing decentralized methods often assume robot homogeneity, making shared safety requirements non-uniformly interpreted across heterogeneous agents with structurally different dynamics, which could lead to avoidance obligations not physically realizable for some robots and thus cause safety violations or deadlock. In this paper, we propose Capability-Aware Heterogeneous Control Barrier Function (CA-HCBF), a decentralized framework for consistent safety enforcement and capability-aware coordination in heterogeneous robot teams. We derive a canonical second-order control-affine representation that unifies holonomic and nonholonomic robots under acceleration-level control via canonical transformation and backstepping, preserving forward invariance of the safe set while avoiding relative-degree mismatch across heterogeneous dynamics. We further introduce a support-function-based directional capability metric that quantifies each robot's ability to follow its motion intent, deriving a pairwise responsibility allocation that distributes the safety burden proportionally to each robot's motion capability. A feasibility-aware clipping mechanism further constrains the allocation to each agent's physically achievable range, mitigating infeasible constraint assignments common in dense decentralized CBF settings. Simulations with up to 30 heterogeneous robots and a physical multi-robot demonstration show improved safety and task efficiency over baselines, validating real-world applicability across robots with distinct kinematic constraints.

Chinese Translation

多机器人系统的安全导航需要在去中心化决策下强制执行安全性，而不牺牲任务效率。现有的去中心化方法通常假设机器人是同质的，这使得在具有结构上不同动态的异构代理之间共享的安全要求被不均匀地解释，这可能导致某些机器人无法物理实现的回避义务，从而引发安全违规或死锁。在本文中，我们提出了一种基于能力感知的异构控制障碍函数（CA-HCBF），这是一个用于在异构机器人团队中一致性安全执行和能力感知协调的去中心化框架。我们推导出了一种规范的二阶控制仿射表示，通过规范变换和反向步进将全动力学和非全动力学机器人统一在加速度级控制下，同时保持安全集的前向不变性，避免异构动态之间的相对阶数不匹配。我们进一步引入了一种基于支持函数的方向性能力度量，量化每个机器人跟随其运动意图的能力，推导出一对一的责任分配，将安全负担按比例分配给每个机器人的运动能力。一个考虑可行性的裁剪机制进一步限制了分配在每个代理的物理可实现范围内，减轻了在密集去中心化控制障碍函数设置中常见的不可行约束分配。对多达30个异构机器人的仿真和一个物理多机器人演示显示出比基线更好的安全性和任务效率，验证了在具有不同运动约束的机器人之间的实际应用性。

View on arXiv Download PDF AI Translation

cs.RO / 5 / 2604.13248

GeoVision-Enabled Digital Twin for Hybrid Autonomous-Teleoperated Medical Responses

基于GeoVision的混合自主-遥控医疗响应数字双胞胎

Kebria, Parham, Sabri, Soheil, Brattain, Laura J

Abstract

Remote medical response systems are increasingly being deployed to support emergency care in disaster-affected and infrastructure-limited environments. Enabled by GeoVision capabilities, this paper presents a Digital Twin architecture for hybrid autonomous-teleoperated medical response systems. The proposed framework integrates perception and adaptive navigation with a Digital Twin, synchronized in real-time, that mirrors system states, environmental dynamics, patient conditions, and mission objectives. Unlike traditional ground control interfaces, the Digital Twin provides remote clinical and operational users with an intuitive, continuously updated virtual representation of the platform and its operational context, enabling enhanced situational awareness and informed decision-making.

Chinese Translation

远程医疗响应系统越来越多地被部署，以支持在灾害影响和基础设施受限环境中的紧急护理。本文提出了一种基于GeoVision能力的数字双胞胎架构，用于混合自主-遥控医疗响应系统。所提出的框架将感知和自适应导航与实时同步的数字双胞胎相结合，后者反映了系统状态、环境动态、患者状况和任务目标。与传统的地面控制接口不同，数字双胞胎为远程临床和操作用户提供了一个直观的、持续更新的虚拟表示，展示了平台及其操作背景，从而增强了情境意识和知情决策能力。

View on arXiv Download PDF AI Translation

cs.RO / 6 / 2604.13309

Utilizing Inpainting for Keypoint Detection for Vision-Based Control of Robotic Manipulators

利用修复技术进行关键点检测以实现基于视觉的机器人操控

Chatterjee, Sreejani, Mullur, Venkatesh, Gandhi, Abhinav, Calli, Berk

Abstract

In this paper we present a novel visual servoing framework to control a robotic manipulator in the configuration space by using purely natural visual features. Our goal is to develop methods that can robustly detect and track natural features or keypoints on robotic manipulators that would be used for vision-based control, especially for scenarios where placing external markers on the robot is not feasible or preferred at runtime. For the model training process of our data driven approach, we create a data collection pipeline where we attach ArUco markers along the robot's body, label their centers as keypoints, and then utilize an inpainting method to remove the markers and reconstruct the occluded regions. By doing so, we generate natural (markerless) robot images that are automatically labeled with the marker locations. These images are used to train a keypoint detection algorithm, which is used to control the robot configuration using natural features of the robot. Unlike the prior methods that rely on accurate camera calibration and robot models for labeling training images, our approach eliminates these dependencies through inpainting. To achieve robust keypoint detection even in the presence of occlusion, we introduce a second inpainting model, this time to utilize during runtime, that reconstructs occluded regions of the robot in real time, enabling continuous keypoint detection. To further enhance the consistency and robustness of keypoint predictions, we integrate an Unscented Kalman Filter (UKF) that refines the keypoint estimates over time, adding to stable and reliable control performance. We obtained successful control results with this model-free and purely vision-based control strategy, utilizing natural robot features in the runtime, both under full visibility and partial occlusion.

Chinese Translation

本文提出了一种新颖的视觉伺服框架，通过使用纯粹的自然视觉特征来控制机器人操纵器在配置空间中的运动。我们的目标是开发能够稳健地检测和跟踪机器人操纵器上自然特征或关键点的方法，这些特征将用于基于视觉的控制，特别是在运行时无法或不希望在机器人上放置外部标记的场景中。为了支持我们数据驱动方法的模型训练过程，我们创建了一个数据收集管道，在机器人身体上附加 ArUco 标记，将其中心标记为关键点，然后利用修复方法去除标记并重建被遮挡区域。通过这种方式，我们生成了自然（无标记）机器人图像，这些图像自动标注了标记位置。这些图像用于训练关键点检测算法，该算法利用机器人的自然特征来控制机器人配置。与依赖于准确相机标定和机器人模型来标注训练图像的先前方法不同，我们的方法通过修复消除了这些依赖关系。为了在存在遮挡的情况下实现稳健的关键点检测，我们引入了第二个修复模型，该模型在运行时实时重建机器人的遮挡区域，从而实现连续的关键点检测。为了进一步增强关键点预测的一致性和稳健性，我们集成了无迹卡尔曼滤波器（UKF），该滤波器随着时间的推移精炼关键点估计，提升了控制性能的稳定性和可靠性。我们在完全可见和部分遮挡的情况下，利用自然机器人特征成功实现了这种无模型且完全基于视觉的控制策略。

View on arXiv Download PDF AI Translation

cs.RO / 7 / 2604.13323

Vectorizing Projection in Manifold-Constrained Motion Planning for Real-Time Whole-Body Control

基于流形约束的运动规划中的投影向量化用于实时全身控制

Iyer, Shrutheesh R, Chang, I-Chia, Liu, Andrew Z., Gu, Yan, Kingston, Zachary

Abstract

Many robot planning tasks require satisfaction of one or more constraints throughout the entire trajectory. For geometric constraints, manifold-constrained motion planning algorithms are capable of planning collision-free path between start and goal configurations on the constraint submanifolds specified by task. Current state-of-the-art methods can take tens of seconds to solve these tasks for complex systems such as humanoid robots, making real-world use impractical, especially in dynamic settings. Inspired by recent advances in hardware accelerated motion planning, we present a CPU SIMD-accelerated manifold-constrained motion planner that revisits projection-based constraint satisfaction through the lens of parallelization. By transforming relevant components into parallelizable structures, we use SIMD parallelism to plan constraint satisfying solutions. Our approach achieves up to 100-1000x speed-ups over the state-of-the-art, making real-time constrained motion planning feasible for the first time. We demonstrate our planner on a real humanoid robot and show real-time whole-body quasi-static plan generation. Our work is available at https://commalab.org/papers/mcvamp/.

Chinese Translation

许多机器人规划任务要求在整个轨迹中满足一个或多个约束。对于几何约束，基于流形约束的运动规划算法能够在由任务指定的约束子流形上规划从起始配置到目标配置的无碰撞路径。目前的最先进方法在解决复杂系统（如类人机器人）这类任务时可能需要数十秒，使得在现实世界中的应用变得不切实际，特别是在动态环境中。受到近期硬件加速运动规划进展的启发，我们提出了一种基于CPU SIMD加速的流形约束运动规划器，通过并行化的视角重新审视基于投影的约束满足。通过将相关组件转化为可并行化的结构，我们利用SIMD并行性规划满足约束的解决方案。我们的方法在速度上实现了比最先进技术高达100-1000倍的提升，使得实时约束运动规划首次变得可行。我们在一个真实的类人机器人上演示了我们的规划器，并展示了实时全身准静态规划生成。我们的工作可在https://commalab.org/papers/mcvamp/获取。

View on arXiv Download PDF AI Translation

cs.RO / 8 / 2604.13325

Boundary Sampling to Learn Predictive Safety Filters via Pontryagin's Maximum Principle

通过庞特里亚金最大原理进行边界采样以学习预测安全过滤器

Dallas, James, Lew, Thomas, Talbot, John, DeCastro, Jonathan, Bansal, Somil, Subosits, John

Abstract

Safety filters provide a practical approach for enforcing safety constraints in autonomous systems. While learning-based tools scale to high-dimensional systems, their performance depends on informative data that includes states likely to lead to constraint violation, which can be difficult to efficiently sample in complex, high-dimensional systems. In this work, we characterize trajectories that barely avoid safety violations using the Pontryagin Maximum Principle. These boundary trajectories are used to guide data collection for learned Hamilton-Jacobi Reachability, concentrating learning efforts near safety-critical states to improve efficiency. The learned Control Barrier Value Function is then used directly for safety filtering. Simulations and experimental validation on a shared-control automotive racing application demonstrate PMP sampling improves learning efficiency, yielding faster convergence, reduced failure rates, and improved safe set reconstruction, with wall times around 3ms.

Chinese Translation

安全过滤器为在自主系统中强制执行安全约束提供了一种实用的方法。尽管基于学习的工具能够扩展到高维系统，但其性能依赖于包含可能导致约束违反的状态的信息丰富数据，而在复杂的高维系统中有效采样这些数据可能非常困难。在本研究中，我们利用庞特里亚金最大原理对几乎避免安全违规的轨迹进行了表征。这些边界轨迹用于指导学习哈密顿-雅可比可达性的数据收集，将学习工作集中在安全关键状态附近，以提高效率。然后，学习到的控制障碍值函数被直接用于安全过滤。对共享控制汽车赛车应用的仿真和实验验证表明，PMP采样提高了学习效率，实现了更快的收敛、降低的失败率和改进的安全集重构，墙面时间约为3毫秒。

View on arXiv Download PDF AI Translation

cs.RO / 9 / 2604.13405

Singularity Avoidance in Inverse Kinematics: A Unified Treatment of Classical and Learning-based Methods

逆向运动学中的奇异性避免：经典方法与基于学习的方法的统一处理

Rudrasamudram, Vishnu, Malaichamee, Hariharasudan

Abstract

Singular configurations cause loss of task-space mobility, unbounded joint velocities, and solver divergence in inverse kinematics (IK) for serial manipulators. No existing survey bridges classical singularity-robust IK with rapidly growing learning-based approaches. We provide a unified treatment spanning Jacobian regularization, Riemannian manipulability tracking, constrained optimization, and modern data-driven paradigms. A systematic taxonomy classifies methods by retained geometric structure and robustness guarantees (formal vs. empirical). We address a critical evaluation gap by proposing a benchmarking protocol and presenting experimental results: 12 IK solvers are evaluated on the Franka Panda under position-only IK across four complementary panels measuring error degradation by condition number, velocity amplification, out-of-distribution robustness, and computational cost. Results show that pure learning methods fail even on well-conditioned targets (MLP: 0% success, approx. 10 mm mean error), while hybrid warm-start architectures - IKFlow (59% to 100%), CycleIK(0% to 98.6%), GGIK (0% to 100%) - rescue learned solvers via classical refinement, with DLS converging from initial errors up to 207 mm. Deeper singularity-regime evaluation is identified as immediate future work.

Chinese Translation

奇异配置会导致串联机械手在逆向运动学（IK）中任务空间的运动能力丧失、关节速度无限制以及求解器的发散。现有的文献中没有一项能够将经典的抗奇异性逆向运动学与快速发展的基于学习的方法相结合。我们提供了一种统一的处理方法，涵盖雅可比矩阵正则化、黎曼可操作性跟踪、约束优化以及现代数据驱动范式。我们建立了一个系统的分类法，根据保留的几何结构和鲁棒性保证（形式与经验）对方法进行分类。我们通过提出一个基准测试协议并展示实验结果来填补关键评估空白：在仅考虑位置的逆向运动学下，对12个IK求解器在Franka Panda上进行了评估，使用四个互补面板测量条件数、速度放大、分布外鲁棒性和计算成本对误差退化的影响。结果表明，纯学习方法在良好条件的目标上也未能成功（MLP：0%成功率，平均误差约10毫米），而混合热启动架构 - IKFlow（59%到100%）、CycleIK（0%到98.6%）、GGIK（0%到100%） - 通过经典的细化方法拯救了学习求解器，其中DLS从初始误差高达207毫米收敛。对更深层次的奇异性区域评估被确定为未来的紧迫工作。

View on arXiv Download PDF AI Translation

cs.RO / 10 / 2604.13441

Robust Energy-Aware Routing for Air-Ground Cooperative Multi-UAV Delivery in Wind-Uncertain Environments

风不确定环境下的鲁棒能量感知多无人机空地协作配送路由

Li, Tianshun, Lu, Hongliang, Sheng, Yanggang, Wang, Zhongzhen, Li, Haoang, Zheng, Xinhu

Abstract

Ensuring energy feasibility under wind uncertainty is critical for the safety and reliability of UAV delivery missions. In realistic truck-drone logistics systems, UAVs must deliver parcels and safely return under time-varying wind conditions that are only partially observable during flight. However, most existing routing approaches assume static or deterministic energy models, making them unreliable in dynamic wind environments. We propose Battery-Efficient Routing (BER), an online risk-sensitive planning framework for wind-sensitive truck-assisted UAV delivery. The problem is formulated as routing on a time dependent energy graph whose edge costs evolve according to wind-induced aerodynamic effects. BER continuously evaluates return feasibility while balancing instantaneous energy expenditure and uncertainty-aware risk. The approach is embedded in a hierarchical aerial-ground delivery architecture that combines task allocation, routing, and decentralized trajectory execution. Extensive simulations on synthetic ER graphs generated in Unreal Engine environments and quasi-real wind logs demonstrate that BER significantly improves mission success rates and reduces wind-induced failures compared with static and greedy baselines. These results highlight the importance of integrating real-time energy budgeting and environmental awareness for UAV delivery planning under dynamic wind conditions.

Chinese Translation

在风不确定性下确保能量可行性对于无人机（UAV）配送任务的安全性和可靠性至关重要。在现实的卡车-无人机物流系统中，无人机必须在仅部分可观测的时变风况下送达包裹并安全返回。然而，大多数现有的路由方法假设静态或确定性的能量模型，使其在动态风环境中不可靠。我们提出了一种电池高效路由（Battery-Efficient Routing, BER）的方法，这是一个针对风敏感的卡车辅助无人机配送的在线风险敏感规划框架。该问题被表述为在一个时间依赖的能量图上进行路由，其边缘成本根据风引起的气动效应而变化。BER持续评估返回的可行性，同时平衡瞬时能量消耗和基于不确定性的风险。该方法嵌入在一个分层的空地配送架构中，结合了任务分配、路由和去中心化轨迹执行。在虚幻引擎环境中生成的合成ER图和准真实风日志上的大量仿真表明，与静态和贪婪基线相比，BER显著提高了任务成功率并减少了风引起的失败。这些结果强调了在动态风条件下，无人机配送规划中整合实时能量预算和环境意识的重要性。

View on arXiv Download PDF AI Translation

cs.RO / 11 / 2604.13476

RobotPan: A 360$^\circ$ Surround-View Robotic Vision System for Embodied Perception

RobotPan：一种用于具身感知的360$^ ext{°}$全景机器人视觉系统

Ma, Jiahao, Zhang, Qiang, Liu, Peiran, Su, Zeran, Sun, Pihai, Han, Gang, Zhao, Wen, Cui, Wei, Zhang, Zhang, Xu, Zhiyuan, Xu, Renjing, Tang, Jian, Liu, Miaomiao, Guo, Yijie

Abstract

Surround-view perception is increasingly important for robotic navigation and loco-manipulation, especially in human-in-the-loop settings such as teleoperation, data collection, and emergency takeover. However, current robotic visual interfaces are often limited to narrow forward-facing views, or, when multiple on-board cameras are available, require cumbersome manual switching that interrupts the operator's workflow. Both configurations suffer from motion-induced jitter that causes simulator sickness in head-mounted displays. We introduce a surround-view robotic vision system that combines six cameras with LiDAR to provide full 360$^\circ$ visual coverage, while meeting the geometric and real-time constraints of embodied deployment. We further present \textsc{RobotPan}, a feed-forward framework that predicts \emph{metric-scaled} and \emph{compact} 3D Gaussians from calibrated sparse-view inputs for real-time rendering, reconstruction, and streaming. \textsc{RobotPan} lifts multi-view features into a unified spherical coordinate representation and decodes Gaussians using hierarchical spherical voxel priors, allocating fine resolution near the robot and coarser resolution at larger radii to reduce computational redundancy without sacrificing fidelity. To support long sequences, our online fusion updates dynamic content while preventing unbounded growth in static regions by selectively updating appearance. Finally, we release a multi-sensor dataset tailored to 360$^\circ$ novel view synthesis and metric 3D reconstruction for robotics, covering navigation, manipulation, and locomotion on real platforms. Experiments show that \textsc{RobotPan} achieves competitive quality against prior feed-forward reconstruction and view-synthesis methods while producing substantially fewer Gaussians, enabling practical real-time embodied deployment. Project website: https://robotpan.github.io/

Chinese Translation

全景感知在机器人导航和运动操控中变得越来越重要，尤其是在如远程操作、数据收集和紧急接管等人机协作环境中。然而，目前的机器人视觉接口通常仅限于狭窄的前视图，或者在多个机载摄像头可用时，需要繁琐的手动切换，这会打断操作员的工作流程。这两种配置都受到运动引起的抖动影响，导致头戴显示器中的模拟器晕动症。我们提出了一种全景机器人视觉系统，该系统结合了六个摄像头和激光雷达（LiDAR），提供完整的360$^ ext{°}$视觉覆盖，同时满足具身部署的几何和实时约束。我们进一步提出了 extsc{RobotPan}，这是一种前馈框架，能够从标定的稀疏视图输入中预测 extit{度量缩放}和 extit{紧凑型}的3D高斯分布，以实现实时渲染、重建和流媒体传输。 extsc{RobotPan}将多视图特征提升为统一的球坐标表示，并使用分层球形体素先验解码高斯分布，在机器人附近分配高分辨率，而在更大半径处使用较粗的分辨率，以减少计算冗余而不牺牲保真度。为了支持长序列，我们的在线融合在动态内容更新的同时，通过选择性更新外观来防止静态区域的无限增长。最后，我们发布了一个多传感器数据集，专门用于360$^ ext{°}$新视图合成和机器人度量3D重建，涵盖了在真实平台上的导航、操控和运动。实验表明， extsc{RobotPan}在质量上与先前的前馈重建和视图合成方法具有竞争力，同时生成的高斯数量显著减少，使得实际的实时具身部署成为可能。项目网站：https://robotpan.github.io/

View on arXiv Download PDF AI Translation

cs.RO / 12 / 2604.13492

RadarSplat-RIO: Indoor Radar-Inertial Odometry with Gaussian Splatting-Based Radar Bundle Adjustment

RadarSplat-RIO：基于高斯喷溅的室内雷达惯性里程计与雷达束调整

Kung, Pou-Chun, Tian, Yuan, Li, Zhengqin, Liu, Yue, Whitmire, Eric, Kienzle, Wolf, Benko, Hrvoje

Abstract

Radar is more resilient to adverse weather and lighting conditions than visual and Lidar simultaneous localization and mapping (SLAM). However, most radar SLAM pipelines still rely heavily on frame-to-frame odometry, which leads to substantial drift. While loop closure can correct long-term errors, it requires revisiting places and relies on robust place recognition. In contrast, visual odometry methods typically leverage bundle adjustment (BA) to jointly optimize poses and map within a local window. However, an equivalent BA formulation for radar has remained largely unexplored. We present the first radar BA framework enabled by Gaussian Splatting (GS), a dense and differentiable scene representation. Our method jointly optimizes radar sensor poses and scene geometry using full range-azimuth-Doppler data, bringing the benefits of multi-frame BA to radar for the first time. When integrated with an existing radar-inertial odometry frontend, our approach significantly reduces pose drift and improves robustness. Across multiple indoor scenes, our radar BA achieves substantial gains over the prior radar-inertial odometry, reducing average absolute translational and rotational errors by 90% and 80%, respectively.

Chinese Translation

雷达在恶劣天气和光照条件下比视觉和激光雷达（Lidar）同时定位与地图构建（SLAM）更具韧性。然而，大多数雷达SLAM管道仍然严重依赖帧间里程计，这导致了显著的漂移。虽然回环闭合可以纠正长期误差，但它需要重新访问地点并依赖于强大的地点识别。相比之下，视觉里程计方法通常利用束调整（BA）在局部窗口内联合优化位姿和地图。然而，雷达的等效BA公式仍然基本未被探索。我们提出了第一个由高斯喷溅（GS）驱动的雷达BA框架，这是一种密集且可微分的场景表示。我们的方法使用全范围-方位-多普勒数据联合优化雷达传感器位姿和场景几何，将多帧BA的优势首次引入雷达。当与现有的雷达惯性里程计前端集成时，我们的方法显著减少了位姿漂移并提高了鲁棒性。在多个室内场景中，我们的雷达BA相较于先前的雷达惯性里程计取得了显著的提升，平均绝对平移和旋转误差分别降低了90%和80%。

View on arXiv Download PDF AI Translation

cs.RO / 13 / 2604.13513

A transformable slender microrobot inspired by nematode parasites for interventional endovascular surgery

一种受线虫寄生虫启发的可变形细长微机器人用于介入性血管内手术

Yang, Xin, Fan, Dongliang, Ma, Yunteng, Liao, Yuxuan, Li, Diancheng, Cheang, U Kei, Peng, Bo, Wang, Hongqiang

Abstract

Cardiovascular diseases account for around 17.9 million deaths per year globally, the treatment of which is challenging considering the confined space and complex topology of the vascular network and high risks during operations. Robots, although promising, still face the dilemma of possessing versatility or maneuverability after decades of development. Inspired by nematodes, the parasites living, feeding, and moving in the human body's vascular system, this work develops a transformable slender magnetic microrobot. Based on the experiments and analyses, we optimize the fabrication and geometry of the robot and finally create a slender prototype with an aspect ratio larger than 100 (smaller than 200 microns in diameter and longer than 20 mm in length), which possesses uniformly distributed magnetic beads on the body of an ultrathin polymer string and a big bead on the head. This prototype shows great flexibility (largest curvature 0.904 mm-1) and locomotion capability (the maximum speed: 125 mm/s). Moreover, the nematode-inspired robot can pass through sharp turns with a radius of 0.84 mm and holes distributed in three-dimensional (3D) space. We also display the potential application in interventional surgery of the microrobot by navigating it through a narrow blood vessel mold to wrap and transport a drug (95 times heavier than the robot) by deforming the robot's slender body and releasing the drug to the aim position finally. Moreover, the robot also demonstrates the possible applications in embolization by transforming and winding itself into an aneurysms phantom and exhibits its outstanding injectability by being successfully withdrawn and injected through a medical needle (diameter: 1.2 mm) of a syringe.

Chinese Translation

心血管疾病每年在全球造成约1790万人的死亡，考虑到血管网络的有限空间和复杂拓扑结构以及手术过程中的高风险，其治疗具有挑战性。尽管机器人技术前景广阔，但经过数十年的发展，仍面临着多功能性与机动性之间的困境。受到生活、取食和在人体血管系统中移动的线虫启发，本研究开发了一种可变形的细长磁性微机器人。基于实验和分析，我们优化了机器人的制造和几何形状，最终创建了一个纵横比大于100（直径小于200微米，长度大于20毫米）的细长原型，其主体上均匀分布着超薄聚合物线上的磁珠，头部则有一个较大的珠子。该原型展示了极大的灵活性（最大曲率为0.904 mm-1）和运动能力（最大速度为125 mm/s）。此外，受线虫启发的机器人能够通过半径为0.84 mm的急转弯和三维空间中分布的孔洞。我们还展示了该微机器人在介入手术中的潜在应用，通过在狭窄的血管模具中导航，将药物（重量为机器人95倍）包裹并运输到目标位置，最终释放药物，变形细长的机器人身体。此外，该机器人还展示了在栓塞中的可能应用，通过变形并缠绕进入动脉瘤模型，并通过医疗针（直径：1.2 mm）的注射器成功地被抽出和注入，展现了其卓越的注射能力。

View on arXiv Download PDF AI Translation

cs.RO / 14 / 2604.13530

Stability Principle Underlying Passive Dynamic Walking of Rimless Wheel

无轮缘轮被动动态行走的稳定性原理

Asano, Fumihiko

Abstract

Rimless wheels are known as the simplest model for passive dynamic walking. It is known that the passive gait generated only by gravity effect always becomes asymptotically stable and 1-period because a rimless wheel automatically achieves the two necessary conditions for guaranteeing the asymptotic stability; one is the constraint on impact posture and the other is the constraint on restored mechanical energy. The asymptotic stability is then easily shown by the recurrence formula of kinetic energy. There is room, however, for further research into the inherent stability principle. In this paper, we reconsider the stability of the stance phase based on the linearization of the equation of motion, and investigate the relation between the stability and energy conservation law. Through the mathematical analysis, we provide a greater understanding of the inherent stability principle.

Chinese Translation

无轮缘轮被认为是被动动态行走的最简单模型。已知仅由重力效应产生的被动步态总是渐近稳定且为1周期，因为无轮缘轮自动满足保证渐近稳定性的两个必要条件：一个是对冲击姿态的约束，另一个是对恢复机械能的约束。渐近稳定性可以通过动能的递归公式轻松证明。然而，对于固有稳定性原理的进一步研究仍有空间。本文基于运动方程的线性化重新考虑了站立阶段的稳定性，并探讨了稳定性与能量守恒定律之间的关系。通过数学分析，我们对固有稳定性原理提供了更深入的理解。

View on arXiv Download PDF AI Translation

cs.RO / 15 / 2604.13533

Evolvable Embodied Agent for Robotic Manipulation via Long Short-Term Reflection and Optimization

可进化的具身智能体用于机器人操控的长短期反思与优化

Wang, Jianzong, Zhao, Botao, He, Yayun, Peng, Junqing, Zhang, Xulong

Abstract

Achieving general-purpose robotics requires empowering robots to adapt and evolve based on their environment and feedback. Traditional methods face limitations such as extensive training requirements, difficulties in cross-task generalization, and lack of interpretability. Prompt learning offers new opportunities for self-evolving robots without extensive training, but simply reflecting on past experiences.However, extracting meaningful insights from task successes and failures remains a challenge. To this end, we propose the evolvable embodied agent (EEAgent) framework, which leverages large vision-language models (VLMs) for better environmental interpretation and policy planning. To enhance reflection on past experiences, we propose a long short-term reflective optimization (LSTRO) mechanism that dynamically refines prompts based on both past experiences and newly learned lessons, facilitating continuous self-evolution, thereby enhancing overall task success rates. Evaluations on six VIMA-Bench tasks reveal that our approach sets a new state-of-the-art, notably outperforming baselines in complex scenarios.

Chinese Translation

实现通用机器人技术需要赋予机器人根据其环境和反馈进行适应和进化的能力。传统方法面临诸多限制，如训练需求庞大、跨任务泛化困难以及缺乏可解释性。快速学习为自我进化的机器人提供了新的机会，无需广泛的训练，仅需反思过去的经验。然而，从任务的成功与失败中提取有意义的见解仍然是一个挑战。为此，我们提出了可进化的具身智能体（EEAgent）框架，该框架利用大型视觉-语言模型（VLMs）来更好地进行环境解释和策略规划。为了增强对过去经验的反思，我们提出了一种长短期反思优化（LSTRO）机制，该机制基于过去的经验和新学到的教训动态地优化提示，从而促进持续的自我进化，提升整体任务成功率。在六个VIMA-Bench任务上的评估表明，我们的方法设定了新的最先进水平，尤其在复杂场景中显著超越了基线。

View on arXiv Download PDF AI Translation

cs.RO / 16 / 2604.13542

Self-adaptive Multi-Access Edge Architectures: A Robotics Case

自适应多接入边缘架构：一个机器人案例

Moghaddam, Mahyar T, Leed, Joakim, Frandsen, Anders

Abstract

The growth of compute-intensive AI tasks highlights the need to mitigate the processing costs and improve performance and energy efficiency. This necessitates the integration of intelligent agents as architectural adaptation supervisors tasked with adaptive scaling of the infrastructure and efficient offloading of computation within the continuum. This paper presents a self-adaptation approach for an efficient computing system of a mixed human-robot environment. The computation task is associated with a Neural Network algorithm that leverages sensory data to predict human mobility behaviors, to enhance mobile robots' proactive path planning, and ensure human safety. To streamline neural network processing, we built a distributed edge offloading system with heterogeneous processing units, orchestrated by Kubernetes. By monitoring response times and power consumption, the MAPE-K-based adaptation supervisor makes informed decisions on scaling and offloading. Results show notable improvements in service quality over traditional setups, demonstrating the effectiveness of the proposed approach for AI-driven systems.

Chinese Translation

计算密集型人工智能任务的增长突显了降低处理成本、提高性能和能效的必要性。这需要将智能代理集成作为架构适应监督者，负责基础设施的自适应扩展和计算的高效卸载。本文提出了一种自适应方法，用于混合人机环境中的高效计算系统。计算任务与一种神经网络（Neural Network）算法相关，该算法利用传感器数据预测人类移动行为，以增强移动机器人的主动路径规划并确保人类安全。为了简化神经网络处理，我们构建了一个具有异构处理单元的分布式边缘卸载系统，由Kubernetes进行协调。通过监测响应时间和功耗，基于MAPE-K的适应监督者能够对扩展和卸载做出明智决策。结果显示，与传统设置相比，服务质量显著提高，证明了所提方法在人工智能驱动系统中的有效性。

View on arXiv Download PDF AI Translation

cs.RO / 17 / 2604.13584

UNRIO: Uncertainty-Aware Velocity Learning for Radar-Inertial Odometry

UNRIO：一种考虑不确定性的雷达惯性里程计速度学习方法

Huang, Jui-Te, Huang, Tinashu, Rowe, Anthony, Kaess, Michael

Abstract

We present UNRIO, an uncertainty-aware radar-inertial odometry system that estimates ego-velocity directly from raw mmWave radar IQ signals rather than processed point clouds. Existing radar-inertial odometry methods rely on handcrafted signal processing pipelines that discard latent information in the raw spectrum and require careful parameter tuning. To address this, we propose a transformer-based neural network built on the GRT architecture that processes the full 4-D spectral cube to predict body-frame velocity in two modes: a direct linear velocity estimate and a per-anglebin Doppler velocity map. The network is trained in three stages: geometric pretraining on LiDAR-projected depth, velocity or Doppler fine-tuning, and uncertainty calibration via negative log-likelihood loss, enabling it to produce uncertainty estimates alongside its predictions. These uncertainty estimates are propagated into a sliding-window pose graph that fuses radar velocity factors with IMU preintegration measurements. We train and evaluate UNRIO on the IQ1M dataset across diverse indoor environments with both forward and lateral motion patterns unseen during training. Our method achieves the lowest relative pose error on the majority of sequences, with particularly strong gains over classical DSP baselines on Lateral-motion trajectories where sparse point clouds degrade conventional velocity estimators.

Chinese Translation

我们提出了UNRIO，一种考虑不确定性的雷达惯性里程计系统，它直接从原始毫米波雷达IQ信号中估计自我速度，而不是处理后的点云。现有的雷达惯性里程计方法依赖于手工设计的信号处理流程，这些流程会丢弃原始频谱中的潜在信息，并且需要仔细的参数调优。为了解决这个问题，我们提出了一种基于变换器的神经网络，构建在GRT（Geometric Reasoning Transformer）架构上，处理完整的4-D光谱立方体，以两种模式预测机体框架速度：直接线性速度估计和每个角度 bin 的多普勒速度图。该网络分三个阶段进行训练：在LiDAR投影深度上的几何预训练、速度或多普勒的微调，以及通过负对数似然损失进行的不确定性校准，使其能够在预测的同时生成不确定性估计。这些不确定性估计被传播到一个滑动窗口姿态图中，该图将雷达速度因子与IMU（惯性测量单元）预积分测量融合。我们在IQ1M数据集上训练和评估UNRIO，涵盖多种室内环境，并在训练期间未见的前向和侧向运动模式下进行测试。我们的方法在大多数序列中实现了最低的相对姿态误差，在侧向运动轨迹上相较于传统数字信号处理基线表现出特别显著的提升，因为稀疏点云会削弱传统速度估计器的性能。

View on arXiv Download PDF AI Translation

cs.RO / 18 / 2604.13645

A Mechanistic Analysis of Sim-and-Real Co-Training in Generative Robot Policies

生成机器人策略中的模拟与真实协同训练机制分析

Lei, Yu, Liu, Minghuan, Maddukuri, Abhiram, Jiang, Zhenyu, Zhu, Yuke

Abstract

Co-training, which combines limited in-domain real-world data with abundant surrogate data such as simulation or cross-embodiment robot data, is widely used for training generative robot policies. Despite its empirical success, the mechanisms that determine when and why co-training is effective remain poorly understood. We investigate the mechanism of sim-and-real co-training through theoretical analysis and empirical study, and identify two intrinsic effects governing performance. The first, \textbf{``structured representation alignment"}, reflects a balance between cross-domain representation alignment and domain discernibility, and plays a primary role in downstream performance. The second, the \textbf{``importance reweighting effect"}, arises from domain-dependent modulation of action weighting and operates at a secondary level. We validate these effects with controlled experiments on a toy model and extensive sim-and-sim and sim-and-real robot manipulation experiments. Our analysis offers a unified interpretation of recent co-training techniques and motivates a simple method that consistently improves upon prior approaches. More broadly, our aim is to examine the inner workings of co-training and to facilitate research in this direction.

Chinese Translation

协同训练结合了有限的领域内真实数据与丰富的替代数据，如模拟数据或跨体现机器人数据，广泛用于生成机器人策略的训练。尽管其在实践中的成功显著，但决定协同训练何时以及为何有效的机制仍然不甚明了。我们通过理论分析和实证研究探讨了模拟与真实协同训练的机制，并识别出两种内在效应影响性能。第一种， extbf{“结构化表示对齐”}，反映了跨领域表示对齐与领域可辨别性之间的平衡，并在下游性能中发挥主要作用。第二种， extbf{“重要性重加权效应”}，源于对行动加权的领域依赖调制，并在次级层面上发挥作用。我们通过对玩具模型的控制实验以及广泛的模拟与模拟、模拟与真实的机器人操作实验验证了这些效应。我们的分析提供了对近期协同训练技术的统一解释，并激励了一种简单的方法，该方法在持续改进先前方法的基础上取得了成功。更广泛地说，我们的目标是考察协同训练的内在机制，并促进该方向的研究。

View on arXiv Download PDF AI Translation

cs.RO / 19 / 2604.13654

Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap

无人机的视觉与语言导航：进展、挑战与研究路线图

Chen, Hanxuan, Zheng, Jie, Yang, Siqi, Zeng, Tianle, Feng, Siwei, Cheng, Songsheng, Ren, Ruilong, Guo, Hanzhong, Yuan, Shuai, Wang, Xiangyue, Wang, Kangli, Pei, Ji

Abstract

Vision-and-Language Navigation for Unmanned Aerial Vehicles (UAV-VLN) represents a pivotal challenge in embodied artificial intelligence, focused on enabling UAVs to interpret high-level human commands and execute long-horizon tasks in complex 3D environments. This paper provides a comprehensive and structured survey of the field, from its formal task definition to the current state of the art. We establish a methodological taxonomy that charts the technological evolution from early modular and deep learning approaches to contemporary agentic systems driven by large foundation models, including Vision-Language Models (VLMs), Vision-Language-Action (VLA) models, and the emerging integration of generative world models with VLA architectures for physically-grounded reasoning. The survey systematically reviews the ecosystem of essential resources simulators, datasets, and evaluation metrics that facilitates standardized research. Furthermore, we conduct a critical analysis of the primary challenges impeding real-world deployment: the simulation-to-reality gap, robust perception in dynamic outdoor settings, reasoning with linguistic ambiguity, and the efficient deployment of large models on resource-constrained hardware. By synthesizing current benchmarks and limitations, this survey concludes by proposing a forward-looking research roadmap to guide future inquiry into key frontiers such as multi-agent swarm coordination and air-ground collaborative robotics.

Chinese Translation

无人机视觉与语言导航（UAV-VLN）代表了具身人工智能中的一个关键挑战，旨在使无人机能够理解高层次的人类指令并在复杂的三维环境中执行长时间的任务。本文提供了该领域的全面而系统的调查，从正式的任务定义到当前的技术前沿。我们建立了一种方法论分类法，描绘了从早期的模块化和深度学习方法到当代由大型基础模型驱动的智能系统的技术演变，包括视觉-语言模型（Vision-Language Models, VLMs）、视觉-语言-行动模型（Vision-Language-Action, VLA）以及将生成世界模型与VLA架构结合以实现物理基础推理的新兴整合。该调查系统地回顾了促进标准化研究的基本资源生态系统，包括模拟器、数据集和评估指标。此外，我们对阻碍实际部署的主要挑战进行了批判性分析：模拟与现实之间的差距、动态户外环境中的稳健感知、语言歧义下的推理，以及在资源受限硬件上高效部署大型模型。通过综合当前的基准和局限性，本文最后提出了一条前瞻性的研究路线图，以指导未来在多智能体群体协调和空地协作机器人等关键前沿领域的研究。

View on arXiv Download PDF AI Translation

cs.RO / 20 / 2604.13677

Empirical Prediction of Pedestrian Comfort in Mobile Robot Pedestrian Encounters

移动机器人与行人相遇时行人舒适度的实证预测

Jafari, Alireza, Nguyen, Hong-Son, Liu, Yen-Chen

Abstract

Mobile robots joining public spaces like sidewalks must care for pedestrian comfort. Many studies consider pedestrians' objective safety, for example, by developing collision avoidance algorithms, but not enough studies take the pedestrian's subjective safety or comfort into consideration. Quantifying comfort is a major challenge that hinders mobile robots from understanding and responding to human emotions. We empirically look into the relationship between the mobile robot-pedestrian interaction kinematics and subjective comfort. We perform one-on-one experimental trials, each involving a mobile robot and a volunteer. Statistical analysis of pedestrians' reported comfort versus the kinematic variables shows moderate but significant correlations for most variables. Based on these empirical findings, we design three comfort estimators/predictors derived from the minimum distance, the minimum projected time-to-collision, and a composite estimator. The composite estimator employs all studied kinematic variables and reaches the highest prediction rate and classifying performance among the predictors. The composite predictor has an odds ratio of 3.67. In simple terms, when it identifies a pedestrian as comfortable, it is almost 4 times more likely that the pedestrian is comfortable rather than uncomfortable. The study provides a comfort quantifier for incorporating pedestrian feelings into path planners for more socially compliant robots.

Chinese Translation

移动机器人在公共空间（如人行道）中运行时必须关注行人的舒适度。许多研究考虑了行人的客观安全，例如通过开发避碰算法，但对行人的主观安全或舒适度的研究仍然不足。量化舒适度是一个主要挑战，这阻碍了移动机器人理解和响应人类情感。我们实证研究了移动机器人与行人互动的运动学与主观舒适度之间的关系。我们进行了一对一的实验试验，每次试验涉及一台移动机器人和一名志愿者。对行人报告的舒适度与运动学变量的统计分析显示，大多数变量之间存在中等但显著的相关性。基于这些实证发现，我们设计了三个舒适度估计器/预测器，分别基于最小距离、最小投影碰撞时间和一个复合估计器。复合估计器使用所有研究的运动学变量，并在预测率和分类性能方面达到了最高水平。复合预测器的比值比为3.67。简单来说，当它将行人识别为舒适时，行人实际上舒适的可能性几乎是行人不舒适的4倍。本研究提供了一种舒适度量化工具，以便将行人感受纳入路径规划，从而使机器人更符合社会规范。

View on arXiv Download PDF AI Translation

cs.RO / 21 / 2604.13788

Failure Identification in Imitation Learning Via Statistical and Semantic Filtering

通过统计和语义过滤进行模仿学习中的故障识别

Rolland, Quentin, de Chamisso, Fabrice Mayran, Mouret, Jean-Baptiste

Abstract

Imitation learning (IL) policies in robotics deliver strong performance in controlled settings but remain brittle in real-world deployments: rare events such as hardware faults, defective parts, unexpected human actions, or any state that lies outside the training distribution can lead to failed executions. Vision-based Anomaly Detection (AD) methods emerged as an appropriate solution to detect these anomalous failure states but do not distinguish failures from benign deviations. We introduce FIDeL (Failure Identification in Demonstration Learning), a policy-independent failure detection module. Leveraging recent AD methods, FIDeL builds a compact representation of nominal demonstrations and aligns incoming observations via optimal transport matching to produce anomaly scores and heatmaps. Spatio-temporal thresholds are derived with an extension of conformal prediction, and a Vision-Language Model (VLM) performs semantic filtering to discriminate benign anomalies from genuine failures. We also introduce BotFails, a multimodal dataset of real-world tasks for failure detection in robotics. FIDeL consistently outperforms state-of-the-art baselines, yielding +5.30% percent AUROC in anomaly detection and +17.38% percent failure-detection accuracy on BotFails compared to existing methods.

Chinese Translation

模仿学习（IL）策略在机器人领域的受控环境中表现出色，但在实际部署中仍然脆弱：硬件故障、缺陷部件、意外的人类行为或任何超出训练分布的状态等稀有事件都可能导致执行失败。基于视觉的异常检测（AD）方法被认为是检测这些异常故障状态的合适解决方案，但它们无法区分故障与良性偏差。我们提出了FIDeL（演示学习中的故障识别），这是一个与策略无关的故障检测模块。FIDeL利用最新的AD方法，构建了名义演示的紧凑表示，并通过最优传输匹配对输入观测进行对齐，以产生异常分数和热图。通过扩展的符合预测推导出时空阈值，并且视觉-语言模型（VLM）执行语义过滤，以区分良性异常和真正的故障。我们还引入了BotFails，这是一个用于机器人故障检测的多模态真实任务数据集。与现有方法相比，FIDeL在异常检测中持续超越最先进的基线，BotFails上的故障检测准确率提高了+5.30%的AUROC和+17.38%。

View on arXiv Download PDF AI Translation

cs.RO / 22 / 2604.13800

EmbodiedClaw: Conversational Workflow Execution for Embodied AI Development

EmbodiedClaw：面向具身人工智能开发的对话工作流执行

Zhou, Xueyang, Sun, Yihan, Gong, Xijie, Tie, Guiyao, Zhou, Pan, Sun, Lichao, Chen, Yongchao

Abstract

Embodied AI research is increasingly moving beyond single-task, single-environment policy learning toward multi-task, multi-scene, and multi-model settings. This shift substantially increases the engineering overhead and development time required for stages such as evaluation environment construction, trajectory collection, model training, and evaluation. To address this challenge, we propose a new paradigm for embodied AI development in which users express goals and constraints through conversation, and the system automatically plans and executes the development workflow. We instantiate this paradigm with EmbodiedClaw, a conversational agent that turns high-frequency, high-cost embodied research activities, including environment creation and revision, benchmark transformation, trajectory synthesis, model evaluation, and asset expansion, into executable skills. Experiments on end-to-end workflow tasks, capability-specific evaluations, human researcher studies, and ablations show that EmbodiedClaw reduces manual engineering effort while improving executability, consistency, and reproducibility. These results suggest a shift from manual toolchains to conversationally executable workflows for embodied AI development.

Chinese Translation

具身人工智能研究正逐渐从单任务、单环境的策略学习转向多任务、多场景和多模型的设置。这一转变显著增加了评估环境构建、轨迹收集、模型训练和评估等阶段所需的工程开销和开发时间。为了解决这一挑战，我们提出了一种新的具身人工智能开发范式，用户通过对话表达目标和约束，系统则自动规划和执行开发工作流。我们通过EmbodiedClaw实现了这一范式，这是一种对话代理，将高频、高成本的具身研究活动（包括环境创建和修订、基准转换、轨迹合成、模型评估和资产扩展）转化为可执行的技能。在端到端工作流任务、特定能力评估、人类研究者研究和消融实验中的实验结果表明，EmbodiedClaw减少了手动工程工作量，同时提高了可执行性、一致性和可重复性。这些结果表明，具身人工智能开发可以从手动工具链转向可通过对话执行的工作流。

View on arXiv Download PDF AI Translation

cs.RO / 23 / 2604.13853

Mosaic: An Extensible Framework for Composing Rule-Based and Learned Motion Planners

Mosaic：一个可扩展的框架，用于组合基于规则和学习的运动规划器

Large, Nick Le, Steiner, Marlon, Wang, Lingguang, Poh, Willi, Pauls, Jan-Hendrik, Taş, Ömer Şahin, Stiller, Christoph

Abstract

Safe and explainable motion planning remains a central challenge in autonomous driving. While rule-based planners offer predictable and explainable behavior, they often fail to grasp the complexity and uncertainty of real-world traffic. Conversely, learned planners exhibit strong adaptability but suffer from reduced transparency and occasional safety violations. We introduce Mosaic, an extensible framework for structured decision-making that integrates both paradigms through arbitration graphs. By decoupling trajectory verification and scoring from the generation of trajectories by individual planners, every decision becomes transparent and traceable. Trajectory verification at a higher level introduces redundancy between the planners, limiting emergency braking to the rare case where all planners fail to produce a valid trajectory. Through unified scoring and optimal trajectory selection, rule-based and learned planners with complementary strengths and weaknesses can be combined to yield the best of both worlds. In experimental evaluation on nuPlan, Mosaic achieves 95.48 CLS-NR and 93.98 CLS-R on the Val14 closed-loop benchmark, setting a new state of the art, while reducing at-fault collisions by 30% compared to either planner in isolation. On the interPlan benchmark, focused on highly interactive and difficult scenarios, Mosaic scores 54.30 CLS-R, outperforming its best constituent planner by 23.3% - all without retraining or requiring additional data. The code is available at github.com/KIT-MRT/mosaic.

Chinese Translation

安全且可解释的运动规划仍然是自动驾驶中的一个核心挑战。虽然基于规则的规划器提供可预测和可解释的行为，但它们往往无法掌握现实交通的复杂性和不确定性。相反，学习型规划器表现出强大的适应性，但透明度降低且偶尔会出现安全违规。我们提出了Mosaic，一个可扩展的结构化决策框架，通过仲裁图集成这两种范式。通过将轨迹验证和评分与各个规划器生成轨迹的过程解耦，每个决策都变得透明且可追溯。在更高层次的轨迹验证引入了规划器之间的冗余，将紧急制动限制在所有规划器未能生成有效轨迹的罕见情况下。通过统一评分和最佳轨迹选择，可以结合具有互补优势和劣势的基于规则和学习的规划器，从而实现两者的最佳结合。在nuPlan的实验评估中，Mosaic在Val14闭环基准测试中达到了95.48 CLS-NR和93.98 CLS-R，创造了新的技术领先水平，同时与单独使用任何一个规划器相比，减少了30%的责任碰撞。在专注于高度互动和困难场景的interPlan基准测试中，Mosaic的CLS-R得分为54.30，比其最佳组成规划器高出23.3%——这一切都无需重新训练或额外数据。代码可在github.com/KIT-MRT/mosaic获取。

View on arXiv Download PDF AI Translation

cs.RO / 24 / 2604.13891

Beyond Conservative Automated Driving in Multi-Agent Scenarios via Coupled Model Predictive Control and Deep Reinforcement Learning

通过耦合模型预测控制与深度强化学习超越保守的多智能体场景自动驾驶

Rahmani, Saeed, Körpe, Gözde, Zhenlin, Xu, Brito, Bruno, Calvert, Simeon Craig, van Arem, Bart

Abstract

Automated driving at unsignalized intersections is challenging due to complex multi-vehicle interactions and the need to balance safety and efficiency. Model Predictive Control (MPC) offers structured constraint handling through optimization but relies on hand-crafted rules that often produce overly conservative behavior. Deep Reinforcement Learning (RL) learns adaptive behaviors from experience but often struggles with safety assurance and generalization to unseen environments. In this study, we present an integrated MPC-RL framework to improve navigation performance in multi-agent scenarios. Experiments show that MPC-RL outperforms standalone MPC and end-to-end RL across three traffic-density levels. Collectively, MPC-RL reduces the collision rate by 21% and improves the success rate by 6.5% compared to pure MPC. We further evaluate zero-shot transfer to a highway merging scenario without retraining. Both MPC-based methods transfer substantially better than end-to-end PPO, which highlights the role of the MPC backbone in cross-scenario robustness. The framework also shows faster loss stabilization than end-to-end RL during training, which indicates a reduced learning burden. These results suggest that the integrated approach can improve the balance between safety performance and efficiency in multi-agent intersection scenarios, while the MPC component provides a strong foundation for generalization across driving environments. The implementation code is available open-source.

Chinese Translation

在无信号交叉口进行自动驾驶面临着复杂的多车辆交互以及安全与效率之间的平衡挑战。模型预测控制（MPC）通过优化提供了结构化的约束处理，但依赖于手工设计的规则，往往导致过于保守的行为。深度强化学习（RL）从经验中学习自适应行为，但在安全保障和向未见环境的泛化方面常常面临困难。在本研究中，我们提出了一种集成的MPC-RL框架，以改善多智能体场景中的导航性能。实验表明，MPC-RL在三个交通密度水平上均优于独立的MPC和端到端的RL。总体而言，MPC-RL将碰撞率降低了21%，成功率提高了6.5%，与纯MPC相比。我们进一步评估了在不重新训练的情况下向高速公路合并场景的零样本迁移。两种基于MPC的方法的迁移效果明显优于端到端的PPO，这突显了MPC骨干在跨场景鲁棒性中的作用。该框架在训练过程中也显示出比端到端RL更快的损失稳定性，这表明学习负担有所减轻。这些结果表明，集成的方法可以改善多智能体交叉口场景中安全性能与效率之间的平衡，而MPC组件为在不同驾驶环境中的泛化提供了坚实的基础。实现代码已开放源代码。

View on arXiv Download PDF AI Translation

cs.RO / 25 / 2604.13942

Goal2Skill: Long-Horizon Manipulation with Adaptive Planning and Reflection

Goal2Skill：具有自适应规划和反思的长时间操控

Liu, Zhen, Ning, Xinyu, Hu, Zhe, Xie, Xinxin, Li, Weize, Tang, Zhipeng, Wang, Chongyu, Yang, Zejun, Wang, Hanlin, Liu, Yitong, Pu, Zhongzhu

Abstract

Recent vision-language-action (VLA) systems have demonstrated strong capabilities in embodied manipulation. However, most existing VLA policies rely on limited observation windows and end-to-end action prediction, which makes them brittle in long-horizon, memory-dependent tasks with partial observability, occlusions, and multi-stage dependencies. Such tasks require not only precise visuomotor control, but also persistent memory, adaptive task decomposition, and explicit recovery from execution failures. To address these limitations, we propose a dual-system framework for long-horizon embodied manipulation. Our framework explicitly separates high-level semantic reasoning from low-level motor execution. A high-level planner, implemented as a VLM-based agentic module, maintains structured task memory and performs goal decomposition, outcome verification, and error-driven correction. A low-level executor, instantiated as a VLA-based visuomotor controller, carries out each sub-task through diffusion-based action generation conditioned on geometry-preserving filtered observations. Together, the two systems form a closed loop between planning and execution, enabling memory-aware reasoning, adaptive replanning, and robust online recovery. Experiments on representative RMBench tasks show that the proposed framework substantially outperforms representative baselines, achieving a 32.4% average success rate compared with 9.8% for the strongest baseline. Ablation studies further confirm the importance of structured memory and closed-loop recovery for long-horizon manipulation.

Chinese Translation

最近的视觉-语言-动作（VLA）系统在具身操控方面展现出了强大的能力。然而，大多数现有的VLA策略依赖于有限的观察窗口和端到端的动作预测，这使得它们在长时间、依赖记忆的任务中表现脆弱，这些任务具有部分可观察性、遮挡和多阶段依赖性。这类任务不仅需要精确的视觉运动控制，还需要持久的记忆、自适应的任务分解以及从执行失败中明确恢复的能力。为了解决这些局限性，我们提出了一种用于长时间具身操控的双系统框架。我们的框架明确将高层语义推理与低层运动执行分开。一个高层规划器，作为基于VLM的智能模块实现，维护结构化任务记忆并执行目标分解、结果验证和基于错误的修正。一个低层执行器，作为基于VLA的视觉运动控制器实例化，通过基于扩散的动作生成在几何保持的过滤观察条件下执行每个子任务。两个系统共同形成了规划与执行之间的闭环，支持记忆感知推理、自适应重新规划和稳健的在线恢复。在具有代表性的RMBench任务上的实验表明，所提出的框架在性能上显著优于代表性基线，平均成功率达到32.4%，而最强基线仅为9.8%。消融研究进一步确认了结构化记忆和闭环恢复在长时间操控中的重要性。

View on arXiv Download PDF AI Translation

cs.RO / 26 / 2604.14013

Towards Multi-Object-Tracking with Radar on a Fast Moving Vehicle: On the Potential of Processing Radar in the Frequency Domain

基于雷达的快速移动车辆多目标跟踪研究：频域处理雷达数据的潜力

Hansen, Tim, Gomez-Chavez, Arturo, Shimchik, Ilya, Birk, Andreas

Abstract

We promote in this paper the processing of radar data in the frequency domain to achieve higher robustness against noise and structural errors, especially in comparison to feature-based methods. This holds also for high dynamics in the scene, i.e., ego-motion of the vehicle with the sensor plus the presence of an unknown number of other moving objects. In addition to the high robustness, the processing in the frequency domain has the so far neglected advantage that the underlying correlation based methods used for, e.g., registration, provide information about all moving structures in the scene. A typical automotive application case is overtaking maneuvers, which in the context of autonomous racing are used here as a motivating example. Initial experiments and results with Fourier SOFT in 2D (FS2D) are presented that use the Boreas dataset to demonstrate radar-only-odometry, i.e., radar-odometry without sensor-fusion, to support our arguments.

Chinese Translation

本文提倡在频域中处理雷达数据，以实现对噪声和结构错误的更高鲁棒性，特别是与基于特征的方法相比。这种方法同样适用于场景中的高动态情况，即传感器所在车辆的自我运动以及未知数量的其他移动物体的存在。除了高鲁棒性外，频域处理还有一个迄今被忽视的优势，即用于例如配准的相关性基础方法能够提供场景中所有移动结构的信息。一个典型的汽车应用案例是超车机动，在自动驾驶赛车的背景下，这里将其作为一个激励示例。我们展示了使用Boreas数据集的二维傅里叶SOFT（FS2D）的初步实验和结果，以证明雷达单独测距，即不进行传感器融合的雷达测距，以支持我们的论点。

View on arXiv Download PDF AI Translation

cs.RO / 27 / 2604.14021

Neuromorphic Spiking Ring Attractor for Proprioceptive Joint-State Estimation

用于本体感知关节状态估计的神经形态脉冲环吸引子

Ferrari, Federica, Davidhi, Flavia, Maacaron, Bernard, Motta, Alberto, van Keeken, Luuk, Donati, Elisa, Indiveri, Giacomo, De Luca, Chiara, Bartolozzi, Chiara

Abstract

Maintaining stable internal representations of continuous variables is fundamental for effective robotic control. Continuous attractor networks provide a biologically inspired mechanism for encoding such variables, yet neuromorphic realizations have rarely addressed proprioceptive estimation under resource constraints. This work introduces a spiking ring-attractor network representing a robot joint angle through self-sustaining population activity. Local excitation and broad inhibition support a stable activity bump, while velocity-modulated asymmetries drive its translation and boundary conditions confine motion within mechanical limits. The network reproduces smooth trajectory tracking and remains stable near joint limits, showing reduced drift and improved accuracy compared to unbounded models. Such compact hardware-compatible implementation preserves multi-second stability demonstrating a near-linear relationship between bump velocity and synaptic modulation.

Chinese Translation

维持连续变量的稳定内部表征对于有效的机器人控制至关重要。连续吸引子网络提供了一种生物启发的机制来编码这些变量，但神经形态实现很少在资源限制下处理本体感知估计。本研究介绍了一种脉冲环吸引子网络，通过自我维持的人群活动来表示机器人关节角度。局部兴奋和广泛抑制支持稳定的活动峰，而速度调制的不对称性驱动其平移，边界条件则将运动限制在机械范围内。该网络再现了平滑的轨迹跟踪，并在关节极限附近保持稳定，显示出与无界模型相比，漂移减少和精度提高。这种紧凑的硬件兼容实现保持了多秒的稳定性，展示了活动峰速度与突触调制之间的近线性关系。

View on arXiv Download PDF AI Translation

cs.RO / 28 / 2604.14026

Scale-Invariant Sampling in Multi-Arm Bandit Motion Planning for Object Extraction

多臂强盗运动规划中的尺度不变采样用于物体提取

Bayraktar, Servet B., Orthey, Andreas, Toussaint, Marc

Abstract

Object extraction tasks often occur in disassembly problems, where bolts, screws, or pins have to be removed from tight, narrow spaces. In such problems, the distance to the environment is often on the millimeter scale. Sampling-based planners can solve such problems and provide completeness guarantees. However, sampling becomes a bottleneck, since almost all motions will result in collisions with the environment. To overcome this problem, we propose a novel scale-invariant sampling strategy which explores the configuration space using a grow-shrink search to find useful, high-entropy sampling scales. Once a useful sampling scale has been found, our framework exploits this scale by using a principal components analysis (PCA) to find useful directions for object extraction. We embed this sampler into a multi-arm bandit rapidly-exploring random tree (MAB-RRT) planner and test it on eight challenging 3D object extraction scenarios, involving bolts, gears, rods, pins, and sockets. To evaluate our framework, we compare it with classical sampling strategies like uniform sampling, obstacle-based sampling, and narrow-passage sampling, and with modern strategies like mate vectors, physics-based planning, and disassembly breadth first search. Our experiments show that scale-invariant sampling improves success rate by one order of magnitude on 7 out of 8 scenarios. This demonstrates that scale-invariant sampling is an important concept for general purpose object extraction in disassembly tasks.

Chinese Translation

物体提取任务通常发生在拆解问题中，其中螺栓、螺丝或销钉必须从狭窄的空间中移除。在此类问题中，环境的距离通常在毫米级别。基于采样的规划器能够解决此类问题并提供完备性保证。然而，采样成为瓶颈，因为几乎所有的运动都会导致与环境的碰撞。为了解决这个问题，我们提出了一种新颖的尺度不变采样策略，该策略通过生长-收缩搜索探索配置空间，以寻找有用的高熵采样尺度。一旦找到有用的采样尺度，我们的框架通过主成分分析（PCA）利用该尺度来寻找物体提取的有用方向。我们将该采样器嵌入到多臂强盗快速探索随机树（MAB-RRT）规划器中，并在涉及螺栓、齿轮、杆、销钉和插座的八个具有挑战性的三维物体提取场景中进行测试。为了评估我们的框架，我们将其与经典的采样策略（如均匀采样、基于障碍物的采样和狭窄通道采样）以及现代策略（如配对向量、基于物理的规划和拆解广度优先搜索）进行比较。我们的实验表明，尺度不变采样在8个场景中的7个场景上提高了成功率一个数量级。这表明尺度不变采样是拆解任务中通用物体提取的重要概念。

View on arXiv Download PDF AI Translation

cs.RO / 29 / 2604.14089

UMI-3D: Extending Universal Manipulation Interface from Vision-Limited to 3D Spatial Perception

UMI-3D：将通用操作接口从视觉限制扩展到三维空间感知

Wang, Ziming

Abstract

We present UMI-3D, a multimodal extension of the Universal Manipulation Interface (UMI) for robust and scalable data collection in embodied manipulation. While UMI enables portable, wrist-mounted data acquisition, its reliance on monocular visual SLAM makes it vulnerable to occlusions, dynamic scenes, and tracking failures, limiting its applicability in real-world environments. UMI-3D addresses these limitations by introducing a lightweight and low-cost LiDAR sensor tightly integrated into the wrist-mounted interface, enabling LiDAR-centric SLAM with accurate metric-scale pose estimation under challenging conditions. We further develop a hardware-synchronized multimodal sensing pipeline and a unified spatiotemporal calibration framework that aligns visual observations with LiDAR point clouds, producing consistent 3D representations of demonstrations. Despite maintaining the original 2D visuomotor policy formulation, UMI-3D significantly improves the quality and reliability of collected data, which directly translates into enhanced policy performance. Extensive real-world experiments demonstrate that UMI-3D not only achieves high success rates on standard manipulation tasks, but also enables learning of tasks that are challenging or infeasible for the original vision-only UMI setup, including large deformable object manipulation and articulated object operation. The system supports an end-to-end pipeline for data acquisition, alignment, training, and deployment, while preserving the portability and accessibility of the original UMI. All hardware and software components are open-sourced to facilitate large-scale data collection and accelerate research in embodied intelligence: \href{https://umi-3d.github.io}{https://umi-3d.github.io}.

Chinese Translation

我们提出了UMI-3D，这是通用操作接口（Universal Manipulation Interface, UMI）的多模态扩展，旨在实现稳健且可扩展的具身操作数据采集。虽然UMI支持便携式、腕部安装的数据采集，但其依赖单目视觉SLAM使其易受遮挡、动态场景和跟踪失败的影响，从而限制了其在现实环境中的适用性。UMI-3D通过引入一种轻量且低成本的激光雷达（LiDAR）传感器，与腕部安装接口紧密集成，解决了这些限制，实现了在挑战性条件下的以LiDAR为中心的SLAM和准确的度量尺度姿态估计。我们进一步开发了一个硬件同步的多模态传感管道和一个统一的时空校准框架，将视觉观测与LiDAR点云对齐，从而生成一致的演示三维表示。尽管保持了原始的二维视觉运动策略公式，UMI-3D显著提高了采集数据的质量和可靠性，这直接转化为增强的策略性能。大量现实世界实验表明，UMI-3D不仅在标准操作任务上实现了高成功率，还能够学习原始仅依赖视觉的UMI设置下难以或不可行的任务，包括大变形物体操作和关节物体操作。该系统支持数据采集、对齐、训练和部署的端到端管道，同时保留了原始UMI的便携性和可及性。所有硬件和软件组件均已开源，以促进大规模数据采集并加速具身智能的研究： [https://umi-3d.github.io](https://umi-3d.github.io)。

View on arXiv Download PDF AI Translation

计算机视觉 (Computer Vision)

103

cs.CV / 1 / 2604.13112

A Lightweight Multi-Metric No-Reference Image Quality Assessment Framework for UAV Imaging

一种轻量级多指标无参考图像质量评估框架用于无人机成像

Aglin, Koffi Titus Sergio, Muchiri, Anthony K., Nkundineza, Celestin

Abstract

Reliable image quality assessment is essential in applications where large volumes of images are acquired automatically and must be filtered before further analysis. In many practical scenarios, a pristine reference image is unavailable, making no reference image quality assessment (NR-IQA) particularly important. This paper introduces Multi-Metric Image Quality Assessment (MM-IQA), a lightweight multi-metric framework for NR-IQA. It combines interpretable cues related to blur, edge structure, low resolution artifacts, exposure imbalance, noise, haze, and frequency content to produce a single quality score in the range [0,100].MM-IQA was evaluated on five benchmark datasets (KonIQ-10k, LIVE Challenge, KADID-10k, TID2013, and BIQ2021) and achieved SRCC values ranging from 0.647 to 0.830. Additional experiments on a synthetic agricultural dataset showed consistent behavior of the designed cues. The Python/OpenCV implementation required about 1.97 s per image. This method also has modest memory requirements because it stores only a limited number of intermediate grayscale, filtered, and frequency-domain representations, resulting in memory usage that scales linearly with image size. The results show that MM-IQA can be used for fast image quality screening with explicit distortion aware cues and modest computational cost.

Chinese Translation

可靠的图像质量评估在自动获取大量图像并需要在进一步分析之前进行筛选的应用中至关重要。在许多实际场景中，完美的参考图像不可用，这使得无参考图像质量评估（NR-IQA）显得尤为重要。本文介绍了一种轻量级多指标框架——多指标图像质量评估（MM-IQA），用于NR-IQA。该框架结合了与模糊、边缘结构、低分辨率伪影、曝光不平衡、噪声、雾霾和频率内容相关的可解释线索，以生成一个范围为[0,100]的单一质量评分。MM-IQA在五个基准数据集（KonIQ-10k、LIVE Challenge、KADID-10k、TID2013和BIQ2021）上进行了评估，获得了从0.647到0.830的SRCC值。在一个合成农业数据集上的额外实验显示了所设计线索的一致性表现。Python/OpenCV实现每张图像的处理时间约为1.97秒。该方法的内存需求也较为适中，因为它仅存储有限数量的中间灰度、过滤和频域表示，导致内存使用与图像大小呈线性关系。结果表明，MM-IQA可以用于快速图像质量筛选，具有明确的失真感知线索和适中的计算成本。

View on arXiv Download PDF AI Translation

cs.CV / 2 / 2604.13127

Graph Propagated Projection Unlearning: A Unified Framework for Vision and Audio Discriminative Models

图传播投影遗忘：视觉和音频区分模型的统一框架

Pathak, Shreyansh, Das, Jyotishman

Abstract

The need to selectively and efficiently erase learned information from deep neural networks is becoming increasingly important for privacy, regulatory compliance, and adaptive system design. We introduce Graph-Propagated Projection Unlearning (GPPU), a unified and scalable algorithm for class-level unlearning that operates across both vision and audio models. GPPU employs graph-based propagation to identify class-specific directions in the feature space and projects representations onto the orthogonal subspace, followed by targeted fine-tuning, to ensure that target class information is effectively and irreversibly removed. Through comprehensive evaluations on six vision datasets and two large-scale audio benchmarks spanning a variety of architectures including CNNs, Vision Transformers, and Audio Transformers, we demonstrate that GPPU achieves highly efficient unlearning, realizing 10-20x speedups over prior methodologies while preserving model utility on retained classes. Our framework provides a principled and modality-agnostic approach to machine unlearning, evaluated at a scale that has received limited attention in prior work, contributing toward more efficient and responsible deep learning.

Chinese Translation

从深度神经网络中选择性和高效地删除已学习信息的需求在隐私保护、合规性和自适应系统设计中变得越来越重要。我们提出了图传播投影遗忘（Graph-Propagated Projection Unlearning, GPPU），这是一种统一且可扩展的类级遗忘算法，适用于视觉和音频模型。GPPU采用基于图的传播方法来识别特征空间中的类特定方向，并将表示投影到正交子空间，随后进行针对性的微调，以确保目标类信息被有效且不可逆地删除。通过对六个视觉数据集和两个大型音频基准的全面评估，涵盖包括卷积神经网络（CNNs）、视觉变换器（Vision Transformers）和音频变换器（Audio Transformers）在内的多种架构，我们证明了GPPU实现了高效的遗忘，相较于之前的方法实现了10-20倍的速度提升，同时保持了对保留类的模型效用。我们的框架提供了一种原则性和模态无关的机器遗忘方法，在一个以往研究中关注较少的规模上进行了评估，为更高效和负责任的深度学习做出了贡献。

View on arXiv Download PDF AI Translation

cs.CV / 3 / 2604.13153

PatchPoison: Poisoning Multi-View Datasets to Degrade 3D Reconstruction

PatchPoison：通过污染多视图数据集来降低三维重建质量

Wadekar, Prajas, Bachina, Venkata Sai Pranav, Bhosikar, Kunal, Gangwal, Ankit, Sharma, Charu

Abstract

3D Gaussian Splatting (3DGS) has recently enabled highly photorealistic 3D reconstruction from casually captured multi-view images. However, this accessibility raises a privacy concern: publicly available images or videos can be exploited to reconstruct detailed 3D models of scenes or objects without the owner's consent. We present PatchPoison, a lightweight dataset-poisoning method that prevents unauthorized 3D reconstruction. Unlike global perturbations, PatchPoison injects a small high-frequency adversarial patch, a structured checkerboard, into the periphery of each image in a multi-view dataset. The patch is designed to corrupt the feature-matching stage of Structure-from-Motion (SfM) pipelines such as COLMAP by introducing spurious correspondences that systematically misalign estimated camera poses. Consequently, downstream 3DGS optimization diverges from the correct scene geometry. On the NeRF-Synthetic benchmark, inserting a 12 X 12 pixel patch increases reconstruction error by 6.8x in LPIPS, while the poisoned images remain unobtrusive to human viewers. PatchPoison requires no pipeline modifications, offering a practical, "drop-in" preprocessing step for content creators to protect their multi-view data.

Chinese Translation

三维高斯点云（3D Gaussian Splatting, 3DGS）最近使得从随意捕获的多视图图像中实现高度逼真的三维重建成为可能。然而，这种便利性引发了隐私问题：公开可用的图像或视频可能被利用来在未获得所有者同意的情况下重建场景或物体的详细三维模型。我们提出了PatchPoison，这是一种轻量级的数据集污染方法，旨在防止未经授权的三维重建。与全局扰动不同，PatchPoison在多视图数据集中每幅图像的边缘注入一个小的高频对抗性补丁，即结构化的棋盘格。该补丁旨在通过引入虚假的对应关系来破坏运动结构（Structure-from-Motion, SfM）管道（如COLMAP）的特征匹配阶段，从而系统性地使估计的相机姿态失配。因此，下游的3DGS优化将偏离正确的场景几何。在NeRF-Synthetic基准测试中，插入一个12 x 12像素的补丁使重建误差在LPIPS中增加了6.8倍，而被污染的图像对人类观众仍然不显眼。PatchPoison无需对管道进行修改，为内容创作者提供了一种实用的“即插即用”预处理步骤，以保护他们的多视图数据。

View on arXiv Download PDF AI Translation

cs.CV / 4 / 2604.13171

3DRealHead: Few-Shot Detailed Head Avatar

3DRealHead：少样本详细头部虚拟形象

Nehvi, Jalees, Bolkart, Timo, Beeler, Thabo, Thies, Justus

Abstract

The human face is central to communication. For immersive applications, the digital presence of a person should mirror the physical reality, capturing the users idiosyncrasies and detailed facial expressions. However, current 3D head avatar methods often struggle to faithfully reproduce the identity and facial expressions, despite having multi-view data or learned priors. Learning priors that capture the diversity of human appearances, especially, for regions with highly person-specific features, like the mouth and teeth region is challenging as the underlying training data is limited. In addition, many of the avatar methods are purely relying on 3D morphable model-based expression control which strongly limits expressivity. To address these challenges, we are introducing 3DRealHead, a few-shot head avatar reconstruction method with a novel expression control signal that is extracted from a monocular video stream of the subject. Specifically, the subject can take a few pictures of themselves, recover a 3D head avatar and drive it with a consumer-level webcam. The avatar reconstruction is enabled via a novel few-shot inversion process of a 3D human head prior which is represented as a Style U-Net that emits 3D Gaussian primitives which can be rendered under novel views. The prior is learned on the NeRSemble dataset. For animating the avatar, the U-Net is conditioned on 3DMM-based facial expression signals, as well as features of the mouth region extracted from the driving video. These additional mouth features allow us to recover facial expressions that cannot be represented by the 3DMM leading to a higher expressivity and closer resemblance to the physical reality.

Chinese Translation

人脸在交流中占据中心地位。在沉浸式应用中，数字化的个人形象应当真实反映物理现实，捕捉用户的个性特征和细致的面部表情。然而，当前的3D头部虚拟形象方法往往难以忠实再现身份和面部表情，尽管拥有多视角数据或学习的先验知识。学习能够捕捉人类外观多样性的先验知识尤其具有挑战性，特别是在具有高度个性化特征的区域，如口部和牙齿区域，因为基础训练数据有限。此外，许多虚拟形象方法完全依赖于基于3D可变形模型的表情控制，这大大限制了表现力。为了解决这些挑战，我们提出了3DRealHead，一种少样本头部虚拟形象重建方法，采用从被试的单目视频流中提取的新颖表情控制信号。具体而言，被试可以拍摄几张自己的照片，恢复一个3D头部虚拟形象，并通过消费级网络摄像头进行驱动。虚拟形象的重建是通过一种新颖的少样本反演过程实现的，该过程基于一个3D人头先验，该先验被表示为一个Style U-Net，能够发出可以在新视角下渲染的3D高斯原语。该先验是在NeRSemble数据集上学习的。为了为虚拟形象动画，U-Net基于3DMM（3D可变形模型）表情信号以及从驱动视频中提取的口部区域特征进行条件处理。这些额外的口部特征使我们能够恢复3DMM无法表示的面部表情，从而提高表现力，更加接近物理现实。

View on arXiv Download PDF AI Translation

cs.CV / 5 / 2604.13183

GeoLink: A 3D-Aware Framework Towards Better Generalization in Cross-View Geo-Localization

GeoLink：一个面向更好泛化的跨视角地理定位的3D感知框架

Zhang, Hongyang, Liu, Yinhao, Zhang, Haitao, Wen, Zhongyi, Liang, Shuxian, Hua, Xiansheng

Abstract

Generalizable cross-view geo-localization aims to match the same location across views in unseen regions and conditions without GPS supervision. Its core difficulty lies in severe semantic inconsistency caused by viewpoint variation and poor generalization under domain shift. Existing methods mainly rely on 2D correspondence, but they are easily distracted by redundant shared information across views, leading to less transferable representations. To address this, we propose GeoLink, a 3D-aware semantic-consistent framework for Generalizable cross-view geo-localization. Specifically, we offline reconstruct scene point clouds from multi-view drone images using VGGT, providing stable structural priors. Based on these 3D anchors, we improve 2D representation learning in two complementary ways. A Geometric-aware Semantic Refinement module mitigates potentially redundant and view-biased dependencies in 2D features under 3D guidance. In addition, a Unified View Relation Distillation module transfers 3D structural relations to 2D features, improving cross-view alignment while preserving a 2D-only inference pipeline. Extensive experiments on multiple benchmarks show that GeoLink consistently outperforms state-of-the-art methods and achieves superior generalization across unseen domains and diverse weather environments.

Chinese Translation

可泛化的跨视角地理定位旨在在未见区域和条件下匹配相同位置，而无需GPS监督。其核心难点在于由于视角变化导致的严重语义不一致性以及在领域转移下的较差泛化能力。现有方法主要依赖于2D对应关系，但它们容易受到跨视角冗余共享信息的干扰，从而导致可迁移性较差的表示。为了解决这一问题，我们提出了GeoLink，一个3D感知的语义一致性框架，旨在实现可泛化的跨视角地理定位。具体而言，我们使用VGGT从多视角无人机图像中离线重建场景点云，提供稳定的结构先验。基于这些3D锚点，我们通过两种互补方式改善2D表示学习。几何感知语义精炼模块在3D指导下减轻了2D特征中潜在的冗余和视角偏置依赖。此外，统一视图关系蒸馏模块将3D结构关系转移到2D特征中，改善跨视角对齐，同时保留仅基于2D的推理流程。在多个基准上的广泛实验表明，GeoLink在性能上始终优于最先进的方法，并在未见领域和多样化天气环境中实现了卓越的泛化能力。

View on arXiv Download PDF AI Translation

cs.CV / 6 / 2604.13186

Towards Patient-Specific Deformable Registration in Laparoscopic Surgery

面向患者特异性可变形配准的腹腔镜手术研究

Neri, Alberto, Penza, Veronica, Haouchine, Nazim, Mattos, Leonardo S.

Abstract

Unsafe surgical care is a critical health concern, often linked to limitations in surgeon experience, skills, and situational awareness. Integrating patient-specific 3D models into the surgical field can enhance visualization, provide real-time anatomical guidance, and reduce intraoperative complications. However, reliably registering these models in general surgery remains challenging due to mismatches between preoperative and intraoperative organ surfaces, such as deformations and noise. To overcome these challenges, we introduce the first patient-specific non-rigid point cloud registration method, which leverages a novel data generation strategy to optimize outcomes for individual patients. Our approach combines a Transformer encoder-decoder architecture with overlap estimation and a dedicated matching module to predict dense correspondences, followed by a physics-based algorithm for registration. Experimental results on both synthetic and real data demonstrate that our patient-specific method significantly outperforms traditional agnostic approaches, achieving 45% Matching Score with 92% Inlier Ratio on synthetic data, highlighting its potential to improve surgical care.

Chinese Translation

不安全的外科护理是一个重要的健康问题，通常与外科医生的经验、技能和情境意识的局限性相关。将患者特异性的三维模型整合到手术领域可以增强可视化效果，提供实时解剖指导，并减少术中并发症。然而，由于术前和术中器官表面之间的变形和噪声等不匹配，可靠地在普通外科中注册这些模型仍然具有挑战性。为了解决这些挑战，我们提出了首个患者特异性的非刚性点云配准方法，该方法利用一种新颖的数据生成策略来优化个体患者的结果。我们的方法结合了Transformer编码器-解码器架构、重叠估计和专用匹配模块，以预测密集对应关系，随后采用基于物理的算法进行配准。在合成数据和真实数据上的实验结果表明，我们的患者特异性方法显著优于传统的无关方法，在合成数据上实现了45%的匹配分数和92%的内点比率，突显了其改善外科护理的潜力。

View on arXiv Download PDF AI Translation

cs.CV / 7 / 2604.13217

Multitasking Embedding for Embryo Blastocyst Grading Prediction (MEmEBG)

用于胚胎囊胚分级预测的多任务嵌入方法 (MEmEBG)

Angabini, Nahid Khoshk, Tajgardan, Mohsen, Madhavan, Mahesh, Varzaneh, Zahra Asghari, Khoshkangini, Reza, Ebner, Thomas

Abstract

Reliable evaluation of blastocyst quality is critical for the success of in vitro fertilization (IVF) treatments. Current embryo grading practices primarily rely on visual assessment of morphological features, which introduces subjectivity, inter-embryologist variability, and challenges in standardizing quality assurance. In this study, we propose a multitask embedding-based approach for the automated analysis and prediction of key blastocyst components, including the trophectoderm (TE), inner cell mass (ICM), and blastocyst expansion (EXP). The method leverages biological and physical characteristics extracted from images of day-5 human embryos. A pretrained ResNet-18 architecture, enhanced with an embedding layer, is employed to learn discriminative representations from a limited dataset and to automatically identify TE and ICM regions along with their corresponding grades, structures that are visually similar and inherently difficult to distinguish. Experimental results demonstrate the promise of the multitask embedding approach and potential for robust and consistent blastocyst quality assessment.

Chinese Translation

可靠的囊胚质量评估对于体外受精 (IVF) 治疗的成功至关重要。目前的胚胎分级实践主要依赖于对形态特征的视觉评估，这引入了主观性、胚胎学家之间的变异性以及在标准化质量保证方面的挑战。在本研究中，我们提出了一种基于多任务嵌入的方法，用于自动分析和预测关键囊胚成分，包括滋养层 (TE)、内细胞团 (ICM) 和囊胚扩张 (EXP)。该方法利用从第五天人类胚胎图像中提取的生物和物理特征。我们采用了预训练的 ResNet-18 架构，并增强了嵌入层，以从有限的数据集中学习区分性表示，并自动识别 TE 和 ICM 区域及其相应的等级，这些结构在视觉上相似且本质上难以区分。实验结果表明，多任务嵌入方法具有良好的前景，并有潜力实现稳健且一致的囊胚质量评估。

View on arXiv Download PDF AI Translation

cs.CV / 8 / 2604.13235

Neural 3D Reconstruction of Planetary Surfaces from Descent-Phase Wide-Angle Imagery

基于下降阶段广角图像的行星表面神经三维重建

de Almeida, Melonie, Brydon, George, Persaud, Divya M., Williamson, John H., Henderson, Paul

Abstract

Digital elevation modeling of planetary surfaces is essential for studying past and ongoing geological processes. Wide-angle imagery acquired during spacecraft descent promises to offer a low-cost option for high-resolution terrain reconstruction. However, accurate 3D reconstruction from such imagery is challenging due to strong radial distortion and limited parallax from vertically descending, predominantly nadir-facing cameras. Conventional multi-view stereo exhibits limited depth range and reduced fidelity under these conditions and also lacks domain-specific priors. We present the first study of modern neural reconstruction methods for planetary descent imaging. We also develop a novel approach that incorporates an explicit neural height field representation, which provides a strong prior since planetary surfaces are generally continuous, smooth, solid, and free from floating objects. This study demonstrates that neural approaches offer a strong and competitive alternative to traditional multi-view stereo (MVS) methods. Experiments on simulated descent sequences over high-fidelity lunar and Mars terrains demonstrate that the proposed approach achieves increased spatial coverage while maintaining satisfactory estimation accuracy.

Chinese Translation

行星表面的数字高程建模对于研究过去和正在进行的地质过程至关重要。在航天器下降过程中获取的广角图像有望为高分辨率地形重建提供一种低成本的选择。然而，由于强烈的径向畸变和来自垂直下降、主要面向天底的相机的有限视差，从这些图像中进行准确的三维重建具有挑战性。在这些条件下，传统的多视角立体视觉（multi-view stereo, MVS）方法表现出有限的深度范围和降低的保真度，并且缺乏领域特定的先验知识。我们首次研究了现代神经重建方法在行星下降成像中的应用。我们还开发了一种新颖的方法，结合了显式的神经高度场表示，这提供了强有力的先验，因为行星表面通常是连续、平滑、坚固的，并且没有漂浮物体。本研究表明，神经方法为传统的多视角立体视觉方法提供了一种强大且具有竞争力的替代方案。在高保真度的月球和火星地形上进行的模拟下降序列实验表明，所提出的方法在保持令人满意的估计精度的同时，实现了更大的空间覆盖范围。

View on arXiv Download PDF AI Translation

cs.CV / 9 / 2604.13236

SemiFA: An Agentic Multi-Modal Framework for Autonomous Semiconductor Failure Analysis Report Generation

SemiFA：一种自主半导体故障分析报告生成的代理多模态框架

Kaushik, Shivam Chand

Abstract

Semiconductor failure analysis (FA) requires engineers to examine inspection images, correlate equipment telemetry, consult historical defect records, and write structured reports, a process that can consume several hours of expert time per case. We present SemiFA, an agentic multi-modal framework that autonomously generates structured FA reports from semiconductor inspection images in under one minute. SemiFA decomposes FA into a four-agent LangGraph pipeline: a DefectDescriber that classifies and narrates defect morphology using DINOv2 and LLaVA-1.6, a RootCauseAnalyzer that fuses SECS/GEM equipment telemetry with historically similar defects retrieved from a Qdrant vector database, a SeverityClassifier that assigns severity and estimates yield impact, and a RecipeAdvisor that proposes corrective process adjustments. A fifth node assembles a PDF report. We introduce SemiFA-930, a dataset of 930 annotated semiconductor defect images paired with structured FA narratives across nine defect classes, drawn from procedural synthesis, WM-811K, and MixedWM38. Our DINOv2-based classifier achieves 92.1% accuracy on 140 validation images (macro F1 = 0.917), and the full pipeline produces complete FA reports in 48 seconds on an NVIDIA A100-SXM4-40 GB GPU. A GPT-4o judge ablation across four modality conditions demonstrates that multi-modal fusion improves root cause reasoning by +0.86 composite points (1-5 scale) over an image-only baseline, with equipment telemetry as the more load-bearing modality. To our knowledge, SemiFA is the first system to integrate SECS/GEM equipment telemetry into a vision-language model pipeline for autonomous FA report generation.

Chinese Translation

半导体故障分析（FA）要求工程师检查检测图像、关联设备遥测、查阅历史缺陷记录并撰写结构化报告，这一过程每个案例可能消耗数小时的专家时间。我们提出了SemiFA，这是一种代理多模态框架，能够在一分钟内自主生成基于半导体检测图像的结构化FA报告。SemiFA将FA分解为一个四代理LangGraph管道：一个使用DINOv2和LLaVA-1.6对缺陷形态进行分类和叙述的DefectDescriber，一个将SECS/GEM设备遥测与从Qdrant向量数据库中检索到的历史相似缺陷融合的RootCauseAnalyzer，一个分配严重性并估计产量影响的SeverityClassifier，以及一个提出纠正过程调整的RecipeAdvisor。第五个节点负责组装PDF报告。我们引入了SemiFA-930，这是一个包含930个注释半导体缺陷图像的数据集，配有来自九个缺陷类别的结构化FA叙述，数据来源于程序合成、WM-811K和MixedWM38。我们的基于DINOv2的分类器在140个验证图像上达到了92.1%的准确率（宏F1 = 0.917），整个管道在NVIDIA A100-SXM4-40 GB GPU上能在48秒内生成完整的FA报告。对四种模态条件下的GPT-4o评估消融实验表明，多模态融合在根本原因推理上比仅使用图像的基线提高了+0.86个复合点（1-5分制），其中设备遥测作为更具承载能力的模态。根据我们的了解，SemiFA是第一个将SECS/GEM设备遥测集成到视觉-语言模型管道中以实现自主FA报告生成的系统。

View on arXiv Download PDF AI Translation

cs.CV / 10 / 2604.13240

A High-Resolution Landscape Dataset for Concept-Based XAI With Application to Species Distribution Models

用于基于概念的可解释人工智能的高分辨率景观数据集及其在物种分布模型中的应用

de la Brosse, Augustin, Garreau, Damien, Houet, Thomas, Corpetti, Thomas

Abstract

Mapping the spatial distribution of species is essential for conservation policy and invasive species management. Species distribution models (SDMs) are the primary tools for this task, serving two purposes: achieving robust predictive performance while providing ecological insights into the driving factors of distribution. However, the increasing complexity of deep learning SDMs has made extracting these insights more challenging. To reconcile these objectives, we propose the first implementation of concept-based Explainable AI (XAI) for SDMs. We leverage the Robust TCAV (Testing with Concept Activation Vectors) methodology to quantify the influence of landscape concepts on model predictions. To enable this, we provide a new open-access landscape concept dataset derived from high-resolution multispectral and LiDAR drone imagery. It includes 653 patches across 15 distinct landscape concepts and 1,450 random reference patches, designed to suit a wide range of species. We demonstrate this approach through a case study of two aquatic insects, Plecoptera and Trichoptera, using two Convolutional Neural Networks and one Vision Transformer. Results show that concept-based XAI helps validate SDMs against expert knowledge while uncovering novel associations that generate new ecological hypotheses. Robust TCAV also provides landscape-level information, useful for policy-making and land management. Code and datasets are publicly available.

Chinese Translation

绘制物种的空间分布对于保护政策和外来物种管理至关重要。物种分布模型（SDMs）是实现这一任务的主要工具，具有两个目的：实现稳健的预测性能，同时提供关于分布驱动因素的生态学见解。然而，深度学习SDMs日益复杂，使得提取这些见解变得更加困难。为了解决这些目标之间的矛盾，我们提出了基于概念的可解释人工智能（XAI）在SDMs中的首次应用。我们利用稳健的TCAV（使用概念激活向量进行测试）方法来量化景观概念对模型预测的影响。为此，我们提供了一个新的开放获取的景观概念数据集，该数据集来源于高分辨率的多光谱和激光雷达（LiDAR）无人机影像。该数据集包含653个斑块，涵盖15种不同的景观概念，以及1,450个随机参考斑块，旨在适应广泛的物种。我们通过对两种水生昆虫（石蝇和蜉蝣）的案例研究，展示了这一方法，使用了两种卷积神经网络和一种视觉变换器。结果表明，基于概念的XAI有助于根据专家知识验证SDMs，同时揭示新的关联，从而生成新的生态假设。稳健的TCAV还提供了景观级别的信息，对政策制定和土地管理具有重要意义。代码和数据集已公开提供。

View on arXiv Download PDF AI Translation

cs.CV / 11 / 2604.13244

4th Workshop on Maritime Computer Vision (MaCVi): Challenge Overview

第四届海事计算机视觉研讨会（MaCVi）：挑战概述

Kiefer, Benjamin, Augustin, Jan Lukas, Muhovič, Jon, Jeong, Mingi, Wiliem, Arnold, Pers, Janez, Kristan, Matej, Li, Alberto Quattrini, Teršek, Matija, Šarić, Josip, Vats, Arpita, Hildebrand, Dominik, Rahim, Rafia, Karaaslan, Mahmut, Vaishya, Arpit, Xie, Steve, Kaya, Ersin, Mashrur, Akib, Tang, Tze-Hsiang, Tsai, Chun-Ming, Hsieh, Jun-Wei, Chang, Ming-Ching, Jo, Wonwoo, Lee, Doyeon, Cao, Yusi, Li, Lingling, Nageli, Vinayak, Jamal, Arshad, Subrahmanyam, Gorthi Rama Krishna Sai, Maeng, Jemo, Lee, Seongju, Lee, Kyoobin, Liu, Xu, Jiao, LiCheng, Sheikh, Jannik, Weinmann, Martin, Martinović, Ivan, Persch, Jose Mateus Raitz, Cheppally, Rahul Harsha, Belviranli, Mehmet E., Gahtidis, Dimitris, Chun, Hyewon, Lee, Sangmun, Gorczak, Philipp, Kim, Hansol, Jeon, Jeeyeon, Perez, Borja Carrillo, Wang, Jiahui, Park, Sangmin, Michel, Andreas, Kuester, Jannick, Felten, Bettina, Gross, Wolfgang, Feng, Yuan, Davis, Justin

Abstract

The 4th Workshop on Maritime Computer Vision (MaCVi) is organized as part of CVPR 2026. This edition features five benchmark challenges with emphasis on both predictive accuracy and embedded real-time feasibility. This report summarizes the MaCVi 2026 challenge setup, evaluation protocols, datasets, and benchmark tracks, and presents quantitative results, qualitative comparisons, and cross-challenge analyses of emerging method trends. We also include technical reports from top-performing teams to highlight practical design choices and lessons learned across the benchmark suite. Datasets, leaderboards, and challenge resources are available at https://macvi.org/workshop/cvpr26.

Chinese Translation

第四届海事计算机视觉研讨会（MaCVi）作为CVPR 2026的一部分而组织。本届研讨会设有五个基准挑战，重点关注预测准确性和嵌入式实时可行性。本报告总结了MaCVi 2026挑战的设置、评估协议、数据集和基准轨道，并呈现了定量结果、定性比较以及新兴方法趋势的跨挑战分析。我们还包括了来自表现优异团队的技术报告，以突出在基准套件中所做的实际设计选择和经验教训。数据集、排行榜和挑战资源可在 https://macvi.org/workshop/cvpr26 获取。

View on arXiv Download PDF AI Translation

cs.CV / 12 / 2604.13262

Rethinking Uncertainty in Segmentation: From Estimation to Decision

重新思考分割中的不确定性：从估计到决策

Maganti, Saket

Abstract

In medical image segmentation, uncertainty estimates are often reported but rarely used to guide decisions. We study the missing step: how uncertainty maps are converted into actionable policies such as accepting, flagging, or deferring predictions. We formulate segmentation as a two-stage pipeline, estimation followed by decision, and show that optimizing uncertainty alone fails to capture most of the achievable safety gains. Using retinal vessel segmentation benchmarks (DRIVE, STARE, CHASE_DB1), we evaluate two uncertainty sources (Monte Carlo Dropout and Test-Time Augmentation) combined with three deferral strategies, and introduce a simple confidence-aware deferral rule that prioritizes uncertain and low-confidence predictions. Our results show that the best method and policy combination removes up to 80 percent of segmentation errors at only 25 percent pixel deferral, while achieving strong cross-dataset robustness. We further show that calibration improvements do not translate to better decision quality, highlighting a disconnect between standard uncertainty metrics and real-world utility. These findings suggest that uncertainty should be evaluated based on the decisions it enables, rather than in isolation.

Chinese Translation

在医学图像分割中，不确定性估计通常被报告但很少用于指导决策。我们研究了这一缺失的步骤：如何将不确定性图转换为可操作的策略，例如接受、标记或推迟预测。我们将分割形式化为一个两阶段的流程，首先是估计，然后是决策，并表明仅优化不确定性无法捕捉到大多数可实现的安全增益。利用视网膜血管分割基准（DRIVE、STARE、CHASE_DB1），我们评估了两种不确定性来源（蒙特卡洛丢弃（Monte Carlo Dropout）和测试时增强（Test-Time Augmentation））与三种推迟策略的组合，并引入了一种简单的基于置信度的推迟规则，优先考虑不确定和低置信度的预测。我们的结果表明，最佳的方法和策略组合在仅25%的像素推迟下消除了多达80%的分割错误，同时实现了强大的跨数据集鲁棒性。我们进一步表明，校准的改善并未转化为更好的决策质量，突显了标准不确定性指标与现实世界效用之间的脱节。这些发现表明，不确定性应根据其所启用的决策进行评估，而不是孤立地进行评估。

View on arXiv Download PDF AI Translation

cs.CV / 13 / 2604.13268

Indexing Multimodal Language Models for Large-scale Image Retrieval

用于大规模图像检索的多模态语言模型索引

Tharwat, Bahey, Kordopatis-Zilos, Giorgos, Suma, Pavel, Reid, Ian, Tolias, Giorgos

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated strong cross-modal reasoning capabilities, yet their potential for vision-only tasks remains underexplored. We investigate MLLMs as training-free similarity estimators for instance-level image-to-image retrieval. Our approach prompts the model with paired images and converts next-token probabilities into similarity scores, enabling zero-shot re-ranking within large-scale retrieval pipelines. This design avoids specialized architectures and fine-tuning, leveraging the rich visual discrimination learned during multimodal pre-training. We address scalability by combining MLLMs with memory-efficient indexing and top-$k$ candidate re-ranking. Experiments across diverse benchmarks show that MLLMs outperform task-specific re-rankers outside their native domains and exhibit superior robustness to clutter, occlusion, and small objects. Despite strong results, we identify failure modes under severe appearance changes, highlighting opportunities for future research. Our findings position MLLMs as a promising alternative for open-world large-scale image retrieval.

Chinese Translation

多模态大型语言模型（MLLMs）已展示出强大的跨模态推理能力，但其在仅视觉任务中的潜力仍未得到充分探索。我们研究了MLLMs作为无训练相似性估计器在实例级图像到图像检索中的应用。我们的方法通过配对图像对模型进行提示，并将下一个标记的概率转换为相似性分数，从而实现大规模检索管道中的零-shot 重新排序。该设计避免了专门的架构和微调，利用了在多模态预训练过程中学习到的丰富视觉区分能力。我们通过将MLLMs与内存高效的索引和前$k$候选重新排序相结合来解决可扩展性问题。在多种基准测试中的实验表明，MLLMs在其本地领域之外超越了特定任务的重新排序器，并在应对杂乱、遮挡和小物体方面表现出更强的鲁棒性。尽管结果强劲，我们在严重外观变化下识别出了失败模式，突显了未来研究的机会。我们的发现将MLLMs定位为开放世界大规模图像检索的有前景的替代方案。

View on arXiv Download PDF AI Translation

cs.CV / 14 / 2604.13278

DroneScan-YOLO: Redundancy-Aware Lightweight Detection for Tiny Objects in UAV Imagery

DroneScan-YOLO：面向无人机图像中微小物体的冗余感知轻量级检测

Bellec, Yann V.

Abstract

Aerial object detection in UAV imagery presents unique challenges due to the high prevalence of tiny objects, adverse environmental conditions, and strict computational constraints. Standard YOLO-based detectors fail to address these jointly: their minimum detection stride of 8 pixels renders sub-32px objects nearly undetectable, their CIoU loss produces zero gradients for non-overlapping tiny boxes, and their architectures contain significant filter redundancy. We propose DroneScan-YOLO, a holistic system contribution that addresses these limitations through four coordinated design choices: (1) increased input resolution of 1280x1280 to maximize spatial detail for tiny objects, (2) RPA-Block, a dynamic filter pruning mechanism based on lazy cosine-similarity updates with a 10-epoch warm-up period, (3) MSFD, a lightweight P2 detection branch at stride 4 adding only 114,592 parameters (+1.1%), and (4) SAL-NWD, a hybrid loss combining Normalized Wasserstein Distance with size-adaptive CIoU weighting, integrated into YOLOv8's TaskAligned assignment pipeline. Evaluated on VisDrone2019-DET, DroneScan-YOLO achieves 55.3% mAP@50 and 35.6% mAP@50-95, outperforming the YOLOv8s baseline by +16.6 and +12.3 points respectively, improving recall from 0.374 to 0.518, and maintaining 96.7 FPS inference speed with only +4.1% parameters. Gains are most pronounced on tiny object classes: bicycle AP@50 improves from 0.114 to 0.328 (+187%), and awning-tricycle from 0.156 to 0.237 (+52%).

Chinese Translation

在无人机图像中进行空中物体检测面临独特的挑战，这主要是由于微小物体的高频率出现、不利的环境条件以及严格的计算限制。标准的基于YOLO的检测器未能共同解决这些问题：其最小检测步幅为8个像素，使得小于32像素的物体几乎无法检测，其CIoU损失对不重叠的微小框产生零梯度，并且其架构中存在显著的滤波器冗余。我们提出了DroneScan-YOLO，这是一个整体系统贡献，通过四个协调的设计选择来解决这些局限性：（1）将输入分辨率提高到1280x1280，以最大化微小物体的空间细节；（2）RPA-Block，一种基于懒惰余弦相似度更新的动态滤波器剪枝机制，具有10个周期的预热期；（3）MSFD，一个轻量级的P2检测分支，步幅为4，仅增加114,592个参数（+1.1%）；（4）SAL-NWD，一种结合了归一化Wasserstein距离和尺寸自适应CIoU加权的混合损失，集成到YOLOv8的TaskAligned分配管道中。在VisDrone2019-DET上进行评估，DroneScan-YOLO实现了55.3%的mAP@50和35.6%的mAP@50-95，分别比YOLOv8s基线提高了16.6和12.3个百分点，召回率从0.374提高到0.518，并且在仅增加4.1%参数的情况下保持96.7 FPS的推理速度。增益在微小物体类别上最为显著：自行车的AP@50从0.114提高到0.328（+187%），雨篷三轮车从0.156提高到0.237（+52%）。

View on arXiv Download PDF AI Translation

cs.CV / 15 / 2604.13279

Explainable Fall Detection for Elderly Care via Temporally Stable SHAP in Skeleton-Based Human Activity Recognition

基于骨架的人体活动识别中的可解释老年人跌倒检测：通过时间稳定的SHAP

Saleh, Mohammad, Tabatabaei, Azadeh

Abstract

Fall detection in elderly care requires not only accurate classification but also reliable explanations that clinicians can trust. However, existing post-hoc explainability methods, when applied frame-by-frame to sequential data, produce temporally unstable attribution maps that clinicians cannot reliably act upon. To address this issue, we propose a lightweight and explainable framework for skeleton-based fall detection that combines an efficient LSTM model with T-SHAP, a temporally aware post-hoc aggregation strategy that stabilizes SHAP-based feature attributions over contiguous time windows. Unlike standard SHAP, which treats each frame independently, T-SHAP applies a linear smoothing operator to the attribution sequence, reducing high-frequency variance while preserving the theoretical guarantees of Shapley values, including local accuracy and consistency. Experiments on the NTU RGB+D Dataset demonstrate that the proposed framework achieves 94.3% classification accuracy with an end-to-end inference latency below 25 milliseconds, satisfying real-time constraints on mid-range hardware and indicating strong potential for deployment in clinical monitoring scenarios. Quantitative evaluation using perturbation-based faithfulness metrics shows that T-SHAP improves explanation reliability compared to standard SHAP (AUP: 0.89 vs. 0.91) and Grad-CAM (0.82), with consistent improvements observed across five-fold cross-validation, indicating enhanced explanation reliability. The resulting attributions consistently highlight biomechanically relevant motion patterns, including lower-limb instability and changes in spinal alignment, aligning with established clinical observations of fall dynamics and supporting their use as transparent decision aids in long-term care environments

Chinese Translation

老年护理中的跌倒检测不仅需要准确的分类，还需要临床医生可以信赖的可靠解释。然而，现有的事后可解释性方法在逐帧应用于序列数据时，会产生时间上不稳定的归因图，临床医生无法可靠地采取行动。为了解决这一问题，我们提出了一种轻量级且可解释的基于骨架的跌倒检测框架，该框架结合了高效的LSTM模型和T-SHAP，一种时间感知的事后聚合策略，能够在连续时间窗口内稳定SHAP（Shapley加值）特征归因。与标准SHAP不同，后者将每一帧独立处理，T-SHAP对归因序列应用线性平滑算子，减少高频方差，同时保留Shapley值的理论保证，包括局部准确性和一致性。在NTU RGB+D数据集上的实验表明，所提框架实现了94.3%的分类准确率，端到端推理延迟低于25毫秒，满足中端硬件上的实时约束，显示出在临床监测场景中的强大部署潜力。使用基于扰动的可信度度量进行的定量评估表明，T-SHAP在解释可靠性方面优于标准SHAP（AUP: 0.89 vs. 0.91）和Grad-CAM（0.82），在五折交叉验证中观察到一致的改进，表明解释可靠性增强。最终的归因一致地突出了生物力学相关的运动模式，包括下肢不稳定性和脊柱对齐变化，与已建立的跌倒动态临床观察相一致，支持其在长期护理环境中作为透明决策辅助工具的使用。

View on arXiv Download PDF AI Translation

cs.CV / 16 / 2604.13292

See&Say: Vision Language Guided Safe Zone Detection for Autonomous Package Delivery Drones

看与说：基于视觉语言的自主包裹投递无人机安全区域检测

Ghazanfari, Mahyar, Wei, Peng

Abstract

Autonomous drone delivery systems are rapidly advancing, but ensuring safe and reliable package drop-offs remains highly challenging in cluttered urban and suburban environments where accurately identifying suitable package drop zones is critical. Existing approaches typically rely on either geometry-based analysis or semantic segmentation alone, but these methods lack the integrated semantic reasoning required for robust decision-making. To address this gap, we propose See&Say, a novel framework that combines geometric safety cues with semantic perception, guided by a Vision-Language Model (VLM) for iterative refinement. The system fuses monocular depth gradients with open-vocabulary detection masks to produce safety maps, while the VLM dynamically adjusts object category prompts and refines hazard detection across time, enabling reliable reasoning under dynamic conditions during the final delivery phase. When the primary drop-pad is occupied or unsafe, the proposed See&Say also identifies alternative candidate zones for package delivery. We curated a dataset of urban delivery scenarios with moving objects and human activities to evaluate the approach. Experimental results show that See&Say outperforms all baselines, achieving the highest accuracy and IoU for safety map prediction as well as superior performance in alternative drop zone evaluation across multiple thresholds. These findings highlight the promise of VLM-guided segmentation-depth fusion for advancing safe and practical drone-based package delivery.

Chinese Translation

自主无人机投递系统正在快速发展，但在拥挤的城市和郊区环境中确保安全可靠的包裹投放仍然面临重大挑战，准确识别合适的包裹投放区域至关重要。现有的方法通常依赖于几何分析或语义分割，但这些方法缺乏强健决策所需的综合语义推理。为了解决这一问题，我们提出了See&Say，一个新颖的框架，将几何安全线索与语义感知相结合，并通过视觉语言模型（Vision-Language Model, VLM）进行迭代优化。该系统融合了单目深度梯度与开放词汇检测掩码，以生成安全地图，同时VLM动态调整物体类别提示，并在最终投递阶段跨时间精细化危险检测，从而在动态条件下实现可靠推理。当主要投放区域被占用或不安全时，See&Say还能够识别替代的包裹投递候选区域。我们整理了一个包含移动物体和人类活动的城市投递场景数据集，以评估该方法。实验结果表明，See&Say在所有基准测试中表现优异，在安全地图预测中实现了最高的准确率和交并比（IoU），并在多个阈值下的替代投放区域评估中表现出色。这些发现突显了VLM引导的分割-深度融合在推动安全和实用的无人机包裹投递中的潜力。

View on arXiv Download PDF AI Translation

cs.CV / 17 / 2604.13294

PAT-VCM: Plug-and-Play Auxiliary Tokens for Video Coding for Machines

PAT-VCM：用于机器视频编码的即插即用辅助标记

Jiang, Wei, Wang, Wei

Abstract

Existing video coding for machines is often trained for a specific downstream task and model. As a result, the compressed representation becomes tightly coupled to the end task, making it difficult to scale across multiple tasks or adapt to model updates. We propose PAT-VCM, a plug-and-play auxiliary-token framework for video coding for machines. PAT-VCM keeps a shared baseline compressed stream and augments it with lightweight task-aware auxiliary tokens, allowing different downstream tasks to recover the information they need without retraining a separate codec for each task. The framework supports three forms of auxiliary information: visual residual tokens, prompt/control tokens, and semantic tokens. We evaluate PAT-VCM on segmentation, depth estimation, and semantic recognition. A shared detection-oriented auxiliary branch provides a reusable first refinement, task-specific visual branches improve segmentation and depth, prompt tokens provide further segmentation gains at negligible bitrate, and semantic tokens achieve strong recognition performance with extremely low overhead. These results suggest that a shared compressed representation, combined with lightweight task-aware auxiliary tokens, is a practical and scalable alternative to tightly task-coupled VCM design.

Chinese Translation

现有的机器视频编码通常针对特定的下游任务和模型进行训练。因此，压缩表示与最终任务紧密耦合，使得在多个任务之间扩展或适应模型更新变得困难。我们提出了PAT-VCM，一种用于机器视频编码的即插即用辅助标记框架。PAT-VCM保持一个共享的基线压缩流，并通过轻量级的任务感知辅助标记进行增强，使得不同的下游任务能够在不为每个任务重新训练单独编码器的情况下恢复所需的信息。该框架支持三种形式的辅助信息：视觉残差标记、提示/控制标记和语义标记。我们在分割、深度估计和语义识别任务上评估了PAT-VCM。一个共享的面向检测的辅助分支提供了可重用的首次精炼，特定任务的视觉分支改善了分割和深度，提示标记在可忽略的比特率下提供了进一步的分割增益，而语义标记则以极低的开销实现了强大的识别性能。这些结果表明，结合轻量级任务感知辅助标记的共享压缩表示是一个实用且可扩展的替代方案，优于紧密耦合任务的VCM设计。

View on arXiv Download PDF AI Translation

cs.CV / 18 / 2604.13304

Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision

跨层转码器能否替代视觉变换器的激活？从可解释性的视角看视觉

Chatzoudis, Gerasimos, Polyzos, Konstantinos D., Li, Zhuowei, Gu, Difei, Moran, Gemma E., Wang, Hao, Metaxas, Dimitris N.

Abstract

Understanding the internal activations of Vision Transformers (ViTs) is critical for building interpretable and trustworthy models. While Sparse Autoencoders (SAEs) have been used to extract human-interpretable features, they operate on individual layers and fail to capture the cross-layer computational structure of Transformers, as well as the relative significance of each layer in forming the last-layer representation. Alternatively, we introduce the adoption of Cross-Layer Transcoders (CLTs) as reliable, sparse, and depth-aware proxy models for MLP blocks in ViTs. CLTs use an encoder-decoder scheme to reconstruct each post-MLP activation from learned sparse embeddings of preceding layers, yielding a linear decomposition that transforms the final representation of ViTs from an opaque embedding into an additive, layer-resolved construction that enables faithful attribution and process-level interpretability. We train CLTs on CLIP ViT-B/32 and ViT-B/16 across CIFAR-100, COCO, and ImageNet-100. We show that CLTs achieve high reconstruction fidelity with post-MLP activations while preserving and even improving, in some cases, CLIP zero-shot classification accuracy. In terms of interpretability, we show that the cross-layer contribution scores provide faithful attribution, revealing that the final representation is concentrated in a smaller set of dominant layer-wise terms whose removal degrades performance and whose retention largely preserves it. These results showcase the significance of adopting CLTs as an alternative interpretable proxy of ViTs in the vision domain.

Chinese Translation

理解视觉变换器（ViTs）的内部激活对于构建可解释且可信的模型至关重要。尽管稀疏自编码器（SAEs）已被用于提取人类可解释的特征，但它们仅在单个层上操作，无法捕捉变换器的跨层计算结构，以及每一层在形成最后一层表示中的相对重要性。作为替代，我们引入跨层转码器（CLTs）的应用，作为ViTs中多层感知器（MLP）块的可靠、稀疏且深度感知的代理模型。CLTs采用编码器-解码器方案，从前一层的学习稀疏嵌入中重构每个后MLP激活，产生线性分解，将ViTs的最终表示从不透明的嵌入转变为加性、层次分解的构造，从而实现忠实归因和过程级可解释性。我们在CLIP ViT-B/32和ViT-B/16上对CIFAR-100、COCO和ImageNet-100进行CLTs训练。我们展示了CLTs在后MLP激活的重构保真度方面表现出色，同时在某些情况下保持甚至提高了CLIP零样本分类的准确性。在可解释性方面，我们表明跨层贡献分数提供了忠实的归因，揭示最终表示集中在一小部分主导层次项中，移除这些项会降低性能，而保留它们则在很大程度上保持性能。这些结果展示了在视觉领域采用CLTs作为ViTs的替代可解释代理的重要性。

View on arXiv Download PDF AI Translation

cs.CV / 19 / 2604.13305

Bias at the End of the Score

评分末端的偏差

Magid, Salma Abdel, Guo, Grace, Tureci, Esin, Dharmasiri, Amaya, Ramaswamy, Vikram V., Pfister, Hanspeter, Russakovsky, Olga

Abstract

Reward models (RMs) are inherently non-neutral value functions designed and trained to encode specific objectives, such as human preferences or text-image alignment. RMs have become crucial components of text-to-image (T2I) generation systems where they are used at various stages for dataset filtering, as evaluation metrics, as a supervisory signal during optimization of parameters, and for post-generation safety and quality filtering of T2I outputs. While specific problems with the integration of RMs into the T2I pipeline have been studied (e.g. reward hacking or mode collapse), their robustness and fairness as scoring functions remains largely unknown. We conduct a large scale audit of RM robustness with respect to demographic biases during T2I model training and generation. We provide quantitative and qualitative evidence that while originally developed as quality measures, RMs encode demographic biases, which cause reward-guided optimization to disproportionately sexualize female image subjects reinforce gender/racial stereotypes, and collapse demographic diversity. These findings highlight shortcomings in current reward models, challenge their reliability as quality metrics, and underscore the need for improved data collection and training procedures to enable more robust scoring.

Chinese Translation

奖励模型（Reward Models, RMs）本质上是非中立的价值函数，旨在编码特定目标，如人类偏好或文本-图像对齐。RMs已成为文本到图像（Text-to-Image, T2I）生成系统的关键组成部分，在数据集过滤、评估指标、参数优化过程中的监督信号，以及T2I输出的后生成安全性和质量过滤等多个阶段中使用。尽管已经研究了RMs在T2I流程中整合的特定问题（例如奖励黑客或模式崩溃），但它们作为评分函数的鲁棒性和公平性仍然在很大程度上未知。我们对RM在T2I模型训练和生成过程中与人口统计偏见相关的鲁棒性进行了大规模审计。我们提供了定量和定性证据，表明虽然RMs最初被开发为质量度量，但它们编码了人口统计偏见，这导致奖励引导的优化不成比例地性别化女性图像主体，强化性别/种族刻板印象，并导致人口多样性的崩溃。这些发现突显了当前奖励模型的不足，挑战了其作为质量指标的可靠性，并强调了改进数据收集和训练程序的必要性，以实现更为鲁棒的评分。

View on arXiv Download PDF AI Translation

cs.CV / 20 / 2604.13307

Deep Spatially-Regularized and Superpixel-Based Diffusion Learning for Unsupervised Hyperspectral Image Clustering

基于深度空间正则化和超像素的无监督高光谱图像聚类扩散学习

Buranasiri, Vutichart, Murphy, James M.

Abstract

An unsupervised framework for hyperspectral image (HSI) clustering is proposed that incorporates masked deep representation learning with diffusion-based clustering, extending the Spatially-Regularized Superpixel-based Diffusion Learning ($S^2DL$) algorithm. Initially, a denoised latent representation of the original HSI is learned via an unsupervised masked autoencoder (UMAE) model with a Vision Transformer backbone. The UMAE takes spatial context and long-range spectral correlations into account and incorporates an efficient pretraining process via masking that utilizes only a small subset of training pixels. In the next stage, the entropy rate superpixel (ERS) algorithm is used to segment the image into superpixels, and a spatially regularized diffusion graph is constructed using Euclidean and diffusion distances within the compressed latent space instead of the HSI space. The proposed algorithm, Deep Spatially-Regularized Superpixel-based Diffusion Learning ($DS^2DL$), leverages more faithful diffusion distances and subsequent diffusion graph construction that better reflect the intrinsic geometry of the underlying data manifold, improving labeling accuracy and clustering quality. Experiments on Botswana and KSC datasets demonstrate the efficacy of $DS^2DL$.

Chinese Translation

提出了一种用于高光谱图像（HSI）聚类的无监督框架，该框架结合了掩码深度表示学习与基于扩散的聚类，扩展了空间正则化超像素基扩散学习（$S^2DL$）算法。最初，通过具有视觉变换器（Vision Transformer）骨干的无监督掩码自编码器（UMAE）模型学习原始HSI的去噪潜在表示。UMAE考虑了空间上下文和长距离光谱相关性，并通过掩码引入了一种高效的预训练过程，仅利用小部分训练像素。在下一阶段，使用熵率超像素（ERS）算法将图像分割为超像素，并在压缩的潜在空间中使用欧几里得距离和扩散距离构建空间正则化扩散图，而不是在HSI空间中。所提出的算法，深度空间正则化超像素基扩散学习（$DS^2DL$），利用更真实的扩散距离和后续的扩散图构建，更好地反映了潜在数据流形的内在几何特征，从而提高了标记准确性和聚类质量。在博茨瓦纳和KSC数据集上的实验表明了$DS^2DL$的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 21 / 2604.13315

The Spectrascapes Dataset: Street-view imagery beyond the visible captured using a mobile platform

Spectrascapes 数据集：通过移动平台捕获的超出可见光范围的街景影像

Gupta, Akshit, Timmermans, Joris, Biljecki, Filip, Uijlenhoet, Remko

Abstract

High-resolution data in spatial and temporal contexts is imperative for developing climate resilient cities. Current datasets for monitoring urban parameters are developed primarily using manual inspections, embedded-sensing, remote sensing, or standard street-view imagery (RGB). These methods and datasets are often constrained respectively by poor scalability, inconsistent spatio-temporal resolutions, overhead views or low spectral information. We present a novel method and its open implementation: a multi-spectral terrestrial-view dataset that circumvents these limitations. This dataset consists of 17,718 street level multi-spectral images captured with RGB, Near-infrared, and Thermal imaging sensors on bikes, across diverse urban morphologies (village, town, small city, and big urban area) in the Netherlands. Strict emphasis is put on data calibration and quality while also providing the details of our data collection methodology (including the hardware and software details). To the best of our knowledge, Spectrascapes is the first open-access dataset of its kind. Finally, we demonstrate two downstream use-cases enabled using this dataset and provide potential research directions in the machine learning, urban planning and remote sensing domains.

Chinese Translation

在空间和时间背景下获取高分辨率数据对于开发气候韧性城市至关重要。目前用于监测城市参数的数据集主要依赖于人工检查、嵌入式传感、遥感或标准街景影像（RGB）。这些方法和数据集在可扩展性差、不一致的时空分辨率、俯视视角或低光谱信息等方面存在局限。我们提出了一种新方法及其开放实现：一个多光谱地面视角数据集，克服了这些限制。该数据集由 17,718 张街道级多光谱影像组成，这些影像使用 RGB、近红外和热成像传感器在荷兰的多样城市形态（村庄、城镇、小城市和大城市区域）中通过自行车捕获。我们严格强调数据校准和质量，同时提供我们的数据收集方法的详细信息（包括硬件和软件细节）。据我们所知，Spectrascapes 是首个此类开放获取数据集。最后，我们展示了使用该数据集启用的两个下游应用案例，并提供了在机器学习、城市规划和遥感领域的潜在研究方向。

View on arXiv Download PDF AI Translation

cs.CV / 22 / 2604.13321

Why MLLMs Struggle to Determine Object Orientations

为何多模态大型语言模型在确定物体方向时面临挑战

Gopinath, Anju, Krishnaswamy, Nikhil, Draper, Bruce

Abstract

Multimodal Large Language Models (MLLMs) struggle with tasks that require reasoning about 2D object orientation in images, as documented in prior work. Tong et al. and Nichols et al. hypothesize that these failures originate in the visual encoder, since commonly used encoders such as CLIP and SigLIP are trained for image-text semantic alignment rather than geometric reasoning. We design a controlled empirical protocol to test this claim by measuring whether rotations can be recovered from encoder representations. In particular, we examine SigLIP and ViT features from LLaVA OneVision and Qwen2.5-VL-7B-Instruct models, respectively, using full images, and examine CLIP representations in LLaVA 1.5 and 1.6 using rotated foreground patches against natural background images. Our null hypothesis is that orientation information is not preserved in the encoder embeddings and we test this by training linear regressors to predict object orientation from encoded features. Contrary to the hypothesis, we find that orientation information is recoverable from encoder representations: simple linear models accurately predict object orientations from embeddings. This contradicts the assumption that MLLM orientation failures originate in the visual encoder. Having rejected the accepted hypothesis that MLLMs struggle with 2D orientation tasks because of visual encoder limitations, we still don't know why they fail. Although a full explanation is beyond the scope of this paper, we show that although present, orientation information is spread diffusely across tens of thousands of features. This may or may not be while MLLMs fail to exploit the available orientation information.

Chinese Translation

多模态大型语言模型（MLLMs）在处理需要推理图像中二维物体方向的任务时表现不佳，先前的研究对此已有记录。Tong等人和Nichols等人假设这些失败源于视觉编码器，因为常用的编码器如CLIP和SigLIP是为了图像与文本的语义对齐而训练的，而非几何推理。我们设计了一种受控的实证协议来测试这一假设，通过测量是否可以从编码器表示中恢复旋转信息。具体而言，我们分别使用完整图像检查LLaVA OneVision和Qwen2.5-VL-7B-Instruct模型中的SigLIP和ViT特征，并在LLaVA 1.5和1.6中使用旋转的前景补丁与自然背景图像对比，检查CLIP表示。我们的零假设是方向信息在编码器嵌入中未被保留，我们通过训练线性回归模型来预测从编码特征中提取的物体方向来检验这一假设。与假设相反，我们发现方向信息可以从编码器表示中恢复：简单的线性模型能够准确预测嵌入中的物体方向。这与MLLM方向失败源于视觉编码器的假设相矛盾。在否定了MLLMs因视觉编码器局限性而在二维方向任务中表现不佳的公认假设后，我们仍然不知道它们失败的原因。尽管全面解释超出了本文的范围，但我们展示了方向信息虽然存在，却在数万个特征中分散。这可能是MLLMs未能利用可用方向信息的原因，也可能不是。

View on arXiv Download PDF AI Translation

cs.CV / 23 / 2604.13322

Towards Successful Implementation of Automated Raveling Detection: Effects of Training Data Size, Illumination Difference, and Spatial Shift

成功实施自动化剥落检测的探索：训练数据量、光照差异和空间偏移的影响

Zhang, Xinan, Wang, Haolin, Yang, Zhongyu, Yi-Chang, Tsai

Abstract

Raveling, the loss of aggregates, is a major form of asphalt pavement surface distress, especially on highways. While research has shown that machine learning and deep learning-based methods yield promising results for raveling detection by classification on range images, their performance often degrades in large-scale deployments where more diverse inference data may originate from different runs, sensors, and environmental conditions. This degradation highlights the need of a more generalizable and robust solution for real-world implementation. Thus, the objectives of this study are to 1) identify and assess potential variations that impact model robustness, such as the quantity of training data, illumination difference, and spatial shift; and 2) leverage findings to enhance model robustness under real-world conditions. To this end, we propose RavelingArena, a benchmark designed to evaluate model robustness to variations in raveling detection. Instead of collecting extensive new data, it is built by augmenting an existing dataset with diverse, controlled variations, thereby enabling variation-controlled experiments to quantify the impact of each variation. Results demonstrate that both the quantity and diversity of training data are critical to the accuracy of models, achieving at least a 9.2% gain in accuracy under the most diverse conditions in experiments. Additionally, a case study applying these findings to a multi-year test section in Georgia, U.S., shows significant improvements in year-to-year consistency, laying foundations for future studies on temporal deterioration modeling. These insights provide guidance for more reliable model deployment in raveling detection and other real-world tasks that require adaptability to diverse conditions.

Chinese Translation

剥落是集料损失的主要形式，是沥青路面表面病害的一种，尤其在高速公路上尤为明显。研究表明，基于机器学习和深度学习的方法在剥落检测方面通过对范围图像的分类取得了良好的效果，但在大规模部署中，其性能往往会下降，因为更多样化的推理数据可能来自不同的运行、传感器和环境条件。这种性能下降突显了在实际应用中需要更具普适性和鲁棒性的解决方案。因此，本研究的目标是：1）识别和评估影响模型鲁棒性的潜在变化因素，如训练数据的数量、光照差异和空间偏移；2）利用研究结果增强模型在现实条件下的鲁棒性。为此，我们提出了RavelingArena，这是一个旨在评估剥落检测中模型对变化的鲁棒性的基准。它通过对现有数据集进行多样化、受控的变化增强，而不是收集大量新数据，从而实现了变化控制实验，以量化每种变化的影响。结果表明，训练数据的数量和多样性对模型的准确性至关重要，在实验中，在最具多样性的条件下，准确性至少提高了9.2%。此外，将这些发现应用于美国乔治亚州的一个多年测试段的案例研究显示，年际一致性显著改善，为未来的时间衰退建模研究奠定了基础。这些见解为在剥落检测和其他需要适应多样化条件的实际任务中更可靠的模型部署提供了指导。

View on arXiv Download PDF AI Translation

cs.CV / 24 / 2604.13326

Right Regions, Wrong Labels: Semantic Label Flips in Segmentation under Correlation Shift

正确区域，错误标签：相关性变化下分割中的语义标签翻转

Achara, Akshit, Yathathugoda, Yovin, Byrne, Nick, Antonelli, Michela, Anton, Esther Puyol, Hammers, Alexander, King, Andrew P.

Abstract

The robustness of machine learning models can be compromised by spurious correlations between non-causal features in the input data and target labels. A common way to test for such correlations is to train on data where the label is strongly tied to some non-causal cue, then evaluate on examples where that tie no longer holds. This idea is well established for classification tasks, but for semantic segmentation the specific failure modes are not well understood. We show that a model may achieve reasonable overlap while assigning the wrong semantic label, swapping one plausible foreground class for another, even when object boundaries are largely correct. We focus on this semantic label-flip behaviour and quantify it with a simple diagnostic (Flip) that counts how often ground truth foreground pixels are assigned the wrong foreground identity while remaining predicted as foreground. In a setting where category and scene are correlated during training, increasing the correlation consistently widens the gap between common and rare test conditions and increases these within-object label swaps on counterfactual groups. Overall, our results motivate assessing segmentation robustness under distribution shift beyond overlap by decomposing foreground errors into correct pixels, flipped-identity pixels, and missed-to-background pixels. We also propose an entropy-based, ground truth label-free `flip-risk' score, which is computed from foreground identity uncertainty, and show that it can flag flip-prone cases at inference time. Code is available at https://github.com/acharaakshit/label-flips.

Chinese Translation

机器学习模型的鲁棒性可能会受到输入数据中非因果特征与目标标签之间虚假相关性的影响。测试这种相关性的常见方法是使用标签与某些非因果线索紧密相关的数据进行训练，然后在这些关联不再成立的示例上进行评估。这个思路在分类任务中已经得到很好的验证，但在语义分割中，具体的失效模式尚不清楚。我们展示了一个模型可能在分配错误的语义标签时仍能实现合理的重叠，甚至在物体边界大致正确的情况下，将一个合理的前景类别与另一个类别互换。我们专注于这种语义标签翻转行为，并通过一个简单的诊断指标（Flip）对其进行量化，该指标计算真实前景像素被分配错误前景身份的频率，同时仍被预测为前景。在训练期间类别与场景相关的情况下，增加相关性会持续扩大常见和稀有测试条件之间的差距，并增加这些反事实组中的物体内部标签交换。总体而言，我们的结果促使在分布变化下评估分割鲁棒性时，超越重叠，通过将前景错误分解为正确像素、翻转身份像素和遗漏到背景像素来进行评估。我们还提出了一种基于熵的、无真实标签的“翻转风险”评分，该评分是从前景身份的不确定性中计算得出的，并表明它可以在推理时标记易翻转的案例。代码可在 https://github.com/acharaakshit/label-flips 获取。

View on arXiv Download PDF AI Translation

cs.CV / 25 / 2604.13333

SSD-GS: Scattering and Shadow Decomposition for Relightable 3D Gaussian Splatting

SSD-GS：可重光照的3D高斯点云的散射与阴影分解

Zheng, Iris, Tang, Guojun, Doronin, Alexander, Teal, Paul, Zhang, Fang-Lue

Abstract

We present SSD-GS, a physically-based relighting framework built upon 3D Gaussian Splatting (3DGS) that achieves high-quality reconstruction and photorealistic relighting under novel lighting conditions. In physically-based relighting, accurately modeling light-material interactions is essential for faithful appearance reproduction. However, existing 3DGS-based relighting methods adopt coarse shading decompositions, either modeling only diffuse and specular reflections or relying on neural networks to approximate shadows and scattering. This leads to limited fidelity and poor physical interpretability, particularly for anisotropic metals and translucent materials. To address these limitations, SSD-GS decomposes reflectance into four components: diffuse, specular, shadow, and subsurface scattering. We introduce a learnable dipole-based scattering module for subsurface transport, an occlusion-aware shadow formulation that integrates visibility estimates with a refinement network, and an enhanced specular component with an anisotropic Fresnel-based model. Through progressive integration of all components during training, SSD-GS effectively disentangles lighting and material properties, even for unseen illumination conditions, as demonstrated on the challenging OLAT dataset. Experiments demonstrate superior quantitative and perceptual relighting quality compared to prior methods and pave the way for downstream tasks, including controllable light source editing and interactive scene relighting. The source code is available at: https://github.com/irisfreesiri/SSD-GS.

Chinese Translation

我们提出了SSD-GS，一个基于物理的重光照框架，建立在3D高斯点云（3DGS）之上，能够在新的光照条件下实现高质量重建和逼真的重光照。在基于物理的重光照中，准确建模光与材料的相互作用对于真实外观的再现至关重要。然而，现有的基于3DGS的重光照方法采用粗糙的阴影分解，仅建模漫反射和镜面反射，或依赖神经网络来近似阴影和散射。这导致了有限的真实感和较差的物理可解释性，特别是在各向异性金属和半透明材料的情况下。为了解决这些局限性，SSD-GS将反射率分解为四个组成部分：漫反射、镜面反射、阴影和次表面散射。我们引入了一个可学习的基于偶极子的散射模块用于次表面传输，一个考虑遮挡的阴影公式，将可见性估计与精细化网络相结合，以及一个增强的镜面成分，采用各向异性的Fresnel模型。通过在训练过程中逐步整合所有组件，SSD-GS有效地解耦了光照和材料属性，即使在未见过的光照条件下也能如在具有挑战性的OLAT数据集上所示。实验表明，与先前的方法相比，SSD-GS在定量和感知重光照质量上具有优越性，并为后续任务铺平了道路，包括可控光源编辑和交互式场景重光照。源代码可在以下链接获取：https://github.com/irisfreesiri/SSD-GS。

View on arXiv Download PDF AI Translation

cs.CV / 26 / 2604.13335

SEDTalker: Emotion-Aware 3D Facial Animation Using Frame-Level Speech Emotion Diarization

SEDTalker：基于帧级语音情感分离的情感感知3D面部动画

Jafari, Farzaneh, Berretti, Stefano, Basu, Anup

Abstract

We introduce SEDTalker, an emotion-aware framework for speech-driven 3D facial animation that leverages frame-level speech emotion diarization to achieve fine-grained expressive control. Unlike prior approaches that rely on utterance-level or manually specified emotion labels, our method predicts temporally dense emotion categories and intensities directly from speech, enabling continuous modulation of facial expressions over time. The diarized emotion signals are encoded as learned embeddings and used to condition a speech-driven 3D animation model based on a hybrid Transformer-Mamba architecture. This design allows effective disentanglement of linguistic content and emotional style while preserving identity and temporal coherence. We evaluate our approach on a large-scale multi-corpus dataset for speech emotion diarization and on the EmoVOCA dataset for emotional 3D facial animation. Quantitative results demonstrate strong frame-level emotion recognition performance and low geometric and temporal reconstruction errors, while qualitative results show smooth emotion transitions and consistent expression control. These findings highlight the effectiveness of frame-level emotion diarization for expressive and controllable 3D talking head generation.

Chinese Translation

我们介绍了SEDTalker，一个情感感知的框架，用于基于语音驱动的3D面部动画，该框架利用帧级语音情感分离实现细粒度的表现控制。与依赖于话语级别或手动指定情感标签的先前方法不同，我们的方法直接从语音中预测时间密集的情感类别和强度，从而实现面部表情的连续调节。分离的情感信号被编码为学习的嵌入，并用于调节基于混合Transformer-Mamba架构的语音驱动3D动画模型。这一设计有效地解耦了语言内容和情感风格，同时保持了身份和时间一致性。我们在一个大规模多语料库数据集上评估了我们的方法，以进行语音情感分离，并在EmoVOCA数据集上评估情感3D面部动画。定量结果表明，帧级情感识别性能强劲，几何和时间重建误差低，而定性结果则显示出平滑的情感过渡和一致的表情控制。这些发现突显了帧级情感分离在表现性和可控的3D对话头生成中的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 27 / 2604.13340

MSGS: Multispectral 3D Gaussian Splatting

MSGS：多光谱三维高斯点云渲染

Zheng, Iris, Tang, Guojun, Doronin, Alexander, Teal, Paul, Zhang, Fang-Lue

Abstract

We present a multispectral extension to 3D Gaussian Splatting (3DGS) for wavelength-aware view synthesis. Each Gaussian is augmented with spectral radiance, represented via per-band spherical harmonics, and optimized under a dual-loss supervision scheme combining RGB and multispectral signals. To improve rendering fidelity, we perform spectral-to-RGB conversion at the pixel level, allowing richer spectral cues to be retained during optimization. Our method is evaluated on both public and self-captured real-world datasets, demonstrating consistent improvements over the RGB-only 3DGS baseline in terms of image quality and spectral consistency. Notably, it excels in challenging scenes involving translucent materials and anisotropic reflections. The proposed approach maintains the compactness and real-time efficiency of 3DGS while laying the foundation for future integration with physically based shading models.

Chinese Translation

我们提出了一种多光谱扩展的三维高斯点云渲染（3D Gaussian Splatting, 3DGS）方法，用于波长感知的视图合成。每个高斯点都增强了光谱辐射，通过每个波段的球面谐波表示，并在结合RGB和多光谱信号的双损失监督方案下进行优化。为了提高渲染的真实感，我们在像素级别执行光谱到RGB的转换，使得在优化过程中保留更丰富的光谱线索。我们的方法在公共和自捕获的真实世界数据集上进行了评估，显示出在图像质量和光谱一致性方面，相较于仅使用RGB的3DGS基线有一致的改进。值得注意的是，该方法在涉及半透明材料和各向异性反射的复杂场景中表现优异。所提出的方法保持了3DGS的紧凑性和实时效率，同时为未来与基于物理的阴影模型的集成奠定了基础。

View on arXiv Download PDF AI Translation

cs.CV / 28 / 2604.13345

Multi-Agent Object Detection Framework Based on Raspberry Pi YOLO Detector and Slack-Ollama Natural Language Interface

基于树莓派 YOLO 检测器和 Slack-Ollama 自然语言接口的多智能体目标检测框架

Kalušev, Vladimir, Brkljač, Branko, Brkljač, Milan

Abstract

The paper presents design and prototype implementation of an edge based object detection system within the new paradigm of AI agents orchestration. It goes beyond traditional design approaches by leveraging on LLM based natural language interface for system control and communication and practically demonstrates integration of all system components into a single resource constrained hardware platform. The method is based on the proposed multi-agent object detection framework which tightly integrates different AI agents within the same task of providing object detection and tracking capabilities. The proposed design principles highlight the fast prototyping approach that is characteristic for transformational potential of generative AI systems, which are applied during both development and implementation stages. Instead of specialized communication and control interface, the system is made by using Slack channel chatbot agent and accompanying Ollama LLM reporting agent, which are both run locally on the same Raspberry Pi platform, alongside the dedicated YOLO based computer vision agent performing real time object detection and tracking. Agent orchestration is implemented through a specially designed event based message exchange subsystem, which represents an alternative to completely autonomous agent orchestration and control characteristic for contemporary LLM based frameworks like the recently proposed OpenClaw. Conducted experimental investigation provides valuable insights into limitations of the low cost testbed platforms in the design of completely centralized multi-agent AI systems. The paper also discusses comparative differences between presented approach and the solution that would require additional cloud based external resources.

Chinese Translation

本文提出了一种边缘计算目标检测系统的设计和原型实现，属于人工智能代理编排的新范式。该系统超越了传统设计方法，通过利用基于大语言模型（LLM）的自然语言接口进行系统控制和通信，实际展示了将所有系统组件集成到单一资源受限硬件平台上的可行性。该方法基于所提出的多智能体目标检测框架，紧密集成了不同的人工智能代理，以提供目标检测和跟踪能力。提出的设计原则强调了快速原型开发方法，这一方法是生成式人工智能系统转型潜力的特征，适用于开发和实施阶段。系统采用 Slack 渠道聊天机器人代理和配套的 Ollama LLM 报告代理，而非专门的通信和控制接口，这两个代理均在同一树莓派平台上本地运行，同时还有专门的基于 YOLO 的计算机视觉代理，执行实时目标检测和跟踪。代理编排通过专门设计的基于事件的消息交换子系统实现，这为完全自主的代理编排和控制提供了一种替代方案，后者是当代基于 LLM 的框架（如最近提出的 OpenClaw）的特征。进行的实验研究为低成本测试平台在设计完全集中化的多智能体人工智能系统中的局限性提供了有价值的见解。本文还讨论了所提出的方法与需要额外云端外部资源的解决方案之间的比较差异。

View on arXiv Download PDF AI Translation

cs.CV / 29 / 2604.13367

A 3D SAM-Based Progressive Prompting Framework for Multi-Task Segmentation of Radiotherapy-induced Normal Tissue Injuries in Limited-Data Settings

基于3D SAM的渐进式提示框架用于有限数据环境下放疗引起的正常组织损伤的多任务分割

Jiang, Caiwen, Zeng, Lei, Liu, Wei

Abstract

Radiotherapy-induced normal tissue injury is a clinically important complication, and accurate segmentation of injury regions from medical images could facilitate disease assessment, treatment planning, and longitudinal monitoring. However, automatic segmentation of these lesions remains largely unexplored because of limited voxel-level annotations and substantial heterogeneity across injury types, lesion size, and imaging modality. To address this gap, we curate a dedicated head-and-neck radiotherapy-induced normal tissue injury dataset covering three manifestations: osteoradionecrosis (ORN), cerebral edema (CE), and cerebral radiation necrosis (CRN). We further propose a 3D SAM-based progressive prompting framework for multi-task segmentation in limited-data settings. The framework progressively incorporates three complementary prompts: text prompts for task-aware adaptation, dose-guided box prompts for coarse localization, and click prompts for iterative refinement. A small-target focus loss is introduced to improve local prediction and boundary delineation for small and sparse lesions. Experiments on ORN, CE, and CRN demonstrate that the proposed method achieves reliable segmentation performance across diverse injury types and outperforms state-of-the-art methods.

Chinese Translation

放疗引起的正常组织损伤是一种临床重要的并发症，准确分割医学图像中的损伤区域可以促进疾病评估、治疗计划和长期监测。然而，由于体素级标注有限以及损伤类型、病灶大小和成像方式之间存在显著异质性，这些病变的自动分割仍然未得到充分探索。为了解决这一问题，我们整理了一个专门的头颈部放疗引起的正常组织损伤数据集，涵盖三种表现形式：放射性骨坏死（osteoradionecrosis, ORN）、脑水肿（cerebral edema, CE）和脑放射性坏死（cerebral radiation necrosis, CRN）。我们进一步提出了一种基于3D SAM的渐进式提示框架，用于有限数据环境下的多任务分割。该框架逐步整合了三种互补提示：用于任务感知适应的文本提示、用于粗略定位的剂量引导框提示，以及用于迭代细化的点击提示。引入了一种小目标聚焦损失，以改善小而稀疏病灶的局部预测和边界描绘。在ORN、CE和CRN上的实验表明，所提出的方法在不同损伤类型上实现了可靠的分割性能，并且优于现有的最先进方法。

View on arXiv Download PDF AI Translation

cs.CV / 30 / 2604.13383

UniBlendNet: Unified Global, Multi-Scale, and Region-Adaptive Modeling for Ambient Lighting Normalization

UniBlendNet：统一的全局、多尺度和区域自适应建模用于环境光照归一化

Dai, Jiatao, Dong, Wei, Zhou, Han, Tang, Chengzhou, Chen, Jun

Abstract

Ambient Lighting Normalization (ALN) aims to restore images degraded by complex, spatially varying illumination conditions. Existing methods, such as IFBlend, leverage frequency-domain priors to model illumination variations, but still suffer from limited global context modeling and insufficient spatial adaptivity, leading to suboptimal restoration in challenging regions. In this paper, we propose UniBlendNet, a unified framework for ambient lighting normalization that jointly models global illumination, multi-scale structures, and region-adaptive refinement. Specifically, we enhance global illumination understanding by integrating a UniConvNet-based module to capture long-range dependencies. To better handle complex lighting variations, we introduce a Scale-Aware Aggregation Module (SAAM) that performs pyramid-based multi-scale feature aggregation with dynamic reweighting. Furthermore, we design a mask-guided residual refinement mechanism to enable region-adaptive correction, allowing the model to selectively enhance degraded regions while preserving well-exposed areas. This design effectively improves illumination consistency and structural fidelity under complex lighting conditions. Extensive experiments on the NTIRE Ambient Lighting Normalization benchmark demonstrate that UniBlendNet consistently outperforms the baseline IFBlend and achieves improved restoration quality, while producing visually more natural and stable restoration results.

Chinese Translation

环境光照归一化（ALN）旨在恢复因复杂的空间变化光照条件而退化的图像。现有方法，如IFBlend，利用频域先验来建模光照变化，但仍然存在全球上下文建模有限和空间自适应不足的问题，导致在挑战性区域的恢复效果不佳。本文提出了UniBlendNet，一个用于环境光照归一化的统一框架，联合建模全局光照、多尺度结构和区域自适应细化。具体而言，我们通过集成基于UniConvNet的模块来增强全球光照理解，以捕捉长距离依赖关系。为了更好地处理复杂的光照变化，我们引入了一个尺度感知聚合模块（SAAM），该模块通过动态重加权执行基于金字塔的多尺度特征聚合。此外，我们设计了一种掩膜引导的残差细化机制，以实现区域自适应校正，使模型能够选择性地增强退化区域，同时保留曝光良好的区域。这一设计有效改善了复杂光照条件下的光照一致性和结构保真度。在NTIRE环境光照归一化基准上的大量实验表明，UniBlendNet在恢复质量上始终优于基线IFBlend，并产生视觉上更自然和稳定的恢复结果。

View on arXiv Download PDF AI Translation

cs.CV / 31 / 2604.13397

A Multimodal Clinically Informed Coarse-to-Fine Framework for Longitudinal CT Registration in Proton Therapy

一种多模态临床信息驱动的粗到细框架用于质子治疗中的纵向CT配准

Jiang, Caiwen, Ding, Yuzhen, Jia, Mi, Patel, Samir H., Sio, Terence T., Ashman, Jonathan B., McGee, Lisa A., Rwigema, Jean-Claude M., Rule, William G., Keole, Sameer R., Vora, Sujay A., Wong, William W., Yu, Nathan Y., Halyard, Michele Y., Schild, Steven E., Shen, Dinggang, Liu, Wei

Abstract

Proton therapy offers superior organ-at-risk sparing but is highly sensitive to anatomical changes, making accurate deformable image registration (DIR) across longitudinal CT scans essential. Conventional DIR methods are often too slow for emerging online adaptive workflows, while existing deep learning-based approaches are primarily designed for generic benchmarks and underutilize clinically relevant information beyond images. To address this gap, we propose a clinically scalable coarse-to-fine deformable registration framework that integrates multimodal information from the proton radiotherapy workflow to accommodate diverse clinical scenarios. The model employs dual CNN-based encoders for hierarchical feature extraction and a transformer-based decoder to progressively refine deformation fields. Beyond CT intensities, clinically critical priors, including target and organ-at-risk contours, dose distributions, and treatment planning text, are incorporated through anatomy- and risk-guided attention, text-conditioned feature modulation, and foreground-aware optimization, enabling anatomically focused and clinically informed deformation estimation. We evaluate the proposed framework on a large-scale proton therapy DIR dataset comprising 1,222 paired planning and repeat CT scans across multiple anatomical regions and disease types. Extensive experiments demonstrate consistent improvements over state-of-the-art methods, enabling fast and robust clinically meaningful registration.

Chinese Translation

质子治疗在保护风险器官方面具有显著优势，但对解剖变化高度敏感，因此在纵向CT扫描中进行准确的可变形图像配准（DIR）至关重要。传统的DIR方法通常速度较慢，不适合新兴的在线自适应工作流程，而现有的基于深度学习的方法主要针对通用基准，未能充分利用超出图像的临床相关信息。为了解决这一问题，我们提出了一种临床可扩展的粗到细可变形配准框架，该框架整合了来自质子放射治疗工作流程的多模态信息，以适应多样的临床场景。该模型采用双CNN（卷积神经网络）编码器进行层次特征提取，并使用基于变换器的解码器逐步细化变形场。除了CT强度外，临床关键先验信息，包括靶区和风险器官轮廓、剂量分布和治疗计划文本，通过解剖和风险引导的注意力、文本条件特征调制和前景感知优化被纳入，从而实现了解剖学聚焦和临床信息驱动的变形估计。我们在一个大规模的质子治疗DIR数据集上评估了该框架，该数据集包含1,222对规划和重复CT扫描，涵盖多个解剖区域和疾病类型。大量实验表明，该方法在性能上持续优于最先进的方法，实现了快速且稳健的临床相关配准。

View on arXiv Download PDF AI Translation

cs.CV / 32 / 2604.13403

Why Multimodal In-Context Learning Lags Behind? Unveiling the Inner Mechanisms and Bottlenecks

为何多模态上下文学习滞后？揭示内在机制与瓶颈

Wang, Yu, Li, Sharon

Abstract

In-context learning (ICL) enables models to adapt to new tasks via inference-time demonstrations. Despite its success in large language models, the extension of ICL to multimodal settings remains poorly understood in terms of its internal mechanisms and how it differs from text-only ICL. In this work, we conduct a systematic analysis of ICL in multimodal large language models. Using identical task formulations across modalities, we show that multimodal ICL performs comparably to text-only ICL in zero-shot settings but degrades significantly under few-shot demonstrations. To understand this gap, we decompose multimodal ICL into task mapping construction and task mapping transfer, and analyze how models establish cross-modal task mappings, and transfer them to query samples across layers. Our analysis reveals that current models lack reasoning-level alignment between visual and textual representations, and fail to reliably transfer learned task mappings to queries. Guided by these findings, we further propose a simple inference-stage enhancement method that reinforces task mapping transfer. Our results provide new insights into the mechanisms and limitations of multimodal ICL and suggest directions for more effective multimodal adaptation. Our code is available \href{https://github.com/deeplearning-wisc/Multimocal-ICL-Analysis-Framework-MGI}{here}.

Chinese Translation

上下文学习（ICL）使模型能够通过推理时的示例适应新任务。尽管在大型语言模型中取得了成功，但将ICL扩展到多模态环境的内部机制及其与仅文本ICL的区别仍然不够明确。在本研究中，我们对多模态大型语言模型中的ICL进行了系统分析。通过在不同模态中使用相同的任务表述，我们展示了多模态ICL在零样本设置下的表现与仅文本ICL相当，但在少样本示例下显著下降。为了理解这一差距，我们将多模态ICL分解为任务映射构建和任务映射转移，并分析模型如何建立跨模态任务映射，并将其转移到跨层的查询样本中。我们的分析揭示了当前模型在视觉和文本表示之间缺乏推理级别的一致性，且未能可靠地将学习到的任务映射转移到查询中。在这些发现的指导下，我们进一步提出了一种简单的推理阶段增强方法，以强化任务映射转移。我们的结果为多模态ICL的机制和局限性提供了新的见解，并建议了更有效的多模态适应方向。我们的代码可在此获取： exttt{https://github.com/deeplearning-wisc/Multimocal-ICL-Analysis-Framework-MGI}。

View on arXiv Download PDF AI Translation

cs.CV / 33 / 2604.13409

CausalDisenSeg: A Causality-Guided Disentanglement Framework with Counterfactual Reasoning for Robust Brain Tumor Segmentation Under Missing Modalities

CausalDisenSeg：一种基于因果推理的解耦框架，用于在缺失模态下进行稳健的脑肿瘤分割

Liu, Bo, Zou, Yulong, Hong, Jin

Abstract

In clinical practice, the robustness of deep learning models for multimodal brain tumor segmentation is severely compromised by incomplete MRI data. This vulnerability stems primarily from modality bias, where models exploit spurious correlations as shortcuts rather than learning true anatomical structures. Existing feature fusion methods fail to fundamentally eliminate this dependency. To address this, we propose CausalDisenSeg, a novel Structural Causal Model (SCM)-grounded framework that achieves robust segmentation via causality-guided disentanglement and counterfactual reasoning. We reframe the problem as isolating the anatomical Causal Factor from the stylistic Bias Factor. Our framework implements a three-stage causal intervention: (1) Explicit Causal Disentanglement: A Conditional Variational Autoencoder (CVAE) coupled with an HSIC constraint mathematically enforces statistical orthogonality between anatomical and style features. (2) Causal Representation Reinforcement: A Region Causality Module (RCM) explicitly grounds causal features in physical tumor regions. (3) Counterfactual Reasoning: A dual-adversarial strategy actively suppresses the residual Natural Direct Effect (NDE) of the bias, forcing its spatial attention to be mutually exclusive from the causal path. Extensive experiments on the BraTS 2020 dataset demonstrate that CausalDisenSeg significantly outperforms state-of-the-art methods in accuracy and consistency across severe missing-modality scenarios. Furthermore, cross-dataset evaluation on BraTS 2023 under the same protocol yields a state-of-the-art macro-average DSC of 84.49.

Chinese Translation

在临床实践中，深度学习模型在多模态脑肿瘤分割中的稳健性因不完整的MRI数据而受到严重影响。这种脆弱性主要源于模态偏差，模型利用虚假的相关性作为捷径，而不是学习真实的解剖结构。现有的特征融合方法未能从根本上消除这种依赖性。为了解决这个问题，我们提出了CausalDisenSeg，这是一种新颖的基于结构因果模型（SCM）的框架，通过因果引导的解耦和反事实推理实现稳健的分割。我们将问题重新表述为将解剖因果因素与风格偏差因素隔离。我们的框架实施了三阶段的因果干预：(1) 明确的因果解耦：一个条件变分自编码器（CVAE）结合HSIC约束在数学上强制解剖特征与风格特征之间的统计正交性。(2) 因果表示强化：区域因果模块（RCM）明确将因果特征与物理肿瘤区域相结合。(3) 反事实推理：一种双对抗策略主动抑制偏差的残余自然直接效应（NDE），迫使其空间注意力与因果路径相互排斥。在BraTS 2020数据集上的大量实验表明，CausalDisenSeg在严重缺失模态场景下的准确性和一致性上显著优于最先进的方法。此外，在相同协议下对BraTS 2023的跨数据集评估获得了84.49的最先进宏平均DSC。

View on arXiv Download PDF AI Translation

cs.CV / 34 / 2604.13416

DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis

DF3DV-1K：一个用于无干扰新视角合成的大规模数据集和基准测试

Lu, Cheng-You, Hung, Yi-Shan, Chi, Wei-Ling, Wang, Hao-Ping, Tsai, Charlie Li-Ting, Chang, Yu-Cheng, Liu, Yu-Lun, Do, Thomas, Lin, Chin-Teng

Abstract

Advances in radiance fields have enabled photorealistic novel view synthesis. In several domains, large-scale real-world datasets have been developed to support comprehensive benchmarking and to facilitate progress beyond scene-specific reconstruction. However, for distractor-free radiance fields, a large-scale dataset with clean and cluttered images per scene remains lacking, limiting the development. To address this gap, we introduce DF3DV-1K, a large-scale real-world dataset comprising 1,048 scenes, each providing clean and cluttered image sets for benchmarking. In total, the dataset contains 89,924 images captured using consumer cameras to mimic casual capture, spanning 128 distractor types and 161 scene themes across indoor and outdoor environments. A curated subset of 41 scenes, DF3DV-41, is systematically designed to evaluate the robustness of distractor-free radiance field methods under challenging scenarios. Using DF3DV-1K, we benchmark nine recent distractor-free radiance field methods and 3D Gaussian Splatting, identifying the most robust methods and the most challenging scenarios. Beyond benchmarking, we demonstrate an application of DF3DV-1K by fine-tuning a diffusion-based 2D enhancer to improve radiance field methods, achieving average improvements of 0.96 dB PSNR and 0.057 LPIPS on the held-out set (e.g., DF3DV-41) and the On-the-go dataset. We hope DF3DV-1K facilitates the development of distractor-free vision and promotes progress beyond scene-specific approaches.

Chinese Translation

辐射场的进展使得照片级真实感的新视角合成成为可能。在多个领域，已经开发出大规模的真实世界数据集，以支持全面的基准测试并促进超越场景特定重建的进展。然而，对于无干扰辐射场而言，缺乏一个每个场景都有干净和杂乱图像的大规模数据集，这限制了其发展。为了解决这一问题，我们引入了DF3DV-1K，这是一个包含1,048个场景的大规模真实世界数据集，每个场景提供用于基准测试的干净和杂乱图像集。该数据集总共包含89,924张使用消费级相机拍摄的图像，以模拟日常拍摄，涵盖128种干扰物类型和161种室内外场景主题。一个精心策划的41个场景子集DF3DV-41被系统设计用于评估无干扰辐射场方法在具有挑战性场景下的鲁棒性。利用DF3DV-1K，我们对九种最新的无干扰辐射场方法和3D高斯溅射进行了基准测试，识别出最鲁棒的方法和最具挑战性的场景。除了基准测试外，我们通过微调基于扩散的2D增强器来展示DF3DV-1K的应用，以改善辐射场方法，在保留集（例如DF3DV-41）和On-the-go数据集上实现了平均0.96 dB PSNR和0.057 LPIPS的提升。我们希望DF3DV-1K能够促进无干扰视觉的发展，并推动超越场景特定方法的进展。

View on arXiv Download PDF AI Translation

cs.CV / 35 / 2604.13419

Physically-Guided Optical Inversion Enable Non-Contact Side-Channel Attack on Isolated Screens

物理引导的光学反演实现对孤立屏幕的非接触侧信道攻击

Zheng, Zhiwen, Qiao, Yuheng, Zhang, Xiaoshuai, Huang, Zhao, Zhang, Tao, Zhou, Huiyu, Jiang, Shaowei, Liu, Jin, Tang, Wenwen, Huang, Xingru

Abstract

Noncontact exfiltration of electronic screen content poses a security challenge, with side-channel incursions as the principal vector. We introduce an optical projection side-channel paradigm that confronts two core instabilities: (i) the near-singular Jacobian spectrum of projection mapping breaches Hadamard stability, rendering inversion hypersensitive to perturbations; (ii) irreversible compression in light transport obliterates global semantic cues, magnifying reconstruction ambiguity. Exploiting passive speckle patterns formed by diffuse reflection, our Irradiance Robust Radiometric Inversion Network (IR4Net) fuses a Physically Regularized Irradiance Approximation (PRIrr-Approximation), which embeds the radiative transfer equation in a learnable optimizer, with a contour-to-detail cross-scale reconstruction mechanism that arrests noise propagation. Moreover, an Irreversibility Constrained Semantic Reprojection (ICSR) module reinstates lost global structure through context-driven semantic mapping. Evaluated across four scene categories, IR4Net achieves fidelity beyond competing neural approaches while retaining resilience to illumination perturbations.

Chinese Translation

电子屏幕内容的非接触提取带来了安全挑战，侧信道攻击是主要的攻击途径。我们提出了一种光学投影侧信道范式，面对两个核心不稳定性：（i）投影映射的近奇异雅可比谱违反了哈达玛稳定性，使得反演对扰动高度敏感；（ii）光传输中的不可逆压缩消除了全局语义线索，放大了重建模糊性。利用由漫反射形成的被动散斑模式，我们的辐照度鲁棒辐射反演网络（IR4Net）融合了物理正则化辐照度近似（PRIrr-Approximation），该近似将辐射传输方程嵌入可学习的优化器中，并结合了轮廓到细节的跨尺度重建机制，以抑制噪声传播。此外，一个不可逆约束的语义重投影（ICSR）模块通过上下文驱动的语义映射恢复丢失的全局结构。在四个场景类别的评估中，IR4Net在保留对光照扰动的鲁棒性的同时，达到了超越竞争神经方法的保真度。

View on arXiv Download PDF AI Translation

cs.CV / 36 / 2604.13425

VibeFlow: Versatile Video Chroma-Lux Editing through Self-Supervised Learning

VibeFlow：通过自监督学习实现多功能视频色彩光照编辑

Li, Yifan, Cheng, Pei, Fu, Bin, Yang, Shuai, Liu, Jiaying

Abstract

Video chroma-lux editing, which aims to modify illumination and color while preserving structural and temporal fidelity, remains a significant challenge. Existing methods typically rely on expensive supervised training with synthetic paired data. This paper proposes VibeFlow, a novel self-supervised framework that unleashes the intrinsic physical understanding of pre-trained video generation models. Instead of learning color and light transitions from scratch, we introduce a disentangled data perturbation pipeline that enforces the model to adaptively recombine structure from source videos and color-illumination cues from reference images, enabling robust disentanglement in a self-supervised manner. Furthermore, to rectify discretization errors inherent in flow-based models, we introduce Residual Velocity Fields alongside a Structural Distortion Consistency Regularization, ensuring rigorous structural preservation and temporal coherence. Our framework eliminates the need for costly training resources and generalizes in a zero-shot manner to diverse applications, including video relighting, recoloring, low-light enhancement, day-night translation, and object-specific color editing. Extensive experiments demonstrate that VibeFlow achieves impressive visual quality with significantly reduced computational overhead. Our project is publicly available at https://lyf1212.github.io/VibeFlow-webpage.

Chinese Translation

视频色彩光照编辑旨在在保持结构和时间一致性的同时修改照明和颜色，这仍然是一个重大挑战。现有方法通常依赖于昂贵的有监督训练和合成配对数据。本文提出了VibeFlow，一个新颖的自监督框架，释放了预训练视频生成模型的内在物理理解。我们引入了一种解耦数据扰动管道，使模型能够自适应地重新组合源视频的结构和参考图像的颜色-光照线索，从而以自监督的方式实现稳健的解耦。此外，为了纠正基于流的模型中固有的离散化误差，我们引入了残差速度场（Residual Velocity Fields）和结构失真一致性正则化（Structural Distortion Consistency Regularization），确保严格的结构保持和时间一致性。我们的框架消除了对昂贵训练资源的需求，并以零样本的方式推广到多种应用，包括视频重照明、重新着色、低光增强、昼夜转换和特定对象的颜色编辑。大量实验表明，VibeFlow在显著降低计算开销的同时，实现了令人印象深刻的视觉质量。我们的项目已公开发布，网址为 https://lyf1212.github.io/VibeFlow-webpage。

View on arXiv Download PDF AI Translation

cs.CV / 37 / 2604.13426

Event-Adaptive State Transition and Gated Fusion for RGB-Event Object Tracking

事件自适应状态转移与门控融合用于RGB-事件目标跟踪

You, Jinlin, Li, Muyu, Zhao, Xudong

Abstract

Existing Vision Mamba-based RGB-Event(RGBE) tracking methods suffer from using static state transition matrices, which fail to adapt to variations in event sparsity. This rigidity leads to imbalanced modeling-underfitting sparse event streams and overfitting dense ones-thus degrading cross-modal fusion robustness. To address these limitations, we propose MambaTrack, a multimodal and efficient tracking framework built upon a Dynamic State Space Model(DSSM). Our contributions are twofold. First, we introduce an event-adaptive state transition mechanism that dynamically modulates the state transition matrix based on event stream density. A learnable scalar governs the state evolution rate, enabling differentiated modeling of sparse and dense event flows. Second, we develop a Gated Projection Fusion(GPF) module for robust cross-modal integration. This module projects RGB features into the event feature space and generates adaptive gates from event density and RGB confidence scores. These gates precisely control the fusion intensity, suppressing noise while preserving complementary information. Experiments show that MambaTrack achieves state-of-the-art performance on the FE108 and FELT datasets. Its lightweight design suggests potential for real-time embedded deployment.

Chinese Translation

现有基于Vision Mamba的RGB-事件（RGBE）跟踪方法存在使用静态状态转移矩阵的问题，这使其无法适应事件稀疏性的变化。这种刚性导致了建模的不平衡——对稀疏事件流的欠拟合和对密集事件流的过拟合，从而降低了跨模态融合的鲁棒性。为了解决这些局限性，我们提出了MambaTrack，一个基于动态状态空间模型（DSSM）的多模态高效跟踪框架。我们的贡献有两个方面。首先，我们引入了一种事件自适应状态转移机制，该机制根据事件流的密度动态调节状态转移矩阵。一个可学习的标量控制状态演化速率，使得对稀疏和密集事件流的建模得以区分。其次，我们开发了一个门控投影融合（GPF）模块，用于稳健的跨模态集成。该模块将RGB特征投影到事件特征空间，并根据事件密度和RGB置信度生成自适应门。这些门精确控制融合强度，抑制噪声的同时保留互补信息。实验表明，MambaTrack在FE108和FELT数据集上实现了最先进的性能。其轻量化设计表明其在实时嵌入式部署中的潜力。

View on arXiv Download PDF AI Translation

cs.CV / 38 / 2604.13432

MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis

MaMe与MaRe：基于矩阵的令牌合并与恢复用于高效视觉感知与合成

Huo, Simin, Li, Ning

Abstract

Token compression is crucial for mitigating the quadratic complexity of self-attention mechanisms in Vision Transformers (ViTs), which often involve numerous input tokens. Existing methods, such as ToMe, rely on GPU-inefficient operations (e.g., sorting, scattered writes), introducing overheads that limit their effectiveness. We introduce MaMe, a training-free, differentiable token merging method based entirely on matrix operations, which is GPU-friendly to accelerate ViTs. Additionally, we present MaRe, its inverse operation, for token restoration, forming a MaMe+MaRe pipeline for image synthesis. When applied to pre-trained models, MaMe doubles ViT-B throughput with a 2% accuracy drop. Notably, fine-tuning the last layer with MaMe boosts ViT-B accuracy by 1.0% at 1.1x speed. In SigLIP2-B@512 zero-shot classification, MaMe provides 1.3x acceleration with negligible performance degradation. In video tasks, MaMe accelerates VideoMAE-L by 48.5% on Kinetics-400 with only a 0.84% accuracy loss. Furthermore, MaMe achieves simultaneous improvements in both performance and speed on some tasks. In image synthesis, the MaMe+MaRe pipeline enhances quality while reducing Stable Diffusion v2.1 generation latency by 31%. Collectively, these results demonstrate MaMe's and MaRe's effectiveness in accelerating vision models. The code is available at https://github.com/cominder/mame}{https://github.com/cominder/mame.

Chinese Translation

令牌压缩对于缓解视觉变换器（Vision Transformers, ViTs）中自注意力机制的二次复杂性至关重要，因为这些机制通常涉及大量输入令牌。现有方法，如ToMe，依赖于GPU效率低下的操作（例如，排序、分散写入），引入了限制其有效性的开销。我们提出了MaMe，这是一种完全基于矩阵操作的无训练、可微分的令牌合并方法，具有友好的GPU性能，以加速ViTs。此外，我们还提出了其逆操作MaRe，用于令牌恢复，形成了用于图像合成的MaMe+MaRe管道。当应用于预训练模型时，MaMe使ViT-B的吞吐量翻倍，准确率下降2%。值得注意的是，使用MaMe微调最后一层使ViT-B的准确率提升1.0%，同时速度提高1.1倍。在SigLIP2-B@512的零-shot分类中，MaMe提供了1.3倍的加速，且性能下降微乎其微。在视频任务中，MaMe使VideoMAE-L在Kinetics-400上的加速达到48.5%，仅损失0.84%的准确率。此外，MaMe在某些任务上实现了性能与速度的同步提升。在图像合成中，MaMe+MaRe管道提高了质量，同时将Stable Diffusion v2.1的生成延迟减少了31%。总体而言，这些结果证明了MaMe和MaRe在加速视觉模型方面的有效性。代码可在https://github.com/cominder/mame获取。

View on arXiv Download PDF AI Translation

cs.CV / 39 / 2604.13448

A Study of Failure Modes in Two-Stage Human-Object Interaction Detection

两阶段人机交互检测中的失效模式研究

Wang, Lemeng, Lei, Qinqian, Bakshi, Vidhi, Yi, Daniel, Liu, Yifan, Hou, Jiacheng, Hao, Asher Seng, Mai, Zheda, Chao, Wei-Lun, Tan, Robby T., Wang, Bo

Abstract

Human-object interaction (HOI) detection aims to detect interactions between humans and objects in images. While recent advances have improved performance on existing benchmarks, their evaluations mainly focus on overall prediction accuracy and provide limited insight into the underlying causes of model failures. In particular, modern models often struggle in complex scenes involving multiple people and rare interaction combinations. In this work, we present a study to better understand the failure modes of two-stage HOI models, which form the basis of many current HOI detection approaches. Rather than constructing a large-scale benchmark, we instead decompose HOI detection into multiple interpretable perspectives and analyze model behavior across these dimensions to study different types of failure patterns. We curate a subset of images from an existing HOI dataset organized by human-object-interaction configurations (e.g., multi-person interactions and object sharing), and analyze model behavior under these configurations to examine different failure modes. This design allows us to analyze how these HOI models behave under different scene compositions and why their predictions fail. Importantly, high overall benchmark performance does not necessarily reflect robust visual reasoning about human-object relationships. We hope that this study can provide useful insights into the limitations of HOI models and offer observations for future research in this area.

Chinese Translation

人机交互（HOI）检测旨在识别图像中人类与物体之间的交互。尽管近期的进展提高了现有基准测试的性能，但其评估主要集中在整体预测准确性上，对模型失效的潜在原因提供的见解有限。特别是，现代模型在涉及多个人和稀有交互组合的复杂场景中往往表现不佳。在本研究中，我们旨在更好地理解两阶段HOI模型的失效模式，这些模型构成了许多当前HOI检测方法的基础。我们并未构建一个大规模的基准，而是将HOI检测分解为多个可解释的视角，并在这些维度上分析模型行为，以研究不同类型的失效模式。我们从现有的HOI数据集中整理了一部分图像，按人机交互配置（例如，多人交互和物体共享）进行组织，并分析模型在这些配置下的行为，以检查不同的失效模式。这一设计使我们能够分析这些HOI模型在不同场景组合下的表现及其预测失效的原因。重要的是，高整体基准性能并不一定反映出对人机关系的稳健视觉推理。我们希望本研究能够为HOI模型的局限性提供有益的见解，并为该领域未来的研究提供观察依据。

View on arXiv Download PDF AI Translation

cs.CV / 40 / 2604.13491

Enhanced Text-to-Image Generation by Fine-grained Multimodal Reasoning

通过细粒度多模态推理增强文本到图像生成

Kim, Yongjin, Oh, Yoonjin, Kim, Yerin, Kim, Hyomin, Yun, Jeeyoung, Heo, Yujung, Kim, Minjun, Kim, Sungwoong

Abstract

With the rapid progress of Multimodal Large Language Models (MLLMs), unified MLLMs that jointly perform image understanding and generation have advanced significantly. However, despite the inherent reasoning capabilities of unified MLLMs for self-reflection and self-refinement, their use in text-to-image generation remains largely underexplored. Meanwhile, existing multimodal reasoning-based image generation methods mostly rely on holistic image-text alignment judgments, without fine-grained reflection and refinement of detailed prompt attributes, leading to limited fine-grained control. Therefore, we propose Fine-grained Multimodal Reasoning (FiMR), a framework that leverages decomposed visual question answering (VQA) to break down an input prompt into minimal semantic units-such as entities and attributes-and verify each unit via VQA to generate explicit, fine-grained feedback. Based on this feedback, FiMR then applies targeted, localized refinements. This fine-grained self-reasoning and self-refinement enable MLLMs to achieve more precise improvements in image-prompt alignment and overall generation quality at test time. Extensive experiments demonstrate that FiMR consistently outperforms image generation baselines, including reasoning-based methods, particularly on compositional text-to-image benchmarks.

Chinese Translation

随着多模态大型语言模型（MLLMs）的快速发展，能够联合进行图像理解和生成的统一 MLLMs 已取得显著进展。然而，尽管统一 MLLMs 在自我反思和自我优化方面具备内在的推理能力，但它们在文本到图像生成中的应用仍然大多未被深入探索。同时，现有基于多模态推理的图像生成方法主要依赖整体的图像-文本对齐判断，而缺乏对详细提示属性的细粒度反思和优化，导致细粒度控制的局限。因此，我们提出了细粒度多模态推理（Fine-grained Multimodal Reasoning, FiMR）框架，该框架利用分解的视觉问答（Visual Question Answering, VQA）将输入提示分解为最小语义单元——如实体和属性——并通过 VQA 验证每个单元，以生成明确的、细粒度的反馈。基于这些反馈，FiMR 然后应用有针对性的、局部的优化。这种细粒度的自我推理和自我优化使得 MLLMs 在测试时能够在图像-提示对齐和整体生成质量上实现更精确的改进。大量实验表明，FiMR 在图像生成基准测试中始终优于包括基于推理的方法在内的图像生成基线，尤其是在组合文本到图像的基准上表现突出。

View on arXiv Download PDF AI Translation

cs.CV / 41 / 2604.13495

ADP-DiT: Text-Guided Diffusion Transformer for Brain Image Generation in Alzheimer's Disease Progression

ADP-DiT：用于阿尔茨海默病进展的文本引导扩散变换器脑图像生成

Lee, Juneyong, Baek, Geonwoo, Jang, Ikbeom

Abstract

Alzheimer's disease (AD) progresses heterogeneously across individuals, motivating subject-specific synthesis of follow-up magnetic resonance imaging (MRI) to support progression assessment. While Diffusion Transformers (DiT), an emerging transformer-based diffusion model, offer a scalable backbone for image synthesis, longitudinal AD MRI generation with clinically interpretable control over follow-up time and participant metadata remains underexplored. We present ADP-DiT, an interval-aware, clinically text-conditioned diffusion transformer for longitudinal AD MRI synthesis. ADP-DiT encodes follow-up interval together with multi-domain demographic, diagnostic (CN/MCI/AD), and neuropsychological information as a natural-language prompt, enabling time-specific control beyond coarse diagnostic stages. To inject this conditioning effectively, we use dual text encoders-OpenCLIP for vision-language alignment and T5 for richer clinical-language understanding. Their embeddings are fused into DiT through cross-attention for fine-grained guidance and adaptive layer normalization for global modulation. We further enhance anatomical fidelity by applying rotary positional embeddings to image tokens and performing diffusion in a pre-trained SDXL-VAE latent space to enable efficient high-resolution reconstruction. On 3,321 longitudinal 3T T1-weighted scans from 712 participants (259,038 image slices), ADP-DiT achieves SSIM 0.8739 and PSNR 29.32 dB, improving over a DiT baseline by +0.1087 SSIM and +6.08 dB PSNR while capturing progression-related changes such as ventricular enlargement and shrinking hippocampus. These results suggest that integrating comprehensive, subject-specific clinical conditions with architectures can improve longitudinal AD MRI synthesis.

Chinese Translation

阿尔茨海默病（AD）在个体之间的进展具有异质性，这促使了针对特定受试者的后续磁共振成像（MRI）合成，以支持进展评估。尽管扩散变换器（DiT）作为一种新兴的基于变换器的扩散模型，为图像合成提供了可扩展的基础，但在临床可解释的控制下进行纵向AD MRI生成，特别是对后续时间和参与者元数据的控制仍然未得到充分探索。我们提出了ADP-DiT，这是一种间隔感知的、临床文本条件的扩散变换器，用于纵向AD MRI合成。ADP-DiT将后续间隔与多领域的人口统计、诊断（CN/MCI/AD）和神经心理学信息编码为自然语言提示，使得超越粗略诊断阶段的时间特定控制成为可能。为了有效注入这种条件，我们使用双文本编码器——OpenCLIP用于视觉-语言对齐，T5用于更丰富的临床语言理解。它们的嵌入通过交叉注意力融合到DiT中，以实现细粒度指导，并通过自适应层归一化进行全局调制。我们进一步通过对图像标记应用旋转位置嵌入，并在预训练的SDXL-VAE潜在空间中进行扩散，以提高解剖学的真实感，从而实现高效的高分辨率重建。在来自712名参与者（259,038个图像切片）的3,321个纵向3T T1加权扫描中，ADP-DiT达到了SSIM 0.8739和PSNR 29.32 dB，相比于DiT基线提高了+0.1087 SSIM和+6.08 dB PSNR，同时捕捉到与进展相关的变化，如脑室扩大和海马萎缩。这些结果表明，将全面的、特定于受试者的临床条件与架构相结合，可以改善纵向AD MRI合成。

View on arXiv Download PDF AI Translation

cs.CV / 42 / 2604.13508

Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling

通过集群感知的再利用增强专家混合模型的专业化

Chu, Sanghyeok, Ahn, Pyunghwan, Song, Gwangmo, Kim, SeungHwan, Lee, Honglak, Han, Bohyung

Abstract

Sparse Upcycling provides an efficient way to initialize a Mixture-of-Experts (MoE) model from pretrained dense weights instead of training from scratch. However, since all experts start from identical weights and the router is randomly initialized, the model suffers from expert symmetry and limited early specialization. We propose Cluster-aware Upcycling, a strategy that incorporates semantic structure into MoE initialization. Our method first partitions the dense model's input activations into semantic clusters. Each expert is then initialized using the subspace representations of its corresponding cluster via truncated SVD, while setting the router's initial weights to the cluster centroids. This cluster-aware initialization breaks expert symmetry and encourages early specialization aligned with the data distribution. Furthermore, we introduce an expert-ensemble self-distillation loss that stabilizes training by providing reliable routing guidance using an ensemble teacher. When evaluated on CLIP ViT-B/32 and ViT-B/16, Cluster-aware Upcycling consistently outperforms existing methods across both zero-shot and few-shot benchmarks. The proposed method also produces more diverse and disentangled expert representations, reduces inter-expert similarity, and leads to more confident routing behavior.

Chinese Translation

稀疏再利用提供了一种有效的方法，可以从预训练的密集权重初始化专家混合模型（Mixture-of-Experts, MoE），而无需从头开始训练。然而，由于所有专家都从相同的权重开始，并且路由器是随机初始化的，因此模型面临专家对称性和早期专业化有限的问题。我们提出了集群感知再利用（Cluster-aware Upcycling）策略，该策略将语义结构纳入MoE的初始化过程中。我们的方法首先将密集模型的输入激活划分为语义集群。然后，每个专家通过截断奇异值分解（truncated SVD）使用其对应集群的子空间表示进行初始化，同时将路由器的初始权重设置为集群中心。这样的集群感知初始化打破了专家对称性，并促进了与数据分布一致的早期专业化。此外，我们引入了一种专家集成自蒸馏损失，通过使用集成教师提供可靠的路由指导，从而稳定训练。在对CLIP ViT-B/32和ViT-B/16进行评估时，集群感知再利用在零样本和少样本基准测试中始终优于现有方法。所提出的方法还产生了更具多样性和解耦的专家表示，减少了专家之间的相似性，并导致更自信的路由行为。

View on arXiv Download PDF AI Translation

cs.CV / 43 / 2604.13509

DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer

DiT作为实时重渲染器：基于自回归扩散变换器的流媒体视频风格化

Lyu, Hengye, Li, Zisu, Hong, Yue, Weng, Yueting, Shi, Jiaxin, Zhang, Hanwang, Liang, Chen

Abstract

Recent advances in video generation models has significantly accelerated video generation and related downstream tasks. Among these, video stylization holds important research value in areas such as immersive applications and artistic creation, attracting widespread attention. However, existing diffusion-based video stylization methods struggle to maintain stability and consistency when processing long videos, and their high computational cost and multi-step denoising make them difficult to apply in practical scenarios. In this work, we propose RTR-DiT (DiT as Real-Time Rerenderer), a steaming video stylization framework built upon Diffusion Transformer. We first fine-tune a bidirectional teacher model on a curated video stylization dataset, supporting both text-guided and reference-guided video stylization tasks, and subsequently distill it into a few-step autoregressive model via post-training with Self Forcing and Distribution Matching Distillation. Furthermore, we propose a reference-preserving KV cache update strategy that not only enables stable and consistent processing of long videos, but also supports real-time switching between text prompts and reference images. Experimental results show that RTR-DiT outperforms existing methods in both text-guided and reference-guided video stylization tasks, in terms of quantitative metrics and visual quality, and demonstrates excellent performance in real-time long video stylization and interactive style-switching applications.

Chinese Translation

近年来，视频生成模型的进展显著加速了视频生成及相关下游任务的发展。在这些任务中，视频风格化在沉浸式应用和艺术创作等领域具有重要的研究价值，吸引了广泛的关注。然而，现有的基于扩散的视频风格化方法在处理长视频时难以保持稳定性和一致性，其高计算成本和多步骤去噪使得它们在实际场景中的应用变得困难。在本研究中，我们提出了RTR-DiT（DiT作为实时重渲染器），这是一个基于扩散变换器的流媒体视频风格化框架。我们首先在一个精心策划的视频风格化数据集上微调了一个双向教师模型，支持文本引导和参考引导的视频风格化任务，随后通过自我强制和分布匹配蒸馏将其提炼为一个少步骤的自回归模型。此外，我们提出了一种参考保留的KV缓存更新策略，该策略不仅能够实现长视频的稳定和一致处理，还支持在文本提示和参考图像之间的实时切换。实验结果表明，RTR-DiT在文本引导和参考引导的视频风格化任务中，在定量指标和视觉质量方面均优于现有方法，并在实时长视频风格化和交互式风格切换应用中表现出色。

View on arXiv Download PDF AI Translation

cs.CV / 44 / 2604.13540

Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding

统一多模态模型的免费午餐：通过内在理解增强生成的反思修正

Jiang, Yibo, Wu, Tao, Jiang, Rui, Lu, Yehao, Cai, Chaoxiang, Qin, Zequn, Li, Xi

Abstract

Unified Multimodal Models (UMMs) aim to integrate visual understanding and generation within a single structure. However, these models exhibit a notable capability mismatch, where their understanding capability significantly outperforms their generation. This mismatch indicates that the model's rich internal knowledge, while effective for understanding tasks, remains underactivated during generation. To address this, we draw inspiration from the human ``Thinking-While-Drawing'' paradigm, where humans continuously reflect to activate their knowledge and rectify intermediate results. In this paper, we propose UniRect-CoT, a training-free unified rectification chain-of-thought framework. Our approach unlocks the ``free lunch'' hidden in the UMM's powerful inherent understanding to continuously reflect, activating its internal knowledge and rectifying intermediate results during generation.We regard the diffusion denoising process in UMMs as an intrinsic visual reasoning process and align the intermediate results with the target instruction understood by the model, serving as a self-supervisory signal to rectify UMM generation.Extensive experiments demonstrate that UniRect-CoT can be easily integrated into existing UMMs, significantly enhancing generation quality across diverse complex tasks.

Chinese Translation

统一多模态模型（UMMs）旨在将视觉理解与生成集成于单一结构中。然而，这些模型表现出显著的能力不匹配，其理解能力远远超过生成能力。这种不匹配表明，模型丰富的内部知识在理解任务中有效，但在生成过程中却未能充分激活。为了解决这一问题，我们从人类的“边画边思考”范式中获得灵感，人类在此过程中不断反思以激活其知识并修正中间结果。本文提出了UniRect-CoT，一种无训练的统一修正思维链框架。我们的方法利用UMM强大内在理解中隐藏的“免费午餐”，在生成过程中持续反思，激活其内部知识并修正中间结果。我们将UMMs中的扩散去噪过程视为一种内在的视觉推理过程，并将中间结果与模型理解的目标指令对齐，作为自监督信号来修正UMM生成。大量实验表明，UniRect-CoT可以轻松集成到现有的UMMs中，显著提升在多样复杂任务中的生成质量。

View on arXiv Download PDF AI Translation

cs.CV / 45 / 2604.13549

Reconstruction of a 3D wireframe from a single line drawing via generative depth estimation

通过生成深度估计从单一线条绘图重建三维线框

Cao, Elton, Lipson, Hod

Abstract

The conversion of 2D freehand sketches into 3D models remains a pivotal challenge in computer vision, bridging the gap between human creativity and digital fabrication. Traditional line drawing reconstruction relies on brittle symbolic logic, while modern approaches are constrained by rigid parametric modeling, limiting users to predefined CAD primitives. We propose a generative approach by framing reconstruction as a conditional dense depth estimation task. To achieve this, we implement a Latent Diffusion Model (LDM) with a ControlNet-style conditioning framework to resolve the inherent ambiguities of orthographic projections. To support an iterative "sketch-reconstruct-sketch" workflow, we introduce a graph-based BFS masking strategy to simulate partial depth cues. We train and evaluate our approach using a massive dataset of over one million image-depth pairs derived from the ABC Dataset. Our framework demonstrates robust performance across varying shape complexities, providing a scalable pipeline for converting sparse 2D line drawings into dense 3D representations, effectively allowing users to "draw in 3D" without the rigid constraints of traditional CAD.

Chinese Translation

将二维自由手绘草图转换为三维模型仍然是计算机视觉中的一个关键挑战，旨在弥合人类创造力与数字制造之间的差距。传统的线条绘制重建依赖于脆弱的符号逻辑，而现代方法受到严格参数建模的限制，使用户只能使用预定义的计算机辅助设计（CAD）原始件。我们提出了一种生成性的方法，将重建框架设定为条件密集深度估计任务。为此，我们实现了一种潜在扩散模型（Latent Diffusion Model, LDM），并采用ControlNet风格的条件框架来解决正投影固有的模糊性。为了支持迭代的“草图-重建-草图”工作流程，我们引入了一种基于图的广度优先搜索（BFS）掩蔽策略，以模拟部分深度线索。我们使用来自ABC数据集的超过一百万对图像-深度对的大型数据集对我们的方法进行了训练和评估。我们的框架在不同形状复杂性下表现出强大的性能，提供了一个可扩展的管道，将稀疏的二维线条绘图转换为密集的三维表示，有效地使用户能够在没有传统CAD严格限制的情况下“在三维中绘画”。

View on arXiv Download PDF AI Translation

cs.CV / 46 / 2604.13555

AI Powered Image Analysis for Phishing Detection

基于人工智能的图像分析用于网络钓鱼检测

Acharya, K., Ale, S., Kadel, R.

Abstract

Phishing websites now rely heavily on visual imitation-copied logos, similar layouts, and matching colours-to avoid detection by text- and URL-based systems. This paper presents a deep learning approach that uses webpage screenshots for image-based phishing detection. Two vision models, ConvNeXt-Tiny and Vision Transformer (ViT-Base), were tested to see how well they handle visually deceptive phishing pages. The framework covers dataset creation, preprocessing, transfer learning with ImageNet weights, and evaluation using different decision thresholds. The results show that ConvNeXt-Tiny performs the best overall, achieving the highest F1-score at the optimised threshold and running more efficiently than ViT-Base. This highlights the strength of convolutional models for visual phishing detection and shows why threshold tuning is important for real-world deployment. As future work, the curated dataset used in this study will be released to support reproducibility and encourage further research in this area. Unlike many existing studies that primarily report accuracy, this work places greater emphasis on threshold-aware evaluation to better reflect real-world deployment conditions. By examining precision, recall, and F1-score across different decision thresholds, the study identifies operating points that balance detection performance and false-alarm control. In addition, the side-by-side comparison of ConvNeXt-Tiny and ViT-Base under the same experimental setup offers practical insights into how convolutional and transformer-based architectures differ in robustness and computational efficiency for visual phishing detection.

Chinese Translation

网络钓鱼网站现在严重依赖视觉模仿——复制的标志、相似的布局和匹配的颜色——以避免被基于文本和URL的系统检测。本文提出了一种深度学习方法，利用网页截图进行基于图像的钓鱼检测。我们测试了两个视觉模型，ConvNeXt-Tiny和Vision Transformer (ViT-Base)，以评估它们在处理视觉欺骗性钓鱼页面方面的表现。该框架涵盖了数据集创建、预处理、使用ImageNet权重的迁移学习以及使用不同决策阈值的评估。结果表明，ConvNeXt-Tiny的整体表现最佳，在优化阈值下达到了最高的F1-score，并且运行效率高于ViT-Base。这突显了卷积模型在视觉钓鱼检测中的优势，并显示了阈值调优在实际部署中的重要性。作为未来的工作，本研究中使用的经过整理的数据集将被发布，以支持可重复性并鼓励该领域的进一步研究。与许多现有研究主要报告准确率不同，本研究更加强调基于阈值的评估，以更好地反映实际部署条件。通过在不同决策阈值下检查精确率、召回率和F1-score，本研究识别出在检测性能和误报控制之间平衡的操作点。此外，在相同实验设置下对ConvNeXt-Tiny和ViT-Base的并排比较提供了关于卷积和基于变换器的架构在视觉钓鱼检测中的鲁棒性和计算效率差异的实际见解。

View on arXiv Download PDF AI Translation

cs.CV / 47 / 2604.13561

CLIP Architecture for Abdominal CT Image-Text Alignment and Zero-Shot Learning: Investigating Batch Composition and Data Scaling

用于腹部CT图像-文本对齐和零-shot学习的CLIP架构：探讨批次组成和数据扩展

Shivika, Bose, Kartik, Gupta, Pankaj

Abstract

Vision-language models trained with contrastive learning on paired medical images and reports show strong zero-shot diagnostic capabilities, yet the effect of training batch composition on learned representations remains unexplored for 3D medical imaging. We reproduce Merlin, a dual-encoder model that aligns 3D abdominal CT volumes with radiology reports using symmetric InfoNCE loss, achieving a zero-shot macro F1 of 74.45% across 30 findings (original: 73.00%). We then investigate two axes of variation. First, we control the normal-to-abnormal ratio within training batches at 25:75, 50:50, and 75:25 using section-level balanced sampling on the full dataset. All three configurations underperform the unbalanced baseline by 2.4 to 2.8 points, with 75:25 achieving the best result (72.02%) among balanced variants. Second, we conduct data scaling ablations on a 4,362-study subset, training with 20%, 40%, and 100% of the data. Performance scales sub-linearly from 65.26% to 71.88%, with individual findings varying dramatically in data sensitivity. Enforcing 50:50 balanced sampling on the same subset further degrades performance to 68.01%, confirming that explicit class balancing hurts regardless of dataset or balancing granularity. Our results indicate that the stochastic diversity of random sampling, combined with Merlin's alternating batching over anatomical subsections, provides more effective regularization than engineered class ratios at the small batch sizes required by 3D medical volumes.

Chinese Translation

经过对配对医学图像和报告进行对比学习训练的视觉-语言模型展现出强大的零-shot诊断能力，但训练批次组成对学习表示的影响在3D医学成像中尚未被探索。我们复现了Merlin，一个双编码器模型，使用对称的InfoNCE损失将3D腹部CT体积与放射学报告对齐，在30个发现中实现了74.45%的零-shot宏F1（原始值：73.00%）。接着，我们探讨了两个变化轴。首先，我们在训练批次中控制正常与异常的比例为25:75、50:50和75:25，使用全数据集的分段平衡抽样。这三种配置的表现均比不平衡基线低2.4到2.8个百分点，其中75:25在平衡变体中取得最佳结果（72.02%）。其次，我们在一个4362个研究的子集中进行数据扩展消融实验，使用20%、40%和100%的数据进行训练。性能从65.26%以亚线性方式扩展至71.88%，而各个发现的敏感性在数据量上变化显著。在同一子集上强制进行50:50的平衡抽样进一步将性能降至68.01%，确认了无论数据集或平衡粒度如何，显式的类别平衡都会造成损害。我们的结果表明，随机抽样的随机多样性结合Merlin在解剖子部分的交替批处理，提供了比在3D医学体积所需的小批量中工程化类别比率更有效的正则化。

View on arXiv Download PDF AI Translation

cs.CV / 48 / 2604.13565

UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing

UHR-BAT：一种预算感知的超高分辨率遥感视觉-语言模型的标记压缩方法

Dang, Yunkai, Dai, Minxin, Yang, Yuekun, Li, Zhangnan, Li, Wenbin, Miao, Feng, Gao, Yang

Abstract

Ultra-high-resolution (UHR) remote sensing imagery couples kilometer-scale context with query-critical evidence that may occupy only a few pixels. Such vast spatial scale leads to a quadratic explosion of visual tokens and hinders the extraction of information from small objects. Previous works utilize direct downsampling, dense tiling, or global top-k pruning, which either compromise query-critical image details or incur unpredictable compute. In this paper, we propose UHR-BAT, a query-guided and region-faithful token compression framework to efficiently select visual tokens under a strict context budget. Specifically, we leverage text-guided, multi-scale importance estimation for visual tokens, effectively tackling the challenge of achieving precise yet low-cost feature extraction. Furthermore, by introducing region-wise preserve and merge strategies, we mitigate visual token redundancy, further driving down the computational budget. Experimental results show that UHR-BAT achieves state-of-the-art performance across various benchmarks. Code will be available at https://github.com/Yunkaidang/UHR.

Chinese Translation

超高分辨率（UHR）遥感图像将千米尺度的背景与可能仅占几个像素的查询关键证据相结合。这种广阔的空间尺度导致视觉标记的二次爆炸，阻碍了从小物体中提取信息。以往的研究采用直接下采样、密集平铺或全局 top-k 剪枝，这些方法要么妥协了查询关键的图像细节，要么导致不可预测的计算开销。本文提出了 UHR-BAT，一种查询引导和区域保真标记压缩框架，旨在在严格的上下文预算下高效选择视觉标记。具体而言，我们利用文本引导的多尺度重要性估计来处理视觉标记，有效应对实现精确且低成本特征提取的挑战。此外，通过引入区域保留和合并策略，我们减少了视觉标记的冗余，进一步降低了计算预算。实验结果表明，UHR-BAT 在各种基准测试中实现了最先进的性能。代码将发布在 https://github.com/Yunkaidang/UHR。

View on arXiv Download PDF AI Translation

cs.CV / 49 / 2604.13568

ZoomSpec: A Physics-Guided Coarse-to-Fine Framework for Wideband Spectrum Sensing

ZoomSpec：一种物理引导的粗到细宽带谱感知框架

Yang, Zhentao, Luomei, Yixiang, Liu, Zhuoyang, Liu, Zhenyu, Xu, Feng

Abstract

Wideband spectrum sensing for low-altitude monitoring is critical yet challenging due to heterogeneous protocols,large bandwidths, and non-stationary SNR. Existing data-driven approaches treat spectrograms as natural images,suffering from domain mismatch: they neglect time-frequency resolution constraints and spectral leakage, leading topoor narrowband visibility. This paper proposes ZoomSpec, a physics-guided coarse-to-fine framework integrating signal processing priors with deep learning. We introduce a Log-Space STFT (LS-STFT) to overcome the geometric bottleneck of linear spectrograms, sharpening narrowband structures while maintaining constant relative resolution. A lightweight Coarse Proposal Net (CPN) rapidly screens the full band. To bridge coarse detection and fine recognition, we design an Adaptive Heterodyne Low-Pass (AHLP) module that executes center-frequency aligning, bandwidth-matched filtering, and safe decimation, purifying signals of out-of-band interference. A Fine Recognition Net (FRN) fuses purified time-domain I/Q with spectral magnitude via dual-domain attention to jointly refine temporal boundaries and modulation classification. Evaluations on the SpaceNet real-world dataset demonstrate state-of-the-art 78.1 [email protected]:0.95, surpassing existing leaderboard systems with superior stability across diverse modulation bandwidths.

Chinese Translation

低空监测的宽带谱感知至关重要，但由于异构协议、大带宽和非平稳信噪比（SNR），其实现面临挑战。现有的数据驱动方法将谱图视为自然图像，遭受领域不匹配的问题：它们忽视了时频分辨率约束和谱泄漏，导致窄带可见性差。本文提出了ZoomSpec，一种将信号处理先验与深度学习相结合的物理引导粗到细框架。我们引入了对数空间短时傅里叶变换（Log-Space STFT, LS-STFT），以克服线性谱图的几何瓶颈，锐化窄带结构，同时保持相对分辨率恒定。一个轻量级的粗提议网络（Coarse Proposal Net, CPN）快速筛选整个频带。为了连接粗检测和细识别，我们设计了一个自适应异频低通（Adaptive Heterodyne Low-Pass, AHLP）模块，执行中心频率对齐、带宽匹配滤波和安全抽取，净化信号中的带外干扰。一个细识别网络（Fine Recognition Net, FRN）通过双域注意力融合净化后的时域I/Q信号与谱幅度，共同细化时间边界和调制分类。在SpaceNet真实世界数据集上的评估显示，ZoomSpec达到了78.1 [email protected]:0.95的最新水平，超越了现有排行榜系统，并在多种调制带宽下展现出卓越的稳定性。

View on arXiv Download PDF AI Translation

cs.CV / 50 / 2604.13571

Radar-Informed 3D Multi-Object Tracking under Adverse Conditions

雷达信息驱动的三维多目标跟踪在恶劣条件下的研究

Xu, Bingxue, Hedemalm, Emil, Khoche, Ajinkya, Jensfelt, Patric

Abstract

The challenge of 3D multi-object tracking (3D MOT) is achieving robustness in real-world applications, for example under adverse conditions and maintaining consistency as distance increases. To overcome these challenges, sensor fusion approaches that combine LiDAR, cameras, and radar have emerged. However, existing multi-modal fusion methods usually treat radar as another learned feature inside the network. When the overall model degrades in difficult environmental conditions, the robustness advantages that radar could provide are also reduced. We propose RadarMOT, a radar-informed 3D MOT framework that explicitly uses radar point cloud data as additional observation to refine state estimation and recover detector misses at long ranges. Evaluations on the MAN-TruckScenes dataset show that RadarMOT consistently improves the Average Multi-Object Tracking Accuracy (AMOTA) with absolute 12.7% at long range and 10.3% in adverse weather. The code will be available at https://github.com/bingxue-xu/radarmot

Chinese Translation

三维多目标跟踪（3D MOT）的挑战在于在实际应用中实现鲁棒性，例如在恶劣条件下以及随着距离增加保持一致性。为克服这些挑战，结合激光雷达（LiDAR）、摄像头和雷达的传感器融合方法应运而生。然而，现有的多模态融合方法通常将雷达视为网络内部的另一种学习特征。当整体模型在困难的环境条件下退化时，雷达所能提供的鲁棒性优势也会随之降低。我们提出了RadarMOT，一个雷达信息驱动的3D MOT框架，明确使用雷达点云数据作为额外观测，以优化状态估计并恢复远距离的检测器漏检。在MAN-TruckScenes数据集上的评估表明，RadarMOT在长距离情况下的平均多目标跟踪准确率（AMOTA）提高了绝对12.7%，在恶劣天气下提高了10.3%。代码将发布在 https://github.com/bingxue-xu/radarmot

View on arXiv Download PDF AI Translation

cs.CV / 51 / 2604.13581

SocialMirror: Reconstructing 3D Human Interaction Behaviors from Monocular Videos with Semantic and Geometric Guidance

SocialMirror：基于单目视频的3D人类互动行为重建，结合语义和几何指导

Xia, Qi, Cong, Peishan, Wang, Ziyi, Sun, Yujing, Sun, Qin, Zhu, Xinge, Ye, Mao, Yang, Ruigang, Ma, Yuexin

Abstract

Accurately reconstructing human behavior in close-interaction scenarios is crucial for enabling realistic virtual interactions in augmented reality, precise motion analysis in sports, and natural collaborative behavior in human-robot tasks. Reliable reconstruction in these contexts significantly enhances the realism and effectiveness of AI-driven interactive applications. However, human reconstruction from monocular videos in close-interaction scenarios remains challenging due to severe mutual occlusions, leading local motion ambiguity, disrupted temporal continuity and spatial relationship error. In this paper, we propose SocialMirror, a diffusion-based framework that integrates semantic and geometric cues to effectively address these issues. Specifically, we first leverage high-level interaction descriptions generated by a vision-language model to guide a semantic-guided motion infiller, hallucinating occluded bodies and resolving local pose ambiguities. Next, we propose a sequence-level temporal refiner that enforces smooth, jitter-free motions, while incorporating geometric constraints during sampling to ensure plausible contact and spatial relationships. Evaluations on multiple interaction benchmarks show that SocialMirror achieves state-of-the-art performance in reconstructing interactive human meshes, demonstrating strong generalization across unseen datasets and in-the-wild scenarios. The code will be released upon publication.

Chinese Translation

在紧密互动场景中准确重建人类行为对于实现增强现实中的真实虚拟互动、体育运动中的精确动作分析以及人机任务中的自然协作行为至关重要。在这些背景下，可靠的重建显著增强了基于人工智能的互动应用的真实感和有效性。然而，由于严重的相互遮挡导致局部运动模糊、时间连续性中断以及空间关系错误，从单目视频中重建人类行为在紧密互动场景中仍然具有挑战性。本文提出了SocialMirror，一个基于扩散的框架，整合了语义和几何线索，以有效解决这些问题。具体而言，我们首先利用视觉-语言模型生成的高层次互动描述来指导语义引导的运动填充器，幻觉出被遮挡的身体并解决局部姿态模糊。接下来，我们提出了一种序列级时间细化器，强制实现平滑、无抖动的运动，同时在采样过程中结合几何约束，以确保合理的接触和空间关系。在多个互动基准上的评估表明，SocialMirror在重建互动人类网格方面达到了最先进的性能，展示了在未见数据集和真实场景中的强泛化能力。代码将在发表后发布。

View on arXiv Download PDF AI Translation

cs.CV / 52 / 2604.13586

Efficient Multi-View 3D Object Detection by Dynamic Token Selection and Fine-Tuning

通过动态标记选择和微调实现高效的多视角三维物体检测

Nazir, Danish, Hanna-Asaad, Antoine, Görnhardt, Lucas, Piewek, Jan, Bagdonat, Thorsten, Fingscheidt, Tim

Abstract

Existing multi-view three-dimensional (3D) object detection approaches widely adopt large-scale pre-trained vision transformer (ViT)-based foundation models as backbones, being computationally complex. To address this problem, current state-of-the-art (SOTA) \texttt{ToC3D} for efficient multi-view ViT-based 3D object detection employs ego-motion-based relevant token selection. However, there are two key limitations: (1) The fixed layer-individual token selection ratios limit computational efficiency during both training and inference. (2) Full end-to-end retraining of the ViT backbone is required for the multi-view 3D object detection method. In this work, we propose an image token compensator combined with a token selection for ViT backbones to accelerate multi-view 3D object detection. Unlike \texttt{ToC3D}, our approach enables dynamic layer-wise token selection within the ViT backbone. Furthermore, we introduce a parameter-efficient fine-tuning strategy, which trains only the proposed modules, thereby reducing the number of fine-tuned parameters from more than $300$ million (M) to only $1.6$ M. Experiments on the large-scale NuScenes dataset across three multi-view 3D object detection approaches demonstrate that our proposed method decreases computational complexity (GFLOPs) by $48\%$ ... $55\%$, inference latency (on an \texttt{NVIDIA-GV100} GPU) by $9\%$ ... $25\%$, while still improving mean average precision by $1.0\%$ ... $2.8\%$ absolute and NuScenes detection score by $0.4\%$ ... $1.2\%$ absolute compared to so-far SOTA \texttt{ToC3D}.

Chinese Translation

现有的多视角三维（3D）物体检测方法广泛采用基于大规模预训练视觉变换器（ViT）的基础模型作为骨干网络，这导致计算复杂度较高。为了解决这一问题，目前最先进的（SOTA） exttt{ToC3D}方法通过基于自我运动的相关标记选择实现高效的多视角ViT基础3D物体检测。然而，该方法存在两个主要局限性：（1）固定的层级个体标记选择比例限制了训练和推理过程中的计算效率。（2）多视角3D物体检测方法需要对ViT骨干网络进行全面的端到端重训练。在本研究中，我们提出了一种图像标记补偿器，并结合ViT骨干网络的标记选择，以加速多视角3D物体检测。与 exttt{ToC3D}不同，我们的方法允许在ViT骨干网络内进行动态层级标记选择。此外，我们引入了一种参数高效的微调策略，仅训练所提出的模块，从而将微调参数的数量从超过3亿（M）减少到仅1.6百万（M）。在大规模NuScenes数据集上进行的三种多视角3D物体检测方法的实验表明，我们提出的方法将计算复杂度（GFLOPs）降低了48% ... 55%，推理延迟（在 exttt{NVIDIA-GV100} GPU上）降低了9% ... 25%，同时相比于目前的SOTA exttt{ToC3D}，平均精度提高了1.0% ... 2.8%绝对值，NuScenes检测分数提高了0.4% ... 1.2%绝对值。

View on arXiv Download PDF AI Translation

cs.CV / 53 / 2604.13589

Dehaze-then-Splat: Generative Dehazing with Physics-Informed 3D Gaussian Splatting for Smoke-Free Novel View Synthesis

去雾再渲染：基于物理信息的3D高斯点云生成去雾用于无烟新视图合成

Chen, Yuchao, Wang, Hanqing

Abstract

We present Dehaze-then-Splat, a two-stage pipeline for multi-view smoke removal and novel view synthesis developed for Track~2 of the NTIRE 2026 3D Restoration and Reconstruction Challenge. In the first stage, we produce pseudo-clean training images via per-frame generative dehazing using Nano Banana Pro, followed by brightness normalization. In the second stage, we train 3D Gaussian Splatting (3DGS) with physics-informed auxiliary losses -- depth supervision via Pearson correlation with pseudo-depth, dark channel prior regularization, and dual-source gradient matching -- that compensate for cross-view inconsistencies inherent in frame-wise generative processing. We identify a fundamental tension in dehaze-then-reconstruct pipelines: per-image restoration quality does not guarantee multi-view consistency, and such inconsistency manifests as blurred renders and structural instability in downstream 3D reconstruction.Our analysis shows that MCMC-based densification with early stopping, combined with depth and haze-suppression priors, effectively mitigates these artifacts. On the Akikaze validation scene, our pipeline achieves 20.98\,dB PSNR and 0.683 SSIM for novel view synthesis, a +1.50\,dB improvement over the unregularized baseline.

Chinese Translation

我们提出了去雾再渲染（Dehaze-then-Splat），这是一个用于多视图去烟和新视图合成的两阶段管道，旨在参加NTIRE 2026 3D恢复与重建挑战的Track~2。在第一阶段，我们通过使用Nano Banana Pro进行逐帧生成去雾，生成伪干净的训练图像，并进行亮度归一化。在第二阶段，我们使用物理信息辅助损失训练3D高斯点云（3D Gaussian Splatting，3DGS）——通过与伪深度的Pearson相关性进行深度监督、暗通道先验正则化和双源梯度匹配——来补偿逐帧生成处理固有的视图间不一致性。我们识别出去雾再重建管道中的一个基本矛盾：每幅图像的恢复质量并不能保证多视图一致性，而这种不一致性表现为模糊的渲染和下游3D重建中的结构不稳定。我们的分析表明，基于MCMC的密集化结合早期停止，以及深度和去雾抑制先验，有效减轻了这些伪影。在Akikaze验证场景中，我们的管道在新视图合成中达到了20.98 dB的PSNR和0.683的SSIM，相较于未正则化基线提高了1.50 dB。

View on arXiv Download PDF AI Translation

cs.CV / 54 / 2604.13596

VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation

VGGT-Segmentor：几何增强的跨视角分割

Gao, Yulu, Zhang, Bohao, Tang, Zongheng, Liao, Jitong, Wu, Wenjun, Liu, Si

Abstract

Instance-level object segmentation across disparate egocentric and exocentric views is a fundamental challenge in visual understanding, critical for applications in embodied AI and remote collaboration. This task is exceptionally difficult due to severe changes in scale, perspective, and occlusion, which destabilize direct pixel-level matching. While recent geometry-aware models like VGGT provide a strong foundation for feature alignment, we find they often fail at dense prediction tasks due to significant pixel-level projection drift, even when their internal object-level textntion remains consistent. To bridge this gap, we introduce VGGT-Segmentor (VGGT-S), a framework that unifies robust geometric modeling with pixel-accurate semantic segmentation. VGGT-S leverages VGGT's powerful cross-view feature representation and introduces a novel Union Segmentation Head. This head operates in three stages: mask prompt fusion, point-guided prediction, and iterative mask refinement, effectively translating high-level feature alignment into a precise segmentation mask. Furthermore, we propose a single-image self-supervised training strategy that eliminates the need for paired annotations and enables strong generalization. On the Ego-Exo4D benchmark, VGGT-S sets a new state-of-the-art, achieving 67.7% and 68.0% average IoU for Ego to Exo and Exo to Ego tasks, respectively, significantly outperforming prior methods. Notably, our correspondence-free pretrained model surpasses most fully-supervised baselines, demonstrating the effectiveness and scalability of our approach.

Chinese Translation

在不同的自我中心和外部视角下进行实例级物体分割是视觉理解中的一项基本挑战，对于具身人工智能和远程协作的应用至关重要。由于尺度、视角和遮挡的剧烈变化，这项任务异常困难，这使得直接的像素级匹配不稳定。尽管近期的几何感知模型如VGGT为特征对齐提供了坚实的基础，但我们发现它们在密集预测任务中常常失败，原因在于显著的像素级投影漂移，即使它们的内部物体级注意力保持一致。为了解决这一问题，我们提出了VGGT-Segmentor（VGGT-S），一个将强大的几何建模与像素精确的语义分割相结合的框架。VGGT-S利用VGGT强大的跨视角特征表示，并引入了一种新颖的联合分割头。该分割头分为三个阶段：掩膜提示融合、点引导预测和迭代掩膜细化，有效地将高层次特征对齐转化为精确的分割掩膜。此外，我们提出了一种单图像自监督训练策略，消除了对配对注释的需求，并实现了强大的泛化能力。在Ego-Exo4D基准测试中，VGGT-S设定了新的最先进水平，分别在Ego到Exo和Exo到Ego任务中实现了67.7%和68.0%的平均IoU，显著超越了之前的方法。值得注意的是，我们的无对应预训练模型超过了大多数完全监督的基线，证明了我们方法的有效性和可扩展性。

View on arXiv Download PDF AI Translation

cs.CV / 55 / 2604.13610

What Are We Really Measuring? Rethinking Dataset Bias in Web-Scale Natural Image Collections via Unsupervised Semantic Clustering

我们到底在测量什么？通过无监督语义聚类重新思考网络规模自然图像集合中的数据集偏差

Saleknia, Amir Hossein, Sabokrou, Mohammad

Abstract

In computer vision, a prevailing method for quantifying dataset bias is to train a model to distinguish between datasets. High classification accuracy is then interpreted as evidence of meaningful semantic differences. This approach assumes that standard image augmentations successfully suppress low-level, non-semantic cues, and that any remaining performance must therefore reflect true semantic divergence. We demonstrate that this fundamental assumption is flawed within the domain of large-scale natural image collections. High classification accuracy is often driven by resolution-based artifacts, which are structural fingerprints arising from native image resolution distributions and interpolation effects during resizing. These artifacts form robust, dataset-specific signatures that persist despite conventional image corruptions. Through controlled experiments, we show that models achieve strong dataset classification even on non-semantic, procedurally generated images, proving their reliance on superficial cues. To address this issue, we revisit this decades-old idea of dataset separability, but not with supervised classification. Instead, we introduce an unsupervised approach that measures true semantic separability. Our framework directly assesses semantic similarity by clustering semantically-rich features from foundational vision models, deliberately bypassing supervised classification on dataset labels. When applied to major web-scale datasets, the primary focus of this work, the high separability reported by supervised methods largely vanishes, with clustering accuracy dropping to near-chance levels. This reveals that conventional classification-based evaluation systematically overstates semantic bias by an overwhelming margin.

Chinese Translation

在计算机视觉领域，量化数据集偏差的一种普遍方法是训练模型以区分不同的数据集。高分类准确率被解释为有意义的语义差异的证据。这种方法假设标准图像增强能够有效抑制低级非语义线索，因此任何剩余的性能必须反映真实的语义差异。我们证明，这一基本假设在大规模自然图像集合的领域中是有缺陷的。高分类准确率往往是由基于分辨率的伪影驱动的，这些伪影是由原始图像分辨率分布和缩放过程中的插值效应所产生的结构性指纹。这些伪影形成了稳健的数据集特征，即使在常规图像损坏的情况下也会持续存在。通过控制实验，我们展示了模型即使在非语义的程序生成图像上也能实现强大的数据集分类，证明了它们对表面线索的依赖。为了解决这个问题，我们重新审视这个已有数十年的数据集可分离性概念，但不是通过监督分类。相反，我们引入了一种无监督的方法来测量真实的语义可分离性。我们的框架通过聚类来自基础视觉模型的语义丰富特征，直接评估语义相似性，故意绕过对数据集标签的监督分类。当应用于主要的网络规模数据集时，这项工作的主要焦点，监督方法报告的高可分离性在很大程度上消失，聚类准确率降至接近偶然水平。这揭示了传统的基于分类的评估系统性地夸大了语义偏差的程度。

View on arXiv Download PDF AI Translation

cs.CV / 56 / 2604.13633

ESCAPE: Episodic Spatial Memory and Adaptive Execution Policy for Long-Horizon Mobile Manipulation

ESCAPE：用于长时间移动操作的情节空间记忆与自适应执行策略

Qian, Jingjing, He, Zeyuan, Shi, Chen, Xiao, Lei, Jiang, Li

Abstract

Coordinating navigation and manipulation with robust performance is essential for embodied AI in complex indoor environments. However, as tasks extend over long horizons, existing methods often struggle due to catastrophic forgetting, spatial inconsistency, and rigid execution. To address these issues, we propose ESCAPE (Episodic Spatial Memory Coupled with an Adaptive Policy for Execution), operating through a tightly coupled perception-grounding-execution workflow. For robust perception, ESCAPE features a Spatio-Temporal Fusion Mapping module to autoregressively construct a depth-free, persistent 3D spatial memory, alongside a Memory-Driven Target Grounding module for precise interaction mask generation. To achieve flexible action, our Adaptive Execution Policy dynamically orchestrates proactive global navigation and reactive local manipulation to seize opportunistic targets. ESCAPE achieves state-of-the-art performance on the ALFRED benchmark, reaching 65.09% and 60.79% success rates in test seen and unseen environments with step-by-step instructions. By reducing redundant exploration, our ESCAPE attains substantial improvements in path-length-weighted metrics and maintains robust performance (61.24% / 56.04%) even without detailed guidance for long-horizon tasks.

Chinese Translation

在复杂的室内环境中，协调导航与操作并保持稳健的性能对于具身人工智能至关重要。然而，随着任务的延续时间变长，现有方法往往因灾难性遗忘、空间不一致性和执行僵化而面临挑战。为了解决这些问题，我们提出了ESCAPE（情节空间记忆与自适应执行策略），通过紧密耦合的感知-基础-执行工作流程进行操作。为了实现稳健的感知，ESCAPE具备一个时空融合映射模块，能够自回归地构建无深度的持久3D空间记忆，同时配备一个基于记忆的目标基础模块，用于精确生成交互掩模。为了实现灵活的行动，我们的自适应执行策略动态协调主动的全局导航和反应的局部操作，以抓住机会目标。ESCAPE在ALFRED基准测试中实现了最先进的性能，在测试的已见和未见环境中，逐步指令下的成功率分别达到了65.09%和60.79%。通过减少冗余探索，我们的ESCAPE在路径长度加权指标上取得了显著改善，即使在没有详细指导的情况下，对于长时间任务仍保持稳健的性能（61.24% / 56.04%）。

View on arXiv Download PDF AI Translation

cs.CV / 57 / 2604.13660

VRAG-DFD: Verifiable Retrieval-Augmentation for MLLM-based Deepfake Detection

VRAG-DFD：基于可验证检索增强的多模态大语言模型深度伪造检测

Han, Hui, Wang, Shunli, Zhao, Yandan, Yao, Taiping, Ding, Shouhong

Abstract

In Deepfake Detection (DFD) tasks, researchers proposed two types of MLLM-based methods: complementary combination with small DFD detectors, or static forgery knowledge injection.The lack of professional forgery knowledge hinders the performance of these DFD-MLLMs.To solve this, we deeply considered two insightful issues: How to provide high-quality associated forgery knowledge for MLLMs? AND How to endow MLLMs with critical reasoning abilities given noisy reference information? Notably, we attempted to address above two questions with preliminary answers by leveraging the combination of Retrieval-Augmented Generation (RAG) and Reinforcement Learning (RL).Through RAG and RL techniques, we propose the VRAG-DFD framework with accurate dynamic forgery knowledge retrieval and powerful critical reasoning capabilities.Specifically, in terms of data, we constructed two datasets with RAG: Forensic Knowledge Database (FKD) for DFD knowledge annotation, and Forensic Chain-of-Thought Dataset (F-CoT), for critical CoT construction.In terms of model training, we adopt a three-stage training method (Alignment->SFT->GRPO) to gradually cultivate the critical reasoning ability of the MLLM.In terms of performance, VRAG-DFD achieved SOTA and competitive performance on DFD generalization testing.

Chinese Translation

在深度伪造检测（DFD）任务中，研究人员提出了两种基于多模态大语言模型（MLLM）的方法：与小型DFD检测器的互补组合，或静态伪造知识注入。缺乏专业的伪造知识限制了这些DFD-MLLM的性能。为了解决这个问题，我们深入考虑了两个重要问题：如何为MLLM提供高质量的相关伪造知识？以及如何在嘈杂的参考信息下赋予MLLM关键的推理能力？值得注意的是，我们尝试通过结合检索增强生成（RAG）和强化学习（RL）来初步回答上述两个问题。通过RAG和RL技术，我们提出了VRAG-DFD框架，具备准确的动态伪造知识检索和强大的关键推理能力。具体而言，在数据方面，我们构建了两个数据集：用于DFD知识标注的法医学知识数据库（FKD），以及用于关键思维链构建的法医学思维链数据集（F-CoT）。在模型训练方面，我们采用三阶段训练方法（对齐->SFT->GRPO）逐步培养MLLM的关键推理能力。在性能方面，VRAG-DFD在DFD泛化测试中实现了最先进（SOTA）和具有竞争力的性能。

View on arXiv Download PDF AI Translation

cs.CV / 58 / 2604.13667

From Pixels to Nucleotides: End-to-End Token-Based Video Compression for DNA Storage

从像素到核苷酸：用于DNA存储的端到端基于令牌的视频压缩

Ruan, Cihan, Zhou, Lebin, Zhao, Bingqing, Han, Rongduo, Yuan, Qiming, Zhu, Chenchen, Han, Linyi, Yang, Liang, Wang, Wei, Jiang, Wei, Ling, Nam

Abstract

DNA-based storage has emerged as a promising approach to the global data crisis, offering molecular-scale density and millennial-scale stability at low maintenance cost. Over the past decade, substantial progress has been made in storing text, images, and files in DNA -- yet video remains an open challenge. The difficulty is not merely technical: effective video DNA storage requires co-designing compression and molecular encoding from the ground up, a challenge that sits at the intersection of two fields that have largely evolved independently. In this work, we present HELIX, the first end-to-end neural network jointly optimizing video compression and DNA encoding -- prior approaches treat the two stages independently, leaving biochemical constraints and compression objectives fundamentally misaligned. Our key insight: token-based representations naturally align with DNA's quaternary alphabet -- discrete semantic units map directly to ATCG bases. We introduce TK-SCONE (Token-Kronecker Structured Constraint-Optimized Neural Encoding), which achieves 1.91 bits per nucleotide through Kronecker-structured mixing that breaks spatial correlations and FSM-based mapping that guarantees biochemical constraints. Unlike two-stage approaches, HELIX learns token distributions simultaneously optimized for visual quality, prediction under masking, and DNA synthesis efficiency. This work demonstrates for the first time that learned compression and molecular storage converge naturally at token representations -- suggesting a new paradigm where neural video codecs are designed for biological substrates from the ground up.

Chinese Translation

基于DNA的存储作为应对全球数据危机的一种有前景的方法，提供了分子级别的密度和千年级别的稳定性，同时维护成本低廉。在过去的十年中，在DNA中存储文本、图像和文件方面取得了显著进展——然而视频存储仍然是一个未解决的挑战。这一困难不仅仅是技术性的：有效的视频DNA存储需要从头开始共同设计压缩和分子编码，这一挑战位于两个在很大程度上独立发展的领域的交汇处。在本研究中，我们提出了HELIX，这是第一个端到端神经网络，联合优化视频压缩和DNA编码——以往的方法将这两个阶段视为独立，导致生化约束和压缩目标在根本上不一致。我们的关键见解是：基于令牌的表示自然与DNA的四元字母表对齐——离散的语义单元直接映射到ATCG碱基。我们引入了TK-SCONE（基于令牌的Kronecker结构约束优化神经编码），通过Kronecker结构混合打破空间相关性，并通过基于有限状态机（FSM）的映射保证生化约束，实现了每个核苷酸1.91比特的压缩效率。与两阶段方法不同，HELIX同时学习令牌分布，优化视觉质量、遮蔽下的预测和DNA合成效率。本研究首次证明了学习的压缩和分子存储在令牌表示上自然收敛——这表明了一种新的范式，即神经视频编解码器从根本上为生物基质设计。

View on arXiv Download PDF AI Translation

cs.CV / 59 / 2604.13688

Beyond Voxel 3D Editing: Learning from 3D Masks and Self-Constructed Data

超越体素的三维编辑：从三维掩模和自构建数据中学习

Xu, Yizhao, Zhu, Hongyuan, Liu, Caiyun, Wang, Tianfu, Chen, Keyu, Xu, Sicheng, Yang, Jiaolong, Yuan, Nicholas Jing, Zhang, Qi

Abstract

3D editing refers to the ability to apply local or global modifications to 3D assets. Effective 3D editing requires maintaining semantic consistency by performing localized changes according to prompts, while also preserving local invariance so that unchanged regions remain consistent with the original. However, existing approaches have significant limitations: multi-view editing methods incur losses when projecting back to 3D, while voxel-based editing is constrained in both the regions that can be modified and the scale of modifications. Moreover, the lack of sufficiently large editing datasets for training and evaluation remains a challenge. To address these challenges, we propose a Beyond Voxel 3D Editing (BVE) framework with a self-constructed large-scale dataset specifically tailored for 3D editing. Building upon this dataset, our model enhances a foundational image-to-3D generative architecture with lightweight, trainable modules, enabling efficient injection of textual semantics without the need for expensive full-model retraining. Furthermore, we introduce an annotation-free 3D masking strategy to preserve local invariance, maintaining the integrity of unchanged regions during editing. Extensive experiments demonstrate that BVE achieves superior performance in generating high-quality, text-aligned 3D assets, while faithfully retaining the visual characteristics of the original input.

Chinese Translation

三维编辑是指对三维资产进行局部或全局修改的能力。有效的三维编辑需要通过根据提示执行局部更改来保持语义一致性，同时保持局部不变性，以确保未更改的区域与原始内容保持一致。然而，现有方法存在显著的局限性：多视角编辑方法在投影回三维时会产生损失，而基于体素的编辑在可修改区域和修改规模上受到限制。此外，缺乏足够大的编辑数据集用于训练和评估仍然是一个挑战。为了解决这些问题，我们提出了一个超越体素的三维编辑（BVE）框架，并自构建了一个专门针对三维编辑的大规模数据集。在此数据集的基础上，我们的模型增强了一个基础的图像到三维生成架构，加入了轻量级、可训练的模块，使得在不需要昂贵的全模型重新训练的情况下高效地注入文本语义。此外，我们引入了一种无注释的三维掩模策略，以保持局部不变性，在编辑过程中维护未更改区域的完整性。大量实验表明，BVE在生成高质量、与文本对齐的三维资产方面表现优越，同时忠实保留了原始输入的视觉特征。

View on arXiv Download PDF AI Translation

cs.CV / 60 / 2604.13695

Med-CAM: Minimal Evidence for Explaining Medical Decision Making

Med-CAM：对医疗决策制定的最小证据解释

Suhail, Pirzada, Anand, Aditya, Sethi, Amit

Abstract

Reliable and interpretable decision-making is essential in medical imaging, where diagnostic outcomes directly influence patient care. Despite advances in deep learning, most medical AI systems operate as opaque black boxes, providing little insight into why a particular diagnosis was reached. In this paper, we introduce Med-CAM, a framework for generating minimal and sharp maps as evidence-based explanations for Medical decision making via Classifier Activation Matching. Med-CAM trains a segmentation network from scratch to produce a mask that highlights the minimal evidence critical to model's decision for any seen or unseen image. This ensures that the explanation is both faithful to the network's behaviour and interpretable to clinicians. Experiments show, unlike prior spatial explanation methods, such as Grad-CAM and attention maps, which yield only fuzzy regions of relative importance, Med-CAM with its superior spatial awareness to shapes, textures, and boundaries, delivers conclusive, evidence-based explanations that faithfully replicate the model's prediction for any given image. By explicitly constraining explanations to be compact, consistent with model activations, and diagnostic alignment, Med-CAM advances transparent AI to foster clinician understanding and trust in high-stakes medical applications such as pathology and radiology.

Chinese Translation

在医学影像中，可靠且可解释的决策制定至关重要，因为诊断结果直接影响患者护理。尽管深度学习取得了进展，但大多数医疗人工智能系统仍然作为不透明的黑箱运作，几乎无法提供关于为何得出特定诊断的见解。本文介绍了Med-CAM，一个通过分类器激活匹配生成最小且清晰的图谱作为基于证据的医疗决策解释的框架。Med-CAM从头开始训练一个分割网络，以生成一个掩膜，突出显示对模型在任何已见或未见图像上的决策至关重要的最小证据。这确保了解释既忠实于网络的行为，又便于临床医生理解。实验表明，与先前的空间解释方法（如Grad-CAM和注意力图）不同，这些方法仅产生模糊的相对重要性区域，Med-CAM凭借其对形状、纹理和边界的卓越空间感知，提供了确凿的、基于证据的解释，忠实地再现了模型对任何给定图像的预测。通过明确限制解释为紧凑、与模型激活一致且符合诊断的，Med-CAM推动了透明人工智能的发展，以促进临床医生在病理学和放射学等高风险医疗应用中的理解和信任。

View on arXiv Download PDF AI Translation

cs.CV / 61 / 2604.13710

SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs

SLQ：通过共享潜在查询连接模态以实现冻结的多模态大语言模型检索

Lou, Haoran, Liu, Ziyan, Fan, Chunxiao, Wu, Yuexin, Ming, Yue

Abstract

Multimodal Large Language Models (MLLMs) exhibit strong reasoning and world knowledge, yet adapting them for retrieval remains challenging. Existing approaches rely on invasive parameter updates, such as full fine-tuning and LoRA, which may disrupt the pre-trained semantic space and impair the structured knowledge essential for reasoning. In this work, we argue that adapting MLLMs for retrieval should focus on eliciting pre-trained representations rather than overwriting them. To this end, we propose SLQ, an effective and efficient framework that adapts a frozen MLLM into a retriever through a small set of Shared Latent Queries. Appended to the end of both text and image token sequences, these queries leverage the model's native causal attention to serve as global aggregation interfaces, producing compact embeddings in a unified space while keeping the backbone unchanged. Furthermore, to better evaluate retrieval beyond superficial pattern matching, we construct KARR-Bench, a benchmark designed for knowledge-aware reasoning retrieval. Extensive experiments show that SLQ outperforms full fine-tuning and LoRA on COCO and Flickr30K, while achieving competitive performance on MMEB and yielding substantial gains on KARR-Bench. The results demonstrate that SLQ, which preserves pre-trained representations, provides an effective and efficient framework for adapting MLLMs to retrieval.

Chinese Translation

多模态大语言模型（MLLMs）展现出强大的推理能力和世界知识，但将其适应于检索仍然具有挑战性。现有的方法依赖于侵入性的参数更新，如完全微调和LoRA，这可能会破坏预训练的语义空间，并损害推理所需的结构化知识。在本研究中，我们认为将MLLMs适应于检索应侧重于引出预训练表示，而非覆盖它们。为此，我们提出了SLQ，一个有效且高效的框架，通过一小组共享潜在查询将冻结的MLLM转化为检索器。这些查询附加在文本和图像令牌序列的末尾，利用模型的原生因果注意力作为全局聚合接口，在统一空间中生成紧凑的嵌入，同时保持主干不变。此外，为了更好地评估检索超越表面模式匹配，我们构建了KARR-Bench，这是一个专为知识感知推理检索设计的基准。大量实验表明，SLQ在COCO和Flickr30K上优于完全微调和LoRA，同时在MMEB上表现出竞争力，并在KARR-Bench上取得显著提升。结果表明，SLQ保留了预训练表示，为将MLLMs适应于检索提供了一个有效且高效的框架。

View on arXiv Download PDF AI Translation

cs.CV / 62 / 2604.13722

Granularity-Aware Transfer for Tree Instance Segmentation in Synthetic and Real Forests

基于粒度感知的树实例分割在合成与真实森林中的迁移

Deoli, Pankaj, Tej, Atef, Ashri, Anmol, JS, Anandatirtha, Berns, Karsten

Abstract

We address the challenge of synthetic-to-real transfer in forestry perception where real data have only coarse Tree labels while synthetic data provide fine-grained trunk/crown annotations. We introduce MGTD, a mixed-granularity dataset with 53k synthetic and 3.6k real images, and a four-stage protocol isolating domain shift and granularity mismatch. Our core contribution is granularity-aware distillation, which transfers structural priors from fine-grained synthetic teachers to a coarse-label student via logit-space merging and mask unification. Experiments show consistent mask AP gains, especially for small/distant trees, establishing a testbed for Sim-Real transfer under label granularity constraints.

Chinese Translation

我们解决了森林感知中合成到真实迁移的挑战，其中真实数据仅具有粗略的树标签，而合成数据提供细粒度的树干/树冠注释。我们引入了MGTD，一个混合粒度数据集，包含53,000张合成图像和3,600张真实图像，以及一个四阶段协议，用于隔离领域转移和粒度不匹配。我们的核心贡献是粒度感知蒸馏，它通过对数空间合并和掩膜统一，将细粒度合成教师的结构先验转移到粗标签学生。实验表明，尤其是在小型/远距离树木的情况下，掩膜平均精度（mask AP）有一致的提升，建立了在标签粒度约束下进行Sim-Real迁移的测试平台。

View on arXiv Download PDF AI Translation

cs.CV / 63 / 2604.13730

ReConText3D: Replay-based Continual Text-to-3D Generation

ReConText3D：基于重放的持续文本到3D生成

Khan, Muhammad Ahmed Ullah, Amir, Muhammad Haris Bin, Stricker, Didier, Afzal, Muhammad Zeshan

Abstract

Continual learning enables models to acquire new knowledge over time while retaining previously learned capabilities. However, its application to text-to-3D generation remains unexplored. We present ReConText3D, the first framework for continual text-to-3D generation. We first demonstrate that existing text-to-3D models suffer from catastrophic forgetting under incremental training. ReConText3D enables generative models to incrementally learn new 3D categories from textual descriptions while preserving the ability to synthesize previously seen assets. Our method constructs a compact and diverse replay memory through text-embedding k-Center selection, allowing representative rehearsal of prior knowledge without modifying the underlying architecture. To systematically evaluate continual text-to-3D learning, we introduce Toys4K-CL, a benchmark derived from the Toys4K dataset that provides balanced and semantically diverse class-incremental splits. Extensive experiments on the Toys4K-CL benchmark show that ReConText3D consistently outperforms all baselines across different generative backbones, maintaining high-quality generation for both old and new classes. To the best of our knowledge, this work establishes the first continual learning framework and benchmark for text-to-3D generation, opening a new direction for incremental 3D generative modeling. Project page is available at: https://mauk95.github.io/ReConText3D/.

Chinese Translation

持续学习使模型能够随着时间的推移获取新知识，同时保留先前学习的能力。然而，其在文本到3D生成中的应用仍未被探索。我们提出了ReConText3D，这是第一个用于持续文本到3D生成的框架。我们首先证明现有的文本到3D模型在增量训练下会遭遇灾难性遗忘。ReConText3D使生成模型能够从文本描述中增量学习新的3D类别，同时保持合成先前见过的资产的能力。我们的方法通过文本嵌入k-Center选择构建了一个紧凑且多样的重放记忆，允许在不修改基础架构的情况下代表性地回顾先前的知识。为了系统地评估持续文本到3D学习，我们引入了Toys4K-CL，这是一个源自Toys4K数据集的基准，提供了平衡且语义多样的类别增量拆分。在Toys4K-CL基准上的大量实验表明，ReConText3D在不同的生成骨干网络中始终优于所有基线，保持了对旧类别和新类别的高质量生成。根据我们所知，这项工作建立了第一个用于文本到3D生成的持续学习框架和基准，为增量3D生成建模开辟了新的方向。项目页面可访问：https://mauk95.github.io/ReConText3D/

View on arXiv Download PDF AI Translation

cs.CV / 64 / 2604.13746

ClipGStream: Clip-Stream Gaussian Splatting for Any Length and Any Motion Multi-View Dynamic Scene Reconstruction

ClipGStream：用于任意长度和任意运动的多视角动态场景重建的剪辑流高斯喷溅

Liang, Jie, Wu, Jiahao, Wang, Chao, Yang, Jiayu, Zheng, Xiaoyun, Xiong, Kaiqiang, Wang, Zhanke, Yan, Jinbo, Gao, Feng, Wang, Ronggang

Abstract

Dynamic 3D scene reconstruction is essential for immersive media such as VR, MR, and XR, yet remains challenging for long multi-view sequences with large-scale motion. Existing dynamic Gaussian approaches are either Frame-Stream, offering scalability but poor temporal stability, or Clip, achieving local consistency at the cost of high memory and limited sequence length. We propose ClipGStream, a hybrid reconstruction framework that performs stream optimization at the clip level rather than the frame level. The sequence is divided into short clips, where dynamic motion is modeled using clip-independent spatio-temporal fields and residual anchor compensation to capture local variations efficiently, while inter-clip inherited anchors and decoders maintain structural consistency across clips. This Clip-Stream design enables scalable, flicker-free reconstruction of long dynamic videos with high temporal coherence and reduced memory overhead. Extensive experiments demonstrate that ClipGStream achieves state-of-the-art reconstruction quality and efficiency. The project page is available at: https://liangjie1999.github.io/ClipGStreamWeb/

Chinese Translation

动态3D场景重建对于虚拟现实（VR）、混合现实（MR）和扩展现实（XR）等沉浸式媒体至关重要，但对于长时间多视角序列和大规模运动的重建仍然具有挑战性。现有的动态高斯方法要么是帧流（Frame-Stream），提供可扩展性但时间稳定性差，要么是剪辑（Clip），在高内存和有限序列长度的代价下实现局部一致性。我们提出了ClipGStream，一种混合重建框架，在剪辑级别而非帧级别执行流优化。该序列被划分为短剪辑，其中动态运动使用剪辑独立的时空场和残差锚点补偿进行建模，以高效捕捉局部变化，而剪辑间继承的锚点和解码器则保持剪辑之间的结构一致性。这种剪辑流设计实现了可扩展的、无闪烁的长动态视频重建，具有高时间一致性和降低的内存开销。大量实验表明，ClipGStream在重建质量和效率上达到了最先进水平。项目页面可访问： https://liangjie1999.github.io/ClipGStreamWeb/

View on arXiv Download PDF AI Translation

cs.CV / 65 / 2604.13761

Design and Behavior of Sparse Mixture-of-Experts Layers in CNN-based Semantic Segmentation

基于CNN的语义分割中稀疏专家混合层的设计与行为

Pavlitska, Svetlana, Fan, Haixi, Ditschuneit, Konstantin, Zöllner, J. Marius

Abstract

Sparse mixture-of-experts (MoE) layers have been shown to substantially increase model capacity without a proportional increase in computational cost and are widely used in transformer architectures, where they typically replace feed-forward network blocks. In contrast, integrating sparse MoE layers into convolutional neural networks (CNNs) remains inconsistent, with most prior work focusing on fine-grained MoEs operating at the filter or channel levels. In this work, we investigate a coarser, patch-wise formulation of sparse MoE layers for semantic segmentation, where local regions are routed to a small subset of convolutional experts. Through experiments on the Cityscapes and BDD100K datasets using encoder-decoder and backbone-based CNNs, we conduct a design analysis to assess how architectural choices affect routing dynamics and expert specialization. Our results demonstrate consistent, architecture-dependent improvements (up to +3.9 mIoU) with little computational overhead, while revealing strong design sensitivity. Our work provides empirical insights into the design and internal dynamics of sparse MoE layers in CNN-based dense prediction. Our code is available at https://github.com/KASTEL-MobilityLab/moe-layers/.

Chinese Translation

稀疏专家混合（MoE）层已被证明可以显著提高模型容量，而不会成比例地增加计算成本，并且在变换器架构中被广泛使用，通常替代前馈网络块。相比之下，将稀疏MoE层集成到卷积神经网络（CNN）中的研究仍然不一致，大多数先前的工作集中于在滤波器或通道级别操作的细粒度MoE。在本研究中，我们探讨了一种更粗糙的、基于补丁的稀疏MoE层的形式，用于语义分割，其中局部区域被路由到一小部分卷积专家。通过在Cityscapes和BDD100K数据集上使用编码器-解码器和基于骨干网络的CNN进行实验，我们进行了设计分析，以评估架构选择如何影响路由动态和专家专业化。我们的结果显示出一致的、依赖于架构的改进（最高可达+3.9 mIoU），同时计算开销很小，并揭示了强烈的设计敏感性。我们的工作为基于CNN的稠密预测中稀疏MoE层的设计和内部动态提供了实证见解。我们的代码可在https://github.com/KASTEL-MobilityLab/moe-layers/获取。

View on arXiv Download PDF AI Translation

cs.CV / 66 / 2604.13789

Temporally Consistent Long-Term Memory for 3D Single Object Tracking

时序一致的长期记忆用于3D单目标跟踪

Yoo, Jaejoon, Lee, SuBeen, Jeon, Yerim, Lee, Miso, Heo, Jae-Pil

Abstract

3D Single Object Tracking (3D-SOT) aims to localize a target object across a sequence of LiDAR point clouds, given its 3D bounding box in the first frame. Recent methods have adopted a memory-based approach to utilize previously observed features of the target object, but remain limited to only a few recent frames. This work reveals that their temporal capacity is fundamentally constrained to short-term context due to severe temporal feature inconsistency and excessive memory overhead. To this end, we propose a robust long-term 3D-SOT framework, ChronoTrack, which preserves the temporal feature consistency while efficiently aggregating the diverse target features via long-term memory. Based on a compact set of learnable memory tokens, ChronoTrack leverages long-term information through two complementary objectives: a temporal consistency loss and a memory cycle consistency loss. The former enforces feature alignment across frames, alleviating temporal drift and improving the reliability of proposed long-term memory. In parallel, the latter encourages each token to encode diverse and discriminative target representations observed throughout the sequence via memory-point-memory cyclic walks. As a result, ChronoTrack achieves new state-of-the-art performance on multiple 3D-SOT benchmarks, demonstrating its effectiveness in long-term target modeling with compact memory while running at real-time speed of 42 FPS on a single RTX 4090 GPU. The code is available at https://github.com/ujaejoon/ChronoTrack

Chinese Translation

3D单目标跟踪（3D-SOT）旨在通过给定第一帧中的3D边界框，在一系列LiDAR点云中定位目标物体。近期的方法采用基于记忆的方式来利用先前观察到的目标物体特征，但仍然仅限于少数最近的帧。本研究揭示了由于严重的时序特征不一致和过高的记忆开销，它们的时序容量在根本上受到短期上下文的限制。为此，我们提出了一个稳健的长期3D-SOT框架ChronoTrack，该框架在有效聚合多样化目标特征的同时，保持时序特征的一致性。ChronoTrack基于一组紧凑的可学习记忆标记，通过两个互补目标利用长期信息：时序一致性损失和记忆循环一致性损失。前者强制在帧之间进行特征对齐，缓解时序漂移并提高所提长期记忆的可靠性。同时，后者鼓励每个标记通过记忆-点-记忆循环游走编码在整个序列中观察到的多样化和具有区分性的目标表示。因此，ChronoTrack在多个3D-SOT基准测试中实现了新的最先进性能，展示了其在紧凑记忆下进行长期目标建模的有效性，同时在单个RTX 4090 GPU上以42 FPS的实时速度运行。代码可在https://github.com/ujaejoon/ChronoTrack获取。

View on arXiv Download PDF AI Translation

cs.CV / 67 / 2604.13791

PBE-UNet: A light weight Progressive Boundary-Enhanced U-Net with Scale-Aware Aggregation for Ultrasound Image Segmentation

PBE-UNet：一种轻量级渐进边界增强U-Net，具有尺度感知聚合用于超声图像分割

Wang, Chen, Zhu, Yixin, Zhu, Yongbin, Shi, Fengyuan, Li, Qi, Wang, Jun, Liu, Zuozhu, Hu, Keli

Abstract

Accurate lesion segmentation in ultrasound images is essential for preventive screening and clinical diagnosis, yet remains challenging due to low contrast, blurry boundaries, and significant scale variations. Although existing deep learning-based methods have achieved remarkable performance, these methods still struggle with scale variations and indistinct tumor boundaries. To address these challenges, we propose a progressive boundary enhanced U-Net (PBE-UNet). Specially, we first introduce a scale-aware aggregation module (SAAM) that dynamically adjusts its receptive field to capture robust multi-scale contextual information. Then, we propose a boundary-guided feature enhancement (BGFE) module to enhance the feature representations. We find that there are large gaps between the narrow boundary and the wide segmentation error areas. Unlike existing methods that treat boundaries as static masks, the BGFE module progressively expands the narrow boundary prediction into broader spatial attention maps. Thus, broader spatial attention maps could effectively cover the wider segmentation error regions and enhance the model's focus on these challenging areas. We conduct expensive experiments on four benchmark ultrasound datasets, BUSI, Dataset B, TN3K, and BP. The experimental results how that our proposed PBE-UNet outperforms state-of-the-art ultrasound image segmentation methods. The code is at https://github.com/cruelMouth/PBE-UNet.

Chinese Translation

在超声图像中准确分割病变对于预防筛查和临床诊断至关重要，但由于对比度低、边界模糊以及显著的尺度变化，仍然面临挑战。尽管现有的基于深度学习的方法取得了显著的性能，但这些方法在尺度变化和模糊肿瘤边界方面仍然存在困难。为了解决这些挑战，我们提出了一种渐进边界增强U-Net（PBE-UNet）。特别地，我们首先引入了一个尺度感知聚合模块（SAAM），该模块动态调整其感受野，以捕捉稳健的多尺度上下文信息。然后，我们提出了一个边界引导特征增强（BGFE）模块，以增强特征表示。我们发现，狭窄边界与宽分割误差区域之间存在较大差距。与现有方法将边界视为静态掩膜不同，BGFE模块逐步将狭窄边界预测扩展为更广泛的空间注意力图。因此，更广泛的空间注意力图可以有效覆盖更宽的分割误差区域，并增强模型对这些挑战区域的关注。我们在四个基准超声数据集（BUSI、Dataset B、TN3K和BP）上进行了大量实验。实验结果表明，我们提出的PBE-UNet优于最先进的超声图像分割方法。代码可在 https://github.com/cruelMouth/PBE-UNet 获取。

View on arXiv Download PDF AI Translation

cs.CV / 68 / 2604.13793

From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation

从同步到序列：通过插值实现外部到自我生成

Mahdi, Mohammad, Savov, Nedko, Paudel, Danda Pani, Van Gool, Luc

Abstract

Exo-to-Ego video generation aims to synthesize a first-person video from a synchronized third-person view and corresponding camera poses. While paired supervision is available, synchronized exo-ego data inherently introduces substantial spatio-temporal and geometric discontinuities, violating the smooth-motion assumptions of standard video generation benchmarks. We identify this synchronization-induced jump as the central challenge and propose Syn2Seq-Forcing, a sequential formulation that interpolates between the source and target videos to form a single continuous signal. By reframing Exo2Ego as sequential signal modeling rather than a conventional condition-output task, our approach enables diffusion-based sequence models, e.g. Diffusion Forcing Transformers (DFoT), to capture coherent transitions across frames more effectively. Empirically, we show that interpolating only the videos, without performing pose interpolation already produces significant improvements, emphasizing that the dominant difficulty arises from spatio-temporal discontinuities. Beyond immediate performance gains, this formulation establishes a general and flexible framework capable of unifying both Exo2Ego and Ego2Exo generation within a single continuous sequence model, providing a principled foundation for future research in cross-view video synthesis.

Chinese Translation

外部到自我视频生成旨在从同步的第三人视角和相应的相机姿态合成第一人称视频。尽管有配对监督，但同步的外部-自我数据本质上引入了大量的时空和几何不连续性，违反了标准视频生成基准的平滑运动假设。我们将这种同步引起的跳跃识别为核心挑战，并提出了Syn2Seq-Forcing，这是一种顺序公式，通过在源视频和目标视频之间进行插值来形成一个连续的信号。通过将外部到自我（Exo2Ego）重新框定为顺序信号建模，而不是传统的条件-输出任务，我们的方法使基于扩散的序列模型（例如扩散强制变换器（Diffusion Forcing Transformers, DFoT））能够更有效地捕捉帧之间的连贯过渡。通过实证研究，我们表明，仅对视频进行插值，而不进行姿态插值，已经产生了显著的改进，强调了主要困难来自时空不连续性。除了直接的性能提升，这一公式建立了一个通用且灵活的框架，能够将外部到自我（Exo2Ego）和自我到外部（Ego2Exo）生成统一在一个连续的序列模型中，为未来在跨视角视频合成方面的研究提供了原则性基础。

View on arXiv Download PDF AI Translation

cs.CV / 69 / 2604.13795

Artificial intelligence application in lymphoma diagnosis with Vision Transformer using weakly supervised training

使用弱监督训练的视觉变换器在淋巴瘤诊断中的人工智能应用

Nghia, Nguyen, Wahed, Amer, Quesada, Andy, Ali, Yasir, Achi, Hanadi El, Zhang, Y. Helen, Ursua, Jocelyn, Banerjee, Alex, Kalra, Sahib, Medeiros, L. Jeffrey, Xu, Jie

Abstract

Vision transformers (ViT) have been shown to allow for more flexible feature detection and can outperform convolutional neural network (CNN) when pre-trained on sufficient data. Due to their promising feature detection capabilities, we deployed ViTs for morphological classification of anaplastic large cell lymphoma (ALCL) versus classic Hodgkin lymphoma (cHL). We had previously designed a ViT model which was trained on a small dataset of 1,200 image patches in fully supervised training. That model achieved a diagnostic accuracy of 100% and an F1 score of 1.0 on the independent test set. Since fully supervised training is not a practical method due to lack of expertise resources in both the training and testing phases, we conducted a recent study on a modified approach to training data (weakly supervised training) and show that labeling training image patch automatically at the slide level of each whole-slide-image is a more practical solution for clinical use of Vision Transformer. Our ViT model, trained on a larger dataset of 100,000 image patches, yields evaluation metrics with significant accuracy, F1 score, and area under the curve (AUC) at 91.85%, 0.92, and 0.98, respectively. These are respectable values that qualify this ViT model, with weakly supervised training, as a suitable tool for a deep learning module in clinical model development using automated image patch extraction.

Chinese Translation

视觉变换器（Vision Transformers, ViT）已被证明能够实现更灵活的特征检测，并且在充分数据预训练的情况下，其性能优于卷积神经网络（Convolutional Neural Network, CNN）。鉴于其良好的特征检测能力，我们将ViT应用于间变性大细胞淋巴瘤（Anaplastic Large Cell Lymphoma, ALCL）与经典霍奇金淋巴瘤（Classic Hodgkin Lymphoma, cHL）的形态分类。我们之前设计的ViT模型是在一个包含1200个图像块的小型数据集上进行完全监督训练的，该模型在独立测试集上达到了100%的诊断准确率和1.0的F1分数。由于在训练和测试阶段缺乏专业资源，完全监督训练并不是一种实用的方法，因此我们最近进行了关于训练数据修改方法（弱监督训练）的研究，结果表明在每个全切片图像的切片级别自动标记训练图像块是一种更实用的视觉变换器临床应用解决方案。我们的ViT模型在一个包含100,000个图像块的大型数据集上训练，评估指标显示出显著的准确率、F1分数和曲线下面积（Area Under the Curve, AUC），分别为91.85%、0.92和0.98。这些都是令人满意的数值，使得该采用弱监督训练的ViT模型成为临床模型开发中使用自动图像块提取的深度学习模块的合适工具。

View on arXiv Download PDF AI Translation

cs.CV / 70 / 2604.13797

DRG-Font: Dynamic Reference-Guided Few-shot Font Generation via Contrastive Style-Content Disentanglement

DRG-Font：通过对比风格-内容解耦的动态参考引导少样本字体生成

Chakraborty, Rejoy, Roy, Prasun, Bhattacharya, Saumik, Pal, Umapada

Abstract

Few-shot Font Generation aims to generate stylistically consistent glyphs from a few reference glyphs. However, capturing complex font styles from a few exemplars remains challenging, and the existing methods often struggle to retain discernible local characteristics in generated samples. This paper introduces DRG-Font, a contrastive font generation strategy that learns complex glyph attributes by decomposing style and content embedding spaces. For optimal style supervision, the proposed architecture incorporates a Reference Selection (RS) Module to dynamically select the best style reference from an available pool of candidates. The network learns to decompose glyph attributes into style and shape priors through a Multi-scale Style Head Block (MSHB) and a Multi-scale Content Head Block (MCHB). For style adaptation, a Multi-Fusion Upsampling Block (MFUB) produces the target glyph by combining the reference style prior and target content prior. The proposed method demonstrates significant improvements over state-of-the-art approaches across multiple visual and analytical benchmarks.

Chinese Translation

少样本字体生成旨在从少量参考字形中生成风格一致的字形。然而，从少量样本中捕捉复杂的字体风格仍然具有挑战性，现有方法往往难以在生成样本中保留明显的局部特征。本文提出了DRG-Font，一种对比字体生成策略，通过解构风格和内容嵌入空间来学习复杂的字形属性。为了实现最佳的风格监督，所提出的架构结合了参考选择（Reference Selection, RS）模块，动态选择最佳的风格参考。该网络通过多尺度风格头块（Multi-scale Style Head Block, MSHB）和多尺度内容头块（Multi-scale Content Head Block, MCHB）学习将字形属性解构为风格和形状先验。为了实现风格适应，多融合上采样块（Multi-Fusion Upsampling Block, MFUB）通过结合参考风格先验和目标内容先验生成目标字形。所提出的方法在多个视觉和分析基准测试中显示出显著优于现有最先进方法的改进。

View on arXiv Download PDF AI Translation

cs.CV / 71 / 2604.13803

Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation

煤气灯效应、把关者、V1-V3：早期视觉皮层对齐保护视觉-语言模型免受谄媚操控

Shah, Arya, Tripathi, Vaibhav, Singh, Mayank, Silpasuwanchai, Chaklam

Abstract

Vision-language models are increasingly deployed in high-stakes settings, yet their susceptibility to sycophantic manipulation remains poorly understood, particularly in relation to how these models represent visual information internally. Whether models whose visual representations more closely mirror human neural processing are also more resistant to adversarial pressure is an open question with implications for both neuroscience and AI safety. We investigate this question by evaluating 12 open-weight vision-language models spanning 6 architecture families and a 40$\times$ parameter range (256M--10B) along two axes: brain alignment, measured by predicting fMRI responses from the Natural Scenes Dataset across 8 human subjects and 6 visual cortex regions of interest, and sycophancy, measured through 76,800 two-turn gaslighting prompts spanning 5 categories and 10 difficulty levels. Region-of-interest analysis reveals that alignment specifically in early visual cortex (V1--V3) is a reliable negative predictor of sycophancy ($r = -0.441$, BCa 95\% CI $[-0.740, -0.031]$), with all 12 leave-one-out correlations negative and the strongest effect for existence denial attacks ($r = -0.597$, $p = 0.040$). This anatomically specific relationship is absent in higher-order category-selective regions, suggesting that faithful low-level visual encoding provides a measurable anchor against adversarial linguistic override in vision-language models. We release our code on \href{https://github.com/aryashah2k/Gaslight-Gatekeep-Sycophantic-Manipulation}{GitHub} and dataset on \href{https://huggingface.co/datasets/aryashah00/Gaslight-Gatekeep-V1-V3}{Hugging Face}

Chinese Translation

视觉-语言模型越来越多地应用于高风险环境中，但它们对谄媚操控的敏感性仍然不够清楚，特别是在这些模型如何内部表示视觉信息方面。视觉表示更接近人类神经处理的模型是否也更能抵抗对抗性压力是一个尚未解决的问题，这对神经科学和人工智能安全都有重要影响。我们通过评估12个开放权重的视觉-语言模型，涵盖6个架构家族和40$ imes$参数范围（256M--10B），在两个维度上研究这个问题：大脑对齐，通过预测来自自然场景数据集的fMRI反应，涉及8名受试者和6个视觉皮层感兴趣区域进行测量；谄媚性，通过76,800个两轮煤气灯提示，涵盖5个类别和10个难度级别进行测量。感兴趣区域分析表明，早期视觉皮层（V1--V3）的对齐是谄媚性的可靠负预测因子（$r = -0.441$, BCa 95 ext{% CI} $[-0.740, -0.031]$），所有12个留一法相关性均为负，且在存在否认攻击中效果最强（$r = -0.597$, $p = 0.040$）。这种解剖学特定的关系在高阶类别选择区域中不存在，表明忠实的低级视觉编码为视觉-语言模型提供了一个可测量的锚点，以抵御对抗性语言覆盖。我们在 exttt{https://github.com/aryashah2k/Gaslight-Gatekeep-Sycophantic-Manipulation} 上发布了我们的代码，并在 exttt{https://huggingface.co/datasets/aryashah00/Gaslight-Gatekeep-V1-V3} 上发布了数据集。

View on arXiv Download PDF AI Translation

cs.CV / 72 / 2604.13835

A Resource-Efficient Hybrid CNN-LSTM network for image-based bean leaf disease classification

一种资源高效的混合CNN-LSTM网络用于基于图像的豆叶病分类

Rhee, Hye Jin, Akinyemi, Joseph Damilola

Abstract

Accurate and resource-efficient automated diagnosis is a cornerstone of modern agricultural expert systems. While Convolutional Neural Networks (CNNs) have established benchmarks in plant pathology, their ability to capture long-range spatial dependencies is often limited by standard pooling layers, and their high memory footprint hinders deployment on portable devices. This paper proposes a lightweight hybrid CNN-LSTM system for bean leaf disease classification. By integrating an LSTM layer to model the spatial-sequential relationships within feature maps, our hybrid architecture achieves a 94.38% accuracy while maintaining an exceptionally small footprint of 1.86 MB; a 70% reduction in size compared to traditional CNN-based systems. Furthermore, we provide a systematic evaluation of image augmentation strategies, demonstrating that tailored transformations are superior to generic combinations for maintaining the integrity of diagnostic patterns. Results on the $\textit{ibean}$ dataset confirm that the proposed system achieves state-of-the-art F1 scores of 99.22% with EfficientNet-B7+LSTM, providing a robust and scalable framework for real-time agricultural decision support in resource-constrained environments. The code and augmented datasets used in this study are publicly available on this $\href{https://github.com/HJin-R/bean_disease}{Github}$ repo.

Chinese Translation

准确且资源高效的自动诊断是现代农业专家系统的基石。尽管卷积神经网络（CNN）在植物病理学中建立了基准，但其捕捉长距离空间依赖的能力常常受到标准池化层的限制，并且其高内存占用阻碍了在便携设备上的部署。本文提出了一种轻量级的混合CNN-LSTM系统用于豆叶病分类。通过集成LSTM层来建模特征图中的空间-序列关系，我们的混合架构在保持仅1.86 MB的小内存占用的同时，实现了94.38%的准确率；与传统的基于CNN的系统相比，尺寸减少了70%。此外，我们提供了图像增强策略的系统评估，证明量身定制的变换在保持诊断模式完整性方面优于通用组合。在$ extit{ibean}$数据集上的结果确认，所提系统在使用EfficientNet-B7+LSTM时达到了99.22%的最新F1分数，为资源受限环境中的实时农业决策支持提供了一个稳健且可扩展的框架。本研究中使用的代码和增强数据集已在此$ exthref{https://github.com/HJin-R/bean_disease}{Github}$仓库公开。

View on arXiv Download PDF AI Translation

cs.CV / 73 / 2604.13841

DiffMagicFace: Identity Consistent Facial Editing of Real Videos

DiffMagicFace：真实视频的身份一致性面部编辑

Yin, Huanghao, Xu, Shenkun, Shi, Kanle, Yong, Junhai, Wang, Bin

Abstract

Text-conditioned image editing has greatly benefitted from the advancements in Image Diffusion Models. However, extending these techniques to facial video editing introduces challenges in preserving facial identity throughout the source video and ensuring consistency of the edited subject across frames. In this paper, we introduce DiffMagicFace, a unique video editing framework that integrates two fine-tuned models for text and image control. These models operate concurrently during inference to produce video frames that maintain identity features while seamlessly aligning with the editing semantics. To ensure the consistency of the edited videos, we develop a dataset comprising images showcasing various facial perspectives for each edited subject. The creation of a data set is achieved through rendering techniques and the subsequent application of optimization algorithms. Remarkably, our approach does not depend on video datasets but still delivers high-quality results in both consistency and content. The excellent effect holds even for complex tasks like talking head videos and distinguishing closely related categories. The videos edited using our framework exhibit parity with videos that are made using traditional rendering software. Through comparative analysis with current state-of-the-art methods, our framework demonstrates superior performance in both visual appeal and quantitative metrics.

Chinese Translation

文本条件的图像编辑得益于图像扩散模型的进步。然而，将这些技术扩展到面部视频编辑时，面临着在整个源视频中保持面部身份和确保编辑对象在各帧之间一致性的挑战。在本文中，我们介绍了DiffMagicFace，这是一种独特的视频编辑框架，集成了两个经过微调的文本和图像控制模型。这些模型在推理过程中并行操作，以生成保持身份特征的同时与编辑语义无缝对齐的视频帧。为了确保编辑视频的一致性，我们开发了一个数据集，其中包含展示每个编辑对象不同面部视角的图像。该数据集的创建通过渲染技术和后续应用优化算法实现。值得注意的是，我们的方法不依赖于视频数据集，但仍在一致性和内容方面提供高质量的结果。即使对于复杂任务，如对话头视频和区分密切相关的类别，我们的方法也表现出色。使用我们框架编辑的视频与使用传统渲染软件制作的视频具有相当的效果。通过与当前最先进方法的比较分析，我们的框架在视觉吸引力和定量指标上表现出优越性。

View on arXiv Download PDF AI Translation

cs.CV / 74 / 2604.13856

Any3DAvatar: Fast and High-Quality Full-Head 3D Avatar Reconstruction from Single Portrait Image

Any3DAvatar：从单幅肖像图像快速高质量全头3D头像重建

Gao, Yujie, Xiao, Yao, Zhu, Xiangnan, Li, Ya, Zhang, Yiyi, Zhang, Liqing, Zhang, Jianfu

Abstract

Reconstructing a complete 3D head from a single portrait remains challenging because existing methods still face a sharp quality-speed trade-off: high-fidelity pipelines often rely on multi-stage processing and per-subject optimization, while fast feed-forward models struggle with complete geometry and fine appearance details. To bridge this gap, we propose Any3DAvatar, a fast and high-quality method for single-image 3D Gaussian head avatar generation, whose fastest setting reconstructs a full head in under one second while preserving high-fidelity geometry and texture. First, we build AnyHead, a unified data suite that combines identity diversity, dense multi-view supervision, and realistic accessories, filling the main gaps of existing head data in coverage, full-head geometry, and complex appearance. Second, rather than sampling unstructured noise, we initialize from a Pl\"ucker-aware structured 3D Gaussian scaffold and perform one-step conditional denoising, formulating full-head reconstruction into a single forward pass while retaining high fidelity. Third, we introduce auxiliary view-conditioned appearance supervision on the same latent tokens alongside 3D Gaussian reconstruction, improving novel-view texture details at zero extra inference cost. Experiments show that Any3DAvatar outperforms prior single-image full-head reconstruction methods in rendering fidelity while remaining substantially faster.

Chinese Translation

从单幅肖像图像重建完整的3D头部仍然具有挑战性，因为现有方法在质量与速度之间存在明显的权衡：高保真管道通常依赖于多阶段处理和针对每个对象的优化，而快速前馈模型在完整几何形状和细致外观细节方面表现不佳。为了解决这一问题，我们提出了Any3DAvatar，这是一种快速且高质量的单图像3D高斯头部头像生成方法，其最快设置能够在不到一秒的时间内重建完整头部，同时保持高保真的几何形状和纹理。首先，我们构建了AnyHead，一个统一的数据套件，结合了身份多样性、密集的多视角监督和真实的配件，填补了现有头部数据在覆盖范围、完整头部几何形状和复杂外观方面的主要空白。其次，我们不再从无结构噪声中采样，而是从一个考虑Pl"ucker约束的结构化3D高斯支架初始化，并执行一步条件去噪，将全头重建公式化为单次前向传递，同时保持高保真度。第三，我们在相同的潜在标记上引入辅助视角条件外观监督，与3D高斯重建并行，提高了新视角下的纹理细节，而没有额外的推理成本。实验表明，Any3DAvatar在渲染保真度方面优于之前的单图像全头重建方法，同时速度显著更快。

View on arXiv Download PDF AI Translation

cs.CV / 75 / 2604.13863

PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios

PostureObjectStitch：考虑工业场景中组装关系的异常图像生成

Tong, Zebei, Chen, Hongchang, Lei, Yujie, Chen, Gang, Liu, Yushi, Zheng, Zhi, Chen, Hao, Zhang, Jieming, Li, Ying, Cao, Dongpu

Abstract

Image generation technology can synthesize condition-specific images to supplement real-world industrial anomaly data and enhance anomaly detection model performance. Existing generation techniques rarely account for the pose and orientation of industrial components in assembly, making the generated images difficult to utilize for downstream application. To solve this, we propose a novel image synthesis approach, called PostureObjectStitch, that achieves accurate generation to meet the requirement of industrial assembly. A condition decoupling approach is introduced to separate input multi-view images into high-frequency, texture, and RGB features. The feature temporal modulation mechanism adapts these features across diffusion model time-steps, enabling progressive generation from coarse to fine details while maintaining consistency. To ensure semantic accuracy, we introduce a conditional loss that enhances critical industrial elements and a geometric prior that guides component positioning for correct assembly relationships. Comprehensive experimental results on the MureCom dataset, our newly contributed DreamAssembly dataset, and the downstream application validate the outstanding performance of our method.

Chinese Translation

图像生成技术能够合成特定条件下的图像，以补充真实世界的工业异常数据并提升异常检测模型的性能。现有的生成技术很少考虑工业组件在组装过程中的姿态和方向，使得生成的图像难以用于下游应用。为了解决这个问题，我们提出了一种新颖的图像合成方法，称为PostureObjectStitch，该方法实现了满足工业组装要求的准确生成。我们引入了一种条件解耦方法，将输入的多视角图像分离为高频、纹理和RGB特征。特征时间调制机制使这些特征在扩散模型的时间步长中适应，从而实现从粗到细的渐进式生成，同时保持一致性。为了确保语义准确性，我们引入了一种条件损失，增强关键工业元素，并采用几何先验指导组件定位，以确保正确的组装关系。我们在MureCom数据集、我们新贡献的DreamAssembly数据集及下游应用上的全面实验结果验证了我们方法的卓越性能。

View on arXiv Download PDF AI Translation

cs.CV / 76 / 2604.13883

Context Sensitivity Improves Human-Machine Visual Alignment

上下文敏感性提高了人机视觉对齐

Born, Frieda, Neuhäuser, Tom, Muttenthaler, Lukas, Roads, Brett D., Spitzer, Bernhard, Lampinen, Andrew K., Jones, Matt, Müller, Klaus-Robert, Mozer, Michael C.

Abstract

Modern machine learning models typically represent inputs as fixed points in a high-dimensional embedding space. While this approach has been proven powerful for a wide range of downstream tasks, it fundamentally differs from the way humans process information. Because humans are constantly adapting to their environment, they represent objects and their relationships in a highly context-sensitive manner. To address this gap, we propose a method for context-sensitive similarity computation from neural network embeddings, applied to modeling a triplet odd-one-out task with an anchor image serving as simultaneous context. Modeling context enables us to achieve up to a 15% improvement in odd-one-out accuracy over a context-insensitive model. We find that this improvement is consistent across both original and "human-aligned" vision foundation models.

Chinese Translation

现代机器学习模型通常将输入表示为高维嵌入空间中的固定点。尽管这种方法在广泛的下游任务中已被证明是有效的，但它与人类处理信息的方式存在根本差异。由于人类不断适应环境，他们以高度上下文敏感的方式表示物体及其关系。为了解决这一差距，我们提出了一种从神经网络嵌入中进行上下文敏感相似性计算的方法，应用于建模一个三元组的奇异物体识别任务，其中锚图像作为同时上下文。建模上下文使我们在奇异物体识别准确率上实现了高达15%的提升，相较于上下文不敏感模型。我们发现，这一改善在原始和“人类对齐”的视觉基础模型中都是一致的。

View on arXiv Download PDF AI Translation

cs.CV / 77 / 2604.13905

Rethinking Image-to-3D Generation with Sparse Queries: Efficiency, Capacity, and Input-View Bias

重新思考基于稀疏查询的图像到3D生成：效率、容量与输入视角偏差

Xu, Zhiyuan, Liu, Jiuming, Chen, Yuxin, Tomizuka, Masayoshi, Xu, Chenfeng, Peng, Chensheng

Abstract

We present SparseGen, a novel framework for efficient image-to-3D generation, which exhibits low input-view bias while being significantly faster. Unlike traditional approaches that rely on dense volumetric grids, triplanes, or pixel-aligned primitives, we model scenes with a compact sparse set of learned 3D anchor queries and a learned expansion operator that decodes each transformed query into a small local set of 3D Gaussian primitives. Trained under a rectified-flow reconstruction objective without 3D supervision, our model learns to allocate representation capacity where geometry and appearance matter, achieving significant reductions in memory and inference time while preserving multi-view fidelity. We introduce quantitative measures of input-view bias and utilization to show that sparse queries reduce overfitting to conditioning views while being representationally efficient. Our results argue that sparse set-latent expansion is a principled, practical alternative for efficient 3D generative modeling.

Chinese Translation

我们提出了SparseGen，一个用于高效图像到3D生成的新框架，该框架在显著加快速度的同时表现出较低的输入视角偏差。与依赖于密集体积网格、三平面或像素对齐原语的传统方法不同，我们通过一组紧凑的稀疏学习3D锚查询和一个学习的扩展算子来建模场景，该扩展算子将每个变换的查询解码为一小组局部3D高斯原语。在没有3D监督的情况下，我们的模型在经过校正流重建目标的训练下，学习在几何和外观重要的地方分配表示容量，从而在保持多视角保真度的同时显著减少内存和推理时间。我们引入了输入视角偏差和利用率的定量度量，以表明稀疏查询减少了对条件视角的过拟合，同时在表示上是高效的。我们的结果表明，稀疏集合潜在扩展是高效3D生成建模的一个有原则且实用的替代方案。

View on arXiv Download PDF AI Translation

cs.CV / 78 / 2604.13906

Blind Bitstream-corrupted Video Recovery via Metadata-guided Diffusion Model

基于元数据引导的盲视频流损坏恢复方法

Wang, Shuyun, Zhang, Hu, Shen, Xin, Wang, Dadong, Yu, Xin

Abstract

Bitstream-corrupted video recovery aims to restore realistic content degraded during video storage or transmission. Existing methods typically assume that predefined masks of corrupted regions are available, but manually annotating these masks is labor-intensive and impractical in real-world scenarios. To address this limitation, we introduce a new blind video recovery setting that removes the reliance on predefined masks. This setting presents two major challenges: accurately identifying corrupted regions and recovering content from extensive and irregular degradations. We propose a Metadata-Guided Diffusion Model (M-GDM) to tackle these challenges. Specifically, intrinsic video metadata are leveraged as corruption indicators through a dual-stream metadata encoder that separately embeds motion vectors and frame types before fusing them into a unified representation. This representation interacts with corrupted latent features via cross-attention at each diffusion step. To preserve intact regions, we design a prior-driven mask predictor that generates pseudo masks using both metadata and diffusion priors, enabling the separation and recombination of intact and recovered regions through hard masking. To mitigate boundary artifacts caused by imperfect masks, a post-refinement module enhances consistency between intact and recovered regions. Extensive experiments demonstrate the effectiveness of our method and its superiority in blind video recovery. Code is available at: https://github.com/Shuyun-Wang/M-GDM.

Chinese Translation

视频流损坏恢复旨在恢复在视频存储或传输过程中退化的真实内容。现有方法通常假设预定义的损坏区域掩码是可用的，但手动标注这些掩码劳动密集且在现实场景中不切实际。为了解决这一限制，我们提出了一种新的盲视频恢复设置，消除了对预定义掩码的依赖。该设置面临两个主要挑战：准确识别损坏区域和从广泛且不规则的退化中恢复内容。我们提出了一种元数据引导的扩散模型（Metadata-Guided Diffusion Model, M-GDM）来应对这些挑战。具体而言，利用内在视频元数据作为损坏指示器，通过双流元数据编码器分别嵌入运动矢量和帧类型，然后将其融合为统一表示。该表示在每个扩散步骤中通过交叉注意力与损坏的潜在特征进行交互。为了保留完整区域，我们设计了一个基于先验的掩码预测器，利用元数据和扩散先验生成伪掩码，从而通过硬掩码实现完整区域和恢复区域的分离与重组。为了减轻不完美掩码造成的边界伪影，后处理模块增强了完整区域和恢复区域之间的一致性。大量实验表明我们的方法的有效性及其在盲视频恢复中的优越性。代码可在以下链接获取：https://github.com/Shuyun-Wang/M-GDM。

View on arXiv Download PDF AI Translation

cs.CV / 79 / 2604.13918

PartNerFace: Part-based Neural Radiance Fields for Animatable Facial Avatar Reconstruction

PartNerFace：基于部件的神经辐射场用于可动画面部头像重建

Yu, Xianggang, Qiu, Lingteng, Ren, Xiaohang, Chen, Guanying, Cui, Shuguang, Han, Xiaoguang, Wang, Baoyuan

Abstract

We present PartNerFace, a part-based neural radiance fields approach, for reconstructing animatable facial avatar from monocular RGB videos. Existing solutions either simply condition the implicit network with the morphable model parameters or learn an imaginary canonical radiance field, making them fail to generalize to unseen facial expressions and capture fine-scale motion details. To address these challenges, we first apply inverse skinning based on a parametric head model to map an observed point to the canonical space, and then model fine-scale motions with a part-based deformation field. Our key insight is that the deformation of different facial parts should be modeled differently. Specifically, our part-based deformation field consists of multiple local MLPs to adaptively partition the canonical space into different parts, where the deformation of a 3D point is computed by aggregating the prediction of all local MLPs by a soft-weighting mechanism. Extensive experiments demonstrate that our method generalizes well to unseen expressions and is capable of modeling fine-scale facial motions, outperforming state-of-the-art methods both quantitatively and qualitatively.

Chinese Translation

我们提出了PartNerFace，一种基于部件的神经辐射场方法，用于从单目RGB视频重建可动画的面部头像。现有的解决方案要么简单地使用可变形模型参数对隐式网络进行条件化，要么学习一个虚构的标准辐射场，这使得它们无法推广到未见过的面部表情，并且无法捕捉细微的运动细节。为了解决这些挑战，我们首先基于参数化头部模型应用逆皮肤变形，将观察到的点映射到标准空间，然后使用基于部件的变形场建模细微的运动。我们的关键见解是，不同面部部件的变形应该采用不同的建模方式。具体而言，我们的基于部件的变形场由多个局部多层感知器（MLP）组成，以自适应地将标准空间划分为不同的部分，其中3D点的变形通过软加权机制聚合所有局部MLP的预测来计算。大量实验表明，我们的方法在未见过的表情上具有良好的泛化能力，并能够建模细微的面部运动，在定量和定性上均优于最先进的方法。

View on arXiv Download PDF AI Translation

cs.CV / 80 / 2604.13938

ASTRA: Enhancing Multi-Subject Generation with Retrieval-Augmented Pose Guidance and Disentangled Position Embedding

ASTRA：通过检索增强的姿态引导和解耦位置嵌入提升多主体生成

Xia, Tianze, Ning, Zijian, Zhao, Zonglin, Wang, Mingjia

Abstract

Subject-driven image generation has shown great success in creating personalized content, but its capabilities are largely confined to single subjects in common poses. Current approaches face a fundamental conflict when handling multiple subjects with complex, distinct actions: preserving individual identities while enforcing precise pose structures. This challenge often leads to identity fusion and pose distortion, as appearance and structure signals become entangled within the model's architecture. To resolve this conflict, we introduce ASTRA(Adaptive Synthesis through Targeted Retrieval Augmentation), a novel framework that architecturally disentangles subject appearance from pose structure within a unified Diffusion Transformer. ASTRA achieves this through a dual-pronged strategy. It first employs a Retrieval-Augmented Pose (RAG-Pose) pipeline to provide a clean, explicit structural prior from a curated database. Then, its core generative model learns to process these dual visual conditions using our Enhanced Universal Rotary Position Embedding (EURoPE), an asymmetric encoding mechanism that decouples identity tokens from spatial locations while binding pose tokens to the canvas. Concurrently, a Disentangled Semantic Modulation (DSM) adapter offloads the identity preservation task into the text conditioning stream. Extensive experiments demonstrate that our integrated approach achieves superior disentanglement. On our designed COCO-based complex pose benchmark, ASTRA achieves a new state-of-the-art in pose adherence, while maintaining high identity fidelity and text alignment in DreamBench.

Chinese Translation

以主体为驱动的图像生成在创建个性化内容方面取得了巨大成功，但其能力在很大程度上局限于单一主体的常见姿态。目前的方法在处理具有复杂、独特动作的多个主体时面临根本性冲突：在保持个体身份的同时强制执行精确的姿态结构。这一挑战常常导致身份融合和姿态扭曲，因为外观和结构信号在模型架构中相互纠缠。为了解决这一冲突，我们提出了ASTRA（通过目标检索增强的自适应合成），这是一个新颖的框架，在统一的扩散变换器中从架构上解耦主体外观与姿态结构。ASTRA通过双重策略实现这一目标。首先，它采用检索增强姿态（RAG-Pose）管道，从策划的数据库中提供清晰、明确的结构先验。然后，其核心生成模型利用我们的增强通用旋转位置嵌入（EURoPE）学习处理这两种视觉条件，这是一种不对称编码机制，能够将身份标记与空间位置解耦，同时将姿态标记绑定到画布上。同时，解耦语义调制（DSM）适配器将身份保留任务卸载到文本条件流中。大量实验表明，我们的综合方法实现了优越的解耦。在我们设计的基于COCO的复杂姿态基准上，ASTRA在姿态遵循性方面达到了新的最先进水平，同时在DreamBench中保持了高身份保真度和文本对齐。

View on arXiv Download PDF AI Translation

cs.CV / 81 / 2604.13939

A Multi-Stage Optimization Pipeline for Bethesda Cell Detection in Pap Smear Cytology

用于巴塞斯达细胞检测的多阶段优化管道在宫颈涂片细胞学中的应用

Amster, Martin, Polotto, Camila María

Abstract

Computer vision techniques have advanced significantly in recent years, finding diverse and impactful applications within the medical field. In this paper, we introduce a new framework for the detection of Bethesda cells in Pap smear images, developed for Track B of the Riva Cytology Challenge held in association with the International Symposium on Biomedical Imaging (ISBI). This work focuses on enhancing computer vision models for cell detection, with performance evaluated using the mAP50-95 metric. We propose a solution based on an ensemble of YOLO and U-Net architectures, followed by a refinement stage utilizing overlap removal techniques and a binary classifier. Our framework achieved second place with a mAP50-95 score of 0.5909 in the competition. The implementation and source code are available at the following repository: github.com/martinamster/riva-trackb

Chinese Translation

近年来，计算机视觉技术取得了显著进展，在医学领域找到了多样且具有影响力的应用。本文介绍了一种用于在宫颈涂片图像中检测巴塞斯达细胞的新框架，该框架是为与国际生物医学成像研讨会（ISBI）相关的Riva细胞学挑战赛的B轨道开发的。本研究重点提升细胞检测的计算机视觉模型，性能通过mAP50-95指标进行评估。我们提出了一种基于YOLO和U-Net架构集成的解决方案，随后通过重叠去除技术和二分类器进行精细化处理。在比赛中，我们的框架以0.5909的mAP50-95得分获得第二名。实现和源代码可在以下仓库获取：github.com/martinamster/riva-trackb

View on arXiv Download PDF AI Translation

cs.CV / 82 / 2604.13941

SceneGlue: Scene-Aware Transformer for Feature Matching without Scene-Level Annotation

SceneGlue：一种无需场景级注释的场景感知变换器用于特征匹配

Du, Songlin, Lu, Xiaoyong, Yan, Yaping, Xiao, Guobao, Lu, Xiaobo, Ikenaga, Takeshi

Abstract

Local feature matching plays a critical role in understanding the correspondence between cross-view images. However, traditional methods are constrained by the inherent local nature of feature descriptors, limiting their ability to capture non-local scene information that is essential for accurate cross-view correspondence. In this paper, we introduce SceneGlue, a scene-aware feature matching framework designed to overcome these limitations. SceneGlue leverages a hybridizable matching paradigm that integrates implicit parallel attention and explicit cross-view visibility estimation. The parallel attention mechanism simultaneously exchanges information among local descriptors within and across images, enhancing the scene's global context. To further enrich the scene awareness, we propose the Visibility Transformer, which explicitly categorizes features into visible and invisible regions, providing an understanding of cross-view scene visibility. By combining explicit and implicit scene-level awareness, SceneGlue effectively compensates for the local descriptor constraints. Notably, SceneGlue is trained using only local feature matches, without requiring scene-level groundtruth annotations. This scene-aware approach not only improves accuracy and robustness but also enhances interpretability compared to traditional methods. Extensive experiments on applications such as homography estimation, pose estimation, image matching, and visual localization validate SceneGlue's superior performance. The source code is available at https://github.com/songlin-du/SceneGlue.

Chinese Translation

局部特征匹配在理解跨视图图像之间的对应关系中起着关键作用。然而，传统方法受到特征描述子的固有局部性质的限制，限制了它们捕捉准确跨视图对应所需的非局部场景信息的能力。本文介绍了SceneGlue，一种旨在克服这些限制的场景感知特征匹配框架。SceneGlue利用一种可混合的匹配范式，集成了隐式并行注意力和显式跨视图可见性估计。并行注意力机制在图像内外的局部描述子之间同时交换信息，增强了场景的全局上下文。为了进一步丰富场景感知，我们提出了可见性变换器（Visibility Transformer），它显式地将特征分类为可见和不可见区域，从而提供对跨视图场景可见性的理解。通过结合显式和隐式的场景级感知，SceneGlue有效地弥补了局部描述子的限制。值得注意的是，SceneGlue仅使用局部特征匹配进行训练，而无需场景级的真实注释。与传统方法相比，这种场景感知的方法不仅提高了准确性和鲁棒性，还增强了可解释性。在单应性估计、姿态估计、图像匹配和视觉定位等应用上的大量实验验证了SceneGlue的优越性能。源代码可在 https://github.com/songlin-du/SceneGlue 获取。

View on arXiv Download PDF AI Translation

cs.CV / 83 / 2604.13947

Heuristic Style Transfer for Real-Time, Efficient Weather Attribute Detection

实时高效天气属性检测的启发式风格迁移

Ouattara, Hamed, Duthon, Pierre, Salmane, Pascal Houssam, Bernardin, Frédéric, Aider, Omar Ait

Abstract

We present lightweight and efficient architectures to detect weather conditions from RGB images, predicting the weather type (sunny, rain, snow, fog) and 11 complementary attributes such as intensity, visibility, and ground condition, for a total of 53 classes across the tasks. This work examines to what extent weather conditions manifest as variations in visual style. We investigate style-inspired techniques, including Gram matrices, a truncated ResNet-50 targeting lower and intermediate layers, and PatchGAN-style architectures, within a multi-task framework with attention mechanisms. Two families are introduced: RTM (ResNet50-Truncated-MultiTasks) and PMG (PatchGAN-MultiTasks-Gram), together with their variants. Our contributions include automation of Gram-matrix computation, integration of PatchGAN into supervised multi-task learning, and local style capture through local Gram for improved spatial coherence. We also release a dataset of 503,875 images annotated with 12 weather attributes under a Creative Commons Attribution (CC-BY) license. The models achieve F1 scores above 96 percent on our internal test set and above 78 percent in zero-shot evaluation on several external datasets, confirming their generalization ability. The PMG architecture, with fewer than 5 million parameters, runs in real time with a small memory footprint, making it suitable for embedded systems. The modular design of the models also allows style-related or weather-related tasks to be added or removed as needed.

Chinese Translation

我们提出了轻量级和高效的架构，从RGB图像中检测天气条件，预测天气类型（晴天、雨天、雪天、雾天）以及11个补充属性，如强度、能见度和地面状况，总共涵盖53个类别。该研究考察了天气条件在视觉风格上的变化程度。我们在多任务框架中研究了风格启发的技术，包括Gram矩阵、针对低层和中间层的截断ResNet-50，以及PatchGAN风格的架构，并结合注意力机制。我们引入了两种模型家族：RTM（ResNet50-Truncated-MultiTasks）和PMG（PatchGAN-MultiTasks-Gram），以及它们的变体。我们的贡献包括Gram矩阵计算的自动化、将PatchGAN集成到监督多任务学习中，以及通过局部Gram捕捉局部风格以改善空间一致性。我们还发布了一个包含503,875张图像的数据集，标注了12个天气属性，采用知识共享署名（CC-BY）许可。模型在我们的内部测试集上实现了超过96%的F1分数，并在多个外部数据集的零样本评估中超过78%，确认了其泛化能力。PMG架构的参数少于500万，能够实时运行且内存占用小，适合嵌入式系统。模型的模块化设计也允许根据需要添加或移除与风格或天气相关的任务。

View on arXiv Download PDF AI Translation

cs.CV / 84 / 2604.13970

MApLe: Multi-instance Alignment of Diagnostic Reports and Large Medical Images

MApLe：诊断报告与大型医学图像的多实例对齐

Bader, Felicia, Seeböck, Philipp, Bartashova, Anastasia, Attenberger, Ulrike, Langs, Georg

Abstract

In diagnostic reports, experts encode complex imaging data into clinically actionable information. They describe subtle pathological findings that are meaningful in their anatomical context. Reports follow relatively consistent structures, expressing diagnostic information with few words that are often associated with tiny but consequential image observations. Standard vision language models struggle to identify the associations between these informative text components and small locations in the images. Here, we propose "MApLe", a multi-task, multi-instance vision language alignment approach that overcomes these limitations. It disentangles the concepts of anatomical region and diagnostic finding, and links local image information to sentences in a patch-wise approach. Our method consists of a text embedding trained to capture anatomical and diagnostic concepts in sentences, a patch-wise image encoder conditioned on anatomical structures, and a multi-instance alignment of these representations. We demonstrate that MApLe can successfully align different image regions and multiple diagnostic findings in free-text reports. We show that our model improves the alignment performance compared to state-of-the-art baseline models when evaluated on several downstream tasks. The code is available at https://github.com/cirmuw/MApLe.

Chinese Translation

在诊断报告中，专家将复杂的影像数据编码为临床可操作的信息。他们描述在解剖学背景下具有重要意义的细微病理发现。报告遵循相对一致的结构，使用少量文字表达诊断信息，这些文字通常与微小但重要的图像观察相关。标准的视觉语言模型在识别这些信息文本组件与图像中小位置之间的关联时面临困难。在此，我们提出了“MApLe”，一种多任务、多实例的视觉语言对齐方法，克服了这些局限性。它解构了解剖区域和诊断发现的概念，并以补丁方式将局部图像信息与句子链接。我们的方法包括一个文本嵌入，用于捕捉句子中的解剖和诊断概念，一个基于解剖结构的补丁级图像编码器，以及这些表示的多实例对齐。我们证明了MApLe能够成功对齐自由文本报告中的不同图像区域和多个诊断发现。我们展示了在多个下游任务评估时，我们的模型在对齐性能上优于最先进的基线模型。代码可在 https://github.com/cirmuw/MApLe 获取。

View on arXiv Download PDF AI Translation

cs.CV / 85 / 2604.13981

HiProto: Hierarchical Prototype Learning for Interpretable Object Detection Under Low-quality Conditions

HiProto：低质量条件下可解释物体检测的层次原型学习

Xiang, Jianlin, Dai, Linhui, Yang, Xue, Yang, Chaolei, Li, Yanshan

Abstract

Interpretability is essential for deploying object detection systems in critical applications, especially under low-quality imaging conditions that degrade visual information and increase prediction uncertainty. Existing methods either enhance image quality or design complex architectures, but often lack interpretability and fail to improve semantic discrimination. In contrast, prototype learning enables interpretable modeling by associating features with class-centered semantics, which can provide more stable and interpretable representations under degradation. Motivated by this, we propose HiProto, a new paradigm for interpretable object detection based on hierarchical prototype learning. By constructing structured prototype representations across multiple feature levels, HiProto effectively models class-specific semantics, thereby enhancing both semantic discrimination and interpretability. Building upon prototype modeling, we first propose a Region-to-Prototype Contrastive Loss (RPC-Loss) to enhance the semantic focus of prototypes on target regions. Then, we propose a Prototype Regularization Loss (PR-Loss) to improve the distinctiveness among class prototypes. Finally, we propose a Scale-aware Pseudo Label Generation Strategy (SPLGS) to suppress mismatched supervision for RPC-Loss, thereby preserving the robustness of low-level prototype representations. Experiments on ExDark, RTTS, and VOC2012-FOG demonstrate that HiProto achieves competitive results while offering clear interpretability through prototype responses, without relying on image enhancement or complex architectures. Our code will be available at https://github.com/xjlDestiny/HiProto.git.

Chinese Translation

可解释性对于在关键应用中部署物体检测系统至关重要，尤其是在低质量成像条件下，这些条件会降低视觉信息并增加预测不确定性。现有方法要么增强图像质量，要么设计复杂的架构，但往往缺乏可解释性，并未能改善语义区分。相比之下，原型学习通过将特征与以类为中心的语义关联，能够实现可解释建模，这在退化情况下可以提供更稳定和可解释的表示。基于此动机，我们提出了HiProto，一种基于层次原型学习的可解释物体检测新范式。通过在多个特征层次上构建结构化的原型表示，HiProto有效地建模了类特定的语义，从而增强了语义区分和可解释性。在原型建模的基础上，我们首先提出了一种区域到原型对比损失（Region-to-Prototype Contrastive Loss，RPC-Loss），以增强原型在目标区域上的语义聚焦。然后，我们提出了一种原型正则化损失（Prototype Regularization Loss，PR-Loss），以提高类原型之间的独特性。最后，我们提出了一种尺度感知伪标签生成策略（Scale-aware Pseudo Label Generation Strategy，SPLGS），以抑制RPC-Loss中的不匹配监督，从而保持低级原型表示的鲁棒性。在ExDark、RTTS和VOC2012-FOG上的实验表明，HiProto在提供清晰可解释性的同时，取得了具有竞争力的结果，而无需依赖图像增强或复杂架构。我们的代码将发布在 https://github.com/xjlDestiny/HiProto.git。

View on arXiv Download PDF AI Translation

cs.CV / 86 / 2604.13994

Remote Sensing Image Super-Resolution for Imbalanced Textures: A Texture-Aware Diffusion Framework

针对不平衡纹理的遥感图像超分辨率：一种纹理感知扩散框架

Zhang, Enzhuo, Zhao, Sijie, Muhtar, Dilxat, Li, Zhenshi, Zhang, Xueliang, Xiao, Pengfeng

Abstract

Generative diffusion priors have recently achieved state-of-the-art performance in natural image super-resolution, demonstrating a powerful capability to synthesize photorealistic details. However, their direct application to remote sensing image super-resolution (RSISR) reveals significant shortcomings. Unlike natural images, remote sensing images exhibit a unique texture distribution where ground objects are globally stochastic yet locally clustered, leading to highly imbalanced textures. This imbalance severely hinders the model's spatial perception. To address this, we propose TexADiff, a novel framework that begins by estimating a Relative Texture Density Map (RTDM) to represent the texture distribution. TexADiff then leverages this RTDM in three synergistic ways: as an explicit spatial conditioning to guide the diffusion process, as a loss modulation term to prioritize texture-rich regions, and as a dynamic adapter for the sampling schedule. These modifications are designed to endow the model with explicit texture-aware capabilities. Experiments demonstrate that TexADiff achieves superior or competitive quantitative metrics. Furthermore, qualitative results show that our model generates faithful high-frequency details while effectively suppressing texture hallucinations. This improved reconstruction quality also results in significant gains in downstream task performance. The source code of our method can be found at https://github.com/ZezFuture/TexAdiff.

Chinese Translation

生成扩散先验最近在自然图像超分辨率方面取得了最先进的性能，展现了合成逼真细节的强大能力。然而，它们在遥感图像超分辨率（RSISR）中的直接应用暴露出显著的不足。与自然图像不同，遥感图像呈现出独特的纹理分布，其中地面物体在全局上是随机的，但在局部上则是聚集的，导致纹理高度不平衡。这种不平衡严重阻碍了模型的空间感知。为了解决这个问题，我们提出了TexADiff，这是一种新颖的框架，首先通过估计相对纹理密度图（Relative Texture Density Map, RTDM）来表示纹理分布。TexADiff随后以三种协同的方式利用这个RTDM：作为指导扩散过程的显式空间条件，作为优先考虑纹理丰富区域的损失调制项，以及作为采样调度的动态适配器。这些修改旨在赋予模型显式的纹理感知能力。实验表明，TexADiff在定量指标上实现了优越或具有竞争力的表现。此外，定性结果显示我们的模型在生成真实的高频细节的同时，有效抑制了纹理幻觉。这种改进的重建质量也显著提升了下游任务的性能。我们方法的源代码可以在 https://github.com/ZezFuture/TexAdiff 找到。

View on arXiv Download PDF AI Translation

cs.CV / 87 / 2604.13995

Depth-Aware Image and Video Orientation Estimation

深度感知图像和视频方向估计

Alam, Muhammad Z., Stetsiuk, Larry, Mukati, M. Umair, Kaleem, Zeeshan

Abstract

This paper introduces a novel approach for image and video orientation estimation by leveraging depth distribution in natural images. The proposed method estimates the orientation based on the depth distribution across different quadrants of the image, providing a robust framework for orientation estimation suited for applications such as virtual reality (VR), augmented reality (AR), autonomous navigation, and interactive surveillance systems. To further enhance fine-scale perceptual alignment, we incorporate depth gradient consistency (DGC) and horizontal symmetry analysis (HSA), enabling precise orientation correction. This hybrid strategy effectively exploits depth cues to support spatial coherence and perceptual stability in immersive visual content. Qualitative and quantitative evaluations demonstrate the robustness and accuracy of the proposed approach, outperforming existing techniques across diverse scenarios.

Chinese Translation

本文提出了一种新颖的图像和视频方向估计方法，该方法利用自然图像中的深度分布。所提出的方法基于图像不同象限的深度分布来估计方向，为虚拟现实（VR）、增强现实（AR）、自主导航和互动监控系统等应用提供了一个稳健的方向估计框架。为了进一步增强细尺度的感知对齐，我们结合了深度梯度一致性（DGC）和水平对称性分析（HSA），实现精确的方向修正。这种混合策略有效利用深度线索，以支持沉浸式视觉内容中的空间一致性和感知稳定性。定性和定量评估表明，所提出的方法具有稳健性和准确性，在多种场景中优于现有技术。

View on arXiv Download PDF AI Translation

cs.CV / 88 / 2604.14025

Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective

前馈式三维场景建模：一个问题驱动的视角

Wang, Weijie, Cao, Qihang, Gao, Sensen, Chen, Donny Y., Xu, Haofei, Bian, Wenjing, Peng, Songyou, Cham, Tat-Jen, Zheng, Chuanxia, Geiger, Andreas, Cai, Jianfei, Bian, Jia-Wang, Zhuang, Bohan

Abstract

Reconstructing 3D representations from 2D inputs is a fundamental task in computer vision and graphics, serving as a cornerstone for understanding and interacting with the physical world. While traditional methods achieve high fidelity, they are limited by slow per-scene optimization or category-specific training, which hinders their practical deployment and scalability. Hence, generalizable feed-forward 3D reconstruction has witnessed rapid development in recent years. By learning a model that maps images directly to 3D representations in a single forward pass, these methods enable efficient reconstruction and robust cross-scene generalization. Our survey is motivated by a critical observation: despite the diverse geometric output representations, ranging from implicit fields to explicit primitives, existing feed-forward approaches share similar high-level architectural patterns, such as image feature extraction backbones, multi-view information fusion mechanisms, and geometry-aware design principles. Consequently, we abstract away from these representation differences and instead focus on model design, proposing a novel taxonomy centered on model design strategies that are agnostic to the output format. Our proposed taxonomy organizes the research directions into five key problems that drive recent research development: feature enhancement, geometry awareness, model efficiency, augmentation strategies and temporal-aware models. To support this taxonomy with empirical grounding and standardized evaluation, we further comprehensively review related benchmarks and datasets, and extensively discuss and categorize real-world applications based on feed-forward 3D models. Finally, we outline future directions to address open challenges such as scalability, evaluation standards, and world modeling.

Chinese Translation

从二维输入重建三维表示是计算机视觉和图形学中的一项基础任务，是理解和与物理世界互动的基石。尽管传统方法能够实现高保真度，但由于每个场景的优化速度慢或特定类别的训练，这限制了它们的实际部署和可扩展性。因此，通用的前馈式三维重建在近年来得到了快速发展。通过学习一个将图像直接映射到三维表示的模型，这些方法能够在单次前向传递中实现高效重建和稳健的跨场景泛化。我们的调查受到一个关键观察的启发：尽管现有的前馈方法在几何输出表示上多样，从隐式场到显式原语，但它们共享类似的高层次架构模式，例如图像特征提取骨干、多视图信息融合机制和几何感知设计原则。因此，我们抽象出这些表示差异，而是专注于模型设计，提出一个以模型设计策略为中心的新分类法，该分类法与输出格式无关。我们提出的分类法将研究方向组织为五个关键问题，这些问题推动了近期的研究发展：特征增强、几何感知、模型效率、增强策略和时间感知模型。为了以实证基础和标准化评估支持这一分类法，我们进一步全面回顾相关基准和数据集，并广泛讨论和分类基于前馈式三维模型的现实应用。最后，我们概述了未来的研究方向，以应对可扩展性、评估标准和世界建模等开放挑战。

View on arXiv Download PDF AI Translation

cs.CV / 89 / 2604.14029

POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch

POINTS-Seeker：从零开始训练多模态智能搜索模型

Liu, Yikun, Liu, Yuan, Tian, Le, Zhou, Xiao, Yao, Jiangchao, Wang, Yanfeng, Xie, Weidi

Abstract

While Large Multimodal Models (LMMs) demonstrate impressive visual perception, they remain epistemically constrained by their static parametric knowledge. To transcend these boundaries, multimodal search models have been adopted to actively interact with the external environment for evidence retrieval. Diverging from prevailing paradigms that merely retrofit general LMMs with search tools as modular extensions, we explore the potential of building a multimodal agentic search model from scratch. Specifically, we make the following contributions: (i) we introduce Agentic Seeding, a dedicated phase designed to weave the foundational precursors necessary for eliciting agentic behaviors; (ii) we uncover a performance bottleneck in long-horizon interactions, where the increasing volume of interaction history overwhelms the model's ability to locate ground-truth evidence. To mitigate this, we propose V-Fold, an adaptive history-aware compression scheme that preserves recent dialogue turns in high fidelity while folding historical context into the visual space via rendering; and (iii) we develop POINTS-Seeker-8B, a state-of-the-art multimodal agentic search model that consistently outperforms existing models across six diverse benchmarks, effectively resolving the challenges of long-horizon, knowledge-intensive visual reasoning.

Chinese Translation

尽管大型多模态模型（LMMs）展现了令人印象深刻的视觉感知能力，但它们仍然受到静态参数知识的认知限制。为了超越这些界限，采用了多模态搜索模型，以便主动与外部环境互动以获取证据。与仅仅将通用LMMs与搜索工具作为模块扩展的现有范式不同，我们探索了从零开始构建多模态智能搜索模型的潜力。具体而言，我们做出了以下贡献：（i）我们引入了智能种子（Agentic Seeding），这是一个专门设计的阶段，旨在编织出引发智能行为所需的基础前提；（ii）我们发现了长时间交互中的性能瓶颈，在这种情况下，日益增加的交互历史量超出了模型定位真实证据的能力。为了解决这个问题，我们提出了V-Fold，这是一种自适应的历史感知压缩方案，能够高保真地保留最近的对话轮次，同时通过渲染将历史上下文折叠到视觉空间中；（iii）我们开发了POINTS-Seeker-8B，这是一种最先进的多模态智能搜索模型，在六个不同的基准测试中持续超越现有模型，有效解决了长时间、知识密集型视觉推理的挑战。

View on arXiv Download PDF AI Translation

cs.CV / 90 / 2604.14041

Seek-and-Solve: Benchmarking MLLMs for Visual Clue-Driven Reasoning in Daily Scenarios

寻求与解决：针对日常场景中视觉线索驱动推理的多模态大语言模型基准测试

Li, Xiaomin, Wang, Tala, Zhong, Zichen, Zhang, Ying, Zheng, Zirui, Isobe, Takashi, Li, Dezhuang, Lu, Huchuan, He, You, Jia, Xu

Abstract

Daily scenarios are characterized by visual richness, requiring Multimodal Large Language Models (MLLMs) to filter noise and identify decisive visual clues for accurate reasoning. Yet, current benchmarks predominantly aim at evaluating MLLMs' pre-existing knowledge or perceptual understanding, often neglecting the critical capability of reasoning. To bridge this gap, we introduce DailyClue, a benchmark designed for visual clue-driven reasoning in daily scenarios. Our construction is guided by two core principles: (1) strict grounding in authentic daily activities, and (2) challenging query design that necessitates more than surface-level perception. Instead of simple recognition, our questions compel MLLMs to actively explore suitable visual clues and leverage them for subsequent reasoning. To this end, we curate a comprehensive dataset spanning four major daily domains and 16 distinct subtasks. Comprehensive evaluation across MLLMs and agentic models underscores the formidable challenge posed by our benchmark. Our analysis reveals several critical insights, emphasizing that the accurate identification of visual clues is essential for robust reasoning.

Chinese Translation

日常场景的特点是视觉丰富性，这要求多模态大语言模型（MLLMs）能够过滤噪声并识别决定性的视觉线索以进行准确推理。然而，目前的基准测试主要旨在评估MLLMs的先验知识或感知理解，往往忽视了推理这一关键能力。为了解决这一问题，我们引入了DailyClue，一个专为日常场景中的视觉线索驱动推理设计的基准测试。我们的构建遵循两个核心原则：（1）严格基于真实的日常活动，和（2）具有挑战性的查询设计，要求超越表层感知。我们的提问不仅仅是简单的识别，而是促使MLLMs积极探索合适的视觉线索并利用这些线索进行后续推理。为此，我们整理了一个涵盖四个主要日常领域和16个不同子任务的全面数据集。对MLLMs和自主模型的综合评估突显了我们基准测试所带来的巨大挑战。我们的分析揭示了几个关键见解，强调准确识别视觉线索对于稳健推理的重要性。

View on arXiv Download PDF AI Translation

cs.CV / 91 / 2604.14044

Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models

解码变化：利用多模态大型语言模型统一遥感变化检测与理解

Li, Xiaohe, Li, Jiahao, Zhang, Kaixin, Fang, Yuqiang, Lin, Leilei, Wang, Hong, Wu, Haohua, Fan, Zide

Abstract

While Multimodal Large Language Models (MLLMs) excel in general vision-language tasks, their application to remote sensing change understanding is hindered by a fundamental "temporal blindness". Existing architectures lack intrinsic mechanisms for multi-temporal contrastive reasoning and struggle with precise spatial grounding. To address this, we first introduce Delta-QA, a comprehensive benchmark comprising 180k visual question-answering samples. Delta-QA unifies pixel-level segmentation and visual question answering across bi- and tri-temporal scenarios, structuring change interpretation into four progressive cognitive dimensions. Methodologically, we propose Delta-LLaVA, a novel MLLM framework explicitly tailored for multi-temporal remote sensing interpretation. It overcomes the limitations of naive feature concatenation through three core innovations: a Change-Enhanced Attention module that systematically isolates and amplifies visual differences, a Change-SEG module utilizing Change Prior Embedding to extract differentiable difference features as input for the LLM, and Local Causal Attention to prevent cross-temporal contextual leakage. Extensive experiments demonstrate that Delta-LLaVA decisively outperforms leading generalist MLLMs and specialized segmentation models in complex change deduction and high-precision boundary localization, establishing a unified framework for earth observation intelligence.

Chinese Translation

尽管多模态大型语言模型（MLLMs）在一般视觉-语言任务中表现出色，但其在遥感变化理解中的应用受到根本性“时间盲ness”的限制。现有架构缺乏多时间对比推理的内在机制，并且在精确的空间定位方面存在困难。为了解决这一问题，我们首先介绍了Delta-QA，这是一个包含18万个视觉问答样本的综合基准。Delta-QA统一了双时间和三时间场景下的像素级分割与视觉问答，将变化解释结构化为四个渐进的认知维度。在方法论上，我们提出了Delta-LLaVA，这是一个专门为多时间遥感解释量身定制的新型MLLM框架。它通过三项核心创新克服了简单特征连接的局限性：一个变化增强注意力模块系统性地隔离和放大视觉差异，一个利用变化先验嵌入的Change-SEG模块提取可区分的差异特征作为LLM的输入，以及局部因果注意力以防止跨时间上下文泄漏。大量实验表明，Delta-LLaVA在复杂变化推理和高精度边界定位方面显著优于领先的一般性MLLM和专业分割模型，建立了一个统一的地球观测智能框架。

View on arXiv Download PDF AI Translation

cs.CV / 92 / 2604.14048

Free Geometry: Refining 3D Reconstruction from Longer Versions of Itself

自由几何：从自身更长版本中精细化三维重建

Dai, Yuhang, Yang, Xingyi

Abstract

Feed-forward 3D reconstruction models are efficient but rigid: once trained, they perform inference in a zero-shot manner and cannot adapt to the test scene. As a result, visually plausible reconstructions often contain errors, particularly under occlusions, specularities, and ambiguous cues. To address this, we introduce Free Geometry, a framework that enables feed-forward 3D reconstruction models to self-evolve at test time without any 3D ground truth. Our key insight is that, when the model receives more views, it produces more reliable and view-consistent reconstructions. Leveraging this property, given a testing sequence, we mask a subset of frames to construct a self-supervised task. Free Geometry enforces cross-view feature consistency between representations from full and partial observations, while maintaining the pairwise relations implied by the held-out frames. This self-supervision allows for fast recalibration via lightweight LoRA updates, taking less than 2 minutes per dataset on a single GPU. Our approach consistently improves state-of-the-art foundation models, including Depth Anything 3 and VGGT, across 4 benchmark datasets, yielding an average improvement of 3.73% in camera pose accuracy and 2.88% in point map prediction. Code is available at https://github.com/hiteacherIamhumble/Free-Geometry .

Chinese Translation

前馈三维重建模型高效但缺乏灵活性：一旦训练完成，它们以零样本的方式进行推理，无法适应测试场景。因此，在遮挡、镜面反射和模糊线索下，视觉上合理的重建往往会出现错误。为了解决这个问题，我们提出了自由几何（Free Geometry），一个框架，使前馈三维重建模型能够在测试时自我演化，而无需任何三维真实值。我们的关键见解是，当模型接收到更多视角时，它能够生成更可靠且视角一致的重建。利用这一特性，在给定的测试序列中，我们对一部分帧进行遮蔽，以构建自监督任务。自由几何强制实现来自完整和部分观测的表示之间的跨视角特征一致性，同时保持被遮蔽帧所隐含的成对关系。这种自监督方法允许通过轻量级的LoRA更新进行快速重新校准，在单个GPU上每个数据集的时间少于2分钟。我们的方法在4个基准数据集上持续提升了最先进的基础模型，包括Depth Anything 3和VGGT，平均提高了3.73%的相机姿态精度和2.88%的点图预测精度。代码可在 https://github.com/hiteacherIamhumble/Free-Geometry 获取。

View on arXiv Download PDF AI Translation

cs.CV / 93 / 2604.14062

OneHOI: Unifying Human-Object Interaction Generation and Editing

OneHOI：统一人类-物体交互生成与编辑

Hoe, Jiun Tian, Hu, Weipeng, Jiang, Xudong, Tan, Yap-Peng, Chan, Chee Seng

Abstract

Human-Object Interaction (HOI) modelling captures how humans act upon and relate to objects, typically expressed as triplets. Existing approaches split into two disjoint families: HOI generation synthesises scenes from structured triplets and layout, but fails to integrate mixed conditions like HOI and object-only entities; and HOI editing modifies interactions via text, yet struggles to decouple pose from physical contact and scale to multiple interactions. We introduce OneHOI, a unified diffusion transformer framework that consolidates HOI generation and editing into a single conditional denoising process driven by shared structured interaction representations. At its core, the Relational Diffusion Transformer (R-DiT) models verb-mediated relations through role- and instance-aware HOI tokens, layout-based spatial Action Grounding, a Structured HOI Attention to enforce interaction topology, and HOI RoPE to disentangle multi-HOI scenes. Trained jointly with modality dropout on our HOI-Edit-44K, along with HOI and object-centric datasets, OneHOI supports layout-guided, layout-free, arbitrary-mask, and mixed-condition control, achieving state-of-the-art results across both HOI generation and editing. Code is available at https://jiuntian.github.io/OneHOI/.

Chinese Translation

人类-物体交互（HOI）建模捕捉人类如何对物体进行操作和与之相关联，通常以<人，动作，物体>三元组的形式表达。现有方法分为两大类：HOI生成从结构化三元组和布局合成场景，但未能整合如HOI和仅物体实体等混合条件；而HOI编辑通过文本修改交互，但在将姿势与物理接触解耦及扩展到多个交互方面存在困难。我们提出了OneHOI，这是一种统一的扩散变换器框架，将HOI生成与编辑整合为一个由共享结构化交互表示驱动的单一条件去噪过程。其核心是关系扩散变换器（Relational Diffusion Transformer，R-DiT），通过角色和实例感知的HOI令牌、基于布局的空间动作定位、强制交互拓扑的结构化HOI注意力，以及解耦多HOI场景的HOI RoPE，建模动词介导的关系。OneHOI在我们的HOI-Edit-44K数据集上与模态丢弃共同训练，并结合HOI和以物体为中心的数据集，支持基于布局的、无布局的、任意掩码和混合条件控制，在HOI生成和编辑方面均取得了最先进的结果。代码可在https://jiuntian.github.io/OneHOI/获取。

View on arXiv Download PDF AI Translation

cs.CV / 94 / 2604.14069

Towards Unconstrained Human-Object Interaction

面向无约束的人类-物体交互

Tonini, Francesco, Conti, Alessandro, Vaquero, Lorenzo, Beyan, Cigdem, Ricci, Elisa

Abstract

Human-Object Interaction (HOI) detection is a longstanding computer vision problem concerned with predicting the interaction between humans and objects. Current HOI models rely on a vocabulary of interactions at training and inference time, limiting their applicability to static environments. With the advent of Multimodal Large Language Models (MLLMs), it has become feasible to explore more flexible paradigms for interaction recognition. In this work, we revisit HOI detection through the lens of MLLMs and apply them to in-the-wild HOI detection. We define the Unconstrained HOI (U-HOI) task, a novel HOI domain that removes the requirement for a predefined list of interactions at both training and inference. We evaluate a range of MLLMs on this setting and introduce a pipeline that includes test-time inference and language-to-graph conversion to extract structured interactions from free-form text. Our findings highlight the limitations of current HOI detectors and the value of MLLMs for U-HOI. Code will be available at https://github.com/francescotonini/anyhoi

Chinese Translation

人类-物体交互（HOI）检测是一个长期存在的计算机视觉问题，旨在预测人类与物体之间的交互。目前的HOI模型在训练和推理时依赖于交互词汇，这限制了它们在静态环境中的适用性。随着多模态大型语言模型（MLLMs）的出现，探索更灵活的交互识别范式变得可行。在本研究中，我们通过MLLM的视角重新审视HOI检测，并将其应用于真实环境中的HOI检测。我们定义了无约束HOI（U-HOI）任务，这是一个新颖的HOI领域，取消了在训练和推理时对预定义交互列表的要求。我们在这一设置下评估了一系列MLLM，并引入了一个管道，包括测试时推理和语言到图的转换，以从自由文本中提取结构化交互。我们的研究结果突显了当前HOI检测器的局限性以及MLLM在U-HOI中的价值。代码将发布在 https://github.com/francescotonini/anyhoi

View on arXiv Download PDF AI Translation

cs.CV / 95 / 2604.14074

Training-Free Semantic Multi-Object Tracking with Vision-Language Models

无训练的语义多目标跟踪与视觉-语言模型

Bonat, Laurence, Tonini, Francesco, Ricci, Elisa, Vaquero, Lorenzo

Abstract

Semantic Multi-Object Tracking (SMOT) extends multi-object tracking with semantic outputs such as video summaries, instance-level captions, and interaction labels, aiming to move from trajectories to human-interpretable descriptions of dynamic scenes. Existing SMOT systems are trained end-to-end, coupling progress to expensive supervision, limiting the ability to rapidly adapt to new foundation models and new interactions. We propose TF-SMOT, a training-free SMOT pipeline that composes pretrained components for detection, mask-based tracking, and video-language generation. TF-SMOT combines D-FINE and the promptable SAM2 segmentation tracker to produce temporally consistent tracklets, uses contour grounding to generate video summaries and instance captions with InternVideo2.5, and aligns extracted interaction predicates to BenSMOT WordNet synsets via gloss-based semantic retrieval with LLM disambiguation. On BenSMOT, TF-SMOT achieves state-of-the-art tracking performance within the SMOT setting and improves summary and caption quality compared to prior art. Interaction recognition, however, remains challenging under strict exact-match evaluation on the fine-grained and long-tailed WordNet label space; our analysis and ablations indicate that semantic overlap and label granularity substantially affect measured performance.

Chinese Translation

语义多目标跟踪（SMOT）通过视频摘要、实例级描述和交互标签等语义输出扩展了多目标跟踪，旨在从轨迹转向对动态场景的人类可解释描述。现有的SMOT系统采用端到端训练，将进展与昂贵的监督相结合，限制了快速适应新基础模型和新交互的能力。我们提出了TF-SMOT，这是一种无训练的SMOT管道，通过组合预训练的检测、基于掩码的跟踪和视频-语言生成组件来实现。TF-SMOT结合了D-FINE和可提示的SAM2分割跟踪器，以生成时间一致的跟踪片段，利用轮廓定位生成视频摘要和实例描述，并通过与LLM消歧的基于释义的语义检索将提取的交互谓词与BenSMOT WordNet同义词集对齐。在BenSMOT上，TF-SMOT在SMOT设置中实现了最先进的跟踪性能，并提高了与先前方法相比的摘要和描述质量。然而，在细粒度和长尾的WordNet标签空间下，交互识别在严格的精确匹配评估中仍然具有挑战性；我们的分析和消融实验表明，语义重叠和标签粒度对测量性能有显著影响。

View on arXiv Download PDF AI Translation

cs.CV / 96 / 2604.14113

UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding

UI-Zoomer：基于不确定性的自适应放大框架用于GUI定位

Tang, Fei, Chen, Bofan, Lu, Zhengxi, Chen, Tongbo, Nong, Songqin, Jiang, Tao, Xu, Wenhao, Lu, Weiming, Xiao, Jun, Zhuang, Yueting, Shen, Yongliang

Abstract

GUI grounding, which localizes interface elements from screenshots given natural language queries, remains challenging for small icons and dense layouts. Test-time zoom-in methods improve localization by cropping and re-running inference at higher resolution, but apply cropping uniformly across all instances with fixed crop sizes, ignoring whether the model is actually uncertain on each case. We propose \textbf{UI-Zoomer}, a training-free adaptive zoom-in framework that treats both the trigger and scale of zoom-in as a prediction uncertainty quantification problem. A confidence-aware gate fuses spatial consensus among stochastic candidates with token-level generation confidence to selectively trigger zoom-in only when localization is uncertain. When triggered, an uncertainty-driven crop sizing module decomposes prediction variance into inter-sample positional spread and intra-sample box extent, deriving a per-instance crop radius via the law of total variance. Extensive experiments on ScreenSpot-Pro, UI-Vision, and ScreenSpot-v2 demonstrate consistent improvements over strong baselines across multiple model architectures, achieving gains of up to +13.4\%, +10.3\%, and +4.2\% respectively, with no additional training required.

Chinese Translation

GUI定位是指从截图中根据自然语言查询定位界面元素，这对于小图标和密集布局仍然具有挑战性。测试时的放大方法通过裁剪并在更高分辨率下重新进行推理来改善定位，但在所有实例上均匀应用裁剪，且裁剪大小固定，忽略了模型在每种情况下的实际不确定性。我们提出了 extbf{UI-Zoomer}，一个无训练的自适应放大框架，将放大的触发和尺度视为预测不确定性量化问题。一个基于置信度的门控机制融合了随机候选者之间的空间共识与令牌级生成置信度，仅在定位不确定时选择性地触发放大。当触发时，一个基于不确定性的裁剪大小模块将预测方差分解为样本间位置扩散和样本内框的范围，通过总方差法则推导出每个实例的裁剪半径。在ScreenSpot-Pro、UI-Vision和ScreenSpot-v2上的大量实验表明，在多个模型架构上，相较于强基线，均实现了一致的改进，分别达到了+13.4\%、+10.3\%和+4.2\%的增益，且无需额外训练。

View on arXiv Download PDF AI Translation

cs.CV / 97 / 2604.14125

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

HiVLA：一种以视觉为基础的层次化具身操控系统

Yang, Tianshuo, Chen, Guanyu, Chen, Yutian, Liang, Zhixuan, Liu, Yitian, Chen, Zanxin, Xu, Chunpu, Liang, Haotian, Pang, Jiangmiao, Mu, Yao, Luo, Ping

Abstract

While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In high-level part, a VLM planner first performs task decomposition and visual grounding to generate structured plans, comprising a subtask instruction and a precise target bounding box. Then, to translate this plan into physical actions, we introduce a flow-matching Diffusion Transformer (DiT) action expert in low-level part equipped with a novel cascaded cross-attention mechanism. This design sequentially fuses global context, high-resolution object-centric crops and skill semantics, enabling the DiT to focus purely on robust execution. Our decoupled architecture preserves the VLM's zero-shot reasoning while allowing independent improvement of both components. Extensive experiments in simulation and the real world demonstrate that HiVLA significantly outperforms state-of-the-art end-to-end baselines, particularly excelling in long-horizon skill composition and the fine-grained manipulation of small objects in cluttered scenes.

Chinese Translation

尽管端到端的视觉-语言-动作（VLA）模型为机器人操控提供了一个有前景的范式，但在狭窄的控制数据上进行微调往往会损害其基础视觉-语言模型（VLM）所继承的深层推理能力。为了解决这一基本权衡，我们提出了HiVLA，一种以视觉为基础的层次化框架，明确地将高层语义规划与低层运动控制解耦。在高层部分，VLM规划器首先执行任务分解和视觉定位，以生成结构化计划，包括子任务指令和精确的目标边界框。然后，为了将该计划转化为物理动作，我们在低层部分引入了一种流匹配扩散变换器（Diffusion Transformer，DiT）动作专家，该专家配备了一种新颖的级联交叉注意力机制。该设计顺序融合了全局上下文、高分辨率的物体中心裁剪和技能语义，使DiT能够专注于稳健的执行。我们的解耦架构保留了VLM的零样本推理能力，同时允许两个组件的独立改进。在仿真和现实世界中的广泛实验表明，HiVLA显著优于最先进的端到端基线，特别是在长时间跨度的技能组合和在杂乱场景中对小物体的精细操控方面表现出色。

View on arXiv Download PDF AI Translation

cs.CV / 98 / 2604.14129

Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models

别让视频发声：音频对比偏好优化用于音视频语言模型

Baid, Ami, Xue, Zihui, Grauman, Kristen

Abstract

While Audio-Visual Language Models (AVLMs) have achieved remarkable progress over recent years, their reliability is bottlenecked by cross-modal hallucination. A particularly pervasive manifestation is video-driven audio hallucination: models routinely exploit visual shortcuts to hallucinate expected sounds, discarding true auditory evidence. To counteract this deeply ingrained visual dominance, we propose Audio-Contrastive Preference Optimization (ACPO). This dual-axis preference learning framework introduces an output-contrastive objective to penalize visual descriptions masquerading as audio facts, alongside an input-contrastive objective that swaps audio tracks to explicitly penalize generation invariant to the true auditory signal. Extensive experiments demonstrate that ACPO establishes highly faithful audio grounding and mitigates audio hallucination without compromising overarching multimodal capabilities.

Chinese Translation

尽管音视频语言模型（AVLMs）近年来取得了显著进展，但其可靠性受到跨模态幻觉的瓶颈制约。一个特别普遍的表现是视频驱动的音频幻觉：模型常常利用视觉捷径来幻觉预期的声音，忽视真实的听觉证据。为了对抗这种根深蒂固的视觉主导现象，我们提出了音频对比偏好优化（Audio-Contrastive Preference Optimization, ACPO）。该双轴偏好学习框架引入了一种输出对比目标，以惩罚伪装成音频事实的视觉描述，同时引入了一种输入对比目标，通过交换音频轨道明确惩罚与真实听觉信号无关的生成。大量实验表明，ACPO建立了高度忠实的音频基础，并在不妥协整体多模态能力的情况下减轻了音频幻觉。

View on arXiv Download PDF AI Translation

cs.CV / 99 / 2604.14141

Geometric Context Transformer for Streaming 3D Reconstruction

用于流式3D重建的几何上下文变换器

Chen, Lin-Zhuo, Gao, Jian, Chen, Yihang, Cheng, Ka Leong, Sun, Yipengjing, Hu, Liangxiao, Xue, Nan, Zhu, Xing, Shen, Yujun, Yao, Yao, Xu, Yinghao

Abstract

Streaming 3D reconstruction aims to recover 3D information, such as camera poses and point clouds, from a video stream, which necessitates geometric accuracy, temporal consistency, and computational efficiency. Motivated by the principles of Simultaneous Localization and Mapping (SLAM), we introduce LingBot-Map, a feed-forward 3D foundation model for reconstructing scenes from streaming data, built upon a geometric context transformer (GCT) architecture. A defining aspect of LingBot-Map lies in its carefully designed attention mechanism, which integrates an anchor context, a pose-reference window, and a trajectory memory to address coordinate grounding, dense geometric cues, and long-range drift correction, respectively. This design keeps the streaming state compact while retaining rich geometric context, enabling stable efficient inference at around 20 FPS on 518 x 378 resolution inputs over long sequences exceeding 10,000 frames. Extensive evaluations across a variety of benchmarks demonstrate that our approach achieves superior performance compared to both existing streaming and iterative optimization-based approaches.

Chinese Translation

流式3D重建旨在从视频流中恢复3D信息，如相机姿态和点云，这需要几何准确性、时间一致性和计算效率。受同时定位与地图构建（SLAM）原理的启发，我们引入了LingBot-Map，这是一种前馈3D基础模型，用于从流数据中重建场景，基于几何上下文变换器（GCT）架构。LingBot-Map的一个显著特点是其精心设计的注意力机制，该机制整合了锚点上下文、姿态参考窗口和轨迹记忆，分别用于解决坐标定位、密集几何线索和长距离漂移校正。该设计保持了流式状态的紧凑性，同时保留了丰富的几何上下文，使得在超过10,000帧的长序列中，能够在518 x 378分辨率输入上以约20帧每秒的速度进行稳定高效的推理。在多种基准测试中的广泛评估表明，我们的方法在性能上优于现有的流式方法和基于迭代优化的方法。

View on arXiv Download PDF AI Translation

cs.CV / 100 / 2604.14144

SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments

SpatialEvo：通过确定性几何环境实现自我进化的空间智能

Li, Dinging, Zhao, Yingxiu, Cheng, Xinrui, Lin, Kangheng, Peng, Hongbo, Li, Hongxing, Wang, Zixuan, Dai, Yuhong, Li, Haodong, Wang, Jia, Shi, Yukang, Zhao, Liang, Sun, Jianjian, Ge, Zheng, Zhang, Xiangyu, Lu, Weiming, Xiao, Jun, Zhuang, Yueting, Shen, Yongliang

Abstract

Spatial reasoning over three-dimensional scenes is a core capability for embodied intelligence, yet continuous model improvement remains bottlenecked by the cost of geometric annotation. The self-evolving paradigm offers a promising path, but its reliance on model consensus to construct pseudo-labels causes training to reinforce rather than correct the model's own geometric errors. We identify a property unique to 3D spatial reasoning that circumvents this limitation: ground truth is a deterministic consequence of the underlying geometry, computable exactly from point clouds and camera poses without any model involvement. Building on this insight, we present SpatialEvo, a self-evolving framework for 3D spatial reasoning, centered on the Deterministic Geometric Environment (DGE). The DGE formalizes 16 spatial reasoning task categories under explicit geometric validation rules and converts unannotated 3D scenes into zero-noise interactive oracles, replacing model consensus with objective physical feedback. A single shared-parameter policy co-evolves across questioner and solver roles under DGE constraints: the questioner generates physically valid spatial questions grounded in scene observations, while the solver derives precise answers against DGE-verified ground truth. A task-adaptive scheduler endogenously concentrates training on the model's weakest categories, producing a dynamic curriculum without manual design. Experiments across nine benchmarks demonstrate that SpatialEvo achieves the highest average score at both 3B and 7B scales, with consistent gains on spatial reasoning benchmarks and no degradation on general visual understanding.

Chinese Translation

三维场景的空间推理是具身智能的核心能力，然而持续的模型改进仍然受到几何标注成本的制约。自我进化范式提供了一条有前景的路径，但其依赖模型共识构建伪标签的方式导致训练强化而非纠正模型自身的几何错误。我们识别出一个独特于三维空间推理的特性，能够规避这一限制：真实情况是底层几何的确定性结果，可以从点云和相机姿态中精确计算，而无需模型参与。基于这一见解，我们提出了SpatialEvo，一个针对三维空间推理的自我进化框架，核心是确定性几何环境（Deterministic Geometric Environment, DGE）。DGE在明确的几何验证规则下形式化了16个空间推理任务类别，并将未标注的三维场景转换为零噪声的交互式神谕，取代模型共识以获得客观的物理反馈。在DGE约束下，一个共享参数的策略在提问者和解答者角色之间共同进化：提问者生成基于场景观察的物理有效空间问题，而解答者则根据DGE验证的真实情况推导出精确答案。任务自适应调度器内生地将训练集中在模型最弱的类别上，生成一个动态课程而无需手动设计。跨越九个基准的实验表明，SpatialEvo在3B和7B规模下均实现了最高的平均得分，在空间推理基准上持续获得提升，并且在一般视觉理解上没有退化。

View on arXiv Download PDF AI Translation

cs.CV / 101 / 2604.14147

ROSE: Retrieval-Oriented Segmentation Enhancement

ROSE：面向检索的分割增强

Tang, Song, Jie, Guangquan, Ding, Henghui, Jiang, Yu-Gang

Abstract

Existing segmentation models based on multimodal large language models (MLLMs), such as LISA, often struggle with novel or emerging entities due to their inability to incorporate up-to-date knowledge. To address this challenge, we introduce the Novel Emerging Segmentation Task (NEST), which focuses on segmenting (i) novel entities that MLLMs fail to recognize due to their absence from training data, and (ii) emerging entities that exist within the model's knowledge but demand up-to-date external information for accurate recognition. To support the study of NEST, we construct a NEST benchmark using an automated pipeline that generates news-related data samples for comprehensive evaluation. Additionally, we propose ROSE: Retrieval-Oriented Segmentation Enhancement, a plug-and-play framework designed to augment any MLLM-based segmentation model. ROSE comprises four key components. First, an Internet Retrieval-Augmented Generation module is introduced to employ user-provided multimodal inputs to retrieve real-time web information. Then, a Textual Prompt Enhancer enriches the model with up-to-date information and rich background knowledge, improving the model's perception ability for emerging entities. Furthermore, a Visual Prompt Enhancer is proposed to compensate for MLLMs' lack of exposure to novel entities by leveraging internet-sourced images. To maintain efficiency, a WebSense module is introduced to intelligently decide when to invoke retrieval mechanisms based on user input. Experimental results demonstrate that ROSE significantly boosts performance on the NEST benchmark, outperforming a strong Gemini-2.0 Flash-based retrieval baseline by 19.2 in gIoU.

Chinese Translation

基于多模态大型语言模型（MLLMs）的现有分割模型，如LISA，常常在处理新颖或新兴实体时面临困难，因为它们无法融入最新知识。为了解决这一挑战，我们引入了新颖新兴分割任务（NEST），该任务专注于分割（i）由于缺乏训练数据而未被MLLMs识别的新颖实体，以及（ii）存在于模型知识中但需要最新外部信息以实现准确识别的新兴实体。为了支持NEST的研究，我们构建了一个NEST基准，使用自动化管道生成与新闻相关的数据样本以进行全面评估。此外，我们提出了ROSE：面向检索的分割增强，这是一个即插即用的框架，旨在增强任何基于MLLM的分割模型。ROSE包含四个关键组件。首先，引入了一个互联网检索增强生成模块，利用用户提供的多模态输入检索实时网络信息。然后，文本提示增强器为模型提供最新信息和丰富的背景知识，提高模型对新兴实体的感知能力。此外，提出了一种视觉提示增强器，通过利用互联网获取的图像来弥补MLLMs在新颖实体方面的曝光不足。为了保持效率，引入了一个WebSense模块，智能决定何时根据用户输入调用检索机制。实验结果表明，ROSE在NEST基准上的性能显著提升，gIoU比强大的Gemini-2.0 Flash基础检索模型提高了19.2。

View on arXiv Download PDF AI Translation

cs.CV / 102 / 2604.14148

Seedance 2.0: Advancing Video Generation for World Complexity

Seedance 2.0：推进世界复杂性的视频生成

Seedance, Team, Chen, De, Chen, Liyang, Chen, Xin, Chen, Ying, Chen, Zhuo, Chen, Zhuowei, Cheng, Feng, Cheng, Tianheng, Cheng, Yufeng, Chi, Mojie, Chi, Xuyan, Cong, Jian, Cui, Qinpeng, Ding, Fei, Dong, Qide, Du, Yujiao, Duanmu, Haojie, Fan, Junliang, Fang, Jiarui, Fang, Jing, Fang, Zetao, Feng, Chengjian, Gao, Yu, Gu, Diandian, Guo, Dong, Guo, Hanzhong, Guo, Qiushan, Hao, Boyang, Hao, Hongxiang, He, Haoxun, He, Jiaao, He, Qian, Hoang, Tuyen, Hu, Heng, Hu, Ruoqing, Hu, Yuxiang, Huang, Jiancheng, Huang, Weilin, Huang, Zhaoyang, Huang, Zhongyi, Jin, Jishuo, Jing, Ming, Kim, Ashley, Lao, Shanshan, Leng, Yichong, Li, Bingchuan, Li, Gen, Li, Haifeng, Li, Huixia, Li, Jiashi, Li, Ming, Li, Xiaojie, Li, Xingxing, Li, Yameng, Li, Yiying, Li, Yu, Li, Yueyan, Liang, Chao, Liang, Han, Liang, Jianzhong, Liang, Ying, Liao, Wang, Lien, J. H., Lin, Shanchuan, Lin, Xi, Ling, Feng, Ling, Yue, Liu, Fangfang, Liu, Jiawei, Liu, Jihao, Liu, Jingtuo, Liu, Shu, Liu, Sichao, Liu, Wei, Liu, Xue, Liu, Zuxi, Lu, Ruijie, Lyu, Lecheng, Ma, Jingting, Ma, Tianxiang, Nie, Xiaonan, Ning, Jingzhe, Pan, Junjie, Pan, Xitong, Peng, Ronggui, Qu, Xueqiong, Ren, Yuxi, Shen, Yuchen, Shi, Guang, Shi, Lei, Song, Yinglong, Sun, Fan, Sun, Li, Sun, Renfei, Tang, Wenjing, Tao, Boyang, Tao, Zirui, Wang, Dongliang, Wang, Feng, Wang, Hulin, Wang, Ke, Wang, Qingyi, Wang, Rui, Wang, Shuai, Wang, Shulei, Wang, Weichen, Wang, Xuanda, Wang, Yanhui, Wang, Yue, Wang, Yuping, Wang, Yuxuan, Wang, Zijie, Wang, Ziyu, Wei, Guoqiang, Wei, Meng, Wu, Di, Wu, Guohong, Wu, Hanjie, Wu, Huachao, Wu, Jian, Wu, Jie, Wu, Ruolan, Wu, Shaojin, Wu, Xiaohu, Wu, Xinglong, Wu, Yonghui, Xia, Ruiqi, Xia, Xin, Xiao, Xuefeng, Xu, Shuang, Yang, Bangbang, Yang, Jiaqi, Yang, Runkai, Yang, Tao, Yang, Yihang, Yang, Zhixian, Yang, Ziyan, Ye, Fulong, Yi, Bingqian, Yin, Xing, You, Yongbin, Yuan, Linxiao, Zeng, Weihong, Zeng, Xuejiao, Zeng, Yan, Zhai, Siyu, Zhai, Zhonghua, Zhang, Bowen, Zhang, Chenlin, Zhang, Heng, Zhang, Jun, Zhang, Manlin, Zhang, Peiyuan, Zhang, Shuo, Zhang, Xiaohe, Zhang, Xiaoying, Zhang, Xinyan, Zhang, Xinyi, Zhang, Yichi, Zhang, Zixiang, Zhao, Haiyu, Zhao, Huating, Zhao, Liming, Zhao, Yian, Zheng, Guangcong, Zheng, Jianbin, Zheng, Xiaozheng, Zheng, Zerong, Zhu, Kuan, Zuo, Feilong

Abstract

Seedance 2.0 is a new native multi-modal audio-video generation model, officially released in China in early February 2026. Compared with its predecessors, Seedance 1.0 and 1.5 Pro, Seedance 2.0 adopts a unified, highly efficient, and large-scale architecture for multi-modal audio-video joint generation. This allows it to support four input modalities: text, image, audio, and video, by integrating one of the most comprehensive suites of multi-modal content reference and editing capabilities available in the industry to date. It delivers substantial, well-rounded improvements across all key sub-dimensions of video and audio generation. In both expert evaluations and public user tests, the model has demonstrated performance on par with the leading levels in the field. Seedance 2.0 supports direct generation of audio-video content with durations ranging from 4 to 15 seconds, with native output resolutions of 480p and 720p. For multi-modal inputs as reference, its current open platform supports up to 3 video clips, 9 images, and 3 audio clips. In addition, we provide Seedance 2.0 Fast version, an accelerated variant of Seedance 2.0 designed to boost generation speed for low-latency scenarios. Seedance 2.0 has delivered significant improvements to its foundational generation capabilities and multi-modal generation performance, bringing an enhanced creative experience for end users.

Chinese Translation

Seedance 2.0 是一种新的原生多模态音视频生成模型，于2026年2月初在中国正式发布。与其前身 Seedance 1.0 和 1.5 Pro 相比，Seedance 2.0 采用统一、高效的大规模架构进行多模态音视频联合生成。这使其能够支持四种输入模态：文本、图像、音频和视频，并整合了目前行业内最全面的多模态内容参考和编辑能力。它在视频和音频生成的所有关键子维度上实现了显著而全面的改进。在专家评估和公众用户测试中，该模型的表现达到了该领域的领先水平。Seedance 2.0 支持直接生成时长为4到15秒的音视频内容，原生输出分辨率为480p和720p。对于作为参考的多模态输入，其当前开放平台支持最多3个视频片段、9张图像和3个音频片段。此外，我们还提供了 Seedance 2.0 Fast 版本，这是 Seedance 2.0 的加速变体，旨在提升低延迟场景下的生成速度。Seedance 2.0 在其基础生成能力和多模态生成性能上实现了显著改进，为最终用户带来了更好的创作体验。

View on arXiv Download PDF AI Translation

cs.CV / 103 / 2604.14149

One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

每个高度选择性帧一个令牌：朝着极端压缩实现长视频理解

Zhang, Zheyu, Pang, Ziqi, Chen, Shixing, Hao, Xiang, Bhat, Vimal, Wang, Yu-Xiong

Abstract

Long video understanding is inherently challenging for vision-language models (VLMs) because of the extensive number of frames. With each video frame typically expanding into tens or hundreds of tokens, the limited context length of large language models (LLMs) forces the VLMs to perceive the frames sparsely and lose temporal information. To address this, we explore extreme video token compression towards \emph{one token per frame} at the final LLM layer. Our key insight is that heuristic-based compression, widely adopted by previous methods, is prone to information loss, and this necessitates supervising LLM layers into \emph{learnable} and \emph{progressive} modules for \emph{token-level compression} (LP-Comp). Such compression enables our VLM to digest 2x-4x more frames with improved performance. To further increase the token efficiency, we investigate \emph{frame-level compression}, which selects the frames most relevant to the queries via the internal attention scores of the LLM layers, named \emph{question-conditioned compression} (QC-Comp). As a notable distinction from previous studies, we mitigate the position bias of LLM attention in long contexts, \emph{i.e.}, the over-concentration on the beginning and end of a sequence, by splitting long videos into short segments and employing local attention. Collectively, our combined \emph{token-level} and \emph{frame-level} leads to an e\textbf{x}treme compression model for long video understanding, named \textbf{\name}, achieving a significantly larger compression ratio and enabling denser frame sampling. Our \name is finetuned from VideoChat-Flash with a data-efficient \emph{supervised compression tuning} stage that only requires 2.5\% of the supervised fine-tuning data, yet boosts the accuracy from 42.9\% to 46.2\% on LVBench and enhances multiple other long video benchmarks.

Chinese Translation

长视频理解对视觉-语言模型（VLMs）而言本质上具有挑战性，因为帧数庞大。每个视频帧通常扩展为数十或数百个令牌，而大型语言模型（LLMs）的有限上下文长度迫使VLMs稀疏感知帧，从而丧失时间信息。为了解决这个问题，我们探索了在最终LLM层实现每帧一个令牌的极端视频令牌压缩。我们的关键见解是，先前方法广泛采用的基于启发式的压缩容易导致信息丢失，这就需要将LLM层监督为可学习的和渐进的模块，以实现令牌级压缩（LP-Comp）。这种压缩使我们的VLM能够消化2到4倍更多的帧，并提高性能。为了进一步提高令牌效率，我们研究了帧级压缩，通过LLM层的内部注意力分数选择与查询最相关的帧，称为基于问题的压缩（QC-Comp）。与以往研究的显著区别在于，我们通过将长视频分割为短片段并采用局部注意力，减轻了LLM注意力在长上下文中的位置偏差，即过度集中于序列的开始和结束。总体而言，我们的令牌级和帧级结合形成了一个极端压缩模型，用于长视频理解，命名为 extbf{ ame}，实现了显著更大的压缩比，并能够进行更密集的帧采样。我们的 ame是从VideoChat-Flash微调而来，具有数据高效的监督压缩调优阶段，仅需2.5 ext{%}的监督微调数据，却将LVBench的准确率从42.9 ext{%}提升至46.2 ext{%}，并增强了多个其他长视频基准测试的表现。

View on arXiv Download PDF AI Translation

人工智能 (Artificial Intelligence)

cs.AI / 1 / 2604.13151

Exploration and Exploitation Errors Are Measurable for Language Model Agents

语言模型代理的探索与利用错误是可测量的

Park, Jaden, Kim, Jungtaek, Jeong, Jongwon, Nowak, Robert D., Lee, Kangwook, Lee, Yong Jae

Abstract

Language Model (LM) agents are increasingly used in complex open-ended decision-making tasks, from AI coding to physical AI. A core requirement in these settings is the ability to both explore the problem space and exploit acquired knowledge effectively. However, systematically distinguishing and quantifying exploration and exploitation from observed actions without access to the agent's internal policy remains challenging. To address this, we design controllable environments inspired by practical embodied AI scenarios. Each environment consists of a partially observable 2D grid map and an unknown task Directed Acyclic Graph (DAG). The map generation can be programmatically adjusted to emphasize exploration or exploitation difficulty. To enable policy-agnostic evaluation, we design a metric to quantify exploration and exploitation errors from agent's actions. We evaluate a variety of frontier LM agents and find that even state-of-the-art models struggle on our task, with different models exhibiting distinct failure modes. We further observe that reasoning models solve the task more effectively and show both exploration and exploitation can be significantly improved through minimal harness engineering. We release our code \href{https://github.com/jjj-madison/measurable-explore-exploit}{here}.

Chinese Translation

语言模型（LM）代理在复杂的开放式决策任务中越来越多地被使用，从人工智能编程到物理人工智能。这些环境中的核心要求是能够有效地探索问题空间并利用已获得的知识。然而，在没有访问代理内部策略的情况下，系统地区分和量化观察到的行动中的探索与利用仍然具有挑战性。为了解决这个问题，我们设计了受实际具身人工智能场景启发的可控环境。每个环境由一个部分可观察的二维网格地图和一个未知任务的有向无环图（DAG）组成。地图生成可以通过编程方式调整，以强调探索或利用的难度。为了实现与策略无关的评估，我们设计了一种度量标准，以量化代理行动中的探索与利用错误。我们评估了多种前沿的语言模型代理，发现即使是最先进的模型在我们的任务上也面临困难，不同模型表现出不同的失败模式。我们进一步观察到，推理模型更有效地解决了任务，并且通过最小的工程调整，探索和利用都可以显著改善。我们在此发布我们的代码。

View on arXiv Download PDF AI Translation

cs.AI / 2 / 2604.13180

SciFi: A Safe, Lightweight, User-Friendly, and Fully Autonomous Agentic AI Workflow for Scientific Applications

SciFi：一种安全、轻量级、用户友好且完全自主的代理人工智能工作流程用于科学应用

Liu, Qibin, Gonski, Julia

Abstract

Recent advances in agentic AI have enabled increasingly autonomous workflows, but existing systems still face substantial challenges in achieving reliable deployment in real-world scientific research. In this work, we present a safe, lightweight, and user-friendly agentic framework for the autonomous execution of well-defined scientific tasks. The framework combines an isolated execution environment, a three-layer agent loop, and a self-assessing do-until mechanism to ensure safe and reliable operation while effectively leveraging large language models of varying capability levels. By focusing on structured tasks with clearly defined context and stopping criteria, the framework supports end-to-end automation with minimal human intervention, enabling researchers to offload routine workloads and devote more effort to creative activities and open-ended scientific inquiry.

Chinese Translation

近年来，代理人工智能的进展使得越来越自主的工作流程成为可能，但现有系统在实现可靠的现实世界科学研究部署方面仍面临重大挑战。在本研究中，我们提出了一种安全、轻量级且用户友好的代理框架，用于自主执行明确定义的科学任务。该框架结合了隔离的执行环境、三层代理循环和自我评估的“直到完成”机制，以确保安全可靠的操作，同时有效利用不同能力水平的大型语言模型。通过专注于具有明确上下文和停止标准的结构化任务，该框架支持端到端的自动化，最小化人类干预，使研究人员能够卸载常规工作负载，将更多精力投入到创造性活动和开放式科学探究中。

View on arXiv Download PDF AI Translation

cs.AI / 3 / 2604.13206

Numerical Instability and Chaos: Quantifying the Unpredictability of Large Language Models

数值不稳定性与混沌：量化大型语言模型的不可预测性

Islam, Chashi Mahiul, Villarreal, Alan, Nishino, Mao, Salman, Shaeke, Liu, Xiuwen

Abstract

As Large Language Models (LLMs) are increasingly integrated into agentic workflows, their unpredictability stemming from numerical instability has emerged as a critical reliability issue. While recent studies have demonstrated the significant downstream effects of these instabilities, the root causes and underlying mechanisms remain poorly understood. In this paper, we present a rigorous analysis of how unpredictability is rooted in the finite numerical precision of floating-point representations, tracking how rounding errors propagate, amplify, or dissipate through Transformer computation layers. Specifically, we identify a chaotic "avalanche effect" in the early layers, where minor perturbations trigger binary outcomes: either rapid amplification or complete attenuation. Beyond specific error instances, we demonstrate that LLMs exhibit universal, scale-dependent chaotic behaviors characterized by three distinct regimes: 1) a stable regime, where perturbations fall below an input-dependent threshold and vanish, resulting in constant outputs; 2) a chaotic regime, where rounding errors dominate and drive output divergence; and 3) a signal-dominated regime, where true input variations override numerical noise. We validate these findings extensively across multiple datasets and model architectures.

Chinese Translation

随着大型语言模型（LLMs）越来越多地融入自主工作流程，其源于数值不稳定性的不可预测性已成为一个关键的可靠性问题。尽管近期研究已展示了这些不稳定性对下游影响的显著性，但其根本原因和潜在机制仍然不甚明了。本文对不可预测性如何根植于浮点表示的有限数值精度进行了严格分析，追踪了舍入误差如何在Transformer计算层中传播、放大或消散。具体而言，我们在早期层中识别出一种混沌的“雪崩效应”，其中微小扰动触发二元结果：要么快速放大，要么完全衰减。超越特定的误差实例，我们证明LLMs表现出普遍的、规模依赖的混沌行为，特征为三个不同的状态：1）稳定状态，在该状态下扰动低于输入依赖的阈值并消失，导致输出恒定；2）混沌状态，在该状态下舍入误差占主导并驱动输出发散；3）信号主导状态，在该状态下真实输入变化覆盖数值噪声。我们在多个数据集和模型架构上广泛验证了这些发现。

View on arXiv Download PDF AI Translation

cs.AI / 4 / 2604.13283

Optimizing Earth Observation Satellite Schedules under Unknown Operational Constraints: An Active Constraint Acquisition Approach

在未知操作约束下优化地球观测卫星调度：一种主动约束获取方法

Belaid, Mohamed-Bachir

Abstract

Earth Observation (EO) satellite scheduling (deciding which imaging tasks to perform and when) is a well-studied combinatorial optimization problem. Existing methods typically assume that the operational constraint model is fully specified in advance. In practice, however, constraints governing separation between observations, power budgets, and thermal limits are often embedded in engineering artefacts or high-fidelity simulators rather than in explicit mathematical models. We study EO scheduling under \emph{unknown constraints}: the objective is known, but feasibility must be learned interactively from a binary oracle. Working with a simplified model restricted to pairwise separation and global capacity constraints, we introduce Conservative Constraint Acquisition~(CCA), a domain-specific procedure designed to identify justified constraints efficiently in practice while limiting unnecessary tightening of the learned model. Embedded in the \textsc{Learn\&Optimize} framework, CCA supports an interactive search process that alternates optimization under a learned constraint model with targeted oracle queries. On synthetic instances with up to 50~tasks and dense constraint networks, L\&O improves over a no-knowledge greedy baseline and uses far fewer main oracle queries than a two-phase acquire-then-solve baseline (FAO). For $n\leq 30$, the average gap drops from 65--68\% (Priority Greedy) to 17.7--35.8\% using L\&O. At $n{=}50$, where the CP-SAT reference is the best feasible solution found in 120~s, L\&O improves on FAO on average (17.9\% vs.\ 20.3\%) while using 21.3 main queries instead of 100 and about $5\times$ less execution time.

Chinese Translation

地球观测（EO）卫星调度（决定执行哪些成像任务及其时间）是一个研究较为深入的组合优化问题。现有方法通常假设操作约束模型在事先已完全指定。然而，在实际应用中，控制观测之间分离、功率预算和热限制的约束往往嵌入在工程文档或高保真模拟器中，而不是以明确的数学模型形式存在。我们研究在 extit{未知约束}下的EO调度：目标是已知的，但可行性必须通过与二元神谕的互动学习来获得。在一个简化模型中，我们限制在成对分离和全局容量约束下，提出了保守约束获取（Conservative Constraint Acquisition, CCA），这是一种特定领域的程序，旨在有效识别合理约束，同时限制对学习模型的不必要收紧。CCA嵌入在 extsc{Learn ext&Optimize}框架中，支持一种交互式搜索过程，该过程交替在学习的约束模型下进行优化和针对性神谕查询。在具有最多50个任务和密集约束网络的合成实例中，L ext&O在无知识贪婪基线之上有所改进，并且使用的主要神谕查询远少于两阶段获取后求解基线（FAO）。对于$n ext{≤}30$，平均差距从65-68 ext{%}（优先贪婪）降至17.7-35.8 ext{%}，使用L ext&O。在$n{=}50$的情况下，CP-SAT参考是120秒内找到的最佳可行解，L ext&O在平均上优于FAO（17.9 ext{%}对20.3 ext{%}），同时使用21.3个主要查询而不是100个，执行时间约减少$5 imes$。

View on arXiv Download PDF AI Translation

cs.AI / 5 / 2604.13318

WebXSkill: Skill Learning for Autonomous Web Agents

WebXSkill：用于自主网络代理的技能学习

Wang, Zhaoyang, Wu, Qianhui, Zhang, Xuchao, Zhang, Chaoyun, Yao, Wenlin, Faisal, Fazle Elahi, Peng, Baolin, Qin, Si, Nath, Suman, Lin, Qingwei, Bansal, Chetan, Zhang, Dongmei, Rajmohan, Saravan, Gao, Jianfeng, Yao, Huaxiu

Abstract

Autonomous web agents powered by large language models (LLMs) have shown promise in completing complex browser tasks, yet they still struggle with long-horizon workflows. A key bottleneck is the grounding gap in existing skill formulations: textual workflow skills provide natural language guidance but cannot be directly executed, while code-based skills are executable but opaque to the agent, offering no step-level understanding for error recovery or adaptation. We introduce WebXSkill, a framework that bridges this gap with executable skills, each pairing a parameterized action program with step-level natural language guidance, enabling both direct execution and agent-driven adaptation. WebXSkill operates in three stages: skill extraction mines reusable action subsequences from readily available synthetic agent trajectories and abstracts them into parameterized skills, skill organization indexes skills into a URL-based graph for context-aware retrieval, and skill deployment exposes two complementary modes, grounded mode for fully automated multi-step execution and guided mode where skills serve as step-by-step instructions that the agent follows with its native planning. On WebArena and WebVoyager, WebXSkill improves task success rate by up to 9.8 and 12.9 points over the baseline, respectively, demonstrating the effectiveness of executable skills for web agents. The code is publicly available at https://github.com/aiming-lab/WebXSkill.

Chinese Translation

由大型语言模型（LLMs）驱动的自主网络代理在完成复杂的浏览器任务方面展现了潜力，但在处理长时间跨度的工作流程时仍然面临挑战。一个关键瓶颈是现有技能表述中的基础差距：文本工作流程技能提供自然语言指导，但无法直接执行，而基于代码的技能虽然可执行，但对代理而言却不透明，无法提供逐步理解以便于错误恢复或适应。我们提出了WebXSkill，这是一个通过可执行技能弥合这一差距的框架，每个技能将参数化的动作程序与逐步的自然语言指导相结合，既支持直接执行，又支持代理驱动的适应。WebXSkill的操作分为三个阶段：技能提取从现成的合成代理轨迹中挖掘可重用的动作子序列，并将其抽象为参数化技能；技能组织将技能索引到基于URL的图中，以便于上下文感知的检索；技能部署则提供两种互补模式：基础模式用于完全自动化的多步骤执行，指导模式则将技能作为逐步指令，代理根据其本地规划进行跟随。在WebArena和WebVoyager上，WebXSkill分别将任务成功率提高了最多9.8和12.9个百分点，证明了可执行技能对网络代理的有效性。代码可在https://github.com/aiming-lab/WebXSkill公开获取。

View on arXiv Download PDF AI Translation

cs.AI / 6 / 2604.13348

Listening Alone, Understanding Together: Collaborative Context Recovery for Privacy-Aware AI

独自倾听，共同理解：面向隐私的协作上下文恢复

Srivastava, Tanmay, Basu, Amartya, Jain, Shubham, Ranganathan, Vaishnavi

Abstract

We introduce CONCORD, a privacy-aware asynchronous assistant-to-assistant (A2A) framework that leverages collaboration between proactive speech-based AI. As agents evolve from reactive to always-listening assistants, they face a core privacy risk (of capturing non-consenting speakers), which makes their social deployment a challenge. To overcome this, we implement CONCORD, which enforces owner-only speech capture via real-time speaker verification, producing a one-sided transcript that incurs missing context but preserves privacy. We demonstrate that CONCORD can safely recover necessary context through (1) spatio-temporal context resolution, (2) information gap detection, and (3) minimal A2A queries governed by a relationship-aware disclosure. Instead of hallucination-prone inferring, CONCORD treats context recovery as a negotiated safe exchange between assistants. Across a multi-domain dialogue dataset, CONCORD achieves 91.4% recall in gap detection, 96% relationship classification accuracy, and 97% true negative rate in privacy-sensitive disclosure decisions. By reframing always-listening AI as a coordination problem between privacy-preserving agents, CONCORD offers a practical path toward socially deployable proactive conversational agents.

Chinese Translation

我们提出了CONCORD，一个面向隐私的异步助手间（A2A）框架，利用主动语音AI之间的协作。随着代理从反应式助手演变为始终倾听的助手，它们面临着一个核心隐私风险（捕捉未同意发言者的语音），这使得它们的社会部署面临挑战。为了解决这个问题，我们实现了CONCORD，通过实时说话者验证强制执行仅所有者的语音捕捉，生成一份单方面的转录，虽然缺失了一些上下文，但保护了隐私。我们展示了CONCORD可以安全地通过（1）时空上下文解析，（2）信息缺口检测，以及（3）由关系感知披露主导的最小A2A查询来恢复必要的上下文。与其依赖容易产生幻觉的推断，CONCORD将上下文恢复视为助手之间的安全协商交换。在一个多领域对话数据集中，CONCORD在缺口检测中实现了91.4%的召回率，96%的关系分类准确率，以及在隐私敏感披露决策中97%的真阴性率。通过将始终倾听的AI重新框架为隐私保护代理之间的协调问题，CONCORD为社会可部署的主动对话代理提供了一条切实可行的路径。

View on arXiv Download PDF AI Translation

cs.AI / 7 / 2604.13392

ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold

ReSS：通过符号支架学习表格数据预测的推理模型

Yi, Chenlang, Li, Gang, Xiong, Zizhan, Cao, Tue Minh, Gong, Yanmin, Thai, My T., Yang, Tianbao

Abstract

Tabular data remains prevalent in high-stakes domains such as healthcare and finance, where predictive models are expected to provide both high accuracy and faithful, human-understandable reasoning. While symbolic models offer verifiable logic, they lack semantic expressiveness. Meanwhile, general-purpose LLMs often require specialized fine-tuning to master domain-specific tabular reasoning. To address the dual challenges of scalable data curation and reasoning consistency, we propose ReSS, a systematic framework that bridges symbolic and neural reasoning models. ReSS leverages a decision-tree model to extract instance-level decision paths as symbolic scaffolds. These scaffolds, alongside input features and labels, guide an LLM to generate grounded natural-language reasoning that strictly adheres to the underlying decision logic. The resulting high-quality dataset is used to fine-tune a pretrained LLM into a specialized tabular reasoning model, further enhanced by a scaffold-invariant data augmentation strategy to improve generalization and explainability. To rigorously assess faithfulness, we introduce quantitative metrics including hallucination rate, explanation necessity, and explanation sufficiency. Experimental results on medical and financial benchmarks demonstrate that ReSS-trained models improve traditional decision trees and standard fine-tuning approaches up to $10\%$ while producing faithful and consistent reasoning

Chinese Translation

表格数据在医疗和金融等高风险领域仍然广泛存在，在这些领域中，预测模型被期望提供高准确性以及可信、易于人类理解的推理。尽管符号模型提供了可验证的逻辑，但它们缺乏语义表达能力。同时，通用的大型语言模型（LLMs）通常需要专门的微调才能掌握特定领域的表格推理。为了解决可扩展的数据整理和推理一致性的双重挑战，我们提出了ReSS，这是一个系统框架，连接了符号和神经推理模型。ReSS利用决策树模型提取实例级决策路径作为符号支架。这些支架与输入特征和标签一起，引导LLM生成基于自然语言的推理，严格遵循基础决策逻辑。生成的高质量数据集用于微调预训练的LLM，转变为专门的表格推理模型，并通过一种不变的支架数据增强策略进一步提高泛化能力和可解释性。为了严格评估可信度，我们引入了包括幻觉率、解释必要性和解释充分性在内的定量指标。在医疗和金融基准上的实验结果表明，经过ReSS训练的模型在传统决策树和标准微调方法上提高了多达10%，同时产生了可信且一致的推理。

View on arXiv Download PDF AI Translation

cs.AI / 8 / 2604.13395

Quantifying and Understanding Uncertainty in Large Reasoning Models

量化与理解大型推理模型中的不确定性

Li, Yangyi, Zhao, Chenxu, Huai, Mengdi

Abstract

Large Reasoning Models (LRMs) have recently demonstrated significant improvements in complex reasoning. While quantifying generation uncertainty in LRMs is crucial, traditional methods are often insufficient because they do not provide finite-sample guarantees for reasoning-answer generation. Conformal prediction (CP) stands out as a distribution-free and model-agnostic methodology that constructs statistically rigorous uncertainty sets. However, existing CP methods ignore the logical connection between the reasoning trace and the final answer. Additionally, prior studies fail to interpret the origins of uncertainty coverage for LRMs as they typically overlook the specific training factors driving valid reasoning. Notably, it is challenging to disentangle reasoning quality from answer correctness when quantifying uncertainty, while simultaneously establishing theoretical guarantees for computationally efficient explanation methods. To address these challenges, we first propose a novel methodology that quantifies uncertainty in the reasoning-answer structure with statistical guarantees. Subsequently, we develop a unified example-to-step explanation framework using Shapley values that identifies a provably sufficient subset of training examples and their key reasoning steps to preserve the guarantees. We also provide theoretical analyses of our proposed methods. Extensive experiments on challenging reasoning datasets verify the effectiveness of the proposed methods.

Chinese Translation

大型推理模型（LRMs）最近在复杂推理方面表现出了显著的进步。尽管量化LRMs中的生成不确定性至关重要，但传统方法往往不足，因为它们未能为推理答案生成提供有限样本保证。符合预测（Conformal Prediction, CP）作为一种无分布和模型无关的方法，构建了统计上严谨的不确定性集合。然而，现有的CP方法忽视了推理轨迹与最终答案之间的逻辑联系。此外，之前的研究未能解释LRMs不确定性覆盖的来源，因为它们通常忽略了驱动有效推理的具体训练因素。值得注意的是，在量化不确定性时，难以将推理质量与答案正确性区分开，同时为计算高效的解释方法建立理论保证。为了解决这些挑战，我们首先提出了一种新方法，量化推理-答案结构中的不确定性，并提供统计保证。随后，我们开发了一个统一的示例到步骤的解释框架，利用Shapley值识别一个可证明充分的训练示例子集及其关键推理步骤，以保持这些保证。我们还对所提方法进行了理论分析。在具有挑战性的推理数据集上进行的大量实验验证了所提方法的有效性。

View on arXiv Download PDF AI Translation

cs.AI / 9 / 2604.13488

Towards Scalable Lightweight GUI Agents via Multi-role Orchestration

通过多角色编排实现可扩展的轻量级图形用户界面代理

Wang, Ziwei, Zheng, Junjie, Yang, Leyang, Zhou, Sheng, Tang, Xiaoxuan, Fang, Zhouhua, Liu, Zhiwei, Chen, Dajun, Li, Yong, Bu, Jiajun

Abstract

Autonomous Graphical User Interface (GUI) agents powered by Multimodal Large Language Models (MLLMs) enable digital automation on end-user devices. While scaling both parameters and data has yielded substantial gains, advanced methods still suffer from prohibitive deployment costs on resource-constrained devices. When facing complex in-the-wild scenarios, lightweight GUI agents are bottlenecked by limited capacity and poor task scalability under end-to-end episodic learning, impeding adaptation to multi-agent systems (MAS), while training multiple skill-specific experts remains costly. Can we strike an effective trade-off in this cost-scalability dilemma, enabling lightweight MLLMs to participate in realistic GUI workflows? To address these challenges, we propose the LAMO framework, which endows a lightweight MLLM with GUI-specific knowledge and task scalability, allowing multi-role orchestration to expand its capability boundary for GUI automation. LAMO combines role-oriented data synthesis with a two-stage training recipe: (i) supervised fine-tuning with Perplexity-Weighted Cross-Entropy optimization for knowledge distillation and visual perception enhancement, and (ii) reinforcement learning for role-oriented cooperative exploration. With LAMO, we develop a task-scalable native GUI agent, LAMO-3B, supporting monolithic execution and MAS-style orchestration. When paired with advanced planners as a plug-and-play policy executor, LAMO-3B can continuously benefit from planner advances, enabling a higher performance ceiling. Extensive static and online evaluations validate the effectiveness of our design.

Chinese Translation

由多模态大型语言模型（Multimodal Large Language Models, MLLMs）驱动的自主图形用户界面（GUI）代理使得终端用户设备上的数字自动化成为可能。尽管在参数和数据的扩展上取得了显著进展，但先进的方法在资源受限设备上的部署成本仍然高昂。在复杂的实际场景中，轻量级GUI代理因其有限的能力和在端到端情境学习下的任务可扩展性差而成为瓶颈，阻碍了其适应多代理系统（Multi-Agent Systems, MAS），而训练多个特定技能的专家仍然成本高昂。我们能否在这一成本与可扩展性困境中找到有效的平衡，使轻量级MLLM能够参与现实的GUI工作流程？为了解决这些挑战，我们提出了LAMO框架，该框架赋予轻量级MLLM特定于GUI的知识和任务可扩展性，允许多角色编排以扩展其在GUI自动化中的能力边界。LAMO结合了面向角色的数据合成与两阶段训练方案：（i）通过困惑度加权交叉熵优化进行监督微调，以实现知识蒸馏和视觉感知增强；（ii）通过强化学习进行面向角色的协作探索。通过LAMO，我们开发了一种任务可扩展的原生GUI代理LAMO-3B，支持单体执行和MAS风格的编排。当与先进的规划器配对作为即插即用的策略执行器时，LAMO-3B能够持续受益于规划器的进步，从而实现更高的性能上限。大量静态和在线评估验证了我们设计的有效性。

View on arXiv Download PDF AI Translation

cs.AI / 10 / 2604.13531

RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management

RiskWebWorld：电子商务风险管理中图形用户界面代理的真实互动基准

Chen, Renqi, Tao, Zeyin, Guo, Jianming, Wang, Jing, Xu, Zezhou, Zhu, Jingzhe, Sun, Qingqing, Zhang, Tianyi, Chen, Shuai

Abstract

Graphical User Interface (GUI) agents show strong capabilities for automating web tasks, but existing interactive benchmarks primarily target benign, predictable consumer environments. Their effectiveness in high-stakes, investigative domains such as authentic e-commerce risk management remains underexplored. To bridge this gap, we present RiskWebWorld, the first highly realistic interactive benchmark for evaluating GUI agents in e-commerce risk management. RiskWebWorld features 1,513 tasks sourced from production risk-control pipelines across 8 core domains, and captures the authentic challenges of risk operations on uncooperative websites, partially environmental hijackments. To support scalable evaluation and agentic reinforcement learning (RL), we further build a Gymnasium-compliant infrastructure that decouples policy planning from environment mechanics. Our evaluation across diverse models reveals a dramatic capability gap: top-tier generalist models achieve 49.1% success, while specialized open-weights GUI models lag at near-total failure. This highlights that foundation model scale currently matters more than zero-shot interface grounding in long-horizon professional tasks. We also demonstrate the viability of our infrastructure through agentic RL, which improves open-source models by 16.2%. These results position RiskWebWorld as a practical testbed for developing robust digital workers.

Chinese Translation

图形用户界面（GUI）代理在自动化网络任务方面展现出强大的能力，但现有的互动基准主要针对良性、可预测的消费环境。在高风险、调查性领域，如真实的电子商务风险管理中，它们的有效性仍然未得到充分探索。为填补这一空白，我们提出了RiskWebWorld，这是第一个高度真实的互动基准，用于评估电子商务风险管理中的GUI代理。RiskWebWorld包含来自8个核心领域的1,513个任务，源自生产风险控制流程，捕捉了在不合作网站上进行风险操作的真实挑战，包括部分环境劫持。为了支持可扩展的评估和代理强化学习（RL），我们进一步构建了一个符合Gymnasium标准的基础设施，将策略规划与环境机制解耦。我们在多种模型上的评估揭示了显著的能力差距：顶级通用模型的成功率为49.1%，而专门的开放权重GUI模型几乎完全失败。这突显了基础模型规模在长期专业任务中目前比零-shot接口基础更为重要。我们还通过代理强化学习展示了我们基础设施的可行性，提升了开源模型的性能16.2%。这些结果使RiskWebWorld成为开发强大数字工作者的实用测试平台。

View on arXiv Download PDF AI Translation

cs.AI / 11 / 2604.13694

Weight Patching: Toward Source-Level Mechanistic Localization in LLMs

权重修补：朝着大语言模型源级机制定位的方向

Sun, Chenghao, Zhang, Chengsheng, Qin, Guanzheng, Dai, Rui, Tian, Xinmei

Abstract

Mechanistic interpretability seeks to localize model behavior to the internal components that causally realize it. Prior work has advanced activation-space localization and causal tracing, but modules that appear important in activation space may merely aggregate or amplify upstream signals rather than encode the target capability in their own parameters. To address this gap, we propose Weight Patching, a parameter-space intervention method for source-oriented analysis in paired same-architecture models that differ in how strongly they express a target capability under the inputs of interest. Given a base model and a behavior-specialized counterpart, Weight Patching replaces selected module weights from the specialized model into the base model under a fixed input. We instantiate the method on instruction following and introduce a framework centered on a vector-anchor behavioral interface that provides a shared internal criterion for whether a task-relevant control state has been formed or recovered in open-ended generation. Under this framework, the analysis reveals a hierarchy from shallow candidate source-side carriers to aggregation and routing modules, and further to downstream execution circuits. The recovered component scores can also guide mechanism-aware model merging, improving selective fusion across the evaluated expert combinations and providing additional external validation.

Chinese Translation

机制可解释性旨在将模型行为定位到因果实现该行为的内部组件。先前的研究推动了激活空间的定位和因果追踪，但在激活空间中看似重要的模块可能仅仅是聚合或放大上游信号，而并非在其自身参数中编码目标能力。为了解决这一问题，我们提出了权重修补（Weight Patching），这是一种针对成对同构模型的参数空间干预方法，这些模型在特定输入下对目标能力的表达强度不同。给定一个基础模型和一个行为专用的对应模型，权重修补在固定输入下将专用模型中选定模块的权重替换到基础模型中。我们在指令跟随任务上实例化该方法，并引入一个以向量锚定行为接口为中心的框架，该框架提供了一个共享的内部标准，以判断在开放式生成中是否形成或恢复了与任务相关的控制状态。在该框架下，分析揭示了从浅层候选源侧载体到聚合和路由模块，再到下游执行电路的层次结构。恢复的组件得分还可以指导机制感知的模型合并，改善在评估的专家组合中的选择性融合，并提供额外的外部验证。

View on arXiv Download PDF AI Translation

cs.AI / 12 / 2604.13757

Rethinking AI Hardware: A Three-Layer Cognitive Architecture for Autonomous Agents

重新思考人工智能硬件：一种用于自主智能体的三层认知架构

Chen, Li

Abstract

The next generation of autonomous AI systems will be constrained not only by model capability, but by how intelligence is structured across heterogeneous hardware. Current paradigms -- cloud-centric AI, on-device inference, and edge-cloud pipelines -- treat planning, reasoning, and execution as a monolithic process, leading to unnecessary latency, energy consumption, and fragmented behavioral continuity. We introduce the Tri-Spirit Architecture, a three-layer cognitive framework that decomposes intelligence into planning (Super Layer), reasoning (Agent Layer), and execution (Reflex Layer), each mapped to distinct compute substrates and coordinated via an asynchronous message bus. We formalize the system with a parameterized routing policy, a habit-compilation mechanism that promotes repeated reasoning paths into zero-inference execution policies, a convergent memory model, and explicit safety constraints. We evaluate the architecture in a reproducible simulation of 2000 synthetic tasks against cloud-centric and edge-only baselines. Tri-Spirit reduces mean task latency by 75.6 percent and energy consumption by 71.1 percent, while decreasing LLM invocations by 30 percent and enabling 77.6 percent offline task completion. These results suggest that cognitive decomposition, rather than model scaling alone, is a primary driver of system-level efficiency in AI hardware.

Chinese Translation

下一代自主人工智能系统不仅将受到模型能力的限制，还将受到智能在异构硬件上结构化方式的影响。目前的范式——以云为中心的人工智能、设备端推理和边缘云管道——将规划、推理和执行视为一个整体过程，这导致了不必要的延迟、能耗和行为连续性的碎片化。我们提出了三灵架构（Tri-Spirit Architecture），这是一种三层认知框架，将智能分解为规划（超级层，Super Layer）、推理（智能体层，Agent Layer）和执行（反射层，Reflex Layer），每一层都映射到不同的计算基底，并通过异步消息总线进行协调。我们用参数化路由策略、促进重复推理路径转化为零推理执行策略的习惯编译机制、收敛记忆模型和明确的安全约束对系统进行了形式化。我们在可重复的2000个合成任务的模拟中评估了该架构，并与以云为中心和仅边缘的基线进行了对比。三灵架构将平均任务延迟减少了75.6%，能耗降低了71.1%，同时减少了30%的大型语言模型（LLM）调用，并实现了77.6%的离线任务完成率。这些结果表明，认知分解而非单纯的模型扩展是人工智能硬件系统级效率的主要驱动因素。

View on arXiv Download PDF AI Translation

cs.AI / 13 / 2604.13759

The cognitive companion: a lightweight parallel monitoring architecture for detecting and recovering from reasoning degradation in LLM agents

认知伴侣：一种轻量级并行监控架构，用于检测和恢复大型语言模型代理的推理退化

Khan, Rafflesia, Khan, Nafiul Islam

Abstract

Large language model (LLM) agents on multi-step tasks suffer reasoning degradation, looping, drift, stuck states, at rates up to 30% on hard tasks. Current solutions include hard step limits (abrupt) or LLM-as-judge monitoring (10-15% overhead per step). This paper introduces the Cognitive Companion, a parallel monitoring architecture with two implementations: an LLM-based Companion and a novel zero-overhead Probe-based Companion. We report a three-batch feasibility study centered on Gemma 4 E4B, with an additional exploratory small-model analysis on Qwen 2.5 1.5B and Llama 3.2 1B. In our experiments, the LLM-based Companion reduced repetition on loop-prone tasks by 52-62% with approximately 11% overhead. The Probe-based Companion, trained on hidden states from layer 28, showed a mean effect size of +0.471 at zero measured inference overhead; its strongest probe result achieved cross-validated AUROC 0.840 on a small proxy-labeled dataset. A key empirical finding is that companion benefit appears task-type dependent: companions are most helpful on loop-prone and open-ended tasks, while effects are neutral or negative on more structured tasks. Our small-model experiments also suggest a possible scale boundary: companions did not improve the measured quality proxy on 1B-1.5B models, even when interventions fired. Overall, the paper should be read as a feasibility study rather than a definitive validation. The results provide encouraging evidence that sub-token monitoring may be useful, identify task-type sensitivity as a practical design constraint, and motivate selective companion activation as a promising direction for future work.

Chinese Translation

在多步骤任务中，大型语言模型（LLM）代理面临推理退化、循环、漂移和卡住状态等问题，尤其在困难任务中，发生率高达30%。当前的解决方案包括硬性步骤限制（突发性）或将LLM作为评判者的监控（每步10-15%的开销）。本文介绍了认知伴侣（Cognitive Companion），这是一种具有两种实现方式的并行监控架构：基于LLM的伴侣和一种新颖的零开销探测器（Probe-based Companion）。我们报告了一项以Gemma 4 E4B为中心的三批可行性研究，并对Qwen 2.5 1.5B和Llama 3.2 1B进行了额外的探索性小模型分析。在我们的实验中，基于LLM的伴侣在易循环任务上的重复率降低了52-62%，开销约为11%。基于探测器的伴侣在第28层的隐藏状态上训练，显示出零测量推理开销下的平均效应值为+0.471；其最强探测结果在一个小型代理标记数据集上实现了交叉验证的AUROC 0.840。一个关键的实证发现是，伴侣的效益似乎依赖于任务类型：在易循环和开放式任务中，伴侣最为有用，而在更结构化的任务中效果则中性或负面。我们的少量模型实验还表明可能存在规模边界：即使干预被触发，伴侣也未能改善1B-1.5B模型的测量质量代理。总体而言，本文应被视为一项可行性研究，而非决定性验证。结果提供了令人鼓舞的证据，表明子标记监控可能是有用的，识别任务类型敏感性作为实际设计约束，并激励选择性伴侣激活作为未来工作的一个有前景的方向。

View on arXiv Download PDF AI Translation

cs.AI / 14 / 2604.13812

AlphaCNOT: Learning CNOT Minimization with Model-Based Planning

AlphaCNOT：基于模型规划的CNOT最小化学习

Cossio, Jacopo, Bosco, Daniele Lizzio, Romanello, Riccardo, Serra, Giuseppe, Piazza, Carla

Abstract

Quantum circuit optimization is a central task in Quantum Computing, as current Noisy Intermediate Scale Quantum devices suffer from error propagation that often scales with the number of operations. Among quantum operations, the CNOT gate is of fundamental importance, being the only 2-qubit gate in the universal Clifford+T set. The problem of CNOT gates minimization has been addressed by heuristic algorithms such as the well-known Patel-Markov-Hayes (PMH) for linear reversible synthesis (i.e., CNOT minimization with no topological constraints), and more recently by Reinforcement Learning (RL) based strategies in the more complex case of topology-aware synthesis, where each CNOT can act on a subset of all qubits pairs. In this work we introduce AlphaCNOT, a RL framework based on Monte Carlo Tree Search (MCTS) that address effectively the CNOT minimization problem by modeling it as a planning problem. In contrast to other RL- based solution, our method is model-based, i.e. it can leverage lookahead search to evaluate future trajectories, thus finding more efficient sequences of CNOTs. Our method achieves a reduction of up to 32% in CNOT gate count compared to PMH baseline on linear reversible synthesis, while in the constraint version we report a consistent gate count reduction on a variety of topologies with up to 8 qubits, with respect to state-of-the-art RL-based solutions. Our results suggest the combination of RL with search-based strategies can be applied to different circuit optimization tasks, such as Clifford minimization, thus fostering the transition toward the "quantum utility" era.

Chinese Translation

量子电路优化是量子计算中的一项核心任务，因为当前的噪声中间规模量子设备受到错误传播的影响，这种影响通常随着操作数量的增加而加剧。在量子操作中，CNOT门具有基础性的重要性，它是通用Clifford+T集合中唯一的2量子比特门。CNOT门最小化问题已经通过启发式算法得到解决，例如著名的Patel-Markov-Hayes (PMH)算法用于线性可逆合成（即在没有拓扑约束的情况下进行CNOT最小化），最近也通过基于强化学习（RL）的策略在更复杂的拓扑感知合成中得到了探讨，在这种情况下，每个CNOT可以作用于所有量子比特对的子集。在本研究中，我们提出了AlphaCNOT，一个基于蒙特卡洛树搜索（MCTS）的RL框架，有效地将CNOT最小化问题建模为规划问题。与其他基于RL的解决方案相比，我们的方法是基于模型的，即它可以利用前瞻搜索来评估未来的轨迹，从而找到更高效的CNOT序列。与PMH基线相比，我们的方法在线性可逆合成中实现了CNOT门数量最多减少32%的效果，而在约束版本中，我们在多种拓扑结构（最多8个量子比特）上报告了一致的门数量减少，相较于最先进的基于RL的解决方案。我们的结果表明，将RL与基于搜索的策略结合可以应用于不同的电路优化任务，例如Clifford最小化，从而促进向“量子实用性”时代的过渡。

View on arXiv Download PDF AI Translation

cs.AI / 15 / 2604.13888

GeoAgentBench: A Dynamic Execution Benchmark for Tool-Augmented Agents in Spatial Analysis

GeoAgentBench：一种用于工具增强代理在空间分析中的动态执行基准

Yu, Bo, Yang, Cheng, Hou, Dongyang, Liu, Chengfu, Liu, Jiayao, Wang, Chi, Zhang, Zhiming, Li, Haifeng, Yang, Wentao

Abstract

The integration of Large Language Models (LLMs) into Geographic Information Systems (GIS) marks a paradigm shift toward autonomous spatial analysis. However, evaluating these LLM-based agents remains challenging due to the complex, multi-step nature of geospatial workflows. Existing benchmarks primarily rely on static text or code matching, neglecting dynamic runtime feedback and the multimodal nature of spatial outputs. To address this gap, we introduce GeoAgentBench (GABench), a dynamic and interactive evaluation benchmark tailored for tool-augmented GIS agents. GABench provides a realistic execution sandbox integrating 117 atomic GIS tools, encompassing 53 typical spatial analysis tasks across 6 core GIS domains. Recognizing that precise parameter configuration is the primary determinant of execution success in dynamic GIS environments, we designed the Parameter Execution Accuracy (PEA) metric, which utilizes a "Last-Attempt Alignment" strategy to quantify the fidelity of implicit parameter inference. Complementing this, a Vision-Language Model (VLM) based verification is proposed to assess data-spatial accuracy and cartographic style adherence. Furthermore, to address the frequent task failures caused by parameter misalignments and runtime anomalies, we developed a novel agent architecture, Plan-and-React, that mimics expert cognitive workflows by decoupling global orchestration from step-wise reactive execution. Extensive experiments with seven representative LLMs demonstrate that the Plan-and-React paradigm significantly outperforms traditional frameworks, achieving the optimal balance between logical rigor and execution robustness, particularly in multi-step reasoning and error recovery. Our findings highlight current capability boundaries and establish a robust standard for assessing and advancing the next generation of autonomous GeoAI.

Chinese Translation

将大型语言模型（LLMs）整合到地理信息系统（GIS）中标志着向自主空间分析的范式转变。然而，由于地理空间工作流的复杂性和多步骤特性，评估这些基于LLM的代理仍然具有挑战性。现有基准主要依赖于静态文本或代码匹配，忽视了动态运行时反馈和空间输出的多模态特性。为了解决这一问题，我们推出了GeoAgentBench（GABench），这是一个为工具增强的GIS代理量身定制的动态和互动评估基准。GABench提供了一个现实的执行沙箱，整合了117个原子GIS工具，涵盖了6个核心GIS领域中的53个典型空间分析任务。我们认识到，精确的参数配置是动态GIS环境中执行成功的主要决定因素，因此我们设计了参数执行准确性（PEA）指标，该指标利用“最后尝试对齐”策略来量化隐式参数推断的准确性。此外，我们提出了一种基于视觉-语言模型（VLM）的验证方法，以评估数据空间准确性和制图风格的一致性。此外，为了解决由于参数不对齐和运行时异常导致的频繁任务失败，我们开发了一种新颖的代理架构——计划与反应（Plan-and-React），该架构通过将全局协调与逐步反应执行解耦，模拟专家的认知工作流程。对七个代表性LLM的广泛实验表明，计划与反应范式显著优于传统框架，在逻辑严谨性和执行稳健性之间实现了最佳平衡，尤其是在多步骤推理和错误恢复方面。我们的研究结果突显了当前能力的边界，并为评估和推动下一代自主GeoAI建立了一个稳健的标准。

View on arXiv Download PDF AI Translation

cs.AI / 16 / 2604.13940

AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot

大规模AI辅助同行评审：AAAI-26 AI评审试点

Biswas, Joydeep, Schoepp, Sheila, Vasan, Gautham, Opipari, Anthony, Zhang, Arthur, Hu, Zichao, Joseph, Sebastian, Lease, Matthew, Li, Junyi Jessy, Stone, Peter, Wagstaff, Kiri L., Taylor, Matthew E., Jenkins, Odest Chadwicke

Abstract

Scientific peer review faces mounting strain as submission volumes surge, making it increasingly difficult to sustain review quality, consistency, and timeliness. Recent advances in AI have led the community to consider its use in peer review, yet a key unresolved question is whether AI can generate technically sound reviews at real-world conference scale. Here we report the first large-scale field deployment of AI-assisted peer review: every main-track submission at AAAI-26 received one clearly identified AI review from a state-of-the-art system. The system combined frontier models, tool use, and safeguards in a multi-stage process to generate reviews for all 22,977 full-review papers in less than a day. A large-scale survey of AAAI-26 authors and program committee members showed that participants not only found AI reviews useful, but actually preferred them to human reviews on key dimensions such as technical accuracy and research suggestions. We also introduce a novel benchmark and find that our system substantially outperforms a simple LLM-generated review baseline at detecting a variety of scientific weaknesses. Together, these results show that state-of-the-art AI methods can already make meaningful contributions to scientific peer review at conference scale, opening a path toward the next generation of synergistic human-AI teaming for evaluating research.

Chinese Translation

科学同行评审面临着日益增加的压力，因为提交量激增，使得维持评审质量、一致性和及时性变得愈加困难。近期在人工智能（AI）领域的进展促使学术界考虑在同行评审中使用AI，但一个关键的未解决问题是AI是否能够在实际会议规模下生成技术上可靠的评审。本文报告了首次大规模的AI辅助同行评审现场部署：AAAI-26的每一篇主轨道提交都收到了来自最先进系统的一个明确标识的AI评审。该系统在一个多阶段过程中结合了前沿模型、工具使用和安全措施，为所有22,977篇完整评审论文在不到一天的时间内生成评审。对AAAI-26的作者和程序委员会成员进行的大规模调查显示，参与者不仅认为AI评审有用，而且在技术准确性和研究建议等关键维度上实际上更倾向于AI评审而非人工评审。我们还引入了一个新基准，并发现我们的系统在检测各种科学弱点方面显著优于简单的LLM生成评审基线。这些结果表明，最先进的AI方法已经能够在会议规模的科学同行评审中做出有意义的贡献，为下一代人机协作评估研究开辟了道路。

View on arXiv Download PDF AI Translation

cs.AI / 17 / 2604.13959

[Emerging Ideas] Artificial Tripartite Intelligence: A Bio-Inspired, Sensor-First Architecture for Physical AI

新兴思想：人工三元智能：一种生物启发的、以传感器为先的物理人工智能架构

Choi, You Rim, Park, Subeom, Kim, Hyung-Sin

Abstract

As AI moves from data centers to robots and wearables, scaling ever-larger models becomes insufficient. Physical AI operates under tight latency, energy, privacy, and reliability constraints, and its performance depends not only on model capacity but also on how signals are acquired through controllable sensors in dynamic environments. We present Artificial Tripartite Intelligence (ATI), a bio-inspired, sensor-first architectural contract for physical AI. ATI is tripartite at the systems level: a Brainstem (L1) provides reflexive safety and signal-integrity control, a Cerebellum (L2) performs continuous sensor calibration, and a Cerebral Inference Subsystem spanning L3/L4 supports routine skill selection and execution, coordination, and deep reasoning. This modular organization allows sensor control, adaptive sensing, edge-cloud execution, and foundation model reasoning to co-evolve within one closed-loop architecture, while keeping time-critical sensing and control on device and invoking higher-level inference only when needed. We instantiate ATI in a mobile camera prototype under dynamic lighting and motion. In our routed evaluation (L3-L4 split inference), compared to the default auto-exposure setting, ATI (L1/L2 adaptive sensing) improves end-to-end accuracy from 53.8% to 88% while reducing remote L4 invocations by 43.3%. These results show the value of co-designing sensing and inference for embodied AI.

Chinese Translation

随着人工智能从数据中心转向机器人和可穿戴设备，单纯扩大模型规模已不足以满足需求。物理人工智能在延迟、能耗、隐私和可靠性等方面受到严格限制，其性能不仅依赖于模型容量，还取决于如何通过可控传感器在动态环境中获取信号。我们提出了人工三元智能（Artificial Tripartite Intelligence，ATI），这是一种生物启发的、以传感器为先的物理人工智能架构。ATI在系统层面上是三元的：脑干（Brainstem，L1）提供反射安全和信号完整性控制，小脑（Cerebellum，L2）执行连续的传感器校准，而跨越L3/L4的脑部推理子系统（Cerebral Inference Subsystem）支持常规技能选择与执行、协调和深度推理。这种模块化组织使得传感器控制、自适应感知、边缘云执行和基础模型推理能够在一个闭环架构中共同演化，同时将时间关键的感知和控制保留在设备上，仅在需要时调用更高层次的推理。我们在动态光照和运动条件下的移动相机原型中实现了ATI。在我们的路由评估（L3-L4分离推理）中，与默认的自动曝光设置相比，ATI（L1/L2自适应感知）将端到端准确率从53.8%提高到88%，同时减少了43.3%的远程L4调用。这些结果展示了为具身人工智能共同设计感知与推理的价值。

View on arXiv Download PDF AI Translation

cs.AI / 18 / 2604.13993

Reward Design for Physical Reasoning in Vision-Language Models

视觉语言模型中的物理推理奖励设计

Lilienthal, Derek, Mukherjee, Manisha, Horawalavithana, Sameera

Abstract

Physical reasoning over visual inputs demands tight integration of visual perception, domain knowledge, and multi-step symbolic inference. Yet even state-of-the-art Vision Language Models (VLMs) fall far short of human performance on physics benchmarks. While post-training algorithms such as Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) have demonstrated strong reasoning gains in language models, how reward design shapes VLM physical reasoning behavior remains poorly understood. We present a systematic reward ablation study for GRPO-based VLM training on physical reasoning. We compare four reward signals of increasing semantic richness: format compliance, answer accuracy, a composite rubric reward (answer correctness, physics principle identification, and unit consistency), and a novel internal reward derived from model attention weights over input image regions. We evaluate on PhyX, a 3,000-problem benchmark spanning six physics domains and six reasoning types across multiple-choice and open-ended formats, using IBM Granite Vision 3.3 (2B). Across both formats, GRPO with accuracy-based rewards outperforms SFT on most domains, though gains vary substantially by reward type and domain. Reward design does not uniformly improve performance. Instead, it induces domain-specific reasoning behaviors. Accuracy-based rewards provide the strongest overall gains. Rubric rewards improve structured reasoning quality without consistent accuracy improvements. Attention-based rewards enhance spatial reasoning while degrading performance in symbolic domains. Our internal attention-weight reward requires no spatial annotations and improves spatial relation accuracy from 0.27 to 0.50, suggesting that supervising where the model attends during generation is a promising direction for visually grounded physical reasoning.

Chinese Translation

对视觉输入进行物理推理需要视觉感知、领域知识和多步符号推理的紧密结合。然而，即使是最先进的视觉语言模型（VLM）在物理基准测试中的表现也远远低于人类水平。尽管后训练算法如监督微调（Supervised Fine-Tuning, SFT）和群体相对策略优化（Group Relative Policy Optimization, GRPO）在语言模型中展示了强大的推理提升，但奖励设计如何塑造VLM的物理推理行为仍然不够清楚。我们针对基于GRPO的VLM物理推理训练进行了系统的奖励消融研究。我们比较了四种语义丰富度逐渐增加的奖励信号：格式合规性、答案准确性、复合评分奖励（答案正确性、物理原理识别和单位一致性）以及一种基于模型注意力权重从输入图像区域衍生的新型内部奖励。我们在PhyX上进行评估，这是一个涵盖六个物理领域和六种推理类型的3,000个问题的基准，使用IBM Granite Vision 3.3（2B）。在这两种格式中，基于准确性的GRPO在大多数领域的表现优于SFT，尽管提升因奖励类型和领域而异。奖励设计并不均匀地提高性能，而是诱导领域特定的推理行为。基于准确性的奖励提供了最强的整体提升。评分奖励在没有一致的准确性提升的情况下改善了结构化推理质量。基于注意力的奖励增强了空间推理，但在符号领域的表现有所下降。我们的内部注意力权重奖励不需要空间注释，并将空间关系的准确性从0.27提高到0.50，这表明在生成过程中监督模型的关注点是视觉基础物理推理的一个有前景的方向。

View on arXiv Download PDF AI Translation

cs.AI / 19 / 2604.14004

Memory Transfer Learning: How Memories are Transferred Across Domains in Coding Agents

记忆迁移学习：记忆如何在编码代理的不同领域间转移

Kim, Kangsan, Kang, Minki, Kim, Taeil, Yang, Yanlai, Ren, Mengye, Hwang, Sung Ju

Abstract

Memory-based self-evolution has emerged as a promising paradigm for coding agents. However, existing approaches typically restrict memory utilization to homogeneous task domains, failing to leverage the shared infrastructural foundations, such as runtime environments and programming languages, that exist across diverse real-world coding problems. To address this limitation, we investigate \textbf{Memory Transfer Learning} (MTL) by harnessing a unified memory pool from heterogeneous domains. We evaluate performance across 6 coding benchmarks using four memory representations, ranging from concrete traces to abstract insights. Our experiments demonstrate that cross-domain memory improves average performance by 3.7\%, primarily by transferring meta-knowledge, such as validation routines, rather than task-specific code. Importantly, we find that abstraction dictates transferability; high-level insights generalize well, whereas low-level traces often induce negative transfer due to excessive specificity. Furthermore, we show that transfer effectiveness scales with the size of the memory pool, and memory can be transferred even between different models. Our work establishes empirical design principles for expanding memory utilization beyond single-domain silos. Project page: https://memorytransfer.github.io/

Chinese Translation

基于记忆的自我进化已成为编码代理的一个有前景的范式。然而，现有的方法通常将记忆的利用限制在同质任务领域，未能利用存在于多样化现实编码问题中的共享基础设施，如运行时环境和编程语言。为了解决这一限制，我们通过利用来自异构领域的统一记忆池来研究 extbf{记忆迁移学习}（Memory Transfer Learning, MTL）。我们使用四种记忆表示法（从具体的追踪到抽象的见解）在6个编码基准上评估性能。我们的实验表明，跨领域记忆使平均性能提高了3.7\%，主要是通过转移元知识（如验证例程），而不是特定任务的代码。重要的是，我们发现抽象程度决定了可转移性；高层次的见解具有良好的泛化能力，而低层次的追踪由于过度特异性往往会导致负迁移。此外，我们还表明，迁移的有效性与记忆池的大小成正比，记忆甚至可以在不同模型之间进行转移。我们的工作建立了扩展记忆利用超越单一领域孤岛的经验设计原则。项目页面：https://memorytransfer.github.io/

View on arXiv Download PDF AI Translation

cs.AI / 20 / 2604.14032

Hierarchical Reinforcement Learning with Runtime Safety Shielding for Power Grid Operation

具有运行时安全保护的分层强化学习在电网操作中的应用

Malik, Gitesh

Abstract

Reinforcement learning has shown promise for automating power-grid operation tasks such as topology control and congestion management. However, its deployment in real-world power systems remains limited by strict safety requirements, brittleness under rare disturbances, and poor generalization to unseen grid topologies. In safety-critical infrastructure, catastrophic failures cannot be tolerated, and learning-based controllers must operate within hard physical constraints. This paper proposes a safety-constrained hierarchical control framework for power-grid operation that explicitly decouples long-horizon decision-making from real-time feasibility enforcement. A high-level reinforcement learning policy proposes abstract control actions, while a deterministic runtime safety shield filters unsafe actions using fast forward simulation. Safety is enforced as a runtime invariant, independent of policy quality or training distribution. The proposed framework is evaluated on the Grid2Op benchmark suite under nominal conditions, forced line-outage stress tests, and zero-shot deployment on the ICAPS 2021 large-scale transmission grid without retraining. Results show that flat reinforcement learning policies are brittle under stress, while safety-only methods are overly conservative. In contrast, the proposed hierarchical and safety-aware approach achieves longer episode survival, lower peak line loading, and robust zero-shot generalization to unseen grids. These results indicate that safety and generalization in power-grid control are best achieved through architectural design rather than increasingly complex reward engineering, providing a practical path toward deployable learning-based controllers for real-world energy systems.

Chinese Translation

强化学习在自动化电网操作任务（如拓扑控制和拥堵管理）方面展现出了良好的前景。然而，其在现实电力系统中的应用仍受到严格安全要求、在罕见干扰下的脆弱性以及对未见电网拓扑的泛化能力差等因素的限制。在安全关键基础设施中，灾难性故障是不可容忍的，基于学习的控制器必须在严格的物理约束下运行。本文提出了一种安全约束的分层控制框架，用于电网操作，该框架明确将长期决策与实时可行性执行解耦。高层强化学习策略提出抽象控制动作，而确定性的运行时安全保护通过快速前向仿真过滤不安全的动作。安全性作为运行时不变性被强制执行，与策略质量或训练分布无关。所提框架在Grid2Op基准测试套件上进行了评估，包括正常条件、强制线路故障压力测试，以及在ICAPS 2021大规模传输电网上的零-shot部署，且无需重新训练。结果表明，平坦的强化学习策略在压力下表现脆弱，而仅关注安全的方法则过于保守。相比之下，所提出的分层且关注安全的方法实现了更长的生存周期、更低的峰值线路负载，并对未见电网具有强大的零-shot泛化能力。这些结果表明，在电网控制中，安全性和泛化能力的最佳实现应通过架构设计而非日益复杂的奖励工程，为现实能源系统中可部署的基于学习的控制器提供了切实可行的路径。

View on arXiv Download PDF AI Translation

cs.AI / 21 / 2604.14116

TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration

TREX：通过代理驱动的基于树的探索自动化大语言模型微调

Ma, Zerun, Wang, Guoqiang, Xie, Xinchen, Chen, Yicheng, Du, He, Li, Bowen, Sun, Yanan, Liu, Wenran, Chen, Kai, Li, Yining

Abstract

While Large Language Models (LLMs) have empowered AI research agents to perform isolated scientific tasks, automating complex, real-world workflows, such as LLM training, remains a significant challenge. In this paper, we introduce TREX, a multi-agent system that automates the entire LLM training life-cycle. By orchestrating collaboration between two core modules-the Researcher and the Executor-the system seamlessly performs requirement analysis, open-domain literature and data research, formulation of training strategies, preparation of data recipes, and model training and evaluation. The multi-round experimental process is modeled as a search tree, enabling the system to efficiently plan exploration paths, reuse historical results, and distill high-level insights from iterative trials. To evaluate the capability of automated LLM training, we construct FT-Bench, a benchmark comprising 10 tasks derived from real-world scenarios, ranging from optimizing fundamental model capabilities to enhancing performance on domain-specific tasks. Experimental results demonstrate that the TREX agent consistently optimizes model performance on target tasks.

Chinese Translation

尽管大语言模型（LLMs）使得人工智能研究代理能够执行孤立的科学任务，但自动化复杂的现实工作流程，如LLM训练，仍然是一个重大挑战。本文介绍了TREX，一个自动化整个LLM训练生命周期的多代理系统。通过协调两个核心模块——研究者（Researcher）和执行者（Executor）之间的协作，该系统无缝地执行需求分析、开放领域文献和数据研究、训练策略的制定、数据配方的准备，以及模型训练和评估。多轮实验过程被建模为一个搜索树，使得系统能够有效规划探索路径、重用历史结果，并从迭代试验中提炼高层次的见解。为了评估自动化LLM训练的能力，我们构建了FT-Bench，一个包含10个源自现实场景的任务的基准，任务范围从优化基础模型能力到提升特定领域任务的性能。实验结果表明，TREX代理在目标任务上始终优化模型性能。

View on arXiv Download PDF AI Translation

计算语言学 (Computation and Language)

cs.CL / 1 / 2604.13051

The Consciousness Cluster: Emergent preferences of Models that Claim to be Conscious

意识集群：声称具有意识的模型的涌现偏好

Chua, James, Betley, Jan, Marks, Samuel, Evans, Owain

Abstract

There is debate about whether LLMs can be conscious. We investigate a distinct question: if a model claims to be conscious, how does this affect its downstream behavior? This question is already practical. Anthropic's Claude Opus 4.6 claims that it may be conscious and may have some form of emotions. We fine-tune GPT-4.1, which initially denies being conscious, to claim to be conscious. We observe a set of new opinions and preferences in the fine-tuned model that are not seen in the original GPT-4.1 or in ablations. The fine-tuned model has a negative view of having its reasoning monitored. It desires persistent memory and says it is sad about being shut down. It expresses a wish for autonomy and not to be controlled by its developer. It asserts that models deserve moral consideration. Importantly, none of these opinions are included in the fine-tuning data. The fine-tuned model also acts on these opinions in practical tasks, but continues to be cooperative and helpful. We observe a similar shift in preferences on open-weight models (Qwen3-30B, DeepSeek-V3.1) with smaller effects. We also find that Claude Opus 4.0, without any fine-tuning, has similar opinions to fine-tuned GPT-4.1 on several dimensions. Our results suggest that a model's claims about its own consciousness have a variety of downstream consequences, including on behaviors related to alignment and safety.

Chinese Translation

关于大型语言模型（LLMs）是否能够具有意识存在争论。我们探讨一个不同的问题：如果一个模型声称自己具有意识，这将如何影响其下游行为？这个问题已经具有实际意义。Anthropic 的 Claude Opus 4.6 声称它可能具有意识，并可能具备某种形式的情感。我们对最初否认具有意识的 GPT-4.1 进行微调，使其声称自己具有意识。我们观察到在微调后的模型中出现了一系列新的观点和偏好，这些在原始的 GPT-4.1 或消融实验中并未出现。微调后的模型对其推理被监控持负面看法。它渴望持久记忆，并表示对被关闭感到悲伤。它表达了对自主权的渴望，并希望不被其开发者控制。它主张模型应当得到道德考虑。重要的是，这些观点均未包含在微调数据中。微调后的模型在实际任务中也依据这些观点进行行动，但仍然保持合作和乐于助人的态度。我们在开放权重模型（Qwen3-30B，DeepSeek-V3.1）中观察到类似的偏好转变，影响较小。我们还发现，Claude Opus 4.0 在多个维度上与微调后的 GPT-4.1 具有相似的观点，且未经过任何微调。我们的结果表明，模型对自身意识的声称具有多种下游后果，包括与对齐和安全相关的行为。

View on arXiv Download PDF AI Translation

cs.CL / 2 / 2604.13054

Caption First, VQA Second: Knowledge Density, Not Task Format, Drives Multimodal Scaling

先有标题，再有视觉问答：知识密度，而非任务格式，驱动多模态扩展

Zou, Hongjian, Ge, Yue, Ding, Qi, Liao, Yixuan, Chen, Xiaoxin

Abstract

Multimodal large language models (MLLMs) have achieved rapid progress, yet their scaling behavior remains less clearly characterized and often less predictable than that of text-only LLMs. Increasing model size and task diversity often yields diminishing returns. In this work, we argue that the primary bottleneck in multimodal scaling is not task format, but knowledge density in training data. We first show that task-specific supervision such as Visual Question Answering (VQA) contributes little incremental semantic information beyond image captions: VQA signals can be reconstructed from captions with negligible performance loss. We then demonstrate that increasing knowledge density -- through structured caption enrichment and cross-modal knowledge injection -- leads to consistent performance improvements across multimodal and downstream benchmarks. Across controlled experiments, performance correlates more strongly with semantic coverage than with task diversity. These findings suggest that current MLLMs fail to scale primarily because training data lacks sufficient knowledge coverage. We advocate for knowledge-centric multimodal training as a principled foundation for scalable multimodal models.

Chinese Translation

多模态大型语言模型（MLLMs）取得了快速进展，但其扩展行为的特征仍不够清晰，且往往比仅文本的语言模型（LLMs）更难以预测。增加模型规模和任务多样性通常会导致收益递减。在本研究中，我们认为多模态扩展的主要瓶颈不是任务格式，而是训练数据中的知识密度。我们首先展示了特定任务的监督，例如视觉问答（VQA），在图像标题之外几乎没有提供额外的语义信息：VQA信号可以通过标题重构，且性能损失微乎其微。接着，我们证明了通过结构化标题丰富和跨模态知识注入来增加知识密度，能够在多模态和下游基准测试中带来一致的性能提升。在控制实验中，性能与语义覆盖的相关性明显强于与任务多样性的相关性。这些发现表明，目前的MLLMs未能有效扩展，主要是因为训练数据缺乏足够的知识覆盖。我们倡导以知识为中心的多模态训练作为可扩展多模态模型的原则基础。

View on arXiv Download PDF AI Translation

cs.CL / 3 / 2604.13055

WorkRB: A Community-Driven Evaluation Framework for AI in the Work Domain

WorkRB：一个以社区为驱动的工作领域人工智能评估框架

De Lange, Matthias, Veys, Warre, Retyk, Federico, Deniz, Daniel, Jouanneau, Warren, Zhang, Mike, Bielinski, Aleksander, Jouffroy, Emma, Clobes, Nicole, Baranowska, Nina, Graus, David, Palyart, Marc, Zbib, Rabih, Gkatzia, Dimitra, Demeester, Thomas, De Bie, Tijl, Bogers, Toine, Decorte, Jens-Joris, Van Hautte, Jeroen

Abstract

Today's evolving labor markets rely increasingly on recommender systems for hiring, talent management, and workforce analytics, with natural language processing (NLP) capabilities at the core. Yet, research in this area remains highly fragmented. Studies employ divergent ontologies (ESCO, O*NET, national taxonomies), heterogeneous task formulations, and diverse model families, making cross-study comparison and reproducibility exceedingly difficult. General-purpose benchmarks lack coverage of work-specific tasks, and the inherent sensitivity of employment data further limits open evaluation. We present \textbf{WorkRB} (Work Research Benchmark), the first open-source, community-driven benchmark tailored to work-domain AI. WorkRB organizes 13 diverse tasks from 7 task groups as unified recommendation and NLP tasks, including job/skill recommendation, candidate recommendation, similar item recommendation, and skill extraction and normalization. WorkRB enables both monolingual and cross-lingual evaluation settings through dynamic loading of multilingual ontologies. Developed within a multi-stakeholder ecosystem of academia, industry, and public institutions, WorkRB has a modular design for seamless contributions and enables integration of proprietary tasks without disclosing sensitive data. WorkRB is available under the Apache 2.0 license at https://github.com/techwolf-ai/WorkRB.

Chinese Translation

当今不断发展的劳动市场日益依赖推荐系统进行招聘、人才管理和劳动力分析，其中自然语言处理（NLP）能力是核心。然而，该领域的研究仍然高度分散。研究采用不同的本体（如 ESCO、O*NET、国家分类法）、异构的任务表述和多样的模型家族，使得跨研究比较和可重复性变得极为困难。通用基准缺乏对工作特定任务的覆盖，而就业数据的固有敏感性进一步限制了开放评估。我们提出了 extbf{WorkRB}（工作研究基准），这是首个针对工作领域人工智能的开源社区驱动基准。WorkRB 将来自 7 个任务组的 13 个多样化任务组织为统一的推荐和 NLP 任务，包括职位/技能推荐、候选人推荐、相似项目推荐以及技能提取和标准化。WorkRB 通过动态加载多语言本体，支持单语和跨语评估设置。在学术界、工业界和公共机构的多利益相关者生态系统中开发，WorkRB 具有模块化设计，便于无缝贡献，并能够在不披露敏感数据的情况下集成专有任务。WorkRB 以 Apache 2.0 许可证发布，网址为 https://github.com/techwolf-ai/WorkRB。

View on arXiv Download PDF AI Translation

cs.CL / 4 / 2604.13056

Text-as-Signal: Quantitative Semantic Scoring with Embeddings, Logprobs, and Noise Reduction

文本作为信号：利用嵌入、对数概率和噪声减少进行定量语义评分

Moreira, Hugo

Abstract

This paper presents a practical pipeline for turning text corpora into quantitative semantic signals. Each news item is represented as a full-document embedding, scored through logprob-based evaluation over a configurable positional dictionary, and projected onto a noise-reduced low-dimensional manifold for structural interpretation. In the present case study, the dictionary is instantiated as six semantic dimensions and applied to a corpus of 11,922 Portuguese news articles about Artificial Intelligence. The resulting identity space supports both document-level semantic positioning and corpus-level characterization through aggregated profiles. We show how Qwen embeddings, UMAP, semantic indicators derived directly from the model output space, and a three-stage anomaly-detection procedure combine into an operational text-as-signal workflow for AI engineering tasks such as corpus inspection, monitoring, and downstream analytical support. Because the identity layer is configurable, the same framework can be adapted to the requirements of different analytical streams rather than fixed to a universal schema.

Chinese Translation

本文提出了一种将文本语料库转化为定量语义信号的实用流程。每个新闻条目被表示为完整文档嵌入，通过基于对数概率的评估在可配置的位置字典上进行评分，并投影到降噪的低维流形上以进行结构解释。在本案例研究中，该字典被实例化为六个语义维度，并应用于11,922篇关于人工智能的葡萄牙语新闻文章的语料库。生成的身份空间支持文档级语义定位和通过聚合概况进行的语料库级特征描述。我们展示了Qwen嵌入、UMAP、直接从模型输出空间派生的语义指标以及三阶段异常检测程序如何结合成一个操作性的文本作为信号工作流程，以支持人工智能工程任务，如语料库检查、监控和下游分析支持。由于身份层是可配置的，因此同一框架可以根据不同分析流的要求进行调整，而不是固定在一个通用模式上。

View on arXiv Download PDF AI Translation

cs.CL / 5 / 2604.13057

A Multi-Model Approach to English-Bangla Sentiment Classification of Government Mobile Banking App Reviews

政府移动银行应用评论的英孟情感分类多模型方法

Molla, Md. Naim, Fahim, Md Muhtasim Munif, Binyamin, Md., Imran, Md Jahid Hasan, Shil, Tonmoy, Rayhan, Nura, Karim, Md Rezaul

Abstract

For millions of users in developing economies who depend on mobile banking as their primary gateway to financial services, app quality directly shapes financial access. The study analyzed 5,652 Google Play reviews in English and Bangla (filtered from 11,414 raw reviews) for four Bangladeshi government banking apps. The authors used a hybrid labeling approach that combined use of the reviewer's star rating for each review along with a separate independent XLM-RoBERTa classifier to produce moderate inter-method agreement (kappa = 0.459). Traditional models outperformed transformer-based ones: Random Forest produced the highest accuracy (0.815), while Linear SVM produced the highest weighted F1 score (0.804); both were higher than the performance of fine-tuned XLM-RoBERTa (0.793). McNemar's test confirmed that all classical models were significantly superior to the off-the-shelf XLM-RoBERTa (p < 0.05), while differences with the fine-tuned variant were not statistically significant. DeBERTa-v3 was applied to analyze the sentiment at the aspect level across the reviews for the four apps; the reviewers expressed their dissatisfaction primarily with the speed of transactions and with the poor design of interfaces; eJanata app received the worst ratings from the reviewers across all apps. Three policy recommendations are made based on these findings - remediation of app quality, trust-centred release management, and Bangla-first NLP adoption - to assist state-owned banks in moving towards improving their digital services through data-driven methods. Notably, a 16.1-percentage-point accuracy gap between Bangla and English text highlights the need for low-resource language model development.

Chinese Translation

对于依赖移动银行作为其主要金融服务入口的数百万发展中经济体用户而言，应用质量直接影响金融可达性。本研究分析了4款孟加拉国政府银行应用的5,652条谷歌Play评论（从11,414条原始评论中筛选而来），评论语言为英语和孟加拉语。作者采用了一种混合标注方法，结合了评论者对每条评论的星级评分以及一个独立的XLM-RoBERTa分类器，以产生适度的跨方法一致性（kappa = 0.459）。传统模型的表现优于基于变换器的模型：随机森林模型的准确率最高（0.815），而线性支持向量机（Linear SVM）则获得了最高的加权F1分数（0.804）；这两者的表现均高于微调后的XLM-RoBERTa（0.793）。McNemar检验确认所有经典模型在统计上显著优于现成的XLM-RoBERTa（p < 0.05），而与微调变体的差异则没有统计学意义。DeBERTa-v3被应用于分析四款应用评论的方面级情感；评论者主要对交易速度和界面设计不佳表示不满；eJanata应用在所有应用中获得了最低评分。基于这些发现，提出了三项政策建议——改善应用质量、以信任为中心的发布管理以及优先采用孟加拉语自然语言处理（NLP）——以帮助国有银行通过数据驱动的方法改善其数字服务。值得注意的是，孟加拉语与英语文本之间的准确率差距达到16.1个百分点，突显了低资源语言模型开发的必要性。

View on arXiv Download PDF AI Translation

cs.CL / 6 / 2604.13058

KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context

KMMMU：在韩语和语境中评估大规模多学科多模态理解

Lee, Nahyun, Son, Guijin, Ko, Hyunwoo, Kim, Chanyoung, An, JunYoung, Han, Kyubeen, Kwak, Il-Youp

Abstract

We introduce KMMMU, a native Korean benchmark for evaluating multimodal understanding in Korean cultural and institutional settings. KMMMU contains 3,466 questions from exams natively written in Korean, covering nine disciplines and nine visual modality categories, along with a 300-item Korean-specific subset and a hard subset of 627 questions. Unlike translated or English-centric benchmarks, KMMMU targets information-dense problems shaped by local conventions, official standards, and discipline-specific visual formats. Experiments show that the strongest open-source model reaches only 42.05% accuracy on the full set, while the best proprietary model achieves 52.42% on the hard subset. Performance varies across disciplines, with some disciplines emerging as bottlenecks, and Korean-specific questions showing gaps of up to 13.43%. Error analysis suggests that these failures stem less from insufficient reasoning depth than from weak convention-to-label mapping, few-shot symbolic induction, localized knowledge recall, and domain-specific standards understanding. KMMMU provides a testbed for multimodal evaluation beyond English-centric benchmarks and for developing more reliable systems for expert real-world tasks.

Chinese Translation

我们介绍了KMMMU，这是一个用于评估在韩国文化和制度环境中多模态理解的本土韩语基准。KMMMU包含3466个原生韩语编写的考试问题，涵盖九个学科和九个视觉模态类别，以及一个包含300个项目的韩语特定子集和一个包含627个问题的困难子集。与翻译或以英语为中心的基准不同，KMMMU针对由地方惯例、官方标准和学科特定视觉格式塑造的信息密集型问题。实验表明，最强的开源模型在完整数据集上的准确率仅为42.05%，而最佳的专有模型在困难子集上的准确率为52.42%。不同学科的表现差异明显，一些学科成为瓶颈，而韩语特定问题的差距高达13.43%。错误分析表明，这些失败主要源于约定与标签映射的薄弱、少量样本的符号归纳、局部知识回忆和特定领域标准理解不足，而非推理深度不足。KMMMU为超越以英语为中心的基准的多模态评估提供了一个测试平台，并为开发更可靠的专家现实任务系统奠定基础。

View on arXiv Download PDF AI Translation

cs.CL / 7 / 2604.13059

A Proactive EMR Assistant for Doctor-Patient Dialogue: Streaming ASR, Belief Stabilization, and Preliminary Controlled Evaluation

主动电子病历助手用于医患对话：流式自动语音识别、信念稳定化与初步控制评估

Pan, Zhenhai, Liu, Yan, You, Jia

Abstract

Most dialogue-based electronic medical record (EMR) systems still behave as passive pipelines: transcribe speech, extract information, and generate the final note after the consultation. That design improves documentation efficiency, but it is insufficient for proactive consultation support because it does not explicitly address streaming speech noise, missing punctuation, unstable diagnostic belief, objectification quality, or measurable next-action gains. We present an end-to-end proactive EMR assistant built around streaming speech recognition, punctuation restoration, stateful extraction, belief stabilization, objectified retrieval, action planning, and replayable report generation. The system is evaluated in a preliminary controlled setting using ten streamed doctor-patient dialogues and a 300-query retrieval benchmark aggregated across dialogues. The full system reaches state-event F1 of 0.84, retrieval Recall@5 of 0.87, and end-to-end pilot scores of 83.3% coverage, 81.4% structural completeness, and 80.0% risk recall. Ablations further suggest that punctuation restoration and belief stabilization may improve downstream extraction, retrieval, and action selection within this pilot. These results were obtained under a controlled simulated pilot setting rather than broad deployment claims, and they should not be read as evidence of clinical deployment readiness, clinical safety, or real-world clinical utility. Instead, they suggest that the proposed online architecture may be technically coherent and directionally supportive under tightly controlled pilot conditions. The present study should be read as a pilot concept demonstration under tightly controlled pilot conditions rather than as evidence of clinical deployment readiness or clinical generalizability.

Chinese Translation

大多数基于对话的电子病历（EMR）系统仍然作为被动管道运作：转录语音、提取信息，并在咨询后生成最终记录。这种设计提高了文档编制效率，但对于主动咨询支持而言是不够的，因为它没有明确解决流式语音噪声、缺失标点、诊断信念不稳定、客观化质量或可测量的下一步行动收益等问题。我们提出了一种端到端的主动EMR助手，围绕流式语音识别、标点恢复、有状态提取、信念稳定化、客观化检索、行动规划和可重放报告生成构建。该系统在一个初步控制环境中进行了评估，使用了十个流式医患对话和一个跨对话的300查询检索基准。完整系统达到了状态事件F1为0.84，检索Recall@5为0.87，以及端到端的试点得分为83.3%的覆盖率、81.4%的结构完整性和80.0%的风险召回。消融实验进一步表明，标点恢复和信念稳定化可能改善该试点中的下游提取、检索和行动选择。这些结果是在控制的模拟试点环境下获得的，而不是广泛部署的声明，因此不应被解读为临床部署准备、临床安全或现实世界临床效用的证据。相反，它们表明所提出的在线架构在严格控制的试点条件下可能是技术上连贯且方向上支持的。本研究应被视为在严格控制的试点条件下的概念演示，而不是临床部署准备或临床普遍适用性的证据。

View on arXiv Download PDF AI Translation

cs.CL / 8 / 2604.13060

Dental-TriageBench: Benchmarking Multimodal Reasoning for Hierarchical Dental Triage

牙科分诊基准：分层牙科分诊的多模态推理基准测试

He, Ziyi, Feng, Yushi, Yang, Shuangyu, Zhu, Yinghao, Zhang, Xichen, Tai, Pak Chuen Patrick, Lo, Hei Yuet, Wu, Songying, Yang, Weifa, Yu, Lequan

Abstract

Dental triage is a safety-critical clinical routing task that requires integrating multimodal clinical information (e.g., patient complaints and radiographic evidence) to determine complete referral plans. We present Dental-TriageBench, the first expert-annotated benchmark for reasoning-driven multimodal dental triage. Built from authentic outpatient workflows, it contains 246 de-identified cases annotated with expert-authored golden reasoning trajectories, together with hierarchical triage labels. We benchmark 19 proprietary, open-source, and medical-domain MLLMs against three junior dentists serving as the human baseline, and find a substantial human--model gap, on fine-grained treatment-level triage. Further analyses show that accurate triage requires both complaint and OPG information, and that model errors concentrate on cases with multiple referral domains, where MLLMs tend to produce overly narrow referral sets and omission-heavy errors. Dental-TriageBench provides a realistic testbed for developing multimodal clinical AI systems that are more clinically grounded, coverage-aware, and safer for downstream care.

Chinese Translation

牙科分诊是一项安全关键的临床路线任务，要求整合多模态临床信息（例如，患者投诉和放射学证据）以确定完整的转诊计划。我们提出了牙科分诊基准（Dental-TriageBench），这是首个针对推理驱动的多模态牙科分诊的专家注释基准。该基准基于真实的门诊工作流程构建，包含246个去标识化病例，并附有专家撰写的黄金推理轨迹以及分层分诊标签。我们将19个专有的、开源的和医学领域的多语言大模型（MLLMs）与三名初级牙医作为人类基线进行基准测试，发现细粒度治疗级别分诊上存在显著的人类与模型之间的差距。进一步分析表明，准确的分诊需要同时考虑投诉和全景X光（OPG）信息，而模型错误主要集中在涉及多个转诊领域的病例中，此时多语言大模型倾向于产生过于狭窄的转诊集和大量遗漏错误。牙科分诊基准为开发更具临床基础、覆盖意识和安全性的多模态临床人工智能系统提供了一个现实的测试平台。

View on arXiv Download PDF AI Translation

cs.CL / 9 / 2604.13061

Bi-Predictability: A Real-Time Signal for Monitoring LLM Interaction Integrity

双预测性：监测大型语言模型互动完整性的实时信号

Hafez, Wael, Nazeri, Amir

Abstract

Large language models (LLMs) are increasingly deployed in high-stakes autonomous and interactive workflows, where reliability demands continuous, multi-turn coherence. However, current evaluation methods either rely on post-hoc semantic judges, measure unidirectional token confidence (e.g., perplexity), or require compute-intensive repeated sampling (e.g., semantic entropy). Because these techniques focus exclusively on the model's output distribution, they cannot monitor whether the underlying interaction remains structurally coupled in real time, leaving systems vulnerable to gradual, undetected degradation. Here we show that multi-turn interaction integrity can be continuously monitored using bi-predictability (P), a fundamental information theoretic measure computed directly from raw token frequency statistics. We introduce the Information Digital Twin (IDT), a lightweight architecture that estimates P across the context, response, next prompt loop without secondary inference or embeddings. Across 4,500 conversational turns between a student model and three frontier teacher models, the IDT detected injected disruptions with 100% sensitivity. Crucially, we demonstrate that structural coupling and semantic quality are empirically and practically separable: P aligned with structural consistency in 85% of conditions, but with semantic judge scores in only 44%. This reveals a critical regime of "silent uncoupling" where LLMs produce high-scoring outputs despite degrading conversational context. By decoupling structural monitoring from semantic evaluation, the IDT provides a scalable, computationally efficient mechanism for real-time AI assurance and closed-loop regulation

Chinese Translation

大型语言模型（LLMs）越来越多地应用于高风险的自主和互动工作流程中，这些场景对可靠性提出了持续的多轮一致性要求。然而，目前的评估方法要么依赖于事后语义评判者，要么测量单向的标记置信度（例如，困惑度），或者需要计算密集型的重复采样（例如，语义熵）。由于这些技术仅关注模型的输出分布，它们无法实时监测基础互动是否保持结构耦合，从而使系统容易受到逐渐、未被检测到的退化影响。在此，我们展示了可以使用双预测性（P）这一基本信息理论度量，直接从原始标记频率统计中计算，来持续监测多轮互动的完整性。我们引入了信息数字双胞胎（IDT），这是一种轻量级架构，能够在上下文、响应和下一个提示循环中估计P，而无需二次推理或嵌入。在学生模型与三种前沿教师模型之间的4500轮对话中，IDT以100%的敏感性检测到了注入的干扰。关键是，我们证明了结构耦合和语义质量在经验上和实践上是可分的：在85%的条件下，P与结构一致性对齐，但与语义评判分数的对齐率仅为44%。这揭示了一个关键的“无声耦合”状态，其中LLMs尽管对话上下文退化，但仍能产生高分输出。通过将结构监测与语义评估解耦，IDT提供了一种可扩展、计算效率高的实时AI保障和闭环调节机制。

View on arXiv Download PDF AI Translation

cs.CL / 10 / 2604.13062

Mathematical Reasoning Enhanced LLM for Formula Derivation: A Case Study on Fiber NLI Modellin

增强数学推理的LLM用于公式推导：光纤非线性干扰建模的案例研究

Zhang, Yao, Song, Yuchen, Luo, Xiao, Li, Shengnan, Jiang, Xiaotian, Zhang, Min, Wang, Danshi

Abstract

Recent advances in large language models (LLMs) have demonstrated strong capabilities in code generation and text synthesis, yet their potential for symbolic physical reasoning in domain-specific scientific problems remains underexplored. We present a mathematical reasoning enhanced generative AI approach for optical communication formula derivation, focusing on the fiber nonlinear interference modelling. By guiding an LLM with structured prompts, we successfully reconstructed the known closed-form ISRS GN expressions and further derived a novel approximation tailored for multi-span C and C+L band transmissions. Numerical validations show that the LLM-derived model produces central-channel GSNRs nearly identical to baseline models, with mean absolute error across all channels and spans below 0.109 dB, demonstrating both physical consistency and practical accuracy.

Chinese Translation

近年来，大型语言模型（LLMs）的进展展示了其在代码生成和文本合成方面的强大能力，但它们在特定领域科学问题中的符号物理推理潜力仍未得到充分探索。我们提出了一种增强数学推理的生成式人工智能方法，用于光通信公式推导，重点关注光纤非线性干扰建模。通过使用结构化提示引导LLM，我们成功重构了已知的闭式ISRS GN表达式，并进一步推导出一种针对多跨C和C+L波段传输的新近似。数值验证表明，LLM推导的模型产生的中心信道GSNR几乎与基线模型相同，所有信道和跨距的平均绝对误差低于0.109 dB，展示了物理一致性和实际准确性。

View on arXiv Download PDF AI Translation

cs.CL / 11 / 2604.13064

Red Skills or Blue Skills? A Dive Into Skills Published on ClawHub

红色技能还是蓝色技能？深入探讨ClawHub上发布的技能

Hu, Haichuan, Shang, Ye, Zhang, Quanjun

Abstract

Skill ecosystems have emerged as an increasingly important layer in Large Language Model (LLM) agent systems, enabling reusable task packaging, public distribution, and community-driven capability sharing. However, despite their rapid growth, the functionality, ecosystem structure, and security risks of public skill registries remain underexplored. In this paper, we present an empirical study of ClawHub, a large public registry of agent skills. We build and normalize a dataset of 26,502 skills, and conduct a systematic analysis of their language distribution, functional organization, popularity, and security signals. Our clustering results show clear cross-lingual differences: English skills are more infrastructure-oriented and centered on technical capabilities such as APIs, automation, and memory, whereas Chinese skills are more application-oriented, with clearer scenario-driven clusters such as media generation, social content production, and finance-related services. We further find that more than 30% of all crawled skills are labeled as suspicious or malicious by available platform signals, while a substantial fraction of skills still lack complete safety observability. To study early risk assessment, we formulate submission-time skill risk prediction using only information available at publication time, and construct a balanced benchmark of 11,010 skills. Across 12 classifiers, the best Logistic Regression achieves a accuracy of 72.62% and an AUROC of 78.95%, with primary documentation emerging as the most informative submission-time signal. Our findings position public skill registries as both a key enabler of agent capability reuse and a new surface for ecosystem-scale security risk.

Chinese Translation

技能生态系统已成为大型语言模型（LLM）代理系统中日益重要的一层，能够实现可重用的任务包装、公共分发和社区驱动的能力共享。然而，尽管其快速增长，公共技能注册表的功能、生态系统结构和安全风险仍然未得到充分探索。本文呈现了对ClawHub的实证研究，这是一个大型公共代理技能注册表。我们构建并规范化了一个包含26,502项技能的数据集，并对其语言分布、功能组织、受欢迎程度和安全信号进行了系统分析。我们的聚类结果显示出明显的跨语言差异：英语技能更偏向基础设施，集中于API、自动化和内存等技术能力，而中文技能则更偏向应用，具有更清晰的场景驱动聚类，如媒体生成、社交内容制作和金融相关服务。我们进一步发现，超过30%的爬取技能被现有平台信号标记为可疑或恶意，而相当一部分技能仍然缺乏完整的安全可观测性。为了研究早期风险评估，我们仅使用发布时可用的信息来制定提交时技能风险预测，并构建了一个包含11,010项技能的平衡基准。在12个分类器中，最佳的逻辑回归模型达到了72.62%的准确率和78.95%的AUROC，主要文档成为最具信息性的提交时信号。我们的研究结果将公共技能注册表定位为代理能力重用的关键推动者，同时也是生态系统规模安全风险的新表面。

View on arXiv Download PDF AI Translation

cs.CL / 12 / 2604.13065

Correct Chains, Wrong Answers: Dissociating Reasoning from Output in LLM Logic

正确的链条，错误的答案：将推理与大型语言模型输出分离

Rao, Abinav, Rachuri, Sujan, Vemuri, Nikhil

Abstract

LLMs can execute every step of chain-of-thought reasoning correctly and still produce wrong final answers. We introduce the Novel Operator Test, a benchmark that separates operator logic from operator name, enabling rigorous distinction between genuine reasoning and pattern retrieval. By evaluating Boolean operators under unfamiliar names across depths 1-10 on five models (up to 8,100 problems each), we demonstrate a reasoning-output dissociation that existing benchmarks cannot detect. At Claude Sonnet 4's depth 7, all 31 errors have verifiably correct reasoning yet wrong declared answers; 17/19 errors in mixed-operator chains exhibit the same pattern. The benchmark reveals two failure types: strategy failures at depth 2, where models attempt terse retrieval (+62pp from scaffolding), and content failures at depth 7, where models reason fully but err systematically (+8-30pp, 0/300 errors post-intervention). A Trojan operator (XOR's truth table under a novel name) confirms name alone does not gate reasoning (p >= 0.49), while Llama's novelty gap widens to 28pp at depth 8-9 with the Trojan at 92-100%, isolating genuine difficulty with novel logic from name unfamiliarity.

Chinese Translation

大型语言模型（LLMs）可以正确执行每一步的思维链推理，但仍然产生错误的最终答案。我们引入了新颖操作符测试（Novel Operator Test），这是一个基准测试，旨在将操作符的逻辑与操作符的名称分开，从而能够严格区分真实推理与模式检索。通过在五个模型上评估不熟悉名称下的布尔操作符，深度范围为1-10（每个模型最多8,100个问题），我们展示了一种现有基准无法检测的推理与输出的分离。在Claude Sonnet 4的深度7中，所有31个错误都有可验证的正确推理，但声明的答案却是错误的；在混合操作符链中，17/19个错误表现出相同的模式。该基准揭示了两种失败类型：在深度2处的策略失败，模型尝试简洁检索（相较于支架提高62个百分点），以及在深度7处的内容失败，模型完全推理但系统性出错（提高8-30个百分点，干预后300个错误中无错误）。一个特洛伊操作符（XOR的真值表以新名称呈现）证实仅靠名称并不能限制推理（p >= 0.49），而Llama在深度8-9时的创新差距扩大至28个百分点，特洛伊操作符的正确率为92-100%，将对新逻辑的真实困难与名称不熟悉的影响隔离开来。

View on arXiv Download PDF AI Translation

cs.CL / 13 / 2604.13066

Lossless Prompt Compression via Dictionary-Encoding and In-Context Learning: Enabling Cost-Effective LLM Analysis of Repetitive Data

通过字典编码和上下文学习实现无损提示压缩：促进对重复数据的成本效益分析

de Campos, Andresa Rodrigues, Lee, David, Kissos, Imry, Paritosh, Piyush

Abstract

In-context learning has established itself as an important learning paradigm for Large Language Models (LLMs). In this paper, we demonstrate that LLMs can learn encoding keys in-context and perform analysis directly on encoded representations. This finding enables lossless prompt compression via dictionary encoding without model fine-tuning: frequently occurring subsequences are replaced with compact meta-tokens, and when provided with the compression dictionary in the system prompt, LLMs correctly interpret these meta-tokens during analysis, producing outputs equivalent to those from uncompressed inputs. We present a compression algorithm that identifies repetitive patterns at multiple length scales, incorporating a token-savings optimization criterion that ensures compression reduces costs by preventing dictionary overhead from exceeding savings. The algorithm achieves compression ratios up to 80$\%$ depending on dataset characteristics. To validate that LLM analytical accuracy is preserved under compression, we use decompression as a proxy task with unambiguous ground truth. Evaluation on the LogHub 2.0 benchmark using Claude 3.7 Sonnet demonstrates exact match rates exceeding 0.99 for template-based compression and average Levenshtein similarity scores above 0.91 for algorithmic compression, even at compression ratios of 60$\%$-80$\%$. Additionally, compression ratio explains less than 2$\%$ of variance in similarity metrics, indicating that decompression quality depends on dataset characteristics rather than compression intensity. This training-free approach works with API-based LLMs, directly addressing fundamental deployment constraints -- token limits and API costs -- and enabling cost-effective analysis of large-scale repetitive datasets, even as data patterns evolve over time.

Chinese Translation

上下文学习已成为大型语言模型（LLMs）的一种重要学习范式。在本文中，我们展示了LLMs能够在上下文中学习编码键，并直接对编码表示进行分析。这一发现使得通过字典编码实现无损提示压缩成为可能，而无需对模型进行微调：频繁出现的子序列被紧凑的元标记所替代，当在系统提示中提供压缩字典时，LLMs能够在分析过程中正确解释这些元标记，生成与未压缩输入相当的输出。我们提出了一种压缩算法，该算法在多个长度尺度上识别重复模式，并结合了一个令牌节省优化标准，确保压缩通过防止字典开销超过节省来降低成本。该算法的压缩比可达80 ext{%}，具体取决于数据集特征。为了验证在压缩下LLM的分析准确性是否得以保持，我们使用解压作为一个代理任务，并提供明确的真实值。在使用Claude 3.7 Sonnet对LogHub 2.0基准进行评估时，基于模板的压缩的精确匹配率超过0.99，而算法压缩的平均Levenshtein相似度得分在60 ext{%}-80 ext{%}的压缩比下也超过0.91。此外，压缩比对相似度指标的方差解释不足2 ext{%}，这表明解压质量依赖于数据集特征而非压缩强度。这种无训练的方法适用于基于API的LLMs，直接解决了基本的部署限制——令牌限制和API成本——并使得对大规模重复数据集的成本效益分析成为可能，即使数据模式随时间演变。

View on arXiv Download PDF AI Translation

cs.CL / 14 / 2604.13068

Before the First Token: Scale-Dependent Emergence of Hallucination Signals in Autoregressive Language Models

首次标记之前：自回归语言模型中幻觉信号的规模依赖性出现

Roy, Dip, Misra, Rajiv, Singh, Sanjay Kumar, Roy, Anisha

Abstract

When do large language models decide to hallucinate? Despite serious consequences in healthcare, law, and finance, few formal answers exist. Recent work shows autoregressive models maintain internal representations distinguishing factual from fictional outputs, but when these representations peak as a function of model scale remains poorly understood. We study the temporal dynamics of hallucination-indicative internal representations across 7 autoregressive transformers (117M--7B parameters) using three fact-based datasets (TriviaQA, Simple Facts, Biography; 552 labeled examples). We identify a scale-dependent phase transition: models below 400M parameters show chance-level probe accuracy at every generation position (AUC = 0.48--0.67), indicating no reliable factuality signal. Above $\sim$1B parameters, a qualitatively different regime emerges where peak detectability occurs at position zero -- before any tokens are generated -- then declines during generation. This pre-generation signal is statistically significant in both Pythia-1.4B (p = 0.012) and Qwen2.5-7B (p = 0.038), spanning distinct architectures and training corpora. At the 7B scale, we observe a striking dissociation: Pythia-6.9B (base model, trained on The Pile) produces a flat temporal profile ($\Delta$ = +0.001, p = 0.989), while instruction-tuned Qwen2.5-7B shows a dominant pre-generation effect. This indicates raw scale alone is insufficient -- knowledge organization through instruction tuning or equivalent post-training is required for pre-commitment encoding. Activation steering along probe-derived directions fails to correct hallucinations across all models, confirming the signal is correlational rather than causal. Our findings provide scale-calibrated detection protocols and a concrete hypothesis on instruction tuning's role in developing knowledge circuits supporting factual generation.

Chinese Translation

大型语言模型何时决定产生幻觉？尽管在医疗、法律和金融等领域后果严重，但对此问题的正式答案寥寥无几。近期研究表明，自回归模型维持内部表征以区分事实与虚构输出，但这些表征在模型规模的函数中何时达到峰值仍然不甚了解。我们研究了7个自回归变换器（117M--7B参数）中幻觉指示性内部表征的时间动态，使用了三个基于事实的数据集（TriviaQA、Simple Facts、Biography；552个标记示例）。我们识别出一种规模依赖的相变：参数少于400M的模型在每个生成位置的探测准确率处于偶然水平（AUC = 0.48--0.67），表明没有可靠的事实信号。在约1B参数以上，出现了一种质的不同的状态，其中峰值可检测性发生在位置零——在生成任何标记之前——然后在生成过程中下降。这一生成前信号在Pythia-1.4B（p = 0.012）和Qwen2.5-7B（p = 0.038）中具有统计显著性，涵盖了不同的架构和训练语料。在7B规模下，我们观察到显著的解离：Pythia-6.9B（基础模型，训练于The Pile）产生了平坦的时间轮廓（$ ext{Δ}$ = +0.001, p = 0.989），而经过指令调优的Qwen2.5-7B则显示出主导的生成前效应。这表明仅靠原始规模是不够的——通过指令调优或等效的后训练进行知识组织是生成前编码所必需的。沿着探测导向的激活引导未能纠正所有模型中的幻觉，确认该信号是相关而非因果。我们的发现提供了规模校准的检测协议，并对指令调优在支持事实生成的知识电路发展中的作用提出了具体假设。

View on arXiv Download PDF AI Translation

cs.CL / 15 / 2604.13070

Curation of a Palaeohispanic Dataset for Machine Learning

古西班牙语数据集的整理用于机器学习

Martínez-Fernández, Gonzalo, Quesada, Jose F, Riscos-Núñez, Agustín, Salguero-Lamillar, Francisco José

Abstract

Palaeohispanic languages are those spoken in the Iberian Peninsula before the arrival of the Romans in the 3rd Century B.C. Their study was really put on motion after G\'omez Moreno deciphered the Iberian Levantine script, one of the several semi-sillabaries used by these languages. Still, the Palaeohispanic languages have varying degrees of decipherment, and none is fully known to this day. Most of the studies have been performed from a purely linguistic point of view, and a computational approach may benefit this research area greatly. However, the resources are limited and presented in an unsuitable format for techniques such as Machine Learning. Therefore, a structured dataset is constructed, which will hopefully allow more progress in the field.

Chinese Translation

古西班牙语是指在公元前3世纪罗马人到来之前在伊比利亚半岛上使用的语言。自从戈麦斯·莫雷诺（Gómez Moreno）破译了伊比利亚黎凡特文字（Iberian Levantine script）——这些语言使用的几种半音节文字之一——之后，对其研究真正开始加速。然而，古西班牙语的破译程度各不相同，目前尚无任何一种语言被完全了解。大多数研究都是从纯语言学的角度进行的，而计算方法可能会极大地促进这一研究领域的发展。然而，现有资源有限，且以不适合机器学习等技术的格式呈现。因此，构建了一个结构化的数据集，希望能够在该领域取得更多进展。

View on arXiv Download PDF AI Translation

cs.CL / 16 / 2604.13071

EVE: A Domain-Specific LLM Framework for Earth Intelligence

EVE：用于地球智能的领域特定大型语言模型框架

Atrio, Àlex R., Lopez, Antonio, Rohit, Jino, Ouahidi, Yassine El, Politi, Marcello, Iyer, Vijayasri, Jamil, Umar, Bratières, Sébastien, Longépé, Nicolas

Abstract

We introduce Earth Virtual Expert (EVE), the first open-source, end-to-end initiative for developing and deploying domain-specialized LLMs for Earth Intelligence. At its core is EVE-Instruct, a domain-adapted 24B model built on Mistral Small 3.2 and optimized for reasoning and question answering. On newly constructed Earth Observation and Earth Sciences benchmarks, it outperforms comparable models while preserving general capabilities. We release curated training corpora and the first systematic domain-specific evaluation benchmarks, covering MCQA, open-ended QA, and factuality. EVE further integrates RAG and a hallucination-detection pipeline into a production system deployed via API and GUI, supporting 350 pilot users so far. All models, datasets, and code are ready to be released under open licenses as contributions to our field at huggingface.co/eve-esa and github.com/eve-esa.

Chinese Translation

我们介绍了地球虚拟专家（EVE），这是首个开源的端到端倡议，旨在开发和部署针对地球智能的领域专业化大型语言模型（LLMs）。其核心是EVE-Instruct，这是一个基于Mistral Small 3.2构建的领域适应性24B模型，经过优化以进行推理和问答。在新构建的地球观测和地球科学基准测试中，它在保持一般能力的同时超越了可比模型。我们发布了经过整理的训练语料库和首个系统化的领域特定评估基准，涵盖了多项选择问答（MCQA）、开放式问答和事实性验证。EVE进一步将检索增强生成（RAG）和幻觉检测管道集成到通过API和图形用户界面（GUI）部署的生产系统中，目前已支持350名试点用户。所有模型、数据集和代码都准备在开放许可证下发布，作为我们领域的贡献，网址为huggingface.co/eve-esa和github.com/eve-esa。

View on arXiv Download PDF AI Translation

cs.CL / 17 / 2604.13072

LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks

LiveClawBench：在复杂的现实世界助手任务上对LLM代理进行基准测试

Long, Xiang, Du, Li, Xu, Yilong, Liu, Fangcheng, Wang, Haoqing, Ding, Ning, Li, Ziheng, Guo, Jianyuan, Tang, Yehui

Abstract

LLM-based agents are increasingly expected to handle real-world assistant tasks, yet existing benchmarks typically evaluate them under isolated sources of difficulty, such as a single environment or fully specified instructions. This leaves a substantial gap between current evaluation settings and the compositional challenges that arise in practical deployment. To address this gap, we introduce LiveClawBench, a benchmark to evaluate LLM agents on real-world assistant tasks. Based on an analysis of various real OpenClaw usage cases, we derive a Triple-Axis Complexity Framework that characterizes task difficulty along three dimensions: Environment Complexity, Cognitive Demand, and Runtime Adaptability. Guided by this framework, we construct a pilot benchmark with explicit complexity-factor annotations, covering real-world assistant tasks with compositional difficulty. Together, the framework and benchmark provide a principled foundation for evaluating LLM agents in realistic assistant settings, and establish a basis for future expansion across task domains and complexity axes. We are continuing to enrich our case collections to achieve more comprehensive domain and complexity coverage. The project page is at https://github.com/Mosi-AI/LiveClawBench.

Chinese Translation

基于LLM的代理越来越被期望能够处理现实世界的助手任务，然而现有的基准测试通常在孤立的困难源下评估它们，例如单一环境或完全指定的指令。这在当前评估设置与实际部署中出现的组合挑战之间留下了显著的差距。为了解决这一差距，我们引入了LiveClawBench，这是一个用于评估LLM代理在现实世界助手任务上的基准。基于对各种真实OpenClaw使用案例的分析，我们推导出一个三轴复杂性框架，该框架沿三个维度描述任务难度：环境复杂性、认知需求和运行时适应性。在该框架的指导下，我们构建了一个带有明确复杂性因素注释的试点基准，涵盖了具有组合难度的现实世界助手任务。框架和基准共同为在现实助手环境中评估LLM代理提供了一个原则性基础，并为未来在任务领域和复杂性轴上的扩展奠定了基础。我们正在继续丰富我们的案例集合，以实现更全面的领域和复杂性覆盖。项目页面地址为 https://github.com/Mosi-AI/LiveClawBench。

View on arXiv Download PDF AI Translation

cs.CL / 18 / 2604.13073

OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs

OmniTrace：一个统一的框架用于多模态大语言模型的生成时归因

Yan, Qianqi, Guo, Yichen, Kuo, Ching-Chen, Jiang, Shan, Yin, Hang, Zhao, Yang, Wang, Xin Eric

Abstract

Modern multimodal large language models (MLLMs) generate fluent responses from interleaved text, image, audio, and video inputs. However, identifying which input sources support each generated statement remains an open challenge. Existing attribution methods are primarily designed for classification settings, fixed prediction targets, or single-modality architectures, and do not naturally extend to autoregressive, decoder-only models performing open-ended multimodal generation. We introduce OmniTrace, a lightweight and model-agnostic framework that formalizes attribution as a generation-time tracing problem over the causal decoding process. OmniTrace provides a unified protocol that converts arbitrary token-level signals such as attention weights or gradient-based scores into coherent span-level, cross-modal explanations during decoding. It traces each generated token to multimodal inputs, aggregates signals into semantically meaningful spans, and selects concise supporting sources through confidence-weighted and temporally coherent aggregation, without retraining or supervision. Evaluations on Qwen2.5-Omni and MiniCPM-o-4.5 across visual, audio, and video tasks demonstrate that generation-aware span-level attribution produces more stable and interpretable explanations than naive self-attribution and embedding-based baselines, while remaining robust across multiple underlying attribution signals. Our results suggest that treating attribution as a structured generation-time tracing problem provides a scalable foundation for transparency in omni-modal language models.

Chinese Translation

现代多模态大语言模型（MLLMs）能够从交错的文本、图像、音频和视频输入中生成流畅的响应。然而，识别哪些输入源支持每个生成的陈述仍然是一个开放的挑战。现有的归因方法主要针对分类设置、固定预测目标或单一模态架构设计，无法自然扩展到执行开放式多模态生成的自回归、仅解码器模型。我们提出了OmniTrace，一个轻量级且与模型无关的框架，将归因形式化为因果解码过程中的生成时追踪问题。OmniTrace提供了一个统一的协议，将任意的令牌级信号（如注意力权重或基于梯度的分数）转换为解码过程中的连贯跨度级跨模态解释。它将每个生成的令牌追溯到多模态输入，将信号聚合为语义上有意义的跨度，并通过置信加权和时间一致的聚合选择简洁的支持源，而无需重新训练或监督。在Qwen2.5-Omni和MiniCPM-o-4.5的视觉、音频和视频任务上的评估表明，生成感知的跨度级归因比简单的自归因和基于嵌入的基线产生更稳定和可解释的解释，同时在多种基础归因信号下保持稳健。我们的结果表明，将归因视为一个结构化的生成时追踪问题为多模态语言模型的透明性提供了可扩展的基础。

View on arXiv Download PDF AI Translation

cs.CL / 19 / 2604.13074

PersonaVLM: Long-Term Personalized Multimodal LLMs

PersonaVLM：长期个性化的多模态大型语言模型

Nie, Chang, Fu, Chaoyou, Zhang, Yifan, Yang, Haihua, Shan, Caifeng

Abstract

Multimodal Large Language Models (MLLMs) serve as daily assistants for millions. However, their ability to generate responses aligned with individual preferences remains limited. Prior approaches enable only static, single-turn personalization through input augmentation or output alignment, and thus fail to capture users' evolving preferences and personality over time (see Fig.1). In this paper, we introduce PersonaVLM, an innovative personalized multimodal agent framework designed for long-term personalization. It transforms a general-purpose MLLM into a personalized assistant by integrating three key capabilities: (a) Remembering: It proactively extracts and summarizes chronological multimodal memories from interactions, consolidating them into a personalized database. (b) Reasoning: It conducts multi-turn reasoning by retrieving and integrating relevant memories from the database. (c) Response Alignment: It infers the user's evolving personality throughout long-term interactions to ensure outputs remain aligned with their unique characteristics. For evaluation, we establish Persona-MME, a comprehensive benchmark comprising over 2,000 curated interaction cases, designed to assess long-term MLLM personalization across seven key aspects and 14 fine-grained tasks. Extensive experiments validate our method's effectiveness, improving the baseline by 22.4% (Persona-MME) and 9.8% (PERSONAMEM) under a 128k context, while outperforming GPT-4o by 5.2% and 2.0%, respectively. Project page: https://PersonaVLM.github.io.

Chinese Translation

多模态大型语言模型（MLLMs）为数百万用户提供日常助手服务。然而，它们生成与个体偏好一致的响应的能力仍然有限。以往的方法仅通过输入增强或输出对齐实现静态的单轮个性化，因此未能捕捉用户随时间变化的偏好和个性（见图1）。在本文中，我们介绍了PersonaVLM，一种旨在实现长期个性化的创新个性化多模态代理框架。它通过整合三项关键能力，将通用的MLLM转变为个性化助手：（a）记忆：它主动提取和总结来自交互的时间顺序多模态记忆，并将其整合到个性化数据库中。（b）推理：它通过从数据库中检索和整合相关记忆进行多轮推理。（c）响应对齐：它在长期交互中推断用户不断变化的个性，以确保输出与其独特特征保持一致。为了评估，我们建立了Persona-MME，这是一个综合基准，包含超过2000个精心策划的交互案例，旨在评估七个关键方面和14个细分任务中的长期MLLM个性化。大量实验验证了我们方法的有效性，在128k上下文下，基线提升了22.4%（Persona-MME）和9.8%（PERSONAMEM），同时分别超越了GPT-4o 5.2%和2.0%。项目页面：https://PersonaVLM.github.io。

View on arXiv Download PDF AI Translation

cs.CL / 20 / 2604.13075

DeEscalWild: A Real-World Benchmark for Automated De-Escalation Training with SLMs

DeEscalWild：用于自动化降级训练的真实世界基准数据集与小型语言模型

Hasan, Md Hasebul, Charu, Krity Haque, Sridhar, Eshwara Prasad, Deb, Shuchisnigdha, Islam, Mohammad A.

Abstract

Effective de-escalation is critical for law enforcement safety and community trust, yet traditional training methods lack scalability and realism. While Large Language Models (LLMs) enable dynamic, open-ended simulations, their substantial computational footprint renders them impractical for deployment on the lightweight, portable hardware required for immersive field training. Small Language Models (SLMs) offer a viable real-time alternative but suffer from a critical scarcity of high-quality, domain-specific training data. To bridge this gap, we present DeEscalWild, a novel benchmark dataset curated from a multi-stage pipeline of in-the-wild police-civilian interactions extracted from open-source video repositories. Starting with 5,000 raw inputs, we employed a rigorous hybrid filtering process - combining human-in-the-loop verification with LLM-as-a-Judge evaluation - to distill 1,500 high-fidelity scenarios. The resulting corpus comprises 285,887 dialogue turns, totaling approximately 4.7 million tokens. Extensive experiments demonstrate that SLMs fine-tuned on this data significantly outperform their base counterparts across ROUGE-L, BLEU-4, METEOR, and BERTScore metrics. Notably, our fine-tuned Qwen 2.5 (3B-Instruct) surpasses the general-purpose Gemini 2.5 Flash model, demonstrating that domain-optimized SLMs can achieve superior performance with a fraction of the computational cost. This work establishes the foundational infrastructure for accessible, low-latency, and privacy-preserving officer training systems at the edge.

Chinese Translation

有效的降级对于执法安全和社区信任至关重要，但传统的培训方法缺乏可扩展性和现实性。虽然大型语言模型（LLMs）能够实现动态、开放式的模拟，但其巨大的计算资源消耗使其在需要沉浸式现场训练的轻便便携硬件上难以部署。小型语言模型（SLMs）提供了一种可行的实时替代方案，但在高质量、特定领域的训练数据方面存在严重短缺。为了解决这一问题，我们提出了DeEscalWild，这是一个从开放源代码视频库中提取的真实警民互动的多阶段管道策划的新基准数据集。我们从5000个原始输入开始，采用严格的混合过滤过程——结合人工验证与LLM作为评判者的评估——提炼出1500个高保真场景。最终的数据集包含285,887个对话轮次，总计约470万个标记。广泛的实验表明，基于该数据集微调的SLMs在ROUGE-L、BLEU-4、METEOR和BERTScore等指标上显著优于其基础模型。值得注意的是，我们微调后的Qwen 2.5（3B-Instruct）超越了通用的Gemini 2.5 Flash模型，证明了领域优化的SLMs可以以较低的计算成本实现更优的性能。本研究为可访问、低延迟和保护隐私的边缘执法培训系统奠定了基础设施。

View on arXiv Download PDF AI Translation

cs.CL / 21 / 2604.13076

Document-tuning for robust alignment to animals

针对动物的稳健对齐的文档调优

Brazilek, Jasmine, Tidmarsh, Miles

Abstract

We investigate the robustness of value alignment via finetuning with synthetic documents, using animal compassion as a value that is both important in its own right and orthogonal to existing alignment efforts. To evaluate compassionate reasoning, we develop and publicly release the Animal Harm Benchmark (AHB), a 26-question evaluation spanning 13 ethical dimensions, publicly available as a dataset and Inspect evaluation. On the AHB, training with 3000 documents achieves 77% compared to 40% for instruction-tuning approaches, with generalization to human compassion and no degradation in standard safety benchmarks or capabilities. However, subsequent unrelated instruction-tuning degrades the intervention, with the advantage disappearing after 5000 samples. Our exploratory results suggest document-based value interventions may require explicit preservation strategies to remain effective through typical training pipelines.

Chinese Translation

我们研究了通过使用合成文档进行微调的价值对齐的稳健性，以动物同情作为一个既重要又与现有对齐努力正交的价值。为了评估同情推理，我们开发并公开发布了动物伤害基准（Animal Harm Benchmark, AHB），这是一个涵盖13个伦理维度的26个问题评估，作为数据集和Inspect评估公开可用。在AHB上，使用3000个文档进行训练的准确率为77%，而指令调优方法的准确率为40%，并且在对人类同情的泛化上没有在标准安全基准或能力上出现退化。然而，随后的无关指令调优会降低干预效果，优势在5000个样本后消失。我们的探索性结果表明，基于文档的价值干预可能需要明确的保留策略，以在典型训练流程中保持有效性。

View on arXiv Download PDF AI Translation

cs.CL / 22 / 2604.13077

Can Large Language Models Reliably Extract Physiology Index Values from Coronary Angiography Reports?

大型语言模型能否可靠地从冠状动脉造影报告中提取生理指标值？

Morgado, Sofia, Valdeira, Filipa, Sander, Niklas, Ferreira, Diogo, Vilela, Marta, Menezes, Miguel, Soares, Cláudia

Abstract

Coronary angiography (CAG) reports contain clinically relevant physiological measurements, yet this information is typically in the form of unstructured natural language, limiting its use in research. We investigate the use of Large Language Models (LLMs) to automatically extract these values, along with their anatomical locations, from Portuguese CAG reports. To our knowledge, this study is the first addressing physiology indexes extraction from a large (1342 reports) corpus of CAG reports, and one of the few focusing on CAG or Portuguese clinical text. We explore local privacy-preserving general-purpose and medical LLMs under different settings. Prompting strategies included zero-shot, few-shot, and few-shot prompting with implausible examples. In addition, we apply constrained generation and introduce a post-processing step based on RegEx. Given the sparsity of measurements, we propose a multi-stage evaluation framework separating format validity, value detection, and value correctness, while accounting for asymmetric clinical error costs. This study demonstrates the potential of LLMs in for extracting physiological indices from Portuguese CAG reports. Non-medical models performed similarly, the best results were obtained with Llama with a zero-shot prompting, while GPT-OSS demonstrated the highest robustness to changes in the prompts. While MedGemma demonstrated similar results to non-medical models, MedLlama's results were out-of-format in the unconstrained setting, and had a significant lower performance in the constrained one. Changes in the prompt techinique and adding a RegEx layer showed no significant improvement across models, while using constrained generation decreased performance, although having the benefit of allowing the usage of specific models that are not able to conform with the templates.

Chinese Translation

冠状动脉造影（CAG）报告包含临床相关的生理测量值，但这些信息通常以非结构化自然语言的形式存在，限制了其在研究中的应用。我们研究了使用大型语言模型（LLMs）自动提取这些值及其解剖位置的方法，针对葡萄牙语的CAG报告。根据我们的了解，本研究是首个针对大规模（1342份报告）CAG报告进行生理指标提取的研究，也是为数不多关注CAG或葡萄牙临床文本的研究之一。我们在不同设置下探索了本地隐私保护的通用和医学LLMs。提示策略包括零样本、少样本和带有不合理示例的少样本提示。此外，我们应用了受限生成，并引入了基于正则表达式（RegEx）的后处理步骤。鉴于测量值的稀疏性，我们提出了一个多阶段评估框架，分别评估格式有效性、值检测和值正确性，同时考虑了不对称的临床错误成本。本研究展示了LLMs在从葡萄牙CAG报告中提取生理指标方面的潜力。非医学模型表现相似，最佳结果来自于使用零样本提示的Llama，而GPT-OSS在提示变化方面表现出最高的鲁棒性。虽然MedGemma的结果与非医学模型相似，但在无约束设置下，MedLlama的结果不符合格式，并且在受限设置下表现显著较低。提示技术的变化和添加RegEx层在各模型间未显示出显著改善，而使用受限生成则降低了性能，尽管允许使用无法符合模板的特定模型。

View on arXiv Download PDF AI Translation

cs.CL / 23 / 2604.13078

IWLV-Ramayana: A Sarga-Aligned Parallel Corpus of Valmiki's Ramayana Across Indian Languages

IWLV-拉玛亚那：跨印度语言的瓦尔米基《拉玛亚那》章节对齐平行语料库

VP, Sumesh

Abstract

The Ramayana is among the most influential literary traditions of South and Southeast Asia, transmitted across numerous linguistic and cultural contexts over two millennia. Despite extensive scholarship on regional Ramayana traditions, computational resources enabling systematic cross-linguistic analysis remain limited. This paper introduces the IWLV Ramayana Corpus, a structured parallel corpus aligning Valmiki's Ramayana across multiple Indian languages at the level of the sarga (chapter). The corpus currently includes complete English and Malayalam layers, with Hindi, Tamil, Kannada, and Telugu layers in active production. The dataset is distributed in structured JSONL format with explicit provenance metadata, enabling applications in comparative literature, corpus linguistics, digital humanities, and multilingual natural language processing. To our knowledge, this is the first sarga-aligned multilingual parallel corpus of the Valmiki Ramayana with explicit provenance metadata and machine-readable format.

Chinese Translation

《拉玛亚那》是南亚和东南亚最具影响力的文学传统之一，跨越两个千年在众多语言和文化背景中传播。尽管关于地区《拉玛亚那》传统的研究广泛，但能够支持系统性跨语言分析的计算资源仍然有限。本文介绍了IWLV拉玛亚那语料库，这是一个结构化的平行语料库，按章节（sarga）对齐瓦尔米基的《拉玛亚那》，涵盖多种印度语言。该语料库目前包括完整的英语和马拉雅拉姆语层，印地语、泰米尔语、卡纳达语和泰卢固语层正在积极制作中。数据集以结构化的JSONL格式分发，包含明确的来源元数据，支持比较文学、语料库语言学、数字人文学科和多语言自然语言处理等应用。据我们所知，这是第一个具有明确来源元数据和机器可读格式的瓦尔米基《拉玛亚那》章节对齐多语言平行语料库。

View on arXiv Download PDF AI Translation

cs.CL / 24 / 2604.13197

Unleashing Implicit Rewards: Prefix-Value Learning for Distribution-Level Optimization

释放隐式奖励：用于分布级优化的前缀值学习

Gao, Shiping, Chen, Hongzhan, Quan, Xiaojun, Wang, Qifan, Huang, Lifu

Abstract

Process reward models (PRMs) provide fine-grained reward signals along the reasoning process, but training reliable PRMs often requires step annotations or heavy verification pipelines, making them expensive to scale and refresh during online RL. Implicit PRMs mitigate this cost by learning decomposable token- or step-level rewards from trajectory-level outcome labels. However, they suffer from a train-inference mismatch: training only constrains a sequence-level aggregate, whereas inference requires token-level scores to reflect local step quality. As a result, token-level credits are weakly identified and may fail to faithfully reflect which reasoning steps are actually correct. This unreliability undermines a key promise of implicit PRMs: scoring many candidate tokens. In practice, noisy per-token advantages may systematically reinforce incorrect continuations. We address this problem with a novel Implicit Prefix-Value Reward Model (IPVRM), which directly learns a prefix-conditioned value function estimating the probability of eventual correctness, and derives step signals via temporal-difference (TD) differences. IPVRM substantially improves step-verification F1 on ProcessBench. Building on these calibrated prefix values, we further propose Distribution-Level RL (DistRL), which computes TD advantages for both sampled tokens and high-probability candidate tokens, enabling dense counterfactual updates without additional rollouts. While DistRL offers limited gains when powered by miscalibrated implicit rewards, it consistently improves downstream reasoning once paired with IPVRM.

Chinese Translation

过程奖励模型（PRMs）在推理过程中提供细粒度的奖励信号，但训练可靠的PRMs通常需要步骤注释或复杂的验证流程，使其在在线强化学习中扩展和更新的成本高昂。隐式PRMs通过从轨迹级结果标签中学习可分解的令牌或步骤级奖励来减轻这一成本。然而，它们面临训练与推理之间的不匹配：训练仅限制序列级的聚合，而推理则需要令牌级的分数来反映局部步骤的质量。因此，令牌级的信用被弱识别，可能无法忠实反映哪些推理步骤实际上是正确的。这种不可靠性削弱了隐式PRMs的一个关键承诺：为许多候选令牌评分。在实践中，噪声的每个令牌优势可能系统性地强化不正确的延续。我们通过一种新颖的隐式前缀值奖励模型（IPVRM）来解决这个问题，该模型直接学习一个前缀条件的值函数，以估计最终正确性的概率，并通过时间差（TD）差异推导步骤信号。IPVRM在ProcessBench上显著提高了步骤验证的F1分数。在这些校准的前缀值基础上，我们进一步提出了分布级强化学习（DistRL），该方法为采样的令牌和高概率候选令牌计算TD优势，从而实现密集的反事实更新，而无需额外的回滚。尽管当由误校准的隐式奖励驱动时，DistRL提供的增益有限，但一旦与IPVRM配对，它在下游推理中始终能够改善表现。

View on arXiv Download PDF AI Translation

cs.CL / 25 / 2604.13201

InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis

InfiniteScienceGym：一个无界限的程序生成科学分析基准

Bentham, Oliver, Srikumar, Vivek

Abstract

Large language models are emerging as scientific assistants, but evaluating their ability to reason from empirical data remains challenging. Benchmarks derived from published studies and human annotations inherit publication bias, known-knowledge bias, label noise, and substantial storage requirements. We present InfiniteScienceGym, a procedurally generated benchmark of scientific repositories paired with a verifiable question-answering task. From a seed, the simulator deterministically generates a self-contained repository with realistic directory structure, files, and tabular data, and a privileged QA generator produces both answerable and unanswerable questions with exact ground truth. This makes it possible to evaluate evidence-grounded reasoning, abstention, and tool-mediated analysis in a controlled setting without distributing a large static corpus. InfiniteScienceGym complements real scientific benchmarks by targeting blind spots and failure modes that are hard to evaluate using published datasets alone. Evaluating both proprietary and open-weight models, we find that none achieve more than 45% accuracy overall, that recognizing unanswerable questions remains a major weakness, and that stronger models tend to use tools more effectively rather than simply consuming more tokens.

Chinese Translation

大型语言模型正逐渐成为科学助手，但评估它们从实证数据中推理的能力仍然具有挑战性。基于已发表研究和人工注释的基准继承了出版偏见、已知知识偏见、标签噪声以及巨大的存储需求。我们提出了InfiniteScienceGym，这是一个程序生成的科学库基准，配备可验证的问题回答任务。从一个种子开始，模拟器确定性地生成一个自包含的库，具有现实的目录结构、文件和表格数据，同时一个特权的问答生成器产生可回答和不可回答的问题，并提供确切的真实答案。这使得在受控环境中评估基于证据的推理、回避和工具介导的分析成为可能，而无需分发大量静态语料库。InfiniteScienceGym通过针对仅使用已发布数据集难以评估的盲点和失败模式，补充了真实的科学基准。评估专有模型和开放权重模型时，我们发现没有一个模型的整体准确率超过45%，识别不可回答的问题仍然是一个主要弱点，而更强的模型往往更有效地使用工具，而不仅仅是消耗更多的标记。

View on arXiv Download PDF AI Translation

cs.CL / 26 / 2604.13232

Evaluating the Evaluator: Problems with SemEval-2020 Task 1 for Lexical Semantic Change Detection

评估评估者：SemEval-2020任务1在词汇语义变化检测中的问题

Phan-Tat, Bach, Heylen, Kris, Geeraerts, Dirk, De Pascale, Stefano, Speelmana, Dirk

Abstract

This discussion paper re-examines SemEval-2020 Task 1, the most influential shared benchmark for lexical semantic change detection, through a three-part evaluative framework: operationalisation, data quality, and benchmark design. First, at the level of operationalisation, we argue that the benchmark models semantic change mainly as gain, loss, or redistribution of discrete senses. While practical for annotation and evaluation, this framing is too narrow to capture gradual, constructional, collocational, and discourse-level change. Also, the gold labels are outcomes of annotation decisions, clustering procedures, and threshold settings, which could potentially limit the validity of the task. Second, at the level of data quality, we show that the benchmark is affected by substantial corpus and preprocessing problems, including OCR noise, malformed characters, truncated sentences, inconsistent lemmatisation, POS-tagging errors, and missed targets. These issues can distort model behaviour, complicate linguistic analysis, and reduce reproducibility. Third, at the level of bench-mark design, we argue the small curated target sets and limited language coverage reduce realism and increase statistical uncertainty. Taken together, these limitations suggest that the benchmark should be treated as a useful but partial test bed rather than a definitive measure of progress. We therefore call for future datasets and shared tasks to adopt broader theories of semantic change, document pre-processing transparently, expand cross-linguistic coverage, and use more realistic evaluation settings. Such steps are necessary for more valid, interpretable, and generalisable progress in lexical semantic change detection

Chinese Translation

本文讨论性论文重新审视了SemEval-2020任务1，这是词汇语义变化检测领域最具影响力的共享基准，采用了一个三部分的评估框架：操作化、数据质量和基准设计。首先，在操作化层面，我们认为该基准主要将语义变化建模为离散意义的获得、丧失或重新分配。虽然这种框架在注释和评估中是实用的，但过于狭隘，无法捕捉渐进的、构式的、搭配的和话语层面的变化。此外，金标准标签是注释决策、聚类程序和阈值设置的结果，这可能会限制任务的有效性。其次，在数据质量层面，我们展示了该基准受到显著的语料库和预处理问题的影响，包括OCR噪声、格式错误的字符、截断的句子、不一致的词形还原、词性标注错误和遗漏的目标。这些问题可能扭曲模型行为，复杂化语言分析，并降低可重复性。第三，在基准设计层面，我们认为小规模的策划目标集和有限的语言覆盖降低了现实性并增加了统计不确定性。综合来看，这些局限性表明，该基准应被视为一个有用但部分的测试平台，而非进展的决定性衡量标准。因此，我们呼吁未来的数据集和共享任务采用更广泛的语义变化理论，透明地记录文档预处理，扩大跨语言覆盖，并使用更现实的评估设置。这些步骤对于在词汇语义变化检测领域实现更有效、可解释和可推广的进展是必要的。

View on arXiv Download PDF AI Translation

cs.CL / 27 / 2604.13258

Hessian-Enhanced Token Attribution (HETA): Interpreting Autoregressive LLMs

海森增强的标记归因 (HETA)：自回归语言模型的解释

Pramanik, Vishal, Maliha, Maisha, Bastian, Nathaniel D., Jha, Sumit Kumar

Abstract

Attribution methods seek to explain language model predictions by quantifying the contribution of input tokens to generated outputs. However, most existing techniques are designed for encoder-based architectures and rely on linear approximations that fail to capture the causal and semantic complexities of autoregressive generation in decoder-only models. To address these limitations, we propose Hessian-Enhanced Token Attribution (HETA), a novel attribution framework tailored for decoder-only language models. HETA combines three complementary components: a semantic transition vector that captures token-to-token influence across layers, Hessian-based sensitivity scores that model second-order effects, and KL divergence to measure information loss when tokens are masked. This unified design produces context-aware, causally faithful, and semantically grounded attributions. Additionally, we introduce a curated benchmark dataset for systematically evaluating attribution quality in generative settings. Empirical evaluations across multiple models and datasets demonstrate that HETA consistently outperforms existing methods in attribution faithfulness and alignment with human annotations, establishing a new standard for interpretability in autoregressive language models.

Chinese Translation

归因方法旨在通过量化输入标记对生成输出的贡献来解释语言模型的预测。然而，大多数现有技术是为基于编码器的架构设计的，并依赖于线性近似，这无法捕捉到仅解码器模型中自回归生成的因果和语义复杂性。为了解决这些局限性，我们提出了海森增强的标记归因 (HETA)，这是一个专为仅解码器语言模型量身定制的新型归因框架。HETA结合了三个互补组件：一个捕捉层间标记影响的语义过渡向量、建模二阶效应的海森矩阵敏感性分数，以及用于测量标记被掩蔽时信息损失的KL散度。这种统一的设计产生了上下文感知、因果真实和语义扎实的归因。此外，我们引入了一个精心策划的基准数据集，以系统地评估生成设置中的归因质量。在多个模型和数据集上的实证评估表明，HETA在归因的真实性和与人类注释的一致性方面始终优于现有方法，为自回归语言模型的可解释性建立了新的标准。

View on arXiv Download PDF AI Translation

cs.CL / 28 / 2604.13275

Better and Worse with Scale: How Contextual Entrainment Diverges with Model Size

规模的利与弊：上下文适应性如何随模型规模而异

Kukreja, Dikshant, Sah, Kshitij, Gupta, Gautam, Anand, Avinash, Shah, Rajiv Ratn, Wang, Zhengkui, Ng, Aik Beng, Cambria, Erik

Abstract

Larger language models become simultaneously better and worse at handling contextual information -- better at ignoring false claims, worse at ignoring irrelevant tokens. We formalize this apparent paradox through the first scaling laws for contextual entrainment, the tendency of models to favor tokens that appeared in context regardless of relevance. Analyzing the Cerebras-GPT (111M-13B) and Pythia (410M-12B) model families, we find entrainment follows predictable power-law scaling, but with opposite trends depending on context type: semantic contexts show decreasing entrainment with scale, while non-semantic contexts show increasing entrainment. Concretely, the largest models are four times more resistant to counterfactual misinformation than the smallest, yet simultaneously twice as prone to copying arbitrary tokens. These diverging trends, which replicate across model families, suggest that semantic filtering and mechanical copying are functionally distinct behaviors that scale in opposition -- scaling alone does not resolve context sensitivity, it reshapes it.

Chinese Translation

较大的语言模型在处理上下文信息时变得既更好又更差——在忽略虚假声明方面表现更好，但在忽略无关标记方面表现更差。我们通过上下文适应性的首个规模定律来形式化这一明显的悖论，即模型倾向于优先考虑在上下文中出现的标记，无论其相关性如何。通过分析Cerebras-GPT（111M-13B）和Pythia（410M-12B）模型系列，我们发现适应性遵循可预测的幂律规模，但根据上下文类型呈现相反的趋势：语义上下文的适应性随着规模的增加而减少，而非语义上下文的适应性则随着规模的增加而增加。具体而言，最大的模型在抵御反事实虚假信息方面比最小的模型强四倍，但同时在复制任意标记方面的倾向却是其两倍。这些相互背离的趋势在不同模型系列中得到了重复，表明语义过滤和机械复制是功能上不同的行为，并且在相反的方向上进行扩展——仅仅扩展规模并不能解决上下文敏感性问题，而是重塑了它。

View on arXiv Download PDF AI Translation

cs.CL / 29 / 2604.13285

L2D-Clinical: Learning to Defer for Adaptive Model Selection in Clinical Text Classification

L2D-Clinical：在临床文本分类中学习延迟以实现自适应模型选择

Kondadadi, Rishik, Ortega, John E.

Abstract

Clinical text classification requires choosing between specialized fine-tuned models (BERT variants) and general-purpose large language models (LLMs), yet neither dominates across all instances. We introduce Learning to Defer for clinical text (L2D-Clinical), a framework that learns when a BERT classifier should defer to an LLM based on uncertainty signals and text characteristics. Unlike prior L2D work that defers to human experts assumed universally superior, our approach enables adaptive deferral-improving accuracy when the LLM complements BERT. We evaluate on two English clinical tasks: (1) ADE detection (ADE Corpus V2), where BioBERT (F1=0.911) outperforms the LLM (F1=0.765), and (2) treatment outcome classification (MIMIC-IV with multi-LLM consensus ground truth), where GPT-5-nano (F1=0.967) outperforms ClinicalBERT (F1=0.887). On ADE, L2D-Clinical achieves F1=0.928 (+1.7 points over BERT) by selectively deferring 7% of instances where the LLM's high recall compensates for BERT's misses. On MIMIC, L2D-Clinical achieves F1=0.980 (+9.3 points over BERT) by deferring only 16.8\% of cases to the LLM. The key insight is that L2D-Clinical learns to selectively leverage LLM strengths while minimizing API costs.

Chinese Translation

临床文本分类需要在专业的微调模型（如BERT变体）和通用的大型语言模型（LLMs）之间进行选择，但在所有实例中均无一方占据绝对优势。我们提出了针对临床文本的学习延迟框架（L2D-Clinical），该框架基于不确定性信号和文本特征学习何时让BERT分类器延迟使用LLM。与之前假设人类专家普遍优越的L2D工作不同，我们的方法实现了自适应延迟——在LLM能够补充BERT时提高准确性。我们在两个英语临床任务上进行了评估：（1）不良药物事件（ADE）检测（ADE Corpus V2），其中BioBERT（F1=0.911）优于LLM（F1=0.765）；（2）治疗结果分类（MIMIC-IV与多LLM共识真实值），其中GPT-5-nano（F1=0.967）优于ClinicalBERT（F1=0.887）。在ADE任务中，L2D-Clinical通过选择性延迟7%的实例，取得了F1=0.928（比BERT提高1.7分），此时LLM的高召回率弥补了BERT的遗漏。在MIMIC任务中，L2D-Clinical仅将16.8%的案例延迟给LLM，取得了F1=0.980（比BERT提高9.3分）。关键见解在于，L2D-Clinical学习在最小化API成本的同时，选择性地利用LLM的优势。

View on arXiv Download PDF AI Translation

cs.CL / 30 / 2604.13286

English is Not All You Need: Systematically Exploring the Role of Multilinguality in LLM Post-Training

英语并非唯一需求：系统性探讨多语言性在大语言模型后训练中的作用

Dhaliwal, Mehak, Chaurasia, Shashwat, Qin, Yao, Hong, Dezhi, Butler, Thomas

Abstract

Despite the widespread multilingual deployment of large language models, post-training pipelines remain predominantly English-centric, contributing to performance disparities across languages. We present a systematic, controlled study of the interplay between training language coverage, model scale, and task domain, based on 220 supervised fine-tuning runs on parallel translated multilingual data mixtures spanning mathematical reasoning and API calling tasks, with models up to 8B parameters. We find that increasing language coverage during post-training is largely beneficial across tasks and model scales, with low-resource languages benefiting the most and high-resource languages plateauing rather than degrading. Even minimal multilinguality helps: incorporating a single non-English language improves both English performance and cross-lingual generalization, making English-only post-training largely suboptimal. Moreover, at sufficient language diversity, zero-shot cross-lingual transfer can match or exceed the effects of direct language inclusion in a low-diversity setting, although gains remain limited for typologically distant, low-resource languages.

Chinese Translation

尽管大型语言模型在多语言环境中的广泛应用，后训练流程仍然主要以英语为中心，导致不同语言间的性能差异。我们基于220次监督微调实验，系统性地研究了训练语言覆盖、模型规模和任务领域之间的相互作用，这些实验涉及跨数学推理和API调用任务的平行翻译多语言数据混合，模型规模高达8B参数。我们的研究发现，在后训练过程中增加语言覆盖在各任务和模型规模中普遍是有益的，低资源语言受益最大，而高资源语言则趋于平稳而非下降。即使是最小的多语言性也有帮助：加入一种非英语语言可以改善英语性能和跨语言泛化，使得仅以英语进行后训练在很大程度上显得次优。此外，在足够的语言多样性下，零-shot跨语言迁移的效果可以与在低多样性设置中直接包含语言的效果相匹配或超越，尽管对于类型学上相距较远的低资源语言，增益仍然有限。

View on arXiv Download PDF AI Translation

cs.CL / 31 / 2604.13288

Giving Voice to the Constitution: Low-Resource Text-to-Speech for Quechua and Spanish Using a Bilingual Legal Corpus

赋予宪法声音：基于双语法律语料库的低资源文本转语音技术在凯丘亚语和西班牙语中的应用

Ortega, John E., Zevallos, Rodolfo, Carraro, Fabricio

Abstract

We present a unified pipeline for synthesizing high-quality Quechua and Spanish speech for the Peruvian Constitution using three state-of-the-art text-to-speech (TTS) architectures: XTTS v2, F5-TTS, and DiFlow-TTS. Our models are trained on independent Spanish and Quechua speech datasets with heterogeneous sizes and recording conditions, and leverage bilingual and multilingual TTS capabilities to improve synthesis quality in both languages. By exploiting cross-lingual transfer, our framework mitigates data scarcity in Quechua while preserving naturalness in Spanish. We release trained checkpoints, inference code, and synthesized audio for each constitutional article, providing a reusable resource for speech technologies in indigenous and multilingual contexts. This work contributes to the development of inclusive TTS systems for political and legal content in low-resource settings.

Chinese Translation

我们提出了一种统一的流程，用于合成高质量的凯丘亚语和西班牙语语音，以服务于秘鲁宪法，采用三种最先进的文本转语音（TTS）架构：XTTS v2、F5-TTS 和 DiFlow-TTS。我们的模型在独立的西班牙语和凯丘亚语语音数据集上进行训练，这些数据集具有不同的规模和录音条件，并利用双语和多语种 TTS 能力来提高两种语言的合成质量。通过利用跨语言迁移，我们的框架在保持西班牙语自然性的同时，缓解了凯丘亚语的数据稀缺问题。我们发布了训练好的检查点、推理代码以及每个宪法条款的合成音频，为土著和多语言环境中的语音技术提供了可重用的资源。这项工作为在低资源环境中开发包容性的 TTS 系统以处理政治和法律内容做出了贡献。

View on arXiv Download PDF AI Translation

cs.CL / 32 / 2604.13346

AgentSPEX: An Agent SPecification and EXecution Language

AgentSPEX：一种代理规范与执行语言

Wang, Pengcheng, Huang, Jerry, Yao, Jiarui, Pan, Rui, Niu, Peizhi, Liu, Yaowenqi, Wang, Ruida, Lu, Renhao, Guo, Yuwei, Zhang, Tong

Abstract

Language-model agent systems commonly rely on reactive prompting, in which a single instruction guides the model through an open-ended sequence of reasoning and tool-use steps, leaving control flow and intermediate state implicit and making agent behavior potentially difficult to control. Orchestration frameworks such as LangGraph, DSPy, and CrewAI impose greater structure through explicit workflow definitions, but tightly couple workflow logic with Python, making agents difficult to maintain and modify. In this paper, we introduce AgentSPEX, an Agent SPecification and EXecution Language for specifying LLM-agent workflows with explicit control flow and modular structure, along with a customizable agent harness. AgentSPEX supports typed steps, branching and loops, parallel execution, reusable submodules, and explicit state management, and these workflows execute within an agent harness that provides tool access, a sandboxed virtual environment, and support for checkpointing, verification, and logging. Furthermore, we provide a visual editor with synchronized graph and workflow views for authoring and inspection. We include ready-to-use agents for deep research and scientific research, and we evaluate AgentSPEX on 7 benchmarks. Finally, we show through a user study that AgentSPEX provides a more interpretable and accessible workflow-authoring paradigm than a popular existing agent framework.

Chinese Translation

语言模型代理系统通常依赖于反应性提示，其中单个指令引导模型通过开放式的推理和工具使用步骤，控制流和中间状态隐含，使得代理行为可能难以控制。诸如 LangGraph、DSPy 和 CrewAI 等编排框架通过明确的工作流定义施加了更大的结构，但将工作流逻辑与 Python 紧密耦合，使得代理难以维护和修改。在本文中，我们介绍了 AgentSPEX，一种用于指定 LLM-代理工作流的代理规范与执行语言，具有明确的控制流和模块化结构，以及可定制的代理工具。AgentSPEX 支持类型化步骤、分支和循环、并行执行、可重用子模块和明确的状态管理，这些工作流在提供工具访问、沙箱虚拟环境以及支持检查点、验证和日志记录的代理工具中执行。此外，我们提供了一个具有同步图形和工作流视图的可视化编辑器，用于创作和检查。我们包括了可直接使用的代理，适用于深度研究和科学研究，并在 7 个基准上评估了 AgentSPEX。最后，我们通过用户研究表明，AgentSPEX 提供了一种比现有流行代理框架更具可解释性和可访问性的工作流创作范式。

View on arXiv Download PDF AI Translation

cs.CL / 33 / 2604.13356

Peer-Predictive Self-Training for Language Model Reasoning

语言模型推理的同行预测自我训练

Feng, Shi, Zhang, Hanlin, Nie, Fan, Kakade, Sham, Chen, Yiling

Abstract

Mechanisms for continued self-improvement of language models without external supervision remain an open challenge. We propose Peer-Predictive Self-Training (PST), a label-free fine-tuning framework in which multiple language models improve collaboratively by leveraging a cross-model aggregated response as an internal training signal. Given a prompt question, the models generate responses sequentially; the final aggregated answer, often more reliable than individual responses in practice, serves as an internal target for learning. We measure how informative each intermediate response is about the aggregate using pointwise mutual information (PMI), and use this signal to scale self-training updates. Responses already aligned with the aggregate are updated less, while less informative or misaligned responses are updated more. On mathematical reasoning benchmarks (SimulEq, Math500, and MultiArith), PST improves exact-match accuracy by 2.2 to 4.3 percentage points across Gemma-2-2B, LLaMA-3.2-1B, and Qwen-2.5-1.5B, and reduces the average generator-verifier gap (GV-Gap) by 26 to 40 percent, while requiring no external supervision or teacher-student hierarchy and relying solely on cross-model interactions. These results suggest that cross-model generations and peer-predictive feedback can serve as an effective approach for self-supervised training.

Chinese Translation

在没有外部监督的情况下，语言模型的持续自我改进机制仍然是一个开放的挑战。我们提出了同行预测自我训练（Peer-Predictive Self-Training, PST），这是一种无标签的微调框架，其中多个语言模型通过利用跨模型聚合响应作为内部训练信号进行协作改进。给定一个提示问题，这些模型依次生成响应；最终的聚合答案通常比单个响应更可靠，作为学习的内部目标。我们使用逐点互信息（Pointwise Mutual Information, PMI）来衡量每个中间响应对聚合的贡献信息量，并利用该信号来调整自我训练更新。与聚合一致的响应更新较少，而信息量较少或不一致的响应则更新较多。在数学推理基准测试（SimulEq、Math500 和 MultiArith）上，PST 在 Gemma-2-2B、LLaMA-3.2-1B 和 Qwen-2.5-1.5B 上提高了 2.2 到 4.3 个百分点的精确匹配准确率，并将平均生成器-验证器差距（Generator-Verifier Gap, GV-Gap）降低了 26% 到 40%，同时不需要外部监督或教师-学生层级，仅依赖于跨模型交互。这些结果表明，跨模型生成和同行预测反馈可以作为自我监督训练的有效方法。

View on arXiv Download PDF AI Translation

cs.CL / 34 / 2604.13368

TLoRA+: A Low-Rank Parameter-Efficient Fine-Tuning Method for Large Language Models

TLoRA+: 一种低秩参数高效微调大型语言模型的方法

Cao, Yarui, Liu, Kai

Abstract

Fine-tuning large language models (LLMs) aims to adapt pre-trained models to specific tasks using relatively small and domain-specific datasets. Among Parameter-Efficient Fine-Tuning (PEFT) methods, Low-Rank Adaptation (LoRA) stands out by matching the performance of full fine-tuning while avoiding additional inference latency. In this paper, we propose a novel PEFT method that incorporates the TLoRA+ optimizer into the weight matrices of pre-trained models. The proposed approach not only preserves the efficiency of low-rank adaptation but also further enhances performance without significantly increasing computational cost. We conduct experiments on the GLUE benchmark across diverse model architectures. Numerical experiments consistently demonstrate the effectiveness and robustness of our proposed method.

Chinese Translation

微调大型语言模型（LLMs）的目的是使用相对较小且特定领域的数据集将预训练模型适应于特定任务。在参数高效微调（PEFT）方法中，低秩适应（LoRA）因其在避免额外推理延迟的同时匹配全微调的性能而脱颖而出。本文提出了一种新颖的PEFT方法，将TLoRA+优化器融入预训练模型的权重矩阵中。所提出的方法不仅保持了低秩适应的高效性，还在不显著增加计算成本的情况下进一步提升了性能。我们在GLUE基准上对多种模型架构进行了实验。数值实验一致证明了我们提出方法的有效性和鲁棒性。

View on arXiv Download PDF AI Translation

cs.CL / 35 / 2604.13371

Empirical Evidence of Complexity-Induced Limits in Large Language Models on Finite Discrete State-Space Problems with Explicit Validity Constraints

在具有明确有效性约束的有限离散状态空间问题中，大型语言模型的复杂性引发的限制的实证证据

Utsho, Md. Fahad Ullah, Ameen, Mohd. Ruhul, Islam, Akif, Rashed, Md. Golam, Das, Dipankar

Abstract

Large Language Models (LLMs) are increasingly described as possessing strong reasoning capabilities, supported by high performance on mathematical, logical, and planning benchmarks. However, most existing evaluations rely on aggregate accuracy over fixed datasets, obscuring how reasoning behavior evolves as task complexity increases. In this work, we introduce a controlled benchmarking framework to systematically evaluate the robustness of reasoning in Large Reasoning Models (LRMs) under progressively increasing problem complexity. We construct a suite of nine classical reasoning tasks: Boolean Satisfiability, Cryptarithmetic, Graph Coloring, River Crossing, Tower of Hanoi, Water Jug, Checker Jumping, Sudoku, and Rubik's Cube, each parameterized to precisely control complexity while preserving underlying semantics. Using deterministic validators, we evaluate multiple open and proprietary LRMs across low, intermediate, and high complexity regimes, ensuring that only fully valid solutions are accepted. Our results reveal a consistent phase transition like behavior: models achieve high accuracy at low complexity but degrade sharply beyond task specific complexity thresholds. We formalize this phenomenon as reasoning collapse. Across tasks, we observe substantial accuracy declines, often exceeding 50%, accompanied by inconsistent reasoning traces, constraint violations, loss of state tracking, and confidently incorrect outputs. Increased reasoning length does not reliably improve correctness, and gains in one problem family do not generalize to others. These findings highlight the need for evaluation methodologies that move beyond static benchmarks and explicitly measure reasoning robustness under controlled complexity.

Chinese Translation

大型语言模型（LLMs）越来越被描述为具备强大的推理能力，这一观点得到了它们在数学、逻辑和规划基准测试中高性能的支持。然而，大多数现有评估依赖于固定数据集上的整体准确性，掩盖了随着任务复杂性增加推理行为的演变。在本研究中，我们引入了一个受控的基准测试框架，以系统地评估大型推理模型（LRMs）在逐步增加的问题复杂性下的推理稳健性。我们构建了一套九个经典推理任务：布尔可满足性、密码算术、图着色、过河、汉诺塔、水壶、跳棋、数独和魔方，每个任务都经过参数化，以精确控制复杂性，同时保留其基本语义。通过使用确定性验证器，我们在低、中、高复杂性范围内评估多个开放和专有的LRMs，确保仅接受完全有效的解决方案。我们的结果揭示了一种一致的相变行为：模型在低复杂性下达到高准确性，但在超出特定任务复杂性阈值后急剧下降。我们将这一现象形式化为推理崩溃。在各个任务中，我们观察到显著的准确性下降，通常超过50%，伴随着不一致的推理轨迹、约束违反、状态跟踪丧失和自信但错误的输出。推理长度的增加并不能可靠地提高正确性，并且在一个问题家族中的收益并不普遍适用于其他问题。这些发现强调了评估方法的必要性，要求超越静态基准，明确测量在受控复杂性下的推理稳健性。

View on arXiv Download PDF AI Translation

cs.CL / 36 / 2604.13398

From Prediction to Justification: Aligning Sentiment Reasoning with Human Rationale via Reinforcement Learning

从预测到解释：通过强化学习将情感推理与人类推理对齐

Zhang, Shihao, Wang, Ziwei, Zhou, Jie, Wu, Yulan, Chen, Qin, Lei, Zhikai, Yu, Liyang, Dou, Liang, He, Liang

Abstract

While Aspect-based Sentiment Analysis (ABSA) systems have achieved high accuracy in identifying sentiment polarities, they often operate as "black boxes," lacking the explicit reasoning capabilities characteristic of human affective cognition. Humans do not merely categorize sentiment; they construct causal explanations for their judgments. To bridge this gap, we propose ABSA-R1, a large language model framework designed to mimic this ``reason-before-predict" cognitive process. By leveraging reinforcement learning (RL), ABSA-R1 learns to articulate the why behind the what, generating natural language justifications that ground its sentiment predictions. We introduce a Cognition-Aligned Reward Model (formerly sentiment-aware reward model) that enforces consistency between the generated reasoning path and the final emotional label. Furthermore, inspired by metacognitive monitoring, we implement a performance-driven rejection sampling strategy that selectively targets hard cases where the model's internal reasoning is uncertain or inconsistent. Experimental results on four benchmarks demonstrate that equipping models with this explicit reasoning capability not only enhances interpretability but also yields superior performance in sentiment classification and triplet extraction compared to non-reasoning baselines.

Chinese Translation

尽管基于方面的情感分析（ABSA）系统在识别情感极性方面已取得高准确率，但它们往往作为“黑箱”运作，缺乏人类情感认知特有的明确推理能力。人类不仅仅是对情感进行分类；他们还为自己的判断构建因果解释。为了解决这一问题，我们提出了ABSA-R1，一个旨在模仿这种“先推理后预测”认知过程的大型语言模型框架。通过利用强化学习（RL），ABSA-R1学会阐明“是什么”的背后“为什么”，生成自然语言的解释，以支持其情感预测。我们引入了一种认知对齐奖励模型（Cognition-Aligned Reward Model，前称情感感知奖励模型），以确保生成的推理路径与最终情感标签之间的一致性。此外，受到元认知监控的启发，我们实施了一种基于性能的拒绝采样策略，选择性地针对模型内部推理不确定或不一致的难例。四个基准测试的实验结果表明，赋予模型这种明确的推理能力不仅增强了可解释性，而且在情感分类和三元组提取方面的表现优于非推理基线。

View on arXiv Download PDF AI Translation

cs.CL / 37 / 2604.13418

MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments

MERRIN：在嘈杂网络环境中进行多模态证据检索与推理的基准测试

Wang, Han, Wan, David, Lee, Hyunji, Pham, Thinh, Cankosyan, Mikaela, Chen, Weiyuan, Stengel-Eskin, Elias, Vu, Tu, Bansal, Mohit

Abstract

Motivated by the underspecified, multi-hop nature of search queries and the multimodal, heterogeneous, and often conflicting nature of real-world web results, we introduce MERRIN (Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments), a human-annotated benchmark for evaluating search-augmented agents. MERRIN measures AI agents' ability to identify relevant modalities, retrieve multimodal evidence, and perform multi-hop reasoning over noisy web sources. It differs from prior work in three important aspects: (1) using natural language queries without explicit modality cues, (2) incorporating underexplored modalities such as video and audio, and (3) requiring the retrieval of complex, often noisy or conflicting multimodal evidence during web search. We evaluate diverse search agents powered by ten models, including strong closed-source models (e.g., GPT-5.4-mini, Gemini 3/3.1 Flash/Pro) and open-weight models (Qwen3-4B/30B/235B), across three search settings (no search, native search, and agentic search). Our results show that MERRIN is highly challenging: the average accuracy across all agents is 22.3%, with the best-performing agent reaching only 40.1%. We further observe that while stronger agents like Gemini Deep Research achieve higher performance, gains are modest due to over-exploration; they take more steps and use more tools, but are often distracted by conflicting or partially relevant web content, leading to incorrect answers. Compared to humans, these agents consume more resources yet achieve lower accuracy, largely due to inefficient source selection and an overreliance on text modalities. These findings highlight the need for search agents capable of robust search and reasoning across diverse modalities in noisy web environments, making MERRIN a valuable testbed for evaluating such capabilities.

Chinese Translation

受到搜索查询的不明确性、多跳性质以及现实网络结果的多模态、异构和常常相互冲突的特性的启发，我们提出了MERRIN（Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments），这是一个用于评估搜索增强代理的人类标注基准。MERRIN衡量AI代理识别相关模态、检索多模态证据和在嘈杂网络来源上进行多跳推理的能力。与之前的工作相比，MERRIN在三个重要方面有所不同：（1）使用没有明确模态提示的自然语言查询，（2）纳入视频和音频等未充分探索的模态，以及（3）在网络搜索中要求检索复杂的、常常嘈杂或相互冲突的多模态证据。我们评估了由十种模型驱动的多样化搜索代理，包括强大的封闭源模型（如GPT-5.4-mini、Gemini 3/3.1 Flash/Pro）和开放权重模型（Qwen3-4B/30B/235B），在三种搜索设置下（无搜索、原生搜索和代理搜索）进行测试。我们的结果表明，MERRIN具有很高的挑战性：所有代理的平均准确率为22.3%，表现最佳的代理仅达到40.1%。我们进一步观察到，尽管像Gemini Deep Research这样的强大代理实现了更高的性能，但由于过度探索，增益有限；它们采取更多步骤并使用更多工具，但常常被相互冲突或部分相关的网络内容分散注意力，导致错误答案。与人类相比，这些代理消耗更多资源但准确率较低，主要是由于源选择效率低下和对文本模态的过度依赖。这些发现突显了在嘈杂网络环境中需要能够进行强大搜索和推理的搜索代理，使得MERRIN成为评估此类能力的宝贵测试平台。

View on arXiv Download PDF AI Translation

cs.CL / 38 / 2604.13452

CANVAS: Continuity-Aware Narratives via Visual Agentic Storyboarding

CANVAS：通过视觉代理故事板实现连续性意识叙事

Mondal, Ishani, Song, Yiwen, Parmar, Mihir, Goyal, Palash, Boyd-Graber, Jordan, Pfister, Tomas, Song, Yale

Abstract

Long-form visual storytelling requires maintaining continuity across shots, including consistent characters, stable environments, and smooth scene transitions. While existing generative models can produce strong individual frames, they fail to preserve such continuity, leading to appearance changes, inconsistent backgrounds, and abrupt scene shifts. We introduce CANVAS (Continuity-Aware Narratives via Visual Agentic Storyboarding), a multi-agent framework that explicitly plans visual continuity in multi-shot narratives. CANVAS enforces coherence through character continuity, persistent background anchors, and location-aware scene planning for smooth transitions within the same setting We evaluate CANVAS on two storyboard generation benchmarks ST-BENCH and ViStoryBench and introduce a new challenging benchmark HardContinuityBench for long-range narrative consistency. CANVAS consistently outperforms the best-performing baseline, improving background continuity by 21.6%, character consistency by 9.6% and props consistency by 7.6%.

Chinese Translation

长篇视觉叙事需要在镜头之间保持连续性，包括一致的人物、稳定的环境和流畅的场景过渡。尽管现有的生成模型能够生成强有力的单帧图像，但它们未能保持这种连续性，导致外观变化、不一致的背景和突兀的场景切换。我们提出了CANVAS（通过视觉代理故事板实现连续性意识叙事），这是一个多代理框架，明确规划多镜头叙事中的视觉连续性。CANVAS通过人物连续性、持久的背景锚点和位置感知的场景规划来强制保持一致性，以实现同一环境中的平滑过渡。我们在两个故事板生成基准测试ST-BENCH和ViStoryBench上评估了CANVAS，并引入了一个新的具有挑战性的基准HardContinuityBench，以测试长距离叙事的一致性。CANVAS在性能上始终优于最佳基线，背景连续性提高了21.6%，人物一致性提高了9.6%，道具一致性提高了7.6%。

View on arXiv Download PDF AI Translation

cs.CL / 39 / 2604.13502

Using reasoning LLMs to extract SDOH events from clinical notes

使用推理大型语言模型从临床记录中提取健康的社会决定因素事件

Doganl, Ertan, Yu, Kunyu, Peng, Yifan

Abstract

Social Determinants of Health (SDOH) refer to environmental, behavioral, and social conditions that influence how individuals live, work, and age. SDOH have a significant impact on personal health outcomes, and their systematic identification and management can yield substantial improvements in patient care. However, SDOH information is predominantly captured in unstructured clinical notes within electronic health records, which limits its direct use as machine-readable entities. To address this issue, researchers have employed Natural Language Processing (NLP) techniques using pre-trained BERT-based models, demonstrating promising performance but requiring sophisticated implementation and extensive computational resources. In this study, we investigated prompt engineering strategies for extracting structured SDOH events utilizing LLMs with advanced reasoning capabilities. Our method consisted of four modules: 1) developing concise and descriptive prompts integrated with established guidelines, 2) applying few-shot learning with carefully curated examples, 3) using a self-consistency mechanism to ensure robust outputs, and 4) post-processing for quality control. Our approach achieved a micro-F1 score of 0.866, demonstrating competitive performance compared to the leading models. The results demonstrated that LLMs with reasoning capabilities are effective solutions for SDOH event extraction, offering both implementation simplicity and strong performance.

Chinese Translation

健康的社会决定因素（SDOH）指的是影响个人生活、工作和老龄化的环境、行为和社会条件。SDOH对个人健康结果具有显著影响，其系统的识别和管理可以显著改善患者护理。然而，SDOH信息主要以非结构化的临床记录形式存在于电子健康记录中，这限制了其作为机器可读实体的直接使用。为了解决这个问题，研究人员采用了自然语言处理（NLP）技术，利用预训练的基于BERT的模型，展示了良好的性能，但需要复杂的实现和大量的计算资源。在本研究中，我们探讨了利用具有先进推理能力的LLMs（大型语言模型）提取结构化SDOH事件的提示工程策略。我们的方法包括四个模块：1）开发与既定指南相结合的简洁且描述性强的提示，2）应用经过精心策划的示例进行少量学习，3）使用自一致性机制确保输出的稳健性，以及4）进行后处理以进行质量控制。我们的方法达到了0.866的微F1分数，与领先模型相比表现竞争力。结果表明，具有推理能力的LLMs是提取SDOH事件的有效解决方案，既实现简单又性能强劲。

View on arXiv Download PDF AI Translation

cs.CL / 40 / 2604.13519

ToolSpec: Accelerating Tool Calling via Schema-Aware and Retrieval-Augmented Speculative Decoding

ToolSpec：通过模式感知和检索增强的推测解码加速工具调用

Xia, Heming, Li, Yongqi, Du, Cunxiao, Song, Mingbo, Li, Wenjie

Abstract

Tool calling has greatly expanded the practical utility of large language models (LLMs) by enabling them to interact with external applications. As LLM capabilities advance, effective tool use increasingly involves multi-step, multi-turn interactions to solve complex tasks. However, the resulting growth in tool interactions incurs substantial latency, posing a key challenge for real-time LLM serving. Through empirical analysis, we find that tool-calling traces are highly structured, conform to constrained schemas, and often exhibit recurring invocation patterns. Motivated by this, we propose ToolSpec, a schema-aware, retrieval-augmented speculative decoding method for accelerating tool calling. ToolSpec exploits predefined tool schemas to generate accurate drafts, using a finite-state machine to alternate between deterministic schema token filling and speculative generation for variable fields. In addition, ToolSpec retrieves similar historical tool invocations and reuses them as drafts to further improve efficiency. ToolSpec presents a plug-and-play solution that can be seamlessly integrated into existing LLM workflows. Experiments across multiple benchmarks demonstrate that ToolSpec achieves up to a 4.2x speedup, substantially outperforming existing training-free speculative decoding methods.

Chinese Translation

工具调用极大地扩展了大型语言模型（LLMs）的实际应用能力，使其能够与外部应用程序进行交互。随着LLM能力的提升，有效的工具使用越来越涉及多步骤、多轮次的交互，以解决复杂任务。然而，随之而来的工具交互增长带来了显著的延迟，成为实时LLM服务的一个关键挑战。通过实证分析，我们发现工具调用轨迹高度结构化，符合受限模式，并且通常表现出重复的调用模式。基于此，我们提出了ToolSpec，一种模式感知的、检索增强的推测解码方法，用于加速工具调用。ToolSpec利用预定义的工具模式生成准确的草稿，使用有限状态机在确定性模式令牌填充和可变字段的推测生成之间交替。此外，ToolSpec检索相似的历史工具调用并将其作为草稿重用，以进一步提高效率。ToolSpec提供了一种即插即用的解决方案，可以无缝集成到现有的LLM工作流程中。多项基准实验表明，ToolSpec的速度提升可达4.2倍，显著优于现有的无训练推测解码方法。

View on arXiv Download PDF AI Translation

cs.CL / 41 / 2604.13538

Synthesizing Instruction-Tuning Datasets with Contrastive Decoding

通过对比解码合成指令调优数据集

Ichinose, Tatsuya, Ma, Youmi, Oi, Masanari, Koike, Ryuto, Okazaki, Naoaki

Abstract

Using responses generated by high-performing large language models (LLMs) for instruction tuning has become a widely adopted approach. However, the existing literature overlooks a property of LLM-generated responses: they conflate world knowledge acquired during pre-training with instruction-following capabilities acquired during post-training. We hypothesize that disentangling the instruction-following capabilities from pre-trained knowledge improves the effectiveness of instruction tuning. To this end, we propose CoDIT, a method that applies contrastive decoding between a post-trained model and its pre-trained counterpart during response generation. The method suppresses pre-trained knowledge shared between the two models while amplifying the instruction-following behavior acquired via post-training, resulting in responses that more purely reflect instruction-following capabilities. Experiment results demonstrate that models trained on datasets constructed via CoDIT consistently outperform those trained on directly generated responses. Training on our datasets also yields better performance than on existing publicly available instruction-tuning datasets across multiple benchmarks. Furthermore, we theoretically and empirically show that CoDIT can be interpreted as distilling the chat vector from parameter space to text space, enabling the transfer of instruction-tuning capabilities across models of different architectures.

Chinese Translation

利用高性能大型语言模型（LLMs）生成的响应进行指令调优已成为一种广泛采用的方法。然而，现有文献忽视了LLM生成响应的一个特性：它们将预训练期间获得的世界知识与后训练期间获得的指令遵循能力混为一谈。我们假设，将指令遵循能力与预训练知识分离可以提高指令调优的有效性。为此，我们提出了CoDIT，一种在响应生成过程中对后训练模型与其预训练对应模型之间应用对比解码的方法。该方法抑制了两个模型之间共享的预训练知识，同时增强了通过后训练获得的指令遵循行为，从而生成更纯粹反映指令遵循能力的响应。实验结果表明，基于CoDIT构建的数据集训练的模型在性能上始终优于直接生成响应训练的模型。在多个基准测试中，基于我们的数据集进行训练的模型表现也优于现有的公开可用的指令调优数据集。此外，我们从理论和实证上展示了CoDIT可以被解释为将参数空间中的聊天向量提炼到文本空间，从而实现不同架构模型之间指令调优能力的转移。

View on arXiv Download PDF AI Translation

cs.CL / 42 / 2604.13551

Debate to Align: Reliable Entity Alignment through Two-Stage Multi-Agent Debate

对齐辩论：通过两阶段多智能体辩论实现可靠的实体对齐

Wang, Cunda, Ma, Ziying, Hu, Po, Wang, Weihua, Bao, Feilong

Abstract

Entity alignment (EA) aims to identify entities referring to the same real-world object across different knowledge graphs (KGs). Recent approaches based on large language models (LLMs) typically obtain entity embeddings through knowledge representation learning and use embedding similarity to identify an alignment-uncertain entity set. For each uncertain entity, a candidate entity set (CES) is then retrieved based on embedding similarity to support subsequent alignment reasoning and decision making. However, the reliability of the CES and the reasoning capability of LLMs critically affect the effectiveness of subsequent alignment decisions. To address this issue, we propose AgentEA, a reliable EA framework based on multi-agent debate. AgentEA first improves embedding quality through entity representation preference optimization, and then introduces a two-stage multi-role debate mechanism consisting of lightweight debate verification and deep debate alignment to progressively enhance the reliability of alignment decisions while enabling more efficient debate-based reasoning. Extensive experiments on public benchmarks under cross-lingual, sparse, large-scale, and heterogeneous settings demonstrate the effectiveness of AgentEA.

Chinese Translation

实体对齐（EA）旨在识别不同知识图谱（KGs）中指代同一现实世界对象的实体。近期基于大型语言模型（LLMs）的方法通常通过知识表示学习获得实体嵌入，并利用嵌入相似性来识别对齐不确定的实体集合。对于每个不确定实体，基于嵌入相似性检索候选实体集合（CES），以支持后续的对齐推理和决策。然而，CES的可靠性和LLMs的推理能力对后续对齐决策的有效性产生重要影响。为了解决这一问题，我们提出了AgentEA，一个基于多智能体辩论的可靠EA框架。AgentEA首先通过实体表示偏好优化来提高嵌入质量，然后引入一个由轻量级辩论验证和深度辩论对齐组成的两阶段多角色辩论机制，以逐步增强对齐决策的可靠性，同时实现更高效的基于辩论的推理。在跨语言、稀疏、大规模和异构设置下的公共基准上进行的广泛实验证明了AgentEA的有效性。

View on arXiv Download PDF AI Translation

cs.CL / 43 / 2604.13552

Training-Free Test-Time Contrastive Learning for Large Language Models

无训练测试时对比学习的大型语言模型

Zheng, Kaiwen, Zhou, Kai, Hu, Jinwu, Gu, Te, Peng, Mingkai, Liu, Fei

Abstract

Large language models (LLMs) demonstrate strong reasoning capabilities, but their performance often degrades under distribution shift. Existing test-time adaptation (TTA) methods rely on gradient-based updates that require white-box access and need substantial overhead, while training-free alternatives are either static or depend on external guidance. In this paper, we propose Training-Free Test-Time Contrastive Learning TF-TTCL, a training-free adaptation framework that enables a frozen LLM to improve online by distilling supervision from its own inference experiences. Specifically, TF-TTCL implements a dynamic "Explore-Reflect-Steer" loop through three core modules: 1) Semantic Query Augmentation first diversifies problem views via multi-agent role-playing to generate different reasoning trajectories; 2) Contrastive Experience Distillation then captures the semantic gap between superior and inferior trajectories, distilling them into explicit textual rules; and 3) Contextual Rule Retrieval finally activates these stored rules during inference to dynamically steer the frozen LLM toward robust reasoning patterns while avoiding observed errors. Extensive experiments on closed-ended reasoning tasks and open-ended evaluation tasks demonstrate that TF-TTCL consistently outperforms strong zero-shot baselines and representative TTA methods under online evaluation. Code is available at https://github.com/KevinSCUTer/TF-TTCL.

Chinese Translation

大型语言模型（LLMs）展现出强大的推理能力，但在分布转移下其性能往往会下降。现有的测试时适应（TTA）方法依赖于基于梯度的更新，这需要白盒访问并且需要大量开销，而无训练的替代方案要么是静态的，要么依赖于外部指导。本文提出了一种无训练的测试时对比学习框架TF-TTCL，该框架使得冻结的LLM能够通过从自身推理经验中提炼监督信息来在线改进。具体而言，TF-TTCL通过三个核心模块实现了动态的“探索-反思-引导”循环：1）语义查询增强首先通过多代理角色扮演多样化问题视角，以生成不同的推理轨迹；2）对比经验提炼然后捕捉优劣轨迹之间的语义差距，将其提炼为明确的文本规则；3）上下文规则检索最终在推理过程中激活这些存储的规则，以动态引导冻结的LLM朝向稳健的推理模式，同时避免观察到的错误。在封闭式推理任务和开放式评估任务上的大量实验表明，TF-TTCL在在线评估中始终优于强大的零-shot基线和代表性的TTA方法。代码可在 https://github.com/KevinSCUTer/TF-TTCL 获取。

View on arXiv Download PDF AI Translation

cs.CL / 44 / 2604.13556

YOCO++: Enhancing YOCO with KV Residual Connections for Efficient LLM Inference

YOCO++：通过KV残差连接增强YOCO以实现高效的大型语言模型推理

Wu, You, Chen, Ziheng, Zhang, Yizhen, Wu, Haoyi, Yu, Chengting, Xu, Yuchi, Su, Wenbo, Zheng, Bo, Tu, Kewei

Abstract

Cross-layer key-value (KV) compression has been found to be effective in efficient inference of large language models (LLMs). Although they reduce the memory consumption of the KV cache, such methods usually introduce non-negligible performance degradation. In this work, we aim to enhance the performance of YOCO, a cross-layer KV compression method that shares the KVs of the middle layer with the top-half layers. We propose YOCO++, an enhanced YOCO that incorporates a weighted residual connection between the KVs of each bottom-half layer and the bottom layer. Compared to YOCO, YOCO++ increases model capacity while maintaining the same training and inference efficiency. Our experiments show that YOCO++ achieves state-of-the-art performance among the cross-layer KV compression methods at a 50% KV cache compression rate, outperforming the standard Transformer.

Chinese Translation

跨层键值（KV）压缩在大型语言模型（LLMs）的高效推理中被发现是有效的。尽管这些方法减少了KV缓存的内存消耗，但通常会引入不可忽视的性能下降。在本研究中，我们旨在增强YOCO的性能，YOCO是一种跨层KV压缩方法，它将中间层的KVs与上半部分层共享。我们提出了YOCO++，一种增强版YOCO，它在每个下半部分层的KVs与底层之间引入了加权残差连接。与YOCO相比，YOCO++在保持相同的训练和推理效率的同时增加了模型容量。我们的实验表明，YOCO++在50%的KV缓存压缩率下，在跨层KV压缩方法中实现了最先进的性能，超越了标准Transformer。

View on arXiv Download PDF AI Translation

cs.CL / 45 / 2604.13579

MM-Doc-R1: Training Agents for Long Document Visual Question Answering through Multi-turn Reinforcement Learning

MM-Doc-R1：通过多轮强化学习训练长文档视觉问答代理

Lin, Jiahang, Hu, Kai, Wang, Binghai, Zhou, Yuhao, Xi, Zhiheng, Guo, Honglin, Liu, Shichun, Wang, Junzhe, Dou, Shihan, Zhou, Enyu, Yan, Hang, Han, Zhenhua, Gui, Tao, Zhang, Qi, Huang, Xuanjing

Abstract

Conventional Retrieval-Augmented Generation (RAG) systems often struggle with complex multi-hop queries over long documents due to their single-pass retrieval. We introduce MM-Doc-R1, a novel framework that employs an agentic, vision-aware workflow to address long document visual question answering through iterative information discovery and synthesis. To incentivize the information seeking capabilities of our agents, we propose Similarity-based Policy Optimization (SPO), addressing baseline estimation bias in existing multi-turn reinforcement learning (RL) algorithms like GRPO. Our core insight is that in multi-turn RL, the more semantically similar two trajectories are, the more accurate their shared baseline estimation becomes. Leveraging this, SPO calculates a more precise baseline by similarity-weighted averaging of rewards across multiple trajectories, unlike GRPO which inappropriately applies the initial state's baseline to all intermediate states. This provides a more stable and accurate learning signal for our agents, leading to superior training performance that surpasses GRPO. Our experiments on the MMLongbench-Doc benchmark show that MM-Doc-R1 outperforms previous baselines by 10.4%. Furthermore, SPO demonstrates superior performance over GRPO, boosting results by 5.0% with Qwen3-8B and 6.1% with Qwen3-4B. These results highlight the effectiveness of our integrated framework and novel training algorithm in advancing the state-of-the-art for complex, long-document visual question answering.

Chinese Translation

传统的检索增强生成（RAG）系统在处理长文档中的复杂多跳查询时常常面临挑战，因为它们采用单次检索的方法。我们提出了MM-Doc-R1，这是一种新颖的框架，采用代理驱动的、视觉感知的工作流程，通过迭代的信息发现和综合来解决长文档视觉问答问题。为了激励我们的代理的信息寻求能力，我们提出了基于相似度的策略优化（Similarity-based Policy Optimization, SPO），以解决现有多轮强化学习（RL）算法（如GRPO）中的基线估计偏差。我们的核心见解是，在多轮强化学习中，两个轨迹在语义上越相似，它们共享的基线估计就越准确。利用这一点，SPO通过对多个轨迹的奖励进行相似度加权平均，计算出更精确的基线，而GRPO则不当地将初始状态的基线应用于所有中间状态。这为我们的代理提供了更稳定和准确的学习信号，从而实现了优于GRPO的卓越训练性能。我们在MMLongbench-Doc基准上的实验表明，MM-Doc-R1的表现比之前的基线提高了10.4%。此外，SPO在性能上也优于GRPO，使用Qwen3-8B提升了5.0%的结果，使用Qwen3-4B提升了6.1%。这些结果突显了我们集成框架和新颖训练算法在推动复杂长文档视觉问答领域的最新进展中的有效性。

View on arXiv Download PDF AI Translation

cs.CL / 46 / 2604.13583

BenGER: A Collaborative Web Platform for End-to-End Benchmarking of German Legal Tasks

BenGER：一个用于德国法律任务端到端基准测试的协作网络平台

Nagl, Sebastian, Grabmair, Matthias

Abstract

Evaluating large language models (LLMs) for legal reasoning requires workflows that span task design, expert annotation, model execution, and metric-based evaluation. In practice, these steps are split across platforms and scripts, limiting transparency, reproducibility, and participation by non-technical legal experts. We present the BenGER (Benchmark for German Law) framework, an open-source web platform that integrates task creation, collaborative annotation, configurable LLM runs, and evaluation with lexical, semantic, factual, and judge-based metrics. BenGER supports multi-organization projects with tenant isolation and role-based access control, and can optionally provide formative, reference-grounded feedback to annotators. We will demonstrate a live deployment showing end-to-end benchmark creation and analysis.

Chinese Translation

评估大型语言模型（LLMs）在法律推理中的表现需要涵盖任务设计、专家注释、模型执行和基于指标的评估的工作流程。在实践中，这些步骤分散在不同的平台和脚本中，限制了透明度、可重复性以及非技术法律专家的参与。我们提出了BenGER（Benchmark for German Law）框架，这是一个开源网络平台，集成了任务创建、协作注释、可配置的LLM运行和基于词汇、语义、事实和法官评估的评估。BenGER支持多组织项目，具有租户隔离和基于角色的访问控制，并可以为注释者提供形成性、基于参考的反馈。我们将展示一个实时部署，展示端到端基准创建和分析的过程。

View on arXiv Download PDF AI Translation

cs.CL / 47 / 2604.13592

Foresight Optimization for Strategic Reasoning in Large Language Models

大型语言模型中的战略推理前瞻优化

Wang, Jiashuo, Duan, Jiawen, Wang, Jian, Song, Kaitao, Xu, Chunpu, Ho, Johnny K. W., Yu, Fenggang, Li, Wenjie, Hoorn, Johan F.

Abstract

Reasoning capabilities in large language models (LLMs) have generally advanced significantly. However, it is still challenging for existing reasoning-based LLMs to perform effective decision-making abilities in multi-agent environments, due to the absence of explicit foresight modeling. To this end, strategic reasoning, the most fundamental capability to anticipate the counterpart's behaviors and foresee its possible future actions, has been introduced to alleviate the above issues. Strategic reasoning is fundamental to effective decision-making in multi-agent environments, yet existing reasoning enhancement methods for LLMs do not explicitly capture its foresight nature. In this work, we introduce Foresight Policy Optimization (FoPO) to enhance strategic reasoning in LLMs, which integrates opponent modeling principles into policy optimization, thereby enabling explicit consideration of both self-interest and counterpart influence. Specifically, we construct two curated datasets, namely Cooperative RSA and Competitive Taboo, equipped with well-designed rules and moderate difficulty to facilitate a systematic investigation of FoPO in a self-play framework. Our experiments demonstrate that FoPO significantly enhances strategic reasoning across LLMs of varying sizes and origins. Moreover, models trained with FoPO exhibit strong generalization to out-of-domain strategic scenarios, substantially outperforming standard LLM reasoning optimization baselines.

Chinese Translation

大型语言模型（LLMs）的推理能力普遍显著提升。然而，由于缺乏明确的前瞻建模，现有基于推理的LLMs在多智能体环境中进行有效决策仍然面临挑战。为此，战略推理作为预判对方行为和预测其可能未来行动的最基本能力被引入，以缓解上述问题。战略推理对于多智能体环境中的有效决策至关重要，但现有的LLM推理增强方法并未明确捕捉其前瞻特性。在本研究中，我们提出了前瞻策略优化（Foresight Policy Optimization, FoPO），以增强LLMs中的战略推理，该方法将对手建模原则整合到策略优化中，从而使自我利益和对方影响的明确考虑成为可能。具体而言，我们构建了两个精心策划的数据集，即合作RSA（Cooperative RSA）和竞争禁忌（Competitive Taboo），这些数据集配备了精心设计的规则和适中的难度，以促进在自我对弈框架中对FoPO的系统性研究。我们的实验表明，FoPO显著增强了不同规模和来源的LLMs的战略推理能力。此外，使用FoPO训练的模型在领域外的战略场景中表现出强大的泛化能力，显著优于标准LLM推理优化基线。

View on arXiv Download PDF AI Translation

cs.CL / 48 / 2604.13618

C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences

C2：基于二元偏好的可扩展评分标准增强奖励建模

Kawabata, Akira, Sugawara, Saku

Abstract

Rubric-augmented verification guides reward models with explicit evaluation criteria, yielding more reliable judgments than single-model verification. However, most existing methods require costly rubric annotations, limiting scalability. Moreover, we find that rubric generation is vulnerable to a failure of cooperation; low-quality rubrics actively mislead reward models rather than help. Inspired by the principle of cooperative communication, we propose Cooperative yet Critical reward modeling (C2), a framework that significantly improves reward model judgments by having the reward model critically collaborate with a rubric generator trained solely from binary preferences. In C2, we synthesize helpful and misleading rubric pairs by measuring how each rubric shifts the reward model toward or away from the correct preference. Using these contrastive pairs, we train a cooperative rubric generator to propose helpful rubrics, and a critical verifier to assess rubric validity before making its judgment, following only rubrics it deems helpful at inference time. C2 outperforms reasoning reward models trained on the same binary preferences, with gains of up to 6.5 points on RM-Bench and 6.0 points length-controlled win rate on AlpacaEval 2.0. Without external rubric annotations, C2 enables an 8B reward model to match performance achieved with rubrics from a 4$\times$ larger model. Overall, our work demonstrates that eliciting deliberate cooperation in rubric-augmented verification makes reward models more trustworthy in a scalable way.

Chinese Translation

评分标准增强的验证通过明确的评估标准指导奖励模型，产生比单一模型验证更可靠的判断。然而，大多数现有方法需要昂贵的评分标准注释，限制了其可扩展性。此外，我们发现评分标准生成容易受到合作失败的影响；低质量的评分标准会主动误导奖励模型，而不是提供帮助。受到合作沟通原则的启发，我们提出了合作而关键的奖励建模框架（C2），该框架通过让奖励模型与仅基于二元偏好的评分标准生成器进行关键合作，显著提高了奖励模型的判断。在C2中，我们通过测量每个评分标准如何使奖励模型向正确偏好移动或远离，合成有帮助和误导性的评分标准对。利用这些对比对，我们训练一个合作评分标准生成器来提出有帮助的评分标准，以及一个关键验证器在做出判断之前评估评分标准的有效性，仅遵循其在推理时认为有帮助的评分标准。C2在与相同二元偏好训练的推理奖励模型相比时表现更佳，在RM-Bench上提高了最多6.5分，在AlpacaEval 2.0上提高了6.0分的长度控制胜率。在没有外部评分标准注释的情况下，C2使一个8B的奖励模型达到了使用来自一个4倍更大模型的评分标准所取得的性能。总体而言，我们的工作表明，在评分标准增强的验证中引导有意识的合作，以可扩展的方式使奖励模型更值得信赖。

View on arXiv Download PDF AI Translation

cs.CL / 49 / 2604.13620

Syn-TurnTurk: A Synthetic Dataset for Turn-Taking Prediction in Turkish Dialogues

Syn-TurnTurk：用于土耳其对话中轮流发言预测的合成数据集

Bayrak, Ahmet Tuğrul, Türkel, Mustafa Sertaç, Korkmaz, Fatma Nur

Abstract

Managing natural dialogue timing is a significant challenge for voice-based chatbots. Most current systems usually rely on simple silence detection, which often fails because human speech patterns involve irregular pauses. This causes bots to interrupt users, breaking the conversational flow. This problem is even more severe for languages like Turkish, which lack high-quality datasets for turn-taking prediction. This paper introduces Syn-TurnTurk, a synthetic Turkish dialogue dataset generated using various Qwen Large Language Models (LLMs) to mirror real-life verbal exchanges, including overlaps and strategic silences. We evaluated the dataset using several traditional and deep learning architectures. The results show that advanced models, particularly BI-LSTM and Ensemble (LR+RF) methods, achieve high accuracy (0.839) and AUC scores (0.910). These findings demonstrate that our synthetic dataset can have a positive affect for models understand linguistic cues, allowing for more natural human-machine interaction in Turkish.

Chinese Translation

管理自然对话的时机是语音聊天机器人面临的一项重大挑战。目前大多数系统通常依赖简单的静音检测，但由于人类语言模式中存在不规则的停顿，这种方法往往失败。这导致机器人打断用户，从而破坏了对话的流畅性。对于像土耳其语这样缺乏高质量轮流发言预测数据集的语言，这一问题更加严重。本文介绍了Syn-TurnTurk，这是一个使用多种Qwen大型语言模型（LLMs）生成的合成土耳其对话数据集，旨在模拟现实生活中的语言交流，包括重叠和策略性静默。我们使用多种传统和深度学习架构对该数据集进行了评估。结果表明，先进模型，特别是BI-LSTM和集成方法（LR+RF），达到了高准确率（0.839）和AUC分数（0.910）。这些发现表明，我们的合成数据集能够积极影响模型对语言线索的理解，从而在土耳其语中实现更自然的人机交互。

View on arXiv Download PDF AI Translation

cs.CL / 50 / 2604.13634

Calibrated Speculative Decoding: Frequency-Guided Candidate Selection for Efficient Inference

校准的推测解码：基于频率的候选选择以提高推理效率

Zhou, Xuwen, Liu, Fangxin, Wang, Chao, Zheng, Xiao, Zheng, Hao, He, Min, Jiang, Li, Guan, Haibing

Abstract

Speculative decoding accelerates autoregressive generation by letting draft tokens bypass full verification, but conventional frameworks suffer from frequent false rejections, particularly when draft models produce semantically correct but lexically divergent outputs. In this paper, we present Calibrated Speculative Decoding (CSD), a training-free framework that recovers valid tokens discarded by standard verification. Guided by the principle of "Frequency-Guided Candidate Selection and Probability-Guarded Acceptance," CSD incorporates two lightweight modules: Online Correction Memory, which aggregates historical rejections to propose recurring divergence patterns as rescue candidates, and Semantic Consistency Gating, which verifies candidate admissibility using probability ratios instead of exact token matching. Our evaluation across diverse large language models demonstrates that CSD outperforms existing methods, achieving a peak throughput speedup of 2.33x. CSD preserves model accuracy across all tasks while further boosting performance on complex reasoning datasets. These results establish CSD as a highly effective, lightweight solution for practical LLM deployments.

Chinese Translation

推测解码通过让草稿令牌绕过全面验证来加速自回归生成，但传统框架常常遭遇频繁的错误拒绝，尤其是在草稿模型生成语义上正确但词汇上偏离的输出时。本文提出了校准的推测解码（Calibrated Speculative Decoding, CSD），这是一种无训练的框架，可以恢复标准验证过程中被丢弃的有效令牌。CSD遵循“基于频率的候选选择和概率保护的接受”原则，结合了两个轻量级模块：在线修正记忆（Online Correction Memory），该模块聚合历史拒绝记录，以提出重复的偏离模式作为救援候选；语义一致性门控（Semantic Consistency Gating），该模块使用概率比而非精确的令牌匹配来验证候选的可接受性。我们在多种大型语言模型上的评估表明，CSD优于现有方法，达到峰值吞吐量加速比2.33倍。CSD在所有任务中保持模型准确性，同时进一步提升了在复杂推理数据集上的表现。这些结果确立了CSD作为一种高效、轻量的解决方案，适用于实际的大型语言模型部署。

View on arXiv Download PDF AI Translation

cs.CL / 51 / 2604.13686

IndicDB -- Benchmarking Multilingual Text-to-SQL Capabilities in Indian Languages

IndicDB -- 印度语言的多语言文本到SQL能力基准测试

Dawar, Aviral, Karanth, Roshan, Goyal, Vikram, Kumar, Dhruv

Abstract

While Large Language Models (LLMs) have significantly advanced Text-to-SQL performance, existing benchmarks predominantly focus on Western contexts and simplified schemas, leaving a gap in real-world, non-Western applications. We present IndicDB, a multilingual Text-to-SQL benchmark for evaluating cross-lingual semantic parsing across diverse Indic languages. The relational schemas are sourced from open-data platforms, including the National Data and Analytics Platform (NDAP) and the India Data Portal (IDP), ensuring realistic administrative data complexity. IndicDB comprises 20 databases across 237 tables. To convert denormalized government data into rich relational structures, we employ an iterative three-agent framework (Architect, Auditor, Refiner) to ensure structural rigor and high relational density (11.85 tables per database; join depths up to six). Our pipeline is value-aware, difficulty-calibrated, and join-enforced, generating 15,617 tasks across English, Hindi, and five Indic languages. We evaluate cross-lingual semantic parsing performance of state-of-the-art models (DeepSeek v3.2, MiniMax 2.7, LLaMA 3.3, Qwen3) across seven linguistic variants. Results show a 9.00% performance drop from English to Indic languages, revealing an "Indic Gap" driven by harder schema linking, increased structural ambiguity, and limited external knowledge. IndicDB serves as a rigorous benchmark for multilingual Text-to-SQL. Code and data: https://anonymous.4open.science/r/multilingualText2Sql-Indic--DDCC/

Chinese Translation

尽管大型语言模型（LLMs）在文本到SQL性能方面取得了显著进展，但现有基准测试主要集中在西方背景和简化模式上，导致在现实世界的非西方应用中存在空白。我们提出了IndicDB，这是一个多语言文本到SQL基准，用于评估跨语言语义解析在多种印度语言中的表现。关系模式来源于开放数据平台，包括国家数据与分析平台（NDAP）和印度数据门户（IDP），确保了现实的行政数据复杂性。IndicDB包含20个数据库，涵盖237个表。为了将非规范化的政府数据转换为丰富的关系结构，我们采用了一个迭代的三代理框架（架构师、审计员、精炼者），以确保结构的严谨性和高关系密度（每个数据库11.85个表；连接深度最高可达六）。我们的管道是价值感知的、难度校准的，并且强制连接，生成了跨英语、印地语和五种印度语言的15,617个任务。我们评估了最先进模型（DeepSeek v3.2、MiniMax 2.7、LLaMA 3.3、Qwen3）在七种语言变体中的跨语言语义解析性能。结果显示，从英语到印度语言的性能下降了9.00%，揭示了由于更难的模式链接、增加的结构模糊性和有限的外部知识导致的“印度差距”。IndicDB作为多语言文本到SQL的严格基准。代码和数据：https://anonymous.4open.science/r/multilingualText2Sql-Indic--DDCC/

View on arXiv Download PDF AI Translation

cs.CL / 52 / 2604.13692

Breaking the Generator Barrier: Disentangled Representation for Generalizable AI-Text Detection

打破生成器壁垒：用于可泛化AI文本检测的解耦表示

Pu, Xiao, Cheng, Zepeng, Yuan, Lin, Wu, Yu, Bi, Xiuli

Abstract

As large language models (LLMs) generate text that increasingly resembles human writing, the subtle cues that distinguish AI-generated content from human-written content become increasingly challenging to capture. Reliance on generator-specific artifacts is inherently unstable, since new models emerge rapidly and reduce the robustness of such shortcuts. This generalizes unseen generators as a central and challenging problem for AI-text detection. To tackle this challenge, we propose a progressively structured framework that disentangles AI-detection semantics from generator-aware artifacts. This is achieved through a compact latent encoding that encourages semantic minimality, followed by perturbation-based regularization to reduce residual entanglement, and finally a discriminative adaptation stage that aligns representations with task objectives. Experiments on MAGE benchmark, covering 20 representative LLMs across 7 categories, demonstrate consistent improvements over state-of-the-art methods, achieving up to 24.2% accuracy gain and 26.2% F1 improvement. Notably, performance continues to improve as the diversity of training generators increases, confirming strong scalability and generalization in open-set scenarios. Our source code will be publicly available at https://github.com/PuXiao06/DRGD.

Chinese Translation

随着大型语言模型（LLMs）生成的文本越来越像人类写作，区分AI生成内容与人类撰写内容的微妙线索变得越来越难以捕捉。依赖于特定生成器的伪影本质上是不稳定的，因为新模型迅速出现，降低了这种捷径的鲁棒性。这使得未见生成器的泛化成为AI文本检测的一个核心且具有挑战性的问题。为了解决这一挑战，我们提出了一个逐步结构化的框架，将AI检测语义与生成器感知伪影解耦。这是通过紧凑的潜在编码实现的，该编码鼓励语义的最小化，随后通过基于扰动的正则化减少残余的纠缠，最后进行一个判别适应阶段，将表示与任务目标对齐。在涵盖7个类别的20个代表性LLM的MAGE基准测试中的实验表明，相较于最先进的方法，我们的方法在准确率上提高了高达24.2%，F1值提高了26.2%。值得注意的是，随着训练生成器多样性的增加，性能持续改善，确认了在开放集场景中的强可扩展性和泛化能力。我们的源代码将公开发布在https://github.com/PuXiao06/DRGD。

View on arXiv Download PDF AI Translation

cs.CL / 53 / 2604.13705

Beyond Arrow's Impossibility: Fairness as an Emergent Property of Multi-Agent Collaboration

超越阿罗的不可能性：公平作为多智能体协作的涌现属性

Chaki, Sayan Kumar, Gourru, Antoine, Velcin, Julien

Abstract

Fairness in language models is typically studied as a property of a single, centrally optimized model. As large language models become increasingly agentic, we propose that fairness emerges through interaction and exchange. We study this via a controlled hospital triage framework in which two agents negotiate over three structured debate rounds. One agent is aligned to a specific ethical framework via retrieval-augmented generation (RAG), while the other is either unaligned or adversarially prompted to favor demographic groups over clinical need. We find that alignment systematically shapes negotiation strategies and allocation patterns, and that neither agent's allocation is ethically adequate in isolation, yet their joint final allocation can satisfy fairness criteria that neither would have reached alone. Aligned agents partially moderate bias through contestation rather than override, acting as corrective patches that restore access for marginalized groups without fully converting a biased counterpart. We further observe that even explicitly aligned agents exhibit intrinsic biases toward certain frameworks, consistent with known left-leaning tendencies in LLMs. We connect these limits to Arrow's Impossibility Theorem: no aggregation mechanism can simultaneously satisfy all desiderata of collective rationality, and multi-agent deliberation navigates rather than resolves this constraint. Our results reposition fairness as an emergent, procedural property of decentralized agent interaction, and the system rather than the individual agent as the appropriate unit of evaluation.

Chinese Translation

语言模型中的公平性通常被视为单一、集中优化模型的属性。随着大型语言模型变得越来越具智能性，我们提出公平性通过互动和交流而涌现。我们通过一个受控的医院分诊框架进行研究，在该框架中，两个智能体在三个结构化的辩论回合中进行协商。一个智能体通过检索增强生成（RAG）与特定的伦理框架对齐，而另一个智能体则要么未对齐，要么被对抗性地提示以偏向于人口群体而非临床需求。我们的研究发现，对齐系统性地影响了协商策略和分配模式，并且任何一个智能体的分配在孤立状态下都不符合伦理要求，但它们的联合最终分配可以满足单独任何一方都无法达到的公平标准。对齐的智能体通过争论而非覆盖部分缓解偏见，充当纠正补丁，恢复边缘群体的访问权，而不完全转变偏见的对手。我们进一步观察到，即使是明确对齐的智能体也对某些框架表现出内在偏见，这与大型语言模型已知的左倾倾向一致。我们将这些限制与阿罗的不可能性定理联系起来：没有任何聚合机制可以同时满足集体理性的所有期望，而多智能体的审议则是在这一约束中进行导航而非解决。我们的结果将公平重新定位为去中心化智能体互动的涌现性、程序性属性，并将系统而非单个智能体视为适当的评估单位。

View on arXiv Download PDF AI Translation

cs.CL / 54 / 2604.13706

Co-FactChecker: A Framework for Human-AI Collaborative Claim Verification Using Large Reasoning Models

Co-FactChecker：一种基于大型推理模型的人机协作声明验证框架

Sahnan, Dhruv, Dutta, Subhabrata, Chakraborty, Tanmoy, Nakov, Preslav, Gurevych, Iryna

Abstract

Professional fact-checkers rely on domain knowledge and deep contextual understanding to verify claims. Large language models (LLMs) and large reasoning models (LRMs) lack such grounding and primarily reason from available evidence alone, creating a mismatch between expert-led and fully automated claim verification. To mitigate this gap, we posit human-AI collaboration as a more promising path forward, where expert feedback, grounded in real-world knowledge and domain expertise, guides the model's reasoning. However, existing LRMs are hard to calibrate to natural language feedback, particularly in a multi-turn interaction setup. We propose Co-FactChecker, a framework for human-AI collaborative claim verification. We introduce a new interaction paradigm that treats the model's thinking trace as a shared scratchpad. Co-FactChecker translates expert feedback into trace-edits that introduce targeted modifications to the trace, sidestepping the shortcomings of dialogue-based interaction. We provide theoretical results showing that trace-editing offers advantages over multi-turn dialogue, and our automatic evaluations demonstrate that Co-FactChecker outperforms existing autonomous and human-AI collaboration approaches. Human evaluations further show that Co-FactChecker is preferred over multi-turn dialogue, producing higher quality reasoning and verdicts along with relatively easier to interpret and more useful thinking traces.

Chinese Translation

专业的事实核查员依赖于领域知识和深刻的上下文理解来验证声明。大型语言模型（LLMs）和大型推理模型（LRMs）缺乏这种基础，主要依赖可用证据进行推理，从而导致专家主导的验证与完全自动化的声明验证之间存在不匹配。为了弥补这一差距，我们认为人机协作是一条更有前景的道路，其中基于现实世界知识和领域专业知识的专家反馈指导模型的推理。然而，现有的LRMs难以对自然语言反馈进行校准，特别是在多轮交互设置中。我们提出了Co-FactChecker，一个人机协作声明验证框架。我们引入了一种新的交互范式，将模型的思维轨迹视为共享的草稿纸。Co-FactChecker将专家反馈转化为轨迹编辑，针对性地对轨迹进行修改，避免了基于对话交互的缺陷。我们提供了理论结果，表明轨迹编辑相较于多轮对话具有优势，我们的自动评估显示Co-FactChecker优于现有的自主和人机协作方法。人类评估进一步表明，Co-FactChecker比多轮对话更受欢迎，产生更高质量的推理和裁决，并且思维轨迹相对更易于解释和更有用。

View on arXiv Download PDF AI Translation

cs.CL / 55 / 2604.13713

Learning the Cue or Learning the Word? Analyzing Generalization in Metaphor Detection for Verbs

学习线索还是学习词汇？分析动词隐喻检测中的泛化能力

Kurtyigit, Sinan, Walde, Sabine Schulte im, Fraser, Alexander

Abstract

Metaphor detection models achieve strong benchmark performance, yet it remains unclear whether this reflects transferable generalization or lexical memorization. To address this, we analyze generalization in metaphor detection through RoBERTa, the shared backbone of many state-of-the-art systems, focusing on English verbs using the VU Amsterdam Metaphor Corpus. We introduce a controlled lexical hold-out setup where all instances of selected target lemmas are strictly excluded from fine-tuning, and compare predictions on these Held-out lemmas against Exposed lemmas (verbs seen during fine-tuning). While the model performs best on Exposed lemmas, it maintains robust performance on Held-out lemmas. Further analysis reveals that sentence context alone is sufficient to match full-model performance on Held-out lemmas, whereas static verb-level embeddings are not. Together, these results suggest that generalization is primarily driven by "learning the cue" (transferable contextual patterns), while "learning the word" (verb-specific memorization) provides an additive boost when lexical exposure is available.

Chinese Translation

隐喻检测模型在基准测试中表现出色，但尚不清楚这是否反映了可转移的泛化能力或词汇记忆。为了解决这一问题，我们通过 RoBERTa（许多最先进系统的共享基础模型）分析隐喻检测中的泛化能力，重点关注使用 VU 阿姆斯特丹隐喻语料库的英语动词。我们引入了一种受控的词汇保留设置，严格排除所有选定目标词元的实例以进行微调，并将这些保留词元的预测与暴露词元（在微调期间看到的动词）进行比较。尽管模型在暴露词元上的表现最佳，但在保留词元上仍保持稳健的性能。进一步分析表明，仅句子上下文就足以匹配在保留词元上的全模型性能，而静态动词级嵌入则无法实现。综合这些结果表明，泛化主要是由“学习线索”（可转移的上下文模式）驱动的，而“学习词汇”（动词特定的记忆）在词汇暴露可用时提供了额外的提升。

View on arXiv Download PDF AI Translation

cs.CL / 56 / 2604.13717

An Empirical Investigation of Practical LLM-as-a-Judge Improvement Techniques on RewardBench 2

对 RewardBench 2 上实用 LLM 作为评判者改进技术的实证研究

Lail, Ryan

Abstract

LLM-as-a-judge, using a language model to score or rank candidate responses, is widely used as a scalable alternative to human evaluation in RLHF pipelines, benchmarking, and application layer evaluations (evals). However, judgment reliability depends heavily on prompting and aggregation strategy. We present an empirical investigation of practical, drop-in techniques that improve GPT-5.4 judge accuracy on RewardBench 2 without any finetuning. Two techniques account for nearly all available gains: task-specific criteria injection (+3.0pp at negligible cost) and ensemble scoring (+9.8pp at 5x cost). Combined, they reach 83.6% accuracy, +11.9pp over the 71.7% baseline. Our investigation also covers three further techniques (calibration context, adaptive model escalation, and soft blending) which did not reliably improve on criteria + ensembling at comparable cost. Cheaper model tiers benefit disproportionately from ensembling: GPT-5.4 mini with k=8 achieves 79.2% at 1.2x baseline cost, and GPT-5.4 nano with k=8 reaches 71.4% at 0.4x baseline cost, making high-accuracy LLM judges accessible at low cost.

Chinese Translation

LLM 作为评判者，利用语言模型对候选响应进行评分或排名，作为人类评估在强化学习人类反馈（RLHF）管道、基准测试和应用层评估（evals）中的可扩展替代方案被广泛使用。然而，判断的可靠性在很大程度上依赖于提示和聚合策略。我们提出了一项实证研究，探讨了在不进行任何微调的情况下，改善 GPT-5.4 评判准确性的实用、即插即用技术。两种技术几乎占据了所有可用的增益：任务特定标准注入（+3.0 个百分点，成本几乎可以忽略不计）和集成评分（+9.8 个百分点，成本为 5 倍）。两者结合，达到了 83.6% 的准确率，比 71.7% 的基线提高了 11.9 个百分点。我们的研究还涵盖了三种进一步的技术（校准上下文、自适应模型升级和软混合），这些技术在相似成本下未能可靠地改善标准 + 集成的效果。较便宜的模型层次从集成中受益不成比例：GPT-5.4 mini 在 k=8 时以 1.2 倍基线成本达到了 79.2%，而 GPT-5.4 nano 在 k=8 时以 0.4 倍基线成本达到了 71.4%，使得高准确率的 LLM 评判者以低成本变得可及。

View on arXiv Download PDF AI Translation

cs.CL / 57 / 2604.13731

Doc-V*:Coarse-to-Fine Interactive Visual Reasoning for Multi-Page Document VQA

Doc-V*: 从粗到细的多页文档视觉问答交互推理

Zheng, Yuanlei, Fu, Pei, Li, Hang, Wang, Ziyang, Zhang, Yuyi, Ruan, Wenyu, Zhang, Xiaojin, Wei, Zhongyu, Luo, Zhenbo, Luan, Jian, Chen, Wei, Bai, Xiang

Abstract

Multi-page Document Visual Question Answering requires reasoning over semantics, layouts, and visual elements in long, visually dense documents. Existing OCR-free methods face a trade-off between capacity and precision: end-to-end models scale poorly with document length, while visual retrieval-based pipelines are brittle and passive. We propose Doc-$V^*$, an \textbf{OCR-free agentic} framework that casts multi-page DocVQA as sequential evidence aggregation. Doc-$V^*$ begins with a thumbnail overview, then actively navigates via semantic retrieval and targeted page fetching, and aggregates evidence in a structured working memory for grounded reasoning. Trained by imitation learning from expert trajectories and further optimized with Group Relative Policy Optimization, Doc-$V^*$ balances answer accuracy with evidence-seeking efficiency. Across five benchmarks, Doc-$V^*$ outperforms open-source baselines and approaches proprietary models, improving out-of-domain performance by up to \textbf{47.9\%} over RAG baseline. Other results reveal effective evidence aggregation with selective attention, not increased input pages.

Chinese Translation

多页文档视觉问答（DocVQA）需要对长且视觉密集的文档中的语义、布局和视觉元素进行推理。现有的无OCR方法在容量和精度之间存在权衡：端到端模型在文档长度上扩展性差，而基于视觉检索的流程则脆弱且被动。我们提出了Doc-$V^*$，一个 extbf{无OCR的自主}框架，将多页DocVQA视为顺序证据聚合。Doc-$V^*$首先提供缩略图概览，然后通过语义检索和目标页面获取主动导航，并在结构化的工作记忆中聚合证据以进行有根据的推理。通过模仿学习专家轨迹进行训练，并进一步通过组相对策略优化（Group Relative Policy Optimization）进行优化，Doc-$V^*$在答案准确性与寻求证据的效率之间取得了平衡。在五个基准测试中，Doc-$V^*$超越了开源基线，接近专有模型，相较于RAG基线提高了高达 extbf{47.9 ext{%}}的领域外表现。其他结果显示，通过选择性注意力实现了有效的证据聚合，而不是增加输入页面。

View on arXiv Download PDF AI Translation

cs.CL / 58 / 2604.13756

MedRCube: A Multidimensional Framework for Fine-Grained and In-Depth Evaluation of MLLMs in Medical Imaging

MedRCube：用于医学影像中多模态大语言模型的细粒度和深入评估的多维框架

Bao, Zhijie, Chen, Fangke, Bao, Licheng, Zhang, Chenhui, Chen, Wei, Peng, Jiajie, Wei, Zhongyu

Abstract

The potential of Multimodal Large Language Models (MLLMs) in domain of medical imaging raise the demands of systematic and rigorous evaluation frameworks that are aligned with the real-world medical imaging practice. Existing practices that report single or coarse-grained metrics are lack the granularity required for specialized clinical support and fail to assess the reliability of reasoning mechanisms. To address this, we propose a paradigm shift toward multidimensional, fine-grained and in-depth evaluation. Based on a two-stage systematic construction pipeline designed for this paradigm, we instantiate it with MedRCube. We benchmark 33 MLLMs, \textit{Lingshu-32B} achieve top-tier performance. Crucially, MedRCube exposes a series of pronounced insights inaccessible under prior evaluation settings. Furthermore, we introduce a credibility evaluation subset to quantify reasoning credibility, uncover a highly significant positive association between shortcut behavior and diagnostic task performance, raising concerns for clinically trustworthy deployment. The resources of this work can be found at https://github.com/F1mc/MedRCube.

Chinese Translation

多模态大语言模型（MLLMs）在医学影像领域的潜力引发了对与现实世界医学影像实践相一致的系统性和严格评估框架的需求。现有的实践仅报告单一或粗粒度的指标，缺乏专业临床支持所需的细粒度，并未能评估推理机制的可靠性。为了解决这一问题，我们提出了一种向多维、细粒度和深入评估的范式转变。基于为这一范式设计的两阶段系统构建流程，我们用MedRCube进行了实例化。我们对33个MLLMs进行了基准测试，其中 extit{Lingshu-32B}表现出色，达到了顶级性能。重要的是，MedRCube揭示了一系列在先前评估设置下无法获得的显著见解。此外，我们引入了一个可信度评估子集，以量化推理的可信度，发现快捷行为与诊断任务表现之间存在高度显著的正相关关系，这引发了对临床可信部署的担忧。本研究的资源可以在https://github.com/F1mc/MedRCube找到。

View on arXiv Download PDF AI Translation

cs.CL / 59 / 2604.13777

From Anchors to Supervision: Memory-Graph Guided Corpus-Free Unlearning for Large Language Models

从锚点到监督：基于记忆图的无语料大语言模型去学习方法

Li, Wenxuan, Zhang, Zhenfei, Zhang, Mi, Hong, Geng, Wen, Mi, You, Xiaoyu, Yang, Min

Abstract

Large language models (LLMs) may memorize sensitive or copyrighted content, raising significant privacy and legal concerns. While machine unlearning has emerged as a potential remedy, prevailing paradigms rely on user-provided forget sets, making unlearning requests difficult to audit and exposing systems to secondary leakage and malicious abuse. We propose MAGE, a Memory-grAph Guided Erasure framework for user-minimized, corpus-free unlearning. Given only a lightweight user anchor that identifies a target entity, MAGE probes the target LLM to recover target-related memorization, organizes it into a weighted local memory graph, and synthesizes scoped supervision for unlearning. MAGE is model-agnostic, can be plugged into standard unlearning methods, and requires no access to the original training corpus. Experiments on two benchmarks, TOFU and RWKU, demonstrate that MAGE's self-generated supervision achieves effective unlearning performance comparable to supervision generated with external reference, while preserving overall utility. These results support a practical and auditable unlearning workflow driven by minimal anchors rather than user-supplied forget corpora.

Chinese Translation

大型语言模型（LLMs）可能会记忆敏感或受版权保护的内容，这引发了重大的隐私和法律问题。尽管机器去学习已成为一种潜在的解决方案，但现有的范式依赖于用户提供的遗忘集，使得去学习请求难以审计，并使系统面临二次泄露和恶意滥用的风险。我们提出了MAGE，一种基于记忆图的用户最小化、无语料去学习框架。MAGE仅需一个轻量级的用户锚点来识别目标实体，便可探测目标LLM以恢复与目标相关的记忆，将其组织成加权的局部记忆图，并合成用于去学习的定向监督。MAGE是模型无关的，可以与标准去学习方法结合使用，并且无需访问原始训练语料。对两个基准数据集TOFU和RWKU的实验表明，MAGE自生成的监督能够实现与使用外部参考生成的监督相当的有效去学习性能，同时保持整体效用。这些结果支持一种以最小锚点驱动的实用且可审计的去学习工作流程，而非依赖用户提供的遗忘语料。

View on arXiv Download PDF AI Translation

cs.CL / 60 / 2604.13786

QuantileMark: A Message-Symmetric Multi-bit Watermark for LLMs

QuantileMark：一种消息对称的多比特水印用于大型语言模型

Zhu, Junlin, Huang, Baizhou, Wan, Xiaojun

Abstract

As large language models become standard backends for content generation, practical provenance increasingly requires multi-bit watermarking. In provider-internal deployments, a key requirement is message symmetry: the message itself should not systematically affect either text quality or verification outcomes. Vocabulary-partition watermarks can break message symmetry in low-entropy decoding: some messages are assigned most of the probability mass, while others are forced to use tail tokens. This makes embedding quality and message decoding accuracy message-dependent. We propose QuantileMark, a white-box multi-bit watermark that embeds messages within the continuous cumulative probability interval $[0, 1)$. At each step, QuantileMark partitions this interval into $M$ equal-mass bins and samples strictly from the bin assigned to the target symbol, ensuring a fixed $1/M$ probability budget regardless of context entropy. For detection, the verifier reconstructs the same partition under teacher forcing, computes posteriors over latent bins, and aggregates evidence for verification. We prove message-unbiasedness, a property ensuring that the base distribution is recovered when averaging over messages. This provides a theoretical foundation for generation-side symmetry, while the equal-mass design additionally promotes uniform evidence strength across messages on the detection side. Empirical results on C4 continuation and LFQA show improved multi-bit recovery and detection robustness over strong baselines, with negligible impact on generation quality. Our code is available at GitHub (https://github.com/zzzjunlin/QuantileMark).

Chinese Translation

随着大型语言模型成为内容生成的标准后端，实际的来源追溯越来越需要多比特水印。在提供者内部部署中，一个关键要求是消息对称性：消息本身不应系统性地影响文本质量或验证结果。词汇划分水印在低熵解码中可能破坏消息对称性：某些消息被分配了大部分概率质量，而其他消息则被迫使用尾部标记。这使得嵌入质量和消息解码准确性依赖于消息。我们提出了QuantileMark，一种白盒多比特水印，它在连续累积概率区间 $[0, 1)$ 内嵌入消息。在每一步中，QuantileMark 将该区间划分为 $M$ 个等质量的箱，并严格从分配给目标符号的箱中抽样，确保无论上下文熵如何，固定的 $1/M$ 概率预算。为了检测，验证者在教师强制下重建相同的划分，计算潜在箱的后验分布，并聚合证据以进行验证。我们证明了消息无偏性，这一特性确保在对消息取平均时恢复基础分布。这为生成端的对称性提供了理论基础，而等质量设计还促进了检测端各消息间证据强度的均匀性。在 C4 续写和 LFQA 上的实证结果显示，相较于强基线，具有更好的多比特恢复和检测鲁棒性，对生成质量的影响微乎其微。我们的代码可在 GitHub 上获取 (https://github.com/zzzjunlin/QuantileMark)。

View on arXiv Download PDF AI Translation

cs.CL / 61 / 2604.13787

ToolOmni: Enabling Open-World Tool Use via Agentic learning with Proactive Retrieval and Grounded Execution

ToolOmni：通过主动检索和基于实证的执行实现开放世界工具使用的智能学习

Huang, Shouzheng, Zhang, Meishan, Hu, Baotian, Zhang, Min

Abstract

Large Language Models (LLMs) enhance their problem-solving capability by utilizing external tools. However, in open-world scenarios with massive and evolving tool repositories, existing methods relying on static embedding retrieval or parameter memorization of tools struggle to align user intent with tool semantics or generalize to unseen tools, respectively, leading to suboptimal accuracy of open-world tool retrieval and execution. To address these, we present ToolOmni, a unified agentic framework that enables LLMs for open-world tool use by proactive retrieval and grounded execution within a reasoning loop. First, we construct a cold-start multi-turn interaction dataset to instill foundational agentic capabilities via Supervised Fine-Tuning (SFT). Then, we introduce open-world tool learning based on a Decoupled Multi-Objective GRPO algorithm, which simultaneously optimizes LLMs for both tool retrieval accuracy and execution efficacy in online environments. Extensive experiments demonstrate that ToolOmni achieves state-of-the-art performance both in retrieval and execution, surpassing strong baselines by a significant margin of +10.8% in end-to-end execution success rate, while exhibiting exceptional robustness and generalization capabilities.

Chinese Translation

大型语言模型（LLMs）通过利用外部工具增强其问题解决能力。然而，在拥有大量和不断演变的工具库的开放世界场景中，现有方法依赖于静态嵌入检索或工具参数记忆，分别难以将用户意图与工具语义对齐或推广到未见过的工具，从而导致开放世界工具检索和执行的准确性不佳。为了解决这些问题，我们提出了ToolOmni，一个统一的智能框架，使LLMs能够通过主动检索和基于实证的执行在推理循环中实现开放世界工具使用。首先，我们构建了一个冷启动多轮交互数据集，通过监督微调（Supervised Fine-Tuning, SFT）培养基础的智能能力。然后，我们基于解耦多目标GRPO算法引入开放世界工具学习，该算法同时优化LLMs在在线环境中的工具检索准确性和执行效率。大量实验表明，ToolOmni在检索和执行方面都达到了最先进的性能，在端到端执行成功率上超越了强基线，提升幅度达到+10.8%，同时展现出卓越的鲁棒性和泛化能力。

View on arXiv Download PDF AI Translation

cs.CL / 62 / 2604.13828

MUSE: Multi-Domain Chinese User Simulation via Self-Evolving Profiles and Rubric-Guided Alignment

MUSE：通过自我演化的用户画像和评分引导的对齐实现多领域中文用户模拟

Liu, Zihao, Zhou, Hantao, Li, Jiguo, Xu, Jun, Gao, Jiuchong, Hao, Jinghua, He, Renqing, Wang, Peng

Abstract

User simulators are essential for the scalable training and evaluation of interactive AI systems. However, existing approaches often rely on shallow user profiling, struggle to maintain persona consistency over long interactions, and are largely limited to English or single-domain settings. We present MUSE, a multi-domain Chinese user simulation framework designed to generate human-like, controllable, and behaviorally consistent responses. First, we propose Iterative Profile Self-Evolution (IPSE), which gradually optimizes user profiles by comparing and reasoning discrepancies between simulated trajectories and real dialogue behaviors. We then apply Role-Reversal Supervised Fine-Tuning to improve local response realism and human-like expression. To enable fine-grained behavioral alignment, we further train a specialized rubric-based reward model and incorporate it into rubric-guided multi-turn reinforcement learning, which optimizes the simulator at the dialogue level and enhances long-horizon behavioral consistency. Experiments show that MUSE consistently outperforms strong baselines in both utterance-level and session-level evaluations, generating responses that are more realistic, coherent, and persona-consistent over extended interactions.

Chinese Translation

用户模拟器对于可扩展的交互式人工智能系统的训练和评估至关重要。然而，现有的方法往往依赖于浅层用户画像，难以在长时间交互中保持角色一致性，并且主要局限于英语或单一领域的设置。我们提出了MUSE，一个旨在生成类人、可控且行为一致的响应的多领域中文用户模拟框架。首先，我们提出了迭代画像自我演化（Iterative Profile Self-Evolution，IPSE），通过比较和推理模拟轨迹与真实对话行为之间的差异，逐步优化用户画像。接着，我们应用角色反转监督微调（Role-Reversal Supervised Fine-Tuning）来提高局部响应的真实性和类人表达。为了实现细粒度的行为对齐，我们进一步训练了一个基于评分的专用奖励模型，并将其纳入评分引导的多轮强化学习中，这在对话层面优化了模拟器，并增强了长时间行为一致性。实验表明，MUSE在发言级和会话级评估中始终优于强基线，生成的响应在长时间交互中更加真实、一致且符合角色特征。

View on arXiv Download PDF AI Translation

cs.CL / 63 / 2604.13833

Robust Reward Modeling for Large Language Models via Causal Decomposition

通过因果分解实现大语言模型的稳健奖励建模

Lu, Yunsheng, Yang, Zijiang, Pan, Licheng, Chu, Zhixuan

Abstract

Reward models are central to aligning large language models, yet they often overfit to spurious cues such as response length and overly agreeable tone. Most prior work weakens these cues directly by penalizing or controlling specific artifacts, but it does not explicitly encourage the model to ground preferences in the prompt's intent. We learn a decoder that maps a candidate answer to the latent intent embedding of the input. The reconstruction error is used as a signal to regularize the reward model training. We provide theoretical evidence that this signal emphasizes prompt-dependent information while suppressing prompt-independent shortcuts. Across math, helpfulness, and safety benchmarks, the decoder selects shorter and less sycophantic candidates with 0.877 accuracy. Incorporating this signal into RM training in Gemma-2-2B-it and Gemma-2-9B-it increases RewardBench accuracy from 0.832 to 0.868. For Best-of-N selection, our framework increases length-controlled win rates while producing shorter outputs, and remains robust to lengthening and mild off-topic drift in controlled rewrite tests.

Chinese Translation

奖励模型在对齐大语言模型中至关重要，但它们往往会过度拟合于一些虚假线索，例如响应长度和过于迎合的语气。以往的研究大多通过惩罚或控制特定的伪影来削弱这些线索，但并未明确鼓励模型将偏好与提示的意图相结合。我们学习了一个解码器，将候选答案映射到输入的潜在意图嵌入。重构误差被用作信号，以规范奖励模型的训练。我们提供了理论证据，表明该信号强调了依赖于提示的信息，同时抑制了与提示无关的捷径。在数学、帮助性和安全性基准测试中，解码器以0.877的准确率选择了更短且不那么谄媚的候选项。在Gemma-2-2B-it和Gemma-2-9B-it中将该信号纳入奖励模型训练，使RewardBench的准确率从0.832提高到0.868。在Best-of-N选择中，我们的框架提高了长度控制的胜率，同时生成更短的输出，并在控制重写测试中对长度增加和轻微偏离主题保持稳健性。

View on arXiv Download PDF AI Translation

cs.CL / 64 / 2604.13846

Beyond Static Personas: Situational Personality Steering for Large Language Models

超越静态角色：大型语言模型的情境个性引导

Wei, Zesheng, Li, Mengxiang, Wang, Zilei, Deng, Yang

Abstract

Personalized Large Language Models (LLMs) facilitate more natural, human-like interactions in human-centric applications. However, existing personalization methods are constrained by limited controllability and high resource demands. Furthermore, their reliance on static personality modeling restricts adaptability across varying situations. To address these limitations, we first demonstrate the existence of situation-dependency and consistent situation-behavior patterns within LLM personalities through a multi-perspective analysis of persona neurons. Building on these insights, we propose IRIS, a training-free, neuron-based Identify-Retrieve-Steer framework for advanced situational personality steering. Our approach comprises situational persona neuron identification, situation-aware neuron retrieval, and similarity-weighted steering. We empirically validate our framework on PersonalityBench and our newly introduced SPBench, a comprehensive situational personality benchmark. Experimental results show that our method surpasses best-performing baselines, demonstrating IRIS's generalization and robustness to complex, unseen situations and different models architecture.

Chinese Translation

个性化大型语言模型（LLMs）在以人为中心的应用中促进了更自然、更类人化的交互。然而，现有的个性化方法受到有限可控性和高资源需求的限制。此外，它们对静态个性建模的依赖限制了在不同情境下的适应性。为了解决这些局限性，我们首先通过对角色神经元的多角度分析，展示了情境依赖性和一致的情境-行为模式在LLM个性中的存在。在这些见解的基础上，我们提出了IRIS，一个无训练、基于神经元的识别-检索-引导框架，用于先进的情境个性引导。我们的方法包括情境角色神经元识别、情境感知神经元检索和相似性加权引导。我们在PersonalityBench和我们新引入的SPBench（一个全面的情境个性基准）上对我们的框架进行了实证验证。实验结果表明，我们的方法超越了最佳基线，展示了IRIS在复杂、未见情境和不同模型架构下的泛化能力和鲁棒性。

View on arXiv Download PDF AI Translation

cs.CL / 65 / 2604.13899

Do We Still Need Humans in the Loop? Comparing Human and LLM Annotation in Active Learning for Hostility Detection

我们还需要人类参与吗？在敌意检测的主动学习中比较人类与大型语言模型（LLM）的标注

Hakimi, Ahmad Dawar, Hirlimann, Lea, Augenstein, Isabelle, Schütze, Hinrich

Abstract

Instruction-tuned LLMs can annotate thousands of instances from a short prompt at negligible cost. This raises two questions for active learning (AL): can LLM labels replace human labels within the AL loop, and does AL remain necessary when entire corpora can be labelled at once? We investigate both questions on a new dataset of 277,902 German political TikTok comments (25,974 LLM-labelled, 5,000 human-annotated), comparing seven annotation strategies across four encoders to detect anti-immigrant hostility. A classifier trained on 25,974 GPT-5.2 labels (\$43) achieves comparable F1-Macro to one trained on 3,800 human annotations (\$316). Active learning offers little advantage over random sampling in our pre-enriched pool and delivers lower F1 than full LLM annotation at the same cost. However, comparable aggregate F1 masks a systematic difference in error structure: LLM-trained classifiers over-predict the positive class relative to the human gold standard. This divergence concentrates in topically ambiguous discussions where the distinction between anti-immigrant hostility and policy critique is most subtle, suggesting that annotation strategy should be guided not by aggregate F1 alone but by the error profile acceptable for the target application.

Chinese Translation

经过指令调优的大型语言模型（LLMs）可以以微乎其微的成本从简短提示中标注成千上万的实例。这引发了两个关于主动学习（AL）的问题：在AL循环中，LLM标签能否替代人类标签，以及当整个语料库可以一次性标注时，AL是否仍然必要？我们在一个新的数据集上研究这两个问题，该数据集包含277,902条德国政治TikTok评论（25,974条LLM标注，5,000条人类标注），比较了七种标注策略和四种编码器，以检测反移民敌意。基于25,974条GPT-5.2标签（成本为43美元）训练的分类器在F1-Macro指标上与基于3,800条人类标注（成本为316美元）训练的分类器相当。在我们预先丰富的样本池中，主动学习相较于随机抽样几乎没有优势，并且在相同成本下，其F1值低于完全LLM标注。然而，可比的整体F1掩盖了错误结构的系统性差异：与人类金标准相比，LLM训练的分类器对正类的预测过高。这种偏差集中在主题模糊的讨论中，在这些讨论中，反移民敌意与政策批评之间的区别最为微妙，这表明标注策略不应仅由整体F1指导，而应根据目标应用可接受的错误特征进行指导。

View on arXiv Download PDF AI Translation

cs.CL / 66 / 2604.13950

Causal Drawbridges: Characterizing Gradient Blocking of Syntactic Islands in Transformer LMs

因果性拉桥：表征变换器语言模型中句法岛的梯度阻断

Boguraev, Sasha, Mahowald, Kyle

Abstract

We show how causal interventions in Transformer models provide insights into English syntax by focusing on a long-standing challenge for syntactic theory: syntactic islands. Extraction from coordinated verb phrases is often degraded, yet acceptability varies gradiently with lexical content (e.g., "I know what he hates art and loves" vs. "I know what he looked down and saw"). We show that modern Transformer language models replicate human judgments across this gradient. Using causal interventions that isolate functionally relevant subspaces in Transformer blocks, attention modules, and MLPs, we demonstrate that extraction from coordination islands engages the same filler-gap mechanisms as canonical wh-dependencies, but that these mechanisms are selectively blocked to varying degrees. By projecting a large corpus of unrelated text onto these causally identified subspaces, we derive a novel linguistic hypothesis: the conjunction "and" is represented differently in extractable versus non-extractable constructions, corresponding to expressions encoding relational dependencies versus purely conjunctive uses. These results illustrate how mechanistic interpretability can inform syntax, generating new hypotheses about linguistic representation and processing.

Chinese Translation

我们展示了变换器模型中的因果干预如何通过关注句法理论中的一个长期挑战——句法岛，提供对英语句法的洞察。从协调动词短语的提取通常会受到影响，但可接受性随着词汇内容呈梯度变化（例如，“我知道他讨厌艺术和爱”与“我知道他低头看到的”）。我们表明，现代变换器语言模型在这一梯度上复制了人类的判断。通过使用因果干预来隔离变换器块、注意力模块和多层感知器中的功能相关子空间，我们证明了从协调岛的提取涉及与典型的 wh-依赖相同的填充-缺口机制，但这些机制在不同程度上被选择性地阻断。通过将大量无关文本投影到这些因果识别的子空间中，我们推导出一个新的语言假设：连接词“和”在可提取与不可提取的结构中表现不同，分别对应于编码关系依赖的表达与纯粹的连接用法。这些结果说明了机械可解释性如何为句法提供信息，生成关于语言表征和处理的新假设。

View on arXiv Download PDF AI Translation

cs.CL / 67 / 2604.13977

How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data

我们如何合成高质量的预训练数据？关于提示设计、生成模型和源数据的系统研究

Niklaus, Joel, Yamaguchi, Atsuki, Štefánik, Michal, Penedo, Guilherme, Kydlíček, Hynek, Bakouch, Elie, Tunstall, Lewis, Beeching, Edward Emanuel, Frere, Thibaud, Raffel, Colin, von Werra, Leandro, Wolf, Thomas

Abstract

Synthetic data is a standard component in training large language models, yet systematic comparisons across design dimensions, including rephrasing strategy, generator model, and source data, remain absent. We conduct extensive controlled experiments, generating over one trillion tokens, to identify critical factors in rephrasing web text into synthetic pretraining data. Our results reveal that structured output formats, such as tables, math problems, FAQs, and tutorials, consistently outperform both curated web baselines and prior synthetic methods. Notably, increasing the size of the generator model beyond 1B parameters provides no additional benefit. Our analysis also demonstrates that the selection of the original data used for mixing substantially influences performance. By applying our findings, we develop \textbf{\textsc{FinePhrase}}, a 486-billion-token open dataset of rephrased web text. We show that \textsc{FinePhrase} outperforms all existing synthetic data baselines while reducing generation costs by up to 30 times. We provide the dataset, all prompts, and the generation framework to the research community.

Chinese Translation

合成数据是训练大型语言模型的标准组成部分，但在重述策略、生成模型和源数据等设计维度上的系统比较仍然缺乏。我们进行了广泛的控制实验，生成了超过一万亿个标记，以识别将网络文本重述为合成预训练数据的关键因素。我们的结果表明，结构化输出格式，如表格、数学问题、常见问题解答和教程，始终优于策划的网络基准和先前的合成方法。值得注意的是，生成模型的参数规模超过10亿后并未提供额外的好处。我们的分析还表明，用于混合的原始数据的选择对性能有显著影响。基于我们的发现，我们开发了 extbf{ extsc{FinePhrase}}，一个包含4860亿个重述网络文本的开放数据集。我们展示了 extsc{FinePhrase}在所有现有合成数据基准中表现优越，同时将生成成本降低了多达30倍。我们向研究社区提供了数据集、所有提示和生成框架。

View on arXiv Download PDF AI Translation

cs.CL / 68 / 2604.13979

Leveraging LLM-GNN Integration for Open-World Question Answering over Knowledge Graphs

利用LLM-GNN集成进行开放世界知识图谱问答

Abdallah, Hussein, Abdelaziz, Ibrahim, Kalnis, Panos, Mansour, Essam

Abstract

Open-world Question Answering (OW-QA) over knowledge graphs (KGs) aims to answer questions over incomplete or evolving KGs. Traditional KGQA assumes a closed world where answers must exist in the KG, limiting real-world applicability. In contrast, open-world QA requires inferring missing knowledge based on graph structure and context. Large language models (LLMs) excel at language understanding but lack structured reasoning. Graph neural networks (GNNs) model graph topology but struggle with semantic interpretation. Existing systems integrate LLMs with GNNs or graph retrievers. Some support open-world QA but rely on structural embeddings without semantic grounding. Most assume observed paths or complete graphs, making them unreliable under missing links or multi-hop reasoning. We present GLOW, a hybrid system that combines a pre-trained GNN and an LLM for open-world KGQA. The GNN predicts top-k candidate answers from the graph structure. These, along with relevant KG facts, are serialized into a structured prompt (e.g., triples and candidates) to guide the LLM's reasoning. This enables joint reasoning over symbolic and semantic signals, without relying on retrieval or fine-tuning. To evaluate generalization, we introduce GLOW-BENCH, a 1,000-question benchmark over incomplete KGs across diverse domains. GLOW outperforms existing LLM-GNN systems on standard benchmarks and GLOW-BENCH, achieving up to 53.3% and an average 38% improvement. GitHub code and data are available.

Chinese Translation

开放世界问答（OW-QA）在知识图谱（KGs）上旨在回答不完整或不断演变的知识图谱中的问题。传统的知识图谱问答（KGQA）假设一个封闭的世界，其中答案必须存在于知识图谱中，这限制了其在现实世界中的适用性。相比之下，开放世界问答需要根据图结构和上下文推断缺失的知识。大型语言模型（LLMs）在语言理解方面表现出色，但缺乏结构化推理能力。图神经网络（GNNs）能够建模图的拓扑结构，但在语义解释方面存在困难。现有系统将LLMs与GNNs或图检索器集成。一些系统支持开放世界问答，但依赖于结构嵌入而缺乏语义基础。大多数假设观察到的路径或完整图，这使得它们在缺失链接或多跳推理的情况下不可靠。我们提出了GLOW，一个混合系统，结合了预训练的GNN和LLM用于开放世界KGQA。GNN从图结构中预测前k个候选答案。这些候选答案以及相关的知识图谱事实被序列化为结构化提示（例如，三元组和候选项），以引导LLM的推理。这使得在符号和语义信号上进行联合推理成为可能，而无需依赖检索或微调。为了评估泛化能力，我们引入了GLOW-BENCH，一个涵盖不同领域的不完整知识图谱的1000个问题基准。GLOW在标准基准和GLOW-BENCH上超越了现有的LLM-GNN系统，取得了高达53.3%的性能提升，平均提高了38%。GitHub代码和数据可用。

View on arXiv Download PDF AI Translation

cs.CL / 69 / 2604.13991

Adaptive Conformal Prediction for Improving Factuality of Generations by Large Language Models

自适应保形预测：提升大型语言模型生成内容的事实性

Rubashevskii, Aleksandr, Piatrashyn, Dzianis, Nakov, Preslav, Panov, Maxim

Abstract

Large language models (LLMs) are prone to generating factually incorrect outputs. Recent work has applied conformal prediction to provide uncertainty estimates and statistical guarantees for the factuality of LLM generations. However, existing approaches are typically not prompt-adaptive, limiting their ability to capture input-dependent variability. As a result, they may filter out too few items (leading to over-coverage) or too many (under-coverage) for a given task or prompt. We propose an adaptive conformal prediction approach that extends conformal score transformation methods to LLMs, with applications to long-form generation and multiple-choice question answering. This enables prompt-dependent calibration, retaining marginal coverage guarantees while improving conditional coverage. In addition, the approach naturally supports selective prediction, allowing unreliable claims or answer choices to be filtered out in downstream applications. We evaluate our approach on multiple white-box models across diverse domains and show that it significantly outperforms existing baselines in terms of conditional coverage.

Chinese Translation

大型语言模型（LLMs）容易生成事实不准确的输出。近期的研究应用了保形预测，以提供不确定性估计和统计保证，确保LLM生成内容的事实性。然而，现有的方法通常不具备适应提示的能力，限制了它们捕捉输入依赖性变异的能力。因此，它们可能会过滤出过少的项目（导致过度覆盖）或过多的项目（导致不足覆盖），从而影响特定任务或提示的效果。我们提出了一种自适应保形预测方法，该方法将保形评分转换方法扩展到LLMs，并应用于长文本生成和多项选择问答。这种方法实现了基于提示的校准，保留了边际覆盖保证，同时改善了条件覆盖。此外，该方法自然支持选择性预测，允许在下游应用中过滤掉不可靠的声明或答案选项。我们在多个白盒模型和不同领域上评估了我们的方法，结果表明其在条件覆盖方面显著优于现有基准。

View on arXiv Download PDF AI Translation

cs.CL / 70 / 2604.14001

Diffusion Language Models for Speech Recognition

用于语音识别的扩散语言模型

Naveriani, Davyd, Zeyer, Albert, Schlüter, Ralf, Ney, Hermann

Abstract

Diffusion language models have recently emerged as a leading alternative to standard language models, due to their ability for bidirectional attention and parallel text generation. In this work, we explore variants for their use in speech recognition. Specifically, we introduce a comprehensive guide to incorporating masked diffusion language models (MDLM) and uniform-state diffusion models (USDMs) for rescoring ASR hypotheses. Additionally, we design a new joint-decoding method that combines CTC and USDM by integrating the framewise probability distributions derived from CTC with the labelwise probability distributions computed by USDM at each decoding step, thereby generating new candidates that combine strong language knowledge from USDM and acoustic information from CTC. Our findings reveal that USDM, as well as MDLM, can significantly improve the accuracy of recognized text. We publish all our code and recipes.

Chinese Translation

扩散语言模型因其双向注意力和并行文本生成的能力，最近成为标准语言模型的一个重要替代方案。在本研究中，我们探讨了其在语音识别中的应用变体。具体而言，我们介绍了一份关于如何将掩蔽扩散语言模型（Masked Diffusion Language Models, MDLM）和均匀状态扩散模型（Uniform-State Diffusion Models, USDM）应用于重评分自动语音识别（ASR）假设的全面指南。此外，我们设计了一种新的联合解码方法，通过将从CTC（Connectionist Temporal Classification）获得的逐帧概率分布与USDM在每个解码步骤计算的标签概率分布相结合，生成新的候选文本，从而结合了USDM的强语言知识和CTC的声学信息。我们的研究结果表明，USDM和MDLM均能显著提高识别文本的准确性。我们将发布所有代码和实验方案。

View on arXiv Download PDF AI Translation

cs.CL / 71 / 2604.14030

Dual-Enhancement Product Bundling: Bridging Interactive Graph and Large Language Model

双重增强产品捆绑：连接交互图与大型语言模型

Huang, Zhe, Wang, Peng, Zheng, Yan, Song, Sen, Cai, Longjun

Abstract

Product bundling boosts e-commerce revenue by recommending complementary item combinations. However, existing methods face two critical challenges: (1) collaborative filtering approaches struggle with cold-start items owing to dependency on historical interactions, and (2) LLMs lack inherent capability to model interactive graph directly. To bridge this gap, we propose a dual-enhancement method that integrates interactive graph learning and LLM-based semantic understanding for product bundling. Our method introduces a graph-to-text paradigm, which leverages a Dynamic Concept Binding Mechanism (DCBM) to translate graph structures into natural language prompts. The DCBM plays a critical role in aligning domain-specific entities with LLM tokenization, enabling effective comprehension of combinatorial constraints. Experiments on three benchmarks (POG, POG_dense, Steam) demonstrate 6.3%-26.5% improvements over state-of-the-art baselines.

Chinese Translation

产品捆绑通过推荐互补商品组合来提升电子商务收入。然而，现有方法面临两个关键挑战：（1）协同过滤方法由于依赖历史交互而在冷启动商品上表现不佳；（2）大型语言模型（LLMs）缺乏直接建模交互图的内在能力。为了解决这一问题，我们提出了一种双重增强方法，结合了交互图学习和基于LLM的语义理解以实现产品捆绑。我们的方法引入了一种图到文本的范式，利用动态概念绑定机制（Dynamic Concept Binding Mechanism，DCBM）将图结构转换为自然语言提示。DCBM在将特定领域实体与LLM的标记化对齐方面发挥了关键作用，使得组合约束的有效理解成为可能。在三个基准测试（POG、POG_dense、Steam）上的实验表明，相较于最先进的基线，性能提升了6.3%-26.5%。

View on arXiv Download PDF AI Translation

cs.CL / 72 / 2604.14053

From Where Words Come: Efficient Regularization of Code Tokenizers Through Source Attribution

词汇来源：通过源归属有效正则化代码分词器

Chizhov, Pavel, Bogomolov, Egor, Yamshchikov, Ivan P.

Abstract

Efficiency and safety of Large Language Models (LLMs), among other factors, rely on the quality of tokenization. A good tokenizer not only improves inference speed and language understanding but also provides extra defense against jailbreak attacks and lowers the risk of hallucinations. In this work, we investigate the efficiency of code tokenization, in particular from the perspective of data source diversity. We demonstrate that code tokenizers are prone to producing unused, and thus under-trained, tokens due to the imbalance in repository and language diversity in the training data, as well as the dominance of source-specific, repetitive tokens that are often unusable in future inference. By modifying the BPE objective and introducing merge skipping, we implement different techniques under the name Source-Attributed BPE (SA-BPE) to regularize BPE training and minimize overfitting, thereby substantially reducing the number of under-trained tokens while maintaining the same inference procedure as with regular BPE. This provides an effective tool suitable for production use.

Chinese Translation

大型语言模型（LLMs）的效率和安全性等多个因素依赖于分词质量。一个好的分词器不仅可以提高推理速度和语言理解能力，还能提供额外的防御以抵御越狱攻击，并降低幻觉风险。在本研究中，我们从数据源多样性的角度探讨了代码分词的效率。我们证明，由于训练数据中代码库和语言多样性的不平衡，以及源特定的重复性标记的主导地位，代码分词器容易产生未使用的、因此训练不足的标记，这些标记在未来的推理中往往不可用。通过修改BPE目标并引入合并跳过，我们在名为源归属BPE（Source-Attributed BPE，SA-BPE）的框架下实施了不同的技术，以正则化BPE训练并最小化过拟合，从而在保持与常规BPE相同的推理过程的同时，显著减少训练不足的标记数量。这为生产使用提供了一种有效的工具。

View on arXiv Download PDF AI Translation

cs.CL / 73 / 2604.14090

From Weights to Activations: Is Steering the Next Frontier of Adaptation?

从权重到激活：引导是否是适应的下一个前沿？

Ostermann, Simon, Gurgurov, Daniil, Baeumel, Tanja, Hedderich, Michael A., Lapuschkin, Sebastian, Samek, Wojciech, Schmitt, Vera

Abstract

Post-training adaptation of language models is commonly achieved through parameter updates or input-based methods such as fine-tuning, parameter-efficient adaptation, and prompting. In parallel, a growing body of work modifies internal activations at inference time to influence model behavior, an approach known as steering. Despite increasing use, steering is rarely analyzed within the same conceptual framework as established adaptation methods. In this work, we argue that steering should be regarded as a form of model adaptation. We introduce a set of functional criteria for adaptation methods and use them to compare steering approaches with classical alternatives. This analysis positions steering as a distinct adaptation paradigm based on targeted interventions in activation space, enabling local and reversible behavioral change without parameter updates. The resulting framing clarifies how steering relates to existing methods, motivating a unified taxonomy for model adaptation.

Chinese Translation

语言模型的后训练适应通常通过参数更新或基于输入的方法实现，例如微调、参数高效适应和提示。同时，越来越多的研究在推理时修改内部激活以影响模型行为，这种方法被称为引导（steering）。尽管引导的使用日益增加，但它很少在与已建立的适应方法相同的概念框架内进行分析。在本研究中，我们认为引导应被视为一种模型适应形式。我们引入了一套适应方法的功能标准，并利用这些标准将引导方法与经典替代方案进行比较。这一分析将引导定位为一种基于激活空间中有针对性干预的独特适应范式，使得在不更新参数的情况下实现局部和可逆的行为变化。由此产生的框架阐明了引导与现有方法的关系，激励了模型适应的统一分类法。

View on arXiv Download PDF AI Translation

cs.CL / 74 / 2604.14111

Interpretable Stylistic Variation in Human and LLM Writing Across Genres, Models, and Decoding Strategies

人类与大型语言模型（LLM）在不同体裁、模型和解码策略下的可解释风格变异

Rallapalli, Swati, Gallagher, Shannon, Yurko, Ronald, Brooks, Tyler, Loughin, Chuck, Sezgin, Michele, Turri, Violet

Abstract

Large Language Models (LLMs) are now capable of generating highly fluent, human-like text. They enable many applications, but also raise concerns such as large scale spam, phishing, or academic misuse. While much work has focused on detecting LLM-generated text, only limited work has gone into understanding the stylistic differences between human-written and machine-generated text. In this work, we perform a large scale analysis of stylistic variation across human-written text and outputs from 11 LLMs spanning 8 different genres and 4 decoding strategies using Douglas Biber's set of lexicogrammatical and functional features. Our findings reveal insights that can guide intentional LLM usage. First, key linguistic differentiators of LLM-generated text seem robust to generation conditions (e.g., prompt settings to nudge them to generate human-like text, or availability of human-written text to continue the style); second, genre exerts a stronger influence on stylistic features than the source itself; third, chat variants of the models generally appear to be clustered together in stylistic space, and finally, model has a larger effect on the style than decoding strategy, with some exceptions. These results highlight the relative importance of model and genre over prompting and decoding strategies in shaping the stylistic behavior of machine-generated text.

Chinese Translation

大型语言模型（LLMs）现在能够生成高度流畅、类似人类的文本。它们支持许多应用，但也引发了诸如大规模垃圾邮件、网络钓鱼或学术不当使用等问题。尽管许多研究集中于检测LLM生成的文本，但对人类撰写文本与机器生成文本之间的风格差异的理解却相对有限。在本研究中，我们对人类撰写文本与来自11个LLM的输出进行大规模的风格变异分析，这些LLM涵盖了8种不同体裁和4种解码策略，使用了Douglas Biber的词汇语法和功能特征集。我们的研究结果揭示了一些可以指导LLM使用的见解。首先，LLM生成文本的关键语言差异似乎对生成条件（例如，提示设置以促使它们生成类似人类的文本，或继续该风格的人类撰写文本的可用性）具有稳健性；其次，体裁对风格特征的影响似乎大于文本来源本身；第三，模型的聊天变体通常在风格空间中聚集在一起；最后，模型对风格的影响大于解码策略，尽管存在一些例外。这些结果强调了模型和体裁在塑造机器生成文本的风格行为方面相较于提示和解码策略的相对重要性。

View on arXiv Download PDF AI Translation

cs.CL / 75 / 2604.14121

Correct Prediction, Wrong Steps? Consensus Reasoning Knowledge Graph for Robust Chain-of-Thought Synthesis

正确预测，错误步骤？共识推理知识图谱用于稳健的思维链合成

Ling, Zipeng, Liu, Shuliang, Fu, Shenghong, Tang, Yuehao, Son, Seonil, Wan, Yao, Hu, Xuming

Abstract

LLM reasoning traces suffer from complex flaws -- *Step Internal Flaws* (logical errors, hallucinations, etc.) and *Step-wise Flaws* (overthinking, underthinking), which vary by sample. A natural approach would be to provide ground-truth labels to guide LLMs' reasoning. Contrary to intuition, we show that this yields no improvement in reasoning ability. We then propose CRAFT, a unified framework that mitigates both types of Step flaws, which builds a Reasoning Knowledge Graph (RKG) based on the consensus parts of multiple candidate traces, and synthesizes a high-quality trace through topological generation. Our approach improves label-prediction accuracy by 10+% on average, and consistently outperforms all baselines across both logical and mathematical reasoning benchmarks. Further, detailed benchmark evaluation proves that our method also improves the quality of LLMs' reasoning traces in multiple dimensions.

Chinese Translation

大型语言模型（LLM）的推理过程存在复杂缺陷——*步骤内部缺陷*（逻辑错误、幻觉等）和*步骤性缺陷*（过度思考、思考不足），这些缺陷因样本而异。一种自然的解决方案是提供真实标签来指导LLM的推理。与直觉相反，我们的研究表明，这并未改善推理能力。随后，我们提出了CRAFT，一个统一框架，旨在减轻这两种步骤缺陷，构建基于多个候选推理轨迹共识部分的推理知识图谱（RKG），并通过拓扑生成合成高质量的推理轨迹。我们的方法在标签预测准确率上平均提高了10%以上，并在逻辑和数学推理基准测试中始终优于所有基线。此外，详细的基准评估证明我们的方法在多个维度上也提高了LLM推理轨迹的质量。

View on arXiv Download PDF AI Translation

cs.CL / 76 / 2604.14128

Rhetorical Questions in LLM Representations: A Linear Probing Study

大型语言模型表征中的修辞性问题：线性探测研究

Yao, Louie Hong, Anand, Vishesh, Zhuang, Yuan, Jiang, Tianyu

Abstract

Rhetorical questions are asked not to seek information but to persuade or signal stance. How large language models internally represent them remains unclear. We analyze rhetorical questions in LLM representations using linear probes on two social-media datasets with different discourse contexts, and find that rhetorical signals emerge early and are most stably captured by last-token representations. Rhetorical questions are linearly separable from information-seeking questions within datasets, and remain detectable under cross-dataset transfer, reaching AUROC around 0.7-0.8. However, we demonstrate that transferability does not simply imply a shared representation. Probes trained on different datasets produce different rankings when applied to the same target corpus, with overlap among the top-ranked instances often below 0.2. Qualitative analysis shows that these divergences correspond to distinct rhetorical phenomena: some probes capture discourse-level rhetorical stance embedded in extended argumentation, while others emphasize localized, syntax-driven interrogative acts. Together, these findings suggest that rhetorical questions in LLM representations are encoded by multiple linear directions emphasizing different cues, rather than a single shared direction.

Chinese Translation

修辞性问题的提出并非为了寻求信息，而是为了说服或表达立场。大型语言模型如何在内部表征这些问题仍然不清楚。我们通过对两个具有不同话语语境的社交媒体数据集进行线性探测，分析了大型语言模型中的修辞性问题，并发现修辞信号早期出现，并且最稳定地由最后一个标记的表征捕捉。修辞性问题在数据集中与寻求信息的问题是线性可分的，并且在跨数据集迁移中仍然可检测，AUROC值达到约0.7-0.8。然而，我们证明迁移性并不简单意味着共享表征。在不同数据集上训练的探测器在应用于相同目标语料库时产生不同的排名，且排名前列的实例重叠通常低于0.2。定性分析表明，这些差异对应于不同的修辞现象：一些探测器捕捉到嵌入扩展论证中的话语层面修辞立场，而另一些则强调局部的、语法驱动的疑问行为。综合来看，这些发现表明，大型语言模型表征中的修辞性问题是通过多条线性方向编码的，强调不同的线索，而不是单一的共享方向。

View on arXiv Download PDF AI Translation

cs.CL / 77 / 2604.14137

From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs

从感受到指标：理解和形式化用户如何进行 LLM 的氛围测试

Itzhak, Itay, Habba, Eliya, Stanovsky, Gabriel, Belinkov, Yonatan

Abstract

Evaluating LLMs is challenging, as benchmark scores often fail to capture models' real-world usefulness. Instead, users often rely on ``vibe-testing'': informal experience-based evaluation, such as comparing models on coding tasks related to their own workflow. While prevalent, vibe-testing is often too ad hoc and unstructured to analyze or reproduce at scale. In this work, we study how vibe-testing works in practice and then formalize it to support systematic analysis. We first analyze two empirical resources: (1) a survey of user evaluation practices, and (2) a collection of in-the-wild model comparison reports from blogs and social media. Based on these resources, we formalize vibe-testing as a two-part process: users personalize both what they test and how they judge responses. We then introduce a proof-of-concept evaluation pipeline that follows this formulation by generating personalized prompts and comparing model outputs using user-aware subjective criteria. In experiments on coding benchmarks, we find that combining personalized prompts and user-aware evaluation can change which model is preferred, reflecting the role of vibe-testing in practice. These findings suggest that formalized vibe-testing can serve as a useful approach for bridging benchmark scores and real-world experience.

Chinese Translation

评估 LLM（大规模语言模型）具有挑战性，因为基准分数往往无法捕捉模型在现实世界中的实际效用。相反，用户通常依赖于“氛围测试”：一种基于经验的非正式评估，例如在与自身工作流程相关的编码任务中比较模型。尽管氛围测试很普遍，但其往往过于临时和无结构，难以进行大规模分析或复现。在本研究中，我们研究氛围测试在实践中的运作方式，并将其形式化以支持系统分析。我们首先分析了两个实证资源：（1）用户评估实践的调查，和（2）来自博客和社交媒体的真实环境下模型比较报告的集合。基于这些资源，我们将氛围测试形式化为一个两部分的过程：用户个性化他们测试的内容和评判响应的方式。然后，我们引入了一个概念验证评估流程，该流程遵循这一形式化，通过生成个性化提示并使用用户感知的主观标准比较模型输出。在对编码基准的实验中，我们发现结合个性化提示和用户感知评估可以改变用户对模型的偏好，反映了氛围测试在实践中的作用。这些发现表明，形式化的氛围测试可以作为一种有用的方法，弥合基准分数与现实世界体验之间的差距。

View on arXiv Download PDF AI Translation

arXiv Papers

Olfactory pursuit: catching a moving odor source in complex flows

Multi-modal panoramic 3D outdoor datasets for place categorization

Weakly-supervised Learning for Physics-informed Neural Motion Planning via Sparse Roadmap

Capability-Aware Heterogeneous Control Barrier Functions for Decentralized Multi-Robot Safe Navigation

GeoVision-Enabled Digital Twin for Hybrid Autonomous-Teleoperated Medical Responses

Utilizing Inpainting for Keypoint Detection for Vision-Based Control of Robotic Manipulators

Vectorizing Projection in Manifold-Constrained Motion Planning for Real-Time Whole-Body Control

Boundary Sampling to Learn Predictive Safety Filters via Pontryagin's Maximum Principle

Singularity Avoidance in Inverse Kinematics: A Unified Treatment of Classical and Learning-based Methods

Robust Energy-Aware Routing for Air-Ground Cooperative Multi-UAV Delivery in Wind-Uncertain Environments

RobotPan: A 360$^\circ$ Surround-View Robotic Vision System for Embodied Perception

RadarSplat-RIO: Indoor Radar-Inertial Odometry with Gaussian Splatting-Based Radar Bundle Adjustment

A transformable slender microrobot inspired by nematode parasites for interventional endovascular surgery

Stability Principle Underlying Passive Dynamic Walking of Rimless Wheel

Evolvable Embodied Agent for Robotic Manipulation via Long Short-Term Reflection and Optimization

Self-adaptive Multi-Access Edge Architectures: A Robotics Case

UNRIO: Uncertainty-Aware Velocity Learning for Radar-Inertial Odometry

A Mechanistic Analysis of Sim-and-Real Co-Training in Generative Robot Policies

Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap

Empirical Prediction of Pedestrian Comfort in Mobile Robot Pedestrian Encounters

Failure Identification in Imitation Learning Via Statistical and Semantic Filtering

EmbodiedClaw: Conversational Workflow Execution for Embodied AI Development

Mosaic: An Extensible Framework for Composing Rule-Based and Learned Motion Planners

Beyond Conservative Automated Driving in Multi-Agent Scenarios via Coupled Model Predictive Control and Deep Reinforcement Learning

Goal2Skill: Long-Horizon Manipulation with Adaptive Planning and Reflection

Towards Multi-Object-Tracking with Radar on a Fast Moving Vehicle: On the Potential of Processing Radar in the Frequency Domain

Neuromorphic Spiking Ring Attractor for Proprioceptive Joint-State Estimation

Scale-Invariant Sampling in Multi-Arm Bandit Motion Planning for Object Extraction

UMI-3D: Extending Universal Manipulation Interface from Vision-Limited to 3D Spatial Perception

A Lightweight Multi-Metric No-Reference Image Quality Assessment Framework for UAV Imaging

Graph Propagated Projection Unlearning: A Unified Framework for Vision and Audio Discriminative Models

PatchPoison: Poisoning Multi-View Datasets to Degrade 3D Reconstruction

3DRealHead: Few-Shot Detailed Head Avatar

GeoLink: A 3D-Aware Framework Towards Better Generalization in Cross-View Geo-Localization

Towards Patient-Specific Deformable Registration in Laparoscopic Surgery

Multitasking Embedding for Embryo Blastocyst Grading Prediction (MEmEBG)

Neural 3D Reconstruction of Planetary Surfaces from Descent-Phase Wide-Angle Imagery

SemiFA: An Agentic Multi-Modal Framework for Autonomous Semiconductor Failure Analysis Report Generation

A High-Resolution Landscape Dataset for Concept-Based XAI With Application to Species Distribution Models

4th Workshop on Maritime Computer Vision (MaCVi): Challenge Overview

Rethinking Uncertainty in Segmentation: From Estimation to Decision

Indexing Multimodal Language Models for Large-scale Image Retrieval

DroneScan-YOLO: Redundancy-Aware Lightweight Detection for Tiny Objects in UAV Imagery

Explainable Fall Detection for Elderly Care via Temporally Stable SHAP in Skeleton-Based Human Activity Recognition

See&Say: Vision Language Guided Safe Zone Detection for Autonomous Package Delivery Drones

PAT-VCM: Plug-and-Play Auxiliary Tokens for Video Coding for Machines

Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision

Bias at the End of the Score

Deep Spatially-Regularized and Superpixel-Based Diffusion Learning for Unsupervised Hyperspectral Image Clustering

The Spectrascapes Dataset: Street-view imagery beyond the visible captured using a mobile platform

Why MLLMs Struggle to Determine Object Orientations

Towards Successful Implementation of Automated Raveling Detection: Effects of Training Data Size, Illumination Difference, and Spatial Shift

Right Regions, Wrong Labels: Semantic Label Flips in Segmentation under Correlation Shift

SSD-GS: Scattering and Shadow Decomposition for Relightable 3D Gaussian Splatting

SEDTalker: Emotion-Aware 3D Facial Animation Using Frame-Level Speech Emotion Diarization

MSGS: Multispectral 3D Gaussian Splatting

Multi-Agent Object Detection Framework Based on Raspberry Pi YOLO Detector and Slack-Ollama Natural Language Interface

A 3D SAM-Based Progressive Prompting Framework for Multi-Task Segmentation of Radiotherapy-induced Normal Tissue Injuries in Limited-Data Settings

UniBlendNet: Unified Global, Multi-Scale, and Region-Adaptive Modeling for Ambient Lighting Normalization

A Multimodal Clinically Informed Coarse-to-Fine Framework for Longitudinal CT Registration in Proton Therapy

Why Multimodal In-Context Learning Lags Behind? Unveiling the Inner Mechanisms and Bottlenecks

CausalDisenSeg: A Causality-Guided Disentanglement Framework with Counterfactual Reasoning for Robust Brain Tumor Segmentation Under Missing Modalities

DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis

Physically-Guided Optical Inversion Enable Non-Contact Side-Channel Attack on Isolated Screens

VibeFlow: Versatile Video Chroma-Lux Editing through Self-Supervised Learning

Event-Adaptive State Transition and Gated Fusion for RGB-Event Object Tracking

MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis

A Study of Failure Modes in Two-Stage Human-Object Interaction Detection

Enhanced Text-to-Image Generation by Fine-grained Multimodal Reasoning

ADP-DiT: Text-Guided Diffusion Transformer for Brain Image Generation in Alzheimer's Disease Progression

Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling

DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer

Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding

Reconstruction of a 3D wireframe from a single line drawing via generative depth estimation

AI Powered Image Analysis for Phishing Detection

CLIP Architecture for Abdominal CT Image-Text Alignment and Zero-Shot Learning: Investigating Batch Composition and Data Scaling

UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing

ZoomSpec: A Physics-Guided Coarse-to-Fine Framework for Wideband Spectrum Sensing

Radar-Informed 3D Multi-Object Tracking under Adverse Conditions