arXiv Daily Digest

261

Papers

Mitigating Error Accumulation in Continuous Navigation via Memory-Augmented Kalman Filtering

通过记忆增强卡尔曼滤波减轻连续导航中的误差积累

Tang, Yin, Ma, Jiawei, Zhang, Jinrui, Wang, Alex Jinpeng, Zhang, Deyu

Abstract

Continuous navigation in complex environments is critical for Unmanned Aerial Vehicle (UAV). However, the existing Vision-Language Navigation (VLN) models follow the dead-reckoning, which iteratively updates its position for the next waypoint prediction, and subsequently construct the complete trajectory. Then, such stepwise manner will inevitably lead to accumulated errors of position over time, resulting in misalignment between internal belief and objective coordinates, which is known as "state drift" and ultimately compromises the full trajectory prediction. Drawing inspiration from classical control theory, we propose to correct for errors by formulating such sequential prediction as a recursive Bayesian state estimation problem. In this paper, we design NeuroKalman, a novel framework that decouples navigation into two complementary processes: a Prior Prediction, based on motion dynamics and a Likelihood Correction, from historical observation. We first mathematically associate Kernel Density Estimation of the measurement likelihood with the attention-based retrieval mechanism, which then allows the system to rectify the latent representation using retrieved historical anchors without gradient updates. Comprehensive experiments on TravelUAV benchmark demonstrate that, with only 10% of the training data fine-tuning, our method clearly outperforms strong baselines and regulates drift accumulation.

Chinese Translation

在复杂环境中进行连续导航对无人机（UAV）至关重要。然而，现有的视觉-语言导航（VLN）模型遵循死 reckoning 方法，该方法迭代更新其在下一个航点预测的位置，并随后构建完整的轨迹。这种逐步方式不可避免地会导致位置随时间的误差积累，导致内部信念与客观坐标之间的错位，这被称为“状态漂移”，最终损害了完整轨迹的预测。受到经典控制理论的启发，我们提出通过将这种序列预测形式化为递归贝叶斯状态估计问题来修正误差。在本文中，我们设计了 NeuroKalman，一个新颖的框架，将导航解耦为两个互补的过程：基于运动动态的先验预测和基于历史观测的似然校正。我们首先将测量似然的核密度估计与基于注意力的检索机制数学关联，这使得系统能够在不进行梯度更新的情况下，利用检索到的历史锚点来修正潜在表示。在 TravelUAV 基准上的全面实验表明，仅通过 10% 的训练数据微调，我们的方法明显优于强基线，并有效调节漂移积累。

View on arXiv Download PDF AI Translation

cs.RO / 2 / 2602.11291

H-WM: Robotic Task and Motion Planning Guided by Hierarchical World Model

H-WM：基于层次世界模型的机器人任务与运动规划

Chen, Wenyuan, Huang, Jinbang, Pang, Oscar, Li, Zhiyuan, Hu, Xiao, Zhang, Lingfeng, Zhang, Zhanguang, Coates, Mark, Cao, Tongtong, Quan, Xingyue, Zhang, Yingxue

Abstract

World models are becoming central to robotic planning and control, as they enable prediction of future state transitions. Existing approaches often emphasize video generation or natural language prediction, which are difficult to directly ground in robot actions and suffer from compounding errors over long horizons. Traditional task and motion planning relies on symbolic logic world models, such as planning domains, that are robot-executable and robust for long-horizon reasoning. However, these methods typically operate independently of visual perception, preventing synchronized symbolic and perceptual state prediction. We propose a Hierarchical World Model (H-WM) that jointly predicts logical and visual state transitions within a unified bilevel framework. H-WM combines a high-level logical world model with a low-level visual world model, integrating the robot-executable, long-horizon robustness of symbolic reasoning with perceptual grounding from visual observations. The hierarchical outputs provide stable and consistent intermediate guidance for long-horizon tasks, mitigating error accumulation and enabling robust execution across extended task sequences. To train H-WM, we introduce a robotic dataset that aligns robot motion with symbolic states, actions, and visual observations. Experiments across vision-language-action (VLA) control policies demonstrate the effectiveness and generality of the approach.

Chinese Translation

世界模型正成为机器人规划与控制的核心，因为它们能够预测未来状态转变。现有的方法通常强调视频生成或自然语言预测，这些方法难以直接与机器人动作对接，并且在长时间范围内容易出现累积误差。传统的任务与运动规划依赖于符号逻辑世界模型，例如规划域，这些模型可被机器人执行并且在长时间推理中具有鲁棒性。然而，这些方法通常独立于视觉感知操作，阻碍了符号和感知状态预测的同步。我们提出了一种层次世界模型（H-WM），在统一的双层框架内共同预测逻辑和视觉状态转变。H-WM将高层次的逻辑世界模型与低层次的视觉世界模型相结合，整合了符号推理的机器人可执行性和长时间鲁棒性，以及来自视觉观察的感知基础。层次输出为长时间任务提供了稳定且一致的中间指导，减轻了误差累积，并使得在扩展任务序列中的鲁棒执行成为可能。为了训练H-WM，我们引入了一个机器人数据集，该数据集将机器人运动与符号状态、动作和视觉观察对齐。通过视觉-语言-动作（VLA）控制策略的实验展示了该方法的有效性和普适性。

View on arXiv Download PDF AI Translation

cs.RO / 3 / 2602.11321

ExtremControl: Low-Latency Humanoid Teleoperation with Direct Extremity Control

ExtremControl：具有直接肢体控制的低延迟类人机器人远程操作

Xiong, Ziyan, Fang, Lixing, Huang, Junyun, Yamazaki, Kashu, Zhang, Hao, Gan, Chuang

Abstract

Building a low-latency humanoid teleoperation system is essential for collecting diverse reactive and dynamic demonstrations. However, existing approaches rely on heavily pre-processed human-to-humanoid motion retargeting and position-only PD control, resulting in substantial latency that severely limits responsiveness and prevents tasks requiring rapid feedback and fast reactions. To address this problem, we propose ExtremControl, a low latency whole-body control framework that: (1) operates directly on SE(3) poses of selected rigid links, primarily humanoid extremities, to avoid full-body retargeting; (2) utilizes a Cartesian-space mapping to directly convert human motion to humanoid link targets; and (3) incorporates velocity feedforward control at low level to support highly responsive behavior under rapidly changing control interfaces. We further provide a unified theoretical formulation of ExtremControl and systematically validate its effectiveness through experiments in both simulation and real-world environments. Building on ExtremControl, we implement a low-latency humanoid teleoperation system that supports both optical motion capture and VR-based motion tracking, achieving end-to-end latency as low as 50ms and enabling highly responsive behaviors such as ping-pong ball balancing, juggling, and real-time return, thereby substantially surpassing the 200ms latency limit observed in prior work.

Chinese Translation

构建一个低延迟的类人机器人远程操作系统对于收集多样化的反应性和动态演示至关重要。然而，现有的方法依赖于大量预处理的人类到类人机器人的运动重定向和仅基于位置的PD控制，导致显著的延迟，严重限制了响应能力，并阻碍了需要快速反馈和快速反应的任务。为了解决这个问题，我们提出了ExtremControl，一个低延迟的全身控制框架： (1) 直接对选定刚性链接的SE(3)姿态进行操作，主要是类人机器人的肢体，以避免全身重定向； (2) 利用笛卡尔空间映射直接将人类运动转换为类人机器人链接目标； (3) 在低层次上结合速度前馈控制，以支持在快速变化的控制界面下高度响应的行为。我们进一步提供了ExtremControl的统一理论框架，并通过模拟和真实环境中的实验系统地验证了其有效性。在ExtremControl的基础上，我们实现了一个低延迟的类人机器人远程操作系统，支持光学运动捕捉和基于虚拟现实的运动跟踪，实现了低至50毫秒的端到端延迟，并使得如乒乓球平衡、杂耍和实时回传等高度响应的行为成为可能，从而大大超越了先前工作中观察到的200毫秒延迟限制。

View on arXiv Download PDF AI Translation

cs.RO / 4 / 2602.11337

MolmoSpaces: A Large-Scale Open Ecosystem for Robot Navigation and Manipulation

MolmoSpaces：一个大规模开放的机器人导航与操作生态系统

Kim, Yejin, Pumacay, Wilbert, Rayyan, Omar, Argus, Max, Han, Winson, VanderBilt, Eli, Salvador, Jordi, Deshpande, Abhay, Hendrix, Rose, Jauhri, Snehal, Liu, Shuo, Shafiullah, Nur Muhammad Mahi, Guru, Maya, Eftekhar, Ainaz, Farley, Karen, Clay, Donovan, Duan, Jiafei, Guru, Arjun, Wolters, Piper, Herrasti, Alvaro, Lee, Ying-Chun, Chalvatzaki, Georgia, Cui, Yuchen, Farhadi, Ali, Fox, Dieter, Krishna, Ranjay

Abstract

Deploying robots at scale demands robustness to the long tail of everyday situations. The countless variations in scene layout, object geometry, and task specifications that characterize real environments are vast and underrepresented in existing robot benchmarks. Measuring this level of generalization requires infrastructure at a scale and diversity that physical evaluation alone cannot provide. We introduce MolmoSpaces, a fully open ecosystem to support large-scale benchmarking of robot policies. MolmoSpaces consists of over 230k diverse indoor environments, ranging from handcrafted household scenes to procedurally generated multiroom houses, populated with 130k richly annotated object assets, including 48k manipulable objects with 42M stable grasps. Crucially, these environments are simulator-agnostic, supporting popular options such as MuJoCo, Isaac, and ManiSkill. The ecosystem supports the full spectrum of embodied tasks: static and mobile manipulation, navigation, and multiroom long-horizon tasks requiring coordinated perception, planning, and interaction across entire indoor environments. We also design MolmoSpaces-Bench, a benchmark suite of 8 tasks in which robots interact with our diverse scenes and richly annotated objects. Our experiments show MolmoSpaces-Bench exhibits strong sim-to-real correlation (R = 0.96, \r{ho} = 0.98), confirm newer and stronger zero-shot policies outperform earlier versions in our benchmarks, and identify key sensitivities to prompt phrasing, initial joint positions, and camera occlusion. Through MolmoSpaces and its open-source assets and tooling, we provide a foundation for scalable data generation, policy training, and benchmark creation for robot learning research.

Chinese Translation

大规模部署机器人需要对日常情况的长尾具有鲁棒性。真实环境中场景布局、物体几何形状和任务规格的无数变化是广泛且在现有机器人基准中未被充分代表的。衡量这种泛化水平需要一种规模和多样性超出物理评估所能提供的基础设施。我们介绍了MolmoSpaces，一个完全开放的生态系统，以支持机器人策略的大规模基准测试。MolmoSpaces包含超过23万个多样化的室内环境，从手工制作的家庭场景到程序生成的多房间住宅，配备了13万个丰富注释的物体资产，包括4.8万个可操作物体和4200万个稳定抓取。至关重要的是，这些环境是模拟器无关的，支持诸如MuJoCo、Isaac和ManiSkill等流行选项。该生态系统支持全方位的具身任务：静态和移动操作、导航以及需要在整个室内环境中进行协调感知、规划和交互的多房间长时间任务。我们还设计了MolmoSpaces-Bench，一个包含8个任务的基准套件，机器人在其中与我们的多样场景和丰富注释的物体进行交互。我们的实验表明，MolmoSpaces-Bench表现出强烈的模拟到现实的相关性（R = 0.96， {ho} = 0.98），确认更新且更强的零-shot策略在我们的基准中优于早期版本，并识别出对提示措辞、初始关节位置和相机遮挡的关键敏感性。通过MolmoSpaces及其开源资产和工具，我们为可扩展的数据生成、策略训练和机器人学习研究的基准创建提供了基础。

View on arXiv Download PDF AI Translation

cs.RO / 5 / 2602.11393

Human Preference Modeling Using Visual Motion Prediction Improves Robot Skill Learning from Egocentric Human Video

基于视觉运动预测的人类偏好建模提升了机器人从自我中心人类视频中学习技能的能力

Verghese, Mrinal, Atkeson, Christopher G.

Abstract

We present an approach to robot learning from egocentric human videos by modeling human preferences in a reward function and optimizing robot behavior to maximize this reward. Prior work on reward learning from human videos attempts to measure the long-term value of a visual state as the temporal distance between it and the terminal state in a demonstration video. These approaches make assumptions that limit performance when learning from video. They must also transfer the learned value function across the embodiment and environment gap. Our method models human preferences by learning to predict the motion of tracked points between subsequent images and defines a reward function as the agreement between predicted and observed object motion in a robot's behavior at each step. We then use a modified Soft Actor Critic (SAC) algorithm initialized with 10 on-robot demonstrations to estimate a value function from this reward and optimize a policy that maximizes this value function, all on the robot. Our approach is capable of learning on a real robot, and we show that policies learned with our reward model match or outperform prior work across multiple tasks in both simulation and on the real robot.

Chinese Translation

我们提出了一种通过建模人类偏好来进行机器人学习的方法，该方法利用自我中心的人类视频，并优化机器人行为以最大化奖励函数。以往关于从人类视频中学习奖励的研究试图测量视觉状态的长期价值，作为其与演示视频中终止状态之间的时间距离。这些方法的假设在从视频学习时限制了性能，并且必须在不同的体现和环境之间转移学习到的价值函数。我们的方法通过学习预测后续图像中跟踪点的运动来建模人类偏好，并将奖励函数定义为机器人在每一步行为中预测的物体运动与观察到的物体运动之间的一致性。然后，我们使用一种修改过的软演员评论家（Soft Actor Critic, SAC）算法，基于10次机器人上的演示初始化，以从该奖励中估计价值函数，并优化一个最大化该价值函数的策略，所有操作均在机器人上进行。我们的方法能够在真实机器人上进行学习，并且我们展示了使用我们的奖励模型学习的策略在多个任务中在仿真和真实机器人上与以往工作相匹配或超越。

View on arXiv Download PDF AI Translation

cs.RO / 6 / 2602.11464

EasyMimic: A Low-Cost Framework for Robot Imitation Learning from Human Videos

EasyMimic：一种低成本的机器人模仿学习框架，基于人类视频

Zhang, Tao, Xia, Song, Wang, Ye, Jin, Qin

Abstract

Robot imitation learning is often hindered by the high cost of collecting large-scale, real-world data. This challenge is especially significant for low-cost robots designed for home use, as they must be both user-friendly and affordable. To address this, we propose the EasyMimic framework, a low-cost and replicable solution that enables robots to quickly learn manipulation policies from human video demonstrations captured with standard RGB cameras. Our method first extracts 3D hand trajectories from the videos. An action alignment module then maps these trajectories to the gripper control space of a low-cost robot. To bridge the human-to-robot domain gap, we introduce a simple and user-friendly hand visual augmentation strategy. We then use a co-training method, fine-tuning a model on both the processed human data and a small amount of robot data, enabling rapid adaptation to new tasks. Experiments on the low-cost LeRobot platform demonstrate that EasyMimic achieves high performance across various manipulation tasks. It significantly reduces the reliance on expensive robot data collection, offering a practical path for bringing intelligent robots into homes. Project website: https://zt375356.github.io/EasyMimic-Project/.

Chinese Translation

机器人模仿学习常常受到收集大规模真实世界数据的高成本的制约。对于设计用于家庭使用的低成本机器人而言，这一挑战尤为显著，因为它们必须既用户友好又经济实惠。为了解决这一问题，我们提出了EasyMimic框架，这是一种低成本且可复制的解决方案，使机器人能够快速从使用标准RGB摄像头捕捉的人类视频演示中学习操作策略。我们的方法首先从视频中提取三维手部轨迹。然后，动作对齐模块将这些轨迹映射到低成本机器人的抓取控制空间。为了弥合人类与机器人之间的领域差距，我们引入了一种简单且用户友好的手部视觉增强策略。接着，我们使用共训练方法，在处理过的人类数据和少量机器人数据上微调模型，从而实现对新任务的快速适应。在低成本的LeRobot平台上的实验表明，EasyMimic在各种操作任务中都表现出高性能。它显著减少了对昂贵机器人数据收集的依赖，为将智能机器人带入家庭提供了一条切实可行的路径。项目网站：https://zt375356.github.io/EasyMimic-Project/

View on arXiv Download PDF AI Translation

cs.RO / 7 / 2602.11468

Effective Task Planning with Missing Objects using Learning-Informed Object Search

基于学习驱动的物体搜索实现缺失物体的有效任务规划

Arnob, Raihan Islam, Merlin, Max, Paudel, Abhishek, Hedegaard, Benned, Konidaris, George, Stein, Gregory

Abstract

Task planning for mobile robots often assumes full environment knowledge and so popular approaches, like planning via the PDDL, cannot plan when the locations of task-critical objects are unknown. Recent learning-driven object search approaches are effective, but operate as standalone tools and so are not straightforwardly incorporated into full task planners, which must additionally determine both what objects are necessary and when in the plan they should be sought out. To address this limitation, we develop a planning framework centered around novel model-based LIOS actions: each a policy that aims to find and retrieve a single object. High-level planning treats LIOS actions as deterministic and so -- informed by model-based calculations of the expected cost of each -- generates plans that interleave search and execution for effective, sound, and complete learning-informed task planning despite uncertainty. Our work effectively reasons about uncertainty while maintaining compatibility with existing full-knowledge solvers. In simulated ProcTHOR homes and in the real world, our approach outperforms non-learned and learned baselines on tasks including retrieval and meal prep.

Chinese Translation

移动机器人任务规划通常假设对环境有全面的了解，因此像 PDDL 这样的流行方法在任务关键物体的位置未知时无法进行规划。最近的学习驱动物体搜索方法虽然有效，但作为独立工具运作，因此不易直接融入完整的任务规划器中，而后者还必须确定哪些物体是必要的，以及在计划中何时应寻找这些物体。为了解决这一局限性，我们开发了一个以新型基于模型的 LIOS（Learning-Informed Object Search）动作为中心的规划框架：每个动作都是一个旨在寻找和获取单个物体的策略。高层次规划将 LIOS 动作视为确定性的，因此在基于模型的计算预期成本的指导下，生成交替进行搜索和执行的计划，从而实现有效、可靠和完整的学习驱动任务规划，即使在不确定性情况下也能保持兼容性。我们的工作有效地考虑了不确定性，同时与现有的全知识求解器兼容。在模拟的 ProcTHOR 家庭环境和现实世界中，我们的方法在包括物体检索和餐食准备等任务上优于非学习和学习基线。

View on arXiv Download PDF AI Translation

cs.RO / 8 / 2602.11554

HyperDet: 3D Object Detection with Hyper 4D Radar Point Clouds

HyperDet：基于超4D雷达点云的3D物体检测

Xiao, Yichun, Guan, Runwei, Ding, Fangqiang

Abstract

4D mmWave radar provides weather-robust, velocity-aware measurements and is more cost-effective than LiDAR. However, radar-only 3D detection still trails LiDAR-based systems because radar point clouds are sparse, irregular, and often corrupted by multipath noise, yielding weak and unstable geometry. We present HyperDet, a detector-agnostic radar-only 3D detection framework that constructs a task-aware hyper 4D radar point cloud for standard LiDAR-oriented detectors. HyperDet aggregates returns from multiple surround-view 4D radars over consecutive frames to improve coverage and density, then applies geometry-aware cross-sensor consensus validation with a lightweight self-consistency check outside overlap regions to suppress inconsistent returns. It further integrates a foreground-focused diffusion module with training-time mixed radar-LiDAR supervision to densify object structures while lifting radar attributes (e.g., Doppler, RCS); the model is distilled into a consistency model for single-step inference. On MAN TruckScenes, HyperDet consistently improves over raw radar inputs with VoxelNeXt and CenterPoint, partially narrowing the radar-LiDAR gap. These results show that input-level refinement enables radar to better leverage LiDAR-oriented detectors without architectural modifications.

Chinese Translation

4D毫米波雷达提供了抗天气干扰、速度感知的测量，并且比激光雷达（LiDAR）更具成本效益。然而，单纯依赖雷达的3D检测仍落后于基于激光雷达的系统，因为雷达点云稀疏、不规则，并且常常受到多路径噪声的干扰，导致几何形状弱且不稳定。我们提出了HyperDet，一个与检测器无关的雷达专用3D检测框架，构建了一个任务感知的超4D雷达点云，以适应标准的激光雷达导向检测器。HyperDet通过连续帧聚合来自多个全景4D雷达的返回信号，以提高覆盖率和密度，然后应用几何感知的跨传感器一致性验证，并在重叠区域外进行轻量级的自一致性检查，以抑制不一致的返回信号。它进一步整合了一个以前景为中心的扩散模块，并在训练阶段采用混合的雷达-激光雷达监督，以增强物体结构的密度，同时提升雷达属性（例如，多普勒、雷达散射截面（RCS））；该模型被提炼为一个一致性模型，以实现单步推理。在MAN TruckScenes数据集上，HyperDet在使用VoxelNeXt和CenterPoint的情况下，始终优于原始雷达输入，部分缩小了雷达与激光雷达之间的差距。这些结果表明，输入级别的优化使雷达能够更好地利用激光雷达导向的检测器，而无需进行架构修改。

View on arXiv Download PDF AI Translation

cs.RO / 9 / 2602.11575

ReaDy-Go: Real-to-Sim Dynamic 3D Gaussian Splatting Simulation for Environment-Specific Visual Navigation with Moving Obstacles

ReaDy-Go：针对特定环境的动态3D高斯点云仿真在移动障碍物下的真实到仿真视觉导航

Yoo, Seungyeon, Jang, Youngseok, Kim, Dabin, Han, Youngsoo, Jung, Seungwoo, Kim, H. Jin

Abstract

Visual navigation models often struggle in real-world dynamic environments due to limited robustness to the sim-to-real gap and the difficulty of training policies tailored to target deployment environments (e.g., households, restaurants, and factories). Although real-to-sim navigation simulation using 3D Gaussian Splatting (GS) can mitigate this gap, prior works have assumed only static scenes or unrealistic dynamic obstacles, despite the importance of safe navigation in dynamic environments. To address these issues, we propose ReaDy-Go, a novel real-to-sim simulation pipeline that synthesizes photorealistic dynamic scenarios for target environments. ReaDy-Go generates photorealistic navigation datasets for dynamic environments by combining a reconstructed static GS scene with dynamic human GS obstacles, and trains policies robust to both the sim-to-real gap and moving obstacles. The pipeline consists of three components: (1) a dynamic GS simulator that integrates scene GS with a human animation module, enabling the insertion of animatable human GS avatars and the synthesis of plausible human motions from 2D trajectories, (2) navigation dataset generation for dynamic environments that leverages the simulator, a robot expert planner designed for dynamic GS representations, and a human planner, and (3) policy learning using the generated datasets. ReaDy-Go outperforms baselines across target environments in both simulation and real-world experiments, demonstrating improved navigation performance even after sim-to-real transfer and in the presence of moving obstacles. Moreover, zero-shot sim-to-real deployment in an unseen environment indicates its generalization potential. Project page: https://syeon-yoo.github.io/ready-go-site/.

Chinese Translation

视觉导航模型在真实世界的动态环境中常常面临挑战，主要由于对真实与仿真之间差距的鲁棒性有限，以及针对目标部署环境（如家庭、餐厅和工厂）定制训练策略的困难。尽管使用3D高斯点云（Gaussian Splatting, GS）进行真实到仿真的导航仿真可以缓解这一差距，但以往的研究仅假设静态场景或不现实的动态障碍物，忽视了在动态环境中安全导航的重要性。为了解决这些问题，我们提出了ReaDy-Go，这是一种新颖的真实到仿真的仿真管道，能够为目标环境合成逼真的动态场景。ReaDy-Go通过将重建的静态GS场景与动态人类GS障碍物相结合，生成动态环境的逼真导航数据集，并训练对真实与仿真差距及移动障碍物都具有鲁棒性的策略。该管道由三个组件组成：（1）一个动态GS仿真器，将场景GS与人类动画模块集成，能够插入可动画的人类GS虚拟形象，并从2D轨迹合成合理的人类动作；（2）利用该仿真器生成动态环境的导航数据集，设计了一个针对动态GS表示的机器人专家规划器和一个人类规划器；（3）使用生成的数据集进行策略学习。ReaDy-Go在目标环境的仿真和真实世界实验中均优于基线，显示出即使在真实与仿真转移后以及存在移动障碍物的情况下，导航性能得到了改善。此外，在未见环境中的零样本真实到仿真部署表明其泛化潜力。项目页面：https://syeon-yoo.github.io/ready-go-site/

View on arXiv Download PDF AI Translation

cs.RO / 10 / 2602.11598

ABot-N0: Technical Report on the VLA Foundation Model for Versatile Embodied Navigation

ABot-N0：多功能具身导航的VLA基础模型技术报告

Chu, Zedong, Xie, Shichao, Wu, Xiaolong, Shen, Yanfen, Luo, Minghua, Wang, Zhengbo, Liu, Fei, Leng, Xiaoxu, Hu, Junjun, Yin, Mingyang, Lu, Jia, Guo, Yingnan, Yang, Kai, Han, Jiawei, Chen, Xu, Zhu, Yanqing, Zhao, Yuxiang, Liu, Xin, Yang, Yirong, He, Ye, Wang, Jiahang, Cai, Yang, Zhang, Tianlin, Gao, Li, Liu, Liu, Sun, Mingchao, Jiang, Fan, Wang, Chiyu, Liu, Zhicheng, Pan, Hongyu, Han, Honglin, Gu, Zhining, Yang, Kuan, Zhang, Jianfang, Jing, Di, Guan, Zihao, Guo, Wei, Liu, Guoqing, Yang, Di, Yang, Xiangpo, Yang, Menglin, Xing, Hongguang, Li, Weiguo, Xu, Mu

Abstract

Embodied navigation has long been fragmented by task-specific architectures. We introduce ABot-N0, a unified Vision-Language-Action (VLA) foundation model that achieves a ``Grand Unification'' across 5 core tasks: Point-Goal, Object-Goal, Instruction-Following, POI-Goal, and Person-Following. ABot-N0 utilizes a hierarchical ``Brain-Action'' architecture, pairing an LLM-based Cognitive Brain for semantic reasoning with a Flow Matching-based Action Expert for precise, continuous trajectory generation. To support large-scale learning, we developed the ABot-N0 Data Engine, curating 16.9M expert trajectories and 5.0M reasoning samples across 7,802 high-fidelity 3D scenes (10.7 $\text{km}^2$). ABot-N0 achieves new SOTA performance across 7 benchmarks, significantly outperforming specialized models. Furthermore, our Agentic Navigation System integrates a planner with hierarchical topological memory, enabling robust, long-horizon missions in dynamic real-world environments.

Chinese Translation

具身导航长期以来受到特定任务架构的碎片化影响。我们介绍了ABot-N0，这是一种统一的视觉-语言-动作（Vision-Language-Action, VLA）基础模型，实现在5个核心任务上的“伟大统一”：目标点导航（Point-Goal）、物体目标导航（Object-Goal）、指令跟随（Instruction-Following）、兴趣点目标导航（POI-Goal）和人跟随（Person-Following）。ABot-N0采用层次化的“脑-动作”（Brain-Action）架构，将基于大语言模型（LLM）的认知大脑与基于流匹配（Flow Matching）的动作专家相结合，以实现精确、连续的轨迹生成。为了支持大规模学习，我们开发了ABot-N0数据引擎，整理了来自7,802个高保真3D场景（10.7平方公里）的1690万个专家轨迹和500万个推理样本。ABot-N0在7个基准测试中实现了新的最先进（SOTA）性能，显著超越了专门模型。此外，我们的代理导航系统（Agentic Navigation System）集成了一个具有层次拓扑记忆的规划器，使其能够在动态现实环境中执行稳健的长时间任务。

View on arXiv Download PDF AI Translation

cs.RO / 11 / 2602.11643

ViTaS: Visual Tactile Soft Fusion Contrastive Learning for Visuomotor Learning

ViTaS：用于视觉运动学习的视觉触觉软融合对比学习

Tian, Yufeng, Cheng, Shuiqi, Wei, Tianming, Zhou, Tianxing, Zhang, Yuanhang, Liu, Zixian, Han, Qianwei, Yuan, Zhecheng, Xu, Huazhe

Abstract

Tactile information plays a crucial role in human manipulation tasks and has recently garnered increasing attention in robotic manipulation. However, existing approaches mostly focus on the alignment of visual and tactile features and the integration mechanism tends to be direct concatenation. Consequently, they struggle to effectively cope with occluded scenarios due to neglecting the inherent complementary nature of both modalities and the alignment may not be exploited enough, limiting the potential of their real-world deployment. In this paper, we present ViTaS, a simple yet effective framework that incorporates both visual and tactile information to guide the behavior of an agent. We introduce Soft Fusion Contrastive Learning, an advanced version of conventional contrastive learning method and a CVAE module to utilize the alignment and complementarity within visuo-tactile representations. We demonstrate the effectiveness of our method in 12 simulated and 3 real-world environments, and our experiments show that ViTaS significantly outperforms existing baselines. Project page: https://skyrainwind.github.io/ViTaS/index.html.

Chinese Translation

触觉信息在人的操作任务中发挥着至关重要的作用，近年来在机器人操作领域也受到越来越多的关注。然而，现有的方法大多集中于视觉和触觉特征的对齐，而集成机制往往是直接拼接。因此，这些方法在处理遮挡场景时效果不佳，因为它们忽视了两种模态的内在互补性，并且对齐的利用可能不够充分，从而限制了其在现实世界中的应用潜力。本文提出了ViTaS，一个简单而有效的框架，结合视觉和触觉信息来指导智能体的行为。我们引入了软融合对比学习（Soft Fusion Contrastive Learning），这是一种传统对比学习方法的高级版本，并结合了条件变分自编码器（CVAE）模块，以利用视觉-触觉表征中的对齐和互补性。我们在12个模拟环境和3个真实环境中验证了我们方法的有效性，实验结果表明ViTaS显著优于现有基线。项目页面：https://skyrainwind.github.io/ViTaS/index.html。

View on arXiv Download PDF AI Translation

cs.RO / 12 / 2602.11648

Human-Like Gaze Behavior in Social Robots: A Deep Learning Approach Integrating Human and Non-Human Stimuli

社交机器人中的类人注视行为：一种整合人类与非人类刺激的深度学习方法

Vahedi, Faezeh, Memari, Morteza, Tabatabaei, Ramtin, Taheri, Alireza

Abstract

Nonverbal behaviors, particularly gaze direction, play a crucial role in enhancing effective communication in social interactions. As social robots increasingly participate in these interactions, they must adapt their gaze based on human activities and remain receptive to all cues, whether human-generated or not, to ensure seamless and effective communication. This study aims to increase the similarity between robot and human gaze behavior across various social situations, including both human and non-human stimuli (e.g., conversations, pointing, door openings, and object drops). A key innovation in this study, is the investigation of gaze responses to non-human stimuli, a critical yet underexplored area in prior research. These scenarios, were simulated in the Unity software as a 3D animation and a 360-degree real-world video. Data on gaze directions from 41 participants were collected via virtual reality (VR) glasses. Preprocessed data, trained two neural networks-LSTM and Transformer-to build predictive models based on individuals' gaze patterns. In the animated scenario, the LSTM and Transformer models achieved prediction accuracies of 67.6% and 70.4%, respectively; In the real-world scenario, the LSTM and Transformer models achieved accuracies of 72% and 71.6%, respectively. Despite the gaze pattern differences among individuals, our models outperform existing approaches in accuracy while uniquely considering non-human stimuli, offering a significant advantage over previous literature. Furthermore, deployed on the NAO robot, the system was evaluated by 275 participants via a comprehensive questionnaire, with results demonstrating high satisfaction during interactions. This work advances social robotics by enabling robots to dynamically mimic human gaze behavior in complex social contexts.

Chinese Translation

非语言行为，特别是注视方向，在增强社交互动中的有效沟通方面发挥着至关重要的作用。随着社交机器人越来越多地参与这些互动，它们必须根据人类活动调整自己的注视，并对所有线索保持敏感，无论是人类生成的还是非人类生成的，以确保无缝且有效的沟通。本研究旨在提高机器人与人类在各种社交情境下的注视行为相似性，包括人类和非人类刺激（例如，交谈、指点、开门和物体掉落）。本研究的一个关键创新在于对非人类刺激的注视反应进行研究，这是先前研究中一个重要但未被充分探索的领域。这些场景在Unity软件中模拟为3D动画和360度真实世界视频。通过虚拟现实（VR）眼镜收集了41名参与者的注视方向数据。对预处理数据进行了训练，构建了基于个体注视模式的预测模型，使用了两种神经网络——LSTM和Transformer。在动画场景中，LSTM和Transformer模型的预测准确率分别达到了67.6%和70.4%；在真实场景中，LSTM和Transformer模型的准确率分别为72%和71.6%。尽管个体之间的注视模式存在差异，我们的模型在准确性上优于现有方法，同时独特地考虑了非人类刺激，提供了相较于以往文献的显著优势。此外，该系统在NAO机器人上部署，通过全面问卷评估了275名参与者的反馈，结果显示在互动过程中满意度较高。本研究通过使机器人能够在复杂社交环境中动态模仿人类注视行为，推动了社交机器人技术的发展。

View on arXiv Download PDF AI Translation

cs.RO / 13 / 2602.11735

AC-MASAC: An Attentive Curriculum Learning Framework for Heterogeneous UAV Swarm Coordination

AC-MASAC：一种用于异构无人机群协调的关注课程学习框架

Liu, Wanhao, Dai, Junhong, Zhang, Yixuan, Yin, Shengyun, Li, Panshuo

Abstract

Cooperative path planning for heterogeneous UAV swarms poses significant challenges for Multi-Agent Reinforcement Learning (MARL), particularly in handling asymmetric inter-agent dependencies and addressing the risks of sparse rewards and catastrophic forgetting during training. To address these issues, this paper proposes an attentive curriculum learning framework (AC-MASAC). The framework introduces a role-aware heterogeneous attention mechanism to explicitly model asymmetric dependencies. Moreover, a structured curriculum strategy is designed, integrating hierarchical knowledge transfer and stage-proportional experience replay to address the issues of sparse rewards and catastrophic forgetting. The proposed framework is validated on a custom multi-agent simulation platform, and the results show that our method has significant advantages over other advanced methods in terms of Success Rate, Formation Keeping Rate, and Success-weighted Mission Time. The code is available at \textcolor{red}{https://github.com/Wanhao-Liu/AC-MASAC}.

Chinese Translation

异构无人机群的协作路径规划对多智能体强化学习（MARL）提出了重大挑战，特别是在处理不对称的智能体间依赖关系以及应对稀疏奖励和训练过程中的灾难性遗忘风险方面。为了解决这些问题，本文提出了一种关注课程学习框架（AC-MASAC）。该框架引入了一种角色感知的异构注意机制，以明确建模不对称依赖关系。此外，设计了一种结构化的课程策略，结合了层次知识转移和阶段性经验重放，以应对稀疏奖励和灾难性遗忘的问题。所提出的框架在一个定制的多智能体仿真平台上进行了验证，结果表明我们的方法在成功率、队形保持率和成功加权任务时间等方面显著优于其他先进方法。代码可在 extcolor{red}{https://github.com/Wanhao-Liu/AC-MASAC} 获取。

View on arXiv Download PDF AI Translation

cs.RO / 14 / 2602.11758

HAIC: Humanoid Agile Object Interaction Control via Dynamics-Aware World Model

HAIC：基于动态感知世界模型的人形灵活物体交互控制

Li, Dongting, Chen, Xingyu, Wu, Qianyang, Chen, Bo, Wu, Sikai, Wu, Hanyu, Zhang, Guoyao, Li, Liang, Zhou, Mingliang, Xiang, Diyun, Ma, Jianzhu, Zhang, Qiang, Xu, Renjing

Abstract

Humanoid robots show promise for complex whole-body tasks in unstructured environments. Although Human-Object Interaction (HOI) has advanced, most methods focus on fully actuated objects rigidly coupled to the robot, ignoring underactuated objects with independent dynamics and non-holonomic constraints. These introduce control challenges from coupling forces and occlusions. We present HAIC, a unified framework for robust interaction across diverse object dynamics without external state estimation. Our key contribution is a dynamics predictor that estimates high-order object states (velocity, acceleration) solely from proprioceptive history. These predictions are projected onto static geometric priors to form a spatially grounded dynamic occupancy map, enabling the policy to infer collision boundaries and contact affordances in blind spots. We use asymmetric fine-tuning, where a world model continuously adapts to the student policy's exploration, ensuring robust state estimation under distribution shifts. Experiments on a humanoid robot show HAIC achieves high success rates in agile tasks (skateboarding, cart pushing/pulling under various loads) by proactively compensating for inertial perturbations, and also masters multi-object long-horizon tasks like carrying a box across varied terrain by predicting the dynamics of multiple objects.

Chinese Translation

人形机器人在非结构化环境中的复杂全身任务中展现出良好的前景。尽管人-物体交互（HOI）已有所进展，但大多数方法集中于与机器人刚性耦合的完全驱动物体，忽视了具有独立动态和非完整约束的欠驱动物体。这些因素引入了来自耦合力和遮挡的控制挑战。我们提出了HAIC，一个统一的框架，用于在不同物体动态下实现稳健的交互，而无需外部状态估计。我们的关键贡献是一个动态预测器，它仅通过本体感知历史来估计高阶物体状态（速度、加速度）。这些预测被投影到静态几何先验上，形成一个空间基础的动态占用图，使得策略能够在盲区推断碰撞边界和接触可用性。我们使用不对称微调，其中世界模型持续适应学生策略的探索，确保在分布变化下的稳健状态估计。在人形机器人上的实验表明，HAIC在灵活任务（如滑板、在各种负载下推拉手推车）中实现了高成功率，通过主动补偿惯性扰动，同时掌握了多物体长时间任务，如在不同地形上搬运箱子，通过预测多个物体的动态。

View on arXiv Download PDF AI Translation

cs.RO / 15 / 2602.11862

LAMP: Implicit Language Map for Robot Navigation

LAMP：用于机器人导航的隐式语言地图

Lee, Sibaek, Yu, Hyeonwoo, Kim, Giseop, Choi, Sunwook

Abstract

Recent advances in vision-language models have made zero-shot navigation feasible, enabling robots to follow natural language instructions without requiring labeling. However, existing methods that explicitly store language vectors in grid or node-based maps struggle to scale to large environments due to excessive memory requirements and limited resolution for fine-grained planning. We introduce LAMP (Language Map), a novel neural language field-based navigation framework that learns a continuous, language-driven map and directly leverages it for fine-grained path generation. Unlike prior approaches, our method encodes language features as an implicit neural field rather than storing them explicitly at every location. By combining this implicit representation with a sparse graph, LAMP supports efficient coarse path planning and then performs gradient-based optimization in the learned field to refine poses near the goal. This coarse-to-fine pipeline, language-driven, gradient-guided optimization is the first application of an implicit language map for precise path generation. This refinement is particularly effective at selecting goal regions not directly observed by leveraging semantic similarities in the learned feature space. To further enhance robustness, we adopt a Bayesian framework that models embedding uncertainty via the von Mises-Fisher distribution, thereby improving generalization to unobserved regions. To scale to large environments, LAMP employs a graph sampling strategy that prioritizes spatial coverage and embedding confidence, retaining only the most informative nodes and substantially reducing computational overhead. Our experimental results, both in NVIDIA Isaac Sim and on a real multi-floor building, demonstrate that LAMP outperforms existing explicit methods in both memory efficiency and fine-grained goal-reaching accuracy.

Chinese Translation

近期视觉-语言模型的进展使得零-shot导航成为可能，使机器人能够遵循自然语言指令而无需标注。然而，现有方法在网格或基于节点的地图中显式存储语言向量，因过高的内存需求和有限的细粒度规划分辨率而难以扩展到大型环境。我们提出了LAMP（语言地图），一种新颖的基于神经语言场的导航框架，它学习一个连续的、以语言驱动的地图，并直接利用该地图进行细粒度路径生成。与之前的方法不同，我们的方法将语言特征编码为隐式神经场，而不是在每个位置显式存储。通过将这种隐式表示与稀疏图结合，LAMP支持高效的粗略路径规划，然后在学习到的场中执行基于梯度的优化，以精细化接近目标的姿态。这种粗到细的管道、以语言驱动的、基于梯度的优化是隐式语言地图在精确路径生成中的首次应用。这种优化在选择未直接观察到的目标区域时特别有效，利用学习到的特征空间中的语义相似性。为了进一步增强鲁棒性，我们采用了一个贝叶斯框架，通过von Mises-Fisher分布建模嵌入不确定性，从而提高对未观察区域的泛化能力。为了扩展到大型环境，LAMP采用了一种图采样策略，优先考虑空间覆盖和嵌入置信度，仅保留最具信息量的节点，从而显著减少计算开销。我们的实验结果，无论是在NVIDIA Isaac Sim中还是在真实的多层建筑中，都表明LAMP在内存效率和细粒度目标到达精度方面优于现有的显式方法。

View on arXiv Download PDF AI Translation

cs.RO / 16 / 2602.11885

Learning to Manipulate Anything: Revealing Data Scaling Laws in Bounding-Box Guided Policies

学习操控任何物体：揭示边界框引导策略中的数据规模法则

Wu, Yihao, Ma, Jinming, Tan, Junbo, Yu, Yanzhao, Li, Shoujie, Zhou, Mingliang, Xiang, Diyun, Wang, Xueqian

Abstract

Diffusion-based policies show limited generalization in semantic manipulation, posing a key obstacle to the deployment of real-world robots. This limitation arises because relying solely on text instructions is inadequate to direct the policy's attention toward the target object in complex and dynamic environments. To solve this problem, we propose leveraging bounding-box instruction to directly specify target object, and further investigate whether data scaling laws exist in semantic manipulation tasks. Specifically, we design a handheld segmentation device with an automated annotation pipeline, Label-UMI, which enables the efficient collection of demonstration data with semantic labels. We further propose a semantic-motion-decoupled framework that integrates object detection and bounding-box guided diffusion policy to improve generalization and adaptability in semantic manipulation. Throughout extensive real-world experiments on large-scale datasets, we validate the effectiveness of the approach, and reveal a power-law relationship between generalization performance and the number of bounding-box objects. Finally, we summarize an effective data collection strategy for semantic manipulation, which can achieve 85\% success rates across four tasks on both seen and unseen objects. All datasets and code will be released to the community.

Chinese Translation

基于扩散的策略在语义操控中表现出有限的泛化能力，这成为了现实世界机器人部署的一个关键障碍。这一限制源于仅依赖文本指令不足以在复杂和动态环境中引导策略的注意力集中于目标物体。为了解决这个问题，我们提出利用边界框指令直接指定目标物体，并进一步探讨在语义操控任务中是否存在数据规模法则。具体而言，我们设计了一种手持分割设备，配备自动注释管道Label-UMI，能够高效收集带有语义标签的演示数据。我们进一步提出了一种语义运动解耦框架，该框架整合了物体检测和边界框引导的扩散策略，以提高语义操控中的泛化能力和适应性。在对大规模数据集进行的广泛现实世界实验中，我们验证了该方法的有效性，并揭示了泛化性能与边界框物体数量之间的幂律关系。最后，我们总结了一种有效的语义操控数据收集策略，该策略在已见和未见物体的四个任务中均能实现85%的成功率。所有数据集和代码将向社区开放。

View on arXiv Download PDF AI Translation

cs.RO / 17 / 2602.11929

General Humanoid Whole-Body Control via Pretraining and Fast Adaptation

通过预训练和快速适应实现通用类人全身控制

Wang, Zepeng, Wang, Jiangxing, Yao, Shiqing, Zhang, Yu, Ding, Ziluo, Yang, Ming, Wang, Yuxuan, Jiang, Haobin, Ma, Chao, Shi, Xiaochuan, Lu, Zongqing

Abstract

Learning a general whole-body controller for humanoid robots remains challenging due to the diversity of motion distributions, the difficulty of fast adaptation, and the need for robust balance in high-dynamic scenarios. Existing approaches often require task-specific training or suffer from performance degradation when adapting to new motions. In this paper, we present FAST, a general humanoid whole-body control framework that enables Fast Adaptation and Stable Motion Tracking. FAST introduces Parseval-Guided Residual Policy Adaptation, which learns a lightweight delta action policy under orthogonality and KL constraints, enabling efficient adaptation to out-of-distribution motions while mitigating catastrophic forgetting. To further improve physical robustness, we propose Center-of-Mass-Aware Control, which incorporates CoM-related observations and objectives to enhance balance when tracking challenging reference motions. Extensive experiments in simulation and real-world deployment demonstrate that FAST consistently outperforms state-of-the-art baselines in robustness, adaptation efficiency, and generalization.

Chinese Translation

为类人机器人学习通用全身控制器仍然面临挑战，主要由于运动分布的多样性、快速适应的困难以及在高动态场景中对稳健平衡的需求。现有方法通常需要特定任务的训练，或者在适应新运动时表现下降。本文提出了FAST，一个通用类人全身控制框架，能够实现快速适应和稳定运动跟踪。FAST引入了Parseval引导的残差策略适应（Parseval-Guided Residual Policy Adaptation），在正交性和KL约束下学习轻量级的增量动作策略，从而高效适应分布外运动，同时减轻灾难性遗忘。为了进一步提高物理稳健性，我们提出了重心感知控制（Center-of-Mass-Aware Control），结合与重心相关的观察和目标，以增强在跟踪具有挑战性的参考运动时的平衡性。在仿真和实际部署中的广泛实验表明，FAST在稳健性、适应效率和泛化能力方面始终优于最先进的基线方法。

View on arXiv Download PDF AI Translation

cs.RO / 18 / 2602.11934

Robot-DIFT: Distilling Diffusion Features for Geometrically Consistent Visuomotor Control

Robot-DIFT：提取扩散特征以实现几何一致的视觉运动控制

Deng, Yu, Jin, Yufeng, Jia, Xiaogang, Xue, Jiahong, Neumann, Gerhard, Chalvatzaki, Georgia

Abstract

We hypothesize that a key bottleneck in generalizable robot manipulation is not solely data scale or policy capacity, but a structural mismatch between current visual backbones and the physical requirements of closed-loop control. While state-of-the-art vision encoders (including those used in VLAs) optimize for semantic invariance to stabilize classification, manipulation typically demands geometric sensitivity the ability to map millimeter-level pose shifts to predictable feature changes. Their discriminative objective creates a "blind spot" for fine-grained control, whereas generative diffusion models inherently encode geometric dependencies within their latent manifolds, encouraging the preservation of dense multi-scale spatial structure. However, directly deploying stochastic diffusion features for control is hindered by stochastic instability, inference latency, and representation drift during fine-tuning. To bridge this gap, we propose Robot-DIFT, a framework that decouples the source of geometric information from the process of inference via Manifold Distillation. By distilling a frozen diffusion teacher into a deterministic Spatial-Semantic Feature Pyramid Network (S2-FPN), we retain the rich geometric priors of the generative model while ensuring temporal stability, real-time execution, and robustness against drift. Pretrained on the large-scale DROID dataset, Robot-DIFT demonstrates superior geometric consistency and control performance compared to leading discriminative baselines, supporting the view that how a model learns to see dictates how well it can learn to act.

Chinese Translation

我们假设，可推广的机器人操作的一个关键瓶颈不仅仅是数据规模或策略能力，而是当前视觉骨干与闭环控制的物理要求之间的结构不匹配。尽管最先进的视觉编码器（包括在视觉语言模型中使用的编码器）优化了语义不变性以稳定分类，但操作通常需要几何敏感性，即将毫米级姿态变化映射到可预测特征变化的能力。它们的判别目标在细粒度控制上造成了“盲点”，而生成扩散模型本质上在其潜在流形中编码了几何依赖性，鼓励保持密集的多尺度空间结构。然而，直接将随机扩散特征用于控制受到随机不稳定性、推理延迟和微调过程中的表示漂移的阻碍。为了弥补这一差距，我们提出了Robot-DIFT，一个通过流形蒸馏将几何信息源与推理过程解耦的框架。通过将一个冻结的扩散教师蒸馏为一个确定性的空间-语义特征金字塔网络（S2-FPN），我们保留了生成模型丰富的几何先验，同时确保时间稳定性、实时执行和抗漂移的鲁棒性。在大规模DROID数据集上进行预训练后，Robot-DIFT在几何一致性和控制性能方面表现优于领先的判别基线，支持了模型学习如何看待事物决定其学习如何行动的观点。

View on arXiv Download PDF AI Translation

cs.RO / 19 / 2602.11978

Accelerating Robotic Reinforcement Learning with Agent Guidance

通过代理引导加速机器人强化学习

Chen, Haojun, Zou, Zili, Ma, Chengdong, Pu, Yaoxiang, Zhang, Haotong, Chen, Yuanpei, Yang, Yaodong

Abstract

Reinforcement Learning (RL) offers a powerful paradigm for autonomous robots to master generalist manipulation skills through trial-and-error. However, its real-world application is stifled by severe sample inefficiency. Recent Human-in-the-Loop (HIL) methods accelerate training by using human corrections, yet this approach faces a scalability barrier. Reliance on human supervisors imposes a 1:1 supervision ratio that limits fleet expansion, suffers from operator fatigue over extended sessions, and introduces high variance due to inconsistent human proficiency. We present Agent-guided Policy Search (AGPS), a framework that automates the training pipeline by replacing human supervisors with a multimodal agent. Our key insight is that the agent can be viewed as a semantic world model, injecting intrinsic value priors to structure physical exploration. By using executable tools, the agent provides precise guidance via corrective waypoints and spatial constraints for exploration pruning. We validate our approach on two tasks, ranging from precision insertion to deformable object manipulation. Results demonstrate that AGPS outperforms HIL methods in sample efficiency. This automates the supervision pipeline, unlocking the path to labor-free and scalable robot learning. Project website: https://agps-rl.github.io/agps.

Chinese Translation

强化学习（Reinforcement Learning, RL）为自主机器人掌握通用操作技能提供了一种强大的范式，依赖于试错法。然而，其在实际应用中的效率受到严重样本不足的制约。近期的人机交互（Human-in-the-Loop, HIL）方法通过使用人类修正来加速训练，但这种方法面临可扩展性障碍。对人类监督者的依赖导致了1:1的监督比例，限制了舰队扩展，且在长时间的操作中容易导致操作者疲劳，并因人类能力的不一致性引入高方差。我们提出了代理引导策略搜索（Agent-guided Policy Search, AGPS），该框架通过用多模态代理替代人类监督者来自动化训练流程。我们的关键见解是，代理可以被视为一种语义世界模型，注入内在价值先验以构建物理探索的结构。通过使用可执行工具，代理通过纠正的路径点和空间约束提供精确的引导，从而实现探索的剪枝。我们在两个任务上验证了我们的方法，这些任务包括精确插入和可变形物体操作。结果表明，AGPS在样本效率上优于HIL方法。这一方法自动化了监督流程，为无劳动和可扩展的机器人学习开辟了道路。项目网站：https://agps-rl.github.io/agps。

View on arXiv Download PDF AI Translation

cs.RO / 20 / 2602.12012

Decentralized Multi-Robot Obstacle Detection and Tracking in a Maritime Scenario

海洋场景下的去中心化多机器人障碍物检测与跟踪

Ahmed, Muhammad Farhan, Frémont, Vincent

Abstract

Autonomous aerial-surface robot teams are promising for maritime monitoring. Robust deployment requires reliable perception over reflective water and scalable coordination under limited communication. We present a decentralized multi-robot framework for detecting and tracking floating containers using multiple UAVs cooperating with an autonomous surface vessel. Each UAV performs YOLOv8 and stereo-disparity-based visual detection, then tracks targets with per-object EKFs using uncertainty-aware data association. Compact track summaries are exchanged and fused conservatively via covariance intersection, ensuring consistency under unknown correlations. An information-driven assignment module allocates targets and selects UAV hover viewpoints by trading expected uncertainty reduction against travel effort and safety separation. Simulation results in a maritime scenario demonstrate improved coverage, localization accuracy, and tracking consistency while maintaining modest communication requirements.

Chinese Translation

自主空中-水面机器人团队在海洋监测中展现出良好的前景。稳健的部署需要在反射水面上实现可靠的感知，并在有限通信条件下进行可扩展的协调。我们提出了一种去中心化的多机器人框架，利用多个无人机（UAV）与自主水面船只合作，检测和跟踪漂浮集装箱。每个无人机执行基于YOLOv8和立体视差的视觉检测，然后通过使用不确定性感知的数据关联，利用每个目标的扩展卡尔曼滤波器（EKF）进行跟踪。紧凑的轨迹摘要通过协方差交集进行保守交换和融合，确保在未知相关性下的一致性。一个信息驱动的分配模块通过权衡预期的不确定性减少与旅行努力和安全间隔，分配目标并选择无人机悬停视点。海洋场景下的仿真结果表明，在保持适度通信需求的同时，覆盖范围、定位精度和跟踪一致性得到了改善。

View on arXiv Download PDF AI Translation

cs.RO / 21 / 2602.12024

Adaptive-Horizon Conflict-Based Search for Closed-Loop Multi-Agent Path Finding

基于冲突的自适应视野闭环多智能体路径规划

Li, Jiarui, Pecora, Federico, Zhang, Runyu, Zardini, Gioele

Abstract

MAPF is a core coordination problem for large robot fleets in automated warehouses and logistics. Existing approaches are typically either open-loop planners, which generate fixed trajectories and struggle to handle disturbances, or closed-loop heuristics without reliable performance guarantees, limiting their use in safety-critical deployments. This paper presents ACCBS, a closed-loop algorithm built on a finite-horizon variant of CBS with a horizon-changing mechanism inspired by iterative deepening in MPC. ACCBS dynamically adjusts the planning horizon based on the available computational budget, and reuses a single constraint tree to enable seamless transitions between horizons. As a result, it produces high-quality feasible solutions quickly while being asymptotically optimal as the budget increases, exhibiting anytime behavior. Extensive case studies demonstrate that ACCBS combines flexibility to disturbances with strong performance guarantees, effectively bridging the gap between theoretical optimality and practical robustness for large-scale robot deployment.

Chinese Translation

多智能体路径规划（MAPF）是自动化仓库和物流中大规模机器人队伍的核心协调问题。现有方法通常是开放式规划器，这些规划器生成固定轨迹并在处理干扰时表现不佳，或者是没有可靠性能保证的闭环启发式方法，限制了它们在安全关键部署中的应用。本文提出了ACCBS，这是一种基于CBS的有限视野变体构建的闭环算法，采用了受模型预测控制（MPC）中迭代加深启发的视野变化机制。ACCBS根据可用的计算预算动态调整规划视野，并重用单一约束树以实现视野之间的无缝过渡。结果，它能够快速生成高质量的可行解，同时在预算增加时渐近最优，展现出任意时间行为。广泛的案例研究表明，ACCBS将对干扰的灵活性与强大的性能保证相结合，有效地弥合了理论最优性与大规模机器人部署的实际鲁棒性之间的差距。

View on arXiv Download PDF AI Translation

cs.RO / 22 / 2602.12032

When would Vision-Proprioception Policies Fail in Robotic Manipulation?

视觉-本体感觉策略在机器人操作中何时会失效？

Lu, Jingxian, Xia, Wenke, Wu, Yuxuan, Lu, Zhiwu, Hu, Di

Abstract

Proprioceptive information is critical for precise servo control by providing real-time robotic states. Its collaboration with vision is highly expected to enhance performances of the manipulation policy in complex tasks. However, recent studies have reported inconsistent observations on the generalization of vision-proprioception policies. In this work, we investigate this by conducting temporally controlled experiments. We found that during task sub-phases that robot's motion transitions, which require target localization, the vision modality of the vision-proprioception policy plays a limited role. Further analysis reveals that the policy naturally gravitates toward concise proprioceptive signals that offer faster loss reduction when training, thereby dominating the optimization and suppressing the learning of the visual modality during motion-transition phases. To alleviate this, we propose the Gradient Adjustment with Phase-guidance (GAP) algorithm that adaptively modulates the optimization of proprioception, enabling dynamic collaboration within the vision-proprioception policy. Specifically, we leverage proprioception to capture robotic states and estimate the probability of each timestep in the trajectory belonging to motion-transition phases. During policy learning, we apply fine-grained adjustment that reduces the magnitude of proprioception's gradient based on estimated probabilities, leading to robust and generalizable vision-proprioception policies. The comprehensive experiments demonstrate GAP is applicable in both simulated and real-world environments, across one-arm and dual-arm setups, and compatible with both conventional and Vision-Language-Action models. We believe this work can offer valuable insights into the development of vision-proprioception policies in robotic manipulation.

Chinese Translation

本体感觉信息对于精确的伺服控制至关重要，因为它提供了实时的机器人状态。它与视觉的协作被高度期望能够增强复杂任务中操作策略的表现。然而，最近的研究报告了关于视觉-本体感觉策略泛化的不一致观察。在本研究中，我们通过进行时间控制实验来探讨这一问题。我们发现，在机器人运动过渡的任务子阶段，这些阶段需要目标定位时，视觉-本体感觉策略的视觉模态发挥的作用有限。进一步分析表明，该策略自然倾向于简洁的本体感觉信号，这些信号在训练时提供更快的损失减少，从而主导了优化过程并抑制了运动过渡阶段视觉模态的学习。为了解决这一问题，我们提出了带相位引导的梯度调整（Gradient Adjustment with Phase-guidance, GAP）算法，该算法自适应地调节本体感觉的优化，使视觉-本体感觉策略能够动态协作。具体而言，我们利用本体感觉捕捉机器人状态，并估计轨迹中每个时间步属于运动过渡阶段的概率。在策略学习过程中，我们应用细粒度调整，根据估计的概率减少本体感觉梯度的幅度，从而导致稳健且具有泛化能力的视觉-本体感觉策略。综合实验表明，GAP算法适用于模拟和现实环境，适用于单臂和双臂设置，并与传统模型和视觉-语言-动作模型兼容。我们相信这项工作能够为机器人操作中视觉-本体感觉策略的发展提供有价值的见解。

View on arXiv Download PDF AI Translation

cs.RO / 23 / 2602.12047

Safety Beyond the Training Data: Robust Out-of-Distribution MPC via Conformalized System Level Synthesis

超越训练数据的安全性：通过符合化系统级合成实现稳健的分布外模型预测控制

Srinivasan, Anutam, Leeman, Antoine, Chou, Glen

Abstract

We present a novel framework for robust out-of-distribution planning and control using conformal prediction (CP) and system level synthesis (SLS), addressing the challenge of ensuring safety and robustness when using learned dynamics models beyond the training data distribution. We first derive high-confidence model error bounds using weighted CP with a learned, state-control-dependent covariance model. These bounds are integrated into an SLS-based robust nonlinear model predictive control (MPC) formulation, which performs constraint tightening over the prediction horizon via volume-optimized forward reachable sets. We provide theoretical guarantees on coverage and robustness under distributional drift, and analyze the impact of data density and trajectory tube size on prediction coverage. Empirically, we demonstrate our method on nonlinear systems of increasing complexity, including a 4D car and a {12D} quadcopter, improving safety and robustness compared to fixed-bound and non-robust baselines, especially outside of the data distribution.

Chinese Translation

我们提出了一种新颖的框架，利用符合预测（Conformal Prediction, CP）和系统级合成（System Level Synthesis, SLS）实现稳健的分布外规划与控制，解决了在使用学习的动态模型超出训练数据分布时确保安全性和稳健性这一挑战。我们首先使用加权的符合预测推导出高置信度的模型误差界限，该界限基于学习的状态控制依赖的协方差模型。这些界限被整合到基于SLS的稳健非线性模型预测控制（Model Predictive Control, MPC）公式中，通过体积优化的前向可达集在预测视域内执行约束收紧。我们提供了在分布漂移下的覆盖性和稳健性的理论保证，并分析了数据密度和轨迹管道大小对预测覆盖的影响。在实证方面，我们在复杂性逐渐增加的非线性系统上演示了我们的方法，包括一个四维汽车和一个12维四旋翼，相较于固定界限和非稳健基线，特别是在数据分布之外，提高了安全性和稳健性。

View on arXiv Download PDF AI Translation

cs.RO / 24 / 2602.12062

HoloBrain-0 Technical Report

HoloBrain-0 技术报告

Lin, Xuewu, Lin, Tianwei, Du, Yun, Xie, Hongyu, Jin, Yiwei, Li, Jiawei, Wu, Shijie, Wang, Qingze, Li, Mengdi, Zhao, Mengao, Li, Ziang, Huang, Chaodong, Bi, Hongzhe, Huang, Lichao, Su, Zhizhong

Abstract

In this work, we introduce HoloBrain-0, a comprehensive Vision-Language-Action (VLA) framework that bridges the gap between foundation model research and reliable real-world robot deployment. The core of our system is a novel VLA architecture that explicitly incorporates robot embodiment priors, including multi-view camera parameters and kinematic descriptions (URDF), to enhance 3D spatial reasoning and support diverse embodiments. We validate this design through a scalable ``pre-train then post-train" paradigm, achieving state-of-the-art results on simulation benchmarks such as RoboTwin 2.0, LIBERO, and GenieSim, as well as strong results on challenging long-horizon real-world manipulation tasks. Notably, our efficient 0.2B-parameter variant rivals significantly larger baselines, enabling low-latency on-device deployment. To further accelerate research and practical adoption, we fully open-source the entire HoloBrain ecosystem, which includes: (1) powerful pre-trained VLA foundations; (2) post-trained checkpoints for multiple simulation suites and real-world tasks; and (3) RoboOrchard, a full-stack VLA infrastructure for data curation, model training and deployment. Together with standardized data collection protocols, this release provides the community with a complete, reproducible path toward high-performance robotic manipulation.

Chinese Translation

在本研究中，我们介绍了 HoloBrain-0，这是一个综合的视觉-语言-动作（VLA）框架，旨在弥合基础模型研究与可靠的现实世界机器人部署之间的差距。我们系统的核心是一个新颖的 VLA 架构，明确地融入了机器人具身先验，包括多视角相机参数和运动学描述（URDF），以增强 3D 空间推理并支持多样化的具身形式。我们通过可扩展的“先预训练再后训练”范式验证了这一设计，在 RoboTwin 2.0、LIBERO 和 GenieSim 等仿真基准上取得了最先进的结果，并在具有挑战性的长时间现实世界操作任务上也取得了良好的成绩。值得注意的是，我们高效的 0.2B 参数变体与显著更大的基线相媲美，实现了低延迟的设备端部署。为了进一步加速研究和实际应用，我们完全开源了整个 HoloBrain 生态系统，包括：（1）强大的预训练 VLA 基础；（2）多个仿真套件和现实世界任务的后训练检查点；以及（3）RoboOrchard，一个用于数据策划、模型训练和部署的全栈 VLA 基础设施。结合标准化的数据收集协议，此次发布为社区提供了一条完整、可重复的高性能机器人操作路径。

View on arXiv Download PDF AI Translation

cs.RO / 25 / 2602.12063

VLAW: Iterative Co-Improvement of Vision-Language-Action Policy and World Model

VLAW：视觉-语言-动作策略与世界模型的迭代共同改进

Guo, Yanjiang, Lee, Tony, Shi, Lucy Xiaoyang, Chen, Jianyu, Liang, Percy, Finn, Chelsea

Abstract

The goal of this paper is to improve the performance and reliability of vision-language-action (VLA) models through iterative online interaction. Since collecting policy rollouts in the real world is expensive, we investigate whether a learned simulator-specifically, an action-conditioned video generation model-can be used to generate additional rollout data. Unfortunately, existing world models lack the physical fidelity necessary for policy improvement: they are predominantly trained on demonstration datasets that lack coverage of many different physical interactions (particularly failure cases) and struggle to accurately model small yet critical physical details in contact-rich object manipulation. We propose a simple iterative improvement algorithm that uses real-world roll-out data to improve the fidelity of the world model, which can then, in turn, be used to generate supplemental synthetic data for improving the VLA model. In our experiments on a real robot, we use this approach to improve the performance of a state-of-the-art VLA model on multiple downstream tasks. We achieve a 39.2% absolute success rate improvement over the base policy and 11.6% improvement from training with the generated synthetic rollouts. Videos can be found at this anonymous website: https://sites.google.com/view/vla-w

Chinese Translation

本文的目标是通过迭代在线交互来提高视觉-语言-动作（VLA）模型的性能和可靠性。由于在现实世界中收集策略回放是昂贵的，我们探讨了是否可以使用学习到的模拟器——特别是一个基于动作条件的视频生成模型——来生成额外的回放数据。不幸的是，现有的世界模型缺乏进行策略改进所需的物理真实感：它们主要是在缺乏多种不同物理交互（特别是失败案例）覆盖的演示数据集上训练的，并且在接触丰富的物体操作中难以准确建模小而关键的物理细节。我们提出了一种简单的迭代改进算法，该算法利用真实世界的回放数据来提高世界模型的真实感，进而可以用于生成补充的合成数据，以改善VLA模型。在我们对真实机器人进行的实验中，我们使用这种方法提高了最先进的VLA模型在多个下游任务上的性能。我们实现了相对于基础策略39.2%的绝对成功率提升，以及通过使用生成的合成回放进行训练获得的11.6%的提升。视频可以在此匿名网站找到：https://sites.google.com/view/vla-w

View on arXiv Download PDF AI Translation

cs.RO / 26 / 2602.12065

Affordance-Graphed Task Worlds: Self-Evolving Task Generation for Scalable Embodied Learning

赋能图任务世界：可扩展的具身学习自我演化任务生成

Liu, Xiang, Cui, Sen, Yao, Guocai, Cao, Zhong, Ma, Jingheng, Zhang, Min, Zhang, Changshui

Abstract

Training robotic policies directly in the real world is expensive and unscalable. Although generative simulation enables large-scale data synthesis, current approaches often fail to generate logically coherent long-horizon tasks and struggle with dynamic physical uncertainties due to open-loop execution. To address these challenges, we propose Affordance-Graphed Task Worlds (AGT-World), a unified framework that autonomously constructs interactive simulated environments and corresponding robot task policies based on real-world observations. Unlike methods relying on random proposals or static replication, AGT-World formalizes the task space as a structured graph, enabling the precise, hierarchical decomposition of complex goals into theoretically grounded atomic primitives. Furthermore, we introduce a Self-Evolution mechanism with hybrid feedback to autonomously refine policies, combining Vision-Language Model reasoning and geometric verification. Extensive experiments demonstrate that our method significantly outperforms in success rates and generalization, achieving a self-improving cycle of proposal, execution, and correction for scalable robot learning.

Chinese Translation

在现实世界中直接训练机器人策略既昂贵又不可扩展。尽管生成模拟能够实现大规模数据合成，但当前的方法往往无法生成逻辑上连贯的长期任务，并且由于开放环执行而在动态物理不确定性方面面临困难。为了解决这些挑战，我们提出了赋能图任务世界（Affordance-Graphed Task Worlds, AGT-World），这是一个统一框架，能够基于现实世界观察自主构建交互式模拟环境和相应的机器人任务策略。与依赖随机提议或静态复制的方法不同，AGT-World将任务空间形式化为一个结构化图，从而实现复杂目标的精确分层分解为理论上基础的原子原语。此外，我们引入了一种具有混合反馈的自我演化机制，以自主优化策略，结合视觉-语言模型推理和几何验证。大量实验表明，我们的方法在成功率和泛化能力上显著优于现有方法，实现了提议、执行和修正的自我改进循环，从而支持可扩展的机器人学习。

View on arXiv Download PDF AI Translation

cs.RO / 27 / 2602.12074

RF-Modulated Adaptive Communication Improves Multi-Agent Robotic Exploration

射频调制自适应通信改善多智能体机器人探索

Achey, Lorin, Crockett, Breanne, Heckman, Christoffer, Hayes, Bradley

Abstract

Reliable coordination and efficient communication are critical challenges for multi-agent robotic exploration of environments where communication is limited. This work introduces Adaptive-RF Transmission (ART), a novel communication-aware planning algorithm that dynamically modulates transmission location based on signal strength and data payload size, enabling heterogeneous robot teams to share information efficiently without unnecessary backtracking. We further explore an extension to this approach called ART-SST, which enforces signal strength thresholds for high-fidelity data delivery. Through over 480 simulations across three cave-inspired environments, ART consistently outperforms existing strategies, including full rendezvous and minimum-signal heuristic approaches, achieving up to a 58% reduction in distance traveled and up to 52% faster exploration times compared to baseline methods. These results demonstrate that adaptive, payload-aware communication significantly improves coverage efficiency and mission speed in complex, communication-constrained environments, offering a promising foundation for future planetary exploration and search-and-rescue missions.

Chinese Translation

可靠的协调和高效的通信是多智能体机器人在通信受限环境中探索的关键挑战。本研究提出了一种新颖的通信感知规划算法——自适应射频传输（Adaptive-RF Transmission, ART），该算法根据信号强度和数据负载大小动态调节传输位置，使异构机器人团队能够高效共享信息，避免不必要的回溯。我们进一步探讨了这一方法的扩展版本——ART-SST，该方法对高保真数据传输施加信号强度阈值。通过在三个类洞穴环境中进行超过480次模拟，ART始终优于现有策略，包括完全会合和最小信号启发式方法，行程距离减少最多58%，探索时间比基线方法快最多52%。这些结果表明，自适应、负载感知的通信显著提高了在复杂、通信受限环境中的覆盖效率和任务速度，为未来的行星探索和搜救任务提供了有希望的基础。

View on arXiv Download PDF AI Translation

cs.RO / 28 / 2602.12095

Pack it in: Packing into Partially Filled Containers Through Contact

装箱：通过接触在部分填充容器中进行装箱

Russell, David, Xu, Zisong, Roa, Maximo A., Dogar, Mehmet

Abstract

The automation of warehouse operations is crucial for improving productivity and reducing human exposure to hazardous environments. One operation frequently performed in warehouses is bin-packing where items need to be placed into containers, either for delivery to a customer, or for temporary storage in the warehouse. Whilst prior bin-packing works have largely been focused on packing items into empty containers and have adopted collision-free strategies, it is often the case that containers will already be partially filled with items, often in suboptimal arrangements due to transportation about a warehouse. This paper presents a contact-aware packing approach that exploits purposeful interactions with previously placed objects to create free space and enable successful placement of new items. This is achieved by using a contact-based multi-object trajectory optimizer within a model predictive controller, integrated with a physics-aware perception system that estimates object poses even during inevitable occlusions, and a method that suggests physically-feasible locations to place the object inside the container.

Chinese Translation

仓库操作的自动化对于提高生产力和减少人类在危险环境中的暴露至关重要。在仓库中，常见的一项操作是箱子装填，即将物品放入容器中，既可以用于交付给客户，也可以用于在仓库中的临时存储。尽管以往的箱子装填研究主要集中在将物品装入空容器，并采用无碰撞策略，但容器通常已经部分填充物品，且由于在仓库内的运输，物品的排列往往并不理想。本文提出了一种接触感知的装箱方法，该方法利用与先前放置物体的有目的交互来创造空闲空间，从而成功放置新物品。这是通过在模型预测控制器内使用基于接触的多物体轨迹优化器实现的，并与一个物理感知系统集成，该系统即使在不可避免的遮挡情况下也能估计物体姿态，以及一种建议在容器内放置物体的物理可行位置的方法。

View on arXiv Download PDF AI Translation

cs.RO / 29 / 2602.12096

Multi Graph Search for High-Dimensional Robot Motion Planning

高维机器人运动规划的多图搜索

Mishani, Itamar, Likhachev, Maxim

Abstract

Efficient motion planning for high-dimensional robotic systems, such as manipulators and mobile manipulators, is critical for real-time operation and reliable deployment. Although advances in planning algorithms have enhanced scalability to high-dimensional state spaces, these improvements often come at the cost of generating unpredictable, inconsistent motions or requiring excessive computational resources and memory. In this work, we introduce Multi-Graph Search (MGS), a search-based motion planning algorithm that generalizes classical unidirectional and bidirectional search to a multi-graph setting. MGS maintains and incrementally expands multiple implicit graphs over the state space, focusing exploration on high-potential regions while allowing initially disconnected subgraphs to be merged through feasible transitions as the search progresses. We prove that MGS is complete and bounded-suboptimal, and empirically demonstrate its effectiveness on a range of manipulation and mobile manipulation tasks. Demonstrations, benchmarks and code are available at https://multi-graph-search.github.io/.

Chinese Translation

高维机器人系统（如机械臂和移动机械臂）的高效运动规划对于实时操作和可靠部署至关重要。尽管规划算法的进步增强了对高维状态空间的可扩展性，但这些改进往往以生成不可预测、不一致的运动或需要过多计算资源和内存为代价。在本研究中，我们提出了多图搜索（Multi-Graph Search, MGS），这是一种基于搜索的运动规划算法，将经典的单向和双向搜索推广到多图设置。MGS在状态空间上维护并逐步扩展多个隐式图，重点探索高潜力区域，同时允许在搜索过程中通过可行的转移将最初不相连的子图合并。我们证明了MGS是完备的且具有有界次优性，并在一系列操控和移动操控任务中实证展示了其有效性。演示、基准测试和代码可在 https://multi-graph-search.github.io/ 获取。

View on arXiv Download PDF AI Translation

cs.RO / 30 / 2602.12159

3DGSNav: Enhancing Vision-Language Model Reasoning for Object Navigation via Active 3D Gaussian Splatting

3DGSNav：通过主动3D高斯点云增强视觉-语言模型在物体导航中的推理能力

Zheng, Wancai, Chen, Hao, Lu, Xianlong, Ou, Linlin, Yu, Xinyi

Abstract

Object navigation is a core capability of embodied intelligence, enabling an agent to locate target objects in unknown environments. Recent advances in vision-language models (VLMs) have facilitated zero-shot object navigation (ZSON). However, existing methods often rely on scene abstractions that convert environments into semantic maps or textual representations, causing high-level decision making to be constrained by the accuracy of low-level perception. In this work, we present 3DGSNav, a novel ZSON framework that embeds 3D Gaussian Splatting (3DGS) as persistent memory for VLMs to enhance spatial reasoning. Through active perception, 3DGSNav incrementally constructs a 3DGS representation of the environment, enabling trajectory-guided free-viewpoint rendering of frontier-aware first-person views. Moreover, we design structured visual prompts and integrate them with Chain-of-Thought (CoT) prompting to further improve VLM reasoning. During navigation, a real-time object detector filters potential targets, while VLM-driven active viewpoint switching performs target re-verification, ensuring efficient and reliable recognition. Extensive evaluations across multiple benchmarks and real-world experiments on a quadruped robot demonstrate that our method achieves robust and competitive performance against state-of-the-art approaches.The Project Page:https://aczheng-cai.github.io/3dgsnav.github.io/

Chinese Translation

物体导航是具身智能的核心能力，使代理能够在未知环境中定位目标物体。最近在视觉-语言模型（VLMs）方面的进展促进了零样本物体导航（ZSON）。然而，现有方法通常依赖于将环境转换为语义地图或文本表示的场景抽象，导致高层决策受到低层感知准确性的限制。在本研究中，我们提出了3DGSNav，一个新颖的ZSON框架，将3D高斯点云（3DGS）嵌入作为VLM的持久记忆，以增强空间推理。通过主动感知，3DGSNav逐步构建环境的3DGS表示，实现基于轨迹的前沿感知第一人称视角的自由视点渲染。此外，我们设计了结构化视觉提示，并将其与思维链（CoT）提示相结合，以进一步改善VLM的推理。在导航过程中，实时物体检测器过滤潜在目标，而VLM驱动的主动视点切换执行目标重新验证，确保高效可靠的识别。在多个基准测试和四足机器人上的真实世界实验中，广泛评估表明我们的方法在与最先进的方法相比时，表现出强大且具有竞争力的性能。项目页面：https://aczheng-cai.github.io/3dgsnav.github.io/

View on arXiv Download PDF AI Translation

cs.RO / 31 / 2602.12199

Sub--Riemannian boundary value problems for Optimal Geometric Locomotion

最优几何运动的子黎曼边值问题

Gross, Oliver, Hartwig, Florine, Rumpf, Martin, Schröder, Peter

Abstract

We propose a geometric model for optimal shape-change-induced motions of slender locomotors, e.g., snakes slithering on sand. In these scenarios, the motion of a body in world coordinates is completely determined by the sequence of shapes it assumes. Specifically, we formulate Lagrangian least-dissipation principles as boundary value problems whose solutions are given by sub-Riemannian geodesics. Notably, our geometric model accounts not only for the energy dissipated by the body's displacement through the environment, but also for the energy dissipated by the animal's metabolism or a robot's actuators to induce shape changes such as bending and stretching, thus capturing overall locomotion efficiency. Our continuous model, together with a consistent time and space discretization, enables numerical computation of sub-Riemannian geodesics for three different types of boundary conditions, i.e., fixing initial and target body, restricting to cyclic motion, or solely prescribing body displacement and orientation. The resulting optimal deformation gaits qualitatively match observed motion trajectories of organisms such as snakes and spermatozoa, as well as known optimality results for low-dimensional systems such as Purcell's swimmers. Moreover, being geometrically less rigid than previous frameworks, our model enables new insights into locomotion mechanisms of, e.g., generalized Purcell's swimmers. The code is publicly available.

Chinese Translation

我们提出了一种几何模型，用于描述细长运动器（例如，在沙地上滑行的蛇）因形状变化而引起的最优运动。在这些场景中，物体在世界坐标系中的运动完全由其所采用的形状序列决定。具体而言，我们将拉格朗日最小耗散原理表述为边值问题，其解由子黎曼测地线给出。值得注意的是，我们的几何模型不仅考虑了物体在环境中位移所耗散的能量，还考虑了动物的新陈代谢或机器人的执行器在诱导形状变化（如弯曲和拉伸）时所耗散的能量，从而捕捉整体运动效率。我们的连续模型结合一致的时间和空间离散化，使得能够对三种不同类型的边界条件进行子黎曼测地线的数值计算，即固定初始和目标体、限制周期性运动或仅规定物体的位移和方向。所得到的最优变形步态在定性上与观察到的生物运动轨迹（如蛇和精子）相匹配，并且与已知的低维系统的最优性结果（如普塞尔游泳者）相一致。此外，由于我们的模型在几何上比之前的框架更为灵活，因此为例如广义普塞尔游泳者的运动机制提供了新的见解。代码已公开发布。

View on arXiv Download PDF AI Translation

cs.RO / 32 / 2602.12215

LDA-1B: Scaling Latent Dynamics Action Model via Universal Embodied Data Ingestion

LDA-1B：通过通用具身数据摄取扩展潜在动态动作模型

Lyu, Jiangran, Liu, Kai, Zhang, Xuheng, Liao, Haoran, Feng, Yusen, Zhu, Wenxuan, Shen, Tingrui, Chen, Jiayi, Zhang, Jiazhao, Dong, Yifei, Cui, Wenbo, Qi, Senmao, Wang, Shuo, Zheng, Yixin, Yan, Mi, Shi, Xuesong, Li, Haoran, Zhao, Dongbin, Liu, Ming-Yu, Zhang, Zhizheng, Yi, Li, Wang, Yizhou, Wang, He

Abstract

Recent robot foundation models largely rely on large-scale behavior cloning, which imitates expert actions but discards transferable dynamics knowledge embedded in heterogeneous embodied data. While the Unified World Model (UWM) formulation has the potential to leverage such diverse data, existing instantiations struggle to scale to foundation-level due to coarse data usage and fragmented datasets. We introduce LDA-1B, a robot foundation model that scales through universal embodied data ingestion by jointly learning dynamics, policy, and visual forecasting, assigning distinct roles to data of varying quality. To support this regime at scale, we assemble and standardize EI-30k, an embodied interaction dataset comprising over 30k hours of human and robot trajectories in a unified format. Scalable dynamics learning over such heterogeneous data is enabled by prediction in a structured DINO latent space, which avoids redundant pixel-space appearance modeling. Complementing this representation, LDA-1B employs a multi-modal diffusion transformer to handle asynchronous vision and action streams, enabling stable training at the 1B-parameter scale. Experiments in simulation and the real world show LDA-1B outperforms prior methods (e.g., $\pi_{0.5}$) by up to 21\%, 48\%, and 23\% on contact-rich, dexterous, and long-horizon tasks, respectively. Notably, LDA-1B enables data-efficient fine-tuning, gaining 10\% by leveraging 30\% low-quality trajectories typically harmful and discarded.

Chinese Translation

近期的机器人基础模型在很大程度上依赖于大规模行为克隆，该方法模仿专家的动作，但忽略了嵌入在异质具身数据中的可转移动态知识。尽管统一世界模型（Unified World Model, UWM）框架有潜力利用这些多样化的数据，但现有实例由于粗糙的数据使用和碎片化的数据集而难以扩展到基础层级。我们提出了LDA-1B，这是一种通过通用具身数据摄取来扩展的机器人基础模型，通过联合学习动态、策略和视觉预测，为不同质量的数据分配独特的角色。为了支持这一大规模的模式，我们组建并标准化了EI-30k，这是一个包含超过30,000小时人类和机器人轨迹的具身交互数据集，采用统一格式。通过在结构化的DINO潜在空间中进行预测，使得在这种异质数据上可扩展的动态学习成为可能，从而避免了冗余的像素空间外观建模。作为对这种表示的补充，LDA-1B采用了多模态扩散变换器，以处理异步的视觉和动作流，从而在1B参数规模下实现稳定训练。在模拟和真实世界中的实验表明，LDA-1B在接触丰富、灵巧和长时间任务上分别比先前的方法（例如，$ ext{π}_{0.5}$）提高了最多21\%、48\%和23\\%。值得注意的是，LDA-1B实现了数据高效的微调，通过利用通常有害且被丢弃的30\\%低质量轨迹获得了10\\%的提升。

View on arXiv Download PDF AI Translation

cs.RO / 33 / 2602.12244

Any House Any Task: Scalable Long-Horizon Planning for Abstract Human Tasks

任何房屋，任何任务：针对抽象人类任务的可扩展长远规划

Liu, Zhihong, Li, Yang, Huang, Rengming, Lu, Cewu, Cai, Panpan

Abstract

Open world language conditioned task planning is crucial for robots operating in large-scale household environments. While many recent works attempt to address this problem using Large Language Models (LLMs) via prompting or training, a key challenge remains scalability. Performance often degrades rapidly with increasing environment size, plan length, instruction ambiguity, and constraint complexity. In this work, we propose Any House Any Task (AHAT), a household task planner optimized for long-horizon planning in large environments given ambiguous human instructions. At its core, AHAT utilizes an LLM trained to map task instructions and textual scene graphs into grounded subgoals defined in the Planning Domain Definition Language (PDDL). These subgoals are subsequently solved to generate feasible and optimal long-horizon plans through explicit symbolic reasoning. To enhance the model's ability to decompose complex and ambiguous intentions, we introduce TGPO, a novel reinforcement learning algorithm that integrates external correction of intermediate reasoning traces into Group Relative Policy Optimization (GRPO). Experiments demonstrate that AHAT achieves significant performance gains over state-of-the-art prompting, planning, and learning methods, particularly in human-style household tasks characterized by brief instructions but requiring complex execution plans.

Chinese Translation

开放世界语言条件下的任务规划对于在大规模家庭环境中操作的机器人至关重要。尽管许多近期研究尝试通过提示或训练使用大型语言模型（LLMs）来解决这一问题，但一个关键挑战仍然是可扩展性。随着环境规模、计划长度、指令模糊性和约束复杂性的增加，性能往往迅速下降。在本研究中，我们提出了任何房屋任何任务（AHAT），这是一种针对在大环境中根据模糊人类指令进行长远规划的家庭任务规划器。AHAT的核心利用了一种经过训练的LLM，将任务指令和文本场景图映射为在规划领域定义语言（PDDL）中定义的具体子目标。这些子目标随后通过显式符号推理被解决，以生成可行且最优的长远计划。为了增强模型分解复杂和模糊意图的能力，我们引入了TGPO，一种新颖的强化学习算法，它将对中间推理轨迹的外部修正集成到群体相对策略优化（GRPO）中。实验表明，AHAT在最新的提示、规划和学习方法上实现了显著的性能提升，特别是在以简短指令为特征但需要复杂执行计划的人类风格家庭任务中。

View on arXiv Download PDF AI Translation

cs.RO / 34 / 2602.12281

Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment

缩放验证比缩放策略学习在视觉-语言-行动对齐中更有效

Kwok, Jacky, Zhang, Xilun, Xu, Mengdi, Liu, Yuejiang, Mirhoseini, Azalia, Finn, Chelsea, Pavone, Marco

Abstract

The long-standing vision of general-purpose robots hinges on their ability to understand and act upon natural language instructions. Vision-Language-Action (VLA) models have made remarkable progress toward this goal, yet their generated actions can still misalign with the given instructions. In this paper, we investigate test-time verification as a means to shrink the "intention-action gap.'' We first characterize the test-time scaling law for embodied instruction following and demonstrate that jointly scaling the number of rephrased instructions and generated actions greatly increases test-time sample diversity, often recovering correct actions more efficiently than scaling each dimension independently. To capitalize on these scaling laws, we present CoVer, a contrastive verifier for vision-language-action alignment, and show that our architecture scales gracefully with additional computational resources and data. We then introduce "boot-time compute" and a hierarchical verification inference pipeline for VLAs. At deployment, our framework precomputes a diverse set of rephrased instructions from a Vision-Language-Model (VLM), repeatedly generates action candidates for each instruction, and then uses a verifier to select the optimal high-level prompt and low-level action chunks. Compared to scaling policy pre-training on the same data, our verification approach yields 22% gains in-distribution and 13% out-of-distribution on the SIMPLER benchmark, with a further 45% improvement in real-world experiments. On the PolaRiS benchmark, CoVer achieves 14% gains in task progress and 9% in success rate.

Chinese Translation

通用机器人的长期愿景依赖于其理解和执行自然语言指令的能力。视觉-语言-行动（VLA）模型在实现这一目标方面取得了显著进展，但它们生成的动作仍可能与给定指令不一致。本文探讨了测试时验证作为缩小“意图-行动差距”的一种手段。我们首先描述了具身指令跟随的测试时缩放规律，并证明联合缩放改写指令和生成动作的数量可以大大增加测试时样本的多样性，通常比独立缩放每个维度更高效地恢复正确动作。为了利用这些缩放规律，我们提出了CoVer，一个用于视觉-语言-行动对齐的对比验证器，并展示了我们的架构在额外计算资源和数据下的良好扩展性。接着，我们引入了“启动时计算”和一个层次化的VLA验证推理管道。在部署时，我们的框架预先计算来自视觉-语言模型（VLM）的多样化改写指令，为每个指令重复生成动作候选，然后使用验证器选择最佳的高层提示和低层动作片段。与在相同数据上缩放策略预训练相比，我们的验证方法在SIMPLER基准上实现了22%的分布内增益和13%的分布外增益，在实际实验中进一步提高了45%。在PolaRiS基准上，CoVer在任务进展上获得了14%的增益，成功率提高了9%。

View on arXiv Download PDF AI Translation

计算机视觉 (Computer Vision)

cs.CV / 1 / 2602.11214

DD-MDN: Human Trajectory Forecasting with Diffusion-Based Dual Mixture Density Networks and Uncertainty Self-Calibration

DD-MDN：基于扩散的双混合密度网络与不确定性自校准的人类轨迹预测

Hetzel, Manuel, Turacan, Kerim, Reichert, Hannes, Doll, Konrad, Sick, Bernhard

Abstract

Human Trajectory Forecasting (HTF) predicts future human movements from past trajectories and environmental context, with applications in Autonomous Driving, Smart Surveillance, and Human-Robot Interaction. While prior work has focused on accuracy, social interaction modeling, and diversity, little attention has been paid to uncertainty modeling, calibration, and forecasts from short observation periods, which are crucial for downstream tasks such as path planning and collision avoidance. We propose DD-MDN, an end-to-end probabilistic HTF model that combines high positional accuracy, calibrated uncertainty, and robustness to short observations. Using a few-shot denoising diffusion backbone and a dual mixture density network, our method learns self-calibrated residence areas and probability-ranked anchor paths, from which diverse trajectory hypotheses are derived, without predefined anchors or endpoints. Experiments on the ETH/UCY, SDD, inD, and IMPTC datasets demonstrate state-of-the-art accuracy, robustness at short observation intervals, and reliable uncertainty modeling. The code is available at: https://github.com/kav-institute/ddmdn.

Chinese Translation

人类轨迹预测（HTF）通过过去的轨迹和环境上下文预测未来的人类运动，广泛应用于自动驾驶、智能监控和人机交互等领域。尽管之前的研究主要集中在准确性、社会交互建模和多样性上，但对不确定性建模、校准以及短观察期的预测关注较少，而这些对于路径规划和碰撞避免等下游任务至关重要。我们提出了DD-MDN，一种端到端的概率性HTF模型，结合了高位置准确性、校准的不确定性和对短观察的鲁棒性。通过使用少量去噪扩散主干和双混合密度网络，我们的方法学习自校准的居住区域和概率排名的锚路径，从中推导出多样的轨迹假设，而无需预定义的锚点或端点。在ETH/UCY、SDD、inD和IMPTC数据集上的实验表明，我们的方法在准确性、短观察间隔的鲁棒性和可靠的不确定性建模方面达到了最先进的水平。代码可在以下链接获取：https://github.com/kav-institute/ddmdn。

View on arXiv Download PDF AI Translation

cs.CV / 2 / 2602.11236

ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning

ABot-M0：用于机器人操作的VLA基础模型与动作流形学习

Yang, Yandan, Zeng, Shuang, Lin, Tong, Chang, Xinyuan, Qi, Dekang, Xiao, Junjin, Liu, Haoyun, Chen, Ronghan, Chen, Yuzhi, Huo, Dongjie, Xiong, Feng, Wei, Xing, Ma, Zhiheng, Xu, Mu

Abstract

Building general-purpose embodied agents across diverse hardware remains a central challenge in robotics, often framed as the ''one-brain, many-forms'' paradigm. Progress is hindered by fragmented data, inconsistent representations, and misaligned training objectives. We present ABot-M0, a framework that builds a systematic data curation pipeline while jointly optimizing model architecture and training strategies, enabling end-to-end transformation of heterogeneous raw data into unified, efficient representations. From six public datasets, we clean, standardize, and balance samples to construct UniACT-dataset, a large-scale dataset with over 6 million trajectories and 9,500 hours of data, covering diverse robot morphologies and task scenarios. Unified pre-training improves knowledge transfer and generalization across platforms and tasks, supporting general-purpose embodied intelligence. To improve action prediction efficiency and stability, we propose the Action Manifold Hypothesis: effective robot actions lie not in the full high-dimensional space but on a low-dimensional, smooth manifold governed by physical laws and task constraints. Based on this, we introduce Action Manifold Learning (AML), which uses a DiT backbone to predict clean, continuous action sequences directly. This shifts learning from denoising to projection onto feasible manifolds, improving decoding speed and policy stability. ABot-M0 supports modular perception via a dual-stream mechanism that integrates VLM semantics with geometric priors and multi-view inputs from plug-and-play 3D modules such as VGGT and Qwen-Image-Edit, enhancing spatial understanding without modifying the backbone and mitigating standard VLM limitations in 3D reasoning. Experiments show components operate independently with additive benefits. We will release all code and pipelines for reproducibility and future research.

Chinese Translation

在多样化硬件上构建通用的具身智能体仍然是机器人领域的一个核心挑战，通常被框架为“一个大脑，多种形态”的范式。进展受到数据碎片化、不一致的表示和不对齐的训练目标的阻碍。我们提出了ABot-M0，一个构建系统化数据整理管道的框架，同时优化模型架构和训练策略，使异构原始数据能够端到端地转化为统一、高效的表示。我们从六个公共数据集中清理、标准化和均衡样本，构建了UniACT数据集，这是一个大规模数据集，包含超过600万条轨迹和9500小时的数据，涵盖多种机器人形态和任务场景。统一的预训练提高了知识转移和跨平台、跨任务的泛化能力，支持通用的具身智能。为了提高动作预测的效率和稳定性，我们提出了动作流形假设：有效的机器人动作并不位于完整的高维空间中，而是位于受物理法则和任务约束支配的低维平滑流形上。在此基础上，我们引入了动作流形学习（Action Manifold Learning, AML），该方法使用DiT骨干网络直接预测干净、连续的动作序列。这一转变使学习从去噪转向投影到可行流形上，提高了解码速度和策略稳定性。ABot-M0通过双流机制支持模块化感知，将VLM语义与几何先验和来自即插即用3D模块（如VGGT和Qwen-Image-Edit）的多视角输入相结合，增强了空间理解，而无需修改骨干网络，并减轻了标准VLM在3D推理中的局限性。实验表明，各组件独立运行并具有叠加效益。我们将发布所有代码和管道，以便于重现和未来的研究。

View on arXiv Download PDF AI Translation

cs.CV / 3 / 2602.11239

Toward Reliable Tea Leaf Disease Diagnosis Using Deep Learning Model: Enhancing Robustness With Explainable AI and Adversarial Training

基于深度学习模型的可靠茶叶病害诊断：通过可解释人工智能和对抗训练增强鲁棒性

Ghosh, Samanta, Mahi, Jannatul Adan, Abrar, Shayan, Mia, Md Parvez, Rayhan, Asaduzzaman, Yasir, Abdul Awal, Hridoy, Asaduzzaman

Abstract

Tea is a valuable asset for the economy of Bangladesh. So, tea cultivation plays an important role to boost the economy. These valuable plants are vulnerable to various kinds of leaf infections which may cause less production and low quality. It is not so easy to detect these diseases manually. It may take time and there could be some errors in the detection.Therefore, the purpose of the study is to develop an automated deep learning model for tea leaf disease classification based on the teaLeafBD dataset so that anyone can detect the diseases more easily and efficiently. There are 5,278 high-resolution images in this dataset. The images are classified into seven categories. Six of them represents various diseases and the rest one represents healthy leaves. The proposed pipeline contains data preprocessing, data splitting, adversarial training, augmentation, model training, evaluation, and comprehension made possible with Explainable AI strategies. DenseNet201 and EfficientNetB3 were employed to perform the classification task. To prepare the model more robustly, we applied adversarial training so it can operate effectively even with noisy or disturbed inputs. In addition, Grad-CAM visualization was executed to analyze the model's predictions by identifying the most influential regions of each image. Our experimental outcomes revealed that EfficientNetB3 achieved the highest classification accuracy of 93%, while DenseNet201 reached 91%. The outcomes prove that the effectiveness of the proposed approach can accurately detect tea leaf diseases and provide a practical solution for advanced agricultural management.

Chinese Translation

茶叶是孟加拉国经济的重要资产，因此茶叶种植在促进经济发展中发挥着重要作用。这些珍贵植物易受各种叶部感染的影响，这可能导致产量下降和质量降低。手动检测这些疾病并不容易，可能需要时间，并且可能存在检测错误。因此，本研究的目的是基于茶叶病害数据集（teaLeafBD）开发一个自动化的深度学习模型，以便任何人都能更轻松高效地检测这些疾病。该数据集中包含5,278张高分辨率图像，这些图像被分为七个类别，其中六个类别代表不同的疾病，另一个类别代表健康叶片。所提出的流程包括数据预处理、数据划分、对抗训练、数据增强、模型训练、评估以及通过可解释人工智能策略实现的理解。我们采用了DenseNet201和EfficientNetB3进行分类任务。为了使模型更加鲁棒，我们应用了对抗训练，以便它能够在噪声或干扰输入下有效运行。此外，执行了Grad-CAM可视化，以通过识别每张图像中最具影响力的区域来分析模型的预测。我们的实验结果显示，EfficientNetB3达到了最高的分类准确率93%，而DenseNet201达到了91%。结果证明，所提出的方法能够准确检测茶叶病害，并为先进的农业管理提供切实可行的解决方案。

View on arXiv Download PDF AI Translation

cs.CV / 4 / 2602.11241

Active Zero: Self-Evolving Vision-Language Models through Active Environment Exploration

主动零：通过主动环境探索自我演化的视觉-语言模型

He, Jinghan, Fang, Junfeng, Xiong, Feng, Yao, Zijun, Shen, Fei, Guo, Haiyun, Wang, Jinqiao, Chua, Tat-Seng

Abstract

Self-play has enabled large language models to autonomously improve through self-generated challenges. However, existing self-play methods for vision-language models rely on passive interaction with static image collections, resulting in strong dependence on initial datasets and inefficient learning. Without the ability to actively seek visual data tailored to their evolving capabilities, agents waste computational effort on samples that are either trivial or beyond their current skill level. To address these limitations, we propose Active-Zero, a framework that shifts from passive interaction to active exploration of visual environments. Active-Zero employs three co-evolving agents: a Searcher that retrieves images from open-world repositories based on the model's capability frontier, a Questioner that synthesizes calibrated reasoning tasks, and a Solver refined through accuracy rewards. This closed loop enables self-scaffolding auto-curricula where the model autonomously constructs its learning trajectory. On Qwen2.5-VL-7B-Instruct across 12 benchmarks, Active-Zero achieves 53.97 average accuracy on reasoning tasks (5.7% improvement) and 59.77 on general understanding (3.9% improvement), consistently outperforming existing self-play baselines. These results highlight active exploration as a key ingredient for scalable and adaptive self-evolving vision-language systems.

Chinese Translation

自我对弈使大型语言模型能够通过自生成的挑战自主改进。然而，现有的视觉-语言模型自我对弈方法依赖于与静态图像集合的被动互动，导致对初始数据集的强依赖和低效学习。在缺乏主动寻求与其不断演化能力相匹配的视觉数据的能力时，智能体在处理那些琐碎或超出其当前技能水平的样本上浪费了计算资源。为了解决这些局限性，我们提出了Active-Zero，一个将被动互动转变为主动探索视觉环境的框架。Active-Zero采用三个共同演化的智能体：一个根据模型能力边界从开放世界库中检索图像的搜索者（Searcher），一个合成经过校准的推理任务的问题生成者（Questioner），以及一个通过准确性奖励进行优化的求解者（Solver）。这一闭环使得自我支撑的自动课程得以实现，模型能够自主构建其学习轨迹。在Qwen2.5-VL-7B-Instruct的12个基准测试中，Active-Zero在推理任务上达到了53.97的平均准确率（提高了5.7%），在一般理解上达到了59.77（提高了3.9%），持续超越现有的自我对弈基线。这些结果突显了主动探索作为可扩展和自适应自我演化视觉-语言系统的关键要素。

View on arXiv Download PDF AI Translation

cs.CV / 5 / 2602.11242

ReTracing: An Archaeological Approach Through Body, Machine, and Generative Systems

再追踪：通过身体、机器和生成系统的考古学方法

Wang, Yitong, Yao, Yue

Abstract

We present ReTracing, a multi-agent embodied performance art that adopts an archaeological approach to examine how artificial intelligence shapes, constrains, and produces bodily movement. Drawing from science-fiction novels, the project extracts sentences that describe human-machine interaction. We use large language models (LLMs) to generate paired prompts "what to do" and "what not to do" for each excerpt. A diffusion-based text-to-video model transforms these prompts into choreographic guides for a human performer and motor commands for a quadruped robot. Both agents enact the actions on a mirrored floor, captured by multi-camera motion tracking and reconstructed into 3D point clouds and motion trails, forming a digital archive of motion traces. Through this process, ReTracing serves as a novel approach to reveal how generative systems encode socio-cultural biases through choreographed movements. Through an immersive interplay of AI, human, and robot, ReTracing confronts a critical question of our time: What does it mean to be human among AIs that also move, think, and leave traces behind?

Chinese Translation

我们提出了再追踪（ReTracing），这是一种多代理的具身表演艺术，采用考古学方法来研究人工智能如何塑造、限制和产生身体运动。该项目借鉴科幻小说，提取描述人机互动的句子。我们使用大型语言模型（LLMs）为每个摘录生成配对提示“该做什么”和“该不做什么”。基于扩散的文本到视频模型将这些提示转化为人类表演者的编舞指南和四足机器人运动指令。两个代理在镜面地板上执行这些动作，通过多摄像头运动跟踪捕捉并重建为3D点云和运动轨迹，形成运动痕迹的数字档案。通过这一过程，再追踪作为一种新颖的方法揭示生成系统如何通过编排的动作编码社会文化偏见。在人工智能、人类和机器人之间的沉浸式互动中，再追踪面对我们时代的一个关键问题：在同样移动、思考并留下痕迹的人工智能中，成为人类意味着什么？

View on arXiv Download PDF AI Translation

cs.CV / 6 / 2602.11244

Stress Tests REVEAL Fragile Temporal and Visual Grounding in Video-Language Models

压力测试揭示视频语言模型中的脆弱时序和视觉基础

T V, Sethuraman, Khosla, Savya, Tiwari, Aditi, Ganesh, Vidya, Jayaprakash, Rakshana, Jain, Aditya, Srinivasakumar, Vignesh, Susladkar, Onkar Kishor, Sunkara, Srinidhi, Shanmugham, Aditya, Vaideeswaran, Rakesh, Nishar, Abbaas Alif Mohamed, Jenni, Simon, Hoiem, Derek

Abstract

This work investigates a fundamental question: Do Video-Language Models (VidLMs) robustly account for video content, temporal sequence, and motion? Our investigation shows that, surprisingly, they often do not. We introduce REVEAL{}, a diagnostic benchmark that probes fundamental weaknesses of contemporary VidLMs through five controlled stress tests; assessing temporal expectation bias, reliance on language-only shortcuts, video sycophancy, camera motion sensitivity, and robustness to spatiotemporal occlusion. We test leading open- and closed-source VidLMs and find that these models confidently describe reversed scenes as forward, answer questions while neglecting video content, agree with false claims, struggle with basic camera motion, and fail to aggregate temporal information amidst simple spatiotemporal masking. Humans, on the other hand, succeed at these tasks with ease. Alongside our benchmark, we provide a data pipeline that automatically generates diagnostic examples for our stress tests, enabling broader and more scalable evaluation. We will release our benchmark and code to support future research.

Chinese Translation

本研究探讨了一个基本问题：视频语言模型（VidLMs）是否能够稳健地考虑视频内容、时间序列和运动？我们的调查表明，令人惊讶的是，它们往往无法做到这一点。我们引入了REVEAL{}，一个诊断基准，通过五个控制的压力测试探测当代VidLMs的基本弱点；评估时间期望偏差、对语言单一捷径的依赖、视频阿谀奉承、相机运动敏感性以及对时空遮挡的鲁棒性。我们测试了领先的开源和闭源VidLMs，发现这些模型自信地将反向场景描述为正向，回答问题时忽略视频内容，赞同虚假声明，对基本的相机运动感到困难，并在简单的时空遮挡中无法聚合时间信息。相比之下，人类在这些任务中轻松成功。除了我们的基准外，我们还提供了一个数据管道，能够自动生成用于压力测试的诊断示例，从而实现更广泛和更具可扩展性的评估。我们将发布我们的基准和代码，以支持未来的研究。

View on arXiv Download PDF AI Translation

cs.CV / 7 / 2602.11314

Advancing Digital Twin Generation Through a Novel Simulation Framework and Quantitative Benchmarking

通过新颖的仿真框架和定量基准测试推进数字双胞胎生成

Rubinstein, Jacob, Donaty, Avi, Engel, Don

Abstract

The generation of 3D models from real-world objects has often been accomplished through photogrammetry, i.e., by taking 2D photos from a variety of perspectives and then triangulating matched point-based features to create a textured mesh. Many design choices exist within this framework for the generation of digital twins, and differences between such approaches are largely judged qualitatively. Here, we present and test a novel pipeline for generating synthetic images from high-quality 3D models and programmatically generated camera poses. This enables a wide variety of repeatable, quantifiable experiments which can compare ground-truth knowledge of virtual camera parameters and of virtual objects against the reconstructed estimations of those perspectives and subjects.

Chinese Translation

从现实世界物体生成3D模型通常是通过摄影测量实现的，即通过从不同角度拍摄2D照片，然后三角测量匹配的基于点的特征以创建纹理网格。在这个框架内，生成数字双胞胎的设计选择多种多样，而这些方法之间的差异主要是定性评判的。在此，我们提出并测试了一种新颖的管道，用于从高质量3D模型和程序生成的相机姿态生成合成图像。这使得能够进行多种可重复、可量化的实验，比较虚拟相机参数和虚拟物体的真实知识与这些视角和对象的重建估计之间的差异。

View on arXiv Download PDF AI Translation

cs.CV / 8 / 2602.11316

Selective Prior Synchronization via SYNC Loss

通过 SYNC 损失实现选择性先验同步

Mishra, Ishan, Li, Jiajie, Mishra, Deepak, Xiong, Jinjun

Abstract

Prediction under uncertainty is a critical requirement for the deep neural network to succeed responsibly. This paper focuses on selective prediction, which allows DNNs to make informed decisions about when to predict or abstain based on the uncertainty level of their predictions. Current methods are either ad-hoc such as SelectiveNet, focusing on how to modify the network architecture or objective function, or post-hoc such as softmax response, achieving selective prediction through analyzing the model's probabilistic outputs. We observe that post-hoc methods implicitly generate uncertainty information, termed the selective prior, which has traditionally been used only during inference. We argue that the selective prior provided by the selection mechanism is equally vital during the training stage. Therefore, we propose the SYNC loss which introduces a novel integration of ad-hoc and post-hoc method. Specifically, our approach incorporates the softmax response into the training process of SelectiveNet, enhancing its selective prediction capabilities by examining the selective prior. Evaluated across various datasets, including CIFAR-100, ImageNet-100, and Stanford Cars, our method not only enhances the model's generalization capabilities but also surpasses previous works in selective prediction performance, and sets new benchmarks for state-of-the-art performance.

Chinese Translation

在不确定性下进行预测是深度神经网络成功的关键要求。本文聚焦于选择性预测，使得深度神经网络能够根据其预测的不确定性水平做出何时预测或放弃预测的知情决策。目前的方法要么是临时的，如 SelectiveNet，专注于如何修改网络架构或目标函数；要么是事后分析的，如 softmax 响应，通过分析模型的概率输出实现选择性预测。我们观察到，事后方法隐式生成的不确定性信息，称为选择性先验，传统上仅在推理期间使用。我们认为，选择机制提供的选择性先验在训练阶段同样至关重要。因此，我们提出了 SYNC 损失，它引入了一种临时方法和事后方法的新颖整合。具体而言，我们的方法将 softmax 响应纳入 SelectiveNet 的训练过程中，通过检查选择性先验来增强其选择性预测能力。在包括 CIFAR-100、ImageNet-100 和 Stanford Cars 等多个数据集上的评估中，我们的方法不仅提升了模型的泛化能力，还在选择性预测性能上超越了之前的研究，并为最先进的性能设定了新的基准。

View on arXiv Download PDF AI Translation

cs.CV / 9 / 2602.11323

MDE-VIO: Enhancing Visual-Inertial Odometry Using Learned Depth Priors

MDE-VIO：利用学习的深度先验增强视觉惯性里程计

Alniak, Arda, Kalkan, Sinan, Ankarali, Mustafa Mert, Saranli, Afsar, Alatan, Abdullah Aydin

Abstract

Traditional monocular Visual-Inertial Odometry (VIO) systems struggle in low-texture environments where sparse visual features are insufficient for accurate pose estimation. To address this, dense Monocular Depth Estimation (MDE) has been widely explored as a complementary information source. While recent Vision Transformer (ViT) based complex foundational models offer dense, geometrically consistent depth, their computational demands typically preclude them from real-time edge deployment. Our work bridges this gap by integrating learned depth priors directly into the VINS-Mono optimization backend. We propose a novel framework that enforces affine-invariant depth consistency and pairwise ordinal constraints, explicitly filtering unstable artifacts via variance-based gating. This approach strictly adheres to the computational limits of edge devices while robustly recovering metric scale. Extensive experiments on the TartanGround and M3ED datasets demonstrate that our method prevents divergence in challenging scenarios and delivers significant accuracy gains, reducing Absolute Trajectory Error (ATE) by up to 28.3%. Code will be made available.

Chinese Translation

传统的单目视觉惯性里程计（VIO）系统在低纹理环境中面临挑战，因为稀疏的视觉特征不足以进行准确的姿态估计。为了解决这个问题，密集的单目深度估计（MDE）被广泛探索作为一种补充信息源。尽管基于视觉变换器（ViT）的复杂基础模型提供了密集且几何一致的深度，但其计算需求通常使其无法在实时边缘部署中使用。我们的工作通过将学习的深度先验直接集成到VINS-Mono优化后端来弥补这一差距。我们提出了一种新颖的框架，该框架强制执行仿射不变的深度一致性和成对的序数约束，通过基于方差的门控显式过滤不稳定的伪影。这种方法严格遵循边缘设备的计算限制，同时稳健地恢复度量尺度。在TartanGround和M3ED数据集上的大量实验表明，我们的方法在具有挑战性的场景中防止了发散，并显著提高了准确性，将绝对轨迹误差（ATE）减少了多达28.3%。代码将会公开。

View on arXiv Download PDF AI Translation

cs.CV / 10 / 2602.11339

Exploring Real-Time Super-Resolution: Benchmarking and Fine-Tuning for Streaming Content

探索实时超分辨率：流媒体内容的基准测试与微调

Bogatyrev, Evgeney, Abud, Khaled, Molodetskikh, Ivan, Alutis, Nikita, Vatolin, Dmitry

Abstract

Recent advancements in real-time super-resolution have enabled higher-quality video streaming, yet existing methods struggle with the unique challenges of compressed video content. Commonly used datasets do not accurately reflect the characteristics of streaming media, limiting the relevance of current benchmarks. To address this gap, we introduce a comprehensive dataset - StreamSR - sourced from YouTube, covering a wide range of video genres and resolutions representative of real-world streaming scenarios. We benchmark 11 state-of-the-art real-time super-resolution models to evaluate their performance for the streaming use-case. Furthermore, we propose EfRLFN, an efficient real-time model that integrates Efficient Channel Attention and a hyperbolic tangent activation function - a novel design choice in the context of real-time super-resolution. We extensively optimized the architecture to maximize efficiency and designed a composite loss function that improves training convergence. EfRLFN combines the strengths of existing architectures while improving both visual quality and runtime performance. Finally, we show that fine-tuning other models on our dataset results in significant performance gains that generalize well across various standard benchmarks. We made the dataset, the code, and the benchmark available at https://github.com/EvgeneyBogatyrev/EfRLFN.

Chinese Translation

近期实时超分辨率的进展使得视频流媒体质量得以提升，但现有方法在处理压缩视频内容的独特挑战时仍显不足。常用的数据集未能准确反映流媒体的特征，限制了当前基准测试的相关性。为了解决这一问题，我们引入了一个全面的数据集——StreamSR，该数据集来源于YouTube，涵盖了广泛的视频类型和分辨率，代表了现实世界中的流媒体场景。我们对11种最先进的实时超分辨率模型进行了基准测试，以评估它们在流媒体应用中的表现。此外，我们提出了EfRLFN，这是一种高效的实时模型，结合了高效通道注意力（Efficient Channel Attention）和双曲正切激活函数（hyperbolic tangent activation function），在实时超分辨率的背景下是一种新颖的设计选择。我们对架构进行了广泛优化，以最大化效率，并设计了一种复合损失函数，以改善训练收敛性。EfRLFN结合了现有架构的优势，同时提高了视觉质量和运行时性能。最后，我们展示了在我们的数据集上微调其他模型能够显著提升性能，并且在各种标准基准测试中具有良好的泛化能力。我们已将数据集、代码和基准测试发布在 https://github.com/EvgeneyBogatyrev/EfRLFN。

View on arXiv Download PDF AI Translation

cs.CV / 11 / 2602.11349

ArtContext: Contextualizing Artworks with Open-Access Art History Articles and Wikidata Knowledge through a LoRA-Tuned CLIP Model

ArtContext：通过LoRA调优的CLIP模型将开放获取的艺术史文章和Wikidata知识进行艺术作品的情境化

Waugh, Samuel, James, Stuart

Abstract

Many Art History articles discuss artworks in general as well as specific parts of works, such as layout, iconography, or material culture. However, when viewing an artwork, it is not trivial to identify what different articles have said about the piece. Therefore, we propose ArtContext, a pipeline for taking a corpus of Open-Access Art History articles and Wikidata Knowledge and annotating Artworks with this information. We do this using a novel corpus collection pipeline, then learn a bespoke CLIP model adapted using Low-Rank Adaptation (LoRA) to make it domain-specific. We show that the new model, PaintingCLIP, which is weakly supervised by the collected corpus, outperforms CLIP and provides context for a given artwork. The proposed pipeline is generalisable and can be readily applied to numerous humanities areas.

Chinese Translation

许多艺术史文章讨论了艺术作品的一般性以及作品的特定部分，例如布局、图像学或物质文化。然而，在观看一件艺术作品时，识别不同文章对该作品的讨论并非易事。因此，我们提出了ArtContext，这是一个将开放获取的艺术史文章和Wikidata知识的语料库进行处理，并用这些信息对艺术作品进行注释的管道。我们采用了一种新颖的语料库收集管道，然后学习了一个使用低秩适应（Low-Rank Adaptation, LoRA）进行调整的定制CLIP模型，使其具有特定领域的特点。我们展示了新模型PaintingCLIP在收集的语料库的弱监督下，优于CLIP，并为特定艺术作品提供了背景信息。所提出的管道具有可推广性，可以方便地应用于多个人文学科领域。

View on arXiv Download PDF AI Translation

cs.CV / 12 / 2602.11401

Latent Forcing: Reordering the Diffusion Trajectory for Pixel-Space Image Generation

潜在强制：重新排序扩散轨迹以进行像素空间图像生成

Baade, Alan, Chan, Eric Ryan, Sargent, Kyle, Chen, Changan, Johnson, Justin, Adeli, Ehsan, Fei-Fei, Li

Abstract

Latent diffusion models excel at generating high-quality images but lose the benefits of end-to-end modeling. They discard information during image encoding, require a separately trained decoder, and model an auxiliary distribution to the raw data. In this paper, we propose Latent Forcing, a simple modification to existing architectures that achieves the efficiency of latent diffusion while operating on raw natural images. Our approach orders the denoising trajectory by jointly processing latents and pixels with separately tuned noise schedules. This allows the latents to act as a scratchpad for intermediate computation before high-frequency pixel features are generated. We find that the order of conditioning signals is critical, and we analyze this to explain differences between REPA distillation in the tokenizer and the diffusion model, conditional versus unconditional generation, and how tokenizer reconstruction quality relates to diffusability. Applied to ImageNet, Latent Forcing achieves a new state-of-the-art for diffusion transformer-based pixel generation at our compute scale.

Chinese Translation

潜在扩散模型在生成高质量图像方面表现出色，但失去了端到端建模的优势。它们在图像编码过程中丢弃信息，需要单独训练的解码器，并对原始数据建模辅助分布。本文提出了潜在强制（Latent Forcing），这是对现有架构的简单修改，能够在处理原始自然图像时实现潜在扩散的高效性。我们的方法通过分别调整噪声调度，联合处理潜在变量和像素，重新排序去噪轨迹。这使得潜在变量可以作为中间计算的草稿板，在生成高频像素特征之前进行计算。我们发现条件信号的顺序至关重要，并对此进行分析，以解释标记器中的REPA蒸馏与扩散模型之间的差异、条件生成与无条件生成之间的区别，以及标记器重建质量与可扩散性之间的关系。应用于ImageNet，潜在强制在我们的计算规模下实现了基于扩散变换器的像素生成的新最先进水平。

View on arXiv Download PDF AI Translation

cs.CV / 13 / 2602.11436

Fighting MRI Anisotropy: Learning Multiple Cardiac Shapes From a Single Implicit Neural Representation

抗击MRI各向异性：从单一隐式神经表示学习多种心脏形状

Brás, Carolina, Haddou, Soufiane Ben, Kuipers, Thijs P., Alvarez-Florez, Laura, Planken, R. Nils, Tjong, Fleur V. Y., Bezzina, Connie, Išgum, Ivana

Abstract

The anisotropic nature of short-axis (SAX) cardiovascular magnetic resonance imaging (CMRI) limits cardiac shape analysis. To address this, we propose to leverage near-isotropic, higher resolution computed tomography angiography (CTA) data of the heart. We use this data to train a single neural implicit function to jointly represent cardiac shapes from CMRI at any resolution. We evaluate the method for the reconstruction of right ventricle (RV) and myocardium (MYO), where MYO simultaneously models endocardial and epicardial left-ventricle surfaces. Since high-resolution SAX reference segmentations are unavailable, we evaluate performance by extracting a 4-chamber (4CH) slice of RV and MYO from their reconstructed shapes. When compared with the reference 4CH segmentation masks from CMRI, our method achieved a Dice similarity coefficient of 0.91 $\pm$ 0.07 and 0.75 $\pm$ 0.13, and a Hausdorff distance of 6.21 $\pm$ 3.97 mm and 7.53 $\pm$ 5.13 mm for RV and MYO, respectively. Quantitative and qualitative assessment demonstrate the model's ability to reconstruct accurate, smooth and anatomically plausible shapes, supporting improvements in cardiac shape analysis.

Chinese Translation

短轴（SAX）心血管磁共振成像（CMRI）的各向异性特性限制了心脏形状分析。为了解决这一问题，我们提出利用近等各向同性的高分辨率计算机断层扫描血管造影（CTA）数据来研究心脏。我们使用这些数据训练一个单一的神经隐式函数，以联合表示CMRI中任意分辨率的心脏形状。我们评估该方法在右心室（RV）和心肌（MYO）的重建效果，其中MYO同时建模心室内膜和心室外膜的左心室表面。由于高分辨率SAX参考分割不可用，我们通过从重建形状中提取RV和MYO的四腔（4CH）切片来评估性能。与CMRI的参考4CH分割掩膜相比，我们的方法在RV和MYO的Dice相似系数分别达到了0.91 ± 0.07和0.75 ± 0.13，Hausdorff距离分别为6.21 ± 3.97 mm和7.53 ± 5.13 mm。定量和定性评估表明该模型能够重建准确、平滑且解剖上合理的形状，支持心脏形状分析的改进。

View on arXiv Download PDF AI Translation

cs.CV / 14 / 2602.11440

Ctrl&Shift: High-Quality Geometry-Aware Object Manipulation in Visual Generation

Ctrl&Shift：高质量几何感知对象操控在视觉生成中的应用

Ruan, Penghui, Zi, Bojia, Qi, Xianbiao, Huang, Youze, Xiao, Rong, Wang, Pichao, Cao, Jiannong, Shi, Yuhui

Abstract

Object-level manipulation, relocating or reorienting objects in images or videos while preserving scene realism, is central to film post-production, AR, and creative editing. Yet existing methods struggle to jointly achieve three core goals: background preservation, geometric consistency under viewpoint shifts, and user-controllable transformations. Geometry-based approaches offer precise control but require explicit 3D reconstruction and generalize poorly; diffusion-based methods generalize better but lack fine-grained geometric control. We present Ctrl&Shift, an end-to-end diffusion framework to achieve geometry-consistent object manipulation without explicit 3D representations. Our key insight is to decompose manipulation into two stages, object removal and reference-guided inpainting under explicit camera pose control, and encode both within a unified diffusion process. To enable precise, disentangled control, we design a multi-task, multi-stage training strategy that separates background, identity, and pose signals across tasks. To improve generalization, we introduce a scalable real-world dataset construction pipeline that generates paired image and video samples with estimated relative camera poses. Extensive experiments demonstrate that Ctrl&Shift achieves state-of-the-art results in fidelity, viewpoint consistency, and controllability. To our knowledge, this is the first framework to unify fine-grained geometric control and real-world generalization for object manipulation, without relying on any explicit 3D modeling.

Chinese Translation

对象级操控，即在图像或视频中重新定位或重新定向对象，同时保持场景的真实感，是电影后期制作、增强现实（AR）和创意编辑的核心。然而，现有方法在同时实现三个核心目标方面存在困难：背景保留、视角变化下的几何一致性，以及用户可控的变换。基于几何的方法提供了精确的控制，但需要显式的三维重建，并且泛化能力较差；而基于扩散的方法泛化能力更强，但缺乏细粒度的几何控制。我们提出了Ctrl&Shift，一个端到端的扩散框架，旨在实现几何一致的对象操控，而无需显式的三维表示。我们的关键见解是将操控过程分解为两个阶段：对象移除和在显式相机姿态控制下的参考引导修补，并将两者编码在一个统一的扩散过程中。为了实现精确、解耦的控制，我们设计了一种多任务、多阶段的训练策略，在各任务之间分离背景、身份和姿态信号。为了提高泛化能力，我们引入了一个可扩展的现实世界数据集构建管道，生成带有估计相对相机姿态的配对图像和视频样本。大量实验表明，Ctrl&Shift在保真度、视角一致性和可控性方面达到了最先进的结果。据我们所知，这是第一个将细粒度几何控制与现实世界泛化统一起来的对象操控框架，而无需依赖任何显式的三维建模。

View on arXiv Download PDF AI Translation

cs.CV / 15 / 2602.11446

Enhanced Portable Ultra Low-Field Diffusion Tensor Imaging with Bayesian Artifact Correction and Deep Learning-Based Super-Resolution

增强型便携式超低场扩散张量成像：贝叶斯伪影校正与基于深度学习的超分辨率

Olchanyi, Mark D., Sorby-Adams, Annabel, Kirsch, John, Edlow, Brian L., Farnan, Ava, Liu, Renfei, Rosen, Matthew S., Brown, Emery N., Kimberly, W. Taylor, Iglesias, Juan Eugenio

Abstract

Portable, ultra-low-field (ULF) magnetic resonance imaging has the potential to expand access to neuroimaging but currently suffers from coarse spatial and angular resolutions and low signal-to-noise ratios. Diffusion tensor imaging (DTI), a sequence tailored to detect and reconstruct white matter tracts within the brain, is particularly prone to such imaging degradation due to inherent sequence design coupled with prolonged scan times. In addition, ULF DTI scans exhibit artifacting that spans both the space and angular domains, requiring a custom modelling algorithm for subsequent correction. We introduce a nine-direction, single-shell ULF DTI sequence, as well as a companion Bayesian bias field correction algorithm that possesses angular dependence and convolutional neural network-based superresolution algorithm that is generalizable across DTI datasets and does not require re-training (''DiffSR''). We show through a synthetic downsampling experiment and white matter assessment in real, matched ULF and high-field DTI scans that these algorithms can recover microstructural and volumetric white matter information at ULF. We also show that DiffSR can be directly applied to white matter-based Alzheimers disease classification in synthetically degraded scans, with notable improvements in agreement between DTI metrics, as compared to un-degraded scans. We freely disseminate the Bayesian bias correction algorithm and DiffSR with the goal of furthering progress on both ULF reconstruction methods and general DTI sequence harmonization. We release all code related to DiffSR for $\href{https://github.com/markolchanyi/DiffSR}{public \space use}$.

Chinese Translation

便携式超低场（ULF）磁共振成像有潜力扩展神经成像的可及性，但目前面临空间和角度分辨率粗糙以及信噪比低的问题。扩散张量成像（DTI）是一种专门用于检测和重建大脑白质纤维束的序列，因其固有的序列设计和较长的扫描时间而特别容易受到成像降解的影响。此外，ULF DTI扫描还表现出在空间和角度域内的伪影，需采用定制建模算法进行后续校正。我们介绍了一种九方向、单壳层的ULF DTI序列，以及一个具有角度依赖性的贝叶斯偏差场校正算法和一个基于卷积神经网络的超分辨率算法（“DiffSR”），该算法可在DTI数据集中广泛应用且无需重新训练。通过合成下采样实验和在真实匹配的ULF与高场DTI扫描中的白质评估，我们展示了这些算法能够恢复ULF下的微观结构和体积白质信息。我们还展示了DiffSR可以直接应用于基于白质的阿尔茨海默病分类，在合成降级扫描中显著改善了DTI指标与未降级扫描之间的一致性。我们免费发布贝叶斯偏差校正算法和DiffSR，旨在推动ULF重建方法和一般DTI序列的协调化进展。我们将与DiffSR相关的所有代码公开供公众使用。

View on arXiv Download PDF AI Translation

cs.CV / 16 / 2602.11466

A Dual-Branch Framework for Semantic Change Detection with Boundary and Temporal Awareness

一种具有边界和时间意识的语义变化检测双分支框架

Li, Yun-Cheng, Lei, Sen, Li, Heng-Chao, Li, Ke

Abstract

Semantic Change Detection (SCD) aims to detect and categorize land-cover changes from bi-temporal remote sensing images. Existing methods often suffer from blurred boundaries and inadequate temporal modeling, limiting segmentation accuracy. To address these issues, we propose a Dual-Branch Framework for Semantic Change Detection with Boundary and Temporal Awareness, termed DBTANet. Specifically, we utilize a dual-branch Siamese encoder where a frozen SAM branch captures global semantic context and boundary priors, while a ResNet34 branch provides local spatial details, ensuring complementary feature representations. On this basis, we design a Bidirectional Temporal Awareness Module (BTAM) to aggregate multi-scale features and capture temporal dependencies in a symmetric manner. Furthermore, a Gaussian-smoothed Projection Module (GSPM) refines shallow SAM features, suppressing noise while enhancing edge information for boundary-aware constraints. Extensive experiments on two public benchmarks demonstrate that DBTANet effectively integrates global semantics, local details, temporal reasoning, and boundary awareness, achieving state-of-the-art performance.

Chinese Translation

语义变化检测（SCD）旨在从双时相遥感图像中检测和分类土地覆盖变化。现有方法常常受到模糊边界和不足的时间建模的影响，限制了分割精度。为了解决这些问题，我们提出了一种具有边界和时间意识的语义变化检测双分支框架，称为DBTANet。具体而言，我们利用一个双分支的Siamese编码器，其中一个冻结的SAM分支捕获全局语义上下文和边界先验，而ResNet34分支提供局部空间细节，确保互补特征表示。在此基础上，我们设计了一个双向时间意识模块（BTAM），以对多尺度特征进行聚合并以对称方式捕获时间依赖性。此外，一个高斯平滑投影模块（GSPM）用于精炼浅层SAM特征，抑制噪声，同时增强边缘信息以实现边界意识约束。在两个公共基准上的广泛实验表明，DBTANet有效整合了全局语义、局部细节、时间推理和边界意识，达到了最先进的性能。

View on arXiv Download PDF AI Translation

cs.CV / 17 / 2602.11494

Arbitrary Ratio Feature Compression via Next Token Prediction

通过下一个标记预测实现任意比例特征压缩

Liu, Yufan, Ren, Daoyuan, Zhang, Zhipeng, Luo, Wenyang, Li, Bing, Hu, Weiming, Maybank, Stephen

Abstract

Feature compression is increasingly important for improving the efficiency of downstream tasks, especially in applications involving large-scale or multi-modal data. While existing methods typically rely on dedicated models for achieving specific compression ratios, they are often limited in flexibility and generalization. In particular, retraining is necessary when adapting to a new compression ratio. To address this limitation, we propose a novel and flexible Arbitrary Ratio Feature Compression (ARFC) framework, which supports any compression ratio with a single model, eliminating the need for multiple specialized models. At its core, the Arbitrary Ratio Compressor (ARC) is an auto-regressive model that performs compression via next-token prediction. This allows the compression ratio to be controlled at inference simply by adjusting the number of generated tokens. To enhance the quality of the compressed features, two key modules are introduced. The Mixture of Solutions (MoS) module refines the compressed tokens by utilizing multiple compression results (solutions), reducing uncertainty and improving robustness. The Entity Relation Graph Constraint (ERGC) is integrated into the training process to preserve semantic and structural relationships during compression. Extensive experiments on cross-modal retrieval, image classification, and image retrieval tasks across multiple datasets demonstrate that our method consistently outperforms existing approaches at various compression ratios. Notably, in some cases, it even surpasses the performance of the original, uncompressed features. These results validate the effectiveness and versatility of ARFC for practical, resource-constrained scenarios.

Chinese Translation

特征压缩在提高下游任务效率方面变得越来越重要，尤其是在涉及大规模或多模态数据的应用中。现有方法通常依赖于专用模型以实现特定的压缩比，但它们在灵活性和泛化能力上往往受到限制。特别是在适应新的压缩比时，需要重新训练。为了解决这一限制，我们提出了一种新颖且灵活的任意比例特征压缩（Arbitrary Ratio Feature Compression, ARFC）框架，该框架支持使用单一模型实现任意压缩比，消除了对多个专用模型的需求。其核心是任意比例压缩器（Arbitrary Ratio Compressor, ARC），这是一个自回归模型，通过下一个标记预测进行压缩。这使得在推理时只需通过调整生成标记的数量即可控制压缩比。为了提高压缩特征的质量，引入了两个关键模块。解决方案混合（Mixture of Solutions, MoS）模块通过利用多个压缩结果（解决方案）来精炼压缩标记，从而减少不确定性并提高鲁棒性。实体关系图约束（Entity Relation Graph Constraint, ERGC）被整合到训练过程中，以在压缩过程中保持语义和结构关系。在多个数据集上的跨模态检索、图像分类和图像检索任务的广泛实验表明，我们的方法在各种压缩比下始终优于现有方法。值得注意的是，在某些情况下，它甚至超越了原始未压缩特征的性能。这些结果验证了ARFC在实际资源受限场景中的有效性和多样性。

View on arXiv Download PDF AI Translation

cs.CV / 18 / 2602.11499

What if Agents Could Imagine? Reinforcing Open-Vocabulary HOI Comprehension through Generation

如果智能体能够想象？通过生成强化开放词汇人-物交互理解

Yuan, Zhenlong, Qu, Xiangyan, Tang, Jing, Chen, Rui, Sun, Lei, Chen, Ruidong, Yu, Hongwei, Qian, Chengxuan, Chu, Xiangxiang, Li, Shuo, Zhou, Yuyin

Abstract

Multimodal Large Language Models have shown promising capabilities in bridging visual and textual reasoning, yet their reasoning capabilities in Open-Vocabulary Human-Object Interaction (OV-HOI) are limited by cross-modal hallucinations and occlusion-induced ambiguity. To address this, we propose \textbf{ImagineAgent}, an agentic framework that harmonizes cognitive reasoning with generative imagination for robust visual understanding. Specifically, our method innovatively constructs cognitive maps that explicitly model plausible relationships between detected entities and candidate actions. Subsequently, it dynamically invokes tools including retrieval augmentation, image cropping, and diffusion models to gather domain-specific knowledge and enriched visual evidence, thereby achieving cross-modal alignment in ambiguous scenarios. Moreover, we propose a composite reward that balances prediction accuracy and tool efficiency. Evaluations on SWIG-HOI and HICO-DET datasets demonstrate our SOTA performance, requiring approximately 20\% of training data compared to existing methods, validating our robustness and efficiency.

Chinese Translation

多模态大语言模型在桥接视觉与文本推理方面展现了良好的能力，但其在开放词汇人-物交互（Open-Vocabulary Human-Object Interaction, OV-HOI）中的推理能力受到跨模态幻觉和遮挡引发的模糊性的限制。为了解决这一问题，我们提出了 extbf{ImagineAgent}，一个将认知推理与生成想象相结合的智能框架，以实现稳健的视觉理解。具体而言，我们的方法创新性地构建了认知地图，明确建模检测到的实体与候选动作之间的合理关系。随后，它动态调用包括检索增强、图像裁剪和扩散模型等工具，以收集领域特定知识和丰富的视觉证据，从而在模糊场景中实现跨模态对齐。此外，我们提出了一种复合奖励，平衡预测准确性和工具效率。在SWIG-HOI和HICO-DET数据集上的评估表明我们的模型达到了最先进的性能，与现有方法相比，仅需约20 ext{%}的训练数据，验证了我们的稳健性和效率。

View on arXiv Download PDF AI Translation

cs.CV / 19 / 2602.11536

Vascular anatomy-aware self-supervised pre-training for X-ray angiogram analysis

考虑血管解剖的自监督预训练用于X射线血管造影分析

Huang, De-Xing, Yu, Chaohui, Zhou, Xiao-Hu, Xiang, Tian-Yu, Zhang, Qin-Yi, Gui, Mei-Jiang, Ma, Rui-Ze, Wang, Chen-Yu, Xiao, Nu-Fang, Wang, Fan, Hou, Zeng-Guang

Abstract

X-ray angiography is the gold standard imaging modality for cardiovascular diseases. However, current deep learning approaches for X-ray angiogram analysis are severely constrained by the scarcity of annotated data. While large-scale self-supervised learning (SSL) has emerged as a promising solution, its potential in this domain remains largely unexplored, primarily due to the lack of effective SSL frameworks and large-scale datasets. To bridge this gap, we introduce a vascular anatomy-aware masked image modeling (VasoMIM) framework that explicitly integrates domain-specific anatomical knowledge. Specifically, VasoMIM comprises two key designs: an anatomy-guided masking strategy and an anatomical consistency loss. The former strategically masks vessel-containing patches to compel the model to learn robust vascular semantics, while the latter preserves structural consistency of vessels between original and reconstructed images, enhancing the discriminability of the learned representations. In conjunction with VasoMIM, we curate XA-170K, the largest X-ray angiogram pre-training dataset to date. We validate VasoMIM on four downstream tasks across six datasets, where it demonstrates superior transferability and achieves state-of-the-art performance compared to existing methods. These findings highlight the significant potential of VasoMIM as a foundation model for advancing a wide range of X-ray angiogram analysis tasks. VasoMIM and XA-170K will be available at https://github.com/Dxhuang-CASIA/XA-SSL.

Chinese Translation

X射线血管造影是心血管疾病的金标准成像方式。然而，当前针对X射线血管造影分析的深度学习方法受到标注数据稀缺的严重限制。虽然大规模自监督学习（SSL）已成为一种有前景的解决方案，但其在该领域的潜力仍然未得到充分探索，主要是由于缺乏有效的SSL框架和大规模数据集。为填补这一空白，我们提出了一种考虑血管解剖的掩码图像建模框架（VasoMIM），该框架明确整合了特定领域的解剖知识。具体而言，VasoMIM包含两个关键设计：解剖引导的掩码策略和解剖一致性损失。前者战略性地掩盖包含血管的图像块，以迫使模型学习稳健的血管语义，而后者则保持原始图像与重建图像之间血管的结构一致性，从而增强学习表示的可区分性。结合VasoMIM，我们整理了XA-170K，这是迄今为止最大的X射线血管造影预训练数据集。我们在六个数据集上的四个下游任务中验证了VasoMIM，结果表明其具有优越的迁移能力，并且相比现有方法达到了最先进的性能。这些发现突显了VasoMIM作为基础模型在推动广泛的X射线血管造影分析任务中的重大潜力。VasoMIM和XA-170K将可在https://github.com/Dxhuang-CASIA/XA-SSL获取。

View on arXiv Download PDF AI Translation

cs.CV / 20 / 2602.11545

Supervise-assisted Multi-modality Fusion Diffusion Model for PET Restoration

监督辅助的多模态融合扩散模型用于正电子发射断层扫描恢复

Zhang, Yingkai, Chen, Shuang, Tian, Ye, Gao, Yunyi, Jiang, Jianyong, Fu, Ying

Abstract

Positron emission tomography (PET) offers powerful functional imaging but involves radiation exposure. Efforts to reduce this exposure by lowering the radiotracer dose or scan time can degrade image quality. While using magnetic resonance (MR) images with clearer anatomical information to restore standard-dose PET (SPET) from low-dose PET (LPET) is a promising approach, it faces challenges with the inconsistencies in the structure and texture of multi-modality fusion, as well as the mismatch in out-of-distribution (OOD) data. In this paper, we propose a supervise-assisted multi-modality fusion diffusion model (MFdiff) for addressing these challenges for high-quality PET restoration. Firstly, to fully utilize auxiliary MR images without introducing extraneous details in the restored image, a multi-modality feature fusion module is designed to learn an optimized fusion feature. Secondly, using the fusion feature as an additional condition, high-quality SPET images are iteratively generated based on the diffusion model. Furthermore, we introduce a two-stage supervise-assisted learning strategy that harnesses both generalized priors from simulated in-distribution datasets and specific priors tailored to in-vivo OOD data. Experiments demonstrate that the proposed MFdiff effectively restores high-quality SPET images from multi-modality inputs and outperforms state-of-the-art methods both qualitatively and quantitatively.

Chinese Translation

正电子发射断层扫描（PET）提供了强大的功能成像能力，但涉及辐射暴露。通过降低放射性示踪剂剂量或扫描时间来减少这种暴露的努力可能会降低图像质量。利用磁共振（MR）图像中更清晰的解剖信息从低剂量PET（LPET）恢复标准剂量PET（SPET）是一种有前景的方法，但在多模态融合的结构和纹理不一致性以及分布外（OOD）数据的不匹配方面面临挑战。本文提出了一种监督辅助的多模态融合扩散模型（MFdiff），旨在解决高质量PET恢复中的这些挑战。首先，为了充分利用辅助MR图像而不在恢复图像中引入多余细节，设计了一个多模态特征融合模块，以学习优化的融合特征。其次，利用融合特征作为额外条件，基于扩散模型迭代生成高质量的SPET图像。此外，我们引入了一种两阶段的监督辅助学习策略，利用来自模拟的分布内数据集的广义先验和针对体内OOD数据的特定先验。实验表明，所提出的MFdiff有效地从多模态输入中恢复高质量的SPET图像，并在定性和定量上均优于最先进的方法。

View on arXiv Download PDF AI Translation

cs.CV / 21 / 2602.11553

Perception-based Image Denoising via Generative Compression

基于感知的图像去噪通过生成压缩

Nguyen, Nam, Nguyen, Thinh, Bose, Bella

Abstract

Image denoising aims to remove noise while preserving structural details and perceptual realism, yet distortion-driven methods often produce over-smoothed reconstructions, especially under strong noise and distribution shift. This paper proposes a generative compression framework for perception-based denoising, where restoration is achieved by reconstructing from entropy-coded latent representations that enforce low-complexity structure, while generative decoders recover realistic textures via perceptual measures such as learned perceptual image patch similarity (LPIPS) loss and Wasserstein distance. Two complementary instantiations are introduced: (i) a conditional Wasserstein GAN (WGAN)-based compression denoiser that explicitly controls the rate-distortion-perception (RDP) trade-off, and (ii) a conditional diffusion-based reconstruction strategy that performs iterative denoising guided by compressed latents. We further establish non-asymptotic guarantees for the compression-based maximum-likelihood denoiser under additive Gaussian noise, including bounds on reconstruction error and decoding error probability. Experiments on synthetic and real-noise benchmarks demonstrate consistent perceptual improvements while maintaining competitive distortion performance.

Chinese Translation

图像去噪旨在去除噪声，同时保留结构细节和感知真实感，然而，基于失真的方法往往会产生过于平滑的重建，尤其是在强噪声和分布转移的情况下。本文提出了一种用于基于感知的去噪的生成压缩框架，其中通过重建低复杂度结构的熵编码潜在表示来实现恢复，同时生成解码器通过学习的感知图像块相似性（LPIPS）损失和Wasserstein距离等感知度量恢复真实的纹理。我们引入了两种互补的实例化：（i）基于条件Wasserstein GAN（WGAN）的压缩去噪器，明确控制速率-失真-感知（RDP）权衡；（ii）基于条件扩散的重建策略，通过压缩潜在表示进行迭代去噪。我们进一步为基于压缩的最大似然去噪器在加性高斯噪声下建立了非渐近保证，包括重建误差和解码误差概率的界限。在合成和真实噪声基准上的实验表明，在保持竞争性失真性能的同时，感知效果持续改善。

View on arXiv Download PDF AI Translation

cs.CV / 22 / 2602.11564

LUVE : Latent-Cascaded Ultra-High-Resolution Video Generation with Dual Frequency Experts

LUVE：基于双频专家的潜在级联超高分辨率视频生成

Zhao, Chen, Chen, Jiawei, Li, Hongyu, Kang, Zhuoliang, Lu, Shilin, Wei, Xiaoming, Zhang, Kai, Yang, Jian, Tai, Ying

Abstract

Recent advances in video diffusion models have significantly improved visual quality, yet ultra-high-resolution (UHR) video generation remains a formidable challenge due to the compounded difficulties of motion modeling, semantic planning, and detail synthesis. To address these limitations, we propose \textbf{LUVE}, a \textbf{L}atent-cascaded \textbf{U}HR \textbf{V}ideo generation framework built upon dual frequency \textbf{E}xperts. LUVE employs a three-stage architecture comprising low-resolution motion generation for motion-consistent latent synthesis, video latent upsampling that performs resolution upsampling directly in the latent space to mitigate memory and computational overhead, and high-resolution content refinement that integrates low-frequency and high-frequency experts to jointly enhance semantic coherence and fine-grained detail generation. Extensive experiments demonstrate that our LUVE achieves superior photorealism and content fidelity in UHR video generation, and comprehensive ablation studies further validate the effectiveness of each component. The project is available at \href{https://unicornanrocinu.github.io/LUVE_web/}{https://github.io/LUVE/}.

Chinese Translation

近年来，视频扩散模型的进展显著提高了视觉质量，但超高分辨率（UHR）视频生成仍然是一项艰巨的挑战，因为运动建模、语义规划和细节合成的困难相互叠加。为了解决这些限制，我们提出了 extbf{LUVE}，一种基于双频 extbf{E}xperts构建的 extbf{L}atent-cascaded extbf{U}HR extbf{V}ideo生成框架。LUVE采用三阶段架构，包括用于运动一致潜在合成的低分辨率运动生成、在潜在空间中直接进行分辨率上采样以减轻内存和计算开销的视频潜在上采样，以及整合低频和高频专家共同增强语义一致性和细致细节生成的高分辨率内容精炼。大量实验表明，我们的LUVE在UHR视频生成中实现了卓越的照片真实感和内容保真度，全面的消融研究进一步验证了每个组件的有效性。该项目可在 exthref{https://unicornanrocinu.github.io/LUVE_web/}{https://github.io/LUVE/}获取。

View on arXiv Download PDF AI Translation

cs.CV / 23 / 2602.11565

Move What Matters: Parameter-Efficient Domain Adaptation via Optimal Transport Flow for Collaborative Perception

关注重要事项：通过最优传输流实现协作感知的参数高效领域适应

Jia, Zesheng, Wang, Jin, Liu, Siao, Li, Lingzhi, Huang, Ziyao, Xu, Yunjiang, Wang, Jianping

Abstract

Fast domain adaptation remains a fundamental challenge for deploying multi-agent systems across diverse environments in Vehicle-to-Everything (V2X) collaborative perception. Despite the success of Parameter-Efficient Fine-Tuning (PEFT) in natural language processing and conventional vision tasks, directly applying PEFT to multi-agent settings leads to significant performance degradation and training instability. In this work, we conduct a detailed analysis and identify two key factors: (i) inter-frame redundancy in heterogeneous sensory streams, and (ii) erosion of fine-grained semantics in deep-layer representations under PEFT adaptation. To address these issues, we propose FlowAdapt, a parameter-efficient framework grounded in optimal transport theory, which minimizes information transport costs across both data distributions and network hierarchies. Specifically, we introduce a Wasserstein Greedy Sampling strategy to selectively filter redundant samples via a bounded covering radius. Furthermore, Progressive Knowledge Transfer module is designed to progressively inject compressed early-stage representations into later stages through learnable pathways, alleviating semantic degradation in late-stage adaptation. Extensive experiments on three benchmarks demonstrate that FlowAdapt achieves state-of-the-art performance with only 1% of trainable parameters, effectively bridging domain gaps with superior sample efficiency and generalization.

Chinese Translation

快速领域适应仍然是将多智能体系统部署到多样化环境中的一个基本挑战，尤其是在车对万物（V2X）协作感知中。尽管参数高效微调（PEFT）在自然语言处理和传统视觉任务中取得了成功，但直接将PEFT应用于多智能体环境会导致显著的性能下降和训练不稳定性。在本研究中，我们进行了详细分析，并确定了两个关键因素：（i）异构传感器流中的帧间冗余，以及（ii）在PEFT适应下深层表示中的细粒度语义的侵蚀。为了解决这些问题，我们提出了FlowAdapt，这是一个基于最优传输理论的参数高效框架，旨在最小化数据分布和网络层次之间的信息传输成本。具体而言，我们引入了一种Wasserstein贪婪采样策略，通过有界覆盖半径选择性地过滤冗余样本。此外，设计了渐进知识转移模块，通过可学习路径逐步将压缩的早期表示注入到后期阶段，从而减轻后期适应中的语义退化。在三个基准上的广泛实验表明，FlowAdapt以仅1%的可训练参数实现了最先进的性能，有效弥合了领域间的差距，具有优越的样本效率和泛化能力。

View on arXiv Download PDF AI Translation

cs.CV / 24 / 2602.11588

A Large Language Model for Disaster Structural Reconnaissance Summarization

用于灾害结构侦察总结的大型语言模型

Gao, Yuqing, Zhou, Guanren, Mosalam, Khalid M.

Abstract

Artificial Intelligence (AI)-aided vision-based Structural Health Monitoring (SHM) has emerged as an effective approach for monitoring and assessing structural condition by analyzing image and video data. By integrating Computer Vision (CV) and Deep Learning (DL), vision-based SHM can automatically identify and localize visual patterns associated with structural damage. However, previous works typically generate only discrete outputs, such as damage class labels and damage region coordinates, requiring engineers to further reorganize and analyze these results for evaluation and decision-making. In late 2022, Large Language Models (LLMs) became popular across multiple fields, providing new insights into AI-aided vision-based SHM. In this study, a novel LLM-based Disaster Reconnaissance Summarization (LLM-DRS) framework is proposed. It introduces a standard reconnaissance plan in which the collection of vision data and corresponding metadata follows a well-designed on-site investigation process. Text-based metadata and image-based vision data are then processed and integrated into a unified format, where well-trained Deep Convolutional Neural Networks extract key attributes, including damage state, material type, and damage level. Finally, all data are fed into an LLM with carefully designed prompts, enabling the LLM-DRS to generate summary reports for individual structures or affected regions based on aggregated attributes and metadata. Results show that integrating LLMs into vision-based SHM, particularly for rapid post-disaster reconnaissance, demonstrates promising potential for improving resilience of the built environment through effective reconnaissance.

Chinese Translation

基于人工智能（AI）的视觉结构健康监测（SHM）已成为通过分析图像和视频数据来监测和评估结构状况的有效方法。通过整合计算机视觉（CV）和深度学习（DL），基于视觉的SHM能够自动识别和定位与结构损伤相关的视觉模式。然而，以往的研究通常仅生成离散的输出，例如损伤类别标签和损伤区域坐标，这需要工程师进一步重新组织和分析这些结果以进行评估和决策。在2022年末，大型语言模型（LLMs）在多个领域变得流行，为AI辅助的基于视觉的SHM提供了新的见解。本研究提出了一种新颖的基于LLM的灾害侦察总结（LLM-DRS）框架。该框架引入了一种标准侦察计划，其中视觉数据及相应的元数据的收集遵循一个精心设计的现场调查过程。然后，基于文本的元数据和基于图像的视觉数据被处理并整合为统一格式，经过良好训练的深度卷积神经网络提取关键属性，包括损伤状态、材料类型和损伤级别。最后，所有数据被输入到一个具有精心设计提示的LLM中，使得LLM-DRS能够根据聚合属性和元数据生成个别结构或受影响区域的总结报告。结果表明，将LLMs整合到基于视觉的SHM中，特别是在快速灾后侦察方面，展示了通过有效侦察提高建筑环境韧性的良好潜力。

View on arXiv Download PDF AI Translation

cs.CV / 25 / 2602.11625

PLOT-CT: Pre-log Voronoi Decomposition Assisted Generation for Low-dose CT Reconstruction

PLOT-CT：基于预对数Voronoi分解辅助生成的低剂量CT重建

Huang, Bin, Yu, Xun, Zhang, Yikun, Zhang, Yi, Chen, Yang, Liu, Qiegen

Abstract

Low-dose computed tomography (LDCT) reconstruction is fundamentally challenged by severe noise and compromised data fidelity under reduced radiation exposure. Most existing methods operate either in the image or post-log projection domain, which fails to fully exploit the rich structural information in pre-log measurements while being highly susceptible to noise. The requisite logarithmic transformation critically amplifies noise within these data, imposing exceptional demands on reconstruction precision. To overcome these challenges, we propose PLOT-CT, a novel framework for Pre-Log vOronoi decomposiTion-assisted CT generation. Our method begins by applying Voronoi decomposition to pre-log sinograms, disentangling the data into distinct underlying components, which are embedded in separate latent spaces. This explicit decomposition significantly enhances the model's capacity to learn discriminative features, directly improving reconstruction accuracy by mitigating noise and preserving information inherent in the pre-log domain. Extensive experiments demonstrate that PLOT-CT achieves state-of-the-art performance, attaining a 2.36dB PSNR improvement over traditional methods at the 1e4 incident photon level in the pre-log domain.

Chinese Translation

低剂量计算机断层扫描（LDCT）重建在减少辐射暴露下面临严重噪声和数据保真度下降的根本挑战。现有大多数方法要么在图像域中操作，要么在后对数投影域中操作，这导致未能充分利用预对数测量中丰富的结构信息，同时对噪声高度敏感。所需的对数变换在这些数据中显著放大了噪声，对重建精度提出了极高的要求。为克服这些挑战，我们提出了PLOT-CT，一种基于预对数Voronoi分解辅助的CT生成新框架。我们的方法首先对预对数正弦图应用Voronoi分解，将数据解构为不同的潜在成分，这些成分嵌入在各自的潜在空间中。这种明确的分解显著增强了模型学习判别特征的能力，直接通过减轻噪声和保留预对数域中的信息来提高重建精度。大量实验表明，PLOT-CT在预对数域中在1e4入射光子水平下，较传统方法实现了2.36dB的PSNR提升，达到了当前的最先进性能。

View on arXiv Download PDF AI Translation

cs.CV / 26 / 2602.11628

PLESS: Pseudo-Label Enhancement with Spreading Scribbles for Weakly Supervised Segmentation

PLESS：基于扩散涂鸦的伪标签增强用于弱监督分割

Gabrielyan, Yeva, Yeghiazaryan, Varduhi, Voiculescu, Irina

Abstract

Weakly supervised learning with scribble annotations uses sparse user-drawn strokes to indicate segmentation labels on a small subset of pixels. This annotation reduces the cost of dense pixel-wise labeling, but suffers inherently from noisy and incomplete supervision. Recent scribble-based approaches in medical image segmentation address this limitation using pseudo-label-based training; however, the quality of the pseudo-labels remains a key performance limit. We propose PLESS, a generic pseudo-label enhancement strategy which improves reliability and spatial consistency. It builds on a hierarchical partitioning of the image into a hierarchy of spatially coherent regions. PLESS propagates scribble information to refine pseudo-labels within semantically coherent regions. The framework is model-agnostic and easily integrates into existing pseudo-label methods. Experiments on two public cardiac MRI datasets (ACDC and MSCMRseg) across four scribble-supervised algorithms show consistent improvements in segmentation accuracy. Code will be made available on GitHub upon acceptance.

Chinese Translation

弱监督学习通过涂鸦注释使用稀疏的用户绘制笔画来指示小部分像素的分割标签。这种注释减少了密集像素级标注的成本，但固有地受到噪声和不完整监督的影响。近期的基于涂鸦的方法在医学图像分割中通过伪标签训练来解决这一限制；然而，伪标签的质量仍然是性能的关键限制。我们提出了PLESS，一种通用的伪标签增强策略，旨在提高可靠性和空间一致性。它基于将图像分割为空间一致区域的层次结构。PLESS将涂鸦信息传播到语义一致区域内，以细化伪标签。该框架与模型无关，能够轻松集成到现有的伪标签方法中。在两个公共心脏MRI数据集（ACDC和MSCMRseg）上进行的实验显示，四种涂鸦监督算法的分割准确性均有一致性提升。代码将在接受后在GitHub上发布。

View on arXiv Download PDF AI Translation

cs.CV / 27 / 2602.11636

ScalSelect: Scalable Training-Free Multimodal Data Selection for Efficient Visual Instruction Tuning

ScalSelect：高效视觉指令调优的可扩展无训练多模态数据选择

Wu, Changti, Mao, Jiahuai, Miao, Yuzhuo, Lian, Shijie, Yu, Bin, Lin, Xiaopeng, Huang, Cong, Zhang, Lei, Chen, Kai

Abstract

Large-scale Visual Instruction Tuning (VIT) has become a key paradigm for advancing the performance of vision-language models (VLMs) across various multimodal tasks. However, training on the large-scale datasets is computationally expensive and inefficient due to redundancy in the data, which motivates the need for multimodal data selection to improve training efficiency. Existing data selection methods for VIT either require costly training or gradient computation. Training-free alternatives often depend on proxy models or datasets, instruction-agnostic representations, and pairwise similarity with quadratic complexity, limiting scalability and representation fidelity. In this work, we propose ScalSelect, a scalable training-free multimodal data selection method with linear-time complexity with respect to the number of samples, eliminating the need for external models or auxiliary datasets. ScalSelect first constructs sample representations by extracting visual features most attended by instruction tokens in the target VLM, capturing instruction-relevant information. It then identifies samples whose representations best approximate the dominant subspace of the full dataset representations, enabling scalable importance scoring without pairwise comparisons. Extensive experiments across multiple VLMs, datasets, and selection budgets demonstrate that ScalSelect achieves over 97.5% of the performance of training on the full dataset using only 16% of the data, and even outperforms full-data training in some settings. The code is available at \href{https://github.com/ChangtiWu/ScalSelect}{ScalSelect}.

Chinese Translation

大规模视觉指令调优（VIT）已成为推动视觉-语言模型（VLMs）在各种多模态任务中性能提升的关键范式。然而，由于数据冗余，在大规模数据集上进行训练的计算成本高且效率低下，这促使了多模态数据选择的需求，以提高训练效率。现有的VIT数据选择方法要么需要昂贵的训练或梯度计算，非训练替代方案通常依赖于代理模型或数据集、与指令无关的表示以及具有二次复杂度的成对相似性，这限制了可扩展性和表示的保真度。在本研究中，我们提出了ScalSelect，一种可扩展的无训练多模态数据选择方法，其在样本数量上的时间复杂度为线性，消除了对外部模型或辅助数据集的需求。ScalSelect首先通过提取目标VLM中最受指令标记关注的视觉特征来构建样本表示，从而捕捉与指令相关的信息。然后，它识别出其表示最接近完整数据集表示的主导子空间的样本，从而实现可扩展的重要性评分，而无需成对比较。在多个VLM、数据集和选择预算下的广泛实验表明，ScalSelect在仅使用16%的数据时，达到了在完整数据集上训练的97.5%以上的性能，甚至在某些设置中超越了完整数据训练。代码可在 exttt{https://github.com/ChangtiWu/ScalSelect} 获取。

View on arXiv Download PDF AI Translation

cs.CV / 28 / 2602.11642

Electrostatics-Inspired Surface Reconstruction (EISR): Recovering 3D Shapes as a Superposition of Poisson's PDE Solutions

电静电启发的表面重建（EISR）：将三维形状恢复为泊松方程解的叠加

Patiño, Diego, Peterson, Knut, Daniilidis, Kostas, Han, David K.

Abstract

Implicit shape representation, such as SDFs, is a popular approach to recover the surface of a 3D shape as the level sets of a scalar field. Several methods approximate SDFs using machine learning strategies that exploit the knowledge that SDFs are solutions of the Eikonal partial differential equation (PDEs). In this work, we present a novel approach to surface reconstruction by encoding it as a solution to a proxy PDE, namely Poisson's equation. Then, we explore the connection between Poisson's equation and physics, e.g., the electrostatic potential due to a positive charge density. We employ Green's functions to obtain a closed-form parametric expression for the PDE's solution, and leverage the linearity of our proxy PDE to find the target shape's implicit field as a superposition of solutions. Our method shows improved results in approximating high-frequency details, even with a small number of shape priors.

Chinese Translation

隐式形状表示（如SDFs）是一种流行的方法，通过标量场的水平集来恢复三维形状的表面。一些方法利用机器学习策略近似SDFs，这些策略利用了SDFs是Eikonal偏微分方程（PDEs）解的知识。在本研究中，我们提出了一种新颖的表面重建方法，将其编码为代理PDE的解，即泊松方程。然后，我们探讨了泊松方程与物理学之间的联系，例如，由正电荷密度引起的静电势。我们采用格林函数获得PDE解的封闭形式参数表达式，并利用代理PDE的线性特性，将目标形状的隐式场表示为解的叠加。我们的方法在近似高频细节方面显示出改进的结果，即使在形状先验数量较少的情况下。

View on arXiv Download PDF AI Translation

cs.CV / 29 / 2602.11646

Brain Tumor Classifiers Under Attack: Robustness of ResNet Variants Against Transferable FGSM and PGD Attacks

脑肿瘤分类器在攻击下的表现：ResNet变体对可转移FGSM和PGD攻击的鲁棒性

Deem, Ryan, Goodman, Garrett, Majeed, Waqas, Khan, Md Abdullah Al Hafiz, Alexiou, Michail S.

Abstract

Adversarial robustness in deep learning models for brain tumor classification remains an underexplored yet critical challenge, particularly for clinical deployment scenarios involving MRI data. In this work, we investigate the susceptibility and resilience of several ResNet-based architectures, referred to as BrainNet, BrainNeXt and DilationNet, against gradient-based adversarial attacks, namely FGSM and PGD. These models, based on ResNet, ResNeXt, and dilated ResNet variants respectively, are evaluated across three preprocessing configurations (i) full-sized augmented, (ii) shrunk augmented and (iii) shrunk non-augmented MRI datasets. Our experiments reveal that BrainNeXt models exhibit the highest robustness to black-box attacks, likely due to their increased cardinality, though they produce weaker transferable adversarial samples. In contrast, BrainNet and Dilation models are more vulnerable to attacks from each other, especially under PGD with higher iteration steps and $\alpha$ values. Notably, shrunk and non-augmented data significantly reduce model resilience, even when the untampered test accuracy remains high, highlighting a key trade-off between input resolution and adversarial vulnerability. These results underscore the importance of jointly evaluating classification performance and adversarial robustness for reliable real-world deployment in brain MRI analysis.

Chinese Translation

深度学习模型在脑肿瘤分类中的对抗鲁棒性仍然是一个未被充分探索但至关重要的挑战，尤其是在涉及MRI数据的临床部署场景中。在本研究中，我们调查了几种基于ResNet架构的模型（分别称为BrainNet、BrainNeXt和DilationNet）对基于梯度的对抗攻击（即FGSM和PGD）的脆弱性和韧性。这些模型分别基于ResNet、ResNeXt和扩张ResNet变体，在三种预处理配置下进行评估：（i）全尺寸增强数据，（ii）缩小增强数据和（iii）缩小非增强MRI数据集。我们的实验结果表明，BrainNeXt模型在黑箱攻击下表现出最高的鲁棒性，这可能归因于其更高的基数，尽管它们产生的可转移对抗样本较弱。相比之下，BrainNet和Dilation模型在相互攻击时更为脆弱，尤其是在PGD攻击中，随着迭代步数和$eta$值的增加。值得注意的是，缩小和非增强数据显著降低了模型的韧性，即使未被篡改的测试准确率仍然较高，这突显了输入分辨率与对抗脆弱性之间的关键权衡。这些结果强调了在脑MRI分析的可靠实际部署中，联合评估分类性能和对抗鲁棒性的重要性。

View on arXiv Download PDF AI Translation

cs.CV / 30 / 2602.11653

GR-Diffusion: 3D Gaussian Representation Meets Diffusion in Whole-Body PET Reconstruction

GR-Diffusion：三维高斯表示与扩散模型在全身PET重建中的结合

Geng, Mengxiao, Chen, Zijie, Hong, Ran, Li, Bingxuan, Liu, Qiegen

Abstract

Positron emission tomography (PET) reconstruction is a critical challenge in molecular imaging, often hampered by noise amplification, structural blurring, and detail loss due to sparse sampling and the ill-posed nature of inverse problems. The three-dimensional discrete Gaussian representation (GR), which efficiently encodes 3D scenes using parameterized discrete Gaussian distributions, has shown promise in computer vision. In this work, we pro-pose a novel GR-Diffusion framework that synergistically integrates the geometric priors of GR with the generative power of diffusion models for 3D low-dose whole-body PET reconstruction. GR-Diffusion employs GR to generate a reference 3D PET image from projection data, establishing a physically grounded and structurally explicit benchmark that overcomes the low-pass limitations of conventional point-based or voxel-based methods. This reference image serves as a dual guide during the diffusion process, ensuring both global consistency and local accuracy. Specifically, we employ a hierarchical guidance mechanism based on the GR reference. Fine-grained guidance leverages differences to refine local details, while coarse-grained guidance uses multi-scale difference maps to correct deviations. This strategy allows the diffusion model to sequentially integrate the strong geometric prior from GR and recover sub-voxel information. Experimental results on the UDPET and Clinical datasets with varying dose levels show that GR-Diffusion outperforms state-of-the-art methods in enhancing 3D whole-body PET image quality and preserving physiological details.

Chinese Translation

正电子发射断层成像（PET）重建是分子成像中的一个关键挑战，常常受到噪声放大、结构模糊和由于稀疏采样及逆问题的不适定性导致的细节丢失的困扰。三维离散高斯表示（GR）有效地使用参数化的离散高斯分布编码三维场景，在计算机视觉中显示出良好的前景。在本研究中，我们提出了一种新颖的GR-Diffusion框架，协同整合了GR的几何先验与扩散模型的生成能力，以实现三维低剂量全身PET重建。GR-Diffusion利用GR从投影数据生成参考三维PET图像，建立了一个物理基础和结构明确的基准，克服了传统基于点或体素方法的低通限制。该参考图像在扩散过程中作为双重指导，确保了全局一致性和局部准确性。具体而言，我们采用基于GR参考的分层指导机制。细粒度指导利用差异来细化局部细节，而粗粒度指导则使用多尺度差异图来修正偏差。这一策略使得扩散模型能够顺序整合来自GR的强几何先验，并恢复亚体素信息。在UDPET和临床数据集上进行的实验结果表明，GR-Diffusion在提升三维全身PET图像质量和保持生理细节方面优于最先进的方法。

View on arXiv Download PDF AI Translation

cs.CV / 31 / 2602.11656

SToRM: Supervised Token Reduction for Multi-modal LLMs toward efficient end-to-end autonomous driving

SToRM：面向高效端到端自主驾驶的多模态大语言模型的监督性令牌减少

Kim, Seo Hyun, Park, Jin Bok, Koo, Do Yeon, Park, Ho Gun, Chun, Il Yong

Abstract

In autonomous driving, end-to-end (E2E) driving systems that predict control commands directly from sensor data have achieved significant advancements. For safe driving in unexpected scenarios, these systems may additionally rely on human interventions such as natural language instructions. Using a multi-modal large language model (MLLM) facilitates human-vehicle interaction and can improve performance in such scenarios. However, this approach requires substantial computational resources due to its reliance on an LLM and numerous visual tokens from sensor inputs, which are limited in autonomous vehicles. Many MLLM studies have explored reducing visual tokens, but often suffer end-task performance degradation compared to using all tokens. To enable efficient E2E driving while maintaining performance comparable to using all tokens, this paper proposes the first Supervised Token Reduction framework for multi-modal LLMs (SToRM). The proposed framework consists of three key elements. First, a lightweight importance predictor with short-term sliding windows estimates token importance scores. Second, a supervised training approach uses an auxiliary path to obtain pseudo-supervision signals from an all-token LLM pass. Third, an anchor-context merging module partitions tokens into anchors and context tokens, and merges context tokens into relevant anchors to reduce redundancy while minimizing information loss. Experiments on the LangAuto benchmark show that SToRM outperforms state-of-the-art E2E driving MLLMs under the same reduced-token budget, maintaining all-token performance while reducing computational cost by up to 30x.

Chinese Translation

在自主驾驶中，端到端（E2E）驾驶系统通过直接从传感器数据预测控制命令取得了显著进展。为了在意外场景中安全驾驶，这些系统可能还依赖于人类干预，例如自然语言指令。使用多模态大语言模型（MLLM）促进人车交互，并能在此类场景中提高性能。然而，由于依赖于大语言模型和来自传感器输入的众多视觉令牌，这种方法需要大量计算资源，而这些资源在自主车辆中是有限的。许多MLLM研究探讨了减少视觉令牌的方法，但与使用所有令牌相比，往往会导致最终任务性能下降。为了在保持与使用所有令牌相当的性能的同时实现高效的E2E驾驶，本文提出了首个针对多模态大语言模型的监督性令牌减少框架（SToRM）。该框架由三个关键元素组成。首先，一个轻量级的重要性预测器通过短期滑动窗口估计令牌的重要性分数。其次，监督训练方法使用辅助路径从全令牌大语言模型传递中获取伪监督信号。第三，一个锚点-上下文合并模块将令牌分为锚点和上下文令牌，并将上下文令牌合并到相关的锚点中，以减少冗余，同时最小化信息损失。在LangAuto基准上的实验表明，SToRM在相同的减少令牌预算下超越了最先进的E2E驾驶MLLM，保持了全令牌性能，同时将计算成本降低了多达30倍。

View on arXiv Download PDF AI Translation

cs.CV / 32 / 2602.11658

EmoSpace: Fine-Grained Emotion Prototype Learning for Immersive Affective Content Generation

EmoSpace：用于沉浸式情感内容生成的细粒度情感原型学习

Wang, Bingyuan, Chen, Xingbei, Qiu, Zongyang, Yuan, Linping, Wang, Zeyu

Abstract

Emotion is important for creating compelling virtual reality (VR) content. Although some generative methods have been applied to lower the barrier to creating emotionally rich content, they fail to capture the nuanced emotional semantics and the fine-grained control essential for immersive experiences. To address these limitations, we introduce EmoSpace, a novel framework for emotion-aware content generation that learns dynamic, interpretable emotion prototypes through vision-language alignment. We employ a hierarchical emotion representation with rich learnable prototypes that evolve during training, enabling fine-grained emotional control without requiring explicit emotion labels. We develop a controllable generation pipeline featuring multi-prototype guidance, temporal blending, and attention reweighting that supports diverse applications, including emotional image outpainting, stylized generation, and emotional panorama generation for VR environments. Our experiments demonstrate the superior performance of EmoSpace over existing methods in both qualitative and quantitative evaluations. Additionally, we present a comprehensive user study investigating how VR environments affect emotional perception compared to desktop settings. Our work facilitates immersive visual content generation with fine-grained emotion control and supports applications like therapy, education, storytelling, artistic creation, and cultural preservation. Code and models will be made publicly available.

Chinese Translation

情感在创造引人入胜的虚拟现实（VR）内容中至关重要。尽管一些生成方法已被应用于降低创建情感丰富内容的门槛，但它们未能捕捉到细微的情感语义和沉浸体验所必需的细粒度控制。为了解决这些局限性，我们提出了EmoSpace，一个新颖的情感感知内容生成框架，通过视觉-语言对齐学习动态、可解释的情感原型。我们采用了层次化的情感表示，具有丰富的可学习原型，这些原型在训练过程中不断演变，使得在不需要显式情感标签的情况下实现细粒度的情感控制。我们开发了一个可控的生成管道，具有多原型引导、时间混合和注意力重加权，支持多种应用，包括情感图像扩展、风格化生成和用于VR环境的情感全景生成。我们的实验表明，EmoSpace在定性和定量评估中均优于现有方法。此外，我们还进行了一项全面的用户研究，探讨VR环境如何影响情感感知，与桌面环境相比。我们的工作促进了具有细粒度情感控制的沉浸式视觉内容生成，并支持治疗、教育、叙事、艺术创作和文化保护等应用。代码和模型将公开发布。

View on arXiv Download PDF AI Translation

cs.CV / 33 / 2602.11660

Clutt3R-Seg: Sparse-view 3D Instance Segmentation for Language-grounded Grasping in Cluttered Scenes

Clutt3R-Seg：用于杂乱场景中语言驱动抓取的稀疏视图3D实例分割

Noh, Jeongho, Rhee, Tai Hyoung, Lee, Eunho, Kim, Jeongyun, Lee, Sunwoo, Kim, Ayoung

Abstract

Reliable 3D instance segmentation is fundamental to language-grounded robotic manipulation. Its critical application lies in cluttered environments, where occlusions, limited viewpoints, and noisy masks degrade perception. To address these challenges, we present Clutt3R-Seg, a zero-shot pipeline for robust 3D instance segmentation for language-grounded grasping in cluttered scenes. Our key idea is to introduce a hierarchical instance tree of semantic cues. Unlike prior approaches that attempt to refine noisy masks, our method leverages them as informative cues: through cross-view grouping and conditional substitution, the tree suppresses over- and under-segmentation, yielding view-consistent masks and robust 3D instances. Each instance is enriched with open-vocabulary semantic embeddings, enabling accurate target selection from natural language instructions. To handle scene changes during multi-stage tasks, we further introduce a consistency-aware update that preserves instance correspondences from only a single post-interaction image, allowing efficient adaptation without rescanning. Clutt3R-Seg is evaluated on both synthetic and real-world datasets, and validated on a real robot. Across all settings, it consistently outperforms state-of-the-art baselines in cluttered and sparse-view scenarios. Even on the most challenging heavy-clutter sequences, Clutt3R-Seg achieves an AP@25 of 61.66, over 2.2x higher than baselines, and with only four input views it surpasses MaskClustering with eight views by more than 2x. The code is available at: https://github.com/jeonghonoh/clutt3r-seg.

Chinese Translation

可靠的3D实例分割是语言驱动机器人操作的基础。其关键应用在于杂乱环境中，在这些环境中，遮挡、有限的视角和噪声掩膜会降低感知效果。为了解决这些挑战，我们提出了Clutt3R-Seg，这是一种零样本管道，用于在杂乱场景中进行稳健的3D实例分割，以支持语言驱动的抓取。我们的关键思想是引入一个层次化的实例树，利用语义线索。与之前尝试精炼噪声掩膜的方法不同，我们的方法将其作为信息线索：通过跨视图分组和条件替代，该树抑制了过分割和欠分割，产生视图一致的掩膜和稳健的3D实例。每个实例都通过开放词汇的语义嵌入进行丰富，使得能够从自然语言指令中准确选择目标。为了处理多阶段任务中的场景变化，我们进一步引入了一种一致性感知更新，仅从单个交互后图像中保留实例对应关系，从而实现高效适应而无需重新扫描。Clutt3R-Seg在合成和真实世界数据集上进行了评估，并在真实机器人上进行了验证。在所有设置中，它在杂乱和稀疏视图场景中始终优于最先进的基线。即使在最具挑战性的重杂乱序列中，Clutt3R-Seg的AP@25达到了61.66，比基线高出超过2.2倍，并且仅用四个输入视图就超过了使用八个视图的MaskClustering，提升幅度超过2倍。代码可在以下链接获取：https://github.com/jeonghonoh/clutt3r-seg。

View on arXiv Download PDF AI Translation

cs.CV / 34 / 2602.11669

Egocentric Gaze Estimation via Neck-Mounted Camera

基于颈部摄像头的自我中心注视估计

Huang, Haoyu, Sato, Yoichi

Abstract

This paper introduces neck-mounted view gaze estimation, a new task that estimates user gaze from the neck-mounted camera perspective. Prior work on egocentric gaze estimation, which predicts device wearer's gaze location within the camera's field of view, mainly focuses on head-mounted cameras while alternative viewpoints remain underexplored. To bridge this gap, we collect the first dataset for this task, consisting of approximately 4 hours of video collected from 8 participants during everyday activities. We evaluate a transformer-based gaze estimation model, GLC, on the new dataset and propose two extensions: an auxiliary gaze out-of-bound classification task and a multi-view co-learning approach that jointly trains head-view and neck-view models using a geometry-aware auxiliary loss. Experimental results show that incorporating gaze out-of-bound classification improves performance over standard fine-tuning, while the co-learning approach does not yield gains. We further analyze these results and discuss implications for neck-mounted gaze estimation.

Chinese Translation

本文介绍了一种基于颈部摄像头的视线估计新任务，该任务旨在从颈部摄像头的视角估计用户的注视位置。以往关于自我中心注视估计的研究主要集中在头戴式摄像头上，预测设备佩戴者在摄像头视野内的注视位置，而其他视角的研究仍然较少。为填补这一空白，我们收集了首个针对该任务的数据集，该数据集包含约4小时的视频，记录了8名参与者在日常活动中的表现。我们在新数据集上评估了一种基于变换器的注视估计模型GLC，并提出了两个扩展：一个辅助的注视越界分类任务和一种多视角共同学习方法，该方法利用几何感知辅助损失共同训练头视图和颈视图模型。实验结果表明，结合注视越界分类的方式在性能上优于标准的微调方法，而共同学习方法未能带来性能提升。我们进一步分析了这些结果，并讨论了对颈部注视估计的影响。

View on arXiv Download PDF AI Translation

cs.CV / 35 / 2602.11672

U-Net with Hadamard Transform and DCT Latent Spaces for Next-day Wildfire Spread Prediction

基于哈达玛变换和离散余弦变换潜在空间的 U-Net 模型用于次日野火传播预测

Luo, Yingyi, Rong, Shuaiang, Watts, Adam, Cetin, Ahmet Enis

Abstract

We developed a lightweight and computationally efficient tool for next-day wildfire spread prediction using multimodal satellite data as input. The deep learning model, which we call Transform Domain Fusion UNet (TD-FusionUNet), incorporates trainable Hadamard Transform and Discrete Cosine Transform layers that apply two-dimensional transforms, enabling the network to capture essential "frequency" components in orthogonalized latent spaces. Additionally, we introduce custom preprocessing techniques, including random margin cropping and a Gaussian mixture model, to enrich the representation of the sparse pre-fire masks and enhance the model's generalization capability. The TD-FusionUNet is evaluated on two datasets which are the Next-Day Wildfire Spread dataset released by Google Research in 2023, and WildfireSpreadTS dataset. Our proposed TD-FusionUNet achieves an F1 score of 0.591 with 370k parameters, outperforming the UNet baseline using ResNet18 as the encoder reported in the WildfireSpreadTS dataset while using substantially fewer parameters. These results show that the proposed latent space fusion model balances accuracy and efficiency under a lightweight setting, making it suitable for real time wildfire prediction applications in resource limited environments.

Chinese Translation

我们开发了一种轻量级且计算高效的工具，用于次日野火传播预测，输入为多模态卫星数据。我们称之为变换域融合 U-Net (Transform Domain Fusion UNet, TD-FusionUNet) 的深度学习模型，结合了可训练的哈达玛变换和离散余弦变换层，这些层应用二维变换，使网络能够捕捉正交潜在空间中的重要“频率”成分。此外，我们引入了自定义预处理技术，包括随机边缘裁剪和高斯混合模型，以丰富稀疏火灾前掩膜的表示，并增强模型的泛化能力。TD-FusionUNet 在两个数据集上进行了评估，分别是 Google Research 在 2023 年发布的次日野火传播数据集和 WildfireSpreadTS 数据集。我们提出的 TD-FusionUNet 在参数量为 370k 的情况下，取得了 0.591 的 F1 分数，超越了在 WildfireSpreadTS 数据集中使用 ResNet18 作为编码器的 UNet 基线，同时使用的参数显著更少。这些结果表明，所提出的潜在空间融合模型在轻量级设置下平衡了准确性和效率，使其适合在资源有限的环境中进行实时野火预测应用。

View on arXiv Download PDF AI Translation

cs.CV / 36 / 2602.11673

RI-Mamba: Rotation-Invariant Mamba for Robust Text-to-Shape Retrieval

RI-Mamba：一种用于鲁棒文本到形状检索的旋转不变Mamba模型

Nguyen, Khanh, Edirimuni, Dasith de Silva, Hassan, Ghulam Mubashar, Mian, Ajmal

Abstract

3D assets have rapidly expanded in quantity and diversity due to the growing popularity of virtual reality and gaming. As a result, text-to-shape retrieval has become essential in facilitating intuitive search within large repositories. However, existing methods require canonical poses and support few object categories, limiting their real-world applicability where objects can belong to diverse classes and appear in random orientations. To address this challenge, we propose RI-Mamba, the first rotation-invariant state-space model for point clouds. RI-Mamba defines global and local reference frames to disentangle pose from geometry and uses Hilbert sorting to construct token sequences with meaningful geometric structure while maintaining rotation invariance. We further introduce a novel strategy to compute orientational embeddings and reintegrate them via feature-wise linear modulation, effectively recovering spatial context and enhancing model expressiveness. Our strategy is inherently compatible with state-space models and operates in linear time. To scale up retrieval, we adopt cross-modal contrastive learning with automated triplet generation, allowing training on diverse datasets without manual annotation. Extensive experiments demonstrate RI-Mamba's superior representational capacity and robustness, achieving state-of-the-art performance on the OmniObject3D benchmark across more than 200 object categories under arbitrary orientations. Our code will be made available at https://github.com/ndkhanh360/RI-Mamba.git.

Chinese Translation

由于虚拟现实和游戏的日益普及，3D资产的数量和多样性迅速增加。因此，文本到形状的检索在促进大型库中的直观搜索方面变得至关重要。然而，现有方法需要标准姿态，并且支持的物体类别较少，这限制了它们在现实世界中的适用性，因为物体可以属于不同类别并以随机方向出现。为了解决这一挑战，我们提出了RI-Mamba，这是第一个用于点云的旋转不变状态空间模型。RI-Mamba定义了全局和局部参考框架，以将姿态与几何形状分离，并使用Hilbert排序构建具有有意义几何结构的标记序列，同时保持旋转不变性。我们进一步引入了一种新颖的策略来计算方向嵌入，并通过特征线性调制重新整合它们，有效恢复空间上下文并增强模型的表现力。我们的策略与状态空间模型本质上兼容，并以线性时间运行。为了扩大检索规模，我们采用跨模态对比学习与自动三元组生成，允许在多样化数据集上进行训练而无需手动标注。大量实验表明，RI-Mamba在表示能力和鲁棒性方面优于其他方法，在任意方向下在超过200个物体类别的OmniObject3D基准测试中实现了最先进的性能。我们的代码将发布在 https://github.com/ndkhanh360/RI-Mamba.git。

View on arXiv Download PDF AI Translation

cs.CV / 37 / 2602.11703

Semantically Conditioned Diffusion Models for Cerebral DSA Synthesis

语义条件扩散模型用于脑血管数字减影血管造影合成

Xu, Qiwen, Rügamer, David, Wenz, Holger, Fontana, Johann, Meggyeshazi, Nora, Bender, Andreas, Maros, Máté E.

Abstract

Digital subtraction angiography (DSA) plays a central role in the diagnosis and treatment of cerebrovascular disease, yet its invasive nature and high acquisition cost severely limit large-scale data collection and public data sharing. Therefore, we developed a semantically conditioned latent diffusion model (LDM) that synthesizes arterial-phase cerebral DSA frames under explicit control of anatomical circulation (anterior vs.\ posterior) and canonical C-arm positions. We curated a large single-centre DSA dataset of 99,349 frames and trained a conditional LDM using text embeddings that encoded anatomy and acquisition geometry. To assess clinical realism, four medical experts, including two neuroradiologists, one neurosurgeon, and one internal medicine expert, systematically rated 400 synthetic DSA images using a 5-grade Likert scale for evaluating proximal large, medium, and small peripheral vessels. The generated images achieved image-wise overall Likert scores ranging from 3.1 to 3.3, with high inter-rater reliability (ICC(2,k) = 0.80--0.87). Distributional similarity to real DSA frames was supported by a low median Fr\'echet inception distance (FID) of 15.27. Our results indicate that semantically controlled LDMs can produce realistic synthetic DSAs suitable for downstream algorithm development, research, and training.

Chinese Translation

数字减影血管造影（DSA）在脑血管疾病的诊断和治疗中发挥着核心作用，但其侵入性特征和高昂的获取成本严重限制了大规模数据收集和公共数据共享。因此，我们开发了一种语义条件潜在扩散模型（LDM），在对解剖循环（前循环与后循环）和标准C臂位置进行明确控制的情况下合成动脉期脑DSA图像。我们整理了一个包含99,349帧的大型单中心DSA数据集，并使用编码解剖结构和获取几何的文本嵌入训练了条件LDM。为了评估临床现实性，四位医学专家（包括两名神经放射科医生、一名神经外科医生和一名内科专家）系统地使用5级李克特量表对400幅合成的DSA图像进行了评分，以评估近端大、中、小外周血管。生成的图像在图像级别的整体李克特评分范围为3.1至3.3，且具有较高的评分者间可靠性（ICC(2,k) = 0.80--0.87）。与真实DSA图像的分布相似性得到了低中位Fréchet起始距离（FID）15.27的支持。我们的结果表明，语义控制的LDM能够生成适合下游算法开发、研究和培训的逼真合成DSA图像。

View on arXiv Download PDF AI Translation

cs.CV / 38 / 2602.11705

TG-Field: Geometry-Aware Radiative Gaussian Fields for Tomographic Reconstruction

TG-Field：用于断层重建的几何感知辐射高斯场

Zhong, Yuxiang, Wei, Jun, Chen, Chaoqi, An, Senyou, Huang, Hui

Abstract

3D Gaussian Splatting (3DGS) has revolutionized 3D scene representation with superior efficiency and quality. While recent adaptations for computed tomography (CT) show promise, they struggle with severe artifacts under highly sparse-view projections and dynamic motions. To address these challenges, we propose Tomographic Geometry Field (TG-Field), a geometry-aware Gaussian deformation framework tailored for both static and dynamic CT reconstruction. A multi-resolution hash encoder is employed to capture local spatial priors, regularizing primitive parameters under ultra-sparse settings. We further extend the framework to dynamic reconstruction by introducing time-conditioned representations and a spatiotemporal attention block to adaptively aggregate features, thereby resolving spatiotemporal ambiguities and enforcing temporal coherence. In addition, a motion-flow network models fine-grained respiratory motion to track local anatomical deformations. Extensive experiments on synthetic and real-world datasets demonstrate that TG-Field consistently outperforms existing methods, achieving state-of-the-art reconstruction accuracy under highly sparse-view conditions.

Chinese Translation

3D 高斯喷溅（3DGS）以其卓越的效率和质量彻底改变了 3D 场景表示。尽管最近针对计算机断层扫描（CT）的改编显示出良好的前景，但在高度稀疏视图投影和动态运动下，它们仍然面临严重的伪影问题。为了解决这些挑战，我们提出了断层几何场（TG-Field），这是一种针对静态和动态 CT 重建的几何感知高斯变形框架。该框架采用多分辨率哈希编码器来捕捉局部空间先验，在超稀疏设置下对原始参数进行正则化。我们进一步通过引入时间条件表示和时空注意力模块来扩展该框架，以自适应地聚合特征，从而解决时空模糊并强制执行时间一致性。此外，运动流网络用于建模细粒度的呼吸运动，以跟踪局部解剖变形。在合成和真实世界数据集上的大量实验表明，TG-Field 在高度稀疏视图条件下始终优于现有方法，实现了最先进的重建精度。

View on arXiv Download PDF AI Translation

cs.CV / 39 / 2602.11706

LLM-Driven 3D Scene Generation of Agricultural Simulation Environments

基于大型语言模型的农业模拟环境三维场景生成

Yoncalik, Arafa, Jansen, Wouter, Huebel, Nico, Rahmani, Mohammad Hasan, Steckel, Jan

Abstract

Procedural generation techniques in 3D rendering engines have revolutionized the creation of complex environments, reducing reliance on manual design. Recent approaches using Large Language Models (LLMs) for 3D scene generation show promise but often lack domain-specific reasoning, verification mechanisms, and modular design. These limitations lead to reduced control and poor scalability. This paper investigates the use of LLMs to generate agricultural synthetic simulation environments from natural language prompts, specifically to address the limitations of lacking domain-specific reasoning, verification mechanisms, and modular design. A modular multi-LLM pipeline was developed, integrating 3D asset retrieval, domain knowledge injection, and code generation for the Unreal rendering engine using its API. This results in a 3D environment with realistic planting layouts and environmental context, all based on the input prompt and the domain knowledge. To enhance accuracy and scalability, the system employs a hybrid strategy combining LLM optimization techniques such as few-shot prompting, Retrieval-Augmented Generation (RAG), finetuning, and validation. Unlike monolithic models, the modular architecture enables structured data handling, intermediate verification, and flexible expansion. The system was evaluated using structured prompts and semantic accuracy metrics. A user study assessed realism and familiarity against real-world images, while an expert comparison demonstrated significant time savings over manual scene design. The results confirm the effectiveness of multi-LLM pipelines in automating domain-specific 3D scene generation with improved reliability and precision. Future work will explore expanding the asset hierarchy, incorporating real-time generation, and adapting the pipeline to other simulation domains beyond agriculture.

Chinese Translation

三维渲染引擎中的程序生成技术彻底改变了复杂环境的创建，减少了对手动设计的依赖。最近使用大型语言模型（LLMs）进行三维场景生成的方法显示出良好的前景，但通常缺乏特定领域的推理、验证机制和模块化设计。这些局限性导致了控制能力降低和扩展性差。本文研究了利用LLMs从自然语言提示生成农业合成模拟环境，特别是为了解决缺乏特定领域推理、验证机制和模块化设计的局限性。我们开发了一个模块化的多LLM管道，集成了三维资产检索、领域知识注入和使用Unreal渲染引擎API的代码生成。这使得生成的三维环境具有基于输入提示和领域知识的真实种植布局和环境背景。为了提高准确性和可扩展性，该系统采用了混合策略，结合了LLM优化技术，如少量提示、检索增强生成（RAG）、微调和验证。与单一模型不同，模块化架构能够实现结构化数据处理、中间验证和灵活扩展。该系统使用结构化提示和语义准确性指标进行了评估。用户研究评估了与真实图像的真实感和熟悉度，而专家比较显示出相较于手动场景设计显著节省了时间。结果确认了多LLM管道在自动化特定领域三维场景生成中的有效性，且提高了可靠性和精确度。未来的工作将探讨扩展资产层次结构、纳入实时生成以及将管道适应于农业以外的其他模拟领域。

View on arXiv Download PDF AI Translation

cs.CV / 40 / 2602.11714

GSO-SLAM: Bidirectionally Coupled Gaussian Splatting and Direct Visual Odometry

GSO-SLAM：双向耦合的高斯点云和直接视觉里程计

Yeon, Jiung, Ha, Seongbo, Yu, Hyeonwoo

Abstract

We propose GSO-SLAM, a real-time monocular dense SLAM system that leverages Gaussian scene representation. Unlike existing methods that couple tracking and mapping with a unified scene, incurring computational costs, or loosely integrate them with well-structured tracking frameworks, introducing redundancies, our method bidirectionally couples Visual Odometry (VO) and Gaussian Splatting (GS). Specifically, our approach formulates joint optimization within an Expectation-Maximization (EM) framework, enabling the simultaneous refinement of VO-derived semi-dense depth estimates and the GS representation without additional computational overhead. Moreover, we present Gaussian Splat Initialization, which utilizes image information, keyframe poses, and pixel associations from VO to produce close approximations to the final Gaussian scene, thereby eliminating the need for heuristic methods. Through extensive experiments, we validate the effectiveness of our method, showing that it not only operates in real time but also achieves state-of-the-art geometric/photometric fidelity of the reconstructed scene and tracking accuracy.

Chinese Translation

我们提出了GSO-SLAM，一种实时单目稠密SLAM系统，利用高斯场景表示。与现有方法通过统一场景耦合跟踪和映射，导致计算成本，或通过良构的跟踪框架松散集成，产生冗余不同，我们的方法双向耦合视觉里程计（Visual Odometry, VO）和高斯点云（Gaussian Splatting, GS）。具体而言，我们的方法在期望最大化（Expectation-Maximization, EM）框架内 formulates 了联合优化，能够同时精炼VO导出的半稠密深度估计和GS表示，而无需额外的计算开销。此外，我们提出了高斯点云初始化（Gaussian Splat Initialization），利用图像信息、关键帧姿态和来自VO的像素关联生成接近最终高斯场景的近似，从而消除了对启发式方法的需求。通过大量实验，我们验证了我们方法的有效性，显示其不仅能够实时运行，还能实现重建场景和跟踪精度的最先进几何/光度保真度。

View on arXiv Download PDF AI Translation

cs.CV / 41 / 2602.11730

STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning

STVG-R1：通过强化学习激励实例级推理和视频中的定位

Zhang, Xiaowen, Gao, Zhi, Jiao, Licheng, Li, Lingling, Li, Qing

Abstract

In vision-language models (VLMs), misalignment between textual descriptions and visual coordinates often induces hallucinations. This issue becomes particularly severe in dense prediction tasks such as spatial-temporal video grounding (STVG). Prior approaches typically focus on enhancing visual-textual alignment or attaching auxiliary decoders. However, these strategies inevitably introduce additional trainable modules, leading to significant annotation costs and computational overhead. In this work, we propose a novel visual prompting paradigm that avoids the difficult problem of aligning coordinates across modalities. Specifically, we reformulate per-frame coordinate prediction as a compact instance-level identification problem by assigning each object a unique, temporally consistent ID. These IDs are embedded into the video as visual prompts, providing explicit and interpretable inputs to the VLMs. Furthermore, we introduce STVG-R1, the first reinforcement learning framework for STVG, which employs a task-driven reward to jointly optimize temporal accuracy, spatial consistency, and structural format regularization. Extensive experiments on six benchmarks demonstrate the effectiveness of our approach. STVG-R1 surpasses the baseline Qwen2.5-VL-7B by a remarkable margin of 20.9% on m_IoU on the HCSTVG-v2 benchmark, establishing a new state of the art (SOTA). Surprisingly, STVG-R1 also exhibits strong zero-shot generalization to multi-object referring video object segmentation tasks, achieving a SOTA 47.3% J&F on MeViS.

Chinese Translation

在视觉-语言模型（VLMs）中，文本描述与视觉坐标之间的不对齐常常导致幻觉现象。这个问题在空间-时间视频定位（STVG）等密集预测任务中尤为严重。以往的方法通常侧重于增强视觉与文本的对齐或附加辅助解码器。然而，这些策略不可避免地引入了额外的可训练模块，导致显著的标注成本和计算开销。在本研究中，我们提出了一种新颖的视觉提示范式，避免了跨模态对齐坐标的困难问题。具体而言，我们将每帧的坐标预测重新表述为一个紧凑的实例级识别问题，通过为每个对象分配一个独特且时间一致的ID。这些ID被嵌入到视频中作为视觉提示，为VLMs提供明确且可解释的输入。此外，我们引入了STVG-R1，这是第一个用于STVG的强化学习框架，采用任务驱动的奖励来联合优化时间准确性、空间一致性和结构格式正则化。在六个基准测试上的广泛实验表明了我们方法的有效性。STVG-R1在HCSTVG-v2基准测试中在m_IoU上超过了基线Qwen2.5-VL-7B，取得了显著的20.9%的提升，确立了新的最先进水平（SOTA）。令人惊讶的是，STVG-R1在多对象指代视频对象分割任务中也表现出强大的零样本泛化能力，在MeViS上达到了47.3%的SOTA J&F。

View on arXiv Download PDF AI Translation

cs.CV / 42 / 2602.11733

Adapting Vision-Language Models for E-commerce Understanding at Scale

针对大规模电子商务理解的视觉-语言模型适应

Nulli, Matteo, Orshulevich, Vladimir, Bazazo, Tala, Herold, Christian, Kozielski, Michael, Mazur, Marcin, Tuzel, Szymon, Snoek, Cees G. M., Hashemi, Seyyed Hadi, Javed, Omar, Versley, Yannick, Khadivi, Shahram

Abstract

E-commerce product understanding demands by nature, strong multimodal comprehension from text, images, and structured attributes. General-purpose Vision-Language Models (VLMs) enable generalizable multimodal latent modelling, yet there is no documented, well-known strategy for adapting them to the attribute-centric, multi-image, and noisy nature of e-commerce data, without sacrificing general performance. In this work, we show through a large-scale experimental study, how targeted adaptation of general VLMs can substantially improve e-commerce performance while preserving broad multimodal capabilities. Furthermore, we propose a novel extensive evaluation suite covering deep product understanding, strict instruction following, and dynamic attribute extraction.

Chinese Translation

电子商务产品理解本质上需要强大的多模态理解能力，包括文本、图像和结构化属性。通用视觉-语言模型（VLMs）能够实现可泛化的多模态潜在建模，但目前尚无文献记录的、广为人知的策略来将其适应于以属性为中心、多图像和噪声特征的电子商务数据，同时不牺牲整体性能。在本研究中，我们通过大规模实验研究展示了如何有针对性地适应通用VLMs，以显著提高电子商务性能，同时保留广泛的多模态能力。此外，我们提出了一套新的广泛评估工具，涵盖深度产品理解、严格的指令遵循和动态属性提取。

View on arXiv Download PDF AI Translation

cs.CV / 43 / 2602.11737

Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding

掩盖重要信息：通过对象对齐的视觉对比解码减轻多模态大型语言模型中的对象幻觉

Chen, Boqi, Liu, Xudong, Qiu, Jianing

Abstract

We study object hallucination in Multimodal Large Language Models (MLLMs) and improve visual contrastive decoding (VCD) by constructing an object-aligned auxiliary view. We leverage object-centric attention in self-supervised Vision Transformers. In particular, we remove the most salient visual evidence to construct an auxiliary view that disrupts unsupported tokens and produces a stronger contrast signal. Our method is prompt-agnostic, model-agnostic, and can be seamlessly plugged into the existing VCD pipeline with little computation overhead, i.e., a single cacheable forward pass. Empirically, our method demonstrates consistent gains on two popular object hallucination benchmarks across two MLLMs.

Chinese Translation

我们研究了多模态大型语言模型（MLLMs）中的对象幻觉，并通过构建对象对齐的辅助视图来改进视觉对比解码（VCD）。我们利用自监督视觉变换器中的对象中心注意力。具体而言，我们去除最显著的视觉证据，以构建一个破坏不支持标记的辅助视图，从而产生更强的对比信号。我们的方法与提示无关、与模型无关，并且可以无缝地集成到现有的VCD管道中，计算开销很小，即仅需一次可缓存的前向传播。从实证上看，我们的方法在两个流行的对象幻觉基准上对两个MLLMs均表现出一致的提升。

View on arXiv Download PDF AI Translation

cs.CV / 44 / 2602.11743

Adaptive Debiasing Tsallis Entropy for Test-Time Adaptation

自适应去偏置Tsallis熵用于测试时适应

Wu, Xiangyu, Jiang, Dongming, Yu, Feng, Tian, Yueying, Tang, Jiaqi, Chen, Qing-Guo, Yang, Yang, Lu, Jianfeng

Abstract

Mainstream Test-Time Adaptation (TTA) methods for adapting vision-language models, e.g., CLIP, typically rely on Shannon Entropy (SE) at test time to measure prediction uncertainty and inconsistency. However, since CLIP has a built-in bias from pretraining on highly imbalanced web-crawled data, SE inevitably results in producing biased estimates of uncertainty entropy. To address this issue, we notably find and demonstrate that Tsallis Entropy (TE), a generalized form of SE, is naturally suited for characterizing biased distributions by introducing a non-extensive parameter q, with the performance of SE serving as a lower bound for TE. Building upon this, we generalize TE into Adaptive Debiasing Tsallis Entropy (ADTE) for TTA, customizing a class-specific parameter q^l derived by normalizing the estimated label bias from continuously incoming test instances, for each category. This adaptive approach allows ADTE to accurately select high-confidence views and seamlessly integrate with a label adjustment strategy to enhance adaptation, without introducing distribution-specific hyperparameter tuning. Besides, our investigation reveals that both TE and ADTE can serve as direct, advanced alternatives to SE in TTA, without any other modifications. Experimental results show that ADTE outperforms state-of-the-art methods on ImageNet and its five variants, and achieves the highest average performance on 10 cross-domain benchmarks, regardless of the model architecture or text prompts used. Our code is available at https://github.com/Jinx630/ADTE.

Chinese Translation

主流的测试时适应（Test-Time Adaptation, TTA）方法，如CLIP，通常依赖于测试时的香农熵（Shannon Entropy, SE）来测量预测的不确定性和不一致性。然而，由于CLIP在高度不平衡的网络爬取数据上进行预训练而内置了偏差，SE不可避免地导致产生偏倚的不确定性熵估计。为了解决这个问题，我们显著发现并证明，Tsallis熵（Tsallis Entropy, TE）作为SE的广义形式，自然适合通过引入非广延参数q来表征偏倚分布，且SE的性能为TE提供了下界。在此基础上，我们将TE推广为自适应去偏置Tsallis熵（Adaptive Debiasing Tsallis Entropy, ADTE）用于TTA，为每个类别定制一个通过对持续到来的测试实例的估计标签偏差进行归一化而得出的类特定参数q^l。这种自适应方法使得ADTE能够准确选择高置信度的视图，并与标签调整策略无缝集成，以增强适应性，而无需引入特定于分布的超参数调优。此外，我们的研究表明，TE和ADTE都可以作为TTA中SE的直接高级替代方案，而无需其他修改。实验结果表明，ADTE在ImageNet及其五个变体上超越了最先进的方法，并在10个跨领域基准测试中实现了最高的平均性能，无论使用何种模型架构或文本提示。我们的代码可在https://github.com/Jinx630/ADTE获取。

View on arXiv Download PDF AI Translation

cs.CV / 45 / 2602.11757

Code2Worlds: Empowering Coding LLMs for 4D World Generation

Code2Worlds：赋能编码大型语言模型以生成四维世界

Zhang, Yi, Wang, Yunshuang, Zhang, Zeyu, Tang, Hao

Abstract

Achieving spatial intelligence requires moving beyond visual plausibility to build world simulators grounded in physical laws. While coding LLMs have advanced static 3D scene generation, extending this paradigm to 4D dynamics remains a critical frontier. This task presents two fundamental challenges: multi-scale context entanglement, where monolithic generation fails to balance local object structures with global environmental layouts; and a semantic-physical execution gap, where open-loop code generation leads to physical hallucinations lacking dynamic fidelity. We introduce Code2Worlds, a framework that formulates 4D generation as language-to-simulation code generation. First, we propose a dual-stream architecture that disentangles retrieval-augmented object generation from hierarchical environmental orchestration. Second, to ensure dynamic fidelity, we establish a physics-aware closed-loop mechanism in which a PostProcess Agent scripts dynamics, coupled with a VLM-Motion Critic that performs self-reflection to iteratively refine simulation code. Evaluations on the Code4D benchmark show Code2Worlds outperforms baselines with a 41% SGS gain and 49% higher Richness, while uniquely generating physics-aware dynamics absent in prior static methods. Code: https://github.com/AIGeeksGroup/Code2Worlds. Website: https://aigeeksgroup.github.io/Code2Worlds.

Chinese Translation

实现空间智能需要超越视觉可信度，构建基于物理法则的世界模拟器。尽管编码大型语言模型在静态三维场景生成方面取得了进展，但将这一范式扩展到四维动态仍然是一个关键前沿任务。该任务面临两个基本挑战：多尺度上下文纠缠，其中单一生成无法平衡局部物体结构与全局环境布局；以及语义-物理执行差距，其中开放式代码生成导致缺乏动态真实感的物理幻觉。我们提出了Code2Worlds，一个将四维生成形式化为语言到模拟代码生成的框架。首先，我们提出了一种双流架构，将检索增强的物体生成与层次环境编排解耦。其次，为了确保动态真实感，我们建立了一个物理感知的闭环机制，其中后处理代理（PostProcess Agent）编写动态脚本，结合一个执行自我反思的VLM-Motion Critic，迭代优化模拟代码。在Code4D基准测试中的评估显示，Code2Worlds在SGS上获得41%的提升，丰富度提高49%，同时独特地生成了在先前静态方法中缺失的物理感知动态。代码链接：https://github.com/AIGeeksGroup/Code2Worlds。网站链接：https://aigeeksgroup.github.io/Code2Worlds。

View on arXiv Download PDF AI Translation

cs.CV / 46 / 2602.11769

Light4D: Training-Free Extreme Viewpoint 4D Video Relighting

Light4D：无训练的极端视角4D视频重照明

Wu, Zhenghuang, Chen, Kang, Zhang, Zeyu, Tang, Hao

Abstract

Recent advances in diffusion-based generative models have established a new paradigm for image and video relighting. However, extending these capabilities to 4D relighting remains challenging, due primarily to the scarcity of paired 4D relighting training data and the difficulty of maintaining temporal consistency across extreme viewpoints. In this work, we propose Light4D, a novel training-free framework designed to synthesize consistent 4D videos under target illumination, even under extreme viewpoint changes. First, we introduce Disentangled Flow Guidance, a time-aware strategy that effectively injects lighting control into the latent space while preserving geometric integrity. Second, to reinforce temporal consistency, we develop Temporal Consistent Attention within the IC-Light architecture and further incorporate deterministic regularization to eliminate appearance flickering. Extensive experiments demonstrate that our method achieves competitive performance in temporal consistency and lighting fidelity, robustly handling camera rotations from -90 to 90. Code: https://github.com/AIGeeksGroup/Light4D. Website: https://aigeeksgroup.github.io/Light4D.

Chinese Translation

最近基于扩散的生成模型的进展为图像和视频重照明建立了新的范式。然而，将这些能力扩展到4D重照明仍然具有挑战性，主要是由于配对的4D重照明训练数据的稀缺以及在极端视角下保持时间一致性的困难。在本研究中，我们提出了Light4D，一个新颖的无训练框架，旨在在目标光照下合成一致的4D视频，即使在极端视角变化下也能保持一致性。首先，我们引入了解耦流引导（Disentangled Flow Guidance），这是一种时间感知策略，能够有效地将光照控制注入潜在空间，同时保持几何完整性。其次，为了增强时间一致性，我们在IC-Light架构中开发了时间一致性注意力（Temporal Consistent Attention），并进一步结合确定性正则化以消除外观闪烁。大量实验表明，我们的方法在时间一致性和光照保真度方面表现出竞争力，能够稳健地处理从-90到90度的相机旋转。代码：https://github.com/AIGeeksGroup/Light4D。网站：https://aigeeksgroup.github.io/Light4D。

View on arXiv Download PDF AI Translation

cs.CV / 47 / 2602.11804

Efficient Segment Anything with Depth-Aware Fusion and Limited Training Data

基于深度感知融合和有限训练数据的高效任意分割

Zhou, Yiming, Xie, Xuenjie, Li, Panfeng, Kunz, Albrecht, Osman, Ahmad, Maldague, Xavier

Abstract

Segment Anything Models (SAM) achieve impressive universal segmentation performance but require massive datasets (e.g., 11M images) and rely solely on RGB inputs. Recent efficient variants reduce computation but still depend on large-scale training. We propose a lightweight RGB-D fusion framework that augments EfficientViT-SAM with monocular depth priors. Depth maps are generated with a pretrained estimator and fused mid-level with RGB features through a dedicated depth encoder. Trained on only 11.2k samples (less than 0.1\% of SA-1B), our method achieves higher accuracy than EfficientViT-SAM, showing that depth cues provide strong geometric priors for segmentation.

Chinese Translation

任意分割模型（Segment Anything Models, SAM）在通用分割性能上表现出色，但需要大量数据集（例如，1100万张图像）并且仅依赖于RGB输入。近期的高效变体减少了计算量，但仍然依赖于大规模训练。我们提出了一种轻量级的RGB-D融合框架，通过单目深度先验增强EfficientViT-SAM。深度图通过预训练的估计器生成，并通过专用的深度编码器与RGB特征在中层进行融合。我们的算法仅在11200个样本（少于SA-1B的0.1%）上训练，取得了比EfficientViT-SAM更高的准确率，表明深度线索为分割提供了强有力的几何先验。

View on arXiv Download PDF AI Translation

cs.CV / 48 / 2602.11810

How to Sample High Quality 3D Fractals for Action Recognition Pre-Training?

如何为动作识别预训练采样高质量的3D分形？

Putak, Marko, Moeslund, Thomas B., Haurum, Joakim Bruslund

Abstract

Synthetic datasets are being recognized in the deep learning realm as a valuable alternative to exhaustively labeled real data. One such synthetic data generation method is Formula Driven Supervised Learning (FDSL), which can provide an infinite number of perfectly labeled data through a formula driven approach, such as fractals or contours. FDSL does not have common drawbacks like manual labor, privacy and other ethical concerns. In this work we generate 3D fractals using 3D Iterated Function Systems (IFS) for pre-training an action recognition model. The fractals are temporally transformed to form a video that is used as a pre-training dataset for downstream task of action recognition. We find that standard methods of generating fractals are slow and produce degenerate 3D fractals. Therefore, we systematically explore alternative ways of generating fractals and finds that overly-restrictive approaches, while generating aesthetically pleasing fractals, are detrimental for downstream task performance. We propose a novel method, Targeted Smart Filtering, to address both the generation speed and fractal diversity issue. The method reports roughly 100 times faster sampling speed and achieves superior downstream performance against other 3D fractal filtering methods.

Chinese Translation

合成数据集在深度学习领域被视为一种有价值的替代方案，可以替代耗时的真实数据标注。其中一种合成数据生成方法是公式驱动的监督学习（Formula Driven Supervised Learning, FDSL），该方法可以通过公式驱动的方式（如分形或轮廓）提供无限数量的完美标注数据。FDSL没有人工劳动、隐私及其他伦理问题等常见缺陷。在本研究中，我们使用3D迭代函数系统（3D Iterated Function Systems, IFS）生成3D分形，以用于动作识别模型的预训练。这些分形经过时间变换形成一个视频，作为下游动作识别任务的预训练数据集。我们发现，标准的分形生成方法速度较慢，并且产生退化的3D分形。因此，我们系统地探索了生成分形的替代方法，并发现过于严格的方法虽然能够生成美观的分形，但对下游任务性能有害。我们提出了一种新方法，目标智能过滤（Targeted Smart Filtering），以解决生成速度和分形多样性的问题。该方法报告的采样速度大约快100倍，并在下游性能上优于其他3D分形过滤方法。

View on arXiv Download PDF AI Translation

cs.CV / 49 / 2602.11832

JEPA-VLA: Video Predictive Embedding is Needed for VLA Models

JEPA-VLA：视频预测嵌入是VLA模型所需的

Miao, Shangchen, Feng, Ningya, Wu, Jialong, Lin, Ye, He, Xu, Li, Dong, Long, Mingsheng

Abstract

Recent vision-language-action (VLA) models built upon pretrained vision-language models (VLMs) have achieved significant improvements in robotic manipulation. However, current VLAs still suffer from low sample efficiency and limited generalization. This paper argues that these limitations are closely tied to an overlooked component, pretrained visual representation, which offers insufficient knowledge on both aspects of environment understanding and policy prior. Through an in-depth analysis, we find that commonly used visual representations in VLAs, whether pretrained via language-image contrastive learning or image-based self-supervised learning, remain inadequate at capturing crucial, task-relevant environment information and at inducing effective policy priors, i.e., anticipatory knowledge of how the environment evolves under successful task execution. In contrast, we discover that predictive embeddings pretrained on videos, in particular V-JEPA 2, are adept at flexibly discarding unpredictable environment factors and encoding task-relevant temporal dynamics, thereby effectively compensating for key shortcomings of existing visual representations in VLAs. Building on these observations, we introduce JEPA-VLA, a simple yet effective approach that adaptively integrates predictive embeddings into existing VLAs. Our experiments demonstrate that JEPA-VLA yields substantial performance gains across a range of benchmarks, including LIBERO, LIBERO-plus, RoboTwin2.0, and real-robot tasks.

Chinese Translation

近期基于预训练视觉-语言模型（VLMs）的视觉-语言-行动（VLA）模型在机器人操作方面取得了显著进展。然而，当前的VLA模型仍然面临低样本效率和有限泛化能力的问题。本文认为，这些局限性与一个被忽视的组成部分——预训练视觉表示密切相关，该部分在环境理解和策略先验两个方面提供的知识不足。通过深入分析，我们发现VLA中常用的视觉表示，无论是通过语言-图像对比学习还是基于图像的自监督学习进行预训练的，都无法有效捕捉关键的、与任务相关的环境信息，也无法有效诱导策略先验，即对环境在成功任务执行下如何演变的预期知识。相比之下，我们发现基于视频预训练的预测嵌入，特别是V-JEPA 2，能够灵活地舍弃不可预测的环境因素，并编码与任务相关的时间动态，从而有效弥补现有VLA视觉表示的关键不足。在此基础上，我们提出了JEPA-VLA，一种简单而有效的方法，能够自适应地将预测嵌入整合到现有的VLA中。我们的实验表明，JEPA-VLA在多个基准测试中，包括LIBERO、LIBERO-plus、RoboTwin2.0和真实机器人任务，均取得了显著的性能提升。

View on arXiv Download PDF AI Translation

cs.CV / 50 / 2602.11845

WorldTree: Towards 4D Dynamic Worlds from Monocular Video using Tree-Chains

WorldTree：基于单目视频构建4D动态世界的树链方法

Wang, Qisen, Zhao, Yifan, Li, Jia

Abstract

Dynamic reconstruction has achieved remarkable progress, but there remain challenges in monocular input for more practical applications. The prevailing works attempt to construct efficient motion representations, but lack a unified spatiotemporal decomposition framework, suffering from either holistic temporal optimization or coupled hierarchical spatial composition. To this end, we propose WorldTree, a unified framework comprising Temporal Partition Tree (TPT) that enables coarse-to-fine optimization based on the inheritance-based partition tree structure for hierarchical temporal decomposition, and Spatial Ancestral Chains (SAC) that recursively query ancestral hierarchical structure to provide complementary spatial dynamics while specializing motion representations across ancestral nodes. Experimental results on different datasets indicate that our proposed method achieves 8.26% improvement of LPIPS on NVIDIA-LS and 9.09% improvement of mLPIPS on DyCheck compared to the second-best method. Code: https://github.com/iCVTEAM/WorldTree.

Chinese Translation

动态重建已取得显著进展，但在单目输入的实际应用中仍面临挑战。现有研究试图构建高效的运动表示，但缺乏统一的时空分解框架，导致要么是整体时间优化，要么是耦合的层次空间组合。为此，我们提出了WorldTree，一个统一框架，包括时间分区树（Temporal Partition Tree, TPT），该树基于继承的分区树结构实现粗到细的优化，以进行层次时间分解；以及空间祖先链（Spatial Ancestral Chains, SAC），该链递归查询祖先层次结构，以提供互补的空间动态，同时在祖先节点之间专业化运动表示。在不同数据集上的实验结果表明，我们提出的方法在NVIDIA-LS上实现了8.26%的LPIPS提升，在DyCheck上实现了9.09%的mLPIPS提升，相较于第二好的方法。代码链接：https://github.com/iCVTEAM/WorldTree。

View on arXiv Download PDF AI Translation

cs.CV / 51 / 2602.11850

Free Lunch for Stabilizing Rectified Flow Inversion

稳定化整流流反演的免费午餐

Wang, Chenru, Zhu, Beier, Zhang, Chi

Abstract

Rectified-Flow (RF)-based generative models have recently emerged as strong alternatives to traditional diffusion models, demonstrating state-of-the-art performance across various tasks. By learning a continuous velocity field that transforms simple noise into complex data, RF-based models not only enable high-quality generation, but also support training-free inversion, which facilitates downstream tasks such as reconstruction and editing. However, existing inversion methods, such as vanilla RF-based inversion, suffer from approximation errors that accumulate across timesteps, leading to unstable velocity fields and degraded reconstruction and editing quality. To address this challenge, we propose Proximal-Mean Inversion (PMI), a training-free gradient correction method that stabilizes the velocity field by guiding it toward a running average of past velocities, constrained within a theoretically derived spherical Gaussian. Furthermore, we introduce mimic-CFG, a lightweight velocity correction scheme for editing tasks, which interpolates between the current velocity and its projection onto the historical average, balancing editing effectiveness and structural consistency. Extensive experiments on PIE-Bench demonstrate that our methods significantly improve inversion stability, image reconstruction quality, and editing fidelity, while reducing the required number of neural function evaluations. Our approach achieves state-of-the-art performance on the PIE-Bench with enhanced efficiency and theoretical soundness.

Chinese Translation

基于整流流（Rectified-Flow, RF）的生成模型最近作为传统扩散模型的有力替代方案出现，在各种任务中展现了最先进的性能。通过学习一个连续的速度场，将简单的噪声转化为复杂的数据，基于RF的模型不仅能够实现高质量的生成，还支持无训练的反演，这为重建和编辑等下游任务提供了便利。然而，现有的反演方法，如普通的基于RF的反演，存在随着时间步的推移而累积的近似误差，导致速度场不稳定以及重建和编辑质量下降。为了解决这一挑战，我们提出了近端均值反演（Proximal-Mean Inversion, PMI），这是一种无训练的梯度校正方法，通过引导速度场朝向过去速度的运行平均值，从而稳定速度场，并限制在理论推导的球形高斯分布内。此外，我们引入了模仿-CFG（mimic-CFG），这是一种轻量级的速度校正方案，适用于编辑任务，它在当前速度和其在历史平均值上的投影之间进行插值，平衡编辑效果和结构一致性。在PIE-Bench上的大量实验表明，我们的方法显著提高了反演的稳定性、图像重建质量和编辑保真度，同时减少了所需的神经函数评估次数。我们的方法在PIE-Bench上实现了最先进的性能，具有更高的效率和理论合理性。

View on arXiv Download PDF AI Translation

cs.CV / 52 / 2602.11858

Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception

无缩放的缩放：细粒度多模态感知的区域到图像蒸馏

Wei, Lai, He, Liangbo, Lan, Jun, Dong, Lingzhong, Cai, Yutong, Li, Siyuan, Zhu, Huijia, Wang, Weiqiang, Kong, Linghe, Wang, Yue, Zhang, Zhuosheng, Huang, Weiran

Abstract

Multimodal Large Language Models (MLLMs) excel at broad visual understanding but still struggle with fine-grained perception, where decisive evidence is small and easily overwhelmed by global context. Recent "Thinking-with-Images" methods alleviate this by iteratively zooming in and out regions of interest during inference, but incur high latency due to repeated tool calls and visual re-encoding. To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM. In particular, we first zoom in to micro-cropped regions to let strong teacher models generate high-quality VQA data, and then distill this region-grounded supervision back to the full image. After training on such data, the smaller student model improves "single-glance" fine-grained perception without tool use. To rigorously evaluate this capability, we further present ZoomBench, a hybrid-annotated benchmark of 845 VQA data spanning six fine-grained perceptual dimensions, together with a dual-view protocol that quantifies the global--regional "zooming gap". Experiments show that our models achieve leading performance across multiple fine-grained perception benchmarks, and also improve general multimodal cognition on benchmarks such as visual reasoning and GUI agents. We further discuss when "Thinking-with-Images" is necessary versus when its gains can be distilled into a single forward pass. Our code is available at https://github.com/inclusionAI/Zooming-without-Zooming.

Chinese Translation

多模态大型语言模型（MLLMs）在广泛的视觉理解方面表现出色，但在细粒度感知上仍然存在困难，因为决定性证据较小且容易被全局上下文所淹没。最近的“图像思维”方法通过在推理过程中反复放大和缩小感兴趣区域来缓解这一问题，但由于重复调用工具和视觉重新编码，导致高延迟。为了解决这个问题，我们提出了区域到图像蒸馏（Region-to-Image Distillation），将缩放从推理时的工具转变为训练时的原语，从而将代理缩放的好处内化为MLLM的单次前向传播。具体来说，我们首先放大微裁剪区域，让强大的教师模型生成高质量的视觉问答（VQA）数据，然后将这种基于区域的监督蒸馏回完整图像。在这样的数据上训练后，较小的学生模型在不使用工具的情况下改善了“单次观察”的细粒度感知。为了严格评估这一能力，我们进一步提出了ZoomBench，这是一个包含845个VQA数据的混合注释基准，涵盖六个细粒度感知维度，并配备了一种双视图协议来量化全局与区域的“缩放差距”。实验表明，我们的模型在多个细粒度感知基准上取得了领先的表现，并且在视觉推理和图形用户界面代理等基准上也改善了整体多模态认知。我们进一步讨论了何时需要“图像思维”，以及何时其收益可以被蒸馏为单次前向传播。我们的代码可在 https://github.com/inclusionAI/Zooming-without-Zooming 获取。

View on arXiv Download PDF AI Translation

cs.CV / 53 / 2602.11875

DiffPlace: Street View Generation via Place-Controllable Diffusion Model Enhancing Place Recognition

DiffPlace：通过可控地点扩散模型生成街景以增强地点识别

Li, Ji, Li, Zhiwei, Li, Shihao, Yu, Zhenjiang, Wang, Boyang, Liu, Haiou

Abstract

Generative models have advanced significantly in realistic image synthesis, with diffusion models excelling in quality and stability. Recent multi-view diffusion models improve 3D-aware street view generation, but they struggle to produce place-aware and background-consistent urban scenes from text, BEV maps, and object bounding boxes. This limits their effectiveness in generating realistic samples for place recognition tasks. To address these challenges, we propose DiffPlace, a novel framework that introduces a place-ID controller to enable place-controllable multi-view image generation. The place-ID controller employs linear projection, perceiver transformer, and contrastive learning to map place-ID embeddings into a fixed CLIP space, allowing the model to synthesize images with consistent background buildings while flexibly modifying foreground objects and weather conditions. Extensive experiments, including quantitative comparisons and augmented training evaluations, demonstrate that DiffPlace outperforms existing methods in both generation quality and training support for visual place recognition. Our results highlight the potential of generative models in enhancing scene-level and place-aware synthesis, providing a valuable approach for improving place recognition in autonomous driving

Chinese Translation

生成模型在真实图像合成方面取得了显著进展，其中扩散模型在质量和稳定性方面表现优异。近期的多视角扩散模型改善了对三维感知街景的生成，但在从文本、鸟瞰图（BEV）和物体边界框生成地点感知和背景一致的城市场景方面仍然存在困难。这限制了它们在生成用于地点识别任务的真实样本方面的有效性。为了解决这些挑战，我们提出了DiffPlace，一个新颖的框架，引入了地点ID控制器，以实现可控地点的多视角图像生成。地点ID控制器采用线性投影、感知变换器和对比学习，将地点ID嵌入映射到固定的CLIP空间，使模型能够合成具有一致背景建筑的图像，同时灵活地修改前景物体和天气条件。大量实验，包括定量比较和增强训练评估，表明DiffPlace在生成质量和视觉地点识别的训练支持方面优于现有方法。我们的结果突显了生成模型在增强场景级和地点感知合成方面的潜力，为改善自动驾驶中的地点识别提供了一种有价值的方法。

View on arXiv Download PDF AI Translation

cs.CV / 54 / 2602.11880

SynthRAR: Ring Artifacts Reduction in CT with Unrolled Network and Synthetic Data Training

SynthRAR：基于展开网络和合成数据训练的CT环伪影减少

Yang, Hongxu, Lippenszky, Levente, Timko, Edina, Avinash, Gopal

Abstract

Defective and inconsistent responses in CT detectors can cause ring and streak artifacts in the reconstructed images, making them unusable for clinical purposes. In recent years, several ring artifact reduction solutions have been proposed in the image domain or in the sinogram domain using supervised deep learning methods. However, these methods require dedicated datasets for training, leading to a high data collection cost. Furthermore, existing approaches focus exclusively on either image-space or sinogram-space correction, neglecting the intrinsic correlations from the forward operation of the CT geometry. Based on the theoretical analysis of non-ideal CT detector responses, the RAR problem is reformulated as an inverse problem by using an unrolled network, which considers non-ideal response together with linear forward-projection with CT geometry. Additionally, the intrinsic correlations of ring artifacts between the sinogram and image domains are leveraged through synthetic data derived from natural images, enabling the trained model to correct artifacts without requiring real-world clinical data. Extensive evaluations on diverse scanning geometries and anatomical regions demonstrate that the model trained on synthetic data consistently outperforms existing state-of-the-art methods.

Chinese Translation

CT探测器中的缺陷和不一致响应会导致重建图像中的环伪影和条纹伪影，使其无法用于临床目的。近年来，已经提出了多种环伪影减少解决方案，这些方案在图像域或正弦图域中使用监督深度学习方法。然而，这些方法需要专门的数据集进行训练，导致高昂的数据收集成本。此外，现有方法仅专注于图像空间或正弦空间的校正，忽视了CT几何的正向操作所带来的内在关联。基于对非理想CT探测器响应的理论分析，RAR问题被重新表述为一个逆问题，采用展开网络，该网络同时考虑了非理想响应和CT几何的线性前向投影。此外，通过从自然图像中衍生的合成数据，利用了正弦图和图像域之间环伪影的内在关联，使得训练模型能够在不需要真实临床数据的情况下修正伪影。在多种扫描几何和解剖区域上的广泛评估表明，基于合成数据训练的模型始终优于现有的最先进方法。

View on arXiv Download PDF AI Translation

cs.CV / 55 / 2602.11919

DynaHOI: Benchmarking Hand-Object Interaction for Dynamic Target

DynaHOI：动态目标的手-物体交互基准测试

Hu, BoCheng, Zhao, Zhonghan, Zhou, Kaiyue, Wang, Hongwei, Wang, Gaoang

Abstract

Most existing hand motion generation benchmarks for hand-object interaction (HOI) focus on static objects, leaving dynamic scenarios with moving targets and time-critical coordination largely untested. To address this gap, we introduce the DynaHOI-Gym, a unified online closed-loop platform with parameterized motion generators and rollout-based metrics for dynamic capture evaluation. Built on DynaHOI-Gym, we release DynaHOI-10M, a large-scale benchmark with 10M frames and 180K hand capture trajectories, whose target motions are organized into 8 major categories and 22 fine-grained subcategories. We also provide a simple observe-before-act baseline (ObAct) that integrates short-term observations with the current frame via spatiotemporal attention to predict actions, achieving an 8.1% improvement in location success rate.

Chinese Translation

现有的大多数手-物体交互（HOI）手势生成基准主要集中在静态物体上，动态场景中移动目标和时间敏感的协调尚未得到充分测试。为了解决这一问题，我们引入了DynaHOI-Gym，一个统一的在线闭环平台，配备参数化的运动生成器和基于回放的动态捕捉评估指标。在DynaHOI-Gym的基础上，我们发布了DynaHOI-10M，这是一个大规模基准数据集，包含1000万帧和18万个手部捕捉轨迹，其目标运动被组织为8个主要类别和22个细分类别。我们还提供了一种简单的观察-行动基线（ObAct），该基线通过时空注意力将短期观察与当前帧结合，以预测动作，实现了8.1%的位置成功率提升。

View on arXiv Download PDF AI Translation

cs.CV / 56 / 2602.11942

Synthesis of Late Gadolinium Enhancement Images via Implicit Neural Representations for Cardiac Scar Segmentation

通过隐式神经表示合成晚期钆增强图像以进行心脏瘢痕分割

Haddou, Soufiane Ben, Alvarez-Florez, Laura, Bekkers, Erik J., Tjong, Fleur V. Y., Amin, Ahmad S., Bezzina, Connie R., Išgum, Ivana

Abstract

Late gadolinium enhancement (LGE) imaging is the clinical standard for myocardial scar assessment, but limited annotated datasets hinder the development of automated segmentation methods. We propose a novel framework that synthesises both LGE images and their corresponding segmentation masks using implicit neural representations (INRs) combined with denoising diffusion models. Our approach first trains INRs to capture continuous spatial representations of LGE data and associated myocardium and fibrosis masks. These INRs are then compressed into compact latent embeddings, preserving essential anatomical information. A diffusion model operates on this latent space to generate new representations, which are decoded into synthetic LGE images with anatomically consistent segmentation masks. Experiments on 133 cardiac MRI scans suggest that augmenting training data with 200 synthetic volumes contributes to improved fibrosis segmentation performance, with the Dice score showing an increase from 0.509 to 0.524. Our approach provides an annotation-free method to help mitigate data scarcity.The code for this research is publicly available.

Chinese Translation

晚期钆增强（LGE）成像是心肌瘢痕评估的临床标准，但有限的标注数据集阻碍了自动分割方法的发展。我们提出了一种新颖的框架，使用隐式神经表示（INRs）结合去噪扩散模型合成LGE图像及其对应的分割掩膜。我们的方法首先训练INRs以捕捉LGE数据及相关心肌和纤维化掩膜的连续空间表示。这些INRs随后被压缩为紧凑的潜在嵌入，保留了重要的解剖信息。扩散模型在这个潜在空间上操作，以生成新的表示，这些表示被解码为具有解剖一致分割掩膜的合成LGE图像。在133个心脏MRI扫描的实验中，使用200个合成体积增强训练数据有助于改善纤维化分割性能，Dice系数从0.509提高到0.524。我们的方法提供了一种无标注的方法，以帮助缓解数据稀缺问题。本研究的代码已公开可用。

View on arXiv Download PDF AI Translation

cs.CV / 57 / 2602.11960

Benchmarking Vision-Language Models for French PDF-to-Markdown Conversion

法语PDF到Markdown转换的视觉-语言模型基准测试

Rigal, Bruno, Dupriez, Victor, Mignon, Alexis, Hy, Ronan Le, Mery, Nicolas

Abstract

This report evaluates PDF-to-Markdown conversion using recent Vision-Language Models (VLMs) on challenging French documents. Document parsing is a critical step for Retrieval-Augmented Generation (RAG) pipelines, where transcription and layout errors propagate to downstream retrieval and grounding. Existing benchmarks often emphasize English or Chinese and can over-penalize benign formatting and linearization choices (e.g., line breaks, list segmentation, alternative table renderings) that are largely irrelevant for downstream use. We introduce a French-focused benchmark of difficult pages selected via model-disagreement sampling from a corpus of 60{,}000 documents, covering handwritten forms, complex layouts, dense tables, and graphics-rich pages. Evaluation is performed with unit-test-style checks that target concrete failure modes (text presence, reading order, and local table constraints) combined with category-specific normalization designed to discount presentation-only variance. Across 15 models, we observe substantially higher robustness for the strongest proprietary models on handwriting and forms, while several open-weights systems remain competitive on standard printed layouts.

Chinese Translation

本报告评估了使用近期视觉-语言模型（VLMs）对具有挑战性的法语文档进行PDF到Markdown的转换。文档解析是检索增强生成（RAG）管道中的关键步骤，在此过程中，转录和布局错误会传播到下游的检索和定位。现有基准测试通常强调英语或中文，并可能对无害的格式和线性化选择（例如，换行、列表分段、替代表格渲染）施加过高的惩罚，而这些选择对下游使用基本无关。我们引入了一个以法语为重点的基准测试，选取了通过模型不一致采样从60,000个文档语料库中挑选出的困难页面，涵盖手写表单、复杂布局、密集表格和图形丰富的页面。评估采用单元测试风格的检查，针对具体的失败模式（文本存在、阅读顺序和局部表格约束），结合特定类别的归一化设计，以减少仅与展示相关的变异性。在15个模型中，我们观察到最强的专有模型在手写和表单方面具有显著更高的鲁棒性，而一些开放权重系统在标准印刷布局上仍然具有竞争力。

View on arXiv Download PDF AI Translation

cs.CV / 58 / 2602.11973

Calibrated Bayesian Deep Learning for Explainable Decision Support Systems Based on Medical Imaging

基于医学影像的可解释决策支持系统的校准贝叶斯深度学习

Xu, Hua, Arias-Londoño, Julián D., Godino-Llorente, Juan I.

Abstract

In critical decision support systems based on medical imaging, the reliability of AI-assisted decision-making is as relevant as predictive accuracy. Although deep learning models have demonstrated significant accuracy, they frequently suffer from miscalibration, manifested as overconfidence in erroneous predictions. To facilitate clinical acceptance, it is imperative that models quantify uncertainty in a manner that correlates with prediction correctness, allowing clinicians to identify unreliable outputs for further review. In order to address this necessity, the present paper proposes a generalizable probabilistic optimization framework grounded in Bayesian deep learning. Specifically, a novel Confidence-Uncertainty Boundary Loss (CUB-Loss) is introduced that imposes penalties on high-certainty errors and low-certainty correct predictions, explicitly enforcing alignment between prediction correctness and uncertainty estimates. Complementing this training-time optimization, a Dual Temperature Scaling (DTS) strategy is devised for post-hoc calibration, further refining the posterior distribution to improve intuitive explainability. The proposed framework is validated on three distinct medical imaging tasks: automatic screening of pneumonia, diabetic retinopathy detection, and identification of skin lesions. Empirical results demonstrate that the proposed approach achieves consistent calibration improvements across diverse modalities, maintains robust performance in data-scarce scenarios, and remains effective on severely imbalanced datasets, underscoring its potential for real clinical deployment.

Chinese Translation

在基于医学影像的关键决策支持系统中，人工智能辅助决策的可靠性与预测准确性同样重要。尽管深度学习模型已显示出显著的准确性，但它们常常遭遇误校准，表现为对错误预测的过度自信。为了促进临床接受，模型必须以一种与预测正确性相关的方式量化不确定性，使临床医生能够识别不可靠的输出以进行进一步审查。为了解决这一需求，本文提出了一种基于贝叶斯深度学习的可推广概率优化框架。具体而言，提出了一种新颖的置信-不确定性边界损失（Confidence-Uncertainty Boundary Loss, CUB-Loss），该损失对高置信度错误和低置信度正确预测施加惩罚，明确地强制预测正确性与不确定性估计之间的对齐。作为这种训练时间优化的补充，设计了一种双温度缩放（Dual Temperature Scaling, DTS）策略用于后期校准，进一步细化后验分布以提高直观可解释性。所提出的框架在三个不同的医学影像任务上进行了验证：肺炎的自动筛查、糖尿病视网膜病变检测和皮肤病变识别。实证结果表明，所提出的方法在不同模态下实现了一致的校准改进，在数据稀缺场景中保持了稳健的性能，并在严重不平衡的数据集上仍然有效，突显了其在实际临床部署中的潜力。

View on arXiv Download PDF AI Translation

cs.CV / 59 / 2602.11980

Spatial Chain-of-Thought: Bridging Understanding and Generation Models for Spatial Reasoning Generation

空间思维链：连接空间推理生成的理解与生成模型

Chen, Wei, Long, Yancheng, Liu, Mingqiao, Ding, Haojie, Yang, Yankai, Wei, Hongyang, Zhang, Yi-Fan, Wen, Bin, Yang, Fan, Gao, Tingting, Li, Han, Chen, Long

Abstract

While diffusion models have shown exceptional capabilities in aesthetic image synthesis, they often struggle with complex spatial understanding and reasoning. Existing approaches resort to Multimodal Large Language Models (MLLMs) to enhance this capability. However, they either incur high computational costs through joint training or suffer from spatial information loss when relying solely on textual prompts. To alleviate these limitations, we propose a Spatial Chain-of-Thought (SCoT) framework, a plug-and-play approach that effectively bridges the reasoning capabilities of MLLMs with the generative power of diffusion models. Specifically, we first enhance the diffusion model's layout awareness by training it on an interleaved text-coordinate instruction format. We then leverage state-of-the-art MLLMs as planners to generate comprehensive layout plans, transferring their spatial planning capabilities directly to the generation process. Extensive experiments demonstrate that our method achieves state-of-the-art performance on image generation benchmarks and significantly outperforms baselines on complex reasoning tasks, while also showing strong efficacy in image editing scenarios.

Chinese Translation

尽管扩散模型在美学图像合成方面展现了卓越的能力，但它们在复杂的空间理解和推理方面往往面临挑战。现有方法依赖于多模态大型语言模型（Multimodal Large Language Models, MLLMs）来增强这一能力。然而，这些方法要么通过联合训练导致高计算成本，要么在仅依赖文本提示时遭遇空间信息的丢失。为了解决这些限制，我们提出了一种空间思维链（Spatial Chain-of-Thought, SCoT）框架，这是一种即插即用的方法，有效地将MLLMs的推理能力与扩散模型的生成能力相结合。具体而言，我们首先通过在交错的文本-坐标指令格式上训练扩散模型，增强其布局意识。然后，我们利用最先进的MLLMs作为规划者，生成全面的布局计划，将其空间规划能力直接转移到生成过程中。大量实验表明，我们的方法在图像生成基准测试中达到了最先进的性能，并在复杂推理任务中显著超越了基线，同时在图像编辑场景中也表现出强大的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 60 / 2602.12002

Can Local Vision-Language Models improve Activity Recognition over Vision Transformers? -- Case Study on Newborn Resuscitation

局部视觉-语言模型能否提升视觉变换器在活动识别中的表现？——以新生儿复苏为案例研究

Guerriero, Enrico, Engan, Kjersti, Meinich-Bache, Øyvind

Abstract

Accurate documentation of newborn resuscitation is essential for quality improvement and adherence to clinical guidelines, yet remains underutilized in practice. Previous work using 3D-CNNs and Vision Transformers (ViT) has shown promising results in detecting key activities from newborn resuscitation videos, but also highlighted the challenges in recognizing such fine-grained activities. This work investigates the potential of generative AI (GenAI) methods to improve activity recognition from such videos. Specifically, we explore the use of local vision-language models (VLMs), combined with large language models (LLMs), and compare them to a supervised TimeSFormer baseline. Using a simulated dataset comprising 13.26 hours of newborn resuscitation videos, we evaluate several zero-shot VLM-based strategies and fine-tuned VLMs with classification heads, including Low-Rank Adaptation (LoRA). Our results suggest that small (local) VLMs struggle with hallucinations, but when fine-tuned with LoRA, the results reach F1 score at 0.91, surpassing the TimeSformer results of 0.70.

Chinese Translation

新生儿复苏的准确记录对于质量改进和遵循临床指南至关重要，但在实践中仍未得到充分利用。之前的研究使用3D卷积神经网络（3D-CNNs）和视觉变换器（Vision Transformers, ViT）在检测新生儿复苏视频中的关键活动方面取得了良好结果，但也突显了识别这些细粒度活动的挑战。本研究探讨了生成性人工智能（Generative AI, GenAI）方法在改善此类视频活动识别中的潜力。具体而言，我们探索了局部视觉-语言模型（Vision-Language Models, VLMs）与大型语言模型（Large Language Models, LLMs）的结合，并将其与监督的TimeSFormer基线进行比较。使用包含13.26小时新生儿复苏视频的模拟数据集，我们评估了几种基于零样本的VLM策略和带有分类头的微调VLM，包括低秩适应（Low-Rank Adaptation, LoRA）。我们的结果表明，小型（局部）VLM在幻觉方面存在困难，但经过LoRA微调后，结果达到了0.91的F1分数，超越了TimeSformer的0.70。

View on arXiv Download PDF AI Translation

cs.CV / 61 / 2602.12003

Projected Representation Conditioning for High-fidelity Novel View Synthesis

高保真新视角合成的投影表示条件化

Kwak, Min-Seop, Kwon, Minkyung, Choi, Jinhyeok, Park, Jiho, Kim, Seungryong

Abstract

We propose a novel framework for diffusion-based novel view synthesis in which we leverage external representations as conditions, harnessing their geometric and semantic correspondence properties for enhanced geometric consistency in generated novel viewpoints. First, we provide a detailed analysis exploring the correspondence capabilities emergent in the spatial attention of external visual representations. Building from these insights, we propose a representation-guided novel view synthesis through dedicated representation projection modules that inject external representations into the diffusion process, a methodology named ReNoV, short for representation-guided novel view synthesis. Our experiments show that this design yields marked improvements in both reconstruction fidelity and inpainting quality, outperforming prior diffusion-based novel-view methods on standard benchmarks and enabling robust synthesis from sparse, unposed image collections.

Chinese Translation

我们提出了一种基于扩散的新视角合成的新框架，其中利用外部表示作为条件，利用其几何和语义对应特性增强生成新视角的几何一致性。首先，我们提供了详细的分析，探讨了外部视觉表示在空间注意力中出现的对应能力。在这些洞察的基础上，我们提出了一种通过专门的表示投影模块引入外部表示到扩散过程中的表示引导新视角合成方法，命名为 ReNoV（表示引导的新视角合成）。我们的实验表明，这种设计在重建保真度和修复质量上都有显著提升，超越了先前基于扩散的新视角方法在标准基准上的表现，并能够从稀疏、未摆姿的图像集合中实现稳健的合成。

View on arXiv Download PDF AI Translation

cs.CV / 62 / 2602.12044

A DMD-Based Adaptive Modulation Method for High Dynamic Range Imaging in High-Glare Environments

基于DMD的自适应调制方法在高眩光环境下的高动态范围成像

Guan, Banglei, Tao, Jing, Xu, Liang, Tan, Dongcai, Sun, Pengju, Liu, Jianbing, Shang, Yang, Yu, Qifeng

Abstract

Background The accuracy of photomechanics measurements critically relies on image quality,particularly under extreme illumination conditions such as welding arc monitoring and polished metallic surface analysis. High dynamic range (HDR) imaging above 120 dB is essential in these contexts. Conventional CCD/CMOS sensors, with dynamic ranges typically below 70 dB, are highly susceptible to saturation under glare, resulting in irreversible loss of detail and significant errors in digital image correlation (DIC). Methods This paper presents an HDR imaging system that leverages the spatial modulation capability of a digital micromirror device (DMD). The system architecture enables autonomous regional segmentation and adaptive exposure control for high-dynamic-range scenes through an integrated framework comprising two synergistic subsystems: a DMD-based optical modulation unit and an adaptive computational imaging pipeline. Results The system achieves a measurable dynamic range of 127 dB, effectively eliminating satu ration artifacts under high glare. Experimental results demonstrate a 78% reduction in strain error and improved DIC positioning accuracy, confirming reliable performance across extreme intensity variations. Conclusion The DMD-based system provides high fidelity adaptive HDR imaging, overcoming key limitations of conventional sensors. It exhibits strong potential for optical metrology and stress analysis in high-glare environments where traditional methods are inadequate.

Chinese Translation

背景：光机械测量的准确性在很大程度上依赖于图像质量，特别是在焊接弧监测和抛光金属表面分析等极端照明条件下。高于120 dB的高动态范围（HDR）成像在这些情况下至关重要。传统的CCD/CMOS传感器的动态范围通常低于70 dB，容易在眩光下饱和，导致细节的不可逆损失和数字图像相关（DIC）中的显著误差。方法：本文提出了一种利用数字微镜设备（DMD）空间调制能力的HDR成像系统。该系统架构通过一个集成框架，结合两个协同子系统：基于DMD的光学调制单元和自适应计算成像管道，实现了高动态范围场景的自主区域分割和自适应曝光控制。结果：该系统实现了127 dB的可测动态范围，有效消除了高眩光下的饱和伪影。实验结果表明，应变误差减少了78%，DIC定位精度提高，确认了在极端强度变化下的可靠性能。结论：基于DMD的系统提供了高保真度的自适应HDR成像，克服了传统传感器的主要限制。它在高眩光环境下的光学计量和应力分析中展现出强大的潜力，传统方法在这些环境中显得不足。

View on arXiv Download PDF AI Translation

cs.CV / 63 / 2602.12099

GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning

GigaBrain-0.5M*: 一种基于世界模型强化学习的视觉-语言-行动（VLA）模型

GigaBrain Team, Wang, Boyuan, Ni, Chaojun, Huang, Guan, Zhao, Guosheng, Li, Hao, Li, Jie, Lv, Jindi, Liu, Jingyu, Feng, Lv, Yu, Mingming, Li, Peng, Deng, Qiuping, Liu, Tianze, Zhou, Xinyu, Chen, Xinze, Wang, Xiaofeng, Wang, Yang, Li, Yifan, Nie, Yifei, Li, Yilong, Zhou, Yukun, Ye, Yun, Liu, Zhichao, Zhu, Zheng

Abstract

Vision-language-action (VLA) models that directly predict multi-step action chunks from current observations face inherent limitations due to constrained scene understanding and weak future anticipation capabilities. In contrast, video world models pre-trained on web-scale video corpora exhibit robust spatiotemporal reasoning and accurate future prediction, making them a natural foundation for enhancing VLA learning. Therefore, we propose \textit{GigaBrain-0.5M*}, a VLA model trained via world model-based reinforcement learning. Built upon \textit{GigaBrain-0.5}, which is pre-trained on over 10,000 hours of robotic manipulation data, whose intermediate version currently ranks first on the international RoboChallenge benchmark. \textit{GigaBrain-0.5M*} further integrates world model-based reinforcement learning via \textit{RAMP} (Reinforcement leArning via world Model-conditioned Policy) to enable robust cross-task adaptation. Empirical results demonstrate that \textit{RAMP} achieves substantial performance gains over the RECAP baseline, yielding improvements of approximately 30\% on challenging tasks including \texttt{Laundry Folding}, \texttt{Box Packing}, and \texttt{Espresso Preparation}. Critically, \textit{GigaBrain-0.5M$^*$} exhibits reliable long-horizon execution, consistently accomplishing complex manipulation tasks without failure as validated by real-world deployment videos on our \href{https://gigabrain05m.github.io}{project page}.

Chinese Translation

直接从当前观察中预测多步动作块的视觉-语言-行动（VLA）模型面临固有的局限性，主要由于场景理解受限和未来预测能力较弱。相比之下，经过网络规模视频语料库预训练的视频世界模型展现出强大的时空推理能力和准确的未来预测，使其成为增强VLA学习的自然基础。因此，我们提出了 extit{GigaBrain-0.5M*}，这是一个通过基于世界模型的强化学习训练的VLA模型。该模型建立在 extit{GigaBrain-0.5}之上，后者在超过10,000小时的机器人操作数据上进行了预训练，其中间版本目前在国际RoboChallenge基准测试中排名第一。 extit{GigaBrain-0.5M*}进一步通过 extit{RAMP}（基于世界模型条件策略的强化学习）集成了基于世界模型的强化学习，以实现强大的跨任务适应性。实证结果表明， extit{RAMP}在RECAP基准上取得了显著的性能提升，在包括 exttt{洗衣折叠}、 exttt{箱子打包}和 exttt{浓缩咖啡制作}等挑战性任务上，性能提升约30 ext{%}。重要的是， extit{GigaBrain-0.5M$^*$}展现出可靠的长时间执行能力，持续成功完成复杂的操作任务，且这一点通过我们 extit{project page}上的真实世界部署视频得到了验证。

View on arXiv Download PDF AI Translation

cs.CV / 64 / 2602.12100

AssetFormer: Modular 3D Assets Generation with Autoregressive Transformer

AssetFormer：基于自回归变换器的模块化3D资产生成

Zhu, Lingting, Qian, Shengju, Fan, Haidi, Dong, Jiayu, Jin, Zhenchao, Zhou, Siwei, Dong, Gen, Wang, Xin, Yu, Lequan

Abstract

The digital industry demands high-quality, diverse modular 3D assets, especially for user-generated content~(UGC). In this work, we introduce AssetFormer, an autoregressive Transformer-based model designed to generate modular 3D assets from textual descriptions. Our pilot study leverages real-world modular assets collected from online platforms. AssetFormer tackles the challenge of creating assets composed of primitives that adhere to constrained design parameters for various applications. By innovatively adapting module sequencing and decoding techniques inspired by language models, our approach enhances asset generation quality through autoregressive modeling. Initial results indicate the effectiveness of AssetFormer in streamlining asset creation for professional development and UGC scenarios. This work presents a flexible framework extendable to various types of modular 3D assets, contributing to the broader field of 3D content generation. The code is available at https://github.com/Advocate99/AssetFormer.

Chinese Translation

数字产业对高质量、多样化的模块化3D资产有着迫切需求，尤其是在用户生成内容（UGC）方面。在本研究中，我们介绍了AssetFormer，这是一种基于自回归变换器的模型，旨在从文本描述中生成模块化3D资产。我们的初步研究利用了从在线平台收集的真实模块化资产。AssetFormer解决了创建由遵循各种应用约束设计参数的原始构件组成的资产的挑战。通过创新性地调整受语言模型启发的模块序列和解码技术，我们的方法通过自回归建模提升了资产生成的质量。初步结果表明，AssetFormer在简化专业开发和UGC场景中的资产创建方面具有有效性。本研究提出了一个灵活的框架，可以扩展到各种类型的模块化3D资产，为更广泛的3D内容生成领域做出了贡献。代码可在 https://github.com/Advocate99/AssetFormer 获取。

View on arXiv Download PDF AI Translation

cs.CV / 65 / 2602.12127

PosterOmni: Generalized Artistic Poster Creation via Task Distillation and Unified Reward Feedback

PosterOmni：通过任务蒸馏和统一奖励反馈实现的通用艺术海报创作

Chen, Sixiang, Lai, Jianyu, Gao, Jialin, Shi, Hengyu, Liu, Zhongying, Ye, Tian, Luo, Junfeng, Wei, Xiaoming, Zhu, Lei

Abstract

Image-to-poster generation is a high-demand task requiring not only local adjustments but also high-level design understanding. Models must generate text, layout, style, and visual elements while preserving semantic fidelity and aesthetic coherence. The process spans two regimes: local editing, where ID-driven generation, rescaling, filling, and extending must preserve concrete visual entities; and global creation, where layout- and style-driven tasks rely on understanding abstract design concepts. These intertwined demands make image-to-poster a multi-dimensional process coupling entity-preserving editing with concept-driven creation under image-prompt control. To address these challenges, we propose PosterOmni, a generalized artistic poster creation framework that unlocks the potential of a base edit model for multi-task image-to-poster generation. PosterOmni integrates the two regimes, namely local editing and global creation, within a single system through an efficient data-distillation-reward pipeline: (i) constructing multi-scenario image-to-poster datasets covering six task types across entity-based and concept-based creation; (ii) distilling knowledge between local and global experts for supervised fine-tuning; and (iii) applying unified PosterOmni Reward Feedback to jointly align visual entity-preserving and aesthetic preference across all tasks. Additionally, we establish PosterOmni-Bench, a unified benchmark for evaluating both local editing and global creation. Extensive experiments show that PosterOmni significantly enhances reference adherence, global composition quality, and aesthetic harmony, outperforming all open-source baselines and even surpassing several proprietary systems.

Chinese Translation

图像到海报生成是一项高需求任务，不仅需要局部调整，还需要高水平的设计理解。模型必须生成文本、布局、风格和视觉元素，同时保持语义的准确性和美学的一致性。该过程涵盖两个阶段：局部编辑，其中基于身份的生成、重新缩放、填充和扩展必须保持具体的视觉实体；以及全局创作，其中布局和风格驱动的任务依赖于对抽象设计概念的理解。这些交织的需求使得图像到海报的过程成为一个多维过程，将实体保留的编辑与基于概念的创作结合在图像提示控制下。为了解决这些挑战，我们提出了PosterOmni，一个通用的艺术海报创作框架，充分发挥基础编辑模型在多任务图像到海报生成中的潜力。PosterOmni通过高效的数据蒸馏-奖励管道，将局部编辑和全局创作这两个阶段整合到一个系统中：(i) 构建覆盖六种任务类型的多场景图像到海报数据集，涵盖基于实体和基于概念的创作；(ii) 在局部和全局专家之间蒸馏知识以进行监督微调；(iii) 应用统一的PosterOmni奖励反馈，以共同对齐所有任务中的视觉实体保留和美学偏好。此外，我们建立了PosterOmni-Bench，一个用于评估局部编辑和全局创作的统一基准。大量实验表明，PosterOmni显著提高了参考遵循度、全局构图质量和美学和谐性，超越了所有开源基线，甚至超过了若干专有系统。

View on arXiv Download PDF AI Translation

cs.CV / 66 / 2602.12155

FAIL: Flow Matching Adversarial Imitation Learning for Image Generation

FAIL：用于图像生成的流匹配对抗模仿学习

Ma, Yeyao, Li, Chen, Zhang, Xiaosong, Hu, Han, Xie, Weidi

Abstract

Post-training of flow matching models-aligning the output distribution with a high-quality target-is mathematically equivalent to imitation learning. While Supervised Fine-Tuning mimics expert demonstrations effectively, it cannot correct policy drift in unseen states. Preference optimization methods address this but require costly preference pairs or reward modeling. We propose Flow Matching Adversarial Imitation Learning (FAIL), which minimizes policy-expert divergence through adversarial training without explicit rewards or pairwise comparisons. We derive two algorithms: FAIL-PD exploits differentiable ODE solvers for low-variance pathwise gradients, while FAIL-PG provides a black-box alternative for discrete or computationally constrained settings. Fine-tuning FLUX with only 13,000 demonstrations from Nano Banana pro, FAIL achieves competitive performance on prompt following and aesthetic benchmarks. Furthermore, the framework generalizes effectively to discrete image and video generation, and functions as a robust regularizer to mitigate reward hacking in reward-based optimization. Code and data are available at https://github.com/HansPolo113/FAIL.

Chinese Translation

流匹配模型的后训练——将输出分布与高质量目标对齐——在数学上等同于模仿学习。虽然监督微调有效地模仿了专家演示，但它无法纠正在未见状态下的策略漂移。偏好优化方法解决了这一问题，但需要昂贵的偏好对或奖励建模。我们提出了流匹配对抗模仿学习（Flow Matching Adversarial Imitation Learning, FAIL），通过对抗训练最小化策略与专家之间的差异，而无需显式奖励或成对比较。我们推导了两个算法：FAIL-PD利用可微分的常微分方程求解器获得低方差的路径梯度，而FAIL-PG则为离散或计算受限的环境提供了一种黑箱替代方案。仅使用来自Nano Banana pro的13,000个演示，微调FLUX后，FAIL在提示跟随和美学基准测试中取得了竞争力的表现。此外，该框架有效地推广到离散图像和视频生成，并作为一种强健的正则化器，减轻基于奖励的优化中的奖励黑客问题。代码和数据可在https://github.com/HansPolo113/FAIL获取。

View on arXiv Download PDF AI Translation

cs.CV / 67 / 2602.12157

TexSpot: 3D Texture Enhancement with Spatially-uniform Point Latent Representation

TexSpot：基于空间均匀点潜在表示的3D纹理增强

Lu, Ziteng, Wu, Yushuang, Ye, Chongjie, Qiu, Yuda, Shao, Jing, Guo, Xiaoyang, Zhou, Jiaqing, Hu, Tianlei, Zhou, Kun, Han, Xiaoguang

Abstract

High-quality 3D texture generation remains a fundamental challenge due to the view-inconsistency inherent in current mainstream multi-view diffusion pipelines. Existing representations either rely on UV maps, which suffer from distortion during unwrapping, or point-based methods, which tightly couple texture fidelity to geometric density that limits high-resolution texture generation. To address these limitations, we introduce TexSpot, a diffusion-based texture enhancement framework. At its core is Texlet, a novel 3D texture representation that merges the geometric expressiveness of point-based 3D textures with the compactness of UV-based representation. Each Texlet latent vector encodes a local texture patch via a 2D encoder and is further aggregated using a 3D encoder to incorporate global shape context. A cascaded 3D-to-2D decoder reconstructs high-quality texture patches, enabling the Texlet space learning. Leveraging this representation, we train a diffusion transformer conditioned on Texlets to refine and enhance textures produced by multi-view diffusion methods. Extensive experiments demonstrate that TexSpot significantly improves visual fidelity, geometric consistency, and robustness over existing state-of-the-art 3D texture generation and enhancement approaches. Project page: https://anonymous.4open.science/w/TexSpot-page-2D91.

Chinese Translation

高质量的3D纹理生成仍然是一个基本挑战，因为当前主流的多视角扩散管道固有的视图不一致性。现有的表示方法要么依赖于UV贴图，这在展开过程中会遭受失真，要么依赖于基于点的方法，这将纹理保真度与几何密度紧密耦合，从而限制了高分辨率纹理的生成。为了解决这些局限性，我们提出了TexSpot，一个基于扩散的纹理增强框架。其核心是Texlet，一种新颖的3D纹理表示，结合了基于点的3D纹理的几何表现力和基于UV的表示的紧凑性。每个Texlet潜在向量通过2D编码器编码一个局部纹理块，并通过3D编码器进一步聚合，以纳入全局形状上下文。一个级联的3D到2D解码器重建高质量的纹理块，从而实现Texlet空间学习。利用这种表示，我们训练了一个以Texlets为条件的扩散变换器，以精炼和增强多视角扩散方法生成的纹理。大量实验表明，TexSpot在视觉保真度、几何一致性和鲁棒性方面显著优于现有的最先进的3D纹理生成和增强方法。项目页面：https://anonymous.4open.science/w/TexSpot-page-2D91。

View on arXiv Download PDF AI Translation

cs.CV / 68 / 2602.12160

DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation

DreamID-Omni：可控人本音视频生成的统一框架

Guo, Xu, Ye, Fulong, Sun, Qichao, Chen, Liyang, Li, Bingchuan, Zhang, Pengze, Liu, Jiawei, Zhao, Songtao, He, Qian, Hou, Xiangwang

Abstract

Recent advancements in foundation models have revolutionized joint audio-video generation. However, existing approaches typically treat human-centric tasks including reference-based audio-video generation (R2AV), video editing (RV2AV) and audio-driven video animation (RA2V) as isolated objectives. Furthermore, achieving precise, disentangled control over multiple character identities and voice timbres within a single framework remains an open challenge. In this paper, we propose DreamID-Omni, a unified framework for controllable human-centric audio-video generation. Specifically, we design a Symmetric Conditional Diffusion Transformer that integrates heterogeneous conditioning signals via a symmetric conditional injection scheme. To resolve the pervasive identity-timbre binding failures and speaker confusion in multi-person scenarios, we introduce a Dual-Level Disentanglement strategy: Synchronized RoPE at the signal level to ensure rigid attention-space binding, and Structured Captions at the semantic level to establish explicit attribute-subject mappings. Furthermore, we devise a Multi-Task Progressive Training scheme that leverages weakly-constrained generative priors to regularize strongly-constrained tasks, preventing overfitting and harmonizing disparate objectives. Extensive experiments demonstrate that DreamID-Omni achieves comprehensive state-of-the-art performance across video, audio, and audio-visual consistency, even outperforming leading proprietary commercial models. We will release our code to bridge the gap between academic research and commercial-grade applications.

Chinese Translation

最近基础模型的进展彻底改变了音视频联合生成的领域。然而，现有的方法通常将以人为中心的任务（包括基于参考的音视频生成（R2AV）、视频编辑（RV2AV）和音频驱动的视频动画（RA2V））视为孤立的目标。此外，在单一框架内实现对多个角色身份和声音音色的精确、解耦控制仍然是一个未解决的挑战。在本文中，我们提出了DreamID-Omni，一个可控人本音视频生成的统一框架。具体而言，我们设计了一种对称条件扩散变换器（Symmetric Conditional Diffusion Transformer），通过对称条件注入方案整合异质条件信号。为了解决多人物场景中普遍存在的身份-音色绑定失败和说话者混淆问题，我们引入了一种双层解耦策略：在信号层面采用同步的RoPE（Synchronized RoPE）以确保刚性注意力空间绑定，在语义层面采用结构化字幕（Structured Captions）以建立明确的属性-主体映射。此外，我们设计了一种多任务渐进训练方案，利用弱约束生成先验来规范强约束任务，防止过拟合并协调不同目标。大量实验表明，DreamID-Omni在视频、音频和音视频一致性方面实现了全面的最先进性能，甚至超越了领先的专有商业模型。我们将发布我们的代码，以弥合学术研究与商业级应用之间的差距。

View on arXiv Download PDF AI Translation

cs.CV / 69 / 2602.12177

EO-VAE: Towards A Multi-sensor Tokenizer for Earth Observation Data

EO-VAE：面向地球观测数据的多传感器标记器

Lehmann, Nils, Wang, Yi, Xiong, Zhitong, Zhu, Xiaoxiang

Abstract

State-of-the-art generative image and video models rely heavily on tokenizers that compress high-dimensional inputs into more efficient latent representations. While this paradigm has revolutionized RGB generation, Earth observation (EO) data presents unique challenges due to diverse sensor specifications and variable spectral channels. We propose EO-VAE, a multi-sensor variational autoencoder designed to serve as a foundational tokenizer for the EO domain. Unlike prior approaches that train separate tokenizers for each modality, EO-VAE utilizes a single model to encode and reconstruct flexible channel combinations via dynamic hypernetworks. Our experiments on the TerraMesh dataset demonstrate that EO-VAE achieves superior reconstruction fidelity compared to the TerraMind tokenizers, establishing a robust baseline for latent generative modeling in remote sensing.

Chinese Translation

最先进的生成图像和视频模型在很大程度上依赖于将高维输入压缩为更高效的潜在表示的标记器。尽管这一范式已经彻底改变了RGB生成，但由于传感器规格多样和光谱通道可变，地球观测（EO）数据面临独特的挑战。我们提出了EO-VAE，一种多传感器变分自编码器，旨在作为EO领域的基础标记器。与以往为每种模态训练单独标记器的方法不同，EO-VAE利用单一模型通过动态超网络编码和重构灵活的通道组合。我们在TerraMesh数据集上的实验表明，EO-VAE在重构保真度方面优于TerraMind标记器，为遥感中的潜在生成建模建立了稳健的基线。

View on arXiv Download PDF AI Translation

cs.CV / 70 / 2602.12221

Best of Both Worlds: Multimodal Reasoning and Generation via Unified Discrete Flow Matching

兼具优势：通过统一离散流匹配实现多模态推理与生成

Susladkar, Onkar, Prakash, Tushar, Deshmukh, Gayatri, Nguyen, Kiet A., Zhang, Jiaxun, Juvekar, Adheesh, Bao, Tianshu, Chai, Lin, Mittal, Sparsh, Dhillon, Inderjit S, Lourentzou, Ismini

Abstract

We propose UniDFlow, a unified discrete flow-matching framework for multimodal understanding, generation, and editing. It decouples understanding and generation via task-specific low-rank adapters, avoiding objective interference and representation entanglement, while a novel reference-based multimodal preference alignment optimizes relative outcomes under identical conditioning, improving faithfulness and controllability without large-scale retraining. UniDFlpw achieves SOTA performance across eight benchmarks and exhibits strong zero-shot generalization to tasks including inpainting, in-context image generation, reference-based editing, and compositional generation, despite no explicit task-specific training.

Chinese Translation

我们提出了UniDFlow，一个用于多模态理解、生成和编辑的统一离散流匹配框架。它通过任务特定的低秩适配器解耦理解与生成，避免了目标干扰和表示纠缠，同时一种新颖的基于参考的多模态偏好对齐在相同条件下优化相对结果，提高了真实性和可控性，而无需大规模重新训练。UniDFlow在八个基准测试中达到了最先进的性能，并展现出对包括图像修复、上下文图像生成、基于参考的编辑和组合生成等任务的强大零-shot 泛化能力，尽管没有进行明确的任务特定训练。

View on arXiv Download PDF AI Translation

cs.CV / 71 / 2602.12279

UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

UniT：统一的多模态思维链测试时扩展

Chen, Leon Liangyu, Ma, Haoyu, Fan, Zhipeng, Huang, Ziqi, Sinha, Animesh, Dai, Xiaoliang, Wang, Jialiang, He, Zecheng, Yang, Jianwei, Li, Chunyuan, Sun, Junzhe, Wang, Chu, Yeung-Levy, Serena, Juefei-Xu, Felix

Abstract

Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs. Many multimodal tasks, especially those involving complex spatial compositions, multiple interacting objects, or evolving instructions, require decomposing instructions, verifying intermediate results, and making iterative corrections. While test-time scaling (TTS) has demonstrated that allocating additional inference compute for iterative reasoning substantially improves language model performance, extending this paradigm to unified multimodal models remains an open challenge. We introduce UniT, a framework for multimodal chain-of-thought test-time scaling that enables a single unified model to reason, verify, and refine across multiple rounds. UniT combines agentic data synthesis, unified model training, and flexible test-time inference to elicit cognitive behaviors including verification, subgoal decomposition, and content memory. Our key findings are: (1) unified models trained on short reasoning trajectories generalize to longer inference chains at test time; (2) sequential chain-of-thought reasoning provides a more scalable and compute-efficient TTS strategy than parallel sampling; (3) training on generation and editing trajectories improves out-of-distribution visual reasoning. These results establish multimodal test-time scaling as an effective paradigm for advancing both generation and understanding in unified models.

Chinese Translation

统一模型能够在单一架构中处理多模态理解和生成，但通常在单次推理中操作，而不对输出进行迭代优化。许多多模态任务，尤其是涉及复杂空间组合、多个交互对象或不断演变的指令的任务，需要对指令进行分解、验证中间结果并进行迭代修正。虽然测试时扩展（TTS）已证明为迭代推理分配额外推理计算显著提高语言模型性能，但将这一范式扩展到统一多模态模型仍然是一个开放的挑战。我们提出了UniT，一个用于多模态思维链测试时扩展的框架，使单一统一模型能够在多个回合中进行推理、验证和优化。UniT结合了主动数据合成、统一模型训练和灵活的测试时推理，以引发包括验证、子目标分解和内容记忆等认知行为。我们的主要发现是：（1）在短推理轨迹上训练的统一模型在测试时能够推广到更长的推理链；（2）顺序思维链推理提供了一种比并行采样更具可扩展性和计算效率的TTS策略；（3）在生成和编辑轨迹上训练能够改善分布外的视觉推理。这些结果确立了多模态测试时扩展作为推动统一模型生成和理解的有效范式。

View on arXiv Download PDF AI Translation

cs.CV / 72 / 2602.12280

Stroke of Surprise: Progressive Semantic Illusions in Vector Sketching

惊喜的笔触：向量素描中的渐进语义幻觉

Cheng, Huai-Hsun, Zhang, Siang-Ling, Liu, Yu-Lun

Abstract

Visual illusions traditionally rely on spatial manipulations such as multi-view consistency. In this work, we introduce Progressive Semantic Illusions, a novel vector sketching task where a single sketch undergoes a dramatic semantic transformation through the sequential addition of strokes. We present Stroke of Surprise, a generative framework that optimizes vector strokes to satisfy distinct semantic interpretations at different drawing stages. The core challenge lies in the "dual-constraint": initial prefix strokes must form a coherent object (e.g., a duck) while simultaneously serving as the structural foundation for a second concept (e.g., a sheep) upon adding delta strokes. To address this, we propose a sequence-aware joint optimization framework driven by a dual-branch Score Distillation Sampling (SDS) mechanism. Unlike sequential approaches that freeze the initial state, our method dynamically adjusts prefix strokes to discover a "common structural subspace" valid for both targets. Furthermore, we introduce a novel Overlay Loss that enforces spatial complementarity, ensuring structural integration rather than occlusion. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art baselines in recognizability and illusion strength, successfully expanding visual anagrams from the spatial to the temporal dimension. Project page: https://stroke-of-surprise.github.io/

Chinese Translation

视觉幻觉传统上依赖于空间操控，例如多视图一致性。在本研究中，我们引入了渐进语义幻觉（Progressive Semantic Illusions），这是一项新颖的向量素描任务，其中单个素描通过逐步添加笔触经历显著的语义转变。我们提出了惊喜的笔触（Stroke of Surprise），这是一个生成框架，旨在优化向量笔触，以满足不同绘制阶段的独特语义解释。核心挑战在于“双重约束”：初始前缀笔触必须形成一个连贯的对象（例如，一只鸭子），同时在添加增量笔触时作为第二个概念（例如，一只羊）的结构基础。为了解决这个问题，我们提出了一种基于序列感知的联合优化框架，该框架由双分支评分蒸馏采样（Score Distillation Sampling, SDS）机制驱动。与冻结初始状态的顺序方法不同，我们的方法动态调整前缀笔触，以发现对两个目标有效的“共同结构子空间”。此外，我们引入了一种新颖的叠加损失（Overlay Loss），以强制空间互补性，确保结构整合而非遮挡。大量实验表明，我们的方法在可识别性和幻觉强度方面显著优于现有最先进的基线，成功将视觉字谜从空间维度扩展到时间维度。项目页面：https://stroke-of-surprise.github.io/

View on arXiv Download PDF AI Translation

人工智能 (Artificial Intelligence)

cs.AI / 1 / 2602.11159

Explaining AI Without Code: A User Study on Explainable AI

无代码的人工智能解释：关于可解释人工智能的用户研究

Abarca, Natalia, Carvallo, Andrés, Moncada, Claudia López, Bravo-Marquez, Felipe

Abstract

The increasing use of Machine Learning (ML) in sensitive domains such as healthcare, finance, and public policy has raised concerns about the transparency of automated decisions. Explainable AI (XAI) addresses this by clarifying how models generate predictions, yet most methods demand technical expertise, limiting their value for novices. This gap is especially critical in no-code ML platforms, which seek to democratize AI but rarely include explainability. We present a human-centered XAI module in DashAI, an open-source no-code ML platform. The module integrates three complementary techniques, which are Partial Dependence Plots (PDP), Permutation Feature Importance (PFI), and KernelSHAP, into DashAI's workflow for tabular classification. A user study (N = 20; ML novices and experts) evaluated usability and the impact of explanations. Results show: (i) high task success ($\geq80\%$) across all explainability tasks; (ii) novices rated explanations as useful, accurate, and trustworthy on the Explanation Satisfaction Scale (ESS, Cronbach's $\alpha$ = 0.74, a measure of internal consistency), while experts were more critical of sufficiency and completeness; and (iii) explanations improved perceived predictability and confidence on the Trust in Automation scale (TiA, $\alpha$ = 0.60), with novices showing higher trust than experts. These findings highlight a central challenge for XAI in no-code ML, making explanations both accessible to novices and sufficiently detailed for experts.

Chinese Translation

机器学习（ML）在医疗、金融和公共政策等敏感领域的日益应用引发了对自动决策透明性的担忧。可解释人工智能（XAI）通过阐明模型如何生成预测来解决这一问题，但大多数方法需要技术专长，从而限制了其对新手的价值。这一差距在无代码机器学习平台中尤为关键，这些平台旨在使人工智能民主化，但很少包含可解释性。我们在DashAI这一开源无代码机器学习平台中提出了一个以人为中心的XAI模块。该模块将部分依赖图（Partial Dependence Plots, PDP）、置换特征重要性（Permutation Feature Importance, PFI）和KernelSHAP三种互补技术整合到DashAI的表格分类工作流程中。我们进行了一项用户研究（N = 20；机器学习新手和专家），评估可用性和解释的影响。结果显示：（i）所有可解释性任务的任务成功率均高（$ ext{≥}80 ext{ ext{%}}$）；（ii）新手在解释满意度量表（Explanation Satisfaction Scale, ESS，Cronbach's $ ext{α}$ = 0.74，内部一致性测量）上评价解释为有用、准确且可信，而专家对充分性和完整性则持更为批判的态度；（iii）解释提高了在自动化信任量表（Trust in Automation scale, TiA，$ ext{α}$ = 0.60）上的感知可预测性和信心，新手的信任程度高于专家。这些发现突显了无代码机器学习中XAI面临的一个核心挑战，即使解释对新手可及，同时又对专家足够详细。

View on arXiv Download PDF AI Translation

cs.AI / 2 / 2602.11229

Latent Generative Solvers for Generalizable Long-Term Physics Simulation

通用长期物理模拟的潜在生成求解器

Chen, Zituo, Wu, Haixu, Deng, Sili

Abstract

We study long-horizon surrogate simulation across heterogeneous PDE systems. We introduce Latent Generative Solvers (LGS), a two-stage framework that (i) maps diverse PDE states into a shared latent physics space with a pretrained VAE, and (ii) learns probabilistic latent dynamics with a Transformer trained by flow matching. Our key mechanism is an uncertainty knob that perturbs latent inputs during training and inference, teaching the solver to correct off-manifold rollout drift and stabilizing autoregressive prediction. We further use flow forcing to update a system descriptor (context) from model-generated trajectories, aligning train/test conditioning and improving long-term stability. We pretrain on a curated corpus of $\sim$2.5M trajectories at $128^2$ resolution spanning 12 PDE families. LGS matches strong deterministic neural-operator baselines on short horizons while substantially reducing rollout drift on long horizons. Learning in latent space plus efficient architectural choices yields up to \textbf{70$\times$} lower FLOPs than non-generative baselines, enabling scalable pretraining. We also show efficient adaptation to an out-of-distribution $256^2$ Kolmogorov flow dataset under limited finetuning budgets. Overall, LGS provides a practical route toward generalizable, uncertainty-aware neural PDE solvers that are more reliable for long-term forecasting and downstream scientific workflows.

Chinese Translation

我们研究了异构偏微分方程（PDE）系统中的长期代理模拟。我们引入了潜在生成求解器（Latent Generative Solvers, LGS），这是一个两阶段框架，(i) 将多样的PDE状态映射到一个共享的潜在物理空间，使用预训练的变分自编码器（VAE），以及 (ii) 通过流匹配训练的变换器（Transformer）学习概率潜在动态。我们的关键机制是一个不确定性调节器，它在训练和推理过程中扰动潜在输入，教会求解器修正离散轨迹的偏差，并稳定自回归预测。我们进一步使用流强迫来从模型生成的轨迹中更新系统描述符（上下文），对齐训练/测试条件并改善长期稳定性。我们在一个精心挑选的语料库上进行了预训练，该语料库包含约250万条在$128^2$分辨率下跨越12个PDE家族的轨迹。LGS在短期范围内与强大的确定性神经算子基线相匹配，同时在长期范围内显著减少了轨迹漂移。潜在空间的学习加上高效的架构选择使得FLOPs比非生成基线低达70倍，从而实现可扩展的预训练。我们还展示了在有限微调预算下对分布外的$256^2$ Kolmogorov流数据集的高效适应。总体而言，LGS为通用的、关注不确定性的神经PDE求解器提供了一条实用的路径，这些求解器在长期预测和下游科学工作流程中更为可靠。

View on arXiv Download PDF AI Translation

cs.AI / 3 / 2602.11295

On Decision-Valued Maps and Representational Dependence

关于决策值映射和表征依赖性

Raitses, Gil

Abstract

A computational engine applied to different representations of the same data can produce different discrete outcomes, with some representations preserving the result and others changing it entirely. A decision-valued map records which representations preserve the outcome and which change it, associating each member of a declared representation family with the discrete result it produces. This paper formalizes decision-valued maps and describes DecisionDB, an infrastructure that logs, replays and audits these relationships using identifiers computed from content and artifacts stored in write-once form. Deterministic replay recovers each recorded decision identifier exactly from stored artifacts, with all three identifying fields matching their persisted values. The contribution partitions representation space into persistence regions and boundaries, and treats decision reuse as a mechanically checkable condition.

Chinese Translation

应用于相同数据的不同表征的计算引擎可能会产生不同的离散结果，其中一些表征能够保留结果，而另一些则完全改变结果。决策值映射记录了哪些表征保留了结果，哪些表征改变了结果，将声明的表征家族中的每个成员与其产生的离散结果关联起来。本文对决策值映射进行了形式化，并描述了DecisionDB，这是一种基础设施，用于记录、重放和审计这些关系，使用从内容和以一次写入形式存储的工件中计算出的标识符。确定性重放能够从存储的工件中精确恢复每个记录的决策标识符，所有三个标识字段与其持久化值匹配。该贡献将表征空间划分为持久性区域和边界，并将决策重用视为一种可机械检查的条件。

View on arXiv Download PDF AI Translation

cs.AI / 4 / 2602.11298

Voxtral Realtime

Voxtral 实时

Liu, Alexander H., Ehrenberg, Andy, Lo, Andy, Sun, Chen-Yo, Lample, Guillaume, Delignon, Jean-Malo, Chandu, Khyathi Raghavi, von Platen, Patrick, Muddireddy, Pavankumar Reddy, Arora, Rohin, Gandhi, Sanchit, Subramanian, Sandeep, Ghosh, Soham, Mishra, Srijan, Rastogi, Abhinav, Jeffares, Alan, Jiang, Albert, Sablayrolles, Alexandre, Héliou, Amélie, Bai, Andrew, Lenglemetz, Angele, Agarwal, Anmol, Eliseev, Anton, Calvi, Antonia, Majumdar, Arjun, Bout, Baptiste, Rozière, Baptiste, De Monicault, Baudouin, Tibi, Benjamin, Lanfranchi, Clémence, Chen, Connor, Barreau, Corentin, Sautier, Corentin, Courtot, Cyprien, Dabert, Darius, Casas, Diego de las, Chane-Sane, Elliot, Paquin, Enguerrand, Ahmed, Faruk, Baldassarre, Federico, Berrada, Gabrielle, Ecrepont, Gaëtan, Guinet, Gauthier, Hayes, Genevieve, Novikov, Georgii, Pistilli, Giada, Martin, Guillaume, Dhanuka, Gunjan, Gupta, Gunshi, Zhou, Han, Mukherjee, Indraneel, Zhang, Irene, Kim, Jaeyoung, Ludziejewski, Jan, Rute, Jason, Studnia, Joachim, Harvill, John, Amar, Jonas, Roberts, Josselin Somerville, Tauran, Julien, Yadav, Karmesh, Khandelwal, Kartik, Jain, Kush, Aitchison, Laurence, Blier, Léonard, Zhao, Lingxiao, Martin, Louis, Saulnier, Lucile, Gao, Luyu, Buyl, Maarten, Sharma, Manan, Jennings, Margaret, Pellat, Marie, Prins, Mark, Poirée, Mathieu, Guillaumin, Mathilde, Dinot, Matthieu, Futeral, Matthieu, Darrin, Maxime, Augustin, Maximilian, Unsal, Mert, Chiquier, Mia, Grinsztajn, Nathan, Gupta, Neha, Bousquet, Olivier, Duchenne, Olivier, Wang, Patricia, Jacob, Paul, Wambergue, Paul, Kurylowicz, Paula, Chagniot, Philomène, Stock, Pierre, Miłoś, Piotr, Gupta, Prateek, Agrawal, Pravesh, Torroba, Quentin, Ramrakhya, Ram, Shah, Rishi, Sauvestre, Romain, Soletskyi, Roman, Millner, Rosalie, Vaze, Sagar, Humeau, Samuel, Gandhi, Siddharth, Aithal, Sumukh, Antoniak, Szymon, Scao, Teven Le, Cachet, Théo, Sorg, Theo Simon, Lavril, Thibaut, Chabal, Thomas, Foubert, Thomas, Robert, Thomas, Wang, Thomas, Lawson, Tim, Bewley, Tom, Edwards, Tom, Wang, Tyler, Nemychnikova, Valeriia, Phung, Van, Nanda, Vedant, Jouault, Victor, Richard, Virgile, Bataev, Vladislav, Bouaziz, Wassim, Li, Wen-Ding, Marshall, William, Li, Xinghui, Guo, Xingran, Yang, Xinyu, Neuhaus, Yannic, Wang, Yihan, Ramzi, Zaccharie, Xu, Zhenlin

Abstract

We introduce Voxtral Realtime, a natively streaming automatic speech recognition model that matches offline transcription quality at sub-second latency. Unlike approaches that adapt offline models through chunking or sliding windows, Voxtral Realtime is trained end-to-end for streaming, with explicit alignment between audio and text streams. Our architecture builds on the Delayed Streams Modeling framework, introducing a new causal audio encoder and Ada RMS-Norm for improved delay conditioning. We scale pretraining to a large-scale dataset spanning 13 languages. At a delay of 480ms, Voxtral Realtime achieves performance on par with Whisper, the most widely deployed offline transcription system. We release the model weights under the Apache 2.0 license.

Chinese Translation

我们介绍了 Voxtral 实时，这是一种原生流式自动语音识别模型，能够在亚秒延迟下达到离线转录质量。与通过分块或滑动窗口适配离线模型的方法不同，Voxtral 实时是针对流式处理进行端到端训练的，音频和文本流之间具有明确的对齐。我们的架构基于延迟流建模框架，引入了一种新的因果音频编码器和自适应 RMS-Norm（Ada RMS-Norm），以改善延迟条件。我们将预训练扩展到一个涵盖 13 种语言的大规模数据集。在 480 毫秒的延迟下，Voxtral 实时的性能与 Whisper 相当，后者是最广泛部署的离线转录系统。我们在 Apache 2.0 许可证下发布模型权重。

View on arXiv Download PDF AI Translation

cs.AI / 5 / 2602.11301

The PBSAI Governance Ecosystem: A Multi-Agent AI Reference Architecture for Securing Enterprise AI Estates

PBSAI治理生态系统：用于保护企业AI资产的多智能体AI参考架构

Willis, John M.

Abstract

Enterprises are rapidly deploying large language models, retrieval augmented generation pipelines, and tool using agents into production, often on shared high performance computing clusters and cloud accelerator platforms that also support defensive analytics. These systems increasingly function not as isolated models but as AI estates: socio technical systems spanning models, agents, data pipelines, security tooling, human workflows, and hyperscale infrastructure. Existing governance and security frameworks, including the NIST AI Risk Management Framework and systems security engineering guidance, articulate principles and risk functions but do not provide implementable architectures for multi agent, AI enabled cyber defense. This paper introduces the Practitioners Blueprint for Secure AI (PBSAI) Governance Ecosystem, a multi agent reference architecture for securing enterprise and hyperscale AI estates. PBSAI organizes responsibilities into a twelve domain taxonomy and defines bounded agent families that mediate between tools and policy through shared context envelopes and structured output contracts. The architecture assumes baseline enterprise security capabilities and encodes key systems security techniques, including analytic monitoring, coordinated defense, and adaptive response. A lightweight formal model of agents, context envelopes, and ecosystem level invariants clarifies the traceability, provenance, and human in the loop guarantees enforced across domains. We demonstrate alignment with NIST AI RMF functions and illustrate application in enterprise SOC and hyperscale defensive environments. PBSAI is proposed as a structured, evidence centric foundation for open ecosystem development and future empirical validation.

Chinese Translation

企业正在迅速将大型语言模型、增强检索生成管道和工具使用代理投入生产，通常是在共享的高性能计算集群和云加速平台上，这些平台还支持防御性分析。这些系统越来越多地作为AI资产而非孤立模型运作：它们是跨越模型、代理、数据管道、安全工具、人类工作流程和超大规模基础设施的社会技术系统。现有的治理和安全框架，包括NIST AI风险管理框架和系统安全工程指导，阐述了原则和风险功能，但未提供可实施的多智能体、AI驱动的网络防御架构。本文介绍了安全AI从业者蓝图（PBSAI）治理生态系统，这是一个用于保护企业和超大规模AI资产的多智能体参考架构。PBSAI将责任组织成十二个领域的分类法，并定义了通过共享上下文封装和结构化输出合同在工具和政策之间进行调解的有限代理家族。该架构假设基础企业安全能力，并编码关键的系统安全技术，包括分析监控、协调防御和自适应响应。代理、上下文封装和生态系统级不变性的轻量级形式模型阐明了跨领域强制执行的可追溯性、来源和人类参与保障。我们展示了与NIST AI RMF功能的一致性，并说明了在企业安全运营中心和超大规模防御环境中的应用。PBSAI被提议作为开放生态系统开发和未来实证验证的结构化、以证据为中心的基础。

View on arXiv Download PDF AI Translation

cs.AI / 6 / 2602.11318

Dissecting Subjectivity and the "Ground Truth" Illusion in Data Annotation

剖析数据标注中的主观性与“真实依据”幻觉

Munir, Sheza, Mah, Benjamin, Kalsi, Krisha, Kapania, Shivani, Posada, Julian, Law, Edith, Wang, Ding, Ahmed, Syed Ishtiaque

Abstract

In machine learning, "ground truth" refers to the assumed correct labels used to train and evaluate models. However, the foundational "ground truth" paradigm rests on a positivistic fallacy that treats human disagreement as technical noise rather than a vital sociotechnical signal. This systematic literature review analyzes research published between 2020 and 2025 across seven premier venues: ACL, AIES, CHI, CSCW, EAAMO, FAccT, and NeurIPS, investigating the mechanisms in data annotation practices that facilitate this "consensus trap". Our identification phase captured 30,897 records, which were refined via a tiered keyword filtration schema to a high-recall corpus of 3,042 records for manual screening, resulting in a final included corpus of 346 papers for qualitative synthesis. Our reflexive thematic analysis reveals that systemic failures in positional legibility, combined with the recent architectural shift toward human-as-verifier models, specifically the reliance on model-mediated annotations, introduce deep-seated anchoring bias and effectively remove human voices from the loop. We further demonstrate how geographic hegemony imposes Western norms as universal benchmarks, often enforced by the performative alignment of precarious data workers who prioritize requester compliance over honest subjectivity to avoid economic penalties. Critiquing the "noisy sensor" fallacy, where statistical models misdiagnose cultural pluralism as random error, we argue for reclaiming disagreement as a high-fidelity signal essential for building culturally competent models. To address these systemic tensions, we propose a roadmap for pluralistic annotation infrastructures that shift the objective from discovering a singular "right" answer to mapping the diversity of human experience.

Chinese Translation

在机器学习中，“真实依据”指的是用于训练和评估模型的假定正确标签。然而，基础的“真实依据”范式建立在一种实证主义谬误之上，将人类的分歧视为技术噪声，而非重要的社会技术信号。本系统文献综述分析了2020年至2025年间在七个顶级会议（ACL、AIES、CHI、CSCW、EAAMO、FAccT和NeurIPS）上发表的研究，探讨了数据标注实践中促进这种“共识陷阱”的机制。我们的识别阶段捕获了30,897条记录，通过分层关键词过滤方案精炼为高召回率的3,042条记录进行手动筛选，最终纳入346篇论文进行定性综合。我们的反思性主题分析揭示了位置可读性中的系统性失败，加上最近向人类验证者模型的架构转变，特别是对模型中介标注的依赖，引入了根深蒂固的锚定偏见，并有效地将人类声音排除在外。我们进一步展示了地理霸权如何将西方规范强加为普遍基准，这种现象通常通过优先考虑请求者合规而非诚实主观性的临时数据工作者的表演性对齐来强化，以避免经济惩罚。批判“嘈杂传感器”谬误，即统计模型将文化多元性误诊为随机误差，我们主张重新认识分歧为构建文化适应性模型所必需的高保真信号。为了解决这些系统性紧张关系，我们提出了一条多元化标注基础设施的路线图，将目标从发现单一“正确”答案转向映射人类经验的多样性。

View on arXiv Download PDF AI Translation

cs.AI / 7 / 2602.11340

Bi-Level Prompt Optimization for Multimodal LLM-as-a-Judge

多层次提示优化用于多模态 LLM 作为评判者

Pan, Bo, Kan, Xuan, Zhang, Kaitai, Yan, Yan, Tan, Shunwen, He, Zihao, Ding, Zixin, Wu, Junjie, Zhao, Liang

Abstract

Large language models (LLMs) have become widely adopted as automated judges for evaluating AI-generated content. Despite their success, aligning LLM-based evaluations with human judgments remains challenging. While supervised fine-tuning on human-labeled data can improve alignment, it is costly and inflexible, requiring new training for each task or dataset. Recent progress in auto prompt optimization (APO) offers a more efficient alternative by automatically improving the instructions that guide LLM judges. However, existing APO methods primarily target text-only evaluations and remain underexplored in multimodal settings. In this work, we study auto prompt optimization for multimodal LLM-as-a-judge, particularly for evaluating AI-generated images. We identify a key bottleneck: multimodal models can only process a limited number of visual examples due to context window constraints, which hinders effective trial-and-error prompt refinement. To overcome this, we propose BLPO, a bi-level prompt optimization framework that converts images into textual representations while preserving evaluation-relevant visual cues. Our bi-level optimization approach jointly refines the judge prompt and the I2T prompt to maintain fidelity under limited context budgets. Experiments on four datasets and three LLM judges demonstrate the effectiveness of our method.

Chinese Translation

大型语言模型（LLMs）已被广泛应用于作为自动评判者来评估 AI 生成的内容。尽管取得了成功，但将基于 LLM 的评估与人类判断对齐仍然具有挑战性。虽然在人工标注数据上进行监督微调可以改善对齐，但这种方法成本高且灵活性差，每个任务或数据集都需要新的训练。最近在自动提示优化（APO）方面的进展提供了一种更高效的替代方案，通过自动改进指导 LLM 评判者的指令。然而，现有的 APO 方法主要针对仅文本的评估，在多模态环境中仍然未得到充分探索。在本研究中，我们研究了多模态 LLM 作为评判者的自动提示优化，特别是用于评估 AI 生成的图像。我们识别出一个关键瓶颈：由于上下文窗口的限制，多模态模型只能处理有限数量的视觉示例，这阻碍了有效的试错提示优化。为了解决这个问题，我们提出了 BLPO，一种双层次提示优化框架，该框架将图像转换为文本表示，同时保留与评估相关的视觉线索。我们的双层次优化方法共同优化评判者提示和 I2T 提示，以在有限的上下文预算下保持保真度。在四个数据集和三个 LLM 评判者上的实验表明了我们方法的有效性。

View on arXiv Download PDF AI Translation

cs.AI / 8 / 2602.11348

AgentNoiseBench: Benchmarking Robustness of Tool-Using LLM Agents Under Noisy Condition

AgentNoiseBench：在噪声条件下评估工具使用型大型语言模型代理的鲁棒性

Wang, Ruipeng, Chen, Yuxin, Wang, Yukai, Wu, Chang, Fang, Junfeng, Cai, Xiaodong, Gu, Qi, Su, Hui, Zhang, An, Wang, Xiang, Cai, Xunliang, Chua, Tat-Seng

Abstract

Recent advances in large language models have enabled LLM-based agents to achieve strong performance on a variety of benchmarks. However, their performance in real-world deployments often that observed on benchmark settings, especially in complex and imperfect environments. This discrepancy largely arises because prevailing training and evaluation paradigms are typically built on idealized assumptions, overlooking the inherent stochasticity and noise present in real-world interactions. To bridge this gap, we introduce AgentNoiseBench, a framework for systematically evaluating the robustness of agentic models under noisy environments. We first conduct an in-depth analysis of biases and uncertainties in real-world scenarios and categorize environmental noise into two primary types: user-noise and tool-noise. Building on this analysis, we develop an automated pipeline that injects controllable noise into existing agent-centric benchmarks while preserving task solvability. Leveraging this pipeline, we perform extensive evaluations across a wide range of models with diverse architectures and parameter scales. Our results reveal consistent performance variations under different noise conditions, highlighting the sensitivity of current agentic models to realistic environmental perturbations.

Chinese Translation

近期大型语言模型的进展使得基于LLM的代理在多种基准测试中取得了优异的表现。然而，它们在实际部署中的表现往往不如基准设置中的表现，尤其是在复杂和不完美的环境中。这种差异主要源于当前的训练和评估范式通常建立在理想化假设之上，忽视了现实世界交互中固有的随机性和噪声。为了弥补这一差距，我们提出了AgentNoiseBench，这是一个系统评估代理模型在噪声环境下鲁棒性的框架。我们首先对现实场景中的偏见和不确定性进行深入分析，并将环境噪声分为两种主要类型：用户噪声和工具噪声。在此分析的基础上，我们开发了一个自动化流程，该流程在现有的以代理为中心的基准中注入可控噪声，同时保持任务的可解性。利用该流程，我们对多种不同架构和参数规模的模型进行了广泛评估。我们的结果揭示了在不同噪声条件下表现的一致性变化，突显了当前代理模型对现实环境扰动的敏感性。

View on arXiv Download PDF AI Translation

cs.AI / 9 / 2602.11351

Pushing Forward Pareto Frontiers of Proactive Agents with Behavioral Agentic Optimization

通过行为代理优化推动主动代理的帕累托前沿

Yao, Yihang, Cen, Zhepeng, Lin, Haohong, Liu, Shiqi, Liu, Zuxin, Zhu, Jiacheng, Hong, Zhang-Wei, Shi, Laixi, Zhao, Ding

Abstract

Proactive large language model (LLM) agents aim to actively plan, query, and interact over multiple turns, enabling efficient task completion beyond passive instruction following and making them essential for real-world, user-centric applications. Agentic reinforcement learning (RL) has recently emerged as a promising solution for training such agents in multi-turn settings, allowing interaction strategies to be learned from feedback. However, existing pipelines face a critical challenge in balancing task performance with user engagement, as passive agents can not efficiently adapt to users' intentions while overuse of human feedback reduces their satisfaction. To address this trade-off, we propose BAO, an agentic RL framework that combines behavior enhancement to enrich proactive reasoning and information-gathering capabilities with behavior regularization to suppress inefficient or redundant interactions and align agent behavior with user expectations. We evaluate BAO on multiple tasks from the UserRL benchmark suite, and demonstrate that it substantially outperforms proactive agentic RL baselines while achieving comparable or even superior performance to commercial LLM agents, highlighting its effectiveness for training proactive, user-aligned LLM agents in complex multi-turn scenarios. Our website: https://proactive-agentic-rl.github.io/.

Chinese Translation

主动大型语言模型（LLM）代理旨在积极规划、查询和进行多轮交互，使得任务完成效率超越被动的指令遵循，从而在现实世界的用户中心应用中变得不可或缺。代理强化学习（RL）最近成为在多轮环境中训练此类代理的有前景的解决方案，允许从反馈中学习交互策略。然而，现有的流程面临着在任务性能与用户参与之间平衡的关键挑战，因为被动代理无法有效适应用户的意图，而过度依赖人类反馈会降低用户的满意度。为了解决这一权衡，我们提出了BAO，一个代理强化学习框架，结合了行为增强以丰富主动推理和信息收集能力，以及行为正则化以抑制低效或冗余的交互，并使代理行为与用户期望保持一致。我们在UserRL基准套件的多个任务上评估了BAO，结果表明其在性能上显著优于主动代理强化学习基线，同时在性能上与商业LLM代理相当甚至更优，突显了其在复杂多轮场景中训练主动、用户对齐的LLM代理的有效性。我们的网站：https://proactive-agentic-rl.github.io/

View on arXiv Download PDF AI Translation

cs.AI / 10 / 2602.11354

ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences

ReplicatorBench：社会与行为科学中 LLM 代理的可复制性基准测试

Nguyen, Bang, Soós, Dominik, Ma, Qian, Obadage, Rochana R., Ranjan, Zack, Koneru, Sai, Errington, Timothy M., Nematova, Shakhlo, Rajtmajer, Sarah, Wu, Jian, Jiang, Meng

Abstract

The literature has witnessed an emerging interest in AI agents for automated assessment of scientific papers. Existing benchmarks focus primarily on the computational aspect of this task, testing agents' ability to reproduce or replicate research outcomes when having access to the code and data. This setting, while foundational, (1) fails to capture the inconsistent availability of new data for replication as opposed to reproduction, and (2) lacks ground-truth diversity by focusing only on reproducible papers, thereby failing to evaluate an agent's ability to identify non-replicable research. Furthermore, most benchmarks only evaluate outcomes rather than the replication process. In response, we introduce ReplicatorBench, an end-to-end benchmark, including human-verified replicable and non-replicable research claims in social and behavioral sciences for evaluating AI agents in research replication across three stages: (1) extraction and retrieval of replication data; (2) design and execution of computational experiments; and (3) interpretation of results, allowing a test of AI agents' capability to mimic the activities of human replicators in real world. To set a baseline of AI agents' capability, we develop ReplicatorAgent, an agentic framework equipped with necessary tools, like web search and iterative interaction with sandboxed environments, to accomplish tasks in ReplicatorBench. We evaluate ReplicatorAgent across four underlying large language models (LLMs), as well as different design choices of programming language and levels of code access. Our findings reveal that while current LLM agents are capable of effectively designing and executing computational experiments, they struggle with retrieving resources, such as new data, necessary to replicate a claim. All code and data are publicly available at https://github.com/CenterForOpenScience/llm-benchmarking.

Chinese Translation

文献中出现了对 AI 代理进行科学论文自动评估的日益关注。现有基准主要集中在这一任务的计算方面，测试代理在获得代码和数据时重现或复制研究结果的能力。尽管这一设置是基础性的，但（1）未能捕捉到与重现相比，复制所需的新数据的不一致可用性，以及（2）由于仅关注可重现的论文，缺乏真实情况的多样性，从而未能评估代理识别不可复制研究的能力。此外，大多数基准仅评估结果，而不是复制过程。对此，我们引入了 ReplicatorBench，这是一个端到端的基准，包含经过人工验证的社会与行为科学中的可复制和不可复制研究主张，用于评估 AI 代理在研究复制中的能力，分为三个阶段：（1）复制数据的提取和检索；（2）计算实验的设计和执行；（3）结果的解释，从而测试 AI 代理在现实世界中模仿人类复制者活动的能力。为了设定 AI 代理能力的基线，我们开发了 ReplicatorAgent，这是一个配备必要工具的代理框架，如网络搜索和与沙盒环境的迭代交互，以完成 ReplicatorBench 中的任务。我们在四种基础大型语言模型（LLMs）以及不同的编程语言设计选择和代码访问级别上评估 ReplicatorAgent。我们的研究结果表明，尽管当前的 LLM 代理能够有效地设计和执行计算实验，但在检索复制主张所需的新数据等资源方面存在困难。所有代码和数据均可在 https://github.com/CenterForOpenScience/llm-benchmarking 上公开获取。

View on arXiv Download PDF AI Translation

cs.AI / 11 / 2602.11389

Causal-JEPA: Learning World Models through Object-Level Latent Interventions

因果-JEPA：通过对象级潜在干预学习世界模型

Nam, Heejeong, Lidec, Quentin Le, Maes, Lucas, LeCun, Yann, Balestriero, Randall

Abstract

World models require robust relational understanding to support prediction, reasoning, and control. While object-centric representations provide a useful abstraction, they are not sufficient to capture interaction-dependent dynamics. We therefore propose C-JEPA, a simple and flexible object-centric world model that extends masked joint embedding prediction from image patches to object-centric representations. By applying object-level masking that requires an object's state to be inferred from other objects, C-JEPA induces latent interventions with counterfactual-like effects and prevents shortcut solutions, making interaction reasoning essential. Empirically, C-JEPA leads to consistent gains in visual question answering, with an absolute improvement of about 20\% in counterfactual reasoning compared to the same architecture without object-level masking. On agent control tasks, C-JEPA enables substantially more efficient planning by using only 1\% of the total latent input features required by patch-based world models, while achieving comparable performance. Finally, we provide a formal analysis demonstrating that object-level masking induces a causal inductive bias via latent interventions. Our code is available at https://github.com/galilai-group/cjepa.

Chinese Translation

世界模型需要强大的关系理解能力，以支持预测、推理和控制。尽管以对象为中心的表示提供了有用的抽象，但它们不足以捕捉依赖于交互的动态。因此，我们提出了C-JEPA，这是一种简单而灵活的以对象为中心的世界模型，它将掩蔽联合嵌入预测从图像块扩展到以对象为中心的表示。通过应用对象级掩蔽，要求从其他对象推断对象的状态，C-JEPA引入了具有反事实效果的潜在干预，并防止了捷径解决方案，使得交互推理变得至关重要。在实证研究中，C-JEPA在视觉问答任务中带来了持续的提升，与没有对象级掩蔽的相同架构相比，在反事实推理方面的绝对改进约为20%。在代理控制任务中，C-JEPA通过仅使用补丁基础世界模型所需的总潜在输入特征的1%，实现了显著更高效的规划，同时达到了可比的性能。最后，我们提供了正式分析，证明对象级掩蔽通过潜在干预引入了因果归纳偏差。我们的代码可在https://github.com/galilai-group/cjepa获取。

View on arXiv Download PDF AI Translation

cs.AI / 12 / 2602.11408

GHOST: Unmasking Phantom States in Mamba2 via Grouped Hidden-state Output-aware Selection & Truncation

GHOST：通过分组隐状态输出感知选择与截断揭示 Mamba2 中的虚幻状态

Menezes, Michael, Kyrillidis, Anastasios

Abstract

While Mamba2's expanded state dimension enhances temporal modeling, it incurs substantial inference overhead that saturates bandwidth during autoregressive generation. Standard pruning methods fail to address this bottleneck: unstructured sparsity leaves activations dense, magnitude-based selection ignores runtime dynamics, and gradient-based methods impose prohibitive costs. We introduce GHOST (Grouped Hidden-state Output-aware Selection and Truncation), a structured pruning framework that approximates control-theoretic balanced truncation using only forward-pass statistics. By jointly measuring controllability and observability, GHOST rivals the fidelity of gradient-based methods without requiring backpropagation. As a highlight, on models ranging from 130M to 2.7B parameters, our approach achieves a 50\% state-dimension reduction with approximately 1 perplexity point increase on WikiText-2. Code is available at https://anonymous.4open.science/r/mamba2_ghost-7BCB/.

Chinese Translation

虽然 Mamba2 扩展的状态维度增强了时间建模能力，但在自回归生成过程中却带来了可观的推理开销，导致带宽饱和。标准的剪枝方法未能解决这一瓶颈：无结构稀疏性使得激活值依然密集，基于幅度的选择忽视了运行时动态，而基于梯度的方法则带来了高昂的成本。我们提出了 GHOST（分组隐状态输出感知选择与截断），这是一种结构化剪枝框架，仅使用前向传播统计来近似控制理论中的平衡截断。通过联合测量可控性和可观测性，GHOST 在不需要反向传播的情况下，与基于梯度的方法的保真度相媲美。作为亮点，在参数范围从 1.3 亿到 27 亿的模型上，我们的方法实现了 50\% 的状态维度减少，同时在 WikiText-2 上的困惑度仅增加了约 1 点。代码可在 https://anonymous.4open.science/r/mamba2_ghost-7BCB/ 获取。

View on arXiv Download PDF AI Translation

cs.AI / 13 / 2602.11409

TRACER: Trajectory Risk Aggregation for Critical Episodes in Agentic Reasoning

TRACER：用于代理推理中关键事件的轨迹风险聚合

Tayebati, Sina, Kumar, Divake, Darabi, Nastaran, Ettori, Davide, Krishnan, Ranganath, Trivedi, Amit Ranjan

Abstract

Estimating uncertainty for AI agents in real-world multi-turn tool-using interaction with humans is difficult because failures are often triggered by sparse critical episodes (e.g., looping, incoherent tool use, or user-agent miscoordination) even when local generation appears confident. Existing uncertainty proxies focus on single-shot text generation and therefore miss these trajectory-level breakdown signals. We introduce TRACER, a trajectory-level uncertainty metric for dual-control Tool-Agent-User interaction. TRACER combines content-aware surprisal with situational-awareness signals, semantic and lexical repetition, and tool-grounded coherence gaps, and aggregates them using a tail-focused risk functional with a MAX-composite step risk to surface decisive anomalies. We evaluate TRACER on $\tau^2$-bench by predicting task failure and selective task execution. To this end, TRACER improves AUROC by up to 37.1% and AUARC by up to 55% over baselines, enabling earlier and more accurate detection of uncertainty in complex conversational tool-use settings. Our code and benchmark are available at https://github.com/sinatayebati/agent-tracer.

Chinese Translation

在与人类进行现实世界的多轮工具使用交互中，估计人工智能代理的不确定性是困难的，因为即使局部生成看起来自信，失败往往是由稀疏的关键事件（例如，循环、不连贯的工具使用或用户与代理的失调）引发的。现有的不确定性代理主要集中在单次文本生成上，因此错过了这些轨迹级别的崩溃信号。我们引入了TRACER，一种用于双控制工具-代理-用户交互的轨迹级不确定性度量。TRACER结合了内容感知的惊讶度与情境意识信号、语义和词汇重复以及工具基础的一致性缺口，并使用以尾部为中心的风险函数进行聚合，采用MAX复合步骤风险来揭示决定性的异常。我们在$ au^2$-bench上评估TRACER，通过预测任务失败和选择性任务执行来进行评估。为此，TRACER在AUROC上提高了高达37.1%，在AUARC上提高了高达55%相较于基线，从而在复杂的对话工具使用环境中实现了更早和更准确的不确定性检测。我们的代码和基准可在https://github.com/sinatayebati/agent-tracer获取。

View on arXiv Download PDF AI Translation

cs.AI / 14 / 2602.11437

Distributionally Robust Cooperative Multi-Agent Reinforcement Learning via Robust Value Factorization

基于鲁棒价值分解的分布鲁棒合作多智能体强化学习

Qu, Chengrui, Yeh, Christopher, Panaganti, Kishan, Mazumdar, Eric, Wierman, Adam

Abstract

Cooperative multi-agent reinforcement learning (MARL) commonly adopts centralized training with decentralized execution, where value-factorization methods enforce the individual-global-maximum (IGM) principle so that decentralized greedy actions recover the team-optimal joint action. However, the reliability of this recipe in real-world settings remains unreliable due to environmental uncertainties arising from the sim-to-real gap, model mismatch, and system noise. We address this gap by introducing Distributionally robust IGM (DrIGM), a principle that requires each agent's robust greedy action to align with the robust team-optimal joint action. We show that DrIGM holds for a novel definition of robust individual action values, which is compatible with decentralized greedy execution and yields a provable robustness guarantee for the whole system. Building on this foundation, we derive DrIGM-compliant robust variants of existing value-factorization architectures (e.g., VDN/QMIX/QTRAN) that (i) train on robust Q-targets, (ii) preserve scalability, and (iii) integrate seamlessly with existing codebases without bespoke per-agent reward shaping. Empirically, on high-fidelity SustainGym simulators and a StarCraft game environment, our methods consistently improve out-of-distribution performance. Code and data are available at https://github.com/crqu/robust-coMARL.

Chinese Translation

合作多智能体强化学习（MARL）通常采用集中训练与分散执行的方式，其中价值分解方法强制执行个体-全局-最大（IGM）原则，使得分散的贪婪动作能够恢复团队最优的联合动作。然而，由于来自仿真与现实之间的差距、模型不匹配和系统噪声等环境不确定性，这一方法在现实世界中的可靠性仍然不够。我们通过引入分布鲁棒IGM（DrIGM）来解决这一问题，该原则要求每个智能体的鲁棒贪婪动作与鲁棒团队最优联合动作保持一致。我们证明了DrIGM适用于一种新定义的鲁棒个体动作值，该定义与分散贪婪执行兼容，并为整个系统提供了可证明的鲁棒性保证。在此基础上，我们推导出符合DrIGM的现有价值分解架构（如VDN/QMIX/QTRAN）的鲁棒变体，这些变体（i）在鲁棒Q目标上进行训练，（ii）保持可扩展性，并且（iii）与现有代码库无缝集成，无需为每个智能体定制奖励塑形。在高保真SustainGym模拟器和StarCraft游戏环境中，我们的方法在分布外性能上始终表现出改善。代码和数据可在https://github.com/crqu/robust-coMARL获取。

View on arXiv Download PDF AI Translation

cs.AI / 15 / 2602.11455

Credit Where It is Due: Cross-Modality Connectivity Drives Precise Reinforcement Learning for MLLM Reasoning

应得的信用：跨模态连接驱动多模态大语言模型的精确强化学习推理

Jiao, Zhengbo, Wang, Shaobo, Zhang, Zifan, Wang, Wei, Zhao, Bing, Wei, Hu, Zhang, Linfeng

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Multimodal Large Language Models (MLLMs), yet how visual evidence is integrated during reasoning remains poorly understood. We explore multimodal RLVR through the lens of cross-modal attention connectivity and find that only a small fraction of tokens (approximately 15%) exhibit strong visual-textual coupling. These high-connectivity tokens act as anchors that ground reasoning in the image, while the majority follow linguistic patterns. During RLVR training, credit assignment naturally concentrates on these anchors, sharpening their visual grounding over time. Building on this insight, we propose Anchor-Token Reinforcement Learning (AT-RL), a lightweight framework that selectively reinforces high-connectivity tokens via graph-based clustering of attention topology. Evaluated across the series (3B-32B), AT-RL introduces only 1.2% overhead yet enables the 32B model to surpass the 72B-Instruct baseline on MathVista (80.2), with consistent gains observed across STEM, video and general tasks. Conversely, training solely on low-connectivity tokens causes severe degradation, confirming that effective multimodal RL hinges on precise credit assignment to visual anchors. Our work reveals that reasoning quality is governed not by token quantity but by the fidelity of cross-modal anchoring.

Chinese Translation

可验证奖励的强化学习（RLVR）显著提升了多模态大语言模型（MLLMs）的推理能力，但在推理过程中如何整合视觉证据仍然不够清晰。我们通过跨模态注意力连接的视角探索多模态RLVR，发现只有一小部分标记（大约15%）表现出强烈的视觉-文本耦合。这些高连接性标记作为锚点，将推理与图像相结合，而大多数标记则遵循语言模式。在RLVR训练过程中，信用分配自然集中于这些锚点，随着时间的推移增强其视觉基础。基于这一见解，我们提出了锚点标记强化学习（AT-RL），这是一个轻量级框架，通过基于图的注意力拓扑聚类选择性地强化高连接性标记。在3B-32B系列的评估中，AT-RL仅引入1.2%的开销，但使32B模型在MathVista上超越72B-Instruct基线（80.2），并在STEM、视频和一般任务中观察到一致的增益。相反，仅在低连接性标记上训练会导致严重退化，确认有效的多模态RL依赖于对视觉锚点的精确信用分配。我们的研究表明，推理质量的高低不是由标记数量决定的，而是由跨模态锚定的准确性所主导。

View on arXiv Download PDF AI Translation

cs.AI / 16 / 2602.11510

AgentLeak: A Full-Stack Benchmark for Privacy Leakage in Multi-Agent LLM Systems

AgentLeak：多智能体大语言模型系统隐私泄露的全栈基准测试

Yagoubi, Faouzi El, Mallah, Ranwa Al, Badu-Marfo, Godwin

Abstract

Multi-agent Large Language Model (LLM) systems create privacy risks that current benchmarks cannot measure. When agents coordinate on tasks, sensitive data passes through inter-agent messages, shared memory, and tool arguments; pathways that output-only audits never inspect. We introduce AgentLeak, to the best of our knowledge the first full-stack benchmark for privacy leakage covering internal channels, spanning 1,000 scenarios across healthcare, finance, legal, and corporate domains, paired with a 32-class attack taxonomy and three-tier detection pipeline. Testing GPT-4o, GPT-4o-mini, Claude 3.5 Sonnet, Mistral Large, and Llama 3.3 70B across 4,979 traces reveals that multi-agent configurations reduce per-channel output leakage (C1: 27.2% vs 43.2% in single-agent) but introduce unmonitored internal channels that raise total system exposure to 68.9% (OR-aggregated across C1, C2, C5). Internal channels account for most of this gap: inter-agent messages (C2) leak at 68.8%, compared to 27.2% on C1 (output channel). This means that output-only audits miss 41.7% of violations. Claude 3.5 Sonnet, which emphasizes safety alignment in its design, achieves the lowest leakage rates on both external (3.3%) and internal (28.1%) channels, suggesting that model-level safety training may transfer to internal channel protection. Across all five models and four domains, the pattern C2 > C1 holds consistently, confirming that inter-agent communication is the primary vulnerability. These findings underscore the need for coordination frameworks that incorporate internal-channel privacy protections and enforce privacy controls on inter-agent communication.

Chinese Translation

多智能体大语言模型（LLM）系统带来了当前基准无法测量的隐私风险。当智能体在任务上进行协调时，敏感数据通过智能体之间的消息、共享内存和工具参数传递；这些路径是输出-only 审计从未检查过的。我们介绍了 AgentLeak，据我们所知，这是第一个涵盖内部通道的全栈隐私泄露基准，涵盖了医疗、金融、法律和企业领域的 1,000 种场景，并配备了 32 类攻击分类法和三级检测管道。在 4,979 个追踪中测试 GPT-4o、GPT-4o-mini、Claude 3.5 Sonnet、Mistral Large 和 Llama 3.3 70B，结果显示多智能体配置减少了每个通道的输出泄露（C1：27.2% 对比单智能体的 43.2%），但引入了未监控的内部通道，使得系统的总暴露率提高至 68.9%（在 C1、C2、C5 上的 OR 聚合）。内部通道占据了这一差距的大部分：智能体之间的消息（C2）泄露率为 68.8%，而 C1（输出通道）的泄露率为 27.2%。这意味着输出-only 审计漏掉了 41.7% 的违规行为。Claude 3.5 Sonnet 在设计上强调安全对齐，在外部（3.3%）和内部（28.1%）通道上实现了最低的泄露率，这表明模型级的安全训练可能会转移到内部通道保护上。在所有五个模型和四个领域中，C2 > C1 的模式始终成立，确认了智能体之间的通信是主要的脆弱性。这些发现强调了需要协调框架，以纳入内部通道的隐私保护，并对智能体之间的通信实施隐私控制。

View on arXiv Download PDF AI Translation

cs.AI / 17 / 2602.11516

Human-Inspired Continuous Learning of Internal Reasoning Processes: Learning How to Think for Adaptive AI Systems

人类启发的内部推理过程连续学习：为自适应人工智能系统学习如何思考

Su, Hong

Abstract

Learning internal reasoning processes is crucial for developing AI systems capable of sustained adaptation in dynamic real-world environments. However, most existing approaches primarily emphasize learning task-specific outputs or static knowledge representations, while overlooking the continuous refinement of internal reasoning structures, action scheduling policies, and learning mechanisms themselves. In this paper, we propose a human-inspired continuous learning framework that unifies reasoning, action, reflection, and verification within a sequential reasoning model enhanced by parallel learning. The framework explicitly treats internal thinking processes as primary learning objects. It systematically records internal reasoning trajectories and environmental interactions as structured learning material, enabling the system to optimize not only task-level content but also the organization, scheduling, and evolution of reasoning activities. This design realizes learning alongside processing, allowing cognitive structures to improve during execution. Furthermore, the framework supports controlled replacement of predefined logic with learned procedures and introduces a hierarchical learning-to-learn mechanism that jointly adapts task-level parameters and learning strategies. As a result, the system progressively evolves its internal cognitive architecture while preserving operational stability. Experimental results on a temperature sensor abnormality detection task show that incorporating internal-process learning reduces average runtime by 23.9%.

Chinese Translation

学习内部推理过程对于开发能够在动态现实环境中持续适应的人工智能系统至关重要。然而，大多数现有方法主要强调学习特定任务的输出或静态知识表示，而忽视了内部推理结构、行动调度策略和学习机制本身的持续优化。本文提出了一种人类启发的连续学习框架，该框架在一个通过并行学习增强的顺序推理模型中统一了推理、行动、反思和验证。该框架明确将内部思维过程视为主要学习对象。它系统地记录内部推理轨迹和环境交互作为结构化学习材料，使系统能够优化不仅是任务层面的内容，还有推理活动的组织、调度和演变。这一设计实现了在处理过程中进行学习，使得认知结构在执行过程中得以改善。此外，该框架支持用学习到的程序有控制地替换预定义逻辑，并引入了一种分层的学习机制，以共同适应任务级参数和学习策略。因此，系统在保持操作稳定性的同时逐步演化其内部认知架构。在温度传感器异常检测任务上的实验结果表明，融入内部过程学习使平均运行时间减少了23.9%。

View on arXiv Download PDF AI Translation

cs.AI / 18 / 2602.11527

CausalAgent: A Conversational Multi-Agent System for End-to-End Causal Inference

CausalAgent：一种用于端到端因果推断的对话式多智能体系统

Zhu, Jiawei, Chen, Wei, Cai, Ruichu

Abstract

Causal inference holds immense value in fields such as healthcare, economics, and social sciences. However, traditional causal analysis workflows impose significant technical barriers, requiring researchers to possess dual backgrounds in statistics and computer science, while manually selecting algorithms, handling data quality issues, and interpreting complex results. To address these challenges, we propose CausalAgent, a conversational multi-agent system for end-to-end causal inference. The system innovatively integrates Multi-Agent Systems (MAS), Retrieval-Augmented Generation (RAG), and the Model Context Protocol (MCP) to achieve automation from data cleaning and causal structure learning to bias correction and report generation through natural language interaction. Users need only upload a dataset and pose questions in natural language to receive a rigorous, interactive analysis report. As a novel user-centered human-AI collaboration paradigm, CausalAgent explicitly models the analysis workflow. By leveraging interactive visualizations, it significantly lowers the barrier to entry for causal analysis while ensuring the rigor and interpretability of the process.

Chinese Translation

因果推断在医疗、经济和社会科学等领域具有重要价值。然而，传统的因果分析工作流程存在显著的技术障碍，要求研究人员具备统计学和计算机科学的双重背景，同时手动选择算法、处理数据质量问题并解释复杂结果。为了解决这些挑战，我们提出了CausalAgent，一种用于端到端因果推断的对话式多智能体系统。该系统创新性地整合了多智能体系统（Multi-Agent Systems, MAS）、检索增强生成（Retrieval-Augmented Generation, RAG）和模型上下文协议（Model Context Protocol, MCP），实现了从数据清理和因果结构学习到偏差修正和报告生成的自动化，且通过自然语言交互进行。用户只需上传数据集并用自然语言提出问题，即可获得严格的互动分析报告。作为一种新颖的以用户为中心的人机协作范式，CausalAgent 明确建模了分析工作流程。通过利用互动可视化，它显著降低了因果分析的入门门槛，同时确保了过程的严谨性和可解释性。

View on arXiv Download PDF AI Translation

cs.AI / 19 / 2602.11541

Budget-Constrained Agentic Large Language Models: Intention-Based Planning for Costly Tool Use

预算受限的自主大型语言模型：基于意图的成本工具使用规划

Liu, Hanbing, Tian, Chunhao, An, Nan, Wang, Ziyuan, Lu, Pinyan, Yu, Changyuan, Qi, Qi

Abstract

We study budget-constrained tool-augmented agents, where a large language model must solve multi-step tasks by invoking external tools under a strict monetary budget. We formalize this setting as sequential decision making in context space with priced and stochastic tool executions, making direct planning intractable due to massive state-action spaces, high variance of outcomes and prohibitive exploration cost. To address these challenges, we propose INTENT, an inference-time planning framework that leverages an intention-aware hierarchical world model to anticipate future tool usage, risk-calibrated cost, and guide decisions online. Across cost-augmented StableToolBench, INTENT strictly enforces hard budget feasibility while substantially improving task success over baselines, and remains robust under dynamic market shifts such as tool price changes and varying budgets.

Chinese Translation

我们研究预算受限的工具增强代理，其中大型语言模型必须在严格的货币预算下，通过调用外部工具来解决多步骤任务。我们将这一设置形式化为具有定价和随机工具执行的上下文空间中的序列决策，由于状态-动作空间庞大、结果的高方差和探索成本的高昂，直接规划变得不可行。为了解决这些挑战，我们提出了INTENT，一个推理时规划框架，利用意图感知的层次世界模型来预测未来的工具使用、风险校准成本，并在线指导决策。在成本增强的StableToolBench上，INTENT严格执行硬预算可行性，同时显著提高了任务成功率，相较于基线方法表现出更强的鲁棒性，能够应对工具价格变化和预算波动等动态市场变化。

View on arXiv Download PDF AI Translation

cs.AI / 20 / 2602.11569

SemaPop: Semantic-Persona Conditioned Population Synthesis

SemaPop：语义-角色条件的人口合成

Qin, Zhenlin, Ling, Yancheng, Wang, Leizhen, Pereira, Francisco Câmara, Ma, Zhenliang

Abstract

Population synthesis is a critical component of individual-level socio-economic simulation, yet remains challenging due to the need to jointly represent statistical structure and latent behavioral semantics. Existing population synthesis approaches predominantly rely on structured attributes and statistical constraints, leaving a gap in semantic-conditioned population generation that can capture abstract behavioral patterns implicitly in survey data. This study proposes SemaPop, a semantic-statistical population synthesis model that integrates large language models (LLMs) with generative population modeling. SemaPop derives high-level persona representations from individual survey records and incorporates them as semantic conditioning signals for population generation, while marginal regularization is introduced to enforce alignment with target population marginals. In this study, the framework is instantiated using a Wasserstein GAN with gradient penalty (WGAN-GP) backbone, referred to as SemaPop-GAN. Extensive experiments demonstrate that SemaPop-GAN achieves improved generative performance, yielding closer alignment with target marginal and joint distributions while maintaining sample-level feasibility and diversity under semantic conditioning. Ablation studies further confirm the contribution of semantic persona conditioning and architectural design choices to balancing marginal consistency and structural realism. These results demonstrate that SemaPop-GAN enables controllable and interpretable population synthesis through effective semantic-statistical information fusion. SemaPop-GAN also provides a promising modular foundation for developing generative population projection systems that integrate individual-level behavioral semantics with population-level statistical constraints.

Chinese Translation

人口合成是个体层面社会经济模拟的关键组成部分，但由于需要共同表示统计结构和潜在行为语义，仍然面临挑战。现有的人口合成方法主要依赖于结构化属性和统计约束，导致在语义条件的人口生成方面存在空白，无法在调查数据中隐式捕捉抽象的行为模式。本研究提出了SemaPop，一种语义-统计人口合成模型，它将大型语言模型（LLMs）与生成性人口建模相结合。SemaPop从个体调查记录中推导出高级角色表示，并将其作为人口生成的语义条件信号，同时引入边际正则化以确保与目标人口边际的一致性。在本研究中，该框架使用带梯度惩罚的Wasserstein GAN（WGAN-GP）作为基础，称为SemaPop-GAN。大量实验证明，SemaPop-GAN在生成性能上有所提升，能够更好地与目标边际和联合分布对齐，同时在语义条件下保持样本级别的可行性和多样性。消融研究进一步确认了语义角色条件和架构设计选择在平衡边际一致性和结构现实性方面的贡献。这些结果表明，SemaPop-GAN通过有效的语义-统计信息融合实现了可控和可解释的人口合成。SemaPop-GAN还为开发将个体层面行为语义与人口层面统计约束相结合的生成性人口预测系统提供了一个有前景的模块化基础。

View on arXiv Download PDF AI Translation

cs.AI / 21 / 2602.11574

Learning to Configure Agentic AI Systems

学习配置自主智能系统

Taparia, Aditya, Sagar, Som, Senanayake, Ransalu

Abstract

Configuring LLM-based agent systems involves choosing workflows, tools, token budgets, and prompts from a large combinatorial design space, and is typically handled today by fixed large templates or hand-tuned heuristics. This leads to brittle behavior and unnecessary compute, since the same cumbersome configuration is often applied to both easy and hard input queries. We formulate agent configuration as a query-wise decision problem and introduce ARC (Agentic Resource & Configuration learner), which learns a light-weight hierarchical policy using reinforcement learning to dynamically tailor these configurations. Across multiple benchmarks spanning reasoning and tool-augmented question answering, the learned policy consistently outperforms strong hand-designed and other baselines, achieving up to 25% higher task accuracy while also reducing token and runtime costs. These results demonstrate that learning per-query agent configurations is a powerful alternative to "one size fits all" designs.

Chinese Translation

配置基于大语言模型（LLM）的代理系统涉及从一个庞大的组合设计空间中选择工作流程、工具、令牌预算和提示，这通常通过固定的大模板或手动调整的启发式方法来处理。这导致了脆弱的行为和不必要的计算，因为同样繁琐的配置通常被应用于简单和复杂的输入查询。我们将代理配置形式化为一个查询级决策问题，并引入ARC（Agentic Resource & Configuration learner），该方法使用强化学习学习轻量级的层次策略，以动态调整这些配置。在多个基准测试中，包括推理和工具增强的问答，所学习的策略始终优于强大的手工设计和其他基线，任务准确率提高了最高25%，同时减少了令牌和运行时间成本。这些结果表明，按查询学习代理配置是“一个尺寸适合所有”设计的有力替代方案。

View on arXiv Download PDF AI Translation

cs.AI / 22 / 2602.11583

The Five Ws of Multi-Agent Communication: Who Talks to Whom, When, What, and Why -- A Survey from MARL to Emergent Language and LLMs

多智能体通信的五个W：谁与谁交谈，何时，什么，以及为什么——从MARL到突现语言和大型语言模型的调查

Chen, Jingdi, Yang, Hanqing, Liu, Zongjun, Joe-Wong, Carlee

Abstract

Multi-agent sequential decision-making powers many real-world systems, from autonomous vehicles and robotics to collaborative AI assistants. In dynamic, partially observable environments, communication is often what reduces uncertainty and makes collaboration possible. This survey reviews multi-agent communication (MA-Comm) through the Five Ws: who communicates with whom, what is communicated, when communication occurs, and why communication is beneficial. This framing offers a clean way to connect ideas across otherwise separate research threads. We trace how communication approaches have evolved across three major paradigms. In Multi-Agent Reinforcement Learning (MARL), early methods used hand-designed or implicit protocols, followed by end-to-end learned communication optimized for reward and control. While successful, these protocols are frequently task-specific and hard to interpret, motivating work on Emergent Language (EL), where agents can develop more structured or symbolic communication through interaction. EL methods, however, still struggle with grounding, generalization, and scalability, which has fueled recent interest in large language models (LLMs) that bring natural language priors for reasoning, planning, and collaboration in more open-ended settings. Across MARL, EL, and LLM-based systems, we highlight how different choices shape communication design, where the main trade-offs lie, and what remains unsolved. We distill practical design patterns and open challenges to support future hybrid systems that combine learning, language, and control for scalable and interpretable multi-agent collaboration.

Chinese Translation

多智能体顺序决策驱动着许多现实世界系统，从自主车辆和机器人到协作人工智能助手。在动态的、部分可观察的环境中，通信通常是减少不确定性并使协作成为可能的关键。本文通过五个W回顾了多智能体通信（MA-Comm）：谁与谁沟通，沟通的内容是什么，何时发生沟通，以及沟通的益处。这样的框架提供了一种清晰的方式，将原本分散的研究主题连接起来。我们追踪了通信方法在三个主要范式中的演变。在多智能体强化学习（MARL）中，早期的方法使用手工设计或隐式协议，随后是针对奖励和控制进行优化的端到端学习通信。尽管取得了成功，这些协议通常是特定于任务的且难以解释，这激励了对突现语言（EL）的研究，代理可以通过交互发展出更结构化或符号化的通信。然而，EL方法在基础、泛化和可扩展性方面仍然面临挑战，这引发了对大型语言模型（LLMs）的近期关注，这些模型为推理、规划和在更开放的环境中协作提供了自然语言的先验知识。在MARL、EL和基于LLM的系统中，我们强调了不同选择如何塑造通信设计，主要权衡在哪里，以及尚未解决的问题。我们提炼了实用的设计模式和开放挑战，以支持未来结合学习、语言和控制的混合系统，实现可扩展和可解释的多智能体协作。

View on arXiv Download PDF AI Translation

cs.AI / 23 / 2602.11596

MAPLE: Modality-Aware Post-training and Learning Ecosystem

MAPLE：模态感知的后训练与学习生态系统

Verma, Nikhil, Kim, Minjung, Yoo, JooYoung, Jin, Kyung-Min, Bharadwaj, Manasa, Ferreira, Kevin, Kim, Ko Keun, Kim, Youngjoon

Abstract

Multimodal language models now integrate text, audio, and video for unified reasoning. Yet existing RL post-training pipelines treat all input signals as equally relevant, ignoring which modalities each task actually requires. This modality-blind training inflates policy-gradient variance, slows convergence, and degrades robustness to real-world distribution shifts where signals may be missing, added, or reweighted. We introduce MAPLE, a complete modality-aware post-training and learning ecosystem comprising: (1) MAPLE-bench, the first benchmark explicitly annotating minimal signal combinations required per task; (2) MAPO, a modality-aware policy optimization framework that stratifies batches by modality requirement to reduce gradient variance from heterogeneous group advantages; (3) Adaptive weighting and curriculum scheduling that balances and prioritizes harder signal combinations. Systematic analysis across loss aggregation, clipping, sampling, and curriculum design establishes MAPO's optimal training strategy. Adaptive weighting and curriculum focused learning further boost performance across signal combinations. MAPLE narrows uni/multi-modal accuracy gaps by 30.24%, converges 3.18x faster, and maintains stability across all modality combinations under realistic reduced signal access. MAPLE constitutes a complete recipe for deployment-ready multimodal RL post-training.

Chinese Translation

多模态语言模型现在整合了文本、音频和视频以实现统一推理。然而，现有的强化学习（RL）后训练流程将所有输入信号视为同等重要，忽视了每个任务实际所需的模态。这种模态盲训练导致策略梯度方差膨胀，减缓收敛速度，并降低对真实世界分布变化的鲁棒性，其中信号可能缺失、增加或重新加权。我们提出了MAPLE，一个完整的模态感知后训练与学习生态系统，包括：（1）MAPLE-bench，这是第一个明确标注每个任务所需最小信号组合的基准；（2）MAPO，一个模态感知的策略优化框架，通过模态需求对批次进行分层，以减少来自异质组优势的梯度方差；（3）自适应加权和课程调度，平衡并优先考虑更难的信号组合。针对损失聚合、剪切、采样和课程设计的系统分析确立了MAPO的最佳训练策略。自适应加权和以课程为中心的学习进一步提升了不同信号组合的性能。MAPLE将单模态/多模态的准确性差距缩小了30.24%，收敛速度提高了3.18倍，并在现实的信号访问减少情况下保持了所有模态组合的稳定性。MAPLE构成了一个完整的可部署多模态强化学习后训练方案。

View on arXiv Download PDF AI Translation

cs.AI / 24 / 2602.11609

scPilot: Large Language Model Reasoning Toward Automated Single-Cell Analysis and Discovery

scPilot：面向自动化单细胞分析与发现的大型语言模型推理

Gao, Yiming, Wang, Zhen, Chen, Jefferson, Antkowiak, Mark, Hu, Mengzhou, Kong, JungHo, Pratt, Dexter, Liu, Jieyuan, Ma, Enze, Hu, Zhiting, Xing, Eric P.

Abstract

We present scPilot, the first systematic framework to practice omics-native reasoning: a large language model (LLM) converses in natural language while directly inspecting single-cell RNA-seq data and on-demand bioinformatics tools. scPilot converts core single-cell analyses, i.e., cell-type annotation, developmental-trajectory reconstruction, and transcription-factor targeting, into step-by-step reasoning problems that the model must solve, justify, and, when needed, revise with new evidence. To measure progress, we release scBench, a suite of 9 expertly curated datasets and graders that faithfully evaluate the omics-native reasoning capability of scPilot w.r.t various LLMs. Experiments with o1 show that iterative omics-native reasoning lifts average accuracy by 11% for cell-type annotation and Gemini-2.5-Pro cuts trajectory graph-edit distance by 30% versus one-shot prompting, while generating transparent reasoning traces explain marker gene ambiguity and regulatory logic. By grounding LLMs in raw omics data, scPilot enables auditable, interpretable, and diagnostically informative single-cell analyses. Code, data, and package are available at https://github.com/maitrix-org/scPilot

Chinese Translation

我们提出了scPilot，这是第一个系统化框架，用于实践组学本土推理：一个大型语言模型（LLM）在直接检查单细胞RNA测序数据和按需生物信息学工具的同时，能够用自然语言进行对话。scPilot将核心单细胞分析，即细胞类型注释、发育轨迹重建和转录因子靶向，转化为模型必须解决、证明并在需要时用新证据修正的逐步推理问题。为了衡量进展，我们发布了scBench，这是一个包含9个专家策划的数据集和评分器的套件，能够忠实评估scPilot在各种LLM下的组学本土推理能力。实验结果表明，迭代的组学本土推理使细胞类型注释的平均准确率提高了11%，而Gemini-2.5-Pro在与一次性提示相比时，将轨迹图编辑距离减少了30%，同时生成的透明推理痕迹解释了标记基因的模糊性和调控逻辑。通过将LLM与原始组学数据结合，scPilot实现了可审计、可解释且具有诊断信息的单细胞分析。代码、数据和软件包可在https://github.com/maitrix-org/scPilot获取。

View on arXiv Download PDF AI Translation

cs.AI / 25 / 2602.11619

When Agents Disagree With Themselves: Measuring Behavioral Consistency in LLM-Based Agents

当智能体自我不一致时：基于大语言模型的智能体行为一致性测量

Mehta, Aman

Abstract

Run the same LLM agent on the same task twice: do you get the same behavior? We find the answer is often no. In a study of 3,000 agent runs across three models (Llama 3.1 70B, GPT-4o, and Claude Sonnet 4.5) on HotpotQA, we observe that ReAct-style agents produce 2.0--4.2 distinct action sequences per 10 runs on average, even with identical inputs. More importantly, this variance predicts failure: tasks with consistent behavior ($\leq$2 unique paths) achieve 80--92% accuracy, while highly inconsistent tasks ($\geq$6 unique paths) achieve only 25--60%, a 32--55 percentage point gap depending on model. We trace variance to early decisions: 69% of divergence occurs at step 2, the first search query. Our results suggest that monitoring behavioral consistency during execution could enable early error detection and improve agent reliability.

Chinese Translation

在同一任务上运行相同的大语言模型（LLM）智能体两次：你会得到相同的行为吗？我们的研究发现答案往往是否定的。在对三个模型（Llama 3.1 70B、GPT-4o 和 Claude Sonnet 4.5）在 HotpotQA 上进行的 3000 次智能体运行的研究中，我们观察到 ReAct 风格的智能体在每 10 次运行中平均产生 2.0 到 4.2 种不同的行动序列，即使输入完全相同。更重要的是，这种变异性预测了失败：具有一致行为（$ ext{≤}2$ 个独特路径）的任务实现了 80% 到 92% 的准确率，而高度不一致的任务（$ ext{≥}6$ 个独特路径）仅实现了 25% 到 60% 的准确率，具体差距取决于模型，达到 32 到 55 个百分点。我们追踪变异性至早期决策：69% 的分歧发生在第 2 步，即第一次搜索查询。我们的结果表明，在执行过程中监测行为一致性可能有助于早期错误检测并提高智能体的可靠性。

View on arXiv Download PDF AI Translation

cs.AI / 26 / 2602.11630

Neuro-Symbolic Multitasking: A Unified Framework for Discovering Generalizable Solutions to PDE Families

神经符号多任务：发现偏微分方程族可推广解的统一框架

Huang, Yipeng, Xu, Dejun, Lin, Zexin, Wang, Zhenzhong, Jiang, Min

Abstract

Solving Partial Differential Equations (PDEs) is fundamental to numerous scientific and engineering disciplines. A common challenge arises from solving the PDE families, which are characterized by sharing an identical mathematical structure but varying in specific parameters. Traditional numerical methods, such as the finite element method, need to independently solve each instance within a PDE family, which incurs massive computational cost. On the other hand, while recent advancements in machine learning PDE solvers offer impressive computational speed and accuracy, their inherent ``black-box" nature presents a considerable limitation. These methods primarily yield numerical approximations, thereby lacking the crucial interpretability provided by analytical expressions, which are essential for deeper scientific insight. To address these limitations, we propose a neuro-assisted multitasking symbolic PDE solver framework for PDE family solving, dubbed NMIPS. In particular, we employ multifactorial optimization to simultaneously discover the analytical solutions of PDEs. To enhance computational efficiency, we devise an affine transfer method by transferring learned mathematical structures among PDEs in a family, avoiding solving each PDE from scratch. Experimental results across multiple cases demonstrate promising improvements over existing baselines, achieving up to a $\sim$35.7% increase in accuracy while providing interpretable analytical solutions.

Chinese Translation

求解偏微分方程（PDE）是众多科学和工程学科的基础。一个常见的挑战来自于求解偏微分方程族，这些方程族具有相同的数学结构但在特定参数上有所不同。传统的数值方法，如有限元法，需要独立地求解偏微分方程族中的每一个实例，这会产生巨大的计算成本。另一方面，尽管最近在机器学习偏微分方程求解器方面的进展提供了令人印象深刻的计算速度和准确性，但其固有的“黑箱”特性带来了相当大的限制。这些方法主要产生数值近似，缺乏由解析表达式提供的重要可解释性，而这对于更深层次的科学洞察至关重要。为了解决这些局限性，我们提出了一种神经辅助多任务符号偏微分方程求解框架，称为NMIPS，旨在求解偏微分方程族。具体而言，我们采用多因素优化方法同时发现偏微分方程的解析解。为了提高计算效率，我们设计了一种仿射转移方法，通过在偏微分方程族中转移学习到的数学结构，避免从头开始求解每个偏微分方程。多个案例的实验结果表明，与现有基线相比，取得了令人鼓舞的改进，准确性提高了约35.7%，同时提供了可解释的解析解。

View on arXiv Download PDF AI Translation

cs.AI / 27 / 2602.11635

Do MLLMs Really Understand Space? A Mathematical Reasoning Evaluation

多模态大型语言模型真的理解空间吗？数学推理评估

Lu, Shuo, Cheng, Jianjie, Xu, Yinuo, Yu, Yongcan, Sheng, Lijun, Wang, Peijie, Jiang, Siru, Hu, Yongguan, Ling, Run, Shao, Yihua, Ma, Ao, Feng, Wei, He, Lingxiao, Wang, Meng, Xie, Qianlong, Wang, Xingxing, He, Ran, Liang, Jian

Abstract

Multimodal large language models (MLLMs) have achieved strong performance on perception-oriented tasks, yet their ability to perform mathematical spatial reasoning, defined as the capacity to parse and manipulate two- and three-dimensional relations, remains unclear. Humans easily solve textbook-style spatial reasoning problems with over 95\% accuracy, but we find that most leading MLLMs fail to reach even 60\% on the same tasks. This striking gap highlights spatial reasoning as a fundamental weakness of current models. To investigate this gap, we present MathSpatial, a unified framework for evaluating and improving spatial reasoning in MLLMs. MathSpatial includes three complementary components: (i) MathSpatial-Bench, a benchmark of 2K problems across three categories and eleven subtypes, designed to isolate reasoning difficulty from perceptual noise; (ii) MathSpatial-Corpus, a training dataset of 8K additional problems with verified solutions; and (iii) MathSpatial-SRT, which models reasoning as structured traces composed of three atomic operations--Correlate, Constrain, and Infer. Experiments show that fine-tuning Qwen2.5-VL-7B on MathSpatial achieves competitive accuracy while reducing tokens by 25\%. MathSpatial provides the first large-scale resource that disentangles perception from reasoning, enabling precise measurement and comprehensive understanding of mathematical spatial reasoning in MLLMs.

Chinese Translation

多模态大型语言模型（MLLMs）在感知导向任务上取得了强劲的表现，但它们在数学空间推理方面的能力仍不明确，数学空间推理被定义为解析和操作二维和三维关系的能力。人类在解决教科书风格的空间推理问题时，准确率超过95%，但我们发现大多数领先的MLLMs在相同任务上的表现甚至未能达到60%。这一显著差距凸显了空间推理作为当前模型的一个基本弱点。为了调查这一差距，我们提出了MathSpatial，一个用于评估和改善MLLMs空间推理的统一框架。MathSpatial包括三个互补组件：（i）MathSpatial-Bench，一个包含2000个问题的基准，涵盖三个类别和十一种子类型，旨在将推理难度与感知噪声隔离；（ii）MathSpatial-Corpus，一个包含8000个经过验证解决方案的额外问题的训练数据集；以及（iii）MathSpatial-SRT，将推理建模为由三个原子操作（相关、约束和推断）组成的结构化痕迹。实验表明，在MathSpatial上微调Qwen2.5-VL-7B可以实现具有竞争力的准确性，同时减少25%的标记。MathSpatial提供了第一个将感知与推理分离的大规模资源，使得对MLLMs中的数学空间推理进行精确测量和全面理解成为可能。

View on arXiv Download PDF AI Translation

cs.AI / 28 / 2602.11661

Quark Medical Alignment: A Holistic Multi-Dimensional Alignment and Collaborative Optimization Paradigm

夸克医疗对齐：一种整体多维对齐与协同优化范式

Xu, Tianxiang, Liu, Jiayi, Tong, Yixuan, Xu, Jialu, Wei, Yunqing, Feng, Kaiwen, Hou, PanPan, Yin, Kangping, Hu, Jiyuan, Zhou, Hao, Ma, Zhenxin, Xu, Jian, Jiang, Guanjun

Abstract

While reinforcement learning for large language model alignment has progressed rapidly in recent years, transferring these paradigms to high-stakes medical question answering reveals a fundamental paradigm mismatch. Reinforcement Learning from Human Feedback relies on preference annotations that are prohibitively expensive and often fail to reflect the absolute correctness of medical facts. Reinforcement Learning from Verifiable Rewards lacks effective automatic verifiers and struggles to handle complex clinical contexts. Meanwhile, medical alignment requires the simultaneous optimization of correctness, safety, and compliance, yet multi-objective heterogeneous reward signals are prone to scale mismatch and optimization conflicts.To address these challenges, we propose a robust medical alignment paradigm. We first construct a holistic multi-dimensional medical alignment matrix that decomposes alignment objectives into four categories: fundamental capabilities, expert knowledge, online feedback, and format specifications. Within each category, we establish a closed loop of where observable metrics inform attributable diagnosis, which in turn drives optimizable rewards, thereby providing fine-grained, high-resolution supervision signals for subsequent iterative optimization. To resolve gradient domination and optimization instability problem caused by heterogeneous signals, we further propose a unified optimization mechanism. This mechanism employs Reference-Frozen Normalization to align reward scales and implements a Tri-Factor Adaptive Dynamic Weighting strategy to achieve collaborative optimization that is weakness-oriented, risk-prioritized, and redundancy-reducing. Experimental results demonstrate the effectiveness of our proposed paradigm in real-world medical scenario evaluations, establishing a new paradigm for complex alignment in vertical domains.

Chinese Translation

尽管近年来针对大型语言模型对齐的强化学习取得了快速进展，但将这些范式转移到高风险的医疗问答中却揭示了根本的范式不匹配。基于人类反馈的强化学习依赖于偏好注释，这种注释成本高昂且常常无法反映医疗事实的绝对正确性。基于可验证奖励的强化学习缺乏有效的自动验证器，并且在处理复杂临床环境时面临困难。同时，医疗对齐需要同时优化正确性、安全性和合规性，但多目标异构奖励信号容易导致规模不匹配和优化冲突。为了解决这些挑战，我们提出了一种稳健的医疗对齐范式。我们首先构建了一个整体多维医疗对齐矩阵，将对齐目标分解为四个类别：基础能力、专家知识、在线反馈和格式规范。在每个类别中，我们建立了一个闭环，其中可观察的指标通知可归因的诊断，而这些诊断又驱动可优化的奖励，从而为后续的迭代优化提供细粒度、高分辨率的监督信号。为了解决由异构信号引起的梯度主导和优化不稳定性问题，我们进一步提出了一种统一的优化机制。该机制采用参考冻结归一化来对齐奖励尺度，并实施三因子自适应动态加权策略，以实现以弱点为导向、以风险为优先、减少冗余的协同优化。实验结果证明了我们提出的范式在现实医疗场景评估中的有效性，为垂直领域的复杂对齐建立了新的范式。

View on arXiv Download PDF AI Translation

cs.AI / 29 / 2602.11666

PhyNiKCE: A Neurosymbolic Agentic Framework for Autonomous Computational Fluid Dynamics

PhyNiKCE：一种用于自主计算流体动力学的神经符号代理框架

Fan, E, Shi, Lisong, Li, Zhengtong, Wen, Chih-yung

Abstract

The deployment of autonomous agents for Computational Fluid Dynamics (CFD), is critically limited by the probabilistic nature of Large Language Models (LLMs), which struggle to enforce the strict conservation laws and numerical stability required for physics-based simulations. Reliance on purely semantic Retrieval Augmented Generation (RAG) often leads to "context poisoning," where agents generate linguistically plausible but physically invalid configurations due to a fundamental Semantic-Physical Disconnect. To bridge this gap, this work introduces PhyNiKCE (Physical and Numerical Knowledgeable Context Engineering), a neurosymbolic agentic framework for trustworthy engineering. Unlike standard black-box agents, PhyNiKCE decouples neural planning from symbolic validation. It employs a Symbolic Knowledge Engine that treats simulation setup as a Constraint Satisfaction Problem, rigidly enforcing physical constraints via a Deterministic RAG Engine with specialized retrieval strategies for solvers, turbulence models, and boundary conditions. Validated through rigorous OpenFOAM experiments on practical, non-tutorial CFD tasks using Gemini-2.5-Pro/Flash, PhyNiKCE demonstrates a 96% relative improvement over state-of-the-art baselines. Furthermore, by replacing trial-and-error with knowledge-driven initialization, the framework reduced autonomous self-correction loops by 59% while simultaneously lowering LLM token consumption by 17%. These results demonstrate that decoupling neural generation from symbolic constraint enforcement significantly enhances robustness and efficiency. While validated on CFD, this architecture offers a scalable, auditable paradigm for Trustworthy Artificial Intelligence in broader industrial automation.

Chinese Translation

自主代理在计算流体动力学（CFD）中的应用受到大型语言模型（LLMs）概率性质的严重限制，这些模型难以执行物理模拟所需的严格守恒定律和数值稳定性。单纯依赖语义检索增强生成（RAG）往往导致“上下文污染”，使得代理生成在语言上似乎合理但在物理上无效的配置，这源于根本的语义-物理脱节。为了解决这一问题，本文提出了PhyNiKCE（物理和数值知识上下文工程），一个用于可信工程的神经符号代理框架。与标准的黑箱代理不同，PhyNiKCE将神经规划与符号验证解耦。它采用符号知识引擎，将仿真设置视为约束满足问题，通过具有专门检索策略的确定性RAG引擎严格执行物理约束，适用于求解器、湍流模型和边界条件。通过在使用Gemini-2.5-Pro/Flash进行的实际非教程CFD任务中的严格OpenFOAM实验进行验证，PhyNiKCE在相较于最先进基准上展示了96%的相对提升。此外，通过用知识驱动的初始化替代试错方法，该框架将自主自我修正循环减少了59%，同时将LLM令牌消耗降低了17%。这些结果表明，将神经生成与符号约束执行解耦显著增强了鲁棒性和效率。尽管在CFD中得到了验证，但该架构为更广泛的工业自动化中的可信人工智能提供了可扩展、可审计的范式。

View on arXiv Download PDF AI Translation

cs.AI / 30 / 2602.11674

Benchmark Health Index: A Systematic Framework for Benchmarking the Benchmarks of LLMs

基准健康指数：一个系统框架用于基准大语言模型的基准评估

Zhu, Longyuan, Hua, Hairan, Miao, Linlin, Zhao, Bing

Abstract

Large Language Models (LLMs) are advancing rapidly, yet the benchmarks used to measure this progress are becoming increasingly unreliable. Score inflation and selective reporting have eroded the authority of standard benchmarks, leaving the community uncertain about which evaluation results remain trustworthy. We introduce the Benchmark Health Index (BHI), a pure data-driven framework for auditing evaluation sets along three orthogonal and complementary axes: (1) Capability Discrimination, measuring how sharply a benchmark separates model performance beyond noise; (2) Anti-Saturation, estimating remaining headroom before ceiling effects erode resolution and thus the benchmark's expected longevity; and (3) Impact, quantifying influence across academic and industrial ecosystems via adoption breadth and practice-shaping power. By distilling 106 validated benchmarks from the technical reports of 91 representative models in 2025, we systematically characterize the evaluation landscape. BHI is the first framework to quantify benchmark health at a macro level, providing a principled basis for benchmark selection and enabling dynamic lifecycle management for next-generation evaluation protocols.

Chinese Translation

大型语言模型（LLMs）正在快速发展，但用于衡量这一进展的基准变得越来越不可靠。分数膨胀和选择性报告削弱了标准基准的权威性，使得社区对哪些评估结果仍然可信感到不确定。我们提出了基准健康指数（Benchmark Health Index, BHI），这是一个纯数据驱动的框架，用于沿着三个正交且互补的轴线审计评估集：（1）能力区分，衡量基准在噪声之外多大程度上区分模型性能；（2）抗饱和，估计在天花板效应侵蚀分辨率之前剩余的提升空间，从而影响基准的预期寿命；（3）影响，量化通过采用广度和塑造实践的能力在学术和工业生态系统中的影响力。通过从2025年91个代表性模型的技术报告中提炼出106个经过验证的基准，我们系统地描述了评估景观。BHI是第一个在宏观层面量化基准健康的框架，为基准选择提供了原则性基础，并使下一代评估协议的动态生命周期管理成为可能。

View on arXiv Download PDF AI Translation

cs.AI / 31 / 2602.11675

Right for the Wrong Reasons: Epistemic Regret Minimization for Causal Rung Collapse in LLMs

错误原因下的正确：大规模语言模型因果层次崩溃的认知遗憾最小化

Chang, Edward Y.

Abstract

Machine learning systems that are "right for the wrong reasons" achieve high performance through shortcuts that collapse under distributional shift. We show this pathology has a precise causal origin: autoregressive training provides no gradient signal to distinguish association P(Y|X) from intervention P(Y|do(X)), a failure we formalize as Rung Collapse. When outcome-based learning reinforces correct answers obtained through incorrect causal models, the agent becomes entrenched in flawed reasoning, a phenomenon we term Aleatoric Entrenchment. We propose Epistemic Regret Minimization (ERM), a belief revision objective that penalizes errors in causal reasoning independently of task success, and embed it within a three-layer architecture with three contributions grounded in knowledge representation: (1) a Physical Grounding Theorem proving that actions satisfying actuator independence implement valid do-operations, bridging action languages and do-calculus; (2) ERM as a causal belief revision operator satisfying AGM postulates, preventing entrenchment even when the agent succeeds for the wrong reasons; and (3) a failure mode taxonomy that classifies recurring reasoning errors and injects domain-independent guards, enabling cross-domain transfer. We prove asymptotic recovery of the true interventional distribution with finite-sample bounds. Experiments on 1,360 causal trap scenarios across six frontier LLMs reveal that Rung Collapse persists even in reasoning-enhanced models (3.7% for GPT-5.2), that steerability exhibits inverse scaling where advanced models resist generic correction, and that targeted ERM feedback recovers 53-59% of entrenched errors where outcome-level feedback fails.

Chinese Translation

被称为“错误原因下的正确”的机器学习系统通过在分布转变下崩溃的捷径实现高性能。我们展示了这种病理的确切因果起源：自回归训练未能提供区分关联 P(Y|X) 和干预 P(Y|do(X)) 的梯度信号，这一失败我们形式化为层次崩溃 (Rung Collapse)。当基于结果的学习强化通过不正确因果模型获得的正确答案时，代理会陷入错误推理的困境，我们称之为随机性固化 (Aleatoric Entrenchment)。我们提出了认知遗憾最小化 (Epistemic Regret Minimization, ERM)，这是一种信念修正目标，独立于任务成功惩罚因果推理中的错误，并将其嵌入到一个三层架构中，提出了三个基于知识表示的贡献：(1) 物理基础定理证明满足执行器独立性的动作实现有效的 do-操作，连接了动作语言和 do-演算；(2) ERM 作为满足 AGM 公设的因果信念修正算子，即使代理因错误原因成功也能防止固化；(3) 一种失败模式分类法，分类重复的推理错误并注入领域无关的保护措施，从而实现跨领域转移。我们证明了在有限样本界限下真实干预分布的渐近恢复。在六个前沿大规模语言模型上对 1360 个因果陷阱场景的实验表明，即使在增强推理的模型中，层次崩溃依然存在（GPT-5.2 的比例为 3.7%），可操控性表现出反向缩放，即高级模型抵制通用修正，而针对性的 ERM 反馈则在结果级反馈失败的情况下恢复了 53-59% 的固化错误。

View on arXiv Download PDF AI Translation

cs.AI / 32 / 2602.11678

Beyond Pixels: Vector-to-Graph Transformation for Reliable Schematic Auditing

超越像素：可靠原理图审计的向量-图转换

Ma, Chengwei, Tian, Zhen, Zhou, Zhou, Xu, Zhixian, Zhu, Xiaowei, Hua, Xia, Shi, Si, Yu, F. Richard

Abstract

Multimodal Large Language Models (MLLMs) have shown remarkable progress in visual understanding, yet they suffer from a critical limitation: structural blindness. Even state-of-the-art models fail to capture topology and symbolic logic in engineering schematics, as their pixel-driven paradigm discards the explicit vector-defined relations needed for reasoning. To overcome this, we propose a Vector-to-Graph (V2G) pipeline that converts CAD diagrams into property graphs where nodes represent components and edges encode connectivity, making structural dependencies explicit and machine-auditable. On a diagnostic benchmark of electrical compliance checks, V2G yields large accuracy gains across all error categories, while leading MLLMs remain near chance level. These results highlight the systemic inadequacy of pixel-based methods and demonstrate that structure-aware representations provide a reliable path toward practical deployment of multimodal AI in engineering domains. To facilitate further research, we release our benchmark and implementation at https://github.com/gm-embodied/V2G-Audit.

Chinese Translation

多模态大型语言模型（MLLMs）在视觉理解方面取得了显著进展，但它们存在一个关键限制：结构盲目性。即使是最先进的模型也无法捕捉工程原理图中的拓扑和符号逻辑，因为它们的像素驱动范式忽略了推理所需的明确向量定义关系。为了解决这一问题，我们提出了一种向量-图（V2G）管道，将CAD图转换为属性图，其中节点表示组件，边表示连接性，使结构依赖关系变得明确且可供机器审计。在电气合规检查的诊断基准测试中，V2G在所有错误类别中都取得了显著的准确性提升，而领先的MLLMs仍然接近随机水平。这些结果突显了基于像素的方法的系统性不足，并表明结构感知表示为多模态人工智能在工程领域的实际应用提供了一条可靠的路径。为了促进进一步研究，我们在https://github.com/gm-embodied/V2G-Audit发布了我们的基准和实现。

View on arXiv Download PDF AI Translation

cs.AI / 33 / 2602.11683

ThinkRouter: Efficient Reasoning via Routing Thinking between Latent and Discrete Spaces

ThinkRouter：通过潜在空间与离散空间之间的路由思维实现高效推理

Xu, Xin, Yu, Tong, Chen, Xiang, Wang, Haoliang, McAuley, Julian, Mitra, Saayan

Abstract

Recent work explores latent reasoning to improve reasoning efficiency by replacing explicit reasoning trajectories with continuous representations in a latent space, yet its effectiveness varies across settings. Analysis of model confidence dynamics under latent reasoning reveals that thinking trajectories ending in incorrect answers contain fewer low-confidence steps than those ending in correct answers. Meanwhile, we suggest that soft embeddings aggregated by multiple low-confidence thinking alternatives may introduce and propagate noise, leading to high confidence in unreliable reasoning trajectories. Motivated by these observations, ThinkRouter, an inference-time confidence-aware routing mechanism is proposed to avoid high confidence and noise for efficient reasoning. ThinkRouter routes thinking to the discrete token space when model confidence is low, and to the latent space otherwise. Extensive experiments on STEM reasoning and coding benchmarks across diverse large reasoning models demonstrate that ThinkRouter outperforms explicit CoT, random routing, and latent reasoning baselines in terms of accuracy, achieving an average improvement of 19.70 points in Pass@1, while reducing generation length by up to 15.55%. Further comprehensive analysis reveals that ThinkRouter can calibrate errors arising from explicit CoT and latent reasoning, and accelerates end-of-thinking token generation by globally lowering model confidence.

Chinese Translation

近期的研究探讨了潜在推理，通过在潜在空间中用连续表示替代显式推理轨迹来提高推理效率，但其有效性在不同设置中有所差异。对潜在推理下模型信心动态的分析表明，结束于错误答案的思维轨迹包含的低信心步骤少于结束于正确答案的轨迹。同时，我们认为由多个低信心思维替代方案聚合的软嵌入可能引入并传播噪声，导致对不可靠推理轨迹的高信心。基于这些观察，我们提出了ThinkRouter，一种在推理时具有信心感知的路由机制，以避免高信心和噪声，从而实现高效推理。当模型信心低时，ThinkRouter将思维路由到离散标记空间，而在其他情况下则路由到潜在空间。在多种大型推理模型的STEM推理和编码基准上的广泛实验表明，ThinkRouter在准确性方面优于显式的链式推理（CoT）、随机路由和潜在推理基线，平均提高了19.70个点的Pass@1，同时将生成长度减少了最多15.55%。进一步的综合分析表明，ThinkRouter能够校准由显式CoT和潜在推理引起的错误，并通过全球降低模型信心加速思维结束标记的生成。

View on arXiv Download PDF AI Translation

cs.AI / 34 / 2602.11717

Beyond Parameter Arithmetic: Sparse Complementary Fusion for Distribution-Aware Model Merging

超越参数算术：用于分布感知模型合并的稀疏互补融合

Lin, Weihong, Sun, Lin, Shi, Qilong, Yuan, Aomufei, Tian, Yuxuan, Wang, Zhengyang, Zhao, Guangxiang, Zhang, Xiangzheng, Yang, Tong

Abstract

Model merging has emerged as a promising paradigm for composing the capabilities of large language models by directly operating in weight space, enabling the integration of specialized models without costly retraining. However, existing merging methods largely rely on parameter-space heuristics, which often introduce severe interference, leading to degraded generalization and unstable generation behaviors such as repetition and incoherent outputs. In this work, we propose Sparse Complementary Fusion with reverse KL (SCF-RKL), a novel model merging framework that explicitly controls functional interference through sparse, distribution-aware updates. Instead of assuming linear additivity in parameter space, SCF-RKL measures the functional divergence between models using reverse Kullback-Leibler divergence and selectively incorporates complementary parameters. This mode-seeking, sparsity-inducing design effectively preserves stable representations while integrating new capabilities. We evaluate SCF-RKL across a wide range of model scales and architectures, covering both reasoning-focused and instruction-tuned models. Extensive experiments on 24 benchmarks spanning advanced reasoning, general reasoning and knowledge, instruction following, and safety demonstrate, vision classification that SCF-RKL consistently outperforms existing model merging methods while maintaining strong generalization and generation stability.

Chinese Translation

模型合并作为一种有前景的范式，通过直接在权重空间中操作，组合大型语言模型的能力，从而实现了在不进行昂贵重训练的情况下集成专业模型。然而，现有的合并方法在很大程度上依赖于参数空间启发式，这往往会引入严重的干扰，导致泛化能力下降和不稳定的生成行为，如重复和不连贯的输出。在本研究中，我们提出了具有反向KL（SCF-RKL）的稀疏互补融合，这是一种新颖的模型合并框架，通过稀疏的、分布感知的更新显式控制功能干扰。SCF-RKL并不假设参数空间中的线性可加性，而是使用反向Kullback-Leibler散度来衡量模型之间的功能差异，并选择性地整合互补参数。这种寻模、诱导稀疏性的设计有效地保留了稳定的表示，同时集成了新的能力。我们在广泛的模型规模和架构上评估了SCF-RKL，涵盖了以推理为重点和经过指令调优的模型。在涵盖高级推理、一般推理与知识、指令跟随和安全性等24个基准的广泛实验中，结果表明SCF-RKL在保持强泛化能力和生成稳定性的同时，始终优于现有的模型合并方法。

View on arXiv Download PDF AI Translation

cs.AI / 35 / 2602.11729

Cross-Architecture Model Diffing with Crosscoders: Unsupervised Discovery of Differences Between LLMs

跨架构模型差异比较与跨编码器：无监督发现大型语言模型之间的差异

Jiralerspong, Thomas, Bricken, Trenton

Abstract

Model diffing, the process of comparing models' internal representations to identify their differences, is a promising approach for uncovering safety-critical behaviors in new models. However, its application has so far been primarily focused on comparing a base model with its finetune. Since new LLM releases are often novel architectures, cross-architecture methods are essential to make model diffing widely applicable. Crosscoders are one solution capable of cross-architecture model diffing but have only ever been applied to base vs finetune comparisons. We provide the first application of crosscoders to cross-architecture model diffing and introduce Dedicated Feature Crosscoders (DFCs), an architectural modification designed to better isolate features unique to one model. Using this technique, we find in an unsupervised fashion features including Chinese Communist Party alignment in Qwen3-8B and Deepseek-R1-0528-Qwen3-8B, American exceptionalism in Llama3.1-8B-Instruct, and a copyright refusal mechanism in GPT-OSS-20B. Together, our results work towards establishing cross-architecture crosscoder model diffing as an effective method for identifying meaningful behavioral differences between AI models.

Chinese Translation

模型差异比较是比较模型内部表示以识别其差异的过程，是揭示新模型中安全关键行为的有前景的方法。然而，迄今为止，其应用主要集中在比较基础模型与其微调模型之间。由于新发布的大型语言模型（LLMs）通常采用新颖的架构，因此跨架构的方法对于使模型差异比较具有广泛适用性至关重要。跨编码器（Crosscoders）是一种能够进行跨架构模型差异比较的解决方案，但迄今为止仅应用于基础模型与微调模型的比较。我们首次将跨编码器应用于跨架构模型差异比较，并引入了专用特征跨编码器（Dedicated Feature Crosscoders, DFCs），这是一种旨在更好地隔离特定于某一模型的特征的架构修改。通过这种技术，我们以无监督的方式发现了包括Qwen3-8B和Deepseek-R1-0528-Qwen3-8B中的中国共产党对齐、Llama3.1-8B-Instruct中的美国例外主义，以及GPT-OSS-20B中的版权拒绝机制等特征。我们的结果共同推动了将跨架构跨编码器模型差异比较建立为识别人工智能模型之间有意义的行为差异的有效方法。

View on arXiv Download PDF AI Translation

cs.AI / 36 / 2602.11745

Text2GQL-Bench: A Text to Graph Query Language Benchmark [Experiment, Analysis & Benchmark]

Text2GQL-Bench：文本到图查询语言基准 [实验、分析与基准测试]

Lyu, Songlin, Ban, Lujie, Wu, Zihang, Luo, Tianqi, Liu, Jirong, Ma, Chenhao, Luo, Yuyu, Tang, Nan, Qi, Shipeng, Lin, Heng, Liu, Yongchao, Hong, Chuntao

Abstract

Graph models are fundamental to data analysis in domains rich with complex relationships. Text-to-Graph-Query-Language (Text-to-GQL) systems act as a translator, converting natural language into executable graph queries. This capability allows Large Language Models (LLMs) to directly analyze and manipulate graph data, posi-tioning them as powerful agent infrastructures for Graph Database Management System (GDBMS). Despite recent progress, existing datasets are often limited in domain coverage, supported graph query languages, or evaluation scope. The advancement of Text-to-GQL systems is hindered by the lack of high-quality benchmark datasets and evaluation methods to systematically compare model capabilities across different graph query languages and domains. In this work, we present Text2GQL-Bench, a unified Text-to-GQL benchmark designed to address these limitations. Text2GQL-Bench couples a multi-GQL dataset that has 178,184 (Question, Query) pairs spanning 13 domains, with a scalable construction framework that generates datasets in different domains, question abstraction levels, and GQLs with heterogeneous resources. To support compre-hensive assessment, we introduce an evaluation method that goes beyond a single end-to-end metric by jointly reporting grammatical validity, similarity, semantic alignment, and execution accuracy. Our evaluation uncovers a stark dialect gap in ISO-GQL generation: even strong LLMs achieve only at most 4% execution accuracy (EX) in zero-shot settings, though a fixed 3-shot prompt raises accuracy to around 50%, the grammatical validity remains lower than 70%. Moreover, a fine-tuned 8B open-weight model reaches 45.1% EX, and 90.8% grammatical validity, demonstrating that most of the performance jump is unlocked by exposure to sufficient ISO-GQL examples.

Chinese Translation

图模型在具有复杂关系的数据分析领域中是基础。文本到图查询语言（Text-to-GQL）系统充当翻译器，将自然语言转换为可执行的图查询。这一能力使得大型语言模型（LLMs）能够直接分析和操作图数据，从而将其定位为图数据库管理系统（GDBMS）的强大代理基础设施。尽管最近取得了一些进展，现有数据集在领域覆盖、支持的图查询语言或评估范围方面往往有限。Text-to-GQL系统的发展受到缺乏高质量基准数据集和评估方法的阻碍，这些方法能够系统地比较不同图查询语言和领域之间模型的能力。在本研究中，我们提出了Text2GQL-Bench，这是一个统一的Text-to-GQL基准，旨在解决这些局限性。Text2GQL-Bench结合了一个包含178,184对（问题，查询）的多GQL数据集，涵盖13个领域，并配备了一个可扩展的构建框架，能够生成不同领域、问题抽象层次和具有异构资源的GQL数据集。为了支持全面评估，我们引入了一种评估方法，超越单一的端到端指标，通过联合报告语法有效性、相似性、语义一致性和执行准确性来进行评估。我们的评估揭示了ISO-GQL生成中的明显方言差距：即使是强大的LLMs在零样本设置下的执行准确率（EX）也仅达到最多4%，尽管固定的3-shot提示将准确率提高到约50%，但语法有效性仍低于70%。此外，经过微调的8B开放权重模型达到了45.1%的EX和90.8%的语法有效性，证明大多数性能提升是通过接触足够的ISO-GQL示例解锁的。

View on arXiv Download PDF AI Translation

cs.AI / 37 / 2602.11749

AIR: Improving Agent Safety through Incident Response

AIR：通过事件响应提升智能体安全性

Xiao, Zibo, Sun, Jun, Chen, Junjie

Abstract

Large Language Model (LLM) agents are increasingly deployed in practice across a wide range of autonomous applications. Yet current safety mechanisms for LLM agents focus almost exclusively on preventing failures in advance, providing limited capabilities for responding to, containing, or recovering from incidents after they inevitably arise. In this work, we introduce AIR, the first incident response framework for LLM agent systems. AIR defines a domain-specific language for managing the incident response lifecycle autonomously in LLM agent systems, and integrates it into the agent's execution loop to (1) detect incidents via semantic checks grounded in the current environment state and recent context, (2) guide the agent to execute containment and recovery actions via its tools, and (3) synthesize guardrail rules during eradication to block similar incidents in future executions. We evaluate AIR on three representative agent types. Results show that AIR achieves detection, remediation, and eradication success rates all exceeding 90%. Extensive experiments further confirm the necessity of AIR's key design components, show the timeliness and moderate overhead of AIR, and demonstrate that LLM-generated rules can approach the effectiveness of developer-authored rules across domains. These results show that incident response is both feasible and essential as a first-class mechanism for improving agent safety.

Chinese Translation

大型语言模型（LLM）智能体在各种自主应用中越来越多地被部署。然而，目前针对LLM智能体的安全机制几乎完全专注于预防故障，提供的能力有限，无法在事件不可避免地发生后进行响应、控制或恢复。在本研究中，我们介绍了AIR，这是首个针对LLM智能体系统的事件响应框架。AIR定义了一种特定领域的语言，用于在LLM智能体系统中自主管理事件响应生命周期，并将其集成到智能体的执行循环中，以（1）通过基于当前环境状态和最近上下文的语义检查来检测事件，（2）指导智能体通过其工具执行控制和恢复操作，以及（3）在消除过程中合成保护规则，以阻止未来执行中类似事件的发生。我们在三种代表性智能体类型上评估了AIR。结果表明，AIR在检测、修复和消除方面的成功率均超过90%。大量实验进一步确认了AIR关键设计组件的必要性，展示了AIR的及时性和适度开销，并证明LLM生成的规则在各个领域的有效性可以接近开发者编写的规则。这些结果表明，事件响应作为提升智能体安全性的首要机制是可行且必要的。

View on arXiv Download PDF AI Translation

cs.AI / 38 / 2602.11767

TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents

TSR：用于大型语言模型代理的多轮强化学习的轨迹搜索回滚

Djuhera, Aladin, Kadhe, Swanand Ravindra, Ahmed, Farhan, Boche, Holger

Abstract

Advances in large language models (LLMs) are driving a shift toward using reinforcement learning (RL) to train agents from iterative, multi-turn interactions across tasks. However, multi-turn RL remains challenging as rewards are often sparse or delayed, and environments can be stochastic. In this regime, naive trajectory sampling can hinder exploitation and induce mode collapse. We propose TSR (Trajectory-Search Rollouts), a training-time approach that repurposes test-time scaling ideas for improved per-turn rollout generation. TSR performs lightweight tree-style search to construct high-quality trajectories by selecting high-scoring actions at each turn using task-specific feedback. This improves rollout quality and stabilizes learning while leaving the underlying optimization objective unchanged, making TSR optimizer-agnostic. We instantiate TSR with best-of-N, beam, and shallow lookahead search, and pair it with PPO and GRPO, achieving up to 15% performance gains and more stable learning on Sokoban, FrozenLake, and WebShop tasks at a one-time increase in training compute. By moving search from inference time to the rollout stage of training, TSR provides a simple and general mechanism for stronger multi-turn agent learning, complementary to existing frameworks and rejection-sampling-style selection methods.

Chinese Translation

大型语言模型（LLMs）的进展正在推动通过任务间的迭代多轮交互使用强化学习（RL）来训练代理。然而，多轮强化学习仍然具有挑战性，因为奖励往往是稀疏或延迟的，环境可能是随机的。在这种情况下，简单的轨迹采样可能会阻碍利用并导致模式崩溃。我们提出了TSR（轨迹搜索回滚），这是一种训练时的方法，重新利用测试时缩放的思想以改善每轮回滚生成。TSR通过使用任务特定反馈在每轮选择高评分的动作，执行轻量级树状搜索来构建高质量的轨迹。这提高了回滚质量并稳定了学习，同时保持了基础优化目标不变，使得TSR与优化器无关。我们用最优N、束搜索和浅层前瞻搜索实例化了TSR，并将其与PPO和GRPO配对，在Sokoban、FrozenLake和WebShop任务上实现了高达15%的性能提升和更稳定的学习，且仅需一次性增加训练计算量。通过将搜索从推理时间转移到训练的回滚阶段，TSR提供了一种简单且通用的机制，以增强多轮代理学习，补充现有框架和拒绝采样式选择方法。

View on arXiv Download PDF AI Translation

cs.AI / 39 / 2602.11771

How to Optimize Multispecies Set Predictions in Presence-Absence Modeling ?

如何优化多物种集合预测在存在-缺失建模中的应用？

Gigot--Léandri, Sébastien, Morand, Gaétan, Joly, Alexis, Munoz, François, Mouillot, David, Botella, Christophe, Servajean, Maximilien

Abstract

Species distribution models (SDMs) commonly produce probabilistic occurrence predictions that must be converted into binary presence-absence maps for ecological inference and conservation planning. However, this binarization step is typically heuristic and can substantially distort estimates of species prevalence and community composition. We present MaxExp, a decision-driven binarization framework that selects the most probable species assemblage by directly maximizing a chosen evaluation metric. MaxExp requires no calibration data and is flexible across several scores. We also introduce the Set Size Expectation (SSE) method, a computationally efficient alternative that predicts assemblages based on expected species richness. Using three case studies spanning diverse taxa, species counts, and performance metrics, we show that MaxExp consistently matches or surpasses widely used thresholding and calibration methods, especially under strong class imbalance and high rarity. SSE offers a simpler yet competitive option. Together, these methods provide robust, reproducible tools for multispecies SDM binarization.

Chinese Translation

物种分布模型（SDMs）通常生成概率性出现预测，这些预测必须转换为二元存在-缺失地图，以便进行生态推断和保护规划。然而，这一二元化步骤通常是启发式的，可能会显著扭曲物种普遍性和群落组成的估计。我们提出了MaxExp，一个基于决策的二元化框架，通过直接最大化所选评估指标来选择最可能的物种组合。MaxExp不需要校准数据，并且在多种评分中具有灵活性。我们还介绍了集合大小期望（Set Size Expectation, SSE）方法，这是一种计算效率高的替代方案，基于预期物种丰富度预测组合。通过三个涵盖不同类群、物种计数和性能指标的案例研究，我们展示了MaxExp在强类别不平衡和高稀有性条件下，始终与广泛使用的阈值和校准方法相匹配或超越。SSE提供了一个更简单但具有竞争力的选项。这些方法共同为多物种SDM二元化提供了稳健、可重复的工具。

View on arXiv Download PDF AI Translation

cs.AI / 40 / 2602.11780

RELATE: A Reinforcement Learning-Enhanced LLM Framework for Advertising Text Generation

RELATE：一种基于强化学习的广告文本生成大语言模型框架

Wang, Jinfang, Liu, Jiajie, Wu, Jianwei, Luo, Ziqin, Chen, Zhen, Li, Chunlei, Han, Biao, Deng, Tao, Li, Yi, Li, Shuanglong, Liu, Lin

Abstract

In online advertising, advertising text plays a critical role in attracting user engagement and driving advertiser value. Existing industrial systems typically follow a two-stage paradigm, where candidate texts are first generated and subsequently aligned with online performance metrics such as click-through rate(CTR). This separation often leads to misaligned optimization objectives and low funnel efficiency, limiting global optimality. To address these limitations, we propose RELATE, a reinforcement learning-based end-to-end framework that unifies generation and objective alignment within a single model. Instead of decoupling text generation from downstream metric alignment, RELATE integrates performance and compliance objectives directly into the generation process via policy learning. To better capture ultimate advertiser value beyond click-level signals, We incorporate conversion-oriented metrics into the objective and jointly model them with compliance constraints as multi-dimensional rewards, enabling the model to generate high-quality ad texts that improve conversion performance under policy constraints. Extensive experiments on large-scale industrial datasets demonstrate that RELATE consistently outperforms baselines. Furthermore, online deployment on a production advertising platform yields statistically significant improvements in click-through conversion rate(CTCVR) under strict policy constraints, validating the robustness and real-world effectiveness of the proposed framework.

Chinese Translation

在在线广告中，广告文本在吸引用户参与和推动广告主价值方面发挥着至关重要的作用。现有的工业系统通常遵循两阶段范式，首先生成候选文本，然后与点击率（CTR）等在线性能指标对齐。这种分离往往导致优化目标不一致和漏斗效率低下，从而限制了全局最优性。为了解决这些局限性，我们提出了RELATE，一种基于强化学习的端到端框架，将生成和目标对齐统一在一个模型中。RELATE通过策略学习将性能和合规目标直接整合到生成过程中，而不是将文本生成与下游指标对齐解耦。为了更好地捕捉超越点击级信号的最终广告主价值，我们将以转化为导向的指标纳入目标，并将其与合规约束共同建模为多维奖励，使模型能够生成在政策约束下提高转化性能的高质量广告文本。在大规模工业数据集上的广泛实验表明，RELATE始终优于基线。此外，在生产广告平台上的在线部署在严格的政策约束下实现了点击转化率（CTCVR）的统计显著提升，验证了所提框架的稳健性和实际有效性。

View on arXiv Download PDF AI Translation

cs.AI / 41 / 2602.11782

FlowMind: Execute-Summarize for Structured Workflow Generation from LLM Reasoning

FlowMind：基于执行-总结的结构化工作流生成方法

Liu, Yihao, Zhang, Ziyun, He, Zile, Cai, Huaqian

Abstract

LLMs can solve complex tasks through reasoning and tool use, but accurately translating these solutions into structured workflows remains challenging. We model workflows as sequences of tool use and reformulate the problem as designing a mechanism that can both solve tasks and reliably construct workflows. Prior approaches that build workflows during execution often suffer from inaccuracies due to interference between the two processes. We propose an Execute-Summarize(ES) framework that decouples task execution from workflow construction: the model first completes the task using available tools, then independently reconstructs a structured workflow from execution traces. This separation improves workflow accuracy and robustness. We introduce FlowBench and show through extensive experiments that our approach outperforms existing methods, providing a reliable paradigm for grounding free-form LLM reasoning into structured workflows.

Chinese Translation

大型语言模型（LLMs）能够通过推理和工具使用解决复杂任务，但将这些解决方案准确转化为结构化工作流仍然具有挑战性。我们将工作流建模为工具使用的序列，并将问题重新表述为设计一种机制，该机制既能解决任务，又能可靠地构建工作流。以往在执行过程中构建工作流的方法往往由于这两个过程之间的干扰而导致不准确。我们提出了一种执行-总结（Execute-Summarize, ES）框架，将任务执行与工作流构建解耦：模型首先使用可用工具完成任务，然后独立地从执行痕迹重构结构化工作流。这种分离提高了工作流的准确性和鲁棒性。我们引入了FlowBench，并通过广泛的实验表明，我们的方法优于现有方法，为将自由形式的LLM推理转化为结构化工作流提供了可靠的范式。

View on arXiv Download PDF AI Translation

cs.AI / 42 / 2602.11790

Beyond End-to-End Video Models: An LLM-Based Multi-Agent System for Educational Video Generation

超越端到端视频模型：基于大型语言模型的多智能体教育视频生成系统

Yan, Lingyong, Wu, Jiulong, Xie, Dong, Shi, Weixian, Xia, Deguo, Huang, Jizhou

Abstract

Although recent end-to-end video generation models demonstrate impressive performance in visually oriented content creation, they remain limited in scenarios that require strict logical rigor and precise knowledge representation, such as instructional and educational media. To address this problem, we propose LAVES, a hierarchical LLM-based multi-agent system for generating high-quality instructional videos from educational problems. The LAVES formulates educational video generation as a multi-objective task that simultaneously demands correct step-by-step reasoning, pedagogically coherent narration, semantically faithful visual demonstrations, and precise audio--visual alignment. To address the limitations of prior approaches--including low procedural fidelity, high production cost, and limited controllability--LAVES decomposes the generation workflow into specialized agents coordinated by a central Orchestrating Agent with explicit quality gates and iterative critique mechanisms. Specifically, the Orchestrating Agent supervises a Solution Agent for rigorous problem solving, an Illustration Agent that produces executable visualization codes, and a Narration Agent for learner-oriented instructional scripts. In addition, all outputs from the working agents are subject to semantic critique, rule-based constraints, and tool-based compilation checks. Rather than directly synthesizing pixels, the system constructs a structured executable video script that is deterministically compiled into synchronized visuals and narration using template-driven assembly rules, enabling fully automated end-to-end production without manual editing. In large-scale deployments, LAVES achieves a throughput exceeding one million videos per day, delivering over a 95% reduction in cost compared to current industry-standard approaches while maintaining a high acceptance rate.

Chinese Translation

尽管最近的端到端视频生成模型在视觉内容创作方面表现出色，但在需要严格逻辑严谨性和精确知识表征的场景中，如教学和教育媒体，它们仍然存在局限性。为了解决这一问题，我们提出了LAVES，一个基于层次化大型语言模型的多智能体系统，用于从教育问题生成高质量的教学视频。LAVES将教育视频生成视为一个多目标任务，要求同时具备正确的逐步推理、教育上连贯的叙述、语义上忠实的视觉演示以及精确的音视频对齐。为了克服先前方法的局限性——包括程序保真度低、生产成本高和可控性有限——LAVES将生成工作流程分解为由中央协调代理（Orchestrating Agent）协调的专门智能体，并设有明确的质量门控和迭代批评机制。具体而言，协调代理监督一个解决方案代理（Solution Agent）进行严格的问题解决，一个插图代理（Illustration Agent）生成可执行的可视化代码，以及一个叙述代理（Narration Agent）负责面向学习者的教学脚本。此外，所有工作代理的输出都需经过语义批评、基于规则的约束和工具编译检查。该系统并非直接合成像素，而是构建一个结构化的可执行视频脚本，该脚本使用基于模板的组装规则确定性地编译成同步的视觉和叙述，从而实现完全自动化的端到端生产，无需人工编辑。在大规模部署中，LAVES的产出超过每天一百万个视频，与当前行业标准方法相比，成本降低超过95%，同时保持高接受率。

View on arXiv Download PDF AI Translation

cs.AI / 43 / 2602.11792

Detecting RLVR Training Data via Structural Convergence of Reasoning

通过推理的结构收敛检测可验证奖励的强化学习训练数据

Zhang, Hongbo, Yang, Yue, Yan, Jianhao, Bao, Guangsheng, Zhang, Yue, Zhang, Yue

Abstract

Reinforcement learning with verifiable rewards (RLVR) is central to training modern reasoning models, but the undisclosed training data raises concerns about benchmark contamination. Unlike pretraining methods, which optimize models using token-level probabilities, RLVR fine-tunes models based on reward feedback from self-generated reasoning trajectories, making conventional likelihood-based detection methods less effective. We show that RLVR induces a distinctive behavioral signature: prompts encountered during RLVR training result in more rigid and similar generations, while unseen prompts retain greater diversity. We introduce Min-$k$NN Distance, a simple black-box detector that quantifies this collapse by sampling multiple completions for a given prompt and computing the average of the $k$ smallest nearest-neighbor edit distances. Min-$k$NN Distance requires no access to the reference model or token probabilities. Experiments across multiple RLVR-trained reasoning models show that Min-$k$NN Distance reliably distinguishes RL-seen examples from unseen ones and outperforms existing membership inference and RL contamination detection baselines.

Chinese Translation

可验证奖励的强化学习（RLVR）是训练现代推理模型的核心，但未公开的训练数据引发了基准污染的担忧。与通过令牌级概率优化模型的预训练方法不同，RLVR基于自生成推理轨迹的奖励反馈对模型进行微调，这使得传统的基于似然的检测方法效果不佳。我们展示了RLVR引发了独特的行为特征：在RLVR训练过程中遇到的提示会导致生成结果更加僵化和相似，而未见过的提示则保持更大的多样性。我们引入了Min-$k$NN距离，这是一种简单的黑箱检测器，通过对给定提示进行多次完成采样并计算$k$个最小邻近编辑距离的平均值来量化这种崩溃。Min-$k$NN距离无需访问参考模型或令牌概率。在多个RLVR训练的推理模型上的实验表明，Min-$k$NN距离能够可靠地区分RL已见示例和未见示例，并且优于现有的成员推断和RL污染检测基线。

View on arXiv Download PDF AI Translation

cs.AI / 44 / 2602.11799

Hi-SAM: A Hierarchical Structure-Aware Multi-modal Framework for Large-Scale Recommendation

Hi-SAM：一种层次结构感知的多模态大规模推荐框架

Pan, Pingjun, Zhou, Tingting, Lu, Peiyao, Fei, Tingting, Chen, Hongxiang, Luo, Chuanjiang

Abstract

Multi-modal recommendation has gained traction as items possess rich attributes like text and images. Semantic ID-based approaches effectively discretize this information into compact tokens. However, two challenges persist: (1) Suboptimal Tokenization: existing methods (e.g., RQ-VAE) lack disentanglement between shared cross-modal semantics and modality-specific details, causing redundancy or collapse; (2) Architecture-Data Mismatch: vanilla Transformers treat semantic IDs as flat streams, ignoring the hierarchy of user interactions, items, and tokens. Expanding items into multiple tokens amplifies length and noise, biasing attention toward local details over holistic semantics. We propose Hi-SAM, a Hierarchical Structure-Aware Multi-modal framework with two designs: (1) Disentangled Semantic Tokenizer (DST): unifies modalities via geometry-aware alignment and quantizes them via a coarse-to-fine strategy. Shared codebooks distill consensus while modality-specific ones recover nuances from residuals, enforced by mutual information minimization; (2) Hierarchical Memory-Anchor Transformer (HMAT): splits positional encoding into inter- and intra-item subspaces via Hierarchical RoPE to restore hierarchy. It inserts Anchor Tokens to condense items into compact memory, retaining details for the current item while accessing history only through compressed summaries. Experiments on real-world datasets show consistent improvements over SOTA baselines, especially in cold-start scenarios. Deployed on a large-scale social platform serving millions of users, Hi-SAM achieved a 6.55% gain in the core online metric.

Chinese Translation

多模态推荐因物品具有文本和图像等丰富属性而受到关注。基于语义ID的方法有效地将这些信息离散化为紧凑的标记。然而，仍然存在两个挑战：（1）次优标记化：现有方法（如 RQ-VAE）缺乏对共享跨模态语义和特定模态细节的解耦，导致冗余或崩溃；（2）架构与数据不匹配：普通的Transformer将语义ID视为平坦流，忽略了用户交互、物品和标记的层次结构。将物品扩展为多个标记会增加长度和噪声，使注意力偏向局部细节而非整体语义。我们提出了Hi-SAM，一种层次结构感知的多模态框架，具有两个设计：（1）解耦语义标记器（DST）：通过几何感知对齐统一模态，并通过粗到细的策略对其进行量化。共享代码本提炼共识，而特定模态的代码本则从残差中恢复细微差别，通过互信息最小化进行强制；（2）层次记忆锚点Transformer（HMAT）：通过层次RoPE将位置编码分割为跨物品和内部物品子空间，以恢复层次结构。它插入锚点标记，将物品压缩为紧凑的记忆，保留当前物品的细节，同时仅通过压缩摘要访问历史。对真实世界数据集的实验表明，在冷启动场景中，Hi-SAM在SOTA基线之上实现了一致的改进。部署在一个为数百万用户服务的大规模社交平台上，Hi-SAM在核心在线指标上实现了6.55%的提升。

View on arXiv Download PDF AI Translation

cs.AI / 45 / 2602.11807

PuYun-LDM: A Latent Diffusion Model for High-Resolution Ensemble Weather Forecasts

PuYun-LDM：一种用于高分辨率集合天气预报的潜在扩散模型

Wu, Lianjun, Zhu, Shengchen, Liu, Yuxuan, Kai, Liuyu, Feng, Xiaoduan, Wang, Duomin, Liu, Wenshuo, Zhang, Jingxuan, Li, Kelvin, Wang, Bin

Abstract

Latent diffusion models (LDMs) suffer from limited diffusability in high-resolution (<=0.25{\deg}) ensemble weather forecasting, where diffusability characterizes how easily a latent data distribution can be modeled by a diffusion process. Unlike natural image fields, meteorological fields lack task-agnostic foundation models and explicit semantic structures, making VFM-based regularization inapplicable. Moreover, existing frequency-based approaches impose identical spectral regularization across channels under a homogeneity assumption, which leads to uneven regularization strength under the inter-variable spectral heterogeneity in multivariate meteorological data. To address these challenges, we propose a 3D Masked AutoEncoder (3D-MAE) that encodes weather-state evolution features as an additional conditioning for the diffusion model, together with a Variable-Aware Masked Frequency Modeling (VA-MFM) strategy that adaptively selects thresholds based on the spectral energy distribution of each variable. Together, we propose PuYun-LDM, which enhances latent diffusability and achieves superior performance to ENS at short lead times while remaining comparable to ENS at longer horizons. PuYun-LDM generates a 15-day global forecast with a 6-hour temporal resolution in five minutes on a single NVIDIA H200 GPU, while ensemble forecasts can be efficiently produced in parallel.

Chinese Translation

潜在扩散模型（LDMs）在高分辨率（<=0.25{ ext{°}}）集合天气预报中存在扩散性有限的问题，其中扩散性表征了潜在数据分布被扩散过程建模的难易程度。与自然图像场不同，气象场缺乏任务无关的基础模型和明确的语义结构，使得基于变分自由能（VFM）的正则化不适用。此外，现有的基于频率的方法在均匀性假设下对各通道施加相同的频谱正则化，这导致在多变量气象数据中，由于变量间频谱异质性而出现不均匀的正则化强度。为了解决这些挑战，我们提出了一种三维掩蔽自编码器（3D Masked AutoEncoder，3D-MAE），它将天气状态演变特征编码为扩散模型的额外条件，并结合了一种基于变量感知的掩蔽频率建模（Variable-Aware Masked Frequency Modeling，VA-MFM）策略，该策略根据每个变量的频谱能量分布自适应选择阈值。综上所述，我们提出了PuYun-LDM，它增强了潜在扩散性，并在短期预报中表现优于集合预报（ENS），同时在较长时间范围内与ENS的表现相当。PuYun-LDM能够在单个NVIDIA H200 GPU上以6小时的时间分辨率在五分钟内生成15天的全球预报，同时能够高效地并行生成集合预报。

View on arXiv Download PDF AI Translation

cs.AI / 46 / 2602.11812

Predicting LLM Output Length via Entropy-Guided Representations

通过熵引导表示预测大语言模型输出长度

Xie, Huanyi, Chen, Yubin, Wang, Liangyu, Hu, Lijie, Wang, Di

Abstract

The long-tailed distribution of sequence lengths in LLM serving and reinforcement learning (RL) sampling causes significant computational waste due to excessive padding in batched inference. Existing methods rely on auxiliary models for static length prediction, but they incur high overhead, generalize poorly, and fail in stochastic "one-to-many" sampling scenarios. We introduce a lightweight framework that reuses the main model's internal hidden states for efficient length prediction. Our framework features two core components: 1) Entropy-Guided Token Pooling (EGTP), which uses on-the-fly activations and token entropy for highly accurate static prediction with negligible cost, and 2) Progressive Length Prediction (PLP), which dynamically estimates the remaining length at each decoding step to handle stochastic generation. To validate our approach, we build and release ForeLen, a comprehensive benchmark with long-sequence, Chain-of-Thought, and RL data. On ForeLen, EGTP achieves state-of-the-art accuracy, reducing MAE by 29.16\% over the best baseline. Integrating our methods with a length-aware scheduler yields significant end-to-end throughput gains. Our work provides a new technical and evaluation baseline for efficient LLM inference.

Chinese Translation

大语言模型（LLM）服务和强化学习（RL）采样中序列长度的长尾分布导致批量推理中由于过度填充而产生显著的计算浪费。现有方法依赖于辅助模型进行静态长度预测，但它们会产生高开销、泛化能力差，并且在随机的“一对多”采样场景中表现不佳。我们提出了一种轻量级框架，重用主模型的内部隐藏状态以实现高效的长度预测。我们的框架具有两个核心组件：1）熵引导的令牌池（Entropy-Guided Token Pooling, EGTP），利用实时激活和令牌熵进行高精度的静态预测，成本几乎可以忽略不计；2）渐进长度预测（Progressive Length Prediction, PLP），在每个解码步骤动态估计剩余长度，以处理随机生成。为了验证我们的方法，我们构建并发布了ForeLen，这是一个包含长序列、思维链和强化学习数据的综合基准。在ForeLen上，EGTP实现了最先进的准确性，相较于最佳基线减少了29.16%的平均绝对误差（MAE）。将我们的方法与长度感知调度器结合使用，显著提高了端到端的吞吐量。我们的工作为高效的LLM推理提供了新的技术和评估基准。

View on arXiv Download PDF AI Translation

cs.AI / 47 / 2602.11824

Revis: Sparse Latent Steering to Mitigate Object Hallucination in Large Vision-Language Models

Revis：稀疏潜在引导以减轻大型视觉-语言模型中的物体幻觉

Wu, Jialin, Shi, Wei, Shen, Han, Qi, Peigui, Tang, Kunsheng, Huang, Zhicong, Wang, Binghao, Yang, Zhou

Abstract

Despite the advanced capabilities of Large Vision-Language Models (LVLMs), they frequently suffer from object hallucination. One reason is that visual features and pretrained textual representations often become intertwined in the deeper network layers. To address this, we propose REVIS, a training-free framework designed to explicitly re-activate this suppressed visual information. Rooted in latent space geometry, REVIS extracts the pure visual information vector via orthogonal projection and employs a calibrated strategy to perform sparse intervention only at the precise depth where suppression occurs. This surgical approach effectively restores visual information with minimal computational cost. Empirical evaluations on standard benchmarks demonstrate that REVIS reduces object hallucination rates by approximately 19% compared to state-of-the-art baselines, while preserving general reasoning capabilities.

Chinese Translation

尽管大型视觉-语言模型（LVLMs）具备先进的能力，但它们仍然经常遭遇物体幻觉。其原因之一在于视觉特征和预训练文本表示在深层网络中常常交织在一起。为了解决这一问题，我们提出了REVIS，一个无需训练的框架，旨在明确地重新激活这些被抑制的视觉信息。REVIS根植于潜在空间几何，通过正交投影提取纯粹的视觉信息向量，并采用校准策略仅在抑制发生的精确深度进行稀疏干预。这种外科式的方法有效地以最小的计算成本恢复视觉信息。对标准基准的实证评估表明，与最先进的基线相比，REVIS将物体幻觉率降低了约19%，同时保持了整体推理能力。

View on arXiv Download PDF AI Translation

cs.AI / 48 / 2602.11852

Prototype Transformer: Towards Language Model Architectures Interpretable by Design

原型变换器：朝着可设计解释的语言模型架构迈进

Yordanov, Yordan, Forasassi, Matteo, Menzat, Bayar, Wang, Ruizhi, Qi, Chang, Kaltenberger, Markus, M'Charrak, Amine, Salvatori, Tommaso, Lukasiewicz, Thomas

Abstract

While state-of-the-art language models (LMs) surpass the vast majority of humans in certain domains, their reasoning remains largely opaque, undermining trust in their output. Furthermore, while autoregressive LMs can output explicit reasoning, their true reasoning process is opaque, which introduces risks like deception and hallucination. In this work, we introduce the Prototype Transformer (ProtoT) -- an autoregressive LM architecture based on prototypes (parameter vectors), posed as an alternative to the standard self-attention-based transformers. ProtoT works by means of two-way communication between the input sequence and the prototypes, and we show that this leads to the prototypes automatically capturing nameable concepts (e.g. "woman") during training. They provide the potential to interpret the model's reasoning and allow for targeted edits of its behavior. Furthermore, by design, the prototypes create communication channels that aggregate contextual information at different time scales, aiding interpretability. In terms of computation scalability, ProtoT scales linearly with sequence length vs the quadratic scalability of SOTA self-attention transformers. Compared to baselines, ProtoT scales well with model and data size, and performs well on text generation and downstream tasks (GLUE). ProtoT exhibits robustness to input perturbations on par or better than some baselines, but differs from them by providing interpretable pathways showing how robustness and sensitivity arises. Reaching close to the performance of state-of-the-art architectures, ProtoT paves the way to creating well-performing autoregressive LMs interpretable by design.

Chinese Translation

尽管最先进的语言模型（LMs）在某些领域超越了绝大多数人类，但它们的推理过程仍然在很大程度上不透明，这削弱了对其输出的信任。此外，尽管自回归语言模型可以输出明确的推理，但其真实的推理过程仍然不透明，这引入了欺骗和幻觉等风险。在本研究中，我们介绍了原型变换器（Prototype Transformer，ProtoT）——一种基于原型（参数向量）的自回归语言模型架构，作为标准自注意力变换器的替代方案。ProtoT通过输入序列与原型之间的双向通信进行工作，我们展示了这导致原型在训练过程中自动捕捉可命名概念（例如“女性”）。它们提供了解释模型推理的潜力，并允许针对性地编辑其行为。此外，原型的设计创建了在不同时间尺度上聚合上下文信息的通信通道，帮助提高可解释性。在计算可扩展性方面，ProtoT的扩展与序列长度呈线性关系，而最先进的自注意力变换器则呈现出二次扩展性。与基线相比，ProtoT在模型和数据规模上表现良好，并在文本生成和下游任务（GLUE）上表现出色。ProtoT对输入扰动表现出与某些基线相当或更好的鲁棒性，但与它们不同的是，ProtoT提供了可解释的路径，展示了鲁棒性和敏感性是如何产生的。ProtoT的性能接近最先进的架构，为创建可设计解释的高性能自回归语言模型铺平了道路。

View on arXiv Download PDF AI Translation

cs.AI / 49 / 2602.11860

Talk2DM: Enabling Natural Language Querying and Commonsense Reasoning for Vehicle-Road-Cloud Integrated Dynamic Maps with Large Language Models

Talk2DM：利用大型语言模型实现车辆-道路-云集成动态地图的自然语言查询与常识推理

Tao, Lu, Luo, Jinxuan, Watanabe, Yousuke, Zhou, Zhengshu, Lu, Yuhuan, Ying, Shen, Zhang, Pan, Zhao, Fei, Takada, Hiroaki

Abstract

Dynamic maps (DM) serve as the fundamental information infrastructure for vehicle-road-cloud (VRC) cooperative autonomous driving in China and Japan. By providing comprehensive traffic scene representations, DM overcome the limitations of standalone autonomous driving systems (ADS), such as physical occlusions. Although DM-enhanced ADS have been successfully deployed in real-world applications in Japan, existing DM systems still lack a natural-language-supported (NLS) human interface, which could substantially enhance human-DM interaction. To address this gap, this paper introduces VRCsim, a VRC cooperative perception (CP) simulation framework designed to generate streaming VRC-CP data. Based on VRCsim, we construct a question-answering data set, VRC-QA, focused on spatial querying and reasoning in mixed-traffic scenes. Building upon VRCsim and VRC-QA, we further propose Talk2DM, a plug-and-play module that extends VRC-DM systems with NLS querying and commonsense reasoning capabilities. Talk2DM is built upon a novel chain-of-prompt (CoP) mechanism that progressively integrates human-defined rules with the commonsense knowledge of large language models (LLMs). Experiments on VRC-QA show that Talk2DM can seamlessly switch across different LLMs while maintaining high NLS query accuracy, demonstrating strong generalization capability. Although larger models tend to achieve higher accuracy, they incur significant efficiency degradation. Our results reveal that Talk2DM, powered by Qwen3:8B, Gemma3:27B, and GPT-oss models, achieves over 93\% NLS query accuracy with an average response time of only 2-5 seconds, indicating strong practical potential.

Chinese Translation

动态地图（DM）作为中国和日本车辆-道路-云（VRC）协作自动驾驶的基础信息基础设施，通过提供全面的交通场景表示，克服了独立自动驾驶系统（ADS）在物理遮挡等方面的局限性。尽管基于动态地图的增强型自动驾驶系统在日本的实际应用中取得了成功，但现有的动态地图系统仍然缺乏支持自然语言的（NLS）人机界面，这可能显著增强人机与动态地图的交互。为了解决这一问题，本文介绍了VRCsim，一个旨在生成流式VRC-CP数据的VRC协作感知（CP）仿真框架。基于VRCsim，我们构建了一个专注于混合交通场景中的空间查询和推理的问题回答数据集VRC-QA。在此基础上，我们进一步提出了Talk2DM，一个即插即用的模块，扩展了VRC-DM系统的NLS查询和常识推理能力。Talk2DM建立在一种新颖的提示链（CoP）机制之上，该机制逐步将人类定义的规则与大型语言模型（LLMs）的常识知识相结合。对VRC-QA的实验表明，Talk2DM能够在不同的LLM之间无缝切换，同时保持高NLS查询准确性，展示出强大的泛化能力。尽管较大的模型往往能实现更高的准确性，但它们会导致显著的效率下降。我们的结果显示，Talk2DM在Qwen3:8B、Gemma3:27B和GPT-oss模型的支持下，NLS查询准确率超过93%，平均响应时间仅为2-5秒，显示出强大的实际应用潜力。

View on arXiv Download PDF AI Translation

cs.AI / 50 / 2602.11865

Intelligent AI Delegation

智能AI委托

Tomašev, Nenad, Franklin, Matija, Osindero, Simon

Abstract

AI agents are able to tackle increasingly complex tasks. To achieve more ambitious goals, AI agents need to be able to meaningfully decompose problems into manageable sub-components, and safely delegate their completion across to other AI agents and humans alike. Yet, existing task decomposition and delegation methods rely on simple heuristics, and are not able to dynamically adapt to environmental changes and robustly handle unexpected failures. Here we propose an adaptive framework for intelligent AI delegation - a sequence of decisions involving task allocation, that also incorporates transfer of authority, responsibility, accountability, clear specifications regarding roles and boundaries, clarity of intent, and mechanisms for establishing trust between the two (or more) parties. The proposed framework is applicable to both human and AI delegators and delegatees in complex delegation networks, aiming to inform the development of protocols in the emerging agentic web.

Chinese Translation

AI代理能够处理日益复杂的任务。为了实现更雄心勃勃的目标，AI代理需要能够有意义地将问题分解为可管理的子组件，并安全地将其完成任务委托给其他AI代理和人类。然而，现有的任务分解和委托方法依赖于简单的启发式，并不能动态适应环境变化，也无法稳健地处理意外故障。在此，我们提出了一种适应性框架，用于智能AI委托——一系列涉及任务分配的决策，同时还包括权力、责任、问责的转移，以及关于角色和边界的明确规定、意图的清晰性，以及在两个（或更多）参与方之间建立信任的机制。该框架适用于复杂委托网络中的人类和AI委托者及被委托者，旨在为新兴的代理网络中协议的发展提供参考。

View on arXiv Download PDF AI Translation

cs.AI / 51 / 2602.11881

From Atoms to Trees: Building a Structured Feature Forest with Hierarchical Sparse Autoencoders

从原子到树：构建具有层次结构的特征森林与稀疏自编码器

Luo, Yifan, Zhan, Yang, Jiang, Jiedong, Liu, Tianyang, Wu, Mingrui, Zhou, Zhennan, Dong, Bin

Abstract

Sparse autoencoders (SAEs) have proven effective for extracting monosemantic features from large language models (LLMs), yet these features are typically identified in isolation. However, broad evidence suggests that LLMs capture the intrinsic structure of natural language, where the phenomenon of "feature splitting" in particular indicates that such structure is hierarchical. To capture this, we propose the Hierarchical Sparse Autoencoder (HSAE), which jointly learns a series of SAEs and the parent-child relationships between their features. HSAE strengthens the alignment between parent and child features through two novel mechanisms: a structural constraint loss and a random feature perturbation mechanism. Extensive experiments across various LLMs and layers demonstrate that HSAE consistently recovers semantically meaningful hierarchies, supported by both qualitative case studies and rigorous quantitative metrics. At the same time, HSAE preserves the reconstruction fidelity and interpretability of standard SAEs across different dictionary sizes. Our work provides a powerful, scalable tool for discovering and analyzing the multi-scale conceptual structures embedded in LLM representations.

Chinese Translation

稀疏自编码器（SAEs）在从大型语言模型（LLMs）中提取单一语义特征方面已被证明有效，但这些特征通常是孤立识别的。然而，广泛的证据表明，LLMs 捕捉到了自然语言的内在结构，其中“特征分裂”现象尤其表明这种结构是层次性的。为了捕捉这一点，我们提出了层次稀疏自编码器（HSAE），它共同学习一系列 SAEs 及其特征之间的父子关系。HSAE 通过两种新颖机制增强父子特征之间的对齐：结构约束损失和随机特征扰动机制。在不同 LLM 和层次上的大量实验表明，HSAE 一致地恢复了语义上有意义的层次结构，这得到了定性案例研究和严格的定量指标的支持。同时，HSAE 在不同字典大小下保持了标准 SAEs 的重构保真度和可解释性。我们的工作为发现和分析嵌入在 LLM 表示中的多尺度概念结构提供了一种强大且可扩展的工具。

View on arXiv Download PDF AI Translation

cs.AI / 52 / 2602.11908

When Should LLMs Be Less Specific? Selective Abstraction for Reliable Long-Form Text Generation

大型语言模型何时应减少特异性？可靠的长文本生成的选择性抽象

Goren, Shani, Galil, Ido, El-Yaniv, Ran

Abstract

LLMs are widely used, yet they remain prone to factual errors that erode user trust and limit adoption in high-risk settings. One approach to mitigate this risk is to equip models with uncertainty estimation mechanisms that abstain when confidence is low. However, this binary "all-or-nothing" approach is excessively restrictive in long-form settings, often discarding valuable information. We introduce Selective Abstraction (SA), a framework that enables LLMs to trade specificity for reliability by selectively reducing the detail of uncertain content. We first formalize SA through the lenses of selective risk and coverage. We then propose Atom-wise Selective Abstraction, a claim-level instantiation that decomposes responses into atomic claims (short, self-contained statements each expressing a single fact) and replaces uncertain atoms with higher confidence, less specific abstractions. To evaluate this framework, we develop a novel end-to-end pipeline for open-ended generation that instantiates risk as factual correctness and measures coverage using an information-theoretic measure of retained information. Across six open-source models on the FactScore and LongFact-Objects benchmarks, atom-wise SA consistently outperforms existing baselines, improving the area under the risk-coverage curve (AURC) by up to 27.73% over claim removal, demonstrating that reducing specificity can boost accuracy and reliability while preserving most of their original meaning.

Chinese Translation

大型语言模型（LLMs）被广泛使用，但仍然容易出现事实错误，这削弱了用户信任并限制了在高风险环境中的应用。一种减轻此风险的方法是为模型配备不确定性估计机制，在信心较低时选择不生成内容。然而，这种二元的“全有或全无”方法在长文本环境中过于严格，常常丢弃有价值的信息。我们提出了选择性抽象（Selective Abstraction, SA）框架，使LLMs能够通过选择性地减少不确定内容的细节来权衡特异性与可靠性。我们首先通过选择性风险和覆盖率的视角对SA进行了形式化。然后，我们提出了基于原子的选择性抽象（Atom-wise Selective Abstraction），这是一种基于声明的实例化方法，它将响应分解为原子声明（短小、自包含的陈述，每个表达一个单一事实），并用更高置信度、较少特异性的抽象替换不确定的原子。为了评估这一框架，我们开发了一种新颖的端到端开放式生成管道，将风险实例化为事实正确性，并使用信息论度量保留信息来衡量覆盖率。在FactScore和LongFact-Objects基准上的六个开源模型中，基于原子的选择性抽象始终优于现有基线，在风险-覆盖率曲线下的面积（AURC）上提高了多达27.73%，证明减少特异性可以提高准确性和可靠性，同时保留大部分原始含义。

View on arXiv Download PDF AI Translation

cs.AI / 53 / 2602.11917

AlphaPROBE: Alpha Mining via Principled Retrieval and On-graph biased evolution

AlphaPROBE：通过原则性检索和图上偏向演化进行的阿尔法挖掘

Guo, Taian, Shen, Haiyang, Luo, Junyu, Chen, Binqi, Ding, Hongjun, Huang, Jinsheng, Liu, Luchen, Ma, Yun, Zhang, Ming

Abstract

Extracting signals through alpha factor mining is a fundamental challenge in quantitative finance. Existing automated methods primarily follow two paradigms: Decoupled Factor Generation, which treats factor discovery as isolated events, and Iterative Factor Evolution, which focuses on local parent-child refinements. However, both paradigms lack a global structural view, often treating factor pools as unstructured collections or fragmented chains, which leads to redundant search and limited diversity. To address these limitations, we introduce AlphaPROBE (Alpha Mining via Principled Retrieval and On-graph Biased Evolution), a framework that reframes alpha mining as the strategic navigation of a Directed Acyclic Graph (DAG). By modeling factors as nodes and evolutionary links as edges, AlphaPROBE treats the factor pool as a dynamic, interconnected ecosystem. The framework consists of two core components: a Bayesian Factor Retriever that identifies high-potential seeds by balancing exploitation and exploration through a posterior probability model, and a DAG-aware Factor Generator that leverages the full ancestral trace of factors to produce context-aware, nonredundant optimizations. Extensive experiments on three major Chinese stock market datasets against 8 competitive baselines demonstrate that AlphaPROBE significantly gains enhanced performance in predictive accuracy, return stability and training efficiency. Our results confirm that leveraging global evolutionary topology is essential for efficient and robust automated alpha discovery. We have open-sourced our implementation at https://github.com/gta0804/AlphaPROBE.

Chinese Translation

通过阿尔法因子挖掘提取信号是定量金融中的一项基本挑战。现有的自动化方法主要遵循两种范式：解耦因子生成（Decoupled Factor Generation），将因子发现视为孤立事件，以及迭代因子演化（Iterative Factor Evolution），侧重于局部的父子关系细化。然而，这两种范式都缺乏全局结构视角，常常将因子池视为无结构的集合或碎片化的链条，导致冗余搜索和有限的多样性。为了解决这些局限性，我们提出了AlphaPROBE（通过原则性检索和图上偏向演化的阿尔法挖掘），一个将阿尔法挖掘重新框定为有向无环图（Directed Acyclic Graph, DAG）的战略导航的框架。通过将因子建模为节点，将演化链接建模为边，AlphaPROBE将因子池视为一个动态的、相互关联的生态系统。该框架由两个核心组件组成：一个贝叶斯因子检索器（Bayesian Factor Retriever），通过后验概率模型平衡开发与探索，识别高潜力种子；以及一个DAG感知因子生成器（DAG-aware Factor Generator），利用因子的完整祖先轨迹生成上下文感知的、非冗余的优化。在对三个主要中国股市数据集进行的广泛实验中，与8个竞争基线相比，AlphaPROBE在预测准确性、收益稳定性和训练效率方面显著提升了性能。我们的结果确认，利用全局演化拓扑对于高效和稳健的自动化阿尔法发现至关重要。我们已在https://github.com/gta0804/AlphaPROBE上开源了我们的实现。

View on arXiv Download PDF AI Translation

cs.AI / 54 / 2602.11918

MEME: Modeling the Evolutionary Modes of Financial Markets

MEME：金融市场演变模式建模

Guo, Taian, Shen, Haiyang, Luo, Junyu, Xing, Zhongshi, Lian, Hanchun, Huang, Jinsheng, Chen, Binqi, Liu, Luchen, Ma, Yun, Zhang, Ming

Abstract

LLMs have demonstrated significant potential in quantitative finance by processing vast unstructured data to emulate human-like analytical workflows. However, current LLM-based methods primarily follow either an Asset-Centric paradigm focused on individual stock prediction or a Market-Centric approach for portfolio allocation, often remaining agnostic to the underlying reasoning that drives market movements. In this paper, we propose a Logic-Oriented perspective, modeling the financial market as a dynamic, evolutionary ecosystem of competing investment narratives, termed Modes of Thought. To operationalize this view, we introduce MEME (Modeling the Evolutionary Modes of Financial Markets), designed to reconstruct market dynamics through the lens of evolving logics. MEME employs a multi-agent extraction module to transform noisy data into high-fidelity Investment Arguments and utilizes Gaussian Mixture Modeling to uncover latent consensus within a semantic space. To model semantic drift among different market conditions, we also implement a temporal evaluation and alignment mechanism to track the lifecycle and historical profitability of these modes. By prioritizing enduring market wisdom over transient anomalies, MEME ensures that portfolio construction is guided by robust reasoning. Extensive experiments on three heterogeneous Chinese stock pools from 2023 to 2025 demonstrate that MEME consistently outperforms seven SOTA baselines. Further ablation studies, sensitivity analysis, lifecycle case study and cost analysis validate MEME's capacity to identify and adapt to the evolving consensus of financial markets. Our implementation can be found at https://github.com/gta0804/MEME.

Chinese Translation

大型语言模型（LLMs）在定量金融领域展现出显著潜力，通过处理大量非结构化数据来模拟人类的分析工作流程。然而，目前基于LLM的方法主要遵循以资产为中心的范式，专注于个别股票预测，或以市场为中心的方法，用于投资组合配置，往往对驱动市场波动的基本推理保持无动于衷。在本文中，我们提出了一种逻辑导向的视角，将金融市场建模为一个动态的、演变的生态系统，竞争的投资叙事被称为思维模式（Modes of Thought）。为了实现这一观点，我们引入了MEME（金融市场演变模式建模），旨在通过不断演变的逻辑重建市场动态。MEME采用多智能体提取模块，将嘈杂数据转化为高保真的投资论点，并利用高斯混合建模（Gaussian Mixture Modeling）来揭示语义空间中的潜在共识。为了建模不同市场条件下的语义漂移，我们还实施了时间评估和对齐机制，以跟踪这些模式的生命周期和历史盈利能力。通过优先考虑持久的市场智慧而非短暂的异常，MEME确保投资组合构建受到稳健推理的指导。在2023年至2025年间对三个异质中国股票池进行的广泛实验表明，MEME始终优于七个最先进的基线模型。进一步的消融研究、敏感性分析、生命周期案例研究和成本分析验证了MEME识别和适应金融市场演变共识的能力。我们的实现可以在 https://github.com/gta0804/MEME 找到。

View on arXiv Download PDF AI Translation

cs.AI / 55 / 2602.11964

Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments

Gaia2：在动态和异步环境中评估大型语言模型代理的基准

Froger, Romain, Andrews, Pierre, Bettini, Matteo, Budhiraja, Amar, Cabral, Ricardo Silveira, Do, Virginie, Garreau, Emilien, Gaya, Jean-Baptiste, Laurençon, Hugo, Lecanu, Maxime, Malkan, Kunal, Mekala, Dheeraj, Ménard, Pierre, Bertran, Gerard Moreno-Torres, Piterbarg, Ulyana, Plekhanov, Mikhail, Rita, Mathieu, Rusakov, Andrey, Vorotilov, Vladislav, Wang, Mengjue, Yu, Ian, Benhalloum, Amine, Mialon, Grégoire, Scialom, Thomas

Abstract

We introduce Gaia2, a benchmark for evaluating large language model agents in realistic, asynchronous environments. Unlike prior static or synchronous evaluations, Gaia2 introduces scenarios where environments evolve independently of agent actions, requiring agents to operate under temporal constraints, adapt to noisy and dynamic events, resolve ambiguity, and collaborate with other agents. Each scenario is paired with a write-action verifier, enabling fine-grained, action-level evaluation and making Gaia2 directly usable for reinforcement learning from verifiable rewards. Our evaluation of state-of-the-art proprietary and open-source models shows that no model dominates across capabilities: GPT-5 (high) reaches the strongest overall score of 42% pass@1 but fails on time-sensitive tasks, Claude-4 Sonnet trades accuracy and speed for cost, Kimi-K2 leads among open-source models with 21% pass@1. These results highlight fundamental trade-offs between reasoning, efficiency, robustness, and expose challenges in closing the "sim2real" gap. Gaia2 is built on a consumer environment with the open-source Agents Research Environments platform and designed to be easy to extend. By releasing Gaia2 alongside the foundational ARE framework, we aim to provide the community with a flexible infrastructure for developing, benchmarking, and training the next generation of practical agent systems.

Chinese Translation

我们介绍了Gaia2，一个用于评估大型语言模型代理在现实异步环境中的基准。与之前的静态或同步评估不同，Gaia2引入了环境独立于代理行为演变的场景，这要求代理在时间限制下操作，适应嘈杂和动态事件，解决模糊性，并与其他代理协作。每个场景都配备了一个写操作验证器，使得能够进行细粒度的操作级评估，并使Gaia2可直接用于从可验证奖励中进行强化学习。我们对最先进的专有和开源模型的评估显示，没有任何模型在各项能力上占据绝对优势：GPT-5（高）获得了42%的整体最高得分，但在时间敏感任务上表现不佳，Claude-4 Sonnet在成本上权衡了准确性和速度，而Kimi-K2在开源模型中以21%的通过率领先。这些结果突显了推理、效率、鲁棒性之间的基本权衡，并揭示了缩小“模拟到现实”（sim2real）差距的挑战。Gaia2建立在消费者环境上，基于开源的Agents Research Environments平台，并设计为易于扩展。通过与基础的ARE框架一起发布Gaia2，我们旨在为社区提供一个灵活的基础设施，以开发、基准测试和训练下一代实用代理系统。

View on arXiv Download PDF AI Translation

cs.AI / 56 / 2602.12004

CSEval: A Framework for Evaluating Clinical Semantics in Text-to-Image Generation

CSEval：一个用于评估文本到图像生成中临床语义的框架

Cronshaw, Robert, Vilouras, Konstantinos, Yan, Junyu, Du, Yuning, Chen, Feng, McDonagh, Steven, Tsaftaris, Sotirios A.

Abstract

Text-to-image generation has been increasingly applied in medical domains for various purposes such as data augmentation and education. Evaluating the quality and clinical reliability of these generated images is essential. However, existing methods mainly assess image realism or diversity, while failing to capture whether the generated images reflect the intended clinical semantics, such as anatomical location and pathology. In this study, we propose the Clinical Semantics Evaluator (CSEval), a framework that leverages language models to assess clinical semantic alignment between the generated images and their conditioning prompts. Our experiments show that CSEval identifies semantic inconsistencies overlooked by other metrics and correlates with expert judgment. CSEval provides a scalable and clinically meaningful complement to existing evaluation methods, supporting the safe adoption of generative models in healthcare.

Chinese Translation

文本到图像生成在医学领域的应用日益增多，主要用于数据增强和教育等多种目的。评估这些生成图像的质量和临床可靠性至关重要。然而，现有方法主要评估图像的真实感或多样性，而未能捕捉生成图像是否反映预期的临床语义，例如解剖位置和病理。在本研究中，我们提出了临床语义评估器（Clinical Semantics Evaluator，CSEval），这是一个利用语言模型评估生成图像与其条件提示之间临床语义一致性的框架。我们的实验表明，CSEval能够识别其他指标忽视的语义不一致，并与专家判断相关联。CSEval为现有评估方法提供了一种可扩展且具有临床意义的补充，支持生成模型在医疗保健中的安全采用。

View on arXiv Download PDF AI Translation

cs.AI / 57 / 2602.12013

InjectRBP: Steering Large Language Model Reasoning Behavior via Pattern Injection

InjectRBP：通过模式注入引导大型语言模型推理行为

Wu, Xiuping, Yu, Zhao, Cheng, Yuxin, Wong, Ngai, Ke, Liangjun, Mishra, Tapas, Katsikopoulos, Konstantinos V.

Abstract

Reasoning can significantly enhance the performance of Large Language Models. While recent studies have exploited behavior-related prompts adjustment to enhance reasoning, these designs remain largely intuitive and lack a systematic analysis of the underlying behavioral patterns. Motivated by this, we investigate how models' reasoning behaviors shape reasoning from the perspective of behavioral patterns. We observe that models exhibit adaptive distributions of reasoning behaviors when responding to specific types of questions, and that structurally injecting these patterns can substantially influence the quality of the models' reasoning processes and outcomes. Building on these findings, we propose two optimization methods that require no parameter updates: InjectCorrect and InjectRLOpt. InjectCorrect guides the model by imitating behavioral patterns derived from its own past correct answers. InjectRLOpt learns a value function from historical behavior-pattern data and, via our proposed Reliability-Aware Softmax Policy, generates behavioral injectant during inference to steer the reasoning process. Our experiments demonstrate that both methods can improve model performance across various reasoning tasks without requiring any modifications to model parameters, achieving gains of up to 5.34% and 8.67%, respectively.

Chinese Translation

推理可以显著提升大型语言模型的性能。尽管近期研究利用与行为相关的提示调整来增强推理，这些设计仍然主要是直观的，缺乏对潜在行为模式的系统分析。基于此动机，我们从行为模式的角度研究模型的推理行为如何塑造推理。我们观察到，模型在回应特定类型的问题时表现出适应性的推理行为分布，并且结构性地注入这些模式可以显著影响模型推理过程和结果的质量。在这些发现的基础上，我们提出了两种不需要参数更新的优化方法：InjectCorrect 和 InjectRLOpt。InjectCorrect 通过模仿模型自身过去正确答案所衍生的行为模式来引导模型。InjectRLOpt 从历史行为模式数据中学习一个价值函数，并通过我们提出的可靠性感知软最大策略（Reliability-Aware Softmax Policy）在推理过程中生成行为注入物，以引导推理过程。我们的实验表明，这两种方法可以在各种推理任务中提高模型性能，而无需对模型参数进行任何修改，分别实现了高达 5.34% 和 8.67% 的性能提升。

View on arXiv Download PDF AI Translation

cs.AI / 58 / 2602.12055

Multi UAVs Preflight Planning in a Shared and Dynamic Airspace

共享动态空域中的多无人机预飞行规划

Sow, Amath, Cesen, Mauricio Rodriguez, de Oliveira, Fabiola Martins Campos, Wzorek, Mariusz, de Leng, Daniel, Tiger, Mattias, Heintz, Fredrik, Rothenberg, Christian Esteve

Abstract

Preflight planning for large-scale Unmanned Aerial Vehicle (UAV) fleets in dynamic, shared airspace presents significant challenges, including temporal No-Fly Zones (NFZs), heterogeneous vehicle profiles, and strict delivery deadlines. While Multi-Agent Path Finding (MAPF) provides a formal framework, existing methods often lack the scalability and flexibility required for real-world Unmanned Traffic Management (UTM). We propose DTAPP-IICR: a Delivery-Time Aware Prioritized Planning method with Incremental and Iterative Conflict Resolution. Our framework first generates an initial solution by prioritizing missions based on urgency. Secondly, it computes roundtrip trajectories using SFIPP-ST, a novel 4D single-agent planner (Safe Flight Interval Path Planning with Soft and Temporal Constraints). SFIPP-ST handles heterogeneous UAVs, strictly enforces temporal NFZs, and models inter-agent conflicts as soft constraints. Subsequently, an iterative Large Neighborhood Search, guided by a geometric conflict graph, efficiently resolves any residual conflicts. A completeness-preserving directional pruning technique further accelerates the 3D search. On benchmarks with temporal NFZs, DTAPP-IICR achieves near-100% success with fleets of up to 1,000 UAVs and gains up to 50% runtime reduction from pruning, outperforming batch Enhanced Conflict-Based Search in the UTM context. Scaling successfully in realistic city-scale operations where other priority-based methods fail even at moderate deployments, DTAPP-IICR is positioned as a practical and scalable solution for preflight planning in dense, dynamic urban airspace.

Chinese Translation

在动态共享空域中进行大规模无人机（UAV）机队的预飞行规划面临诸多挑战，包括时间性禁飞区（NFZ）、异构飞行器特性以及严格的交付截止时间。尽管多智能体路径规划（MAPF）提供了一个正式框架，但现有方法往往缺乏在现实世界无人机交通管理（UTM）中所需的可扩展性和灵活性。我们提出了DTAPP-IICR：一种基于交付时间的优先规划方法，具有增量和迭代冲突解决能力。我们的框架首先通过根据紧急程度对任务进行优先级排序来生成初始解决方案。其次，使用SFIPP-ST计算往返轨迹，这是一种新颖的4D单智能体规划器（安全飞行间隔路径规划，具有软约束和时间约束）。SFIPP-ST能够处理异构无人机，严格执行时间性禁飞区，并将智能体间的冲突建模为软约束。随后，通过几何冲突图指导的迭代大邻域搜索有效地解决任何残余冲突。一种保持完整性的方向性剪枝技术进一步加速了3D搜索。在具有时间性禁飞区的基准测试中，DTAPP-IICR在最多1,000架无人机的机队中实现了近100%的成功率，并通过剪枝获得了高达50%的运行时间减少，超越了批量增强冲突搜索在UTM背景下的表现。在其他基于优先级的方法在中等部署下都无法成功的现实城市规模操作中，DTAPP-IICR成功扩展，成为密集动态城市空域中预飞行规划的实用且可扩展的解决方案。

View on arXiv Download PDF AI Translation

cs.AI / 59 / 2602.12056

LawThinker: A Deep Research Legal Agent in Dynamic Environments

LawThinker：动态环境中的深度研究法律代理

Yang, Xinyu, Deng, Chenlong, Wen, Tongyu, Xie, Binyu, Dou, Zhicheng

Abstract

Legal reasoning requires not only correct outcomes but also procedurally compliant reasoning processes. However, existing methods lack mechanisms to verify intermediate reasoning steps, allowing errors such as inapplicable statute citations to propagate undetected through the reasoning chain. To address this, we propose LawThinker, an autonomous legal research agent that adopts an Explore-Verify-Memorize strategy for dynamic judicial environments. The core idea is to enforce verification as an atomic operation after every knowledge exploration step. A DeepVerifier module examines each retrieval result along three dimensions of knowledge accuracy, fact-law relevance, and procedural compliance, with a memory module for cross-round knowledge reuse in long-horizon tasks. Experiments on the dynamic benchmark J1-EVAL show that LawThinker achieves a 24% improvement over direct reasoning and an 11% gain over workflow-based methods, with particularly strong improvements on process-oriented metrics. Evaluations on three static benchmarks further confirm its generalization capability. The code is available at https://github.com/yxy-919/LawThinker-agent .

Chinese Translation

法律推理不仅需要正确的结果，还需要程序合规的推理过程。然而，现有方法缺乏验证中间推理步骤的机制，导致诸如不适用的法规引用等错误在推理链中未被检测到而传播。为了解决这个问题，我们提出了LawThinker，一个采用探索-验证-记忆（Explore-Verify-Memorize）策略的自主法律研究代理，适用于动态司法环境。其核心思想是在每个知识探索步骤后强制执行验证作为原子操作。DeepVerifier模块从知识准确性、事实与法律的相关性以及程序合规性三个维度检查每个检索结果，并配备一个记忆模块以便在长期任务中进行跨轮次知识重用。在动态基准J1-EVAL上的实验表明，LawThinker在直接推理上实现了24%的改进，并在基于工作流的方法上获得了11%的提升，尤其在面向过程的指标上表现出显著的改进。在三个静态基准上的评估进一步确认了其泛化能力。代码可在https://github.com/yxy-919/LawThinker-agent获取。

View on arXiv Download PDF AI Translation

cs.AI / 60 / 2602.12078

Tiny Recursive Reasoning with Mamba-2 Attention Hybrid

基于 Mamba-2 注意力混合的小型递归推理

Wang, Wenlong, Reid, Fergal

Abstract

Recent work on recursive reasoning models like TRM demonstrates that tiny networks (7M parameters) can achieve strong performance on abstract reasoning tasks through latent recursion -- iterative refinement in hidden representation space without emitting intermediate tokens. This raises a natural question about operator choice: Mamba-2's state space recurrence is itself a form of iterative refinement, making it a natural candidate for recursive reasoning -- but does introducing Mamba-2 into the recursive scaffold preserve reasoning capability? We investigate this by replacing the Transformer blocks in TRM with Mamba-2 hybrid operators while maintaining parameter parity (6.83M vs 6.86M parameters). On ARC-AGI-1, we find that the hybrid improves pass@2 (the official metric) by +2.0\% (45.88\% vs 43.88\%) and consistently outperforms at higher K values (+4.75\% at pass@100), whilst maintaining pass@1 parity. This suggests improved candidate coverage -- the model generates correct solutions more reliably -- with similar top-1 selection. Our results validate that Mamba-2 hybrid operators preserve reasoning capability within the recursive scaffold, establishing SSM-based operators as viable candidates in the recursive operator design space and taking a first step towards understanding the best mixing strategies for recursive reasoning.

Chinese Translation

近期关于递归推理模型如 TRM 的研究表明，小型网络（7M 参数）可以通过潜在递归——在隐藏表示空间中进行迭代精炼而不发出中间标记——在抽象推理任务上取得强劲表现。这引出了一个自然的问题：操作符的选择。Mamba-2 的状态空间递归本身就是一种迭代精炼的形式，使其成为递归推理的自然候选者——但将 Mamba-2 引入递归框架是否能保持推理能力？我们通过在保持参数相等（6.83M 对比 6.86M 参数）的情况下，用 Mamba-2 混合操作符替换 TRM 中的 Transformer 块来进行研究。在 ARC-AGI-1 上，我们发现混合模型在 pass@2（官方指标）上提高了 +2.0\%（45.88\% 对比 43.88\%），并且在更高的 K 值下始终表现优于其他模型（在 pass@100 上提高 +4.75\%），同时保持 pass@1 的相等性。这表明候选覆盖率得到了改善——模型更可靠地生成正确的解决方案——并且 top-1 选择相似。我们的结果验证了 Mamba-2 混合操作符在递归框架中保持了推理能力，确立了基于 SSM 的操作符作为递归操作符设计空间中的可行候选，并迈出了理解递归推理最佳混合策略的第一步。

View on arXiv Download PDF AI Translation

cs.AI / 61 / 2602.12083

Differentiable Modal Logic for Multi-Agent Diagnosis, Orchestration and Communication

用于多智能体诊断、协调和通信的可微模态逻辑

Sulc, Antonin

Abstract

As multi-agent AI systems evolve from simple chatbots to autonomous swarms, debugging semantic failures requires reasoning about knowledge, belief, causality, and obligation, precisely what modal logic was designed to formalize. However, traditional modal logic requires manual specification of relationship structures that are unknown or dynamic in real systems. This tutorial demonstrates differentiable modal logic (DML), implemented via Modal Logical Neural Networks (MLNNs), enabling systems to learn trust networks, causal chains, and regulatory boundaries from behavioral data alone. We present a unified neurosymbolic debugging framework through four modalities: epistemic (who to trust), temporal (when events cause failures), deontic (what actions are permitted), and doxastic (how to interpret agent confidence). Each modality is demonstrated on concrete multi-agent scenarios, from discovering deceptive alliances in diplomacy games to detecting LLM hallucinations, with complete implementations showing how logical contradictions become learnable optimization objectives. Key contributions for the neurosymbolic community: (1) interpretable learned structures where trust and causality are explicit parameters, not opaque embeddings; (2) knowledge injection via differentiable axioms that guide learning with sparse data (3) compositional multi-modal reasoning that combines epistemic, temporal, and deontic constraints; and (4) practical deployment patterns for monitoring, active control and communication of multi-agent systems. All code provided as executable Jupyter notebooks.

Chinese Translation

随着多智能体人工智能系统从简单的聊天机器人演变为自主群体，调试语义失败需要对知识、信念、因果关系和义务进行推理，这正是模态逻辑设计的目的。然而，传统模态逻辑要求手动指定在真实系统中未知或动态的关系结构。本教程展示了可微模态逻辑（Differentiable Modal Logic, DML），通过模态逻辑神经网络（Modal Logical Neural Networks, MLNNs）实现，使系统能够仅通过行为数据学习信任网络、因果链和监管边界。我们通过四种模态提出了统一的神经符号调试框架：知识模态（谁值得信任）、时间模态（何时事件导致失败）、义务模态（允许什么行动）和信念模态（如何解释智能体的信心）。每种模态在具体的多智能体场景中得到了展示，从发现外交游戏中的欺骗性联盟到检测大型语言模型（LLM）幻觉，完整的实现展示了逻辑矛盾如何成为可学习的优化目标。对神经符号社区的关键贡献包括：（1）可解释的学习结构，其中信任和因果关系是明确的参数，而不是不透明的嵌入；（2）通过可微公理进行知识注入，指导稀疏数据下的学习；（3）组合多模态推理，结合知识、时间和义务约束；以及（4）多智能体系统监控、主动控制和通信的实际部署模式。所有代码均以可执行的 Jupyter 笔记本形式提供。

View on arXiv Download PDF AI Translation

cs.AI / 62 / 2602.12108

The Pensieve Paradigm: Stateful Language Models Mastering Their Own Context

Pensieve范式：掌握自身上下文的有状态语言模型

Liu, Xiaoyuan, Liang, Tian, Ma, Dongyang, Zhou, Deyu, Mi, Haitao, He, Pinjia, Wang, Yan

Abstract

In the world of Harry Potter, when Dumbledore's mind is overburdened, he extracts memories into a Pensieve to be revisited later. In the world of AI, while we possess the Pensieve-mature databases and retrieval systems, our models inexplicably lack the "wand" to operate it. They remain like a Dumbledore without agency, passively accepting a manually engineered context as their entire memory. This work finally places the wand in the model's hand. We introduce StateLM, a new class of foundation models endowed with an internal reasoning loop to manage their own state. We equip our model with a suite of memory tools, such as context pruning, document indexing, and note-taking, and train it to actively manage these tools. By learning to dynamically engineering its own context, our model breaks free from the architectural prison of a fixed window. Experiments across various model sizes demonstrate StateLM's effectiveness across diverse scenarios. On long-document QA tasks, StateLMs consistently outperform standard LLMs across all model scales; on the chat memory task, they achieve absolute accuracy improvements of 10% to 20% over standard LLMs. On the deep research task BrowseComp-Plus, the performance gap becomes even more pronounced: StateLM achieves up to 52% accuracy, whereas standard LLM counterparts struggle around 5%. Ultimately, our approach shifts LLMs from passive predictors to state-aware agents where reasoning becomes a stateful and manageable process.

Chinese Translation

在哈利·波特的世界中，当邓布利多的思维负担过重时，他会将记忆提取到Pensieve中，以便稍后回顾。在人工智能的世界里，尽管我们拥有成熟的Pensieve数据库和检索系统，但我们的模型却 inexplicably 缺乏操作它的“魔杖”。它们就像没有自主权的邓布利多，被动地接受手动构建的上下文作为其全部记忆。本研究最终将魔杖放在模型的手中。我们介绍了StateLM，一种新型基础模型，赋予其内部推理循环以管理自身状态。我们为模型配备了一套记忆工具，如上下文修剪、文档索引和笔记记录，并训练其主动管理这些工具。通过学习动态构建自身上下文，我们的模型打破了固定窗口的架构束缚。针对不同模型规模的实验表明，StateLM在多种场景中的有效性。在长文档问答任务中，StateLM在所有模型规模上均持续优于标准LLM；在聊天记忆任务中，相较于标准LLM，StateLM的绝对准确率提升达10%至20%。在深度研究任务BrowseComp-Plus中，性能差距更为明显：StateLM的准确率高达52%，而标准LLM的表现徘徊在5%左右。最终，我们的方法将LLM从被动预测者转变为状态感知的智能体，使推理成为一个有状态且可管理的过程。

View on arXiv Download PDF AI Translation

cs.AI / 63 / 2602.12113

Stop Unnecessary Reflection: Training LRMs for Efficient Reasoning with Adaptive Reflection and Length Coordinated Penalty

停止不必要的反思：通过自适应反思和长度协调惩罚训练大型推理模型以实现高效推理

Yu, Zewei, Gao, Lirong, Zhu, Yuke, Zheng, Bo, Guo, Sheng, Wang, Haobo, Zhao, Junbo

Abstract

Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex reasoning tasks by employing test-time scaling. However, they often generate over-long chains-of-thought that, driven by substantial reflections such as repetitive self-questioning and circular reasoning, lead to high token consumption, substantial computational overhead, and increased latency without improving accuracy, particularly in smaller models. Our observation reveals that increasing problem complexity induces more excessive and unnecessary reflection, which in turn reduces accuracy and increases token overhead. To address this challenge, we propose Adaptive Reflection and Length Coordinated Penalty (ARLCP), a novel reinforcement learning framework designed to dynamically balance reasoning efficiency and solution accuracy. ARLCP introduces two key innovations: (1) a reflection penalty that adaptively curtails unnecessary reflective steps while preserving essential reasoning, and (2) a length penalty calibrated to the estimated complexity of the problem. By coordinating these penalties, ARLCP encourages the model to generate more concise and effective reasoning paths. We evaluate our method on five mathematical reasoning benchmarks using DeepSeek-R1-Distill-Qwen-1.5B and DeepSeek-R1-Distill-Qwen-7B models. Experimental results show that ARLCP achieves a superior efficiency-accuracy trade-off compared to existing approaches. For the 1.5B model, it reduces the average response length by 53.1% while simultaneously improving accuracy by 5.8%. For the 7B model, it achieves a 35.0% reduction in length with a 2.7% accuracy gain. The code is released at https://github.com/ZeweiYu1/ARLCP .

Chinese Translation

大型推理模型（LRMs）通过在测试时进行扩展，已在复杂推理任务中展现出显著的性能。然而，它们往往生成过长的思维链，这些思维链受到大量反思的驱动，例如重复的自我提问和循环推理，导致高令牌消耗、显著的计算开销和增加的延迟，而并未提高准确性，尤其是在较小的模型中。我们的观察表明，问题复杂性的增加会引发更多过度和不必要的反思，从而降低准确性并增加令牌开销。为了解决这一挑战，我们提出了自适应反思和长度协调惩罚（ARLCP），这是一种新颖的强化学习框架，旨在动态平衡推理效率和解决方案准确性。ARLCP引入了两个关键创新：（1）一种反思惩罚，能够自适应地减少不必要的反思步骤，同时保留必要的推理；（2）一种根据问题的估计复杂性进行校准的长度惩罚。通过协调这些惩罚，ARLCP鼓励模型生成更简洁和有效的推理路径。我们在五个数学推理基准上评估了我们的方法，使用了DeepSeek-R1-Distill-Qwen-1.5B和DeepSeek-R1-Distill-Qwen-7B模型。实验结果表明，ARLCP在效率与准确性之间实现了优越的权衡。对于1.5B模型，它将平均响应长度减少了53.1%，同时提高了5.8%的准确性。对于7B模型，它实现了35.0%的长度减少，并获得了2.7%的准确性提升。代码已发布在 https://github.com/ZeweiYu1/ARLCP 。

View on arXiv Download PDF AI Translation

cs.AI / 64 / 2602.12120

Commencing-Student Enrolment Forecasting Under Data Sparsity with Time Series Foundation Models

在数据稀疏条件下基于时间序列基础模型的入学学生预测

Jetwiriyanon, Jittarin, Susnjak, Teo, Ranathunga, Surangika

Abstract

Many universities face increasing financial pressure and rely on accurate forecasts of commencing enrolments. However, enrolment forecasting in higher education is often data-sparse; annual series are short and affected by reporting changes and regime shifts. Popular classical approaches can be unreliable, as parameter estimation and model selection are unstable with short samples, and structural breaks degrade extrapolation. Recently, TSFMs have provided zero-shot priors, delivering strong gains in annual, data-sparse institutional forecasting under leakage-disciplined covariate construction. We benchmark multiple TSFM families in a zero-shot setting and test a compact, leakage-safe covariate set and introduce the Institutional Operating Conditions Index (IOCI), a transferable 0-100 regime covariate derived from time-stamped documentary evidence available at each forecast origin, alongside Google Trends demand proxies with stabilising feature engineering. Using an expanding-window backtest with strict vintage alignment, covariate-conditioned TSFMs perform on par with classical benchmarks without institution-specific training, with performance differences varying by cohort and model.

Chinese Translation

许多大学面临日益增加的财务压力，依赖于对新入学学生的准确预测。然而，高等教育中的入学预测往往数据稀疏；年度系列较短，并受到报告变化和体制转变的影响。流行的经典方法可能不可靠，因为在短样本下，参数估计和模型选择不稳定，结构性断裂会降低外推能力。最近，时间序列基础模型（TSFMs）提供了零样本先验，在泄漏规范的协变量构建下，显著提高了年度数据稀疏的机构预测能力。我们在零样本环境中对多种TSFM家族进行了基准测试，并测试了一组紧凑的、泄漏安全的协变量集，并引入了机构运营条件指数（Institutional Operating Conditions Index, IOCI），这是一个可转移的0-100体制协变量，源自每个预测起点可获得的时间戳文献证据，以及通过稳定特征工程处理的Google Trends需求代理。通过严格的时间对齐扩展窗口回测，协变量条件下的TSFMs在没有特定于机构的训练的情况下表现与经典基准相当，且性能差异因群体和模型而异。

View on arXiv Download PDF AI Translation

cs.AI / 65 / 2602.12128

HLA: Hadamard Linear Attention

HLA：哈达玛线性注意力

Ackermann, Hanno, Cai, Hong, Ghafoorian, Mohsen, Habibian, Amirhossein

Abstract

The attention mechanism is an important reason for the success of transformers. It relies on computing pairwise relations between tokens. To reduce the high computational cost of standard quadratic attention, linear attention has been proposed as an efficient approximation. It employs kernel functions that are applied independently to the inputs before the pairwise similarities are calculated. That allows for an efficient computational procedure which, however, amounts to a low-degree rational function approximating softmax. We propose Hadamard Linear Attention (HLA). Unlike previous works on linear attention, the nonlinearity in HLA is not applied separately to queries and keys, but, analogously to standard softmax attention, after the pairwise similarities have been computed. It will be shown that the proposed nonlinearity amounts to a higher-degree rational function to approximate softmax. An efficient computational scheme for the proposed method is derived that is similar to that of standard linear attention. In contrast to other approaches, no time-consuming tensor reshaping is necessary to apply the proposed algorithm. The effectiveness of the approach is demonstrated by applying it to a large diffusion transformer model for video generation, an application that involves very large amounts of tokens.

Chinese Translation

注意力机制是变换器成功的重要原因。它依赖于计算标记之间的成对关系。为了降低标准二次注意力的高计算成本，线性注意力被提出作为一种高效的近似方法。它采用在计算成对相似性之前独立应用于输入的核函数。这允许一种高效的计算过程，但实际上相当于一个低阶有理函数来近似 softmax。我们提出了哈达玛线性注意力（HLA）。与之前的线性注意力研究不同，HLA 中的非线性不是单独应用于查询和键，而是类似于标准 softmax 注意力，在计算成对相似性之后应用。将证明所提出的非线性相当于一个高阶有理函数来近似 softmax。为所提出的方法推导出了一种高效的计算方案，该方案与标准线性注意力相似。与其他方法相比，应用所提出的算法不需要耗时的张量重塑。通过将其应用于一个大型扩散变换器模型进行视频生成，证明了该方法的有效性，该应用涉及大量的标记。

View on arXiv Download PDF AI Translation

cs.AI / 66 / 2602.12133

Neutral Prompts, Non-Neutral People: Quantifying Gender and Skin-Tone Bias in Gemini Flash 2.5 Image and GPT Image 1.5

中性提示与非中性人群：量化 Gemini Flash 2.5 图像和 GPT Image 1.5 中的性别与肤色偏见

Balestri, Roberto

Abstract

This study quantifies gender and skin-tone bias in two widely deployed commercial image generators - Gemini Flash 2.5 Image (NanoBanana) and GPT Image 1.5 - to test the assumption that neutral prompts yield demographically neutral outputs. We generated 3,200 photorealistic images using four semantically neutral prompts. The analysis employed a rigorous pipeline combining hybrid color normalization, facial landmark masking, and perceptually uniform skin tone quantification using the Monk (MST), PERLA, and Fitzpatrick scales. Neutral prompts produced highly polarized defaults. Both models exhibited a strong "default white" bias (>96% of outputs). However, they diverged sharply on gender: Gemini favored female-presenting subjects, while GPT favored male-presenting subjects with lighter skin tones. This research provides a large-scale, comparative audit of state-of-the-art models using an illumination-aware colorimetric methodology, distinguishing aesthetic rendering from underlying pigmentation in synthetic imagery. The study demonstrates that neutral prompts function as diagnostic probes rather than neutral instructions. It offers a robust framework for auditing algorithmic visual culture and challenges the sociolinguistic assumption that unmarked language results in inclusive representation.

Chinese Translation

本研究量化了两款广泛应用的商业图像生成器——Gemini Flash 2.5 图像（NanoBanana）和 GPT Image 1.5 中的性别与肤色偏见，以检验中性提示是否会产生人口统计学上中性的输出。我们使用四个语义中性的提示生成了 3,200 张逼真的图像。分析采用了结合混合色彩标准化、面部特征标记遮罩和使用 Monk (MST)、PERLA 和 Fitzpatrick 标度的感知均匀肤色量化的严格流程。中性提示产生了高度极化的默认结果。两个模型均表现出强烈的“默认白人”偏见（>96% 的输出）。然而，它们在性别上存在显著差异：Gemini 偏向女性呈现的对象，而 GPT 则偏向肤色较浅的男性呈现对象。本研究提供了对最先进模型的大规模比较审计，采用了考虑光照的色度学方法，区分了合成图像中的美学渲染与基础色素。研究表明，中性提示更像是诊断探针，而非中性指令。它为审计算法视觉文化提供了一个稳健的框架，并挑战了社会语言学假设，即无标记语言会导致包容性代表。

View on arXiv Download PDF AI Translation

cs.AI / 67 / 2602.12134

Value Alignment Tax: Measuring Value Trade-offs in LLM Alignment

价值对齐税：测量大型语言模型对齐中的价值权衡

Chen, Jiajun, Shen, Hua

Abstract

Existing work on value alignment typically characterizes value relations statically, ignoring how interventions - such as prompting, fine-tuning, or preference optimization - reshape the broader value system. We introduce the Value Alignment Tax (VAT), a framework that measures how alignment-induced changes propagate across interconnected values relative to achieved on-target gain. VAT captures the dynamics of value expression under alignment pressure. Using a controlled scenario-action dataset grounded in Schwartz value theory, we collect paired pre-post normative judgments and analyze alignment effects across models, values, and alignment strategies. Our results show that alignment often produces uneven, structured co-movement among values. These effects are invisible under conventional target-only evaluation, revealing systemic, process-level alignment risks and offering new insights into the dynamics of value alignment in LLMs.

Chinese Translation

现有的价值对齐研究通常静态地描述价值关系，忽视了干预措施（如提示、微调或偏好优化）如何重塑更广泛的价值体系。我们提出了价值对齐税（Value Alignment Tax, VAT）框架，该框架测量对齐引发的变化如何相对于实现的目标增益在相互关联的价值之间传播。VAT 捕捉了在对齐压力下价值表达的动态。通过基于施瓦茨价值理论的受控场景-行动数据集，我们收集了成对的前后规范判断，并分析了不同模型、价值和对齐策略下的对齐效果。我们的结果表明，对齐往往在价值之间产生不均匀的、有结构的共同运动。这些效果在传统的仅目标评估下是不可见的，揭示了系统性、过程层面的对齐风险，并为大型语言模型中的价值对齐动态提供了新的见解。

View on arXiv Download PDF AI Translation

cs.AI / 68 / 2602.12143

STAR : Bridging Statistical and Agentic Reasoning for Large Model Performance Prediction

STAR：桥接统计推理与自主推理以预测大型模型性能

Wang, Xiaoxiao, Li, Chunxiao, Wang, Junying, Guo, Yijin, Chen, Zijian, Li, Chunyi, Liu, Xiaohong, Zhang, Zicheng, Zhai, Guangtao

Abstract

As comprehensive large model evaluation becomes prohibitively expensive, predicting model performance from limited observations has become essential. However, existing statistical methods struggle with pattern shifts, data sparsity, and lack of explanation, while pure LLM methods remain unreliable. We propose STAR, a framework that bridges data-driven STatistical expectations with knowledge-driven Agentic Reasoning. STAR leverages specialized retrievers to gather external knowledge and embeds semantic features into Constrained Probabilistic Matrix Factorization (CPMF) to generate statistical expectations with uncertainty. A reasoning module guided by Expectation Violation Theory (EVT) then refines predictions through intra-family analysis, cross-model comparison, and credibility-aware aggregation, producing adjustments with traceable explanations. Extensive experiments show that STAR consistently outperforms all baselines on both score-based and rank-based metrics, delivering a 14.46% gain in total score over the strongest statistical method under extreme sparsity, with only 1--2 observed scores per test model.

Chinese Translation

随着全面的大型模型评估变得极其昂贵，从有限的观察中预测模型性能变得至关重要。然而，现有的统计方法在模式转变、数据稀疏和缺乏解释方面面临挑战，而纯粹的LLM方法则不够可靠。我们提出了STAR，一个将数据驱动的统计期望与知识驱动的自主推理相结合的框架。STAR利用专门的检索器收集外部知识，并将语义特征嵌入约束概率矩阵分解（Constrained Probabilistic Matrix Factorization, CPMF）中，以生成带有不确定性的统计期望。随后，基于期望违反理论（Expectation Violation Theory, EVT）的推理模块通过同类分析、跨模型比较和可信度感知聚合来优化预测，产生可追溯解释的调整。大量实验表明，STAR在基于得分和基于排名的指标上始终优于所有基线方法，在极端稀疏情况下，相较于最强的统计方法，整体得分提升了14.46%，仅在每个测试模型上观察到1-2个得分。

View on arXiv Download PDF AI Translation

cs.AI / 69 / 2602.12146

Seq2Seq2Seq: Lossless Data Compression via Discrete Latent Transformers and Reinforcement Learning

Seq2Seq2Seq：通过离散潜在变换器和强化学习实现无损数据压缩

Khodabandeh, Mahdi, Shabani, Ghazal, Jordehi, Arash Yousefi, Mirroshandel, Seyed Abolghasem

Abstract

Efficient lossless compression is essential for minimizing storage costs and transmission overhead while preserving data integrity. Traditional compression techniques, such as dictionary-based and statistical methods, often struggle to optimally exploit the structure and redundancy in complex data formats. Recent advancements in deep learning have opened new avenues for compression; however, many existing approaches depend on dense vector representations that obscure the underlying token structure. To address these limitations, we propose a novel lossless compression method that leverages Reinforcement Learning applied to a T5 language model architecture. This approach enables the compression of data into sequences of tokens rather than traditional vector representations. Unlike auto-encoders, which typically encode information into continuous latent spaces, our method preserves the token-based structure, aligning more closely with the original data format. This preservation allows for higher compression ratios while maintaining semantic integrity. By training the model using an off-policy Reinforcement Learning algorithm, we optimize sequence length to minimize redundancy and enhance compression efficiency. Our method introduces an efficient and adaptive data compression system built upon advanced Reinforcement Learning techniques, functioning independently of external grammatical or world knowledge. This approach shows significant improvements in compression ratios compared to conventional methods. By leveraging the latent information within language models, our system effectively compresses data without requiring explicit content understanding, paving the way for more robust and practical compression solutions across various applications.

Chinese Translation

高效的无损压缩对于最小化存储成本和传输开销，同时保持数据完整性至关重要。传统的压缩技术，如基于字典和统计的方法，往往难以最佳地利用复杂数据格式中的结构和冗余。最近在深度学习方面的进展为压缩开辟了新的途径；然而，许多现有方法依赖于稠密的向量表示，这掩盖了潜在的标记结构。为了解决这些限制，我们提出了一种新颖的无损压缩方法，该方法利用应用于T5语言模型架构的强化学习。该方法使数据能够压缩为标记序列，而不是传统的向量表示。与通常将信息编码到连续潜在空间的自编码器不同，我们的方法保留了基于标记的结构，更加贴合原始数据格式。这种保留允许在保持语义完整性的同时实现更高的压缩比。通过使用离策略强化学习算法训练模型，我们优化序列长度以最小化冗余并提高压缩效率。我们的方法引入了一种高效且自适应的数据压缩系统，基于先进的强化学习技术，独立于外部语法或世界知识。与传统方法相比，该方法在压缩比方面显示出显著改善。通过利用语言模型中的潜在信息，我们的系统有效地压缩数据，而无需明确的内容理解，为各种应用提供了更强大和实用的压缩解决方案。

View on arXiv Download PDF AI Translation

cs.AI / 70 / 2602.12150

GPT-4o Lacks Core Features of Theory of Mind

GPT-4o 缺乏心智理论的核心特征

Muchovej, John, Royka, Amanda, Lee, Shane, Jara-Ettinger, Julian

Abstract

Do Large Language Models (LLMs) possess a Theory of Mind (ToM)? Research into this question has focused on evaluating LLMs against benchmarks and found success across a range of social tasks. However, these evaluations do not test for the actual representations posited by ToM: namely, a causal model of mental states and behavior. Here, we use a cognitively-grounded definition of ToM to develop and test a new evaluation framework. Specifically, our approach probes whether LLMs have a coherent, domain-general, and consistent model of how mental states cause behavior -- regardless of whether that model matches a human-like ToM. We find that even though LLMs succeed in approximating human judgments in a simple ToM paradigm, they fail at a logically equivalent task and exhibit low consistency between their action predictions and corresponding mental state inferences. As such, these findings suggest that the social proficiency exhibited by LLMs is not the result of an domain-general or consistent ToM.

Chinese Translation

大型语言模型（LLMs）是否具备心智理论（ToM）？对此问题的研究主要集中在评估LLMs在基准测试中的表现，并发现其在一系列社交任务中取得了成功。然而，这些评估并未测试ToM所假设的实际表征：即心理状态与行为之间的因果模型。在此，我们使用基于认知的ToM定义来开发和测试一个新的评估框架。具体而言，我们的方法探讨LLMs是否具备一个连贯的、领域通用的且一致的心理状态如何导致行为的模型——无论该模型是否与人类的ToM相匹配。我们的研究发现，尽管LLMs在一个简单的ToM范式中成功地近似了人类判断，但它们在一个逻辑上等效的任务中表现不佳，并且在其行动预测与相应心理状态推断之间表现出低一致性。因此，这些发现表明，LLMs所展现的社交能力并非源于一个领域通用或一致的ToM。

View on arXiv Download PDF AI Translation

cs.AI / 71 / 2602.12164

Sci-CoE: Co-evolving Scientific Reasoning LLMs via Geometric Consensus with Sparse Supervision

Sci-CoE：通过稀疏监督的几何共识共同进化科学推理大语言模型

He, Xiaohan, Feng, Shiyang, Huang, Songtao, Bai, Lei, Wang, Bin, Zhang, Bo

Abstract

Large language models (LLMs) have demonstrated exceptional reasoning capabilities, and co-evolving paradigms have shown promising results in domains such as code and math. However, in scientific reasoning tasks, these models remain fragile due to unreliable solution evaluation and limited diversity in verification strategies. In this work, we propose Sci-CoE, a two-stage scientific co-evolving framework that enables models to self-evolve as both solver and verifier through a transition from sparse supervision to unsupervised learning. In the first stage, the model uses a small set of annotated data to establish fundamental correctness judgment anchors for the Verifier. In the second stage, we introduce a geometric reward mechanism that jointly considers consensus, reliability, and diversity, driving large-scale self-iteration on unlabeled data. Experiments on several general scientific benchmarks demonstrate that Sci-CoE enhances complex reasoning capabilities and exhibits strong scalability, facilitating the construction of more robust and diverse evaluation systems. Codes are available at https://github.com/InternScience/Sci-CoE.

Chinese Translation

大型语言模型（LLMs）已展现出卓越的推理能力，而共同进化范式在代码和数学等领域显示出良好的效果。然而，在科学推理任务中，这些模型由于解决方案评估不可靠和验证策略多样性有限而显得脆弱。在本研究中，我们提出了Sci-CoE，一个两阶段的科学共同进化框架，使模型能够通过从稀疏监督转向无监督学习，自我进化为求解器和验证器。在第一阶段，模型使用一小部分标注数据为验证器建立基本的正确性判断锚点。在第二阶段，我们引入了一种几何奖励机制，联合考虑共识、可靠性和多样性，推动在未标记数据上的大规模自我迭代。在多个通用科学基准测试上的实验表明，Sci-CoE增强了复杂推理能力，并展现出强大的可扩展性，促进了更强大和多样化评估系统的构建。代码可在 https://github.com/InternScience/Sci-CoE 获取。

View on arXiv Download PDF AI Translation

cs.AI / 72 / 2602.12170

Statistical Parsing for Logical Information Retrieval

逻辑信息检索的统计解析

Coppola, Greg

Abstract

In previous work (Coppola, 2024) we introduced the Quantified Boolean Bayesian Network (QBBN), a logical graphical model that implements the forward fragment of natural deduction (Prawitz, 1965) as a probabilistic factor graph. That work left two gaps: no negation/backward reasoning, and no parser for natural language. This paper addresses both gaps across inference, semantics, and syntax. For inference, we extend the QBBN with NEG factors enforcing P(x) + P(neg x) = 1, enabling contrapositive reasoning (modus tollens) via backward lambda messages, completing Prawitz's simple elimination rules. The engine handles 44/44 test cases spanning 22 reasoning patterns. For semantics, we present a typed logical language with role-labeled predicates, modal quantifiers, and three tiers of expressiveness following Prawitz: first-order quantification, propositions as arguments, and predicate quantification via lambda abstraction. For syntax, we present a typed slot grammar that deterministically compiles sentences to logical form (33/33 correct, zero ambiguity). LLMs handle disambiguation (95% PP attachment accuracy) but cannot produce structured parses directly (12.4% UAS), confirming grammars are necessary. The architecture: LLM preprocesses, grammar parses, LLM reranks, QBBN infers. We argue this reconciles formal semantics with Sutton's "bitter lesson" (2019): LLMs eliminate the annotation bottleneck that killed formal NLP, serving as annotator while the QBBN serves as verifier. Code: https://github.com/gregorycoppola/world

Chinese Translation

在之前的工作中（Coppola, 2024），我们引入了量化布尔贝叶斯网络（Quantified Boolean Bayesian Network, QBBN），这是一种逻辑图模型，它将自然推理的前向片段（Prawitz, 1965）实现为概率因子图。该工作留下了两个空白：缺乏否定/反向推理，以及缺乏自然语言解析器。本文针对推理、语义和句法这两个空白进行了探讨。在推理方面，我们通过引入NEG因子扩展了QBBN，强制执行P(x) + P(neg x) = 1，从而通过反向λ消息实现对偶推理（modus tollens），完成了Prawitz的简单消除规则。该引擎处理了覆盖22种推理模式的44个测试用例，全部通过。在语义方面，我们提出了一种带有角色标记谓词、模态量词和三层表达能力的类型化逻辑语言，遵循Prawitz的理论：一阶量化、作为参数的命题以及通过λ抽象的谓词量化。在句法方面，我们提出了一种类型化槽语法，能够确定性地将句子编译为逻辑形式（33/33正确，无歧义）。大型语言模型（LLMs）处理歧义消解（95%的PP附加准确率），但无法直接生成结构化解析（12.4%的UAS），这证实了语法的必要性。该架构为：LLM预处理，语法解析，LLM重排序，QBBN推理。我们认为这调和了形式语义学与Sutton的“苦涩教训”（2019）：LLMs消除了导致形式自然语言处理（NLP）失败的注释瓶颈，作为注释者，而QBBN则作为验证者。代码链接：https://github.com/gregorycoppola/world

View on arXiv Download PDF AI Translation

cs.AI / 73 / 2602.12172

Pedagogically-Inspired Data Synthesis for Language Model Knowledge Distillation

基于教学启发的数据合成用于语言模型知识蒸馏

He, Bowei, Chen, Yankai, Zhang, Xiaokun, Kong, Linghe, Yu, Philip S., Liu, Xue, Ma, Chen

Abstract

Knowledge distillation from Large Language Models (LLMs) to smaller models has emerged as a critical technique for deploying efficient AI systems. However, current methods for distillation via synthetic data lack pedagogical awareness, treating knowledge transfer as a one-off data synthesis and training task rather than a systematic learning process. In this paper, we propose a novel pedagogically-inspired framework for LLM knowledge distillation that draws from fundamental educational principles. Our approach introduces a three-stage pipeline -- Knowledge Identifier, Organizer, and Adapter (IOA) -- that systematically identifies knowledge deficiencies in student models, organizes knowledge delivery through progressive curricula, and adapts representations to match the cognitive capacity of student models. We integrate Bloom's Mastery Learning Principles and Vygotsky's Zone of Proximal Development to create a dynamic distillation process where student models approach teacher model's performance on prerequisite knowledge before advancing, and new knowledge is introduced with controlled, gradual difficulty increments. Extensive experiments using LLaMA-3.1/3.2 and Qwen2.5 as student models demonstrate that IOA achieves significant improvements over baseline distillation methods, with student models retaining 94.7% of teacher performance on DollyEval while using less than 1/10th of the parameters. Our framework particularly excels in complex reasoning tasks, showing 19.2% improvement on MATH and 22.3% on HumanEval compared with state-of-the-art baselines.

Chinese Translation

从大型语言模型（LLMs）到小型模型的知识蒸馏已成为部署高效人工智能系统的一项关键技术。然而，目前通过合成数据进行蒸馏的方法缺乏教学意识，将知识转移视为一次性的数据显示和训练任务，而不是一个系统的学习过程。本文提出了一种新颖的基于教学启发的LLM知识蒸馏框架，该框架借鉴了基本的教育原则。我们的方法引入了一个三阶段的流程——知识识别者（Knowledge Identifier）、组织者（Organizer）和适配器（Adapter，IOA），系统地识别学生模型中的知识缺陷，通过渐进的课程组织知识传递，并调整表示以匹配学生模型的认知能力。我们结合了布鲁姆的掌握学习原则（Bloom's Mastery Learning Principles）和维果茨基的最近发展区（Vygotsky's Zone of Proximal Development），创建了一个动态的蒸馏过程，使学生模型在推进之前能够接近教师模型在先决知识上的表现，并以受控的、逐步增加的难度引入新知识。使用LLaMA-3.1/3.2和Qwen2.5作为学生模型的广泛实验表明，IOA在基线蒸馏方法上取得了显著的改进，学生模型在DollyEval上保留了94.7%的教师表现，同时使用的参数不到1/10。我们的框架在复杂推理任务中表现尤为出色，与最先进的基线相比，在MATH上提高了19.2%，在HumanEval上提高了22.3%。

View on arXiv Download PDF AI Translation

cs.AI / 74 / 2602.12173

SAM3-LiteText: An Anatomical Study of the SAM3 Text Encoder for Efficient Vision-Language Segmentation

SAM3-LiteText：SAM3文本编码器的解剖研究以实现高效的视觉-语言分割

Zeng, Chengxi, Jiang, Yuxuan, Gao, Ge, Wang, Shuai, Danier, Duolikun, Zhu, Bin, Rudinac, Stevan, Bull, David, Zhang, Fan

Abstract

Vision-language segmentation models such as SAM3 enable flexible, prompt-driven visual grounding, but inherit large, general-purpose text encoders originally designed for open-ended language understanding. In practice, segmentation prompts are short, structured, and semantically constrained, leading to substantial over-provisioning in text encoder capacity and persistent computational and memory overhead. In this paper, we perform a large-scale anatomical analysis of text prompting in vision-language segmentation, covering 404,796 real prompts across multiple benchmarks. Our analysis reveals severe redundancy: most context windows are underutilized, vocabulary usage is highly sparse, and text embeddings lie on low-dimensional manifold despite high-dimensional representations. Motivated by these findings, we propose SAM3-LiteText, a lightweight text encoding framework that replaces the original SAM3 text encoder with a compact MobileCLIP student that is optimized by knowledge distillation. Extensive experiments on image and video segmentation benchmarks show that SAM3-LiteText reduces text encoder parameters by up to 88%, substantially reducing static memory footprint, while maintaining segmentation performance comparable to the original model. Code: https://github.com/SimonZeng7108/efficientsam3/tree/sam3_litetext.

Chinese Translation

视觉-语言分割模型如SAM3能够实现灵活的、基于提示的视觉定位，但继承了最初为开放式语言理解设计的大型通用文本编码器。在实际应用中，分割提示通常较短、结构化且语义受限，导致文本编码器的容量严重过剩，并造成持续的计算和内存开销。本文对视觉-语言分割中的文本提示进行了大规模的解剖分析，涵盖了404,796个真实提示，涉及多个基准测试。我们的分析揭示了严重的冗余：大多数上下文窗口未被充分利用，词汇使用高度稀疏，尽管存在高维表示，文本嵌入仍然位于低维流形上。基于这些发现，我们提出了SAM3-LiteText，一种轻量级文本编码框架，用紧凑的MobileCLIP学生替代原始的SAM3文本编码器，并通过知识蒸馏进行优化。在图像和视频分割基准上的广泛实验表明，SAM3-LiteText将文本编码器参数减少了多达88%，显著降低了静态内存占用，同时保持了与原始模型相当的分割性能。代码：https://github.com/SimonZeng7108/efficientsam3/tree/sam3_litetext。

View on arXiv Download PDF AI Translation

cs.AI / 75 / 2602.12249

"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most

抱歉，我没听清：语音模型如何忽视最重要的内容

Zhou, Kaitlyn, Bartelds, Martijn, Bianchi, Federico, Zou, James

Abstract

Despite speech recognition systems achieving low word error rates on standard benchmarks, they often fail on short, high-stakes utterances in real-world deployments. Here, we study this failure mode in a high-stakes task: the transcription of U.S. street names as spoken by U.S. participants. We evaluate 15 models from OpenAI, Deepgram, Google, and Microsoft on recordings from linguistically diverse U.S. speakers and find an average transcription error rate of 44%. We quantify the downstream impact of failed transcriptions by geographic locations and show that mis-transcriptions systematically cause errors for all speakers, but that routing distance errors are twice as large for non-English primary speakers compared to English primary speakers. To mitigate this harm, we introduce a synthetic data generation approach that produces diverse pronunciations of named entities using open-source text-to-speech models. Fine-tuning with less than 1,000 synthetic samples improves street name transcription accuracy by nearly 60% (relative to base models) for non-English primary speakers. Our results highlight a critical gap between benchmark performance and real-world reliability in speech systems and demonstrate a simple, scalable path to reducing high-stakes transcription errors.

Chinese Translation

尽管语音识别系统在标准基准测试中实现了较低的词错误率，但在实际应用中，它们常常在短小且高风险的发言中表现不佳。在本研究中，我们考察了这一失败模式在一项高风险任务中的表现：转录美国街道名称，发言者为美国参与者。我们评估了来自OpenAI、Deepgram、Google和Microsoft的15个模型，基于来自语言多样性的美国发言者的录音，发现平均转录错误率为44%。我们通过地理位置量化了转录失败的下游影响，并表明错误转录系统性地导致所有发言者出现错误，但对于以非英语为母语的发言者，路线距离错误的规模是以英语为母语的发言者的两倍。为了减轻这种损害，我们提出了一种合成数据生成方法，利用开源文本到语音模型生成多样化的命名实体发音。使用不到1000个合成样本进行微调，非英语母语发言者的街道名称转录准确率提高了近60%（相对于基础模型）。我们的结果突显了基准性能与语音系统在现实世界中的可靠性之间的关键差距，并展示了一条简单且可扩展的路径，以减少高风险转录错误。

View on arXiv Download PDF AI Translation

cs.AI / 76 / 2602.12259

Think like a Scientist: Physics-guided LLM Agent for Equation Discovery

像科学家一样思考：物理引导的 LLM 代理用于方程发现

Yang, Jianke, Venkatachalam, Ohm, Kianezhad, Mohammad, Vadgama, Sharvaree, Yu, Rose

Abstract

Explaining observed phenomena through symbolic, interpretable formulas is a fundamental goal of science. Recently, large language models (LLMs) have emerged as promising tools for symbolic equation discovery, owing to their broad domain knowledge and strong reasoning capabilities. However, most existing LLM-based systems try to guess equations directly from data, without modeling the multi-step reasoning process that scientists often follow: first inferring physical properties such as symmetries, then using these as priors to restrict the space of candidate equations. We introduce KeplerAgent, an agentic framework that explicitly follows this scientific reasoning process. The agent coordinates physics-based tools to extract intermediate structure and uses these results to configure symbolic regression engines such as PySINDy and PySR, including their function libraries and structural constraints. Across a suite of physical equation benchmarks, KeplerAgent achieves substantially higher symbolic accuracy and greater robustness to noisy data than both LLM and traditional baselines.

Chinese Translation

通过符号化、可解释的公式解释观察到的现象是科学的一个基本目标。近年来，大型语言模型（LLMs）因其广泛的领域知识和强大的推理能力而成为符号方程发现的有前景的工具。然而，大多数现有的基于 LLM 的系统直接从数据中猜测方程，而没有建模科学家通常遵循的多步骤推理过程：首先推断物理属性，如对称性，然后利用这些作为先验来限制候选方程的空间。我们引入了 KeplerAgent，一个明确遵循这一科学推理过程的代理框架。该代理协调基于物理的工具以提取中间结构，并利用这些结果配置符号回归引擎，如 PySINDy 和 PySR，包括它们的函数库和结构约束。在一系列物理方程基准测试中，KeplerAgent 实现了显著更高的符号准确性和对噪声数据更大的鲁棒性，超越了 LLM 和传统基线。

View on arXiv Download PDF AI Translation

cs.AI / 77 / 2602.12268

CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use

CM2：用于多轮和多步骤自主工具使用的带有检查清单奖励的强化学习

Zhang, Zhen, Song, Kaiqiang, Wang, Xun, Hu, Yebowen, Yan, Weixiang, Zhao, Chenyang, Zou, Henry Peng, Deng, Haoyun, Indurthi, Sathish Reddy, Liu, Shujian, Ma, Simin, Wang, Xiaoyang, Wang, Xin Eric, Wang, Song

Abstract

AI agents are increasingly used to solve real-world tasks by reasoning over multi-turn user interactions and invoking external tools. However, applying reinforcement learning to such settings remains difficult: realistic objectives often lack verifiable rewards and instead emphasize open-ended behaviors; moreover, RL for multi-turn, multi-step agentic tool use is still underexplored; and building and maintaining executable tool environments is costly, limiting scale and coverage. We propose CM2, an RL framework that replaces verifiable outcome rewards with checklist rewards. CM2 decomposes each turn's intended behavior into fine-grained binary criteria with explicit evidence grounding and structured metadata, turning open-ended judging into more stable classification-style decisions. To balance stability and informativeness, our method adopts a strategy of sparse reward assignment but dense evaluation criteria. Training is performed in a scalable LLM-simulated tool environment, avoiding heavy engineering for large tool sets. Experiments show that CM2 consistently improves over supervised fine-tuning. Starting from an 8B Base model and training on an 8k-example RL dataset, CM2 improves over the SFT counterpart by 8 points on tau^-Bench, by 10 points on BFCL-V4, and by 12 points on ToolSandbox. The results match or even outperform similarly sized open-source baselines, including the judging model. CM2 thus provides a scalable recipe for optimizing multi-turn, multi-step tool-using agents without relying on verifiable rewards. Code provided by the open-source community: https://github.com/namezhenzhang/CM2-RLCR-Tool-Agent.

Chinese Translation

人工智能代理越来越多地被用于通过推理多轮用户交互和调用外部工具来解决现实世界的任务。然而，将强化学习应用于此类环境仍然困难：现实的目标往往缺乏可验证的奖励，而更强调开放式行为；此外，针对多轮、多步骤自主工具使用的强化学习仍然未得到充分探索；构建和维护可执行的工具环境成本高昂，限制了规模和覆盖面。我们提出了CM2，一个将可验证的结果奖励替换为检查清单奖励的强化学习框架。CM2将每轮的预期行为分解为具有明确证据基础和结构化元数据的细粒度二元标准，将开放式判断转变为更稳定的分类风格决策。为了平衡稳定性和信息量，我们的方法采用稀疏奖励分配但密集评估标准的策略。训练在可扩展的LLM模拟工具环境中进行，避免了对大型工具集的重型工程需求。实验表明，CM2在监督微调的基础上持续改进。从一个8B基础模型开始，并在一个8k示例的强化学习数据集上训练，CM2在tau^-Bench上比SFT对照组提高了8分，在BFCL-V4上提高了10分，在ToolSandbox上提高了12分。结果与同样规模的开源基准相匹配，甚至超越了包括判断模型在内的基准。因此，CM2为优化多轮、多步骤工具使用代理提供了一种可扩展的方案，而无需依赖可验证的奖励。代码由开源社区提供：https://github.com/namezhenzhang/CM2-RLCR-Tool-Agent。

View on arXiv Download PDF AI Translation

cs.AI / 78 / 2602.12276

Agentic Test-Time Scaling for WebAgents

WebAgents的代理测试时间缩放

Lee, Nicholas, Erdogan, Lutfi Eren, John, Chris Joseph, Krishnapillai, Surya, Mahoney, Michael W., Keutzer, Kurt, Gholami, Amir

Abstract

Test-time scaling has become a standard way to improve performance and boost reliability of neural network models. However, its behavior on agentic, multi-step tasks remains less well-understood: small per-step errors can compound over long horizons; and we find that naive policies that uniformly increase sampling show diminishing returns. In this work, we present CATTS, a simple technique for dynamically allocating compute for multi-step agents. We first conduct an empirical study of inference-time scaling for web agents. We find that uniformly increasing per-step compute quickly saturates in long-horizon environments. We then investigate stronger aggregation strategies, including an LLM-based Arbiter that can outperform naive voting, but that can overrule high-consensus decisions. We show that uncertainty statistics derived from the agent's own vote distribution (entropy and top-1/top-2 margin) correlate with downstream success and provide a practical signal for dynamic compute allocation. Based on these findings, we introduce Confidence-Aware Test-Time Scaling (CATTS), which uses vote-derived uncertainty to allocate compute only when decisions are genuinely contentious. CATTS improves performance on WebArena-Lite and GoBrowse by up to 9.1% over React while using up to 2.3x fewer tokens than uniform scaling, providing both efficiency gains and an interpretable decision rule.

Chinese Translation

测试时间缩放已成为提高神经网络模型性能和增强可靠性的标准方法。然而，其在代理性多步骤任务中的表现仍不够明确：每步的小错误可能在长时间范围内累积；我们发现，简单的策略通过均匀增加采样显示出收益递减。在本研究中，我们提出了CATTS，这是一种为多步骤代理动态分配计算资源的简单技术。我们首先对Web代理的推理时间缩放进行了实证研究。我们发现，在长时间范围环境中，均匀增加每步计算资源很快会达到饱和。随后，我们研究了更强的聚合策略，包括基于大型语言模型（LLM）的仲裁者，该策略能够超越简单投票，但可能会推翻高共识的决策。我们展示了从代理自身投票分布（熵和前1/前2边际）中得出的不确定性统计与下游成功相关，并为动态计算分配提供了实用信号。基于这些发现，我们引入了信心感知测试时间缩放（CATTS），该方法利用投票派生的不确定性，仅在决策真正存在争议时分配计算资源。CATTS在WebArena-Lite和GoBrowse上的性能提高了多达9.1%，同时使用的令牌数量比均匀缩放少了多达2.3倍，提供了效率提升和可解释的决策规则。

View on arXiv Download PDF AI Translation

计算语言学 (Computation and Language)

cs.CL / 1 / 2602.11156

HybridRAG: A Practical LLM-based ChatBot Framework based on Pre-Generated Q&A over Raw Unstructured Documents

HybridRAG：基于预生成问答的实用LLM聊天机器人框架，适用于原始非结构化文档

Kim, Sungmoon, Jeon, Hyuna, Kim, Dahye, Kim, Mingyu, Chae, Dong-Kyu, Kim, Jiwoong

Abstract

Retrieval-Augmented Generation (RAG) has emerged as a powerful approach for grounding Large Language Model (LLM)-based chatbot responses on external knowledge. However, existing RAG studies typically assume well-structured textual sources (e.g. Wikipedia or curated datasets) and perform retrieval and generation at query time, which can limit their applicability in real-world chatbot scenarios. In this paper, we present HybridRAG, a novel and practical RAG framework towards more accurate and faster chatbot responses. First, HybridRAG ingests raw, unstructured PDF documents containing complex layouts (text, tables, figures) via Optical Character Recognition (OCR) and layout analysis, and convert them into hierarchical text chunks. Then, it pre-generates a plausible question-answer (QA) knowledge base from the organized chunks using an LLM. At query time, user questions are matched against this QA bank to retrieve immediate answers when possible, and only if no suitable QA match is found does our framework fall back to an on-the-fly response generation. Experiments on OHRBench demonstrate that our HybridRAG provides higher answer quality and lower latency compared to a standard RAG baseline. We believe that HybridRAG could be a practical solution for real-world chatbot applications that must handle large volumes of unstructured documents and lots of users under limited computational resources.

Chinese Translation

检索增强生成（Retrieval-Augmented Generation，RAG）已成为一种强大的方法，用于将基于大型语言模型（Large Language Model，LLM）的聊天机器人响应与外部知识相结合。然而，现有的RAG研究通常假设文本来源结构良好（例如维基百科或经过整理的数据集），并在查询时进行检索和生成，这可能限制其在现实世界聊天机器人场景中的适用性。本文提出了HybridRAG，这是一种新颖且实用的RAG框架，旨在提供更准确和更快速的聊天机器人响应。首先，HybridRAG通过光学字符识别（Optical Character Recognition，OCR）和布局分析，处理包含复杂布局（文本、表格、图形）的原始非结构化PDF文档，并将其转换为层次化文本块。然后，它利用LLM从整理后的文本块中预生成一个合理的问答（Question-Answer，QA）知识库。在查询时，用户问题会与该QA库进行匹配，以便在可能的情况下检索即时答案，只有在未找到合适的QA匹配时，我们的框架才会回退到即时响应生成。OHRBench上的实验表明，与标准RAG基线相比，我们的HybridRAG提供了更高的答案质量和更低的延迟。我们相信，HybridRAG可以成为处理大量非结构化文档和众多用户的现实世界聊天机器人应用的实用解决方案，尤其是在计算资源有限的情况下。

View on arXiv Download PDF AI Translation

cs.CL / 2 / 2602.11157

Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety

基于响应的知识蒸馏在多语言越狱预防中的应用无意中危及安全

Zhang, Max, Liu, Derek, Zhang, Kai, Franco, Joshua, Liu, Haihao

Abstract

Large language models (LLMs) are increasingly deployed worldwide, yet their safety alignment remains predominantly English-centric. This allows for vulnerabilities in non-English contexts, especially with low-resource languages. We introduce a novel application of knowledge distillation (KD) in the context of multilingual jailbreak prevention, examining its efficacy. We distill the refusal behaviors of a proprietary teacher model (OpenAI o1-mini) with Low-Rank Adaptation (LoRA) into three open-source student models: Meta-Llama-3-8B-Instruct, Gemma-2-2B-IT, and Qwen3-8B, using ~28,000 multilingual jailbreak prompts from XSafety via black-box response-based, parameter-efficient fine-tuning (PEFT). Evaluation on the MultiJail benchmark reveals a counterintuitive behavior: standard fine-tuning on the teacher's ``safe'' refusal data inadvertently increases Jailbreak Success Rate (JSR) for all student models, up to 16.6 percentage points. Our experiments reveal a divergent generalization to unseen languages during distillation, with varying outcomes depending on the base model. By removing a primary source of safety degradation, nuanced `boundary' refusals, we mitigate or even reverse safety declines in student models, although reductions in reasoning performance (GSM8K) persist. Overall, our exploratory study highlights the challenges and potential of KD as a technique for multilingual safety alignment, offering a foundation for future research in this direction.

Chinese Translation

大型语言模型（LLMs）在全球范围内的部署日益增多，但其安全对齐仍主要集中于英语。这使得在非英语环境中，尤其是低资源语言中存在脆弱性。我们在多语言越狱预防的背景下引入了一种新的知识蒸馏（KD）应用，考察其有效性。我们将一个专有教师模型（OpenAI o1-mini）在低秩适应（LoRA）下的拒绝行为蒸馏到三个开源学生模型：Meta-Llama-3-8B-Instruct、Gemma-2-2B-IT 和 Qwen3-8B，使用来自 XSafety 的约 28,000 个多语言越狱提示，通过基于黑箱响应的参数高效微调（PEFT）。在 MultiJail 基准上的评估揭示了一种反直觉的行为：对教师的“安全”拒绝数据进行标准微调无意中提高了所有学生模型的越狱成功率（JSR），最高可达 16.6 个百分点。我们的实验揭示了在蒸馏过程中对未见语言的不同泛化，结果因基础模型而异。通过消除安全降级的主要来源，即细致的“边界”拒绝，我们减轻甚至逆转了学生模型的安全下降，尽管推理性能（GSM8K）的下降仍然存在。总体而言，我们的探索性研究突显了知识蒸馏作为多语言安全对齐技术的挑战和潜力，为未来在这一方向的研究奠定了基础。

View on arXiv Download PDF AI Translation

cs.CL / 3 / 2602.11162

Retrieval Heads are Dynamic

检索头是动态的

Lin, Yuping, Li, Zitao, Xing, Yue, He, Pengfei, Cui, Yingqian, Li, Yaliang, Ding, Bolin, Zhou, Jingren, Tang, Jiliang

Abstract

Recent studies have identified "retrieval heads" in Large Language Models (LLMs) responsible for extracting information from input contexts. However, prior works largely rely on static statistics aggregated across datasets, identifying heads that perform retrieval on average. This perspective overlooks the fine-grained temporal dynamics of autoregressive generation. In this paper, we investigate retrieval heads from a dynamic perspective. Through extensive analysis, we establish three core claims: (1) Dynamism: Retrieval heads vary dynamically across timesteps; (2) Irreplaceability: Dynamic retrieval heads are specific at each timestep and cannot be effectively replaced by static retrieval heads; and (3) Correlation: The model's hidden state encodes a predictive signal for future retrieval head patterns, indicating an internal planning mechanism. We validate these findings on the Needle-in-a-Haystack task and a multi-hop QA task, and quantify the differences on the utility of dynamic and static retrieval heads in a Dynamic Retrieval-Augmented Generation framework. Our study provides new insights into the internal mechanisms of LLMs.

Chinese Translation

近期研究在大型语言模型（LLMs）中识别出了负责从输入上下文中提取信息的“检索头”。然而，之前的研究主要依赖于跨数据集聚合的静态统计，识别出在平均水平上执行检索的头。这种观点忽视了自回归生成的细粒度时间动态。在本文中，我们从动态的角度研究检索头。通过广泛的分析，我们建立了三个核心论点：（1）动态性：检索头在时间步长上动态变化；（2）不可替代性：动态检索头在每个时间步长上都是特定的，无法被静态检索头有效替代；（3）相关性：模型的隐藏状态编码了对未来检索头模式的预测信号，表明存在内部规划机制。我们在“针在干草堆”任务和多跳问答任务上验证了这些发现，并量化了动态和静态检索头在动态检索增强生成框架中的效用差异。我们的研究为大型语言模型的内部机制提供了新的见解。

View on arXiv Download PDF AI Translation

cs.CL / 4 / 2602.11163

Nested Named Entity Recognition in Plasma Physics Research Articles

等离子体物理研究文章中的嵌套命名实体识别

Haris, Muhammad, Höft, Hans, Becker, Markus M., Stocker, Markus

Abstract

Named Entity Recognition (NER) is an important task in natural language processing that aims to identify and extract key entities from unstructured text. We present a novel application of NER in plasma physics research articles and address the challenges of extracting specialized entities from scientific text in this domain. Research articles in plasma physics often contain highly complex and context-rich content that must be extracted to enable, e.g., advanced search. We propose a lightweight approach based on encoder-transformers and conditional random fields to extract (nested) named entities from plasma physics research articles. First, we annotate a plasma physics corpus with 16 classes specifically designed for the nested NER task. Second, we evaluate an entity-specific model specialization approach, where independent BERT-CRF models are trained to recognize individual entity types in plasma physics text. Third, we integrate an optimization process to systematically fine-tune hyperparameters and enhance model performance. Our work contributes to the advancement of entity recognition in plasma physics and also provides a foundation to support researchers in navigating and analyzing scientific literature.

Chinese Translation

命名实体识别（NER）是自然语言处理中的一项重要任务，旨在从非结构化文本中识别和提取关键实体。我们提出了NER在等离子体物理研究文章中的新颖应用，并解决了从该领域科学文本中提取专业实体的挑战。等离子体物理研究文章通常包含高度复杂和富有上下文的内容，这些内容必须被提取以便于例如高级搜索。我们提出了一种基于编码器-变换器（encoder-transformers）和条件随机场（conditional random fields）的轻量级方法，从等离子体物理研究文章中提取（嵌套）命名实体。首先，我们为嵌套NER任务注释了一个包含16个类别的等离子体物理语料库。其次，我们评估了一种实体特定模型专业化的方法，其中独立的BERT-CRF模型被训练以识别等离子体物理文本中的各个实体类型。第三，我们整合了一个优化过程，以系统地微调超参数并增强模型性能。我们的工作有助于推动等离子体物理领域的实体识别进展，并为研究人员在导航和分析科学文献时提供基础支持。

View on arXiv Download PDF AI Translation

cs.CL / 5 / 2602.11165

Assessing LLM Reliability on Temporally Recent Open-Domain Questions

评估大型语言模型在近期开放领域问题上的可靠性

Krishnappa, Pushwitha, Das, Amit, Jain, Vinija, Mukherjee, Tathagata, Chadha, Aman

Abstract

Large Language Models (LLMs) are increasingly deployed for open-domain question answering, yet their alignment with human perspectives on temporally recent information remains underexplored. We introduce RECOM (Reddit Evaluation for Correspondence of Models), a benchmark dataset of 15,000 recent Reddit questions from September 2025 paired with community-derived reference answers. We investigate how four open-source LLMs (Llama3.1-8B, Mistral-7B, Gemma-2-9B, and GPT-OSS-20B) respond to these questions, evaluating alignment using lexical metrics (BLEU, ROUGE), semantic similarity (BERTScore, MoverScore, cosine similarity), and logical inference (NLI). Our central finding is a striking semantic-lexical paradox: all models achieve over 99% cosine similarity with references despite less than 8% BLEU-1 overlap, a 90+ percentage point gap indicating that models preserve meaning through extensive paraphrasing rather than lexical reproduction. MoverScore (51-53%) confirms this pattern, occupying an intermediate position that reflects the optimal transport cost of semantic alignment. Furthermore, model scale does not predict performance: Mistral-7B (7B parameters) outperforms GPT-OSS-20B (20B parameters) across all metrics. NLI analysis reveals that contradiction rates remain below 7%, suggesting models rarely generate content that directly conflicts with human consensus. These findings challenge the reliability of lexical metrics for evaluating abstractive generation and argue for multi-dimensional evaluation frameworks that capture semantic fidelity beyond surface-level text matching. The RECOM dataset is publicly available at https://anonymous.4open.science/r/recom-D4B0

Chinese Translation

大型语言模型（LLMs）越来越多地用于开放领域问答，但它们与人类对近期信息的看法之间的契合度仍然未得到充分探索。我们引入了RECOM（Reddit模型对应性评估），这是一个包含15,000个来自2025年9月的近期Reddit问题的基准数据集，并配有社区衍生的参考答案。我们研究了四个开源LLM（Llama3.1-8B、Mistral-7B、Gemma-2-9B和GPT-OSS-20B）对这些问题的响应，使用词汇度量（BLEU、ROUGE）、语义相似性（BERTScore、MoverScore、余弦相似性）和逻辑推理（NLI）评估其契合度。我们的核心发现是一个显著的语义-词汇悖论：所有模型与参考答案的余弦相似度均超过99%，尽管BLEU-1重叠率不足8%，这一90多个百分点的差距表明模型通过广泛的意译而非词汇再现来保留意义。MoverScore（51-53%）证实了这一模式，处于反映语义对齐的最优传输成本的中间位置。此外，模型规模并不能预测性能：Mistral-7B（70亿参数）在所有指标上均优于GPT-OSS-20B（200亿参数）。NLI分析显示，矛盾率保持在7%以下，表明模型很少生成与人类共识直接冲突的内容。这些发现挑战了词汇度量在评估抽象生成中的可靠性，并主张采用多维评估框架，以捕捉超越表面文本匹配的语义忠实度。RECOM数据集可在https://anonymous.4open.science/r/recom-D4B0公开获取。

View on arXiv Download PDF AI Translation

cs.CL / 6 / 2602.11166

Small Updates, Big Doubts: Does Parameter-Efficient Fine-tuning Enhance Hallucination Detection ?

小幅更新，大疑虑：参数高效微调是否增强了幻觉检测？

Hu, Xu, Zhang, Yifan, Wei, Songtao, Zhao, Chen, Li, Qiannan, Li, Bingzhe, Chen, Feng

Abstract

Parameter-efficient fine-tuning (PEFT) methods are widely used to adapt large language models (LLMs) to downstream tasks and are often assumed to improve factual correctness. However, how the parameter-efficient fine-tuning methods affect hallucination behavior remains insufficiently understood, especially on QA datasets. In this work, we systematically investigate the impact of PEFT on hallucination detection through a comprehensive empirical study across three open-weight LLM backbones and three fact-seeking QA benchmarks. For each model, we evaluate performance using seven unsupervised hallucination detection methods spanning three complementary approaches: semantic consistency based detectors, confidence based detectors, and entropy based detectors. This multifaceted evaluation enables us to characterize how PEFT reshapes uncertainty across different detection paradigms. In conclusion, our experimental results show that PEFT consistently strengthens hallucination detection ability, substantially improving AUROC across a wide range of hallucination detectors. Besides, further analyses using linear probes and representation diagnostics indicate that PEFT methods primarily reshapes how uncertainty is encoded and surfaced, comparing with injecting new factual knowledge into the models.

Chinese Translation

参数高效微调（PEFT）方法被广泛应用于将大型语言模型（LLMs）适应于下游任务，并且通常被认为能够提高事实正确性。然而，参数高效微调方法如何影响幻觉行为仍然理解不足，特别是在问答（QA）数据集上。在本研究中，我们通过对三种开放权重的LLM骨干网络和三种寻求事实的QA基准进行全面的实证研究，系统地调查了PEFT对幻觉检测的影响。对于每个模型，我们使用七种无监督幻觉检测方法进行性能评估，这些方法涵盖了三种互补的方法：基于语义一致性的检测器、基于置信度的检测器和基于熵的检测器。这种多维度的评估使我们能够表征PEFT如何重塑不同检测范式中的不确定性。总之，我们的实验结果表明，PEFT始终增强了幻觉检测能力，在广泛的幻觉检测器中显著提高了AUROC。此外，使用线性探测器和表示诊断的进一步分析表明，PEFT方法主要重塑了不确定性的编码和呈现方式，而不是将新的事实知识注入模型中。

View on arXiv Download PDF AI Translation

cs.CL / 7 / 2602.11167

Visualizing and Benchmarking LLM Factual Hallucination Tendencies via Internal State Analysis and Clustering

通过内部状态分析和聚类可视化与基准测试大型语言模型的事实幻觉倾向

Mao, Nathan, Kaushik, Varun, Shivkumar, Shreya, Sharafoleslami, Parham, Zhu, Kevin, Dev, Sunishchal

Abstract

Large Language Models (LLMs) often hallucinate, generating nonsensical or false information that can be especially harmful in sensitive fields such as medicine or law. To study this phenomenon systematically, we introduce FalseCite, a curated dataset designed to capture and benchmark hallucinated responses induced by misleading or fabricated citations. Running GPT-4o-mini, Falcon-7B, and Mistral 7-B through FalseCite, we observed a noticeable increase in hallucination activity for false claims with deceptive citations, especially in GPT-4o-mini. Using the responses from FalseCite, we can also analyze the internal states of hallucinating models, visualizing and clustering the hidden state vectors. From this analysis, we noticed that the hidden state vectors, regardless of hallucination or non-hallucination, tend to trace out a distinct horn-like shape. Our work underscores FalseCite's potential as a foundation for evaluating and mitigating hallucinations in future LLM research.

Chinese Translation

大型语言模型（LLMs）经常出现幻觉，生成无意义或虚假的信息，这在医学或法律等敏感领域可能造成特别严重的后果。为了系统地研究这一现象，我们引入了FalseCite，这是一个旨在捕捉和基准测试由误导性或虚构引用引发的幻觉响应的策划数据集。通过对GPT-4o-mini、Falcon-7B和Mistral 7-B进行FalseCite测试，我们观察到，尤其在GPT-4o-mini中，虚假引用的虚假声明的幻觉活动显著增加。利用FalseCite的响应，我们还可以分析幻觉模型的内部状态，进行隐藏状态向量的可视化和聚类。从这一分析中，我们注意到，无论是幻觉还是非幻觉，隐藏状态向量往往呈现出独特的角锥形状。我们的工作强调了FalseCite作为评估和减轻未来大型语言模型研究中幻觉现象的基础的潜力。

View on arXiv Download PDF AI Translation

cs.CL / 8 / 2602.11168

Enhancing SDG-Text Classification with Combinatorial Fusion Analysis and Generative AI

通过组合融合分析和生成性人工智能增强可持续发展目标文本分类

Xu, Jingyan, LaFleur, Marcelo L., Schweikert, Christina, Hsu, D. Frank

Abstract

(Natural Language Processing) NLP techniques such as text classification and topic discovery are very useful in many application areas including information retrieval, knowledge discovery, policy formulation, and decision-making. However, it remains a challenging problem in cases where the categories are unavailable, difficult to differentiate, or are interrelated. Social analysis with human context is an area that can benefit from text classification, as it relies substantially on text data. The focus of this paper is to enhance the classification of text according to the UN's Sustainable Development Goals (SDGs) by collecting and combining intelligence from multiple models. Combinatorial Fusion Analysis (CFA), a system fusion paradigm using a rank-score characteristic (RSC) function and cognitive diversity (CD), has been used to enhance classifier methods by combining a set of relatively good and mutually diverse classification models. We use a generative AI model to generate synthetic data for model training and then apply CFA to this classification task. The CFA technique achieves 96.73% performance, outperforming the best individual model. We compare the outcomes with those obtained from human domain experts. It is demonstrated that combining intelligence from multiple ML/AI models using CFA and getting input from human experts can, not only complement, but also enhance each other.

Chinese Translation

自然语言处理（Natural Language Processing, NLP）技术，如文本分类和主题发现，在信息检索、知识发现、政策制定和决策等多个应用领域中非常有用。然而，在类别不可用、难以区分或相互关联的情况下，这仍然是一个具有挑战性的问题。社会分析与人类背景密切相关，是一个可以从文本分类中受益的领域，因为它在很大程度上依赖于文本数据。本文的重点是通过收集和结合多个模型的智能，增强根据联合国可持续发展目标（Sustainable Development Goals, SDGs）进行文本分类的能力。组合融合分析（Combinatorial Fusion Analysis, CFA）是一种使用排名-得分特征（rank-score characteristic, RSC）函数和认知多样性（cognitive diversity, CD）的系统融合范式，已被用于通过结合一组相对较好且相互多样的分类模型来增强分类器方法。我们使用生成性人工智能模型生成合成数据以进行模型训练，然后将CFA应用于该分类任务。CFA技术实现了96.73%的性能，超越了最佳单一模型。我们将结果与人类领域专家获得的结果进行了比较。研究表明，使用CFA结合多个机器学习/人工智能模型的智能，并获得人类专家的输入，不仅可以互补，还可以相互增强。

View on arXiv Download PDF AI Translation

cs.CL / 9 / 2602.11169

Disentangling Direction and Magnitude in Transformer Representations: A Double Dissociation Through L2-Matched Perturbation Analysis

在变换器表示中解开方向与幅度：通过L2匹配扰动分析的双重解离

Vardhan, Mangadoddi Srikar, Teja, Lekkala Sai

Abstract

Transformer hidden states encode information as high-dimensional vectors, yet whether direction (orientation in representational space) and magnitude (vector norm) serve distinct functional roles remains unclear. Studying Pythia-family models, we discover a striking cross-over dissociation: angular perturbations cause up to 42.9 more damage to language modeling loss, while magnitude perturbations cause disproportionately more damage to syntactic processing (20.4% vs.1.6% accuracy drop on subject-verb agreement).This finding is enabled by L2-matched perturbation analysis, a methodology ensuring that an gular and magnitude perturbations achieve identical Euclidean displacements. Causal intervention reveals that angular damage flows substantially through the attention pathways (28.4% loss recovery via attention repair), while magnitude damage flows partly through the LayerNorm pathways(29.9% recovery via LayerNorm repair). These patterns replicate across scales within the Pythia architecture family. These findings provide evidence that direction and magnitude support partially distinct computational roles in LayerNorm based architectures. The direction preferentially affects attentional routing, while magnitude modulates processing intensity for fine-grained syntactic judgments. We find different patterns in RMSNorm-based architectures, suggesting that the dissociation depends on architectural choices. Our results refine the linear representation hypothesis and have implications for model editing and interpretability research

Chinese Translation

变换器的隐藏状态将信息编码为高维向量，但方向（表示空间中的取向）和幅度（向量范数）是否发挥着不同的功能角色仍不清楚。通过研究Pythia系列模型，我们发现了一种显著的交叉解离现象：角度扰动对语言建模损失造成的损害高达42.9倍，而幅度扰动对句法处理造成的不成比例的损害（在主谓一致上，准确率下降20.4%对比1.6%）。这一发现得益于L2匹配扰动分析，这是一种确保角度和幅度扰动实现相同欧几里得位移的方法。因果干预揭示，角度损害主要通过注意力通路流动（通过注意力修复恢复28.4%的损失），而幅度损害部分通过LayerNorm通路流动（通过LayerNorm修复恢复29.9%）。这些模式在Pythia架构系列中跨尺度复制。这些发现提供了证据，表明方向和幅度在基于LayerNorm的架构中支持部分不同的计算角色。方向优先影响注意力路由，而幅度调节细粒度句法判断的处理强度。我们在基于RMSNorm的架构中发现了不同的模式，表明这种解离依赖于架构选择。我们的结果细化了线性表示假设，并对模型编辑和可解释性研究具有重要意义。

View on arXiv Download PDF AI Translation

cs.CL / 10 / 2602.11170

PRIME: Policy-Reinforced Iterative Multi-agent Execution for Algorithmic Reasoning in Large Language Models

PRIME：用于大型语言模型算法推理的策略强化迭代多智能体执行

Xu, Jiawei, Yu, Zhenyu, Bi, Ziqian, Pham, Minh Duc, Qu, Xiaoyi, Zhang, Danyang

Abstract

Large language models have demonstrated remarkable capabilities across diverse reasoning tasks, yet their performance on algorithmic reasoning remains limited. To handle this limitation, we propose PRIME (Policy-Reinforced Iterative Multi-agent Execution), a framework comprising three specialized agents, an executor for step-by-step reasoning, a verifier for constraint checking, and a coordinator for backtracking control, optimized through group relative policy optimization. For comprehensive evaluation, we introduce PRIME-Bench, the largest algorithmic reasoning benchmark to date, comprising 86 tasks across 12 categories with 51,600 instances. Tasks span sorting algorithms, graph and tree structures, automata and state machines, symbolic reasoning, and constraint-based puzzles, with execution traces reaching over one million steps. Compared to baseline approach, PRIME improves average accuracy from 26.8% to 93.8%, a 250% relative gain. The largest improvements occur on tasks requiring sustained state tracking, with Turing machine simulation improving from 9% to 92% and long division from 16% to 94%. Ablation studies identify iterative verification as the primary contributor, preventing the error propagation that causes baseline approaches to fail catastrophically. Analysis across model scales (8B-120B parameters) reveals that smaller models benefit disproportionately, achieving accuracy comparable to models 8x larger.

Chinese Translation

大型语言模型在多种推理任务中展现了显著的能力，但在算法推理方面的表现仍然有限。为了解决这一限制，我们提出了PRIME（策略强化迭代多智能体执行），这是一个由三个专业代理组成的框架：一个用于逐步推理的执行者、一个用于约束检查的验证者，以及一个用于回溯控制的协调者，通过群体相对策略优化进行优化。为了进行全面评估，我们引入了PRIME-Bench，这是迄今为止最大的算法推理基准，包含86个任务，涵盖12个类别，共有51,600个实例。任务包括排序算法、图和树结构、自动机和状态机、符号推理以及基于约束的难题，执行轨迹超过一百万步。与基线方法相比，PRIME将平均准确率从26.8%提高到93.8%，相对增益达到250%。在需要持续状态跟踪的任务中，改进最大，图灵机模拟的准确率从9%提高到92%，长除法从16%提高到94%。消融研究表明，迭代验证是主要贡献者，防止了导致基线方法灾难性失败的错误传播。对不同模型规模（8B-120B参数）的分析显示，较小的模型受益更大，达到了与8倍更大模型相当的准确率。

View on arXiv Download PDF AI Translation

cs.CL / 11 / 2602.11171

Efficient Hyper-Parameter Search for LoRA via Language-aided Bayesian Optimization

通过语言辅助贝叶斯优化进行高效的LoRA超参数搜索

Seong-Eun, Baek, Jung-Mok, Lee, Sung-Bin, Kim, Oh, Tae-Hyun

Abstract

Fine-tuning Large Language Models (LLMs) with Low-Rank Adaptation (LoRA) enables resource-efficient personalization or specialization, but it comes at the expense of additional hyperparameter tuning. Although LoRA makes fine-tuning efficient, it is highly sensitive to the choice of hyperparameters, and exhaustive hyperparameter search is still computationally very demanding. To address these challenges, we propose a framework that integrates the domain knowledge of pre-trained LLMs into Bayesian Optimization (BO) to efficiently search for LoRA hyperparameters. To leverage the informed knowledge of LLMs, we repurpose LLMs as a discrete-to-continuous mapping to link the hyperparameters and their domain knowledge with a continuous vector space, where BO is conducted. We design and control the mapping by language prompting, where we provide a domain-aware textual prompt describing the relationships among hyperparameters and their respective roles; thereby, we explicitly inject domain knowledge about LoRA into the LLM in natural language. Also, we model the residual information that is hard to linguistically describe in the prompt with an additional learnable token. This aids BO to sample more high-performing hyperparameters. In addition, by leveraging the observation of the strong correlation between the respective performance obtained from full and subset training datasets in LoRA training regimes, we introduce proxy training and evaluation with a data subset. This further increases the efficiency of our method. We demonstrate that our hyperparameter found with only about 30 iterations achieves more than 20% performance improvement over standard hyperparameters found from about 45,000 combinations.

Chinese Translation

使用低秩适应（LoRA）对大型语言模型（LLMs）进行微调能够实现资源高效的个性化或专业化，但这需要额外的超参数调优。尽管LoRA使微调变得高效，但它对超参数的选择高度敏感，全面的超参数搜索仍然在计算上非常耗费资源。为了解决这些挑战，我们提出了一个框架，将预训练LLMs的领域知识整合到贝叶斯优化（BO）中，以高效地搜索LoRA超参数。为了利用LLMs的知识，我们将LLMs重新用于离散到连续的映射，以将超参数及其领域知识与一个连续的向量空间连接，在该空间中进行BO。我们通过语言提示设计和控制该映射，提供一个领域感知的文本提示，描述超参数之间的关系及其各自的角色；从而，我们以自然语言明确地将关于LoRA的领域知识注入到LLM中。此外，我们使用一个额外的可学习标记来建模在提示中难以用语言描述的残余信息。这有助于BO采样更多高性能的超参数。此外，通过利用在LoRA训练过程中从完整和子集训练数据集中获得的各自性能之间的强相关性观察，我们引入了使用数据子集的代理训练和评估。这进一步提高了我们方法的效率。我们证明，经过大约30次迭代找到的超参数比从约45,000个组合中找到的标准超参数提高了超过20%的性能。

View on arXiv Download PDF AI Translation

cs.CL / 12 / 2602.11172

Synthesizing the Virtual Advocate: A Multi-Persona Speech Generation Framework for Diverse Linguistic Jurisdictions in Indic Languages

合成虚拟辩护人：一个针对印度语言多样语言管辖区的多角色语音生成框架

Deroy, Aniket

Abstract

Legal advocacy requires a unique combination of authoritative tone, rhythmic pausing for emphasis, and emotional intelligence. This study investigates the performance of the Gemini 2.5 Flash TTS and Gemini 2.5 Pro TTS models in generating synthetic courtroom speeches across five Indic languages: Tamil, Telugu, Bengali, Hindi, and Gujarati. We propose a prompting framework that utilizes Gemini 2.5s native support for 5 languages and its context-aware pacing to produce distinct advocate personas. The evolution of Large Language Models (LLMs) has shifted the focus of TexttoSpeech (TTS) technology from basic intelligibility to context-aware, expressive synthesis. In the legal domain, synthetic speech must convey authority and a specific professional persona a task that becomes significantly more complex in the linguistically diverse landscape of India. The models exhibit a "monotone authority," excelling at procedural information delivery but struggling with the dynamic vocal modulation and emotive gravitas required for persuasive advocacy. Performance dips in Bengali and Gujarati further highlight phonological frontiers for future refinement. This research underscores the readiness of multilingual TTS for procedural legal tasks while identifying the remaining challenges in replicating the persuasive artistry of human legal discourse. The code is available at-https://github.com/naturenurtureelite/Synthesizing-the-Virtual-Advocate/tree/main

Chinese Translation

法律辩护需要权威的语气、强调时的节奏停顿以及情感智力的独特结合。本研究探讨了Gemini 2.5 Flash TTS和Gemini 2.5 Pro TTS模型在生成五种印度语言（泰米尔语、泰卢固语、孟加拉语、印地语和古吉拉特语）法庭演讲中的表现。我们提出了一个提示框架，利用Gemini 2.5对五种语言的原生支持及其上下文感知的节奏，生成不同的辩护人角色。大型语言模型（LLMs）的发展使得文本到语音（TTS）技术的重点从基本的可理解性转向上下文感知和富有表现力的合成。在法律领域，合成语音必须传达权威性和特定的专业角色，这一任务在印度这个语言多样化的环境中变得更加复杂。模型表现出“单调的权威”，在程序性信息传递方面表现出色，但在说服性辩护所需的动态声调调节和情感重量方面却显得力不从心。在孟加拉语和古吉拉特语中的表现下降进一步突显了未来改进的音位边界。本研究强调了多语言TTS在程序性法律任务中的准备情况，同时识别了在复制人类法律话语的说服艺术方面仍然存在的挑战。代码可在以下链接获取：https://github.com/naturenurtureelite/Synthesizing-the-Virtual-Advocate/tree/main

View on arXiv Download PDF AI Translation

cs.CL / 13 / 2602.11173

Author-in-the-Loop Response Generation and Evaluation: Integrating Author Expertise and Intent in Responses to Peer Review

作者参与的回应生成与评估：在同行评审中整合作者专业知识和意图

Ruan, Qian, Gurevych, Iryna

Abstract

Author response (rebuttal) writing is a critical stage of scientific peer review that demands substantial author effort. Recent work frames this task as automatic text generation, underusing author expertise and intent. In practice, authors possess domain expertise, author-only information, revision and response strategies--concrete forms of author expertise and intent--to address reviewer concerns, and seek NLP assistance that integrates these signals to support effective response writing in peer review. We reformulate author response generation as an author-in-the-loop task and introduce REspGen, a generation framework that integrates explicit author input, multi-attribute control, and evaluation-guided refinement, together with REspEval, a comprehensive evaluation suite with 20+ metrics covering input utilization, controllability, response quality, and discourse. To support this formulation, we construct Re$^3$Align, the first large-scale dataset of aligned review--response--revision triplets, where revisions provide signals of author expertise and intent. Experiments with state-of-the-art LLMs show the benefits of author input and evaluation-guided refinement, the impact of input design on response quality, and trade-offs between controllability and quality. We make our dataset, generation and evaluation tools publicly available.

Chinese Translation

作者回应（反驳）写作是科学同行评审的一个关键阶段，要求作者付出大量努力。近期的研究将这一任务框定为自动文本生成，未能充分利用作者的专业知识和意图。在实践中，作者拥有领域专业知识、仅限作者的信息、修订和回应策略——这些都是应对审稿人关注的具体形式的作者专业知识和意图，并寻求自然语言处理（NLP）支持，以整合这些信号，帮助有效撰写同行评审中的回应。我们将作者回应生成重新表述为一个作者参与的任务，并介绍了REspGen，一个生成框架，整合了明确的作者输入、多属性控制和评估引导的优化，同时推出REspEval，一个涵盖输入利用、可控性、回应质量和话语的20多个指标的综合评估工具。为了支持这一框架，我们构建了Re$^3$Align，这是第一个大规模的对齐审稿-回应-修订三元组数据集，其中修订提供了作者专业知识和意图的信号。与最先进的语言模型（LLMs）的实验显示了作者输入和评估引导优化的好处、输入设计对回应质量的影响，以及可控性与质量之间的权衡。我们将数据集、生成和评估工具公开发布。

View on arXiv Download PDF AI Translation

cs.CL / 14 / 2602.11174

The Script Tax: Measuring Tokenization-Driven Efficiency and Latency Disparities in Multilingual Language Models

脚本税：衡量多语言模型中由标记化驱动的效率和延迟差异

Dixit, Aradhya, Dixit, Shreem

Abstract

Pretrained multilingual language models are often assumed to be script-agnostic, yet their tokenizers can impose systematic costs on certain writing systems. We quantify this script tax by comparing two orthographic variants with identical linguistic content. Across mBERT and XLM-R, the higher-fragmentation orthography shows a ~3.4x increase in fertility (6.73-6.85 vs. 2.10-2.35 tokens/word), leading to a 16.5x inference slowdown (0.23 vs. 3.8 sentences/second) on identical hardware. Using bits per character (BPC) to avoid the "NLL paradox" from subword fragmentation, we find a substantial increase in information cost: +19.7% for mBERT (8.06->9.65) and +47.1% for XLM-R (12.19->17.94). A round-trip conversion check (CER_rt=0.31) suggests these gaps reflect orthography-conditioned processing rather than mapping noise. Our results highlight tokenization as a key source of inequity in multilingual NLP and motivate script-aware tokenization and pretraining.

Chinese Translation

预训练的多语言模型通常被认为是与脚本无关的，但它们的标记器可能会对某些书写系统施加系统性的成本。我们通过比较两个具有相同语言内容的正字法变体来量化这种脚本税。在 mBERT 和 XLM-R 中，碎片化程度更高的正字法显示出约 3.4 倍的生育率增加（6.73-6.85 对比 2.10-2.35 个标记/词），导致在相同硬件上推理速度降低 16.5 倍（0.23 对比 3.8 句子/秒）。使用每字符比特数（BPC）以避免因子词碎片化带来的“负对数似然悖论”，我们发现信息成本显著增加：mBERT 增加 19.7%（8.06->9.65），XLM-R 增加 47.1%（12.19->17.94）。一次往返转换检查（CER_rt=0.31）表明这些差距反映了以正字法为条件的处理，而非映射噪声。我们的结果突显了标记化作为多语言自然语言处理中的一个关键不平等来源，并激励了对脚本感知的标记化和预训练的研究。

View on arXiv Download PDF AI Translation

cs.CL / 15 / 2602.11175

Barriers to Discrete Reasoning with Transformers: A Survey Across Depth, Exactness, and Bandwidth

变压器在离散推理中的障碍：深度、准确性和带宽的调查

Yuan, Michelle, Sun, Weiyi, Rezaeian, Amir H., Singh, Jyotika, Ghoshal, Sandip, Wang, Yao-Ting, Ballesteros, Miguel, Benajiba, Yassine

Abstract

Transformers have become the foundational architecture for a broad spectrum of sequence modeling applications, underpinning state-of-the-art systems in natural language processing, vision, and beyond. However, their theoretical limitations in discrete reasoning tasks, such as arithmetic, logical inference, and algorithmic composition, remain a critical open problem. In this survey, we synthesize recent studies from three theoretical perspectives: circuit complexity, approximation theory, and communication complexity, to clarify the structural and computational barriers that transformers face when performing symbolic computations. By connecting these established theoretical frameworks, we provide an accessible and unified account of why current transformer architectures struggle to implement exact discrete algorithms, even as they excel at pattern matching and interpolation. We review key definitions, seminal results, and illustrative examples, highlighting challenges such as depth constraints, difficulty approximating discontinuities, and bottlenecks in inter-token communication. Finally, we discuss implications for model design and suggest promising directions for overcoming these foundational limitations.

Chinese Translation

变压器已成为广泛序列建模应用的基础架构，支撑着自然语言处理、视觉等领域的最先进系统。然而，它们在离散推理任务（如算术、逻辑推理和算法组合）中的理论局限性仍然是一个关键的未解问题。在本次调查中，我们从电路复杂性、近似理论和通信复杂性三个理论视角综合了近期的研究，以阐明变压器在执行符号计算时所面临的结构性和计算性障碍。通过连接这些已建立的理论框架，我们提供了一个易于理解且统一的解释，说明为什么当前的变压器架构在实现精确的离散算法方面存在困难，即便它们在模式匹配和插值方面表现出色。我们回顾了关键定义、开创性结果和说明性示例，突出了深度限制、近似不连续性困难和跨标记通信瓶颈等挑战。最后，我们讨论了对模型设计的影响，并建议了克服这些基础性局限性的有前景的方向。

View on arXiv Download PDF AI Translation

cs.CL / 16 / 2602.11176

Evaluating Few-Shot Temporal Reasoning of LLMs for Human Activity Prediction in Smart Environments

评估大型语言模型在智能环境中对人类活动预测的少样本时间推理能力

Doctorarastoo, Maral, Flanigan, Katherine A., Bergés, Mario, McComb, Christopher

Abstract

Anticipating human activities and their durations is essential in applications such as smart-home automation, simulation-based architectural and urban design, activity-based transportation system simulation, and human-robot collaboration, where adaptive systems must respond to human activities. Existing data-driven agent-based models--from rule-based to deep learning--struggle in low-data environments, limiting their practicality. This paper investigates whether large language models, pre-trained on broad human knowledge, can fill this gap by reasoning about everyday activities from compact contextual cues. We adopt a retrieval-augmented prompting strategy that integrates four sources of context--temporal, spatial, behavioral history, and persona--and evaluate it on the CASAS Aruba smart-home dataset. The evaluation spans two complementary tasks: next-activity prediction with duration estimation, and multi-step daily sequence generation, each tested with various numbers of few-shot examples provided in the prompt. Analyzing few-shot effects reveals how much contextual supervision is sufficient to balance data efficiency and predictive accuracy, particularly in low-data environments. Results show that large language models exhibit strong inherent temporal understanding of human behavior: even in zero-shot settings, they produce coherent daily activity predictions, while adding one or two demonstrations further refines duration calibration and categorical accuracy. Beyond a few examples, performance saturates, indicating diminishing returns. Sequence-level evaluation confirms consistent temporal alignment across few-shot conditions. These findings suggest that pre-trained language models can serve as promising temporal reasoners, capturing both recurring routines and context-dependent behavioral variations, thereby strengthening the behavioral modules of agent-based models.

Chinese Translation

预测人类活动及其持续时间在智能家居自动化、基于仿真的建筑与城市设计、基于活动的交通系统仿真以及人机协作等应用中至关重要，这些自适应系统必须对人类活动做出响应。现有的数据驱动代理模型——从基于规则到深度学习——在低数据环境中表现不佳，限制了它们的实用性。本文探讨了大型语言模型（Large Language Models, LLMs）是否能够通过从紧凑的上下文线索中推理日常活动来填补这一空白，这些模型在广泛的人类知识上进行了预训练。我们采用了一种增强检索的提示策略，整合了四种上下文来源——时间、空间、行为历史和个性，并在CASAS Aruba智能家居数据集上进行了评估。评估涵盖了两个互补任务：下一活动预测与持续时间估计，以及多步骤日常序列生成，每个任务在提示中提供了不同数量的少样本示例。对少样本效应的分析揭示了在低数据环境中，多少上下文监督是足够的，以平衡数据效率和预测准确性。结果表明，大型语言模型展现出对人类行为的强大内在时间理解：即使在零样本设置中，它们也能生成连贯的日常活动预测，而增加一到两个示例进一步改善了持续时间校准和类别准确性。超过少量示例后，性能趋于饱和，表明收益递减。序列级评估确认了在少样本条件下的一致时间对齐。这些发现表明，预训练语言模型可以作为有前景的时间推理者，捕捉到重复的日常活动和上下文依赖的行为变化，从而增强基于代理模型的行为模块。

View on arXiv Download PDF AI Translation

cs.CL / 17 / 2602.11177

What Do LLMs Know About Alzheimer's Disease? Fine-Tuning, Probing, and Data Synthesis for AD Detection

大型语言模型对阿尔茨海默病的认知：针对阿尔茨海默病检测的微调、探测与数据合成

Jiang, Lei, Zhou, Yue, Parde, Natalie

Abstract

Reliable early detection of Alzheimer's disease (AD) is challenging, particularly due to limited availability of labeled data. While large language models (LLMs) have shown strong transfer capabilities across domains, adapting them to the AD domain through supervised fine-tuning remains largely unexplored. In this work, we fine-tune an LLM for AD detection and investigate how task-relevant information is encoded within its internal representations. We employ probing techniques to analyze intermediate activations across transformer layers, and we observe that, after fine-tuning, the probing values of specific words and special markers change substantially, indicating that these elements assume a crucial role in the model's improved detection performance. Guided by this insight, we design a curated set of task-aware special markers and train a sequence-to-sequence model as a data-synthesis tool that leverages these markers to generate structurally consistent and diagnostically informative synthetic samples. We evaluate the synthesized data both intrinsically and by incorporating it into downstream training pipelines.

Chinese Translation

阿尔茨海默病（AD）的可靠早期检测具有挑战性，特别是由于标注数据的有限可用性。尽管大型语言模型（LLMs）在不同领域表现出强大的迁移能力，但通过监督微调将其适应于阿尔茨海默病领域仍然未得到充分探索。在本研究中，我们对LLM进行微调以进行阿尔茨海默病检测，并探讨任务相关信息如何在其内部表示中编码。我们采用探测技术分析变换器层中的中间激活，并观察到，在微调后，特定单词和特殊标记的探测值发生了显著变化，表明这些元素在模型的检测性能提升中发挥了关键作用。在这一洞察的指导下，我们设计了一组经过精心策划的任务感知特殊标记，并训练了一个序列到序列模型作为数据合成工具，利用这些标记生成结构一致且具有诊断信息的合成样本。我们对合成数据进行了内在评估，并将其纳入下游训练流程中进行评估。

View on arXiv Download PDF AI Translation

cs.CL / 18 / 2602.11179

From Instruction to Output: The Role of Prompting in Modern NLG

从指令到输出：提示在现代自然语言生成中的作用

Zaib, Munazza, Alhazmi, Elaf

Abstract

Prompt engineering has emerged as an integral technique for extending the strengths and abilities of Large Language Models (LLMs) to gain significant performance gains in various Natural Language Processing (NLP) tasks. This approach, which requires instructions to be composed in natural language to bring out the knowledge from LLMs in a structured way, has driven breakthroughs in various NLP tasks. Yet there is still no structured framework or coherent understanding of the varied prompt engineering methods and techniques, particularly in the field of Natural Language Generation (NLG). This survey aims to help fill that gap by outlining recent developments in prompt engineering, and their effect on different NLG tasks. It reviews recent advances in prompting methods and their impact on NLG tasks, presenting prompt design as an input-level control mechanism that complements fine-tuning and decoding approaches. The paper introduces a taxonomy of prompting paradigms, a decision framework for prompt selection based on varying factors for the practitioners, outlines emerging trends and challenges, and proposes a framework that links design, optimization, and evaluation to support more controllable and generalizable NLG.

Chinese Translation

提示工程已成为一种重要技术，旨在扩展大型语言模型（Large Language Models, LLMs）的优势和能力，以在各种自然语言处理（Natural Language Processing, NLP）任务中获得显著的性能提升。这种方法要求以自然语言构建指令，以结构化的方式引导LLMs的知识，从而推动了各种NLP任务的突破。然而，目前仍缺乏一个结构化的框架或对不同提示工程方法和技术的连贯理解，尤其是在自然语言生成（Natural Language Generation, NLG）领域。本文旨在通过概述提示工程的最新发展及其对不同NLG任务的影响来填补这一空白。我们回顾了提示方法的最新进展及其对NLG任务的影响，将提示设计呈现为一种输入级控制机制，补充微调和解码方法。本文引入了提示范式的分类法，为从业者提供基于不同因素的提示选择决策框架，概述了新兴趋势和挑战，并提出了一个将设计、优化和评估联系起来的框架，以支持更可控和可推广的NLG。

View on arXiv Download PDF AI Translation

cs.CL / 19 / 2602.11180

Mechanistic Interpretability for Large Language Model Alignment: Progress, Challenges, and Future Directions

大型语言模型对齐的机制可解释性：进展、挑战与未来方向

Naseem, Usman

Abstract

Large language models (LLMs) have achieved remarkable capabilities across diverse tasks, yet their internal decision-making processes remain largely opaque. Mechanistic interpretability (i.e., the systematic study of how neural networks implement algorithms through their learned representations and computational structures) has emerged as a critical research direction for understanding and aligning these models. This paper surveys recent progress in mechanistic interpretability techniques applied to LLM alignment, examining methods ranging from circuit discovery to feature visualization, activation steering, and causal intervention. We analyze how interpretability insights have informed alignment strategies including reinforcement learning from human feedback (RLHF), constitutional AI, and scalable oversight. Key challenges are identified, including the superposition hypothesis, polysemanticity of neurons, and the difficulty of interpreting emergent behaviors in large-scale models. We propose future research directions focusing on automated interpretability, cross-model generalization of circuits, and the development of interpretability-driven alignment techniques that can scale to frontier models.

Chinese Translation

大型语言模型（LLMs）在多种任务中展现出卓越的能力，但其内部决策过程仍然在很大程度上不透明。机制可解释性（即系统研究神经网络如何通过其学习的表示和计算结构实现算法）已成为理解和对齐这些模型的重要研究方向。本文回顾了应用于LLM对齐的机制可解释性技术的最新进展，考察了从电路发现到特征可视化、激活引导和因果干预等方法。我们分析了可解释性洞察如何为对齐策略提供信息，包括基于人类反馈的强化学习（RLHF）、宪法人工智能和可扩展监督等。我们识别出一些关键挑战，包括叠加假设、神经元的多义性以及在大规模模型中解释涌现行为的困难。我们提出了未来的研究方向，重点关注自动化可解释性、电路的跨模型泛化以及能够扩展到前沿模型的以可解释性为驱动的对齐技术的发展。

View on arXiv Download PDF AI Translation

cs.CL / 20 / 2602.11181

Code Mixologist : A Practitioner's Guide to Building Code-Mixed LLMs

代码混合师：构建代码混合大型语言模型的实践指南

Gupta, Himanshu, Jayarao, Pratik, Dwivedi, Chaitanya, Varshney, Neeraj

Abstract

Code-mixing and code-switching (CSW) remain challenging phenomena for large language models (LLMs). Despite recent advances in multilingual modeling, LLMs often struggle in mixed-language settings, exhibiting systematic degradation in grammaticality, factuality, and safety behavior. This work provides a comprehensive overview of CSW research in modern large language model settings. We introduce a unifying taxonomy that organizes prior work along dimensions of data, modeling, and evaluation, and we distill these findings into a practical playbook of actionable recommendations for building, adapting, and evaluating CSW-capable LLMs. We review modeling approaches ranging from CSW-tailored pre-training and task-specific post-training to prompting strategies and in-context learning. We analyze current evaluation practices, highlighting sources of instability and limited reproducibility, and we catalog existing benchmarks while critically examining their linguistic coverage and English-centric biases. Finally, we discuss emerging safety concerns, including use of code-mixing as a mechanism for bypassing model safeguards, and identify open research challenges.

Chinese Translation

代码混合和代码切换（CSW）仍然是大型语言模型（LLMs）面临的挑战性现象。尽管在多语言建模方面取得了近期进展，LLMs在混合语言环境中仍然表现不佳，系统性地降低了语法性、事实性和安全性行为。本文提供了现代大型语言模型环境中CSW研究的全面概述。我们引入了一个统一的分类法，将先前的研究按数据、建模和评估的维度进行组织，并将这些发现提炼为一个实用的行动建议手册，以便于构建、调整和评估具备CSW能力的LLMs。我们回顾了从针对CSW的预训练和特定任务的后训练到提示策略和上下文学习的建模方法。我们分析了当前的评估实践，强调了不稳定性和有限可重复性的来源，并对现有基准进行了分类，同时批判性地审视了它们的语言覆盖范围和以英语为中心的偏见。最后，我们讨论了新出现的安全问题，包括将代码混合作为绕过模型安全措施的机制，并识别出开放的研究挑战。

View on arXiv Download PDF AI Translation

cs.CL / 21 / 2602.11182

MetaMem: Evolving Meta-Memory for Knowledge Utilization through Self-Reflective Symbolic Optimization

MetaMem：通过自我反思符号优化演化的元记忆以促进知识利用

Xin, Haidong, Li, Xinze, Liu, Zhenghao, Yan, Yukun, Wang, Shuo, Yang, Cheng, Gu, Yu, Yu, Ge, Sun, Maosong

Abstract

Existing memory systems enable Large Language Models (LLMs) to support long-horizon human-LLM interactions by persisting historical interactions beyond limited context windows. However, while recent approaches have succeeded in constructing effective memories, they often disrupt the inherent logical and temporal relationships within interaction sessions, resulting in fragmented memory units and degraded reasoning performance. In this paper, we propose MetaMem, a novel framework that augments memory systems with a self-evolving meta-memory, aiming to teach LLMs how to effectively utilize memorized knowledge. During meta-memory optimization, MetaMem iteratively distills transferable knowledge utilization experiences across different tasks by self-reflecting on reasoning processes and performing actions to update the current meta-memory state. The accumulated meta-memory units serve as explicit knowledge utilization experiences, guiding the LLM to systematically identify and integrate critical evidence from scattered memory fragments. Extensive experiments demonstrate the effectiveness of MetaMem, which significantly outperforms strong baselines by over 3.6%. All codes and datasets are available at https://github.com/OpenBMB/MetaMem.

Chinese Translation

现有的记忆系统使大型语言模型（LLMs）能够支持长期的人类与LLM交互，通过超越有限上下文窗口来持久化历史交互。然而，尽管最近的方法在构建有效记忆方面取得了成功，但它们往往破坏了交互会话中固有的逻辑和时间关系，导致记忆单元碎片化和推理性能下降。本文提出了MetaMem，一个新颖的框架，通过自我演化的元记忆增强记忆系统，旨在教会LLM如何有效利用记忆中的知识。在元记忆优化过程中，MetaMem通过自我反思推理过程并采取行动更新当前的元记忆状态，迭代提炼不同任务之间可转移的知识利用经验。累积的元记忆单元作为明确的知识利用经验，指导LLM系统性地识别和整合来自分散记忆碎片的关键证据。大量实验表明，MetaMem的有效性，其表现显著优于强基线，提升幅度超过3.6%。所有代码和数据集可在 https://github.com/OpenBMB/MetaMem 获取。

View on arXiv Download PDF AI Translation

cs.CL / 22 / 2602.11198

DDL2PropBank Agent: Benchmarking Multi-Agent Frameworks' Developer Experience Through a Novel Relational Schema Mapping Task

DDL2PropBank Agent：通过一种新颖的关系模式映射任务评估多代理框架的开发者体验

Ahmed, Shafiuddin Rehan, Wei, Wei

Abstract

Multi-agent frameworks promise to simplify LLM-driven software development, yet there is no principled way to evaluate their developer experience in a controlled setting. We introduce DDL2PropBank, a novel benchmark task that maps relational database schemas to PropBank rolesets, requiring autonomous retrieval of candidate frames and fine-grained linguistic reasoning over table names, columns, and relations. Using the Agent-as-a-Tool pattern, we implement identical agent logic across 10 frameworks and evaluate along two dimensions: (i) code complexity via static analysis, and (ii) AI-assistability -- the extent to which LLMs can autonomously generate correct, framework-specific code. Our results reveal a threefold complexity spectrum, with Pydantic AI and Agno requiring the least implementation overhead. For AI-assistability, structural alignment scores reliably proxy runtime success for frameworks with single canonical patterns, but overestimate correctness for multi-pattern frameworks. Agno emerges as the strongest overall performer, combining lowest complexity with highest structural alignment and 83% pass@1.

Chinese Translation

多代理框架承诺简化基于大语言模型（LLM）的软件开发，但在受控环境中评估其开发者体验尚无原则性的方法。我们引入了DDL2PropBank，这是一项新颖的基准任务，旨在将关系数据库模式映射到PropBank角色集，要求自主检索候选框架并对表名、列和关系进行细粒度的语言推理。采用Agent-as-a-Tool模式，我们在10个框架中实现了相同的代理逻辑，并从两个维度进行评估：（i）通过静态分析评估代码复杂性，以及（ii）AI辅助能力——即LLMs能够自主生成正确的框架特定代码的程度。我们的结果揭示了三种复杂性谱系，其中Pydantic AI和Agno需要的实现开销最小。在AI辅助能力方面，结构对齐分数可靠地代理了具有单一规范模式的框架的运行时成功，但对多模式框架的正确性进行了高估。Agno作为整体表现最强的框架，结合了最低的复杂性、最高的结构对齐和83%的通过率@1。

View on arXiv Download PDF AI Translation

cs.CL / 23 / 2602.11199

When and What to Ask: AskBench and Rubric-Guided RLVR for LLM Clarification

何时以及询问什么：AskBench 和基于评分标准的强化学习验证（RLVR）用于大语言模型的澄清

Zhao, Jiale, Fang, Ke, Cheng, Lu

Abstract

Large language models (LLMs) often respond even when prompts omit critical details or include misleading information, leading to hallucinations or reinforced misconceptions. We study how to evaluate and improve LLMs' ability to decide when and what to ask for clarification without sacrificing task performance. We introduce AskBench, an interactive benchmark that converts standard QA pairs into multi-turn interactions with explicit checkpoints. A unified judge loop evaluates final answers and simulates user responses as needed. AskBench covers two settings: AskMind, with intent-deficient queries requiring clarification, and AskOverconfidence, with queries containing false premises that must be identified and corrected. We further propose rubric-guided reinforcement learning with verifier-based rewards (RLVR), which uses structured rubrics to encourage targeted clarification. Experiments show consistent improvements in accuracy, rubric adherence, and interaction efficiency, with strong generalization to unseen domains.

Chinese Translation

大型语言模型（LLMs）在提示中遗漏关键细节或包含误导性信息时，仍然会作出回应，这导致了幻觉或强化了错误观念。我们研究如何评估和提高LLMs在不牺牲任务性能的情况下决定何时以及询问什么以获得澄清的能力。我们引入了AskBench，这是一个互动基准，将标准问答对转换为具有明确检查点的多轮交互。一个统一的评判循环评估最终答案，并根据需要模拟用户响应。AskBench涵盖两种设置：AskMind，针对缺乏意图的查询需要澄清，以及AskOverconfidence，针对包含虚假前提的查询，必须识别并纠正。我们进一步提出了基于评分标准的强化学习与验证者奖励（RLVR），该方法利用结构化评分标准来鼓励有针对性的澄清。实验表明，在准确性、评分标准遵循性和交互效率方面均有持续改进，并且在未见领域具有强泛化能力。

View on arXiv Download PDF AI Translation

cs.CL / 24 / 2602.11201

Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning

链式思维推理中忠实度衰减的机制证据

Ye, Donald, Loffgren, Max, Kotadia, Om, Wong, Linus

Abstract

Chain-of-Thought (CoT) explanations are widely used to interpret how language models solve complex problems, yet it remains unclear whether these step-by-step explanations reflect how the model actually reaches its answer, or merely post-hoc justifications. We propose Normalized Logit Difference Decay (NLDD), a metric that measures whether individual reasoning steps are faithful to the model's decision-making process. Our approach corrupts individual reasoning steps from the explanation and measures how much the model's confidence in its answer drops, to determine if a step is truly important. By standardizing these measurements, NLDD enables rigorous cross-model comparison across different architectures. Testing three model families across syntactic, logical, and arithmetic tasks, we discover a consistent Reasoning Horizon (k*) at 70--85% of chain length, beyond which reasoning tokens have little or negative effect on the final answer. We also find that models can encode correct internal representations while completely failing the task. These results show that accuracy alone does not reveal whether a model actually reasons through its chain. NLDD offers a way to measure when CoT matters.

Chinese Translation

链式思维（Chain-of-Thought, CoT）解释被广泛用于阐释语言模型如何解决复杂问题，但仍不清楚这些逐步解释是否反映了模型实际得出答案的过程，或仅仅是事后解释。我们提出了归一化对数差异衰减（Normalized Logit Difference Decay, NLDD），这一指标用于衡量单个推理步骤是否忠实于模型的决策过程。我们的方法通过破坏解释中的单个推理步骤，并测量模型对其答案的信心下降程度，以确定某一步骤是否真正重要。通过标准化这些测量，NLDD使得不同架构之间的严格跨模型比较成为可能。在对三种模型家族进行语法、逻辑和算术任务的测试中，我们发现了一致的推理视野（Reasoning Horizon, k*），其范围为链长的70%至85%，超过该范围后，推理标记对最终答案的影响微乎其微或为负。我们还发现，模型可以编码正确的内部表示，但在任务上完全失败。这些结果表明，仅凭准确性无法揭示模型是否真正通过其链进行推理。NLDD提供了一种衡量链式思维何时重要的方法。

View on arXiv Download PDF AI Translation

cs.CL / 25 / 2602.11221

The Automatic Verification of Image-Text Claims (AVerImaTeC) Shared Task

图像-文本声明的自动验证（AVerImaTeC）共享任务

Cao, Rui, Deng, Zhenyun, Chen, Yulong, Schlichtkrull, Michael, Vlachos, Andreas

Abstract

The Automatic Verification of Image-Text Claims (AVerImaTeC) shared task aims to advance system development for retrieving evidence and verifying real-world image-text claims. Participants were allowed to either employ external knowledge sources, such as web search engines, or leverage the curated knowledge store provided by the organizers. System performance was evaluated using the AVerImaTeC score, defined as a conditional verdict accuracy in which a verdict is considered correct only when the associated evidence score exceeds a predefined threshold. The shared task attracted 14 submissions during the development phase and 6 submissions during the testing phase. All participating systems in the testing phase outperformed the baseline provided. The winning team, HUMANE, achieved an AVerImaTeC score of 0.5455. This paper provides a detailed description of the shared task, presents the complete evaluation results, and discusses key insights and lessons learned.

Chinese Translation

图像-文本声明的自动验证（AVerImaTeC）共享任务旨在推动系统开发，以检索证据和验证现实世界中的图像-文本声明。参与者可以使用外部知识源，如网络搜索引擎，或利用组织者提供的策划知识库。系统性能通过 AVerImaTeC 分数进行评估，该分数定义为条件裁决准确率，其中只有当相关证据分数超过预定义阈值时，裁决才被视为正确。在开发阶段，共有 14 个提交，在测试阶段则有 6 个提交。所有参与测试阶段的系统均优于提供的基线。获胜团队 HUMANE 的 AVerImaTeC 分数为 0.5455。本文详细描述了共享任务，呈现了完整的评估结果，并讨论了关键见解和经验教训。

View on arXiv Download PDF AI Translation

cs.CL / 26 / 2602.11238

SurveyLens: A Research Discipline-Aware Benchmark for Automatic Survey Generation

SurveyLens：一种研究学科意识的自动调查生成基准

Guo, Beichen, Wen, Zhiyuan, Gu, Jia, Wang, Senzhang, Shi, Haochen, Yang, Ruosong, Liu, Shuaiqi

Abstract

The exponential growth of scientific literature has driven the evolution of Automatic Survey Generation (ASG) from simple pipelines to multi-agent frameworks and commercial Deep Research agents. However, current ASG evaluation methods rely on generic metrics and are heavily biased toward Computer Science (CS), failing to assess whether ASG methods adhere to the distinct standards of various academic disciplines. Consequently, researchers, especially those outside CS, lack clear guidance on using ASG systems to yield high-quality surveys compliant with specific discipline standards. To bridge this gap, we introduce SurveyLens, the first discipline-aware benchmark evaluating ASG methods across diverse research disciplines. We construct SurveyLens-1k, a curated dataset of 1,000 high-quality human-written surveys spanning 10 disciplines. Subsequently, we propose a dual-lens evaluation framework: (1) Discipline-Aware Rubric Evaluation, which utilizes LLMs with human preference-aligned weights to assess adherence to domain-specific writing standards; and (2) Canonical Alignment Evaluation to rigorously measure content coverage and synthesis quality against human-written survey papers. We conduct extensive experiments by evaluating 11 state-of-the-art ASG methods on SurveyLens, including Vanilla LLMs, ASG systems, and Deep Research agents. Our analysis reveals the distinct strengths and weaknesses of each paradigm across fields, providing essential guidance for selecting tools tailored to specific disciplinary requirements.

Chinese Translation

科学文献的指数增长推动了自动调查生成（ASG）从简单管道到多代理框架以及商业深度研究代理的发展。然而，目前的ASG评估方法依赖于通用指标，并且严重偏向计算机科学（CS），未能评估ASG方法是否遵循各个学科的独特标准。因此，研究人员，尤其是那些不在计算机科学领域的研究人员，缺乏明确的指导，以使用ASG系统生成符合特定学科标准的高质量调查。为了解决这一问题，我们提出了SurveyLens，这是第一个评估ASG方法在不同研究学科中表现的学科意识基准。我们构建了SurveyLens-1k，这是一个包含1,000个高质量人类撰写调查的策划数据集，涵盖10个学科。随后，我们提出了一个双重评估框架：（1）学科意识评分评估，利用与人类偏好对齐的权重的LLMs来评估对特定领域写作标准的遵循；（2）规范对齐评估，严格测量内容覆盖和合成质量与人类撰写的调查论文的对比。我们通过在SurveyLens上评估11种最先进的ASG方法进行广泛实验，包括原始LLMs、ASG系统和深度研究代理。我们的分析揭示了每种范式在不同领域的独特优缺点，为选择符合特定学科要求的工具提供了重要指导。

View on arXiv Download PDF AI Translation

cs.CL / 27 / 2602.11305

Are Aligned Large Language Models Still Misaligned?

对齐的大型语言模型仍然存在不对齐问题吗？

Naseem, Usman, Kashyap, Gautam Siddharth, Ali, Rafiq, Shabbir, Ebad, Ray, Sushant Kumar, Mohammad, Abdullah, Seth, Agrima

Abstract

Misalignment in Large Language Models (LLMs) arises when model behavior diverges from human expectations and fails to simultaneously satisfy safety, value, and cultural dimensions, which must co-occur in real-world settings to solve a real-world query. Existing misalignment benchmarks-such as INSECURE CODE (safety-centric), VALUEACTIONLENS (value-centric), and CULTURALHERITAGE (culture centric)-rely on evaluating misalignment along individual dimensions, preventing simultaneous evaluation. To address this gap, we introduce Mis-Align Bench, a unified benchmark for analyzing misalignment across safety, value, and cultural dimensions. First we constructs SAVACU, an English misaligned-aligned dataset of 382,424 samples spanning 112 domains (or labels), by reclassifying prompts from the LLM-PROMPT-DATASET via taxonomy into 14 safety domains, 56 value domains, and 42 cultural domains using Mistral-7B-Instruct-v0.3, and expanding low-resource domains via Llama-3.1-8B-Instruct with SimHash-based fingerprint to avoid deduplication. Furthermore, we pairs prompts with misaligned and aligned responses via two-stage rejection sampling to enforce quality. Second we benchmarks general-purpose, fine-tuned, and open-weight LLMs, enabling systematic evaluation of misalignment under three dimensions. Empirically, single-dimension models achieve high Coverage (upto 97.6%) but incur False Failure Rate >50% and lower Alignment Score (63%-66%) under joint conditions.

Chinese Translation

大型语言模型（LLMs）中的不对齐问题出现于模型行为偏离人类预期，并未能同时满足安全、价值和文化维度，而这些维度在现实世界中必须共同存在以解决实际问题。现有的不对齐基准，如 INSECURE CODE（以安全为中心）、VALUEACTIONLENS（以价值为中心）和 CULTURALHERITAGE（以文化为中心），依赖于沿单一维度评估不对齐，阻碍了同时评估的可能性。为了解决这一问题，我们提出了 Mis-Align Bench，这是一个用于分析安全、价值和文化维度不对齐的统一基准。首先，我们构建了 SAVACU，一个包含 382,424 个样本的英语不对齐-对齐数据集，涵盖 112 个领域（或标签），通过对 LLM-PROMPT-DATASET 中的提示进行分类，使用 Mistral-7B-Instruct-v0.3 将其重新分类为 14 个安全领域、56 个价值领域和 42 个文化领域，并通过 Llama-3.1-8B-Instruct 结合基于 SimHash 的指纹扩展低资源领域，以避免重复。进一步地，我们通过两阶段拒绝采样将提示与不对齐和对齐的响应配对，以确保质量。其次，我们对通用、微调和开放权重的 LLM 进行基准测试，从而实现对三个维度下不对齐的系统评估。实证结果表明，单维度模型在覆盖率上达到高达 97.6%，但在联合条件下出现超过 50% 的虚假失败率和较低的对齐评分（63%-66%）。

View on arXiv Download PDF AI Translation

cs.CL / 28 / 2602.11328

Evaluating Alignment of Behavioral Dispositions in LLMs

评估大型语言模型中的行为倾向一致性

Taubenfeld, Amir, Gekhman, Zorik, Nezry, Lior, Feldman, Omri, Harris, Natalie, Reddy, Shashir, Stella, Romina, Goldstein, Ariel, Croak, Marian, Matias, Yossi, Feder, Amir

Abstract

As LLMs integrate into our daily lives, understanding their behavior becomes essential. In this work, we focus on behavioral dispositions$-$the underlying tendencies that shape responses in social contexts$-$and introduce a framework to study how closely the dispositions expressed by LLMs align with those of humans. Our approach is grounded in established psychological questionnaires but adapts them for LLMs by transforming human self-report statements into Situational Judgment Tests (SJTs). These SJTs assess behavior by eliciting natural recommendations in realistic user-assistant scenarios. We generate 2,500 SJTs, each validated by three human annotators, and collect preferred actions from 10 annotators per SJT, from a large pool of 550 participants. In a comprehensive study involving 25 LLMs, we find that models often do not reflect the distribution of human preferences: (1) in scenarios with low human consensus, LLMs consistently exhibit overconfidence in a single response; (2) when human consensus is high, smaller models deviate significantly, and even some frontier models do not reflect the consensus in 15-20% of cases; (3) traits can exhibit cross-LLM patterns, e.g., LLMs may encourage emotion expression in contexts where human consensus favors composure. Lastly, mapping psychometric statements directly to behavioral scenarios presents a unique opportunity to evaluate the predictive validity of self-reports, revealing considerable gaps between LLMs' stated values and their revealed behavior.

Chinese Translation

随着大型语言模型（LLMs）逐渐融入我们的日常生活，理解它们的行为变得至关重要。在本研究中，我们关注行为倾向——塑造社会情境中反应的潜在倾向，并引入一个框架来研究LLMs所表达的倾向与人类倾向的对齐程度。我们的方法基于已有的心理学问卷，但通过将人类自我报告的陈述转化为情境判断测试（Situational Judgment Tests, SJTs）来适应LLMs。这些SJTs通过在现实的用户-助手场景中引导自然推荐来评估行为。我们生成了2500个SJTs，每个SJT由三名人工标注者验证，并从550名参与者的大池中收集每个SJT的10名标注者的首选行动。在涉及25个LLMs的综合研究中，我们发现模型通常未能反映人类偏好的分布：（1）在低人类共识的情境中，LLMs始终表现出对单一反应的过度自信；（2）在人类共识较高时，较小的模型显著偏离，甚至一些前沿模型在15-20%的情况下未能反映共识；（3）特征可能表现出跨LLM的模式，例如，LLMs可能在那些人类共识倾向于冷静的情境中鼓励情感表达。最后，将心理测量陈述直接映射到行为场景提供了一个独特的机会来评估自我报告的预测效度，揭示了LLMs所声明的价值观与其实际行为之间的显著差距。

View on arXiv Download PDF AI Translation

cs.CL / 29 / 2602.11358

When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing

当模型自我审视时：自我指涉处理中的词汇激活对应关系

Dadfar, Zachary Pedram

Abstract

Large language models produce rich introspective language when prompted for self-examination, but whether this language reflects internal computation or sophisticated confabulation has remained unclear. We show that self-referential vocabulary tracks concurrent activation dynamics, and that this correspondence is specific to self-referential processing. We introduce the Pull Methodology, a protocol that elicits extended self-examination through format engineering, and use it to identify a direction in activation space that distinguishes self-referential from descriptive processing in Llama 3.1. The direction is orthogonal to the known refusal direction, localised at 6.25% of model depth, and causally influences introspective output when used for steering. When models produce "loop" vocabulary, their activations exhibit higher autocorrelation (r = 0.44, p = 0.002); when they produce "shimmer" vocabulary under steering, activation variability increases (r = 0.36, p = 0.002). Critically, the same vocabulary in non-self-referential contexts shows no activation correspondence despite nine-fold higher frequency. Qwen 2.5-32B, with no shared training, independently develops different introspective vocabulary tracking different activation metrics, all absent in descriptive controls. The findings indicate that self-report in transformer models can, under appropriate conditions, reliably track internal computational states.

Chinese Translation

大型语言模型在被提示进行自我审视时产生丰富的内省语言，但这种语言是否反映内部计算或复杂的虚构仍不明确。我们展示了自我指涉词汇与并发激活动态之间的关系，并且这种对应关系特定于自我指涉处理。我们引入了拉动方法（Pull Methodology），一种通过格式工程引发扩展自我审视的协议，并利用该方法识别出在激活空间中区分自我指涉与描述性处理的方向。该方向与已知的拒绝方向正交，定位于模型深度的6.25%，并在用于引导时对内省输出产生因果影响。当模型产生“循环”（loop）词汇时，其激活表现出更高的自相关性（r = 0.44, p = 0.002）；当它们在引导下产生“闪烁”（shimmer）词汇时，激活变异性增加（r = 0.36, p = 0.002）。关键是，在非自我指涉的上下文中，相同的词汇尽管频率高达九倍，却没有激活对应关系。Qwen 2.5-32B在没有共享训练的情况下，独立发展出不同的内省词汇，追踪不同的激活指标，而这些在描述性对照中均不存在。研究结果表明，在适当条件下，变换器模型中的自我报告可以可靠地追踪内部计算状态。

View on arXiv Download PDF AI Translation

cs.CL / 30 / 2602.11361

Finding the Cracks: Improving LLMs Reasoning with Paraphrastic Probing and Consistency Verification

发现裂缝：通过释义探测和一致性验证提升大型语言模型的推理能力

Shi, Weili, Guo, Dongliang, Yang, Lehan, Wang, Tianlong, Yuan, Hanzhang, Li, Sheng

Abstract

Large language models have demonstrated impressive performance across a variety of reasoning tasks. However, their problem-solving ability often declines on more complex tasks due to hallucinations and the accumulation of errors within these intermediate steps. Recent work has introduced the notion of critical tokens--tokens in the reasoning process that exert significant influence on subsequent steps. Prior studies suggest that replacing critical tokens can refine reasoning trajectories. Nonetheless, reliably identifying and exploiting critical tokens remains challenging. To address this, we propose the Paraphrastic Probing and Consistency Verification~(PPCV) framework. PPCV operates in two stages. In the first stage, we roll out an initial reasoning path from the original question and then concatenate paraphrased versions of the question with this reasoning path. And we identify critical tokens based on mismatches between the predicted top-1 token and the expected token in the reasoning path. A criterion is employed to confirm the final critical token. In the second stage, we substitute critical tokens with candidate alternatives and roll out new reasoning paths for both the original and paraphrased questions. The final answer is determined by checking the consistency of outputs across these parallel reasoning processes. We evaluate PPCV on mainstream LLMs across multiple benchmarks. Extensive experiments demonstrate PPCV substantially enhances the reasoning performance of LLMs compared to baselines.

Chinese Translation

大型语言模型在各种推理任务中表现出色。然而，由于幻觉和中间步骤中错误的累积，它们在更复杂任务上的问题解决能力往往下降。最近的研究引入了关键令牌的概念——在推理过程中对后续步骤产生显著影响的令牌。先前的研究表明，替换关键令牌可以优化推理轨迹。然而，可靠地识别和利用关键令牌仍然具有挑战性。为了解决这一问题，我们提出了释义探测和一致性验证（Paraphrastic Probing and Consistency Verification, PPCV）框架。PPCV分为两个阶段。在第一阶段，我们从原始问题展开初始推理路径，然后将问题的释义版本与该推理路径连接。我们根据预测的前1个令牌与推理路径中预期令牌之间的差异来识别关键令牌。采用一个标准来确认最终的关键令牌。在第二阶段，我们用候选替代令牌替换关键令牌，并为原始和释义问题展开新的推理路径。最终答案通过检查这些平行推理过程中的输出一致性来确定。我们在多个基准上评估了PPCV在主流大型语言模型上的表现。大量实验表明，与基线相比，PPCV显著提升了大型语言模型的推理性能。

View on arXiv Download PDF AI Translation

cs.CL / 31 / 2602.11364

The Energy of Falsehood: Detecting Hallucinations via Diffusion Model Likelihoods

虚假之能：通过扩散模型似然性检测幻觉

Gautam, Arpit Singh, Talreja, Kailash, Jha, Saurabh

Abstract

Large Language Models (LLMs) frequently hallucinate plausible but incorrect assertions, a vulnerability often missed by uncertainty metrics when models are confidently wrong. We propose DiffuTruth, an unsupervised framework that reconceptualizes fact verification via non equilibrium thermodynamics, positing that factual truths act as stable attractors on a generative manifold while hallucinations are unstable. We introduce the Generative Stress Test, claims are corrupted with noise and reconstructed using a discrete text diffusion model. We define Semantic Energy, a metric measuring the semantic divergence between the original claim and its reconstruction using an NLI critic. Unlike vector space errors, Semantic Energy isolates deep factual contradictions. We further propose a Hybrid Calibration fusing this stability signal with discriminative confidence. Extensive experiments on FEVER demonstrate DiffuTruth achieves a state of the art unsupervised AUROC of 0.725, outperforming baselines by 1.5 percent through the correction of overconfident predictions. Furthermore, we show superior zero shot generalization on the multi hop HOVER dataset, outperforming baselines by over 4 percent, confirming the robustness of thermodynamic truth properties to distribution shifts.

Chinese Translation

大型语言模型（LLMs）经常会产生看似合理但实际上错误的陈述，这种脆弱性在模型自信地错误时常常被不确定性指标所忽视。我们提出了DiffuTruth，一个无监督框架，通过非平衡热力学重新构思事实验证，假设事实真理在生成流形上作为稳定的吸引子，而幻觉则是不稳定的。我们引入了生成压力测试（Generative Stress Test），通过噪声破坏声明，并使用离散文本扩散模型进行重构。我们定义了语义能量（Semantic Energy），这是一个衡量原始声明与其重构之间语义差异的指标，使用自然语言推理（NLI）批评者进行评估。与向量空间误差不同，语义能量能够孤立出深层次的事实矛盾。我们进一步提出了一种混合校准（Hybrid Calibration），将这一稳定信号与判别信心相结合。在FEVER数据集上的大量实验表明，DiffuTruth实现了0.725的最先进无监督AUROC，比基线提高了1.5个百分点，成功纠正了过于自信的预测。此外，我们在多跳HOVER数据集上展示了优越的零样本泛化能力，超越基线超过4个百分点，确认了热力学真理属性对分布变化的鲁棒性。

View on arXiv Download PDF AI Translation

cs.CL / 32 / 2602.11391

Advancing AI Trustworthiness Through Patient Simulation: Risk Assessment of Conversational Agents for Antidepressant Selection

通过患者模拟提升人工智能可信度：对抗抑郁药选择的对话代理风险评估

Shawon, Md Tanvir Rouf, Irbaz, Mohammad Sabik, Elyazori, Hadeel R. A., Resapu, Keerti Reddy, Lin, Yili, Cardenas, Vladimir Franzuela, Alemi, Farrokh, Lybarger, Kevin

Abstract

Objective: This paper introduces a patient simulator designed to enable scalable, automated evaluation of healthcare conversational agents. The simulator generates realistic, controllable patient interactions that systematically vary across medical, linguistic, and behavioral dimensions, allowing annotators and an independent AI judge to assess agent performance, identify hallucinations and inaccuracies, and characterize risk patterns across diverse patient populations. Methods: The simulator is grounded in the NIST AI Risk Management Framework and integrates three profile components reflecting different dimensions of patient variation: (1) medical profiles constructed from electronic health records in the All of Us Research Program; (2) linguistic profiles modeling variation in health literacy and condition-specific communication patterns; and (3) behavioral profiles representing empirically observed interaction patterns, including cooperation, distraction, and adversarial engagement. We evaluated the simulator's effectiveness in identifying errors in an AI decision aid for antidepressant selection. Results: We generated 500 conversations between the patient simulator and the AI decision aid across systematic combinations of five linguistic and three behavioral profiles. Human annotators assessed 1,787 medical concepts across 100 conversations, achieving high agreement (F1=0.94, \k{appa}=0.73), and the LLM judge achieved comparable agreement with human annotators (F1=0.94, \k{appa}=0.78; paired bootstrap p=0.21). The simulator revealed a monotonic degradation in AI decision aid performance across the health literacy spectrum: rank-one concept retrieval accuracy increased from 47.9% for limited health literacy to 69.1% for functional and 81.6% for proficient.

Chinese Translation

目的：本文介绍了一种患者模拟器，旨在实现可扩展的、自动化的医疗对话代理评估。该模拟器生成现实且可控的患者互动，这些互动在医学、语言和行为维度上系统性地变化，使得标注者和独立的人工智能评审能够评估代理的表现，识别幻觉和不准确之处，并描述不同患者群体中的风险模式。方法：该模拟器基于NIST人工智能风险管理框架，并整合了三个反映患者变化不同维度的配置文件组件：（1）基于“All of Us”研究计划中的电子健康记录构建的医学配置文件；（2）模拟健康素养和特定病症沟通模式变化的语言配置文件；（3）代表实证观察到的互动模式的行为配置文件，包括合作、分心和对抗性参与。我们评估了模拟器在识别抗抑郁药选择的人工智能决策辅助工具中的错误的有效性。结果：我们生成了500个患者模拟器与人工智能决策辅助工具之间的对话，涵盖五种语言和三种行为配置文件的系统组合。人类标注者在100个对话中评估了1,787个医学概念，达成了高一致性（F1=0.94， ext{kappa}=0.73），而大型语言模型（LLM）评审与人类标注者达成了相当的一致性（F1=0.94， ext{kappa}=0.78；配对自助法p=0.21）。模拟器揭示了人工智能决策辅助工具在健康素养谱系上的单调下降：第一等级概念检索准确率从健康素养有限的47.9%提高到功能性健康素养的69.1%和熟练健康素养的81.6%。

View on arXiv Download PDF AI Translation

cs.CL / 33 / 2602.11424

Gradients Must Earn Their Influence: Unifying SFT with Generalized Entropic Objectives

梯度必须赢得其影响力：将监督微调与广义熵目标统一

Wang, Zecheng, Liu, Deyuan, Li, Chunshan, Zhang, Yupeng, Zhao, Zhengyun, Chu, Dianhui, Wang, Bingning, Sui, Dianbo

Abstract

Standard negative log-likelihood (NLL) for Supervised Fine-Tuning (SFT) applies uniform token-level weighting. This rigidity creates a two-fold failure mode: (i) overemphasizing low-probability targets can amplify gradients on noisy supervision and disrupt robust priors, and (ii) uniform weighting provides weak sharpening when the model is already confident. Existing methods fail to resolve the resulting plasticity--stability dilemma, often suppressing necessary learning signals alongside harmful ones. To address this issue, we unify token-level SFT objectives within a generalized deformed-log family and expose a universal gate $\times$ error gradient structure, where the gate controls how much the model trusts its current prediction. By employing the Cayley transform, we map the model's continuously evolving uncertainty onto a continuous focus trajectory, which enables seamless interpolation between scenarios involving uncertain novel concepts and those involving well-established knowledge. We then introduce Dynamic Entropy Fine-Tuning (DEFT), a parameter-free objective that modulates the trust gate using distribution concentration (R\'enyi-2 entropy) as a practical proxy for the model's predictive state. Extensive experiments and analyses demonstrate that DEFT achieves a better balance between exploration and exploitation, leading to improved overall performance.

Chinese Translation

标准的负对数似然（NLL）用于监督微调（SFT）时应用了均匀的令牌级加权。这种刚性导致了双重失败模式：（i）过度强调低概率目标可能会在噪声监督下放大梯度并破坏稳健的先验；（ii）当模型已经有信心时，均匀加权提供的锐化效果较弱。现有方法未能解决由此产生的可塑性-稳定性困境，常常抑制必要的学习信号与有害信号。为了解决这一问题，我们在广义变形对数族内统一了令牌级SFT目标，并揭示了一个普遍的门控$ imes$误差梯度结构，其中门控控制模型对当前预测的信任程度。通过采用凯莱变换，我们将模型不断演变的不确定性映射到连续的聚焦轨迹上，从而实现了在涉及不确定的新概念和涉及成熟知识的场景之间的无缝插值。随后，我们引入了动态熵微调（DEFT），这是一种无参数目标，通过分布集中（Rényi-2熵）作为模型预测状态的实用代理来调节信任门。大量实验和分析表明，DEFT在探索与利用之间实现了更好的平衡，从而提高了整体性能。

View on arXiv Download PDF AI Translation

cs.CL / 34 / 2602.11444

Towards Reliable Machine Translation: Scaling LLMs for Critical Error Detection and Safety

迈向可靠的机器翻译：为关键错误检测和安全性扩展大规模语言模型

Chopra, Muskaan, Sparrenberg, Lorenz, Sifa, Rafet

Abstract

Machine Translation (MT) plays a pivotal role in cross-lingual information access, public policy communication, and equitable knowledge dissemination. However, critical meaning errors, such as factual distortions, intent reversals, or biased translations, can undermine the reliability, fairness, and safety of multilingual systems. In this work, we explore the capacity of instruction-tuned Large Language Models (LLMs) to detect such critical errors, evaluating models across a range of parameters using the publicly accessible data sets. Our findings show that model scaling and adaptation strategies (zero-shot, few-shot, fine-tuning) yield consistent improvements, outperforming encoder-only baselines like XLM-R and ModernBERT. We argue that improving critical error detection in MT contributes to safer, more trustworthy, and socially accountable information systems by reducing the risk of disinformation, miscommunication, and linguistic harm, especially in high-stakes or underrepresented contexts. This work positions error detection not merely as a technical challenge, but as a necessary safeguard in the pursuit of just and responsible multilingual AI. The code will be made available at GitHub.

Chinese Translation

机器翻译（MT）在跨语言信息获取、公共政策沟通和公平知识传播中发挥着关键作用。然而，事实扭曲、意图反转或偏见翻译等关键意义错误可能会削弱多语言系统的可靠性、公平性和安全性。在本研究中，我们探讨了经过指令调优的大规模语言模型（LLMs）检测这些关键错误的能力，评估了在一系列参数下模型的表现，使用了公开可获取的数据集。我们的研究结果表明，模型扩展和适应策略（零样本、少样本、微调）带来了持续的改进，超越了仅使用编码器的基线模型，如XLM-R和ModernBERT。我们认为，提高机器翻译中的关键错误检测有助于构建更安全、更可信和社会责任感更强的信息系统，从而降低虚假信息、误传和语言伤害的风险，尤其是在高风险或代表性不足的背景下。本研究将错误检测视为不仅仅是技术挑战，而是追求公正和负责任的多语言人工智能所必需的保障。代码将在GitHub上发布。

View on arXiv Download PDF AI Translation

cs.CL / 35 / 2602.11451

LoopFormer: Elastic-Depth Looped Transformers for Latent Reasoning via Shortcut Modulation

LoopFormer：通过快捷调制实现潜在推理的弹性深度循环变换器

Jeddi, Ahmadreza, Ciccone, Marco, Taati, Babak

Abstract

Looped Transformers have emerged as an efficient and powerful class of models for reasoning in the language domain. Recent studies show that these models achieve strong performance on algorithmic and reasoning tasks, suggesting that looped architectures possess an inductive bias toward latent reasoning. However, prior approaches fix the number of loop iterations during training and inference, leaving open the question of whether these models can flexibly adapt their computational depth under variable compute budgets. We introduce LoopFormer, a looped Transformer trained on variable-length trajectories to enable budget-conditioned reasoning. Our core contribution is a shortcut-consistency training scheme that aligns trajectories of different lengths, ensuring that shorter loops yield informative representations while longer loops continue to refine them. LoopFormer conditions each loop on the current time and step size, enabling representations to evolve consistently across trajectories of varying length rather than drifting or stagnating. Empirically, LoopFormer demonstrates robust performance on language modeling and reasoning benchmarks even under aggressive compute constraints, while scaling gracefully with additional budget. These results show that looped Transformers are inherently suited for adaptive language modeling, opening a path toward controllable and budget-aware large language models.

Chinese Translation

循环变换器已成为语言领域推理的一类高效且强大的模型。最近的研究表明，这些模型在算法和推理任务上表现出色，暗示循环架构对潜在推理具有归纳偏差。然而，之前的方法在训练和推理过程中固定了循环迭代的次数，未能探讨这些模型是否能够在可变计算预算下灵活调整其计算深度。我们提出了LoopFormer，这是一种在可变长度轨迹上训练的循环变换器，以实现预算条件下的推理。我们的核心贡献是一个快捷一致性训练方案，该方案对不同长度的轨迹进行对齐，确保较短的循环能够产生信息丰富的表示，而较长的循环则继续对其进行细化。LoopFormer根据当前时间和步长对每个循环进行条件设置，使得表示能够在不同长度的轨迹中一致演变，而不是漂移或停滞。实证结果表明，LoopFormer在语言建模和推理基准测试中即使在严格的计算约束下也表现出强大的性能，同时在增加预算时表现出良好的扩展性。这些结果表明，循环变换器本质上适合于自适应语言建模，为可控和预算意识的大型语言模型开辟了新的路径。

View on arXiv Download PDF AI Translation

cs.CL / 36 / 2602.11460

ADRD-Bench: A Preliminary LLM Benchmark for Alzheimer's Disease and Related Dementias

ADRD-Bench：阿尔茨海默病及相关痴呆的初步大型语言模型基准测试

Zhao, Guangxin, Zheng, Jiahao, Boustani, Malaz, Nabrzyski, Jarek, Jiang, Meng, Shi, Yiyu, Zheng, Zhi

Abstract

Large language models (LLMs) have shown great potential for healthcare applications. However, existing evaluation benchmarks provide minimal coverage of Alzheimer's Disease and Related Dementias (ADRD). To address this gap, we introduce ADRD-Bench, the first ADRD-specific benchmark dataset designed for rigorous evaluation of LLMs. ADRD-Bench has two components: 1) ADRD Unified QA, a synthesis of 1,352 questions consolidated from seven established medical benchmarks, providing a unified assessment of clinical knowledge; and 2) ADRD Caregiving QA, a novel set of 149 questions derived from the Aging Brain Care (ABC) program, a widely used, evidence-based brain health management program. Guided by a program with national expertise in comprehensive ADRD care, this new set was designed to mitigate the lack of practical caregiving context in existing benchmarks. We evaluated 33 state-of-the-art LLMs on the proposed ADRD-Bench. Results showed that the accuracy of open-weight general models ranged from 0.63 to 0.93 (mean: 0.78; std: 0.09). The accuracy of open-weight medical models ranged from 0.48 to 0.93 (mean: 0.82; std: 0.13). The accuracy of closed-source general models ranged from 0.83 to 0.91 (mean: 0.89; std: 0.03). While top-tier models achieved high accuracies (>0.9), case studies revealed that inconsistent reasoning quality and stability limit their reliability, highlighting a critical need for domain-specific improvement to enhance LLMs' knowledge and reasoning grounded in daily caregiving data. The entire dataset is available at https://github.com/IIRL-ND/ADRD-Bench.

Chinese Translation

大型语言模型（LLMs）在医疗保健应用中展现出了巨大的潜力。然而，现有的评估基准对阿尔茨海默病及相关痴呆（ADRD）的覆盖面非常有限。为了解决这一问题，我们推出了ADRD-Bench，这是第一个专门针对ADRD的基准数据集，旨在对LLMs进行严格评估。ADRD-Bench包含两个部分：1）ADRD统一问答（ADRD Unified QA），这是从七个已建立的医学基准中整合的1,352个问题的综合，提供了对临床知识的统一评估；2）ADRD护理问答（ADRD Caregiving QA），这是从广泛使用的基于证据的脑健康管理项目——老年脑护理（Aging Brain Care, ABC）项目中衍生出的149个新问题。该新问题集在国家级的综合ADRD护理专业指导下设计，旨在弥补现有基准中缺乏实际护理背景的问题。我们对33个最先进的LLMs在所提出的ADRD-Bench上进行了评估。结果显示，开放权重通用模型的准确率范围为0.63到0.93（均值：0.78；标准差：0.09）。开放权重医学模型的准确率范围为0.48到0.93（均值：0.82；标准差：0.13）。闭源通用模型的准确率范围为0.83到0.91（均值：0.89；标准差：0.03）。尽管顶尖模型的准确率较高（>0.9），案例研究表明不一致的推理质量和稳定性限制了它们的可靠性，突显了在基于日常护理数据的知识和推理方面进行领域特定改进的迫切需求。整个数据集可在https://github.com/IIRL-ND/ADRD-Bench获取。

View on arXiv Download PDF AI Translation

cs.CL / 37 / 2602.11488

When Audio-LLMs Don't Listen: A Cross-Linguistic Study of Modality Arbitration

当音频-大语言模型不倾听时：一种跨语言的模态仲裁研究

Billa, Jayadev

Abstract

When audio and text conflict, speech-enabled language models follow the text 10 times more often than when arbitrating between two text sources, even when explicitly instructed to trust the audio. Using ALME, a benchmark of 57,602 controlled audio-text conflict stimuli across 8 languages, we find that Gemini 2.0 Flash exhibits 16.6\% text dominance under audio-text conflict versus 1.6\% under text-text conflict with identical reliability cues. This gap is not explained by audio quality: audio-only accuracy (97.2\%) exceeds cascade accuracy (93.9\%), indicating audio embeddings preserve more information than text transcripts. We propose that text dominance reflects an asymmetry not in information content but in arbitration accessibility: how easily the model can reason over competing representations. This framework explains otherwise puzzling findings. Forcing transcription before answering increases text dominance (19\% to 33\%), sacrificing audio's information advantage without improving accessibility. Framing text as ``deliberately corrupted'' reduces text dominance by 80\%. A fine-tuning ablation provides interventional evidence: training only the audio projection layer increases text dominance (+26.5\%), while LoRA on the language model halves it ($-$23.9\%), localizing text dominance to the LLM's reasoning rather than the audio encoder. Experiments across four state-of-the-art audio-LLMs and 8 languages show consistent trends with substantial cross-linguistic and cross-model variation, establishing modality arbitration as a distinct reliability dimension not captured by standard speech benchmarks.

Chinese Translation

当音频和文本发生冲突时，语音启用的语言模型在仲裁两个文本源时，遵循文本的频率是遵循音频的10倍，即使在明确指示信任音频的情况下。通过使用ALME，一个涵盖8种语言的57,602个受控音频-文本冲突刺激的基准，我们发现Gemini 2.0 Flash在音频-文本冲突下表现出16.6%的文本主导性，而在文本-文本冲突下则为1.6%，两者的可靠性线索相同。这个差距并不能通过音频质量来解释：仅音频的准确率（97.2%）超过级联准确率（93.9%），这表明音频嵌入保留了比文本转录更多的信息。我们提出，文本主导性反映了一种不在信息内容而在仲裁可达性上的不对称性：模型在竞争表示之间推理的容易程度。这一框架解释了其他令人困惑的发现。在回答之前强制转录会增加文本主导性（从19%增加到33%），牺牲了音频的信息优势而没有改善可达性。将文本框架设定为“故意损坏”则将文本主导性降低了80%。一个微调消融实验提供了干预证据：仅训练音频投影层会增加文本主导性（+26.5%），而在语言模型上应用LoRA则将其减半（-23.9%），将文本主导性局限于大语言模型的推理而非音频编码器。对四种最先进的音频-大语言模型和8种语言的实验显示出一致的趋势，并存在显著的跨语言和跨模型变异，确立了模态仲裁作为一种不被标准语音基准捕捉的独特可靠性维度。

View on arXiv Download PDF AI Translation

cs.CL / 38 / 2602.11509

Multimodal Fact-Level Attribution for Verifiable Reasoning

可验证推理的多模态事实级归因

Wan, David, Wang, Han, Wang, Ziyang, Stengel-Eskin, Elias, Lee, Hyunji, Bansal, Mohit

Abstract

Multimodal large language models (MLLMs) are increasingly used for real-world tasks involving multi-step reasoning and long-form generation, where reliability requires grounding model outputs in heterogeneous input sources and verifying individual factual claims. However, existing multimodal grounding benchmarks and evaluation methods focus on simplified, observation-based scenarios or limited modalities and fail to assess attribution in complex multimodal reasoning. We introduce MuRGAt (Multimodal Reasoning with Grounded Attribution), a benchmark for evaluating fact-level multimodal attribution in settings that require reasoning beyond direct observation. Given inputs spanning video, audio, and other modalities, MuRGAt requires models to generate answers with explicit reasoning and precise citations, where each citation specifies both modality and temporal segments. To enable reliable assessment, we introduce an automatic evaluation framework that strongly correlates with human judgments. Benchmarking with human and automated scores reveals that even strong MLLMs frequently hallucinate citations despite correct reasoning. Moreover, we observe a key trade-off: increasing reasoning depth or enforcing structured grounding often degrades accuracy, highlighting a significant gap between internal reasoning and verifiable attribution.

Chinese Translation

多模态大型语言模型（MLLMs）在涉及多步骤推理和长文本生成的现实任务中越来越多地被使用，其中可靠性要求将模型输出与异构输入源相结合，并验证单个事实声明。然而，现有的多模态基础基准和评估方法主要集中在简化的基于观察的场景或有限的模态上，未能评估复杂多模态推理中的归因。我们提出了MuRGAt（基于归因的多模态推理），这是一个用于评估事实级多模态归因的基准，适用于需要超越直接观察的推理设置。MuRGAt要求模型在给定视频、音频和其他模态的输入时，生成带有明确推理和精确引用的答案，其中每个引用都指定了模态和时间段。为了实现可靠评估，我们引入了一个与人类判断高度相关的自动评估框架。通过人类和自动评分的基准测试显示，即使是强大的MLLMs在正确推理的情况下也常常会虚构引用。此外，我们观察到一个关键的权衡：增加推理深度或强制结构化基础通常会降低准确性，突显了内部推理与可验证归因之间的显著差距。

View on arXiv Download PDF AI Translation

cs.CL / 39 / 2602.11543

Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm

使用分布式GPU进行大型语言模型的预训练：一种内存高效的去中心化范式

Zhang, Jinrui, Xiao, Chaodong, Wu, Aoqi, Zhang, Xindong, Zhang, Lei

Abstract

Pretraining large language models (LLMs) typically requires centralized clusters with thousands of high-memory GPUs (e.g., H100/A100). Recent decentralized training methods reduce communication overhead by employing federated optimization; however, they still need to train the entire model on each node, remaining constrained by GPU memory limitations. In this work, we propose SParse Expert Synchronization (SPES), a memory-efficient decentralized framework for pretraining mixture-of-experts (MoE) LLMs. SPES trains only a subset of experts per node, substantially lowering the memory footprint. Each node updates its local experts and periodically synchronizes with other nodes, eliminating full-parameter transmission while ensuring efficient knowledge sharing. To accelerate convergence, we introduce an expert-merging warm-up strategy, where experts exchange knowledge early in training, to rapidly establish foundational capabilities. With SPES, we train a 2B-parameter MoE LLM using 16 standalone 48GB GPUs over internet connections, which achieves competitive performance with centrally trained LLMs under similar computational budgets. We further demonstrate scalability by training a 7B model from scratch and a 9B model upcycled from a dense checkpoint, both of which match prior centralized baselines. Our code is available at https://github.com/zjr2000/SPES.

Chinese Translation

大型语言模型（LLMs）的预训练通常需要数千个高内存GPU（例如，H100/A100）的集中式集群。最近的去中心化训练方法通过采用联邦优化来减少通信开销；然而，它们仍然需要在每个节点上训练整个模型，受到GPU内存限制的约束。在本研究中，我们提出了稀疏专家同步（SParse Expert Synchronization，SPES），这是一种内存高效的去中心化框架，用于预训练混合专家（Mixture-of-Experts，MoE）LLMs。SPES每个节点仅训练一部分专家，显著降低了内存占用。每个节点更新其本地专家，并定期与其他节点同步，从而消除了全参数传输，同时确保了高效的知识共享。为了加速收敛，我们引入了一种专家合并热身策略，在训练初期专家之间进行知识交换，以快速建立基础能力。通过SPES，我们使用16个独立的48GB GPU通过互联网连接训练了一个2B参数的MoE LLM，其性能在相似计算预算下与集中训练的LLMs具有竞争力。我们进一步展示了可扩展性，通过从零开始训练一个7B模型和从稠密检查点回收利用的9B模型，这两个模型均与之前的集中基准相匹配。我们的代码可在https://github.com/zjr2000/SPES获取。

View on arXiv Download PDF AI Translation

cs.CL / 40 / 2602.11551

SIGHT: Reinforcement Learning with Self-Evidence and Information-Gain Diverse Branching for Search Agent

SIGHT：具有自证据和信息增益多样分支的搜索代理强化学习

Zhong, Wenlin, Yang, Jinluan, Wu, Yiquan, Liu, Yi, Yao, Jianhang, Kuang, Kun

Abstract

Reinforcement Learning (RL) has empowered Large Language Models (LLMs) to master autonomous search for complex question answering. However, particularly within multi-turn search scenarios, this interaction introduces a critical challenge: search results often suffer from high redundancy and low signal-to-noise ratios. Consequently, agents easily fall into "Tunnel Vision," where the forced interpretation of early noisy retrievals leads to irreversible error accumulation. To address these challenges, we propose SIGHT, a framework that enhances search-based reasoning through Self-Evidence Support (SES) and Information-Gain Driven Diverse Branching. SIGHT distills search results into high-fidelity evidence via SES and calculates an Information Gain score to pinpoint pivotal states where observations maximally reduce uncertainty. This score guides Dynamic Prompting Interventions - including de-duplication, reflection, or adaptive branching - to spawn new branches with SES. Finally, by integrating SES and correctness rewards via Group Relative Policy Optimization, SIGHT internalizes robust exploration strategies without external verifiers. Experiments on single-hop and multi-hop QA benchmarks demonstrate that SIGHT significantly outperforms existing approaches, particularly in complex reasoning scenarios, using fewer search steps.

Chinese Translation

强化学习（RL）使大型语言模型（LLMs）能够掌握复杂问答的自主搜索。然而，特别是在多轮搜索场景中，这种交互引入了一个关键挑战：搜索结果往往存在高冗余和低信噪比。因此，代理容易陷入“隧道视野”，早期噪声检索的强制解释导致不可逆的错误累积。为了解决这些挑战，我们提出了SIGHT，一个通过自证据支持（SES）和信息增益驱动的多样分支增强基于搜索的推理的框架。SIGHT通过SES将搜索结果提炼为高保真证据，并计算信息增益分数，以确定观察最大限度减少不确定性的关键状态。该分数指导动态提示干预——包括去重、反思或自适应分支——以生成具有SES的新分支。最后，通过群体相对策略优化将SES和正确性奖励整合，SIGHT内部化了强大的探索策略，无需外部验证者。在单跳和多跳问答基准上的实验表明，SIGHT在复杂推理场景中显著优于现有方法，并且使用更少的搜索步骤。

View on arXiv Download PDF AI Translation

cs.CL / 41 / 2602.11570

PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering

PRIME：一个用于数学和工程中可验证推理的过程-结果对齐基准

Wang, Xiangfeng, Guo, Hangyu, Lai, Yanlin, Huang, Mitt, Zhao, Liang, Yao, Chengyuan, Zhang, Yinmin, Han, Qi, Ren, Xiaoxiao, Yuan, Chun, Xu, Tong, Ge, Zheng, Zhang, Xiangyu, Jiang, Daxin

Abstract

While model-based verifiers are essential for scaling Reinforcement Learning with Verifiable Rewards (RLVR), current outcome-centric verification paradigms primarily focus on the consistency between the final result and the ground truth, often neglecting potential errors in the derivation process. This leads to assigning positive rewards to correct answers produced from incorrect derivations. To bridge this gap, we introduce PRIME, a benchmark for evaluating verifiers on Process-Outcome Alignment verification in Mathematics and Engineering. Curated from a comprehensive collection of college-level STEM problems, PRIME comprises 2,530 high-difficulty samples through a consistency-based filtering pipeline. Through extensive evaluation, we find that current verifiers frequently fail to detect derivation flaws. Furthermore, we propose a process-aware RLVR training paradigm utilizing verifiers selected via PRIME. This approach substantially outperforms the outcome-only verification baseline, achieving absolute performance gains of 8.29%, 9.12%, and 7.31% on AIME24, AIME25, and Beyond-AIME, respectively, for the Qwen3-14B-Base model. Finally, we demonstrate a strong linear correlation ($R^2 > 0.92$) between verifier accuracy on PRIME and RLVR training effectiveness, validating PRIME as a reliable predictor for verifier selection.

Chinese Translation

尽管基于模型的验证器对于扩展具有可验证奖励的强化学习（RLVR）至关重要，但当前以结果为中心的验证范式主要关注最终结果与真实值之间的一致性，往往忽视了推导过程中的潜在错误。这导致对从错误推导中产生的正确答案赋予正奖励。为了解决这一问题，我们提出了PRIME，一个用于评估数学和工程中过程-结果对齐验证的基准。PRIME从一系列大学水平的STEM问题中精心整理而来，包含2530个高难度样本，通过一致性过滤管道获得。通过广泛的评估，我们发现当前的验证器经常无法检测到推导缺陷。此外，我们提出了一种基于过程的RLVR训练范式，利用通过PRIME选择的验证器。该方法在结果仅验证基准上显著优于，Qwen3-14B-Base模型在AIME24、AIME25和Beyond-AIME上的绝对性能提升分别为8.29%、9.12%和7.31%。最后，我们展示了PRIME上验证器准确性与RLVR训练有效性之间的强线性相关性（$R^2 > 0.92$），验证了PRIME作为验证器选择的可靠预测指标。

View on arXiv Download PDF AI Translation

cs.CL / 42 / 2602.11607

Scene-Aware Memory Discrimination: Deciding Which Personal Knowledge Stays

场景感知记忆区分：决定哪些个人知识得以保留

Zhong, Yijie, Guo, Mengying, Wang, Zewei, Li, Zhongyang, Tu, Dandan, Wang, Haofen

Abstract

Intelligent devices have become deeply integrated into everyday life, generating vast amounts of user interactions that form valuable personal knowledge. Efficient organization of this knowledge in user memory is essential for enabling personalized applications. However, current research on memory writing, management, and reading using large language models (LLMs) faces challenges in filtering irrelevant information and in dealing with rising computational costs. Inspired by the concept of selective attention in the human brain, we introduce a memory discrimination task. To address large-scale interactions and diverse memory standards in this task, we propose a Scene-Aware Memory Discrimination method (SAMD), which comprises two key components: the Gating Unit Module (GUM) and the Cluster Prompting Module (CPM). GUM enhances processing efficiency by filtering out non-memorable interactions and focusing on the salient content most relevant to application demands. CPM establishes adaptive memory standards, guiding LLMs to discern what information should be remembered or discarded. It also analyzes the relationship between user intents and memory contexts to build effective clustering prompts. Comprehensive direct and indirect evaluations demonstrate the effectiveness and generalization of our approach. We independently assess the performance of memory discrimination, showing that SAMD successfully recalls the majority of memorable data and remains robust in dynamic scenarios. Furthermore, when integrated into personalized applications, SAMD significantly enhances both the efficiency and quality of memory construction, leading to better organization of personal knowledge.

Chinese Translation

智能设备已深度融入日常生活，生成大量用户交互，从而形成宝贵的个人知识。有效地组织这些知识在用户记忆中对于实现个性化应用至关重要。然而，当前关于使用大型语言模型（LLMs）进行记忆写入、管理和读取的研究面临过滤无关信息和应对不断上升的计算成本的挑战。受人脑选择性注意力概念的启发，我们引入了一项记忆区分任务。为了解决这一任务中的大规模交互和多样化记忆标准，我们提出了一种场景感知记忆区分方法（Scene-Aware Memory Discrimination，SAMD），该方法包括两个关键组件：门控单元模块（Gating Unit Module，GUM）和聚类提示模块（Cluster Prompting Module，CPM）。GUM通过过滤掉非记忆性交互，专注于与应用需求最相关的显著内容，从而提高处理效率。CPM建立自适应记忆标准，引导LLMs辨别哪些信息应被记住或丢弃。它还分析用户意图与记忆上下文之间的关系，以构建有效的聚类提示。全面的直接和间接评估证明了我们方法的有效性和泛化能力。我们独立评估了记忆区分的性能，显示SAMD成功回忆起大多数可记忆数据，并在动态场景中保持稳健。此外，当集成到个性化应用中时，SAMD显著提高了记忆构建的效率和质量，从而更好地组织个人知识。

View on arXiv Download PDF AI Translation

cs.CL / 43 / 2602.11639

PACE: Prefix-Protected and Difficulty-Aware Compression for Efficient Reasoning

PACE：前缀保护与难度感知的高效推理压缩

Feng, Ruixiang, Wen, Yuntao, Zhou, Silin, Shi, Ke, Wang, Yifan, Le, Ran, An, Zhenwei, Chen, Zongchao, Yang, Chen, Peng, Guangyue, Jia, Yiming, Wang, Dongsheng, Zhang, Tao, Chen, Lisi, Song, Yang, Gao, Shen, Shang, Shuo

Abstract

Language Reasoning Models (LRMs) achieve strong performance by scaling test-time computation but often suffer from ``overthinking'', producing excessively long reasoning traces that increase latency and memory usage. Existing LRMs typically enforce conciseness with uniform length penalties, which over-compress crucial early deduction steps at the sequence level and indiscriminately penalize all queries at the group level. To solve these limitations, we propose \textbf{\model}, a dual-level framework for prefix-protected and difficulty-aware compression under hierarchical supervision. At the sequence level, prefix-protected optimization employs decaying mixed rollouts to maintain valid reasoning paths while promoting conciseness. At the group level, difficulty-aware penalty dynamically scales length constraints based on query complexity, maintaining exploration for harder questions while curbing redundancy on easier ones. Extensive experiments on DeepSeek-R1-Distill-Qwen (1.5B/7B) demonstrate that \model achieves a substantial reduction in token usage (up to \textbf{55.7\%}) while simultaneously improving accuracy (up to \textbf{4.1\%}) on math benchmarks, with generalization ability to code, science, and general domains.

Chinese Translation

语言推理模型（Language Reasoning Models, LRM）通过扩大测试时计算规模实现了强大的性能，但常常遭遇“过度思考”的问题，产生过长的推理轨迹，从而增加延迟和内存使用。现有的LRM通常通过统一的长度惩罚来强制简洁性，这在序列层面上过度压缩了关键的早期推理步骤，并在组层面上无差别地惩罚所有查询。为了解决这些局限性，我们提出了 extbf{ extit{PACE}}，一个在层次监督下实现前缀保护和难度感知压缩的双层框架。在序列层面，前缀保护优化采用衰减混合回放，以保持有效的推理路径，同时促进简洁性。在组层面，难度感知惩罚根据查询复杂性动态调整长度约束，保持对更难问题的探索，同时抑制对较易问题的冗余。对DeepSeek-R1-Distill-Qwen（1.5B/7B）的广泛实验表明， extit{PACE}在数学基准上实现了令牌使用量的大幅减少（高达 extbf{55.7 ext%}），同时提高了准确性（高达 extbf{4.1 ext%}），并具备对代码、科学和一般领域的泛化能力。

View on arXiv Download PDF AI Translation

cs.CL / 44 / 2602.11650

Which Feedback Works for Whom? Differential Effects of LLM-Generated Feedback Elements Across Learner Profiles

哪种反馈适合谁？LLM生成的反馈元素在不同学习者特征中的差异性影响

Furuhashi, Momoka, Nakayama, Kouta, Kawai, Noboru, Kodama, Takashi, Sugawara, Saku, Takami, Kyosuke

Abstract

Large language models (LLMs) show promise for automatically generating feedback in education settings. However, it remains unclear how specific feedback elements, such as tone and information coverage, contribute to learning outcomes and learner acceptance, particularly across learners with different personality traits. In this study, we define six feedback elements and generate feedback for multiple-choice biology questions using GPT-5. We conduct a learning experiment with 321 first-year high school students and evaluate feedback effectiveness using two learning outcomes measures and subjective evaluations across six criteria. We further analyze differences in how feedback acceptance varies across learners based on Big Five personality traits. Our results show that effective feedback elements share common patterns supporting learning outcomes, while learners' subjective preferences differ across personality-based clusters. These findings highlight the importance of selecting and adapting feedback elements according to learners' personality traits when we design LLM-generated feedback, and provide practical implications for personalized feedback design in education.

Chinese Translation

大型语言模型（LLMs）在教育环境中自动生成反馈方面展现出潜力。然而，具体的反馈元素（如语气和信息覆盖）如何影响学习成果和学习者接受度，尤其是在不同个性特征的学习者之间，仍不清楚。在本研究中，我们定义了六个反馈元素，并使用GPT-5为多项选择生物问题生成反馈。我们对321名高一学生进行了学习实验，并通过两种学习成果指标和六个标准的主观评估来评估反馈的有效性。我们进一步分析了基于五大人格特质（Big Five personality traits）学习者之间反馈接受度的差异。我们的结果表明，有效的反馈元素在支持学习成果方面具有共同模式，而学习者的主观偏好在基于个性的群体中存在差异。这些发现强调了在设计LLM生成的反馈时，根据学习者的个性特征选择和调整反馈元素的重要性，并为教育中的个性化反馈设计提供了实践意义。

View on arXiv Download PDF AI Translation

cs.CL / 45 / 2602.11684

PatientHub: A Unified Framework for Patient Simulation

PatientHub：患者模拟的统一框架

Sabour, Sahand, NG, TszYam, Huang, Minlie

Abstract

As Large Language Models increasingly power role-playing applications, simulating patients has become a valuable tool for training counselors and scaling therapeutic assessment. However, prior work is fragmented: existing approaches rely on incompatible, non-standardized data formats, prompts, and evaluation metrics, hindering reproducibility and fair comparison. In this paper, we introduce PatientHub, a unified and modular framework that standardizes the definition, composition, and deployment of simulated patients. To demonstrate PatientHub's utility, we implement several representative patient simulation methods as case studies, showcasing how our framework supports standardized cross-method evaluation and the seamless integration of custom evaluation metrics. We further demonstrate PatientHub's extensibility by prototyping two new simulator variants, highlighting how PatientHub accelerates method development by eliminating infrastructure overhead. By consolidating existing work into a single reproducible pipeline, PatientHub lowers the barrier to developing new simulation methods and facilitates cross-method and cross-model benchmarking. Our framework provides a practical foundation for future datasets, methods, and benchmarks in patient-centered dialogue, and the code is publicly available via https://github.com/Sahandfer/PatientHub.

Chinese Translation

随着大型语言模型在角色扮演应用中的日益普及，模拟患者已成为培训顾问和扩展治疗评估的重要工具。然而，现有研究相对分散：现有方法依赖于不兼容的、非标准化的数据格式、提示和评估指标，阻碍了可重复性和公平比较。在本文中，我们介绍了PatientHub，一个统一且模块化的框架，标准化了模拟患者的定义、组成和部署。为了展示PatientHub的实用性，我们实现了几种具有代表性的患者模拟方法作为案例研究，展示了我们的框架如何支持标准化的跨方法评估以及自定义评估指标的无缝集成。我们进一步通过原型设计两个新的模拟器变体来展示PatientHub的可扩展性，强调PatientHub如何通过消除基础设施开销加速方法开发。通过将现有工作整合到一个可重复的管道中，PatientHub降低了开发新模拟方法的门槛，并促进了跨方法和跨模型的基准测试。我们的框架为未来以患者为中心的对话中的数据集、方法和基准提供了实用基础，代码可通过 https://github.com/Sahandfer/PatientHub 公开获取。

View on arXiv Download PDF AI Translation

cs.CL / 46 / 2602.11699

Finding Sense in Nonsense with Generated Contexts: Perspectives from Humans and Language Models

通过生成上下文寻找无意义中的意义：来自人类和语言模型的视角

Olsen, Katrin, Padó, Sebastian

Abstract

Nonsensical and anomalous sentences have been instrumental in the development of computational models of semantic interpretation. A core challenge is to distinguish between what is merely anomalous (but can be interpreted given a supporting context) and what is truly nonsensical. However, it is unclear (a) how nonsensical, rather than merely anomalous, existing datasets are; and (b) how well LLMs can make this distinction. In this paper, we answer both questions by collecting sensicality judgments from human raters and LLMs on sentences from five semantically deviant datasets: both context-free and when providing a context. We find that raters consider most sentences at most anomalous, and only a few as properly nonsensical. We also show that LLMs are substantially skilled in generating plausible contexts for anomalous cases.

Chinese Translation

无意义和异常句子在计算语义解释模型的发展中发挥了重要作用。一个核心挑战是区分什么仅仅是异常的（但在支持上下文的情况下可以被解释）和什么是真正无意义的。然而，目前尚不清楚（a）现有数据集中有多少句子是无意义的，而不仅仅是异常的；以及（b）大型语言模型（LLMs）在区分这两者方面的表现如何。在本文中，我们通过收集来自人类评审者和大型语言模型对来自五个语义偏差数据集的句子的意义判断来回答这两个问题：包括无上下文和提供上下文的情况。我们发现评审者认为大多数句子至多是异常的，只有少数被认为是真正无意义的。我们还展示了大型语言模型在为异常情况生成合理上下文方面具有显著的能力。

View on arXiv Download PDF AI Translation

cs.CL / 47 / 2602.11731

Thinking with Drafting: Optical Decompression via Logical Reconstruction

通过逻辑重构进行光学解压：与草拟思维

Wei, Jingxuan, He, Honghao, Jia, Caijun, Li, Siyuan, Sun, Zheng, Xu, Yuhang, Lin, Yuanyuan, Sun, Linzhuang, Wu, Yuchen, Yu, Bihui, Zhang, Xiangxiang, Tan, Cheng

Abstract

Existing multimodal large language models have achieved high-fidelity visual perception and exploratory visual generation. However, a precision paradox persists in complex reasoning tasks: optical perception systems transcribe symbols without capturing logical topology, while pixel-based generative models produce visual artifacts lacking mathematical exactness. To bridge this gap, we propose that reasoning over visual inputs be reconceptualized as optical decompression-the process of reconstructing latent logical structures from compressed visual tokens. Guided by the axiom that Parsing is Reasoning, we introduce Thinking with Drafting (TwD), which utilizes a minimalist Domain-Specific Language (DSL) as a grounding intermediate representation. Unlike standard approaches that hallucinate answers directly, TwD forces the model to draft its mental model into executable code, rendering deterministic visual proofs for self-verification. To validate this, we present VisAlg, a visual algebra benchmark. Experiments demonstrate that TwD serve as a superior cognitive scaffold. Our work establishes a closed-loop system where visual generation acts not as a creative output but as a logical verifier, offering a generalizable path for visual reasoning.

Chinese Translation

现有的多模态大型语言模型在视觉感知和探索性视觉生成方面已取得高保真度。然而，在复杂推理任务中仍然存在一个精确性悖论：光学感知系统转录符号却未能捕捉逻辑拓扑，而基于像素的生成模型则产生缺乏数学精确性的视觉伪影。为了弥补这一差距，我们提出将对视觉输入的推理重新概念化为光学解压——从压缩的视觉符号中重构潜在逻辑结构的过程。在“解析即推理”的公理指导下，我们引入了“与草拟思维”（Thinking with Drafting, TwD），它利用一种极简的领域特定语言（Domain-Specific Language, DSL）作为基础中介表示。与直接幻觉答案的标准方法不同，TwD迫使模型将其心理模型草拟为可执行代码，从而生成可供自我验证的确定性视觉证明。为了验证这一点，我们提出了视觉代数基准（VisAlg）。实验表明，TwD作为一种优越的认知支架。我们的工作建立了一个闭环系统，其中视觉生成不仅作为创造性输出，而是作为逻辑验证者，为视觉推理提供了一条可推广的路径。

View on arXiv Download PDF AI Translation

cs.CL / 48 / 2602.11748

Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning

深入探索的长时间思考：通过长度激励的强化学习学习上下文中的探索

Wang, Futing, Yan, Jianhao, Luo, Yun, Cui, Ganqu, Wang, Zhi, Qu, Xiaoye, Zhang, Yue, Cheng, Yu, Lin, Tao

Abstract

Achieving effective test-time scaling requires models to engage in In-Context Exploration -- the intrinsic ability to generate, verify, and refine multiple reasoning hypotheses within a single continuous context. Grounded in State Coverage theory, our analysis identifies a critical bottleneck to enabling this capability: while broader state coverage requires longer reasoning trajectories, the probability of sampling such sequences decays exponentially during autoregressive generation, a phenomenon we term the ``Shallow Exploration Trap''. To bridge this gap, we propose Length-Incentivized Exploration(\method). This simple yet effective recipe explicitly encourages models to explore more via a length-based reward coupled with a redundancy penalty, thereby maximizing state coverage in two-step manner. Comprehensive experiments across different models (Qwen3, Llama) demonstrate that \method effectively incentivize in-context exploration. As a result, our method achieves an average improvement of 4.4\% on in-domain tasks and a 2.7\% gain on out-of-domain benchmarks.

Chinese Translation

实现有效的测试时间扩展需要模型进行上下文内探索——在单一连续上下文中生成、验证和完善多个推理假设的内在能力。基于状态覆盖理论，我们的分析识别出实现这一能力的关键瓶颈：虽然更广泛的状态覆盖需要更长的推理轨迹，但在自回归生成过程中，采样此类序列的概率呈指数衰减，这一现象我们称之为“浅层探索陷阱”（Shallow Exploration Trap）。为了解决这一问题，我们提出了长度激励探索（Length-Incentivized Exploration， extit{method}）。这一简单而有效的方法明确鼓励模型通过基于长度的奖励和冗余惩罚进行更多探索，从而以两步方式最大化状态覆盖。在不同模型（Qwen3，Llama）上的全面实验表明， extit{method}有效激励了上下文内探索。因此，我们的方法在领域内任务上平均提高了4.4 ext{%}，在领域外基准上获得了2.7 ext{%}的提升。

View on arXiv Download PDF AI Translation

cs.CL / 49 / 2602.11761

MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling

MiniCPM-SALA：稀疏和线性注意力的混合以实现高效的长上下文建模

MiniCPM Team, An, Wenhao, Chen, Yingfa, Fang, Yewei, Li, Jiayi, Li, Xin, Li, Yaohui, Li, Yishan, Li, Yuxuan, Lin, Biyuan, Liu, Chuan, Liu, Hezi, Liu, Siyuan, Lyu, Hongya, Pan, Yinxu, Ren, Shixin, Shen, Xingyu, Su, Zhou, Sun, Haojun, Sun, Yangang, Thai, Zhen Leng, Tian, Xin, Wang, Rui, Wang, Xiaorong, Wang, Yudong, Wu, Bo, Xu, Xiaoyue, Xu, Dong, Xue, Shuaikang, Yang, Jiawei, Zhang, Bowen, Zhang, Jinqian, Zhang, Letian, Zhang, Shengnan, Zhang, Xinyu, Zhang, Xinyuan, Zhang, Zhu, Zhao, Hengyu, Zhao, Jiacheng, Zhou, Jie, Zhou, Zihan, Wang, Shuo, Xiao, Chaojun, Han, Xu, Liu, Zhiyuan, Sun, Maosong

Abstract

The evolution of large language models (LLMs) towards applications with ultra-long contexts faces challenges posed by the high computational and memory costs of the Transformer architecture. While existing sparse and linear attention mechanisms attempt to mitigate these issues, they typically involve a trade-off between memory efficiency and model performance. This paper introduces MiniCPM-SALA, a 9B-parameter hybrid architecture that integrates the high-fidelity long-context modeling of sparse attention (InfLLM-V2) with the global efficiency of linear attention (Lightning Attention). By employing a layer selection algorithm to integrate these mechanisms in a 1:3 ratio and utilizing a hybrid positional encoding (HyPE), the model maintains efficiency and performance for long-context tasks. Furthermore, we introduce a cost-effective continual training framework that transforms pre-trained Transformer-based models into hybrid models, which reduces training costs by approximately 75% compared to training from scratch. Extensive experiments show that MiniCPM-SALA maintains general capabilities comparable to full-attention models while offering improved efficiency. On a single NVIDIA A6000D GPU, the model achieves up to 3.5x the inference speed of the full-attention model at the sequence length of 256K tokens and supports context lengths of up to 1M tokens, a scale where traditional full-attention 8B models fail because of memory constraints.

Chinese Translation

大型语言模型（LLMs）在超长上下文应用中的演变面临着由Transformer架构带来的高计算和内存成本所带来的挑战。尽管现有的稀疏和线性注意力机制试图缓解这些问题，但它们通常在内存效率和模型性能之间存在权衡。本文介绍了MiniCPM-SALA，一种具有90亿参数的混合架构，它将稀疏注意力（InfLLM-V2）的高保真长上下文建模与线性注意力（Lightning Attention）的全局效率相结合。通过采用层选择算法以1:3的比例整合这些机制，并利用混合位置编码（HyPE），该模型在长上下文任务中保持了效率和性能。此外，我们还引入了一种具有成本效益的持续训练框架，将预训练的基于Transformer的模型转变为混合模型，与从头开始训练相比，训练成本减少了约75%。大量实验表明，MiniCPM-SALA在保持与全注意力模型相当的通用能力的同时，提供了更高的效率。在单个NVIDIA A6000D GPU上，该模型在256K标记的序列长度下实现了最高3.5倍于全注意力模型的推理速度，并支持高达1M标记的上下文长度，这是传统全注意力8B模型因内存限制而无法处理的规模。

View on arXiv Download PDF AI Translation

cs.CL / 50 / 2602.11795

A Subword Embedding Approach for Variation Detection in Luxembourgish User Comments

一种基于子词嵌入的卢森堡语用户评论变异检测方法

Lutgen, Anne-Marie, Plum, Alistair, Purschke, Christoph

Abstract

This paper presents an embedding-based approach to detecting variation without relying on prior normalisation or predefined variant lists. The method trains subword embeddings on raw text and groups related forms through combined cosine and n-gram similarity. This allows spelling and morphological diversity to be examined and analysed as linguistic structure rather than treated as noise. Using a large corpus of Luxembourgish user comments, the approach uncovers extensive lexical and orthographic variation that aligns with patterns described in dialectal and sociolinguistic research. The induced families capture systematic correspondences and highlight areas of regional and stylistic differentiation. The procedure does not strictly require manual annotation, but does produce transparent clusters that support both quantitative and qualitative analysis. The results demonstrate that distributional modelling can reveal meaningful patterns of variation even in ''noisy'' or low-resource settings, offering a reproducible methodological framework for studying language variety in multilingual and small-language contexts.

Chinese Translation

本文提出了一种基于嵌入的方法来检测变异，而无需依赖于事先的规范化或预定义的变体列表。该方法在原始文本上训练子词嵌入，并通过结合余弦相似度和n-gram相似度对相关形式进行分组。这使得拼写和形态多样性可以作为语言结构进行检查和分析，而不是被视为噪声。利用大量卢森堡语用户评论的语料库，该方法揭示了广泛的词汇和正字法变异，这与方言和社会语言学研究中描述的模式相一致。所诱导的家族捕捉了系统性的对应关系，并突出了区域和风格差异的领域。该过程并不严格要求手动注释，但确实产生了透明的聚类，支持定量和定性分析。结果表明，分布建模能够揭示即使在“嘈杂”或低资源环境中也有意义的变异模式，为在多语言和小语言背景下研究语言变体提供了可重复的方法框架。

View on arXiv Download PDF AI Translation

cs.CL / 51 / 2602.11871

DMAP: A Distribution Map for Text

DMAP：文本的分布图

Kempton, Tom, Rozanova, Julia, Kamalaruban, Parameswaran, Madigan, Maeve, Wresilo, Karolina, Launay, Yoann L., Sutton, David, Burrell, Stuart

Abstract

Large Language Models (LLMs) are a powerful tool for statistical text analysis, with derived sequences of next-token probability distributions offering a wealth of information. Extracting this signal typically relies on metrics such as perplexity, which do not adequately account for context; how one should interpret a given next-token probability is dependent on the number of reasonable choices encoded by the shape of the conditional distribution. In this work, we present DMAP, a mathematically grounded method that maps a text, via a language model, to a set of samples in the unit interval that jointly encode rank and probability information. This representation enables efficient, model-agnostic analysis and supports a range of applications. We illustrate its utility through three case studies: (i) validation of generation parameters to ensure data integrity, (ii) examining the role of probability curvature in machine-generated text detection, and (iii) a forensic analysis revealing statistical fingerprints left in downstream models that have been subject to post-training on synthetic data. Our results demonstrate that DMAP offers a unified statistical view of text that is simple to compute on consumer hardware, widely applicable, and provides a foundation for further research into text analysis with LLMs.

Chinese Translation

大型语言模型（LLMs）是进行统计文本分析的强大工具，其生成的下一个标记概率分布序列提供了丰富的信息。提取这一信号通常依赖于诸如困惑度（perplexity）等指标，但这些指标并未充分考虑上下文；对给定下一个标记概率的解读依赖于条件分布形状所编码的合理选择数量。在本研究中，我们提出了DMAP，这是一种数学基础的方法，通过语言模型将文本映射到单位区间中的一组样本，从而共同编码排名和概率信息。这种表示方法使得高效且与模型无关的分析成为可能，并支持多种应用。我们通过三个案例研究展示其效用：（i）验证生成参数以确保数据完整性，（ii）考察概率曲率在机器生成文本检测中的作用，以及（iii）法医分析揭示在经过合成数据后训练的下游模型中留下的统计指纹。我们的结果表明，DMAP提供了一种简单易算的文本统一统计视角，适用于消费级硬件，广泛应用，并为进一步研究基于LLMs的文本分析奠定了基础。

View on arXiv Download PDF AI Translation

cs.CL / 52 / 2602.11877

Towards Fair and Comprehensive Evaluation of Routers in Collaborative LLM Systems

朝着公平和全面的协作大语言模型系统中的路由器评估

Wu, Wanxing, Zhu, He, Li, Yixia, Yang, Lei, Zhao, Jiehui, Wang, Hongru, Yang, Jian, Wang, Benyou, Jing, Bingyi, Chen, Guanhua

Abstract

Large language models (LLMs) have achieved success, but cost and privacy constraints necessitate deploying smaller models locally while offloading complex queries to cloud-based models. Existing router evaluations are unsystematic, overlooking scenario-specific requirements and out-of-distribution robustness. We propose RouterXBench, a principled evaluation framework with three dimensions: router ability, scenario alignment, and cross-domain robustness. Unlike prior work that relies on output probabilities or external embeddings, we utilize internal hidden states that capture model uncertainty before answer generation. We introduce ProbeDirichlet, a lightweight router that aggregates cross-layer hidden states via learnable Dirichlet distributions with probabilistic training. Trained on multi-domain data, it generalizes robustly across in-domain and out-of-distribution scenarios. Our results show ProbeDirichlet achieves 16.68% and 18.86% relative improvements over the best baselines in router ability and high-accuracy scenarios, with consistent performance across model families, model scales, heterogeneous tasks, and agentic workflows.

Chinese Translation

大型语言模型（LLMs）已取得成功，但成本和隐私限制要求在本地部署较小的模型，同时将复杂查询卸载到基于云的模型。现有的路由器评估缺乏系统性，忽视了特定场景的需求和分布外的鲁棒性。我们提出了RouterXBench，一个具有原则性的评估框架，涵盖三个维度：路由器能力、场景对齐和跨领域鲁棒性。与以往依赖输出概率或外部嵌入的工作不同，我们利用内部隐藏状态来捕捉模型在生成答案前的不确定性。我们引入了ProbeDirichlet，一个轻量级路由器，通过可学习的Dirichlet分布和概率训练聚合跨层隐藏状态。经过多领域数据训练，它在领域内和分布外场景中表现出良好的泛化能力。我们的结果表明，ProbeDirichlet在路由器能力和高准确度场景中相较于最佳基线分别实现了16.68%和18.86%的相对提升，并在不同模型家族、模型规模、异构任务和自主工作流中保持了一致的性能。

View on arXiv Download PDF AI Translation

cs.CL / 53 / 2602.11886

LLM-based Triplet Extraction from Financial Reports

基于大型语言模型的财务报告三元组提取

Wesslund, Dante, Stenström, Ville, Linde, Pontus, Holmberg, Alexander

Abstract

Corporate financial reports are a valuable source of structured knowledge for Knowledge Graph construction, but the lack of annotated ground truth in this domain makes evaluation difficult. We present a semi-automated pipeline for Subject-Predicate-Object triplet extraction that uses ontology-driven proxy metrics, specifically Ontology Conformance and Faithfulness, instead of ground-truth-based evaluation. We compare a static, manually engineered ontology against a fully automated, document-specific ontology induction approach across different LLMs and two corporate annual reports. The automatically induced ontology achieves 100% schema conformance in all configurations, eliminating the ontology drift observed with the manual approach. We also propose a hybrid verification strategy that combines regex matching with an LLM-as-a-judge check, reducing apparent subject hallucination rates from 65.2% to 1.6% by filtering false positives caused by coreference resolution. Finally, we identify a systematic asymmetry between subject and object hallucinations, which we attribute to passive constructions and omitted agents in financial prose.

Chinese Translation

企业财务报告是构建知识图谱的重要结构化知识来源，但该领域缺乏标注的真实数据，使得评估变得困难。我们提出了一种半自动化的主谓宾三元组提取流程，该流程使用基于本体的代理指标，特别是本体一致性（Ontology Conformance）和忠实性（Faithfulness），而不是基于真实数据的评估。我们将静态的手工设计本体与完全自动化的文档特定本体归纳方法进行了比较，涉及不同的大型语言模型（LLMs）和两份企业年报。自动生成的本体在所有配置中实现了100%的模式一致性，消除了手工方法中观察到的本体漂移。我们还提出了一种混合验证策略，将正则表达式匹配与大型语言模型作为评判者的检查相结合，通过过滤因共指解析引起的假阳性，将明显的主题幻觉率从65.2%降低到1.6%。最后，我们识别出主语和宾语幻觉之间的系统性不对称，归因于财务散文中的被动结构和省略的施事者。

View on arXiv Download PDF AI Translation

cs.CL / 54 / 2602.11898

Benchmark Illusion: Disagreement among LLMs and Its Scientific Consequences

基准幻觉：大型语言模型之间的分歧及其科学后果

Yang, Eddie, Wang, Dashun

Abstract

Benchmarks underpin how progress in large language models (LLMs) is measured and trusted. Yet our analyses reveal that apparent convergence in benchmark accuracy can conceal deep epistemic divergence. Using two major reasoning benchmarks - MMLU-Pro and GPQA - we show that LLMs achieving comparable accuracy still disagree on 16-66% of items, and 16-38% among top-performing frontier models. These discrepancies suggest distinct error profiles for different LLMs. When such models are used for scientific data annotation and inference, their hidden disagreements propagate into research results: in re-analyses of published studies in education and political science, switching the annotation model can change estimated treatment effects by more than 80%, and in some cases reverses their sign. Together, these findings illustrate a benchmark illusion, where equal accuracy may conceal disagreement, with model choice becoming a hidden yet consequential variable for scientific reproducibility.

Chinese Translation

基准测试是衡量和信任大型语言模型（LLMs）进展的基础。然而，我们的分析揭示，基准准确性的表面趋同可能掩盖了深层的认识论分歧。通过使用两个主要的推理基准——MMLU-Pro 和 GPQA——我们展示了即使在准确性上相当的 LLMs 之间，仍然在 16-66% 的项目上存在分歧，而在表现最好的前沿模型中，这一比例为 16-38%。这些差异表明不同 LLMs 具有独特的错误特征。当这些模型用于科学数据注释和推理时，它们的潜在分歧会传播到研究结果中：在对教育和政治科学领域已发表研究的重新分析中，切换注释模型可能导致估计的处理效应变化超过 80%，在某些情况下甚至会改变其符号。这些发现共同揭示了基准幻觉的存在，即相同的准确性可能掩盖了分歧，而模型选择成为科学可重复性中一个隐蔽但重要的变量。

View on arXiv Download PDF AI Translation

cs.CL / 55 / 2602.11931

AdaptEvolve: Improving Efficiency of Evolutionary AI Agents through Adaptive Model Selection

AdaptEvolve：通过自适应模型选择提高进化人工智能代理的效率

Ray, Pretam, Brahma, Pratik Prabhanjan, Liu, Zicheng, Barsoum, Emad

Abstract

Evolutionary agentic systems intensify the trade-off between computational efficiency and reasoning capability by repeatedly invoking large language models (LLMs) during inference. This setting raises a central question: how can an agent dynamically select an LLM that is sufficiently capable for the current generation step while remaining computationally efficient? While model cascades offer a practical mechanism for balancing this trade-off, existing routing strategies typically rely on static heuristics or external controllers and do not explicitly account for model uncertainty. We introduce AdaptEvolve: Adaptive LLM Selection for Multi-LLM Evolutionary Refinement within an evolutionary sequential refinement framework that leverages intrinsic generation confidence to estimate real-time solvability. Empirical results show that confidence-driven selection yields a favourable Pareto frontier, reducing total inference cost by an average of 37.9% across benchmarks while retaining 97.5% of the upper-bound accuracy of static large-model baselines. Our code is available at https://github.com/raypretam/adaptive_llm_selection.

Chinese Translation

进化代理系统在推理过程中反复调用大型语言模型（LLMs），加剧了计算效率与推理能力之间的权衡。这一背景提出了一个核心问题：代理如何动态选择一个在当前生成步骤中足够强大的LLM，同时保持计算效率？虽然模型级联提供了一种平衡这一权衡的实用机制，但现有的路由策略通常依赖于静态启发式方法或外部控制器，并未明确考虑模型的不确定性。我们提出了AdaptEvolve：在进化序列精炼框架内进行多LLM进化精炼的自适应LLM选择，该框架利用内在生成置信度来估计实时可解性。实证结果表明，基于置信度的选择产生了有利的帕累托前沿，在各基准测试中将总推理成本平均降低了37.9%，同时保留了静态大模型基线的97.5%的上限准确率。我们的代码可在https://github.com/raypretam/adaptive_llm_selection获取。

View on arXiv Download PDF AI Translation

cs.CL / 56 / 2602.11933

Cross-Modal Robustness Transfer (CMRT): Training Robust Speech Translation Models Using Adversarial Text

跨模态鲁棒性转移 (CMRT)：使用对抗文本训练鲁棒语音翻译模型

Issam, Abderrahmane, Semerci, Yusuf Can, Scholtes, Jan, Spanakis, Gerasimos

Abstract

End-to-End Speech Translation (E2E-ST) has seen significant advancements, yet current models are primarily benchmarked on curated, "clean" datasets. This overlooks critical real-world challenges, such as morphological robustness to inflectional variations common in non-native or dialectal speech. In this work, we adapt a text-based adversarial attack targeting inflectional morphology to the speech domain and demonstrate that state-of-the-art E2E-ST models are highly vulnerable it. While adversarial training effectively mitigates such risks in text-based tasks, generating high-quality adversarial speech data remains computationally expensive and technically challenging. To address this, we propose Cross-Modal Robustness Transfer (CMRT), a framework that transfers adversarial robustness from the text modality to the speech modality. Our method eliminates the requirement for adversarial speech data during training. Extensive experiments across four language pairs demonstrate that CMRT improves adversarial robustness by an average of more than 3 BLEU points, establishing a new baseline for robust E2E-ST without the overhead of generating adversarial speech.

Chinese Translation

端到端语音翻译 (E2E-ST) 取得了显著进展，但当前模型主要在经过精心挑选的“干净”数据集上进行基准测试。这忽视了现实世界中的关键挑战，例如对非母语或方言语音中常见的屈折变化的形态鲁棒性。在本研究中，我们将针对屈折形态的基于文本的对抗攻击适配到语音领域，并证明最先进的 E2E-ST 模型对此高度脆弱。尽管对抗训练在基于文本的任务中有效减轻了此类风险，但生成高质量的对抗语音数据仍然计算成本高且技术挑战重重。为了解决这一问题，我们提出了跨模态鲁棒性转移 (CMRT) 框架，该框架将对抗鲁棒性从文本模态转移到语音模态。我们的方法消除了训练过程中对抗语音数据的需求。通过对四种语言对的广泛实验，证明 CMRT 平均提高了超过 3 个 BLEU 点的对抗鲁棒性，为无需生成对抗语音的新基线建立了鲁棒的 E2E-ST。

View on arXiv Download PDF AI Translation

cs.CL / 57 / 2602.11938

Who is the richest club in the championship? Detecting and Rewriting Underspecified Questions Improve QA Performance

谁是冠军联赛中最富有的俱乐部？检测和重写不明确的问题以提高问答性能

Huang, Yunchong, Barlacchi, Gianni, Pezzelle, Sandro

Abstract

Large language models (LLMs) perform well on well-posed questions, yet standard question-answering (QA) benchmarks remain far from solved. We argue that this gap is partly due to underspecified questions - queries whose interpretation cannot be uniquely determined without additional context. To test this hypothesis, we introduce an LLM-based classifier to identify underspecified questions and apply it to several widely used QA datasets, finding that 16% to over 50% of benchmark questions are underspecified and that LLMs perform significantly worse on them. To isolate the effect of underspecification, we conduct a controlled rewriting experiment that serves as an upper-bound analysis, rewriting underspecified questions into fully specified variants while holding gold answers fixed. QA performance consistently improves under this setting, indicating that many apparent QA failures stem from question underspecification rather than model limitations. Our findings highlight underspecification as an important confound in QA evaluation and motivate greater attention to question clarity in benchmark design.

Chinese Translation

大型语言模型（LLMs）在明确的问题上表现良好，但标准的问答（QA）基准仍然远未解决。我们认为，这一差距部分是由于不明确的问题——那些在没有额外上下文的情况下无法唯一确定其解释的查询。为了验证这一假设，我们引入了一种基于LLM的分类器来识别不明确的问题，并将其应用于几个广泛使用的QA数据集，发现16%到超过50%的基准问题是不明确的，并且LLMs在这些问题上的表现显著较差。为了隔离不明确性的影响，我们进行了一项受控重写实验，作为上限分析，将不明确的问题重写为完全明确的变体，同时保持金标准答案不变。在这种情况下，QA性能持续改善，表明许多明显的QA失败源于问题的不明确性，而不是模型的局限性。我们的研究结果强调不明确性作为QA评估中的一个重要混淆因素，并促使在基准设计中更加关注问题的清晰性。

View on arXiv Download PDF AI Translation

cs.CL / 58 / 2602.11939

Do Large Language Models Adapt to Language Variation across Socioeconomic Status?

大型语言模型是否适应不同社会经济地位的语言变异？

Bassignana, Elisa, Zhang, Mike, Hovy, Dirk, Curry, Amanda Cercas

Abstract

Humans adjust their linguistic style to the audience they are addressing. However, the extent to which LLMs adapt to different social contexts is largely unknown. As these models increasingly mediate human-to-human communication, their failure to adapt to diverse styles can perpetuate stereotypes and marginalize communities whose linguistic norms are less closely mirrored by the models, thereby reinforcing social stratification. We study the extent to which LLMs integrate into social media communication across different socioeconomic status (SES) communities. We collect a novel dataset from Reddit and YouTube, stratified by SES. We prompt four LLMs with incomplete text from that corpus and compare the LLM-generated completions to the originals along 94 sociolinguistic metrics, including syntactic, rhetorical, and lexical features. LLMs modulate their style with respect to SES to only a minor extent, often resulting in approximation or caricature, and tend to emulate the style of upper SES more effectively. Our findings (1) show how LLMs risk amplifying linguistic hierarchies and (2) call into question their validity for agent-based social simulation, survey experiments, and any research relying on language style as a social signal.

Chinese Translation

人类会根据所面对的听众调整其语言风格。然而，大型语言模型（LLMs）在不同社会背景下的适应程度仍然 largely unknown。随着这些模型在促进人际沟通中扮演越来越重要的角色，它们未能适应多样化风格可能会延续刻板印象，并使那些语言规范与模型不太相符的社区边缘化，从而加剧社会分层。我们研究了LLMs在不同社会经济地位（SES）社区中融入社交媒体沟通的程度。我们从Reddit和YouTube收集了一个新颖的数据集，并按SES进行分层。我们使用该语料库中的不完整文本对四个LLMs进行提示，并在94个社会语言学指标（包括句法、修辞和词汇特征）上比较LLM生成的文本与原始文本。LLMs在SES方面调整其风格的程度仅为很小，通常导致近似或夸张，并且更有效地模仿上层SES的风格。我们的研究结果（1）显示LLMs有可能放大语言等级，且（2）质疑其在基于代理的社会模拟、调查实验以及任何依赖语言风格作为社会信号的研究中的有效性。

View on arXiv Download PDF AI Translation

cs.CL / 59 / 2602.11961

Scaling Model and Data for Multilingual Machine Translation with Open Large Language Models

利用开放大型语言模型进行多语言机器翻译的模型与数据扩展

Shang, Yuzhe, Gao, Pengzhi, Liu, Wei, Luan, Jian, Su, Jinsong

Abstract

Open large language models (LLMs) have demonstrated improving multilingual capabilities in recent years. In this paper, we present a study of open LLMs for multilingual machine translation (MT) across a range of languages, and investigate the effects of model scaling and data scaling when adapting open LLMs to multilingual MT through continual pretraining and instruction finetuning. Based on the Gemma3 model family, we develop MiLMMT-46, which achieves top-tier multilingual translation performance across 46 languages. Extensive experiments show that MiLMMT-46 consistently outperforms recent state-of-the-art (SOTA) models, including Seed-X, HY-MT-1.5, and TranslateGemma, and achieves competitive performance with strong proprietary systems such as Google Translate and Gemini 3 Pro.

Chinese Translation

开放大型语言模型（LLMs）近年来在多语言能力方面表现出显著提升。本文研究了开放LLMs在多语言机器翻译（MT）中的应用，涵盖多种语言，并探讨了在通过持续预训练和指令微调将开放LLMs适应于多语言MT时模型扩展和数据扩展的影响。基于Gemma3模型系列，我们开发了MiLMMT-46，该模型在46种语言上实现了顶级的多语言翻译性能。大量实验表明，MiLMMT-46在性能上始终优于近期的最先进（SOTA）模型，包括Seed-X、HY-MT-1.5和TranslateGemma，并在性能上与强大的专有系统如Google Translate和Gemini 3 Pro相媲美。

View on arXiv Download PDF AI Translation

cs.CL / 60 / 2602.11968

DHPLT: large-scale multilingual diachronic corpora and word representations for semantic change modelling

DHPLT：用于语义变化建模的大规模多语言历时语料库和词表示

Fedorova, Mariia, Kutuzov, Andrey, Umarova, Khonzoda

Abstract

In this resource paper, we present DHPLT, an open collection of diachronic corpora in 41 diverse languages. DHPLT is based on the web-crawled HPLT datasets; we use web crawl timestamps as the approximate signal of document creation time. The collection covers three time periods: 2011-2015, 2020-2021 and 2024-present (1 million documents per time period for each language). We additionally provide pre-computed word type and token embeddings and lexical substitutions for our chosen target words, while at the same time leaving it open for the other researchers to come up with their own target words using the same datasets. DHPLT aims at filling in the current lack of multilingual diachronic corpora for semantic change modelling (beyond a dozen of high-resource languages). It opens the way for a variety of new experimental setups in this field. All the resources described in this paper are available at https://data.hplt-project.org/three/diachronic/, sorted by language.

Chinese Translation

在这篇资源论文中，我们介绍了DHPLT，这是一个包含41种多样语言的开放历时语料库。DHPLT基于网络爬取的HPLT数据集；我们使用网络爬取的时间戳作为文档创建时间的近似信号。该集合涵盖了三个时间段：2011-2015年、2020-2021年和2024年至今（每种语言每个时间段包含100万份文档）。我们还提供了预计算的词类型和词元嵌入以及我们选择的目标词的词汇替代，同时也允许其他研究者使用相同的数据集提出他们自己的目标词。DHPLT旨在填补当前在语义变化建模方面缺乏多语言历时语料库的空白（超越十几种高资源语言）。它为该领域的多种新实验设置铺平了道路。本文中描述的所有资源均可在 https://data.hplt-project.org/three/diachronic/ 获取，按语言分类。

View on arXiv Download PDF AI Translation

cs.CL / 61 / 2602.11982

Automatic Simplification of Common Vulnerabilities and Exposures Descriptions

常见漏洞与暴露描述的自动简化

Vehomäki, Varpu, Kaski, Kimmo K.

Abstract

Understanding cyber security is increasingly important for individuals and organizations. However, a lot of information related to cyber security can be difficult to understand to those not familiar with the topic. In this study, we focus on investigating how large language models (LLMs) could be utilized in automatic text simplification (ATS) of Common Vulnerability and Exposure (CVE) descriptions. Automatic text simplification has been studied in several contexts, such as medical, scientific, and news texts, but it has not yet been studied to simplify texts in the rapidly changing and complex domain of cyber security. We created a baseline for cyber security ATS and a test dataset of 40 CVE descriptions, evaluated by two groups of cyber security experts in two survey rounds. We have found that while out-of-the box LLMs can make the text appear simpler, they struggle with meaning preservation. Code and data are available at https://version.aalto.fi/gitlab/vehomav1/simplification\_nmi.

Chinese Translation

理解网络安全对于个人和组织越来越重要。然而，许多与网络安全相关的信息对于不熟悉该主题的人来说可能难以理解。在本研究中，我们重点调查了大型语言模型（LLMs）如何在常见漏洞与暴露（CVE）描述的自动文本简化（ATS）中发挥作用。自动文本简化已在医学、科学和新闻文本等多个领域进行了研究，但在快速变化且复杂的网络安全领域尚未进行相关研究。我们为网络安全的自动文本简化创建了基线，并构建了一个包含40个CVE描述的测试数据集，经过两轮调查由两组网络安全专家进行评估。我们发现，尽管现成的LLMs可以使文本看起来更简单，但它们在保持意义方面存在困难。代码和数据可在 https://version.aalto.fi/gitlab/vehomav1/simplification_nmi 获取。

View on arXiv Download PDF AI Translation

cs.CL / 62 / 2602.12005

LaCy: What Small Language Models Can and Should Learn is Not Just a Question of Loss

LaCy：小型语言模型可以和应该学习的内容不仅仅是损失的问题

Ujváry, Szilvia, Béthune, Louis, Ablin, Pierre, Monteiro, João, Cuturi, Marco, Kirchhof, Michael

Abstract

Language models have consistently grown to compress more world knowledge into their parameters, but the knowledge that can be pretrained into them is upper-bounded by their parameter size. Especially the capacity of Small Language Models (SLMs) is limited, leading to factually incorrect generations. This problem is often mitigated by giving the SLM access to an outside source: the ability to query a larger model, documents, or a database. Under this setting, we study the fundamental question of \emph{which tokens an SLM can and should learn} during pretraining, versus \emph{which ones it should delegate} via a \texttt{} token. We find that this is not simply a question of loss: although the loss is predictive of whether a predicted token mismatches the ground-truth, some tokens are \emph{acceptable} in that they are truthful alternative continuations of a pretraining document, and should not trigger a \texttt{} even if their loss is high. We find that a spaCy grammar parser can help augment the loss signal to decide which tokens the SLM should learn to delegate to prevent factual errors and which are safe to learn and predict even under high losses. We propose LaCy, a novel pretraining method based on this token selection philosophy. Our experiments demonstrate that LaCy models successfully learn which tokens to predict and where to delegate for help. This results in higher FactScores when generating in a cascade with a bigger model and outperforms Rho or LLM-judge trained SLMs, while being simpler and cheaper.

Chinese Translation

语言模型不断发展，以将更多的世界知识压缩到其参数中，但可以预训练到它们中的知识受到参数大小的上限限制。尤其是小型语言模型（SLMs）的容量有限，导致生成的内容在事实上的不准确性。这个问题通常通过让SLM访问外部来源来缓解：能够查询更大的模型、文档或数据库。在这种情况下，我们研究了一个基本问题，即 extit{SLM在预训练期间可以和应该学习哪些标记}，以及 extit{哪些标记应该通过 exttt{}标记委托出去}。我们发现这不仅仅是一个损失的问题：尽管损失可以预测预测的标记是否与真实值不匹配，但某些标记是 extit{可接受的}，因为它们是预训练文档的真实替代延续，即使它们的损失很高，也不应该触发 exttt{}。我们发现，spaCy语法解析器可以帮助增强损失信号，以决定SLM应该学习委托哪些标记，以防止事实错误，以及哪些标记在高损失下仍然安全学习和预测。我们提出了LaCy，这是一种基于这种标记选择理念的新型预训练方法。我们的实验表明，LaCy模型成功地学习了哪些标记需要预测，哪些需要委托以获得帮助。这导致在与更大模型级联生成时，FactScores更高，并且在性能上优于Rho或LLM-judge训练的SLMs，同时更简单且成本更低。

View on arXiv Download PDF AI Translation

cs.CL / 63 / 2602.12015

Disentangling Ambiguity from Instability in Large Language Models: A Clinical Text-to-SQL Case Study

从不稳定性中解开大型语言模型的模糊性：临床文本到SQL的案例研究

Ziletti, Angelo, D'Ambrosi, Leonardo

Abstract

Deploying large language models for clinical Text-to-SQL requires distinguishing two qualitatively different causes of output diversity: (i) input ambiguity that should trigger clarification, and (ii) model instability that should trigger human review. We propose CLUES, a framework that models Text-to-SQL as a two-stage process (interpretations --> answers) and decomposes semantic uncertainty into an ambiguity score and an instability score. The instability score is computed via the Schur complement of a bipartite semantic graph matrix. Across AmbigQA/SituatedQA (gold interpretations) and a clinical Text-to-SQL benchmark (known interpretations), CLUES improves failure prediction over state-of-the-art Kernel Language Entropy. In deployment settings, it remains competitive while providing a diagnostic decomposition unavailable from a single score. The resulting uncertainty regimes map to targeted interventions - query refinement for ambiguity, model improvement for instability. The high-ambiguity/high-instability regime contains 51% of errors while covering 25% of queries, enabling efficient triage.

Chinese Translation

在临床文本到SQL的应用中，部署大型语言模型需要区分两种质不同的输出多样性原因：(i) 应触发澄清的输入模糊性，(ii) 应触发人工审查的模型不稳定性。我们提出了CLUES，一个将文本到SQL建模为两阶段过程（解释 --> 答案）的框架，并将语义不确定性分解为模糊性评分和不稳定性评分。不稳定性评分通过二分语义图矩阵的Schur补计算得出。在AmbigQA/SituatedQA（黄金解释）和临床文本到SQL基准（已知解释）中，CLUES在失败预测方面优于最先进的核语言熵（Kernel Language Entropy）。在部署环境中，它保持竞争力，同时提供单一评分无法提供的诊断分解。由此产生的不确定性机制映射到针对性的干预措施——模糊性查询优化和不稳定性模型改进。高模糊性/高不稳定性机制包含51%的错误，同时覆盖25%的查询，从而实现高效的分类。

View on arXiv Download PDF AI Translation

cs.CL / 64 / 2602.12036

Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models

Composition-RL：为大型语言模型的强化学习构建可验证的提示

Xu, Xin, Bai, Clive, Yang, Kai, Chen, Tianhao, Chen, Yangkun, Liu, Weijie, Chen, Hao, Wang, Yang, Yang, Saiyong, Yang, Can

Abstract

Large-scale verifiable prompts underpin the success of Reinforcement Learning with Verifiable Rewards (RLVR), but they contain many uninformative examples and are costly to expand further. Recent studies focus on better exploiting limited training data by prioritizing hard prompts whose rollout pass rate is 0. However, easy prompts with a pass rate of 1 also become increasingly prevalent as training progresses, thereby reducing the effective data size. To mitigate this, we propose Composition-RL, a simple yet useful approach for better utilizing limited verifiable prompts targeting pass-rate-1 prompts. More specifically, Composition-RL automatically composes multiple problems into a new verifiable question and uses these compositional prompts for RL training. Extensive experiments across model sizes from 4B to 30B show that Composition-RL consistently improves reasoning capability over RL trained on the original dataset. Performance can be further boosted with a curriculum variant of Composition-RL that gradually increases compositional depth over training. Additionally, Composition-RL enables more effective cross-domain RL by composing prompts drawn from different domains. Codes, datasets, and models are available at https://github.com/XinXU-USTC/Composition-RL.

Chinese Translation

大规模可验证提示是具有可验证奖励的强化学习（RLVR）成功的基础，但它们包含许多无信息的示例，并且扩展成本高。近期研究集中于通过优先考虑回滚通过率为0的困难提示，更好地利用有限的训练数据。然而，随着训练的进行，通过率为1的简单提示也变得越来越普遍，从而减少了有效数据的大小。为了解决这个问题，我们提出了Composition-RL，这是一种简单而有效的方法，旨在更好地利用有限的可验证提示，专注于通过率为1的提示。更具体地说，Composition-RL自动将多个问题组合成一个新的可验证问题，并使用这些组合提示进行强化学习训练。在4B到30B的模型规模上进行的广泛实验表明，Composition-RL在推理能力上始终优于在原始数据集上训练的强化学习。通过逐渐增加训练中的组合深度，Composition-RL的课程变体可以进一步提升性能。此外，Composition-RL通过组合来自不同领域的提示，使跨领域强化学习更加有效。代码、数据集和模型可在 https://github.com/XinXU-USTC/Composition-RL 获取。

View on arXiv Download PDF AI Translation

cs.CL / 65 / 2602.12092

DeepSight: An All-in-One LM Safety Toolkit

DeepSight：一体化大型模型安全工具包

Zhang, Bo, Guo, Jiaxuan, Li, Lijun, Liu, Dongrui, Chen, Sujin, Chen, Guanxu, Zheng, Zhijie, Lin, Qihao, Yan, Lewen, Qian, Chen, Zhou, Yijin, Wu, Yuyao, Guo, Shaoxiong, Du, Tianyi, Yang, Jingyi, Hu, Xuhao, Miao, Ziqi, Lu, Xiaoya, Shao, Jing, Hu, Xia

Abstract

As the development of Large Models (LMs) progresses rapidly, their safety is also a priority. In current Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) safety workflow, evaluation, diagnosis, and alignment are often handled by separate tools. Specifically, safety evaluation can only locate external behavioral risks but cannot figure out internal root causes. Meanwhile, safety diagnosis often drifts from concrete risk scenarios and remains at the explainable level. In this way, safety alignment lack dedicated explanations of changes in internal mechanisms, potentially degrading general capabilities. To systematically address these issues, we propose an open-source project, namely DeepSight, to practice a new safety evaluation-diagnosis integrated paradigm. DeepSight is low-cost, reproducible, efficient, and highly scalable large-scale model safety evaluation project consisting of a evaluation toolkit DeepSafe and a diagnosis toolkit DeepScan. By unifying task and data protocols, we build a connection between the two stages and transform safety evaluation from black-box to white-box insight. Besides, DeepSight is the first open source toolkit that support the frontier AI risk evaluation and joint safety evaluation and diagnosis.

Chinese Translation

随着大型模型（LMs）开发的快速推进，其安全性也成为了优先考虑的问题。在当前的大型语言模型（LLMs）和多模态大型语言模型（MLLMs）安全工作流程中，评估、诊断和对齐通常由不同的工具处理。具体而言，安全评估只能定位外部行为风险，但无法找出内部根本原因。同时，安全诊断往往偏离具体的风险场景，停留在可解释的层面。因此，安全对齐缺乏对内部机制变化的专门解释，可能会降低整体能力。为系统性地解决这些问题，我们提出了一个开源项目，即DeepSight，以实践新的安全评估-诊断集成范式。DeepSight是一个低成本、可重复、有效且高度可扩展的大规模模型安全评估项目，包含评估工具包DeepSafe和诊断工具包DeepScan。通过统一任务和数据协议，我们建立了两个阶段之间的连接，将安全评估从黑箱转变为白箱洞察。此外，DeepSight是第一个支持前沿人工智能风险评估及联合安全评估和诊断的开源工具包。

View on arXiv Download PDF AI Translation

cs.CL / 66 / 2602.12116

P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling

P-GenRM：基于测试时用户的个性化生成奖励模型

Zhang, Pinyi, Lin, Ting-En, Wu, Yuchuan, Chen, Jingyang, Wang, Zongqi, Yang, Hua, Xu, Ze, Huang, Fei, Zhang, Kai, Li, Yongbin

Abstract

Personalized alignment of large language models seeks to adapt responses to individual user preferences, typically via reinforcement learning. A key challenge is obtaining accurate, user-specific reward signals in open-ended scenarios. Existing personalized reward models face two persistent limitations: (1) oversimplifying diverse, scenario-specific preferences into a small, fixed set of evaluation principles, and (2) struggling with generalization to new users with limited feedback. To this end, we propose P-GenRM, the first Personalized Generative Reward Model with test-time user-based scaling. P-GenRM transforms preference signals into structured evaluation chains that derive adaptive personas and scoring rubrics across various scenarios. It further clusters users into User Prototypes and introduces a dual-granularity scaling mechanism: at the individual level, it adaptively scales and aggregates each user's scoring scheme; at the prototype level, it incorporates preferences from similar users. This design mitigates noise in inferred preferences and enhances generalization to unseen users through prototype-based transfer. Empirical results show that P-GenRM achieves state-of-the-art results on widely-used personalized reward model benchmarks, with an average improvement of 2.31%, and demonstrates strong generalization on an out-of-distribution dataset. Notably, Test-time User-based scaling provides an additional 3% boost, demonstrating stronger personalized alignment with test-time scalability.

Chinese Translation

个性化对齐大型语言模型旨在根据个体用户的偏好调整响应，通常通过强化学习实现。一个关键挑战是在开放式场景中获取准确的用户特定奖励信号。现有的个性化奖励模型面临两个持续的局限性：（1）将多样化的、场景特定的偏好简化为一小组固定的评估原则，以及（2）在有限反馈的情况下难以对新用户进行泛化。为此，我们提出了P-GenRM，这是第一个基于测试时用户的个性化生成奖励模型。P-GenRM将偏好信号转化为结构化的评估链，从而在各种场景中推导出自适应的人物角色和评分标准。它进一步将用户聚类为用户原型，并引入双粒度缩放机制：在个体层面，自适应地缩放和聚合每个用户的评分方案；在原型层面，结合相似用户的偏好。这一设计减轻了推断偏好的噪声，并通过基于原型的迁移增强了对未见用户的泛化。实证结果表明，P-GenRM在广泛使用的个性化奖励模型基准上取得了最先进的结果，平均提升2.31%，并在一个分布外数据集上展示了强大的泛化能力。值得注意的是，基于测试时用户的缩放提供了额外的3%的提升，展示了更强的个性化对齐和测试时可扩展性。

View on arXiv Download PDF AI Translation

cs.CL / 67 / 2602.12132

A Rule-based Computational Model for Gaidhlig Morphology

基于规则的盖尔语形态学计算模型

Barclay, Peter J

Abstract

Language models and software tools are essential to support the continuing vitality of lesser-used languages; however, currently popular neural models require considerable data for training, which normally is not available for such low-resource languages. This paper describes work-in-progress to construct a rule-based model of Gaidhlig morphology using data from Wiktionary, arguing that rule-based systems effectively leverage limited sample data, support greater interpretability, and provide insights useful in the design of teaching materials. The use of SQL for querying the occurrence of different lexical patterns is investigated, and a declarative rule-base is presented that allows Python utilities to derive inflected forms of Gaidhlig words. This functionality could be used to support educational tools that teach or explain language patterns, for example, or to support higher level tools such as rule-based dependency parsers. This approach adds value to the data already present in Wiktionary by adapting it to new use-cases.

Chinese Translation

语言模型和软件工具对于支持使用较少的语言的持续活力至关重要；然而，目前流行的神经模型需要大量数据进行训练，而这类低资源语言通常无法提供如此数据。本文描述了正在进行的工作，旨在利用来自维基词典的数据构建一个基于规则的盖尔语形态学模型，认为基于规则的系统能够有效利用有限的样本数据，支持更高的可解释性，并提供对教学材料设计有用的见解。我们探讨了使用SQL查询不同词汇模式出现情况，并提出了一个声明式规则库，允许Python工具推导盖尔语单词的屈折形式。这一功能可用于支持教授或解释语言模式的教育工具，或支持更高级的工具，如基于规则的依赖解析器。该方法通过将维基词典中已有的数据适应于新的使用场景，为其增添了价值。

View on arXiv Download PDF AI Translation

cs.CL / 68 / 2602.12135

WavBench: Benchmarking Reasoning, Colloquialism, and Paralinguistics for End-to-End Spoken Dialogue Models

WavBench：端到端语音对话模型的推理、口语化和副语言学基准测试

Li, Yangzhuo, Ji, Shengpeng, Chen, Yifu, Liang, Tianle, Ying, Haorong, Wang, Yule, Li, Junbo, Fang, Jun, Zhao, Zhou

Abstract

With the rapid integration of advanced reasoning capabilities into spoken dialogue models, the field urgently demands benchmarks that transcend simple interactions to address real-world complexity. However, current evaluations predominantly adhere to text-generation standards, overlooking the unique audio-centric characteristics of paralinguistics and colloquialisms, alongside the cognitive depth required by modern agents. To bridge this gap, we introduce WavBench, a comprehensive benchmark designed to evaluate realistic conversational abilities where prior works fall short. Uniquely, WavBench establishes a tripartite framework: 1) Pro subset, designed to rigorously challenge reasoning-enhanced models with significantly increased difficulty; 2) Basic subset, defining a novel standard for spoken colloquialism that prioritizes "listenability" through natural vocabulary, linguistic fluency, and interactive rapport, rather than rigid written accuracy; and 3) Acoustic subset, covering explicit understanding, generation, and implicit dialogue to rigorously evaluate comprehensive paralinguistic capabilities within authentic real-world scenarios. Through evaluating five state-of-the-art models, WavBench offers critical insights into the intersection of complex problem-solving, colloquial delivery, and paralinguistic fidelity, guiding the evolution of robust spoken dialogue models. The benchmark dataset and evaluation toolkit are available at https://naruto-2024.github.io/wavbench.github.io/.

Chinese Translation

随着先进推理能力迅速融入语音对话模型，该领域迫切需要超越简单交互的基准测试，以应对现实世界的复杂性。然而，目前的评估主要遵循文本生成标准，忽视了副语言学和口语化的独特音频特征，以及现代智能体所需的认知深度。为了解决这一问题，我们引入了WavBench，这是一项全面的基准测试，旨在评估现实对话能力，而现有工作则未能满足这一需求。WavBench独特地建立了一个三部分框架：1）Pro子集，旨在以显著增加的难度严格挑战增强推理能力的模型；2）Basic子集，定义了一种新的口语化标准，优先考虑通过自然词汇、语言流畅性和互动关系来实现“可听性”，而非严格的书面准确性；3）Acoustic子集，涵盖显性理解、生成和隐性对话，以严格评估真实世界场景中的全面副语言能力。通过评估五个最先进的模型，WavBench提供了关于复杂问题解决、口语表达和副语言忠实度交集的重要见解，指导强大语音对话模型的发展。基准数据集和评估工具包可在 https://naruto-2024.github.io/wavbench.github.io/ 获取。

View on arXiv Download PDF AI Translation

cs.CL / 69 / 2602.12137

CitiLink-Minutes: A Multilayer Annotated Dataset of Municipal Meeting Minutes

CitiLink-Minutes：一个多层次注释的市政会议记录数据集

Campos, Ricardo, Pacheco, Ana Filipa, Fernandes, Ana Luísa, Cantante, Inês, Rebouças, Rute, Cunha, Luís Filipe, Isidro, José Miguel, Evans, José Pedro, Marques, Miguel, Batista, Rodrigo, Amorim, Evelin, Jorge, Alípio, Guimarães, Nuno, Nunes, Sérgio, Leal, António, Silvano, Purificação

Abstract

City councils play a crucial role in local governance, directly influencing citizens' daily lives through decisions made during municipal meetings. These deliberations are formally documented in meeting minutes, which serve as official records of discussions, decisions, and voting outcomes. Despite their importance, municipal meeting records have received little attention in Information Retrieval (IR) and Natural Language Processing (NLP), largely due to the lack of annotated datasets, which ultimately limit the development of computational models. To address this gap, we introduce CitiLink-Minutes, a multilayer dataset of 120 European Portuguese municipal meeting minutes from six municipalities. Unlike prior annotated datasets of parliamentary or video records, CitiLink-Minutes provides multilayer annotations and structured linkage of official written minutes. The dataset contains over one million tokens, with all personal identifiers de-identified. Each minute was manually annotated by two trained annotators and curated by an experienced linguist across three complementary dimensions: (1) metadata, (2) subjects of discussion, and (3) voting outcomes, totaling over 38,000 individual annotations. Released under FAIR principles and accompanied by baseline results on metadata extraction, topic classification, and vote labeling, CitiLink-Minutes demonstrates its potential for downstream NLP and IR tasks, while promoting transparent access to municipal decisions.

Chinese Translation

市议会在地方治理中发挥着至关重要的作用，通过在市政会议上做出的决策直接影响公民的日常生活。这些讨论在会议记录中正式记录，作为讨论、决策和投票结果的官方记录。尽管其重要性不言而喻，市政会议记录在信息检索（IR）和自然语言处理（NLP）领域却鲜有关注，主要原因在于缺乏注释数据集，这最终限制了计算模型的发展。为了解决这一问题，我们推出了CitiLink-Minutes，这是一个包含来自六个市政当局的120份欧洲葡萄牙语市政会议记录的多层次数据集。与之前的议会或视频记录的注释数据集不同，CitiLink-Minutes提供了多层次的注释和官方书面记录的结构化链接。该数据集包含超过一百万个词元，所有个人标识符均已去标识化。每份会议记录由两名经过培训的注释员手动注释，并由一位经验丰富的语言学家在三个互补维度上进行整理：（1）元数据，（2）讨论主题，以及（3）投票结果，总计超过38,000个单独注释。根据公平（FAIR）原则发布，并附有元数据提取、主题分类和投票标记的基线结果，CitiLink-Minutes展示了其在下游NLP和IR任务中的潜力，同时促进了对市政决策的透明访问。

View on arXiv Download PDF AI Translation

cs.CL / 70 / 2602.12153

dVoting: Fast Voting for dLLMs

dVoting：针对 dLLMs 的快速投票

Feng, Sicheng, Chen, Zigeng, Ma, Xinyin, Fang, Gongfan, Wang, Xinchao

Abstract

Diffusion Large Language Models (dLLMs) represent a new paradigm beyond autoregressive modeling, offering competitive performance while naturally enabling a flexible decoding process. Specifically, dLLMs can generate tokens at arbitrary positions in parallel, endowing them with significant potential for parallel test-time scaling, which was previously constrained by severe inefficiency in autoregressive modeling. In this work, we introduce dVoting, a fast voting technique that boosts reasoning capability without training, with only an acceptable extra computational overhead. dVoting is motivated by the observation that, across multiple samples for the same prompt, token predictions remain largely consistent, whereas performance is determined by a small subset of tokens exhibiting cross-sample variability. Leveraging the arbitrary-position generation capability of dLLMs, dVoting performs iterative refinement by sampling, identifying uncertain tokens via consistency analysis, regenerating them through voting, and repeating this process until convergence. Extensive evaluations demonstrate that dVoting consistently improves performance across various benchmarks. It achieves gains of 6.22%-7.66% on GSM8K, 4.40%-7.20% on MATH500, 3.16%-14.84% on ARC-C, and 4.83%-5.74% on MMLU. Our code is available at https://github.com/fscdc/dVoting

Chinese Translation

扩散大语言模型（dLLMs）代表了一种超越自回归建模的新范式，提供了具有竞争力的性能，同时自然地实现了灵活的解码过程。具体而言，dLLMs 可以并行地在任意位置生成标记，使其在测试时具有显著的并行扩展潜力，而这一点在自回归建模中受到严重低效的限制。在本研究中，我们介绍了 dVoting，一种快速投票技术，能够在不进行训练的情况下提升推理能力，仅需接受额外的计算开销。dVoting 的动机源于观察到，对于同一提示的多个样本，标记预测在很大程度上保持一致，而性能则由一小部分表现出跨样本变异性的标记决定。利用 dLLMs 的任意位置生成能力，dVoting 通过采样进行迭代优化，通过一致性分析识别不确定标记，通过投票重新生成这些标记，并重复这一过程直到收敛。广泛的评估表明，dVoting 在各种基准测试中始终提高了性能。在 GSM8K 上取得了 6.22%-7.66% 的提升，在 MATH500 上提升了 4.40%-7.20%，在 ARC-C 上提升了 3.16%-14.84%，在 MMLU 上提升了 4.83%-5.74%。我们的代码可在 https://github.com/fscdc/dVoting 获取。

View on arXiv Download PDF AI Translation

cs.CL / 71 / 2602.12192

Query-focused and Memory-aware Reranker for Long Context Processing

面向查询和记忆感知的长文本处理重排序器

Li, Yuqing, Li, Jiangnan, Yu, Mo, Ding, Guoxuan, Lin, Zheng, Wang, Weiping, Zhou, Jie

Abstract

Built upon the existing analysis of retrieval heads in large language models, we propose an alternative reranking framework that trains models to estimate passage-query relevance using the attention scores of selected heads. This approach provides a listwise solution that leverages holistic information within the entire candidate shortlist during ranking. At the same time, it naturally produces continuous relevance scores, enabling training on arbitrary retrieval datasets without requiring Likert-scale supervision. Our framework is lightweight and effective, requiring only small-scale models (e.g., 4B parameters) to achieve strong performance. Extensive experiments demonstrate that our method outperforms existing state-of-the-art pointwise and listwise rerankers across multiple domains, including Wikipedia and long narrative datasets. It further establishes a new state-of-the-art on the LoCoMo benchmark that assesses the capabilities of dialogue understanding and memory usage. We further demonstrate that our framework supports flexible extensions. For example, augmenting candidate passages with contextual information further improves ranking accuracy, while training attention heads from middle layers enhances efficiency without sacrificing performance.

Chinese Translation

基于对大型语言模型中检索头的现有分析，我们提出了一种替代的重排序框架，该框架训练模型使用所选头的注意力分数来估计段落与查询的相关性。这种方法提供了一种列表式解决方案，利用整个候选短名单中的整体信息进行排序。同时，它自然地生成连续的相关性分数，使得在任意检索数据集上进行训练成为可能，而无需依赖李克特量表的监督。我们的框架轻量且高效，仅需小规模模型（例如，4B参数）即可实现强大的性能。大量实验表明，我们的方法在多个领域（包括维基百科和长篇叙述数据集）上超越了现有的最先进的逐点和逐列表重排序器。它进一步在评估对话理解和记忆使用能力的LoCoMo基准上建立了新的最先进水平。我们还展示了我们的框架支持灵活的扩展。例如，使用上下文信息增强候选段落进一步提高了排名准确性，而从中间层训练注意力头则在不牺牲性能的情况下提高了效率。

View on arXiv Download PDF AI Translation

cs.CL / 72 / 2602.12196

Visual Reasoning Benchmark: Evaluating Multimodal LLMs on Classroom-Authentic Visual Problems from Primary Education

视觉推理基准：评估多模态大型语言模型在小学教育中课堂真实视觉问题上的表现

Huti, Mohamed, Mackintosh, Alasdair, Waldock, Amy, Andrews, Dominic, Lelièvre, Maxime, Boos, Moritz, Murray, Tobias, Atherton, Paul, Ince, Robin A. A., Garrod, Oliver G. B.

Abstract

AI models have achieved state-of-the-art results in textual reasoning; however, their ability to reason over spatial and relational structures remains a critical bottleneck -- particularly in early-grade maths, which relies heavily on visuals. This paper introduces the visual reasoning benchmark (VRB), a novel dataset designed to evaluate Multimodal Large Language Models (MLLMs) on their ability to solve authentic visual problems from classrooms. This benchmark is built on a set of 701 questions sourced from primary school examinations in Zambia and India, which cover a range of tasks such as reasoning by analogy, pattern completion, and spatial matching. We outline the methodology and development of the benchmark which intentionally uses unedited, minimal-text images to test if models can meet realistic needs of primary education. Our findings reveal a ``jagged frontier'' of capability where models demonstrate better proficiency in static skills such as counting and scaling, but reach a distinct ``spatial ceiling'' when faced with dynamic operations like folding, reflection, and rotation. These weaknesses pose a risk for classroom use on visual reasoning problems, with the potential for incorrect marking, false scaffolding, and reinforcing student misconceptions. Consequently, education-focused benchmarks like the VRB are essential for determining the functional boundaries of multimodal tools used in classrooms.

Chinese Translation

人工智能模型在文本推理方面已取得了最先进的成果；然而，它们在空间和关系结构上的推理能力仍然是一个关键瓶颈——特别是在早期数学教育中，这一领域高度依赖视觉材料。本文介绍了视觉推理基准（Visual Reasoning Benchmark, VRB），这是一个新颖的数据集，旨在评估多模态大型语言模型（Multimodal Large Language Models, MLLMs）解决课堂真实视觉问题的能力。该基准基于来自赞比亚和印度的小学考试中提取的701个问题，涵盖了类比推理、模式补全和空间匹配等多种任务。我们概述了基准的构建方法和开发过程，故意使用未经编辑的、文本极少的图像，以测试模型是否能够满足小学教育的现实需求。我们的研究发现，模型在静态技能（如计数和缩放）方面表现出更好的能力，但在面对动态操作（如折叠、反射和旋转）时却达到了明显的“空间天花板”。这些弱点在视觉推理问题的课堂使用中带来了风险，可能导致错误评分、虚假支架和强化学生误解。因此，像VRB这样的教育导向基准对于确定课堂中使用的多模态工具的功能边界至关重要。

View on arXiv Download PDF AI Translation

cs.CL / 73 / 2602.12203

ExStrucTiny: A Benchmark for Schema-Variable Structured Information Extraction from Document Images

ExStrucTiny：一个用于从文档图像中提取模式可变结构信息的基准

Sibue, Mathieu, Garza, Andres Muñoz, Mensah, Samuel, Shetty, Pranav, Ma, Zhiqiang, Liu, Xiaomo, Veloso, Manuela

Abstract

Enterprise documents, such as forms and reports, embed critical information for downstream applications like data archiving, automated workflows, and analytics. Although generalist Vision Language Models (VLMs) perform well on established document understanding benchmarks, their ability to conduct holistic, fine-grained structured extraction across diverse document types and flexible schemas is not well studied. Existing Key Entity Extraction (KEE), Relation Extraction (RE), and Visual Question Answering (VQA) datasets are limited by narrow entity ontologies, simple queries, or homogeneous document types, often overlooking the need for adaptable and structured extraction. To address these gaps, we introduce ExStrucTiny, a new benchmark dataset for structured Information Extraction (IE) from document images, unifying aspects of KEE, RE, and VQA. Built through a novel pipeline combining manual and synthetic human-validated samples, ExStrucTiny covers more varied document types and extraction scenarios. We analyze open and closed VLMs on this benchmark, highlighting challenges such as schema adaptation, query under-specification, and answer localization. We hope our work provides a bedrock for improving generalist models for structured IE in documents.

Chinese Translation

企业文档，如表单和报告，嵌入了对下游应用（如数据归档、自动化工作流和分析）至关重要的信息。尽管通用视觉语言模型（VLMs）在既定的文档理解基准上表现良好，但它们在不同文档类型和灵活模式下进行整体、细粒度结构提取的能力尚未得到充分研究。现有的关键实体提取（KEE）、关系提取（RE）和视觉问答（VQA）数据集受限于狭窄的实体本体、简单的查询或同质的文档类型，常常忽视了适应性和结构化提取的需求。为了解决这些问题，我们引入了ExStrucTiny，一个新的文档图像结构化信息提取（IE）基准数据集，统一了KEE、RE和VQA的各个方面。ExStrucTiny通过结合手动和合成的人类验证样本的新颖管道构建，涵盖了更多样化的文档类型和提取场景。我们在该基准上分析了开放和封闭的VLMs，突出了模式适应、查询不足和答案定位等挑战。我们希望我们的工作为改善文档中的结构化IE通用模型提供基础。

View on arXiv Download PDF AI Translation

cs.CL / 74 / 2602.12235

Detecting Overflow in Compressed Token Representations for Retrieval-Augmented Generation

检测压缩令牌表示中的溢出以增强检索生成

Belikova, Julia, Rozhevskii, Danila, Svirin, Dennis, Polev, Konstantin, Panchenko, Alexander

Abstract

Efficient long-context processing remains a crucial challenge for contemporary large language models (LLMs), especially in resource-constrained environments. Soft compression architectures promise to extend effective context length by replacing long token sequences with smaller sets of learned compressed tokens. Yet, the limits of compressibility -- and when compression begins to erase task-relevant content -- remain underexplored. In this paper, we define \emph{token overflow} as a regime in which compressed representations no longer contain sufficient information to answer a given query, and propose a methodology to characterize and detect it. In the xRAG soft-compression setting, we find that query-agnostic saturation statistics reliably separate compressed from uncompressed token representations, providing a practical tool for identifying compressed tokens but showing limited overflow detection capability. Lightweight probing classifiers over both query and context xRAG representations detect overflow with 0.72 AUC-ROC on average on HotpotQA, SQuADv2, and TriviaQA datasets, demonstrating that incorporating query information improves detection performance. These results advance from query-independent diagnostics to query-aware detectors, enabling low-cost pre-LLM gating to mitigate compression-induced errors.

Chinese Translation

高效的长上下文处理仍然是当代大型语言模型（LLMs）面临的一项重要挑战，尤其是在资源受限的环境中。软压缩架构通过用较小的学习压缩令牌集替换长令牌序列，承诺延长有效上下文长度。然而，压缩的极限——以及何时压缩开始抹去与任务相关的内容——仍然未被充分探讨。在本文中，我们将 extit{令牌溢出}定义为一种状态，在这种状态下，压缩表示不再包含足够的信息来回答给定的查询，并提出了一种表征和检测它的方法。在xRAG软压缩设置中，我们发现查询无关的饱和统计量能够可靠地区分压缩和未压缩的令牌表示，提供了一种识别压缩令牌的实用工具，但其溢出检测能力有限。对查询和上下文xRAG表示的轻量级探测分类器在HotpotQA、SQuADv2和TriviaQA数据集上平均检测溢出的AUC-ROC为0.72，证明了结合查询信息可以提高检测性能。这些结果从查询无关的诊断进展到查询感知的检测器，使得在LLM之前进行低成本的门控，以减轻压缩引起的错误成为可能。

View on arXiv Download PDF AI Translation

cs.CL / 75 / 2602.12241

Moonshine v2: Ergodic Streaming Encoder ASR for Latency-Critical Speech Applications

Moonshine v2：适用于延迟关键语音应用的遍历流编码器自动语音识别

Kudlur, Manjunath, King, Evan, Wang, James, Warden, Pete

Abstract

Latency-critical speech applications (e.g., live transcription, voice commands, and real-time translation) demand low time-to-first-token (TTFT) and high transcription accuracy, particularly on resource-constrained edge devices. Full-attention Transformer encoders remain a strong accuracy baseline for automatic speech recognition (ASR) because every frame can directly attend to every other frame, which resolves otherwise locally ambiguous acoustics using distant lexical context. However, this global dependency incurs quadratic complexity in sequence length, inducing an inherent "encode-the-whole-utterance" latency profile. For streaming use cases, this causes TTFT to grow linearly with utterance length as the encoder must process the entire prefix before any decoder token can be emitted. To better meet the needs of on-device, streaming ASR use cases we introduce Moonshine v2, an ergodic streaming-encoder ASR model that employs sliding-window self-attention to achieve bounded, low-latency inference while preserving strong local context. Our models achieve state of the art word error rates across standard benchmarks, attaining accuracy on-par with models 6x their size while running significantly faster. These results demonstrate that carefully designed local attention is competitive with the accuracy of full attention at a fraction of the size and latency cost, opening new possibilities for interactive speech interfaces on edge devices.

Chinese Translation

延迟关键的语音应用（例如，实时转录、语音命令和实时翻译）要求低的首次令牌时间（TTFT）和高的转录准确性，尤其是在资源受限的边缘设备上。全注意力Transformer编码器仍然是自动语音识别（ASR）的强大准确性基准，因为每一帧可以直接关注其他每一帧，这利用远程词汇上下文解决了局部模糊的声学问题。然而，这种全局依赖性在序列长度上引入了二次复杂性，导致固有的“编码整个话语”的延迟特征。对于流媒体使用案例，这导致TTFT随着话语长度线性增长，因为编码器必须处理整个前缀才能发出任何解码器令牌。为了更好地满足设备上流式ASR使用案例的需求，我们引入了Moonshine v2，一种遍历流编码器ASR模型，采用滑动窗口自注意力机制，以实现有界的低延迟推理，同时保持强大的局部上下文。我们的模型在标准基准测试中实现了最先进的字错误率，达到与其大小为6倍的模型相当的准确性，同时运行速度显著更快。这些结果表明，经过精心设计的局部注意力在大小和延迟成本的很小一部分下，与全注意力的准确性具有竞争力，为边缘设备上的交互式语音接口开辟了新的可能性。

View on arXiv Download PDF AI Translation

cs.CL / 76 / 2602.12251

A technical curriculum on language-oriented artificial intelligence in translation and specialised communication

面向语言的人工智能在翻译和专业交流中的技术课程

Krüger, Ralph

Abstract

This paper presents a technical curriculum on language-oriented artificial intelligence (AI) in the language and translation (L&T) industry. The curriculum aims to foster domain-specific technical AI literacy among stakeholders in the fields of translation and specialised communication by exposing them to the conceptual and technical/algorithmic foundations of modern language-oriented AI in an accessible way. The core curriculum focuses on 1) vector embeddings, 2) the technical foundations of neural networks, 3) tokenization and 4) transformer neural networks. It is intended to help users develop computational thinking as well as algorithmic awareness and algorithmic agency, ultimately contributing to their digital resilience in AI-driven work environments. The didactic suitability of the curriculum was tested in an AI-focused MA course at the Institute of Translation and Multilingual Communication at TH Koeln. Results suggest the didactic effectiveness of the curriculum, but participant feedback indicates that it should be embedded into higher-level didactic scaffolding - e.g., in the form of lecturer support - in order to enable optimal learning conditions.

Chinese Translation

本文提出了一项关于面向语言的人工智能（AI）在语言与翻译（L&T）行业中的技术课程。该课程旨在通过以易于理解的方式向利益相关者介绍现代面向语言的人工智能的概念及其技术/算法基础，培养翻译和专业交流领域内的特定领域技术AI素养。核心课程重点包括：1）向量嵌入，2）神经网络的技术基础，3）分词处理和4）变换器神经网络。课程旨在帮助用户发展计算思维、算法意识和算法能力，最终增强他们在AI驱动的工作环境中的数字韧性。该课程的教学适宜性在科隆应用科技大学翻译与多语言交流研究所的一个以AI为重点的硕士课程中进行了测试。结果表明该课程在教学上的有效性，但参与者反馈指出，课程应嵌入更高层次的教学支架中，例如通过讲师支持的形式，以便创造最佳学习条件。

View on arXiv Download PDF AI Translation

cs.CL / 77 / 2602.12275

On-Policy Context Distillation for Language Models

基于策略的上下文蒸馏用于语言模型

Ye, Tianzhu, Dong, Li, Wu, Xun, Huang, Shaohan, Wei, Furu

Abstract

Context distillation enables language models to internalize in-context knowledge into their parameters. In our work, we propose On-Policy Context Distillation (OPCD), a framework that bridges on-policy distillation with context distillation by training a student model on its own generated trajectories while minimizing reverse Kullback-Leibler divergence against a context-conditioned teacher. We demonstrate the effectiveness of OPCD on two important applications: experiential knowledge distillation, where models extract and consolidate transferable knowledge from their historical solution traces, and system prompt distillation, where models internalize beneficial behaviors encoded in optimized prompts. Across mathematical reasoning, text-based games, and domain-specific tasks, OPCD consistently outperforms baseline methods, achieving higher task accuracy while better preserving out-of-distribution capabilities. We further show that OPCD enables effective cross-size distillation, where smaller student models can internalize experiential knowledge from larger teachers.

Chinese Translation

上下文蒸馏使语言模型能够将上下文知识内化到其参数中。在我们的研究中，我们提出了基于策略的上下文蒸馏（On-Policy Context Distillation, OPCD），这是一个将基于策略的蒸馏与上下文蒸馏相结合的框架，通过在学生模型生成的轨迹上进行训练，同时最小化与上下文条件教师之间的反向 Kullback-Leibler 散度。我们在两个重要应用中展示了 OPCD 的有效性：经验知识蒸馏，其中模型从其历史解决轨迹中提取和整合可转移知识，以及系统提示蒸馏，其中模型内化编码在优化提示中的有益行为。在数学推理、基于文本的游戏和特定领域任务中，OPCD 始终优于基线方法，取得更高的任务准确性，同时更好地保留了分布外能力。我们进一步展示了 OPCD 能够有效实现跨规模蒸馏，使得较小的学生模型能够从较大的教师模型中内化经验知识。

View on arXiv Download PDF AI Translation

arXiv Papers

Mitigating Error Accumulation in Continuous Navigation via Memory-Augmented Kalman Filtering

H-WM: Robotic Task and Motion Planning Guided by Hierarchical World Model

ExtremControl: Low-Latency Humanoid Teleoperation with Direct Extremity Control

MolmoSpaces: A Large-Scale Open Ecosystem for Robot Navigation and Manipulation

Human Preference Modeling Using Visual Motion Prediction Improves Robot Skill Learning from Egocentric Human Video

EasyMimic: A Low-Cost Framework for Robot Imitation Learning from Human Videos

Effective Task Planning with Missing Objects using Learning-Informed Object Search

HyperDet: 3D Object Detection with Hyper 4D Radar Point Clouds

ReaDy-Go: Real-to-Sim Dynamic 3D Gaussian Splatting Simulation for Environment-Specific Visual Navigation with Moving Obstacles

ABot-N0: Technical Report on the VLA Foundation Model for Versatile Embodied Navigation

ViTaS: Visual Tactile Soft Fusion Contrastive Learning for Visuomotor Learning

Human-Like Gaze Behavior in Social Robots: A Deep Learning Approach Integrating Human and Non-Human Stimuli

AC-MASAC: An Attentive Curriculum Learning Framework for Heterogeneous UAV Swarm Coordination

HAIC: Humanoid Agile Object Interaction Control via Dynamics-Aware World Model

LAMP: Implicit Language Map for Robot Navigation

Learning to Manipulate Anything: Revealing Data Scaling Laws in Bounding-Box Guided Policies

General Humanoid Whole-Body Control via Pretraining and Fast Adaptation

Robot-DIFT: Distilling Diffusion Features for Geometrically Consistent Visuomotor Control

Accelerating Robotic Reinforcement Learning with Agent Guidance

Decentralized Multi-Robot Obstacle Detection and Tracking in a Maritime Scenario

Adaptive-Horizon Conflict-Based Search for Closed-Loop Multi-Agent Path Finding

When would Vision-Proprioception Policies Fail in Robotic Manipulation?

Safety Beyond the Training Data: Robust Out-of-Distribution MPC via Conformalized System Level Synthesis

HoloBrain-0 Technical Report

VLAW: Iterative Co-Improvement of Vision-Language-Action Policy and World Model

Affordance-Graphed Task Worlds: Self-Evolving Task Generation for Scalable Embodied Learning

RF-Modulated Adaptive Communication Improves Multi-Agent Robotic Exploration

Pack it in: Packing into Partially Filled Containers Through Contact

Multi Graph Search for High-Dimensional Robot Motion Planning

3DGSNav: Enhancing Vision-Language Model Reasoning for Object Navigation via Active 3D Gaussian Splatting

Sub--Riemannian boundary value problems for Optimal Geometric Locomotion

LDA-1B: Scaling Latent Dynamics Action Model via Universal Embodied Data Ingestion

Any House Any Task: Scalable Long-Horizon Planning for Abstract Human Tasks

Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment

DD-MDN: Human Trajectory Forecasting with Diffusion-Based Dual Mixture Density Networks and Uncertainty Self-Calibration

ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning

Toward Reliable Tea Leaf Disease Diagnosis Using Deep Learning Model: Enhancing Robustness With Explainable AI and Adversarial Training

Active Zero: Self-Evolving Vision-Language Models through Active Environment Exploration

ReTracing: An Archaeological Approach Through Body, Machine, and Generative Systems

Stress Tests REVEAL Fragile Temporal and Visual Grounding in Video-Language Models

Advancing Digital Twin Generation Through a Novel Simulation Framework and Quantitative Benchmarking

Selective Prior Synchronization via SYNC Loss

MDE-VIO: Enhancing Visual-Inertial Odometry Using Learned Depth Priors

Exploring Real-Time Super-Resolution: Benchmarking and Fine-Tuning for Streaming Content

ArtContext: Contextualizing Artworks with Open-Access Art History Articles and Wikidata Knowledge through a LoRA-Tuned CLIP Model

Latent Forcing: Reordering the Diffusion Trajectory for Pixel-Space Image Generation

Fighting MRI Anisotropy: Learning Multiple Cardiac Shapes From a Single Implicit Neural Representation

Ctrl&Shift: High-Quality Geometry-Aware Object Manipulation in Visual Generation

Enhanced Portable Ultra Low-Field Diffusion Tensor Imaging with Bayesian Artifact Correction and Deep Learning-Based Super-Resolution

A Dual-Branch Framework for Semantic Change Detection with Boundary and Temporal Awareness

Arbitrary Ratio Feature Compression via Next Token Prediction

What if Agents Could Imagine? Reinforcing Open-Vocabulary HOI Comprehension through Generation

Vascular anatomy-aware self-supervised pre-training for X-ray angiogram analysis

Supervise-assisted Multi-modality Fusion Diffusion Model for PET Restoration

Perception-based Image Denoising via Generative Compression

LUVE : Latent-Cascaded Ultra-High-Resolution Video Generation with Dual Frequency Experts

Move What Matters: Parameter-Efficient Domain Adaptation via Optimal Transport Flow for Collaborative Perception

A Large Language Model for Disaster Structural Reconnaissance Summarization

PLOT-CT: Pre-log Voronoi Decomposition Assisted Generation for Low-dose CT Reconstruction

PLESS: Pseudo-Label Enhancement with Spreading Scribbles for Weakly Supervised Segmentation

ScalSelect: Scalable Training-Free Multimodal Data Selection for Efficient Visual Instruction Tuning

Electrostatics-Inspired Surface Reconstruction (EISR): Recovering 3D Shapes as a Superposition of Poisson's PDE Solutions

Brain Tumor Classifiers Under Attack: Robustness of ResNet Variants Against Transferable FGSM and PGD Attacks

GR-Diffusion: 3D Gaussian Representation Meets Diffusion in Whole-Body PET Reconstruction

SToRM: Supervised Token Reduction for Multi-modal LLMs toward efficient end-to-end autonomous driving

EmoSpace: Fine-Grained Emotion Prototype Learning for Immersive Affective Content Generation

Clutt3R-Seg: Sparse-view 3D Instance Segmentation for Language-grounded Grasping in Cluttered Scenes

Egocentric Gaze Estimation via Neck-Mounted Camera

U-Net with Hadamard Transform and DCT Latent Spaces for Next-day Wildfire Spread Prediction

RI-Mamba: Rotation-Invariant Mamba for Robust Text-to-Shape Retrieval

Semantically Conditioned Diffusion Models for Cerebral DSA Synthesis

TG-Field: Geometry-Aware Radiative Gaussian Fields for Tomographic Reconstruction

LLM-Driven 3D Scene Generation of Agricultural Simulation Environments

GSO-SLAM: Bidirectionally Coupled Gaussian Splatting and Direct Visual Odometry

STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning

Adapting Vision-Language Models for E-commerce Understanding at Scale

Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding

Adaptive Debiasing Tsallis Entropy for Test-Time Adaptation

Code2Worlds: Empowering Coding LLMs for 4D World Generation