cs.RO / 1 / 2604.07378
Evaluation as Evolution: Transforming Adversarial Diffusion into Closed-Loop Curricula for Autonomous Vehicles
评估即演化:将对抗性扩散转化为自主车辆的闭环课程
Abstract
Autonomous vehicles in interactive traffic environments are often limited by the scarcity of safety-critical tail events in static datasets, which biases learned policies toward average-case behaviors and reduces robustness. Existing evaluation methods attempt to address this through adversarial stress testing, but are predominantly open-loop and post-hoc, making it difficult to incorporate discovered failures back into the training process. We introduce Evaluation as Evolution ($E^2$), a closed-loop framework that transforms adversarial generation from a static validation step into an adaptive evolutionary curriculum. Specifically, $E^2$ formulates adversarial scenario synthesis as transport-regularized sparse control over a learned reverse-time SDE prior. To make this high-dimensional generation tractable, we utilize topology-driven support selection to identify critical interacting agents, and introduce Topological Anchoring to stabilize the process. This approach enables the targeted discovery of failure cases while strictly constraining deviations from realistic data distributions. Empirically, $E^2$ improves collision failure discovery by 9.01% on the nuScenes dataset and up to 21.43% on the nuPlan dataset over the strongest baselines, while maintaining low invalidity and high realism. It further yields substantial robustness gains when the resulting boundary cases are recycled for closed-loop policy fine-tuning.
Chinese Translation
在互动交通环境中的自主车辆常常受到静态数据集中安全关键尾事件稀缺的限制,这使得学习到的策略偏向于平均情况行为,从而降低了鲁棒性。现有的评估方法试图通过对抗性压力测试来解决这一问题,但主要是开放式和事后分析,难以将发现的失败重新纳入训练过程。我们提出了评估即演化(Evaluation as Evolution,$E^2$),这是一个闭环框架,将对抗性生成从静态验证步骤转化为自适应的演化课程。具体而言,$E^2$ 将对抗场景合成公式化为对学习到的反向时间随机微分方程(SDE)先验的运输正则化稀疏控制。为了使这种高维生成变得可行,我们利用拓扑驱动的支持选择来识别关键交互代理,并引入拓扑锚定(Topological Anchoring)来稳定这一过程。这种方法能够有针对性地发现失败案例,同时严格限制与现实数据分布的偏差。在实证上,$E^2$ 在nuScenes数据集上提高了9.01%的碰撞失败发现率,在nuPlan数据集上提高了高达21.43%,且保持了低无效性和高现实性。此外,当将生成的边界案例回收用于闭环策略微调时,还显著提升了鲁棒性。
cs.RO / 2 / 2604.07395
A Physical Agentic Loop for Language-Guided Grasping with Execution-State Monitoring
一种用于语言引导抓取的物理代理循环与执行状态监测
Abstract
Robotic manipulation systems that follow language instructions often execute grasp primitives in a largely single-shot manner: a model proposes an action, the robot executes it, and failures such as empty grasps, slips, stalls, timeouts, or semantically wrong grasps are not surfaced to the decision layer in a structured way. Inspired by agentic loops in digital tool-using agents, we reformulate language-guided grasping as a bounded embodied agent operating over grounded execution states, where physical actions expose an explicit tool-state stream. We introduce a physical agentic loop that wraps an unmodified learned manipulation primitive (grasp-and-lift) with (i) an event-based interface and (ii) an execution monitoring layer, Watchdog, which converts noisy gripper telemetry into discrete outcome labels using contact-aware fusion and temporal stabilization. These outcome events, optionally combined with post-grasp semantic verification, are consumed by a deterministic bounded policy that finalizes, retries, or escalates to the user for clarification, guaranteeing finite termination. We validate the resulting loop on a mobile manipulator with an eye-in-hand D405 camera, keeping the underlying grasp model unchanged and evaluating representative scenarios involving visual ambiguity, distractors, and induced execution failures. Results show that explicit execution-state monitoring and bounded recovery enable more robust and interpretable behavior than open-loop execution, while adding minimal architectural overhead. For the source code and demo refer to our project page: https://wenzewwz123.github.io/Agentic-Loop/
Chinese Translation
遵循语言指令的机器人操作系统通常以单次执行的方式执行抓取原语:模型提出一个动作,机器人执行该动作,而诸如空抓取、滑动、停滞、超时或语义错误抓取等失败并未以结构化的方式呈现给决策层。受到数字工具使用代理中代理循环的启发,我们将语言引导抓取重新表述为一个在有界执行状态上操作的具身代理,其中物理动作揭示了显式的工具状态流。我们引入了一个物理代理循环,该循环将未修改的学习抓取原语(抓取与提升)与(i)基于事件的接口和(ii)执行监测层Watchdog相结合,该监测层利用接触感知融合和时间稳定化将嘈杂的抓取器遥测转换为离散的结果标签。这些结果事件可选地与抓取后的语义验证结合,由一个确定性的有界策略消耗,该策略最终确定、重试或向用户请求澄清,从而保证有限终止。我们在一个配备有眼在手D405相机的移动操控器上验证了所得循环,保持基础抓取模型不变,并评估涉及视觉模糊、干扰物和诱发执行失败的代表性场景。结果表明,显式的执行状态监测和有界恢复使得行为比开放循环执行更具鲁棒性和可解释性,同时增加的架构开销最小。有关源代码和演示,请参见我们的项目页面:https://wenzewwz123.github.io/Agentic-Loop/
cs.RO / 3 / 2604.07423
OpenPRC: A Unified Open-Source Framework for Physics-to-Task Evaluation in Physical Reservoir Computing
OpenPRC:一个统一的开源框架,用于物理水库计算中的物理到任务评估
Abstract
Physical Reservoir Computing (PRC) leverages the intrinsic nonlinear dynamics of physical substrates, mechanical, optical, spintronic, and beyond, as fixed computational reservoirs, offering a compelling paradigm for energy-efficient and embodied machine learning. However, the practical workflow for developing and evaluating PRC systems remains fragmented: existing tools typically address only isolated parts of the pipeline, such as substrate-specific simulation, digital reservoir benchmarking, or readout training. What is missing is a unified framework that can represent both high-fidelity simulated trajectories and real experimental measurements through the same data interface, enabling reproducible evaluation, analysis, and physics-aware optimization across substrates and data sources. We present OpenPRC, an open-source Python framework that fills this gap through a schema-driven physics-to-task pipeline built around five modules: a GPU-accelerated hybrid RK4-PBD physics engine (demlat), a video-based experimental ingestion layer (openprc.vision), a modular learning layer (reservoir), information-theoretic analysis and benchmarking tools (analysis), and physics-aware optimization (optimize). A universal HDF5 schema enforces reproducibility and interoperability, allowing GPU-simulated and experimentally acquired trajectories to enter the same downstream workflow without modification. Demonstrated capabilities include simulations of Origami tessellations, video-based trajectory extraction from a physical reservoir, and a common interface for standardized PRC benchmarking, correlation diagnostics, and capacity analysis. The longer-term vision is to serve as a standardizing layer for the PRC community, compatible with external physics engines including PyBullet, PyElastica, and MERLIN.
Chinese Translation
物理水库计算(PRC)利用物理基材的内在非线性动态,包括机械、光学、自旋电子等,作为固定的计算水库,提供了一种能效高且具身的机器学习范式。然而,开发和评估PRC系统的实际工作流程仍然是支离破碎的:现有工具通常仅解决管道的孤立部分,例如特定基材的仿真、数字水库基准测试或读取训练。缺少的是一个统一的框架,能够通过相同的数据接口表示高保真仿真轨迹和真实实验测量,从而实现跨基材和数据源的可重复评估、分析和物理感知优化。我们提出了OpenPRC,一个开源的Python框架,通过一个基于模式驱动的物理到任务管道填补这一空白,该管道围绕五个模块构建:一个GPU加速的混合RK4-PBD物理引擎(demlat)、一个基于视频的实验摄取层(openprc.vision)、一个模块化学习层(reservoir)、信息论分析和基准测试工具(analysis),以及物理感知优化(optimize)。一个通用的HDF5模式强制执行可重复性和互操作性,允许GPU模拟和实验获取的轨迹在不修改的情况下进入相同的下游工作流程。展示的能力包括折纸镶嵌的仿真、从物理水库中提取基于视频的轨迹,以及用于标准化PRC基准测试、相关性诊断和容量分析的公共接口。更长期的愿景是作为PRC社区的标准化层,与包括PyBullet、PyElastica和MERLIN在内的外部物理引擎兼容。
cs.RO / 4 / 2604.07457
CMP: Robust Whole-Body Tracking for Loco-Manipulation via Competence Manifold Projection
CMP:通过能力流形投影实现稳健的全身追踪与运动操作
Abstract
While decoupled control schemes for legged mobile manipulators have shown robustness, learning holistic whole-body control policies for tracking global end-effector poses remains fragile against Out-of-Distribution (OOD) inputs induced by sensor noise or infeasible user commands. To improve robustness against these perturbations without sacrificing task performance and continuity, we propose Competence Manifold Projection (CMP). Specifically, we utilize a Frame-Wise Safety Scheme that transforms the infinite-horizon safety constraint into a computationally efficient single-step manifold inclusion. To instantiate this competence manifold, we employ a Lower-Bounded Safety Estimator that distinguishes unmastered intentions from the training distribution. We then introduce an Isomorphic Latent Space (ILS) that aligns manifold geometry with safety probability, enabling efficient O(1) seamless defense against arbitrary OOD intents. Experiments demonstrate that CMP achieves up to a 10-fold survival rate improvement in typical OOD scenarios where baselines suffer catastrophic failure, incurring under 10% tracking degradation. Notably, the system exhibits emergent ``best-effort'' generalization behaviors to progressively accomplish OOD goals by adhering to the competence boundaries. Result videos are available at: https://shepherd1226.github.io/CMP.
Chinese Translation
尽管针对四足移动操控器的解耦控制方案表现出良好的稳健性,但学习用于追踪全局末端执行器姿态的整体全身控制策略在面对由传感器噪声或不可行用户命令引起的分布外(OOD)输入时仍然脆弱。为了在不牺牲任务性能和连续性的情况下提高对这些扰动的稳健性,我们提出了能力流形投影(Competence Manifold Projection, CMP)。具体而言,我们利用一种帧级安全方案,将无限期安全约束转化为计算效率高的单步流形包含。为了实例化这一能力流形,我们采用了一种下界安全估计器,该估计器能够区分未掌握的意图与训练分布。随后,我们引入了一种同构潜在空间(Isomorphic Latent Space, ILS),使流形几何与安全概率对齐,从而实现对任意OOD意图的高效O(1)无缝防御。实验表明,CMP在典型的OOD场景中实现了高达10倍的生存率提升,而基线方法则遭遇了灾难性失败,跟踪性能下降不足10%。值得注意的是,该系统表现出新兴的“尽力而为”泛化行为,通过遵循能力边界逐步实现OOD目标。结果视频可在以下链接查看:https://shepherd1226.github.io/CMP。
cs.RO / 5 / 2604.07480
Active Reward Machine Inference From Raw State Trajectories
从原始状态轨迹中主动推断奖励机器
Abstract
Reward machines are automaton-like structures that capture the memory required to accomplish a multi-stage task. When combined with reinforcement learning or optimal control methods, they can be used to synthesize robot policies to achieve such tasks. However, specifying a reward machine by hand, including a labeling function capturing high-level features that the decisions are based on, can be a daunting task. This paper deals with the problem of learning reward machines directly from raw state and policy information. As opposed to existing works, we assume no access to observations of rewards, labels, or machine nodes, and show what trajectory data is sufficient for learning the reward machine in this information-scarce regime. We then extend the result to an active learning setting where we incrementally query trajectory extensions to improve data (and indirectly computational) efficiency. Results are demonstrated with several grid world examples.
Chinese Translation
奖励机器是一种类似自动机的结构,用于捕捉完成多阶段任务所需的记忆。当与强化学习或最优控制方法结合时,它们可以用于合成机器人策略以实现这些任务。然而,手动指定奖励机器,包括捕捉决策所依据的高层次特征的标记函数,可能是一项艰巨的任务。本文解决了直接从原始状态和策略信息中学习奖励机器的问题。与现有研究不同,我们假设无法访问奖励、标签或机器节点的观测,并展示了在这种信息稀缺的情况下,哪些轨迹数据对于学习奖励机器是足够的。随后,我们将结果扩展到主动学习设置,在该设置中,我们逐步查询轨迹扩展以提高数据(间接地提高计算)效率。结果通过多个网格世界示例进行了验证。
cs.RO / 6 / 2604.07517
Grasp as You Dream: Imitating Functional Grasping from Generated Human Demonstrations
如梦般抓取:从生成的人类示范中模仿功能性抓取
Abstract
Building generalist robots capable of performing functional grasping in everyday, open-world environments remains a significant challenge due to the vast diversity of objects and tasks. Existing methods are either constrained to narrow object/task sets or rely on prohibitively large-scale data collection to capture real-world variability. In this work, we present an alternative approach, GraspDreamer, a method that leverages human demonstrations synthesized by visual generative models (VGMs) (e.g., video generation models) to enable zero-shot functional grasping without labor-intensive data collection. The key idea is that VGMs pre-trained on internet-scale human data implicitly encode generalized priors about how humans interact with the physical world, which can be combined with embodiment-specific action optimization to enable functional grasping with minimal effort. Extensive experiments on the public benchmarks with different robot hands demonstrate the superior data efficiency and generalization performance of GraspDreamer compared to previous methods. Real-world evaluations further validate the effectiveness on real robots. Additionally, we showcase that GraspDreamer can (1) be naturally extended to downstream manipulation tasks, and (2) can generate data to support visuomotor policy learning.
Chinese Translation
构建能够在日常开放世界环境中执行功能性抓取的通用机器人仍然是一个重大挑战,因为对象和任务的多样性极为广泛。现有方法要么局限于狭窄的对象/任务集,要么依赖于大规模的数据收集,以捕捉现实世界的变异性。在本研究中,我们提出了一种替代方法GraspDreamer,该方法利用由视觉生成模型(VGMs)(例如视频生成模型)合成的人类示范,实现零-shot功能性抓取,而无需劳动密集型的数据收集。关键思想是,经过互联网规模人类数据预训练的VGMs隐式编码了人类如何与物理世界互动的通用先验知识,这可以与特定体现的动作优化相结合,以最小的努力实现功能性抓取。在不同机器人手的公共基准上进行的大量实验表明,GraspDreamer在数据效率和泛化性能方面优于先前的方法。现实世界的评估进一步验证了其在真实机器人上的有效性。此外,我们展示了GraspDreamer可以(1)自然扩展到下游操作任务,以及(2)生成数据以支持视觉运动策略学习。
cs.RO / 7 / 2604.07575
Robust Multi-Agent Target Tracking in Intermittent Communication Environments via Analytical Belief Merging
在间歇通信环境中通过解析信念融合实现鲁棒的多智能体目标跟踪
Abstract
Autonomous multi-agent target tracking in GPS-denied and communication-restricted environments (e.g., underwater exploration, subterranean search and rescue, and adversarial domains) forces agents to operate independently and only exchange information during brief reconnection windows. Because transmitting complete observation and trajectory histories is bandwidth-exhaustive, exchanging probabilistic belief maps serves as a highly efficient proxy that preserves the topology of agent knowledge. While minimizing divergence metrics to merge these decentralized beliefs is conceptually sound, traditional approaches often rely on numerical solvers that introduce critical quantization errors and artificial noise floors. In this paper, we formulate the decentralized belief merging problem as Forward and Reverse Kullback-Leibler (KL) divergence optimizations and derive their exact closed-form analytical solutions. By deploying these derivations, we mathematically eliminate optimization artifacts, achieving perfect mathematical fidelity while reducing the computational complexity of the belief merge to $\mathcal{O}(N|S|)$ scalar operations. Furthermore, we propose a novel spatially-aware visit-weighted KL merging strategy that dynamically weighs agent beliefs based on their physical visitation history. Validated across tens of thousands of distributed simulations, extensive sensitivity analysis demonstrates that our proposed method significantly suppresses sensor noise and outperforms standard analytical means in environments characterized by highly degraded sensors and prolonged communication intervals.
Chinese Translation
在GPS不可用和通信受限的环境(例如水下探测、地下搜索与救援以及对抗性领域)中,自治多智能体目标跟踪迫使智能体独立操作,并仅在短暂的重连窗口期间交换信息。由于传输完整的观察和轨迹历史消耗大量带宽,交换概率信念图作为一种高效的代理,能够保留智能体知识的拓扑结构。尽管在合并这些去中心化信念时最小化发散度度量在概念上是合理的,但传统方法往往依赖于数值求解器,这会引入关键的量化误差和人为的噪声底线。在本文中,我们将去中心化信念融合问题表述为前向和反向Kullback-Leibler (KL) 发散优化,并推导出其精确的闭式解析解。通过部署这些推导,我们在数学上消除了优化伪影,实现了完美的数学保真度,同时将信念融合的计算复杂度降低到 $ ext{O}(N|S|)$ 标量操作。此外,我们提出了一种新颖的空间感知访问加权KL融合策略,根据智能体的物理访问历史动态加权信念。通过数万次分布式仿真验证,广泛的敏感性分析表明,我们提出的方法显著抑制了传感器噪声,并在传感器高度退化和通信间隔延长的环境中优于标准解析方法。
cs.RO / 8 / 2604.07592
Spatio-Temporal Grounding of Large Language Models from Perception Streams
来自感知流的大型语言模型的时空基础
Abstract
Embodied-AI agents must reason about how objects move and interact in 3-D space over time, yet existing smaller frontier Large Language Models (LLMs) still mis-handle fine-grained spatial relations, metric distances, and temporal orderings. We introduce the general framework Formally Explainable Spatio-Temporal Scenes (FESTS) that injects verifiable spatio-temporal supervision into an LLM by compiling natural-language queries into Spatial Regular Expression (SpRE) -- a language combining regular expression syntax with S4u spatial logic and extended here with universal and existential quantification. The pipeline matches each SpRE against any structured video log and exports aligned (query, frames, match, explanation) tuples, enabling unlimited training data without manual labels. Training a 3-billion-parameter model on 27k such tuples boosts frame-level F1 from 48.5% to 87.5%, matching GPT-4.1 on complex spatio-temporal reasoning while remaining two orders of magnitude smaller, and, hence, enabling spatio-temporal intelligence for Video LLM.
Chinese Translation
具身人工智能代理必须推理物体如何在三维空间中随时间移动和互动,但现有的小型前沿大型语言模型(LLMs)仍然在处理细粒度空间关系、度量距离和时间顺序方面存在问题。我们提出了一个通用框架——形式可解释的时空场景(Formally Explainable Spatio-Temporal Scenes, FESTS),通过将自然语言查询编译为空间正则表达式(Spatial Regular Expression, SpRE)来向LLM注入可验证的时空监督。SpRE是一种结合了正则表达式语法与S4u空间逻辑的语言,并在此基础上扩展了全称量词和存在量词。该流程将每个SpRE与任何结构化视频日志进行匹配,并导出对齐的(查询,帧,匹配,解释)元组,从而实现无限的训练数据而无需手动标签。在27,000个这样的元组上训练一个参数量为30亿的模型,使得帧级F1从48.5%提升至87.5%,在复杂的时空推理上与GPT-4.1相匹配,同时模型规模小两个数量级,从而为视频大型语言模型(Video LLM)提供时空智能。
cs.RO / 9 / 2604.07599
SANDO: Safe Autonomous Trajectory Planning for Dynamic Unknown Environments
SANDO:动态未知环境下的安全自主轨迹规划
Abstract
SANDO is a safe trajectory planner for 3D dynamic unknown environments, where obstacle locations and motions are unknown a priori and a collision-free plan can become unsafe at any moment, requiring fast replanning. Existing soft-constraint planners are fast but cannot guarantee collision-free paths, while hard-constraint methods ensure safety at the cost of longer computation. SANDO addresses this trade-off through three contributions. First, a heat map-based A* global planner steers paths away from high-risk regions using soft costs, and a spatiotemporal safe flight corridor (STSFC) generator produces time-layered polytopes that inflate obstacles only by their worst-case reachable set at each time layer, rather than by the worst case over the entire horizon. Second, trajectory optimization is formulated as a Mixed-Integer Quadratic Program (MIQP) with hard collision-avoidance constraints, and a variable elimination technique reduces the number of decision variables, enabling fast computation. Third, a formal safety analysis establishes collision-free guarantees under explicit velocity-bound and estimation-error assumptions. Ablation studies show that variable elimination yields up to 7.4x speedup in optimization time, and that STSFCs are critical for feasibility in dense dynamic environments. Benchmark simulations against state-of-the-art methods across standardized static benchmarks, obstacle-rich static forests, and dynamic environments show that SANDO consistently achieves the highest success rate with no constraint violations across all difficulty levels; perception-only experiments without ground truth obstacle information confirm robust performance under realistic sensing. Hardware experiments on a UAV with fully onboard planning, perception, and localization demonstrate six safe flights in static environments and ten safe flights among dynamic obstacles.
Chinese Translation
SANDO 是一种针对三维动态未知环境的安全轨迹规划器,其中障碍物的位置和运动是事先未知的,且无碰撞计划可能在任何时刻变得不安全,因此需要快速重新规划。现有的软约束规划器速度较快,但无法保证路径无碰撞,而硬约束方法则以较长的计算时间确保安全。SANDO 通过三个贡献来解决这一权衡。首先,基于热图的 A* 全局规划器使用软成本将路径引导远离高风险区域,时空安全飞行走廊(STSFC)生成器产生时间分层的多面体,仅在每个时间层中根据其最坏情况下可达集来膨胀障碍物,而不是在整个时间范围内的最坏情况。其次,轨迹优化被表述为带有硬碰撞避免约束的混合整数二次规划(MIQP),并且变量消除技术减少了决策变量的数量,从而实现快速计算。第三,正式的安全分析在明确的速度限制和估计误差假设下建立了无碰撞的保证。消融研究表明,变量消除在优化时间上可实现高达 7.4 倍的加速,而 STSFC 在密集动态环境中的可行性至关重要。与最先进的方法在标准化静态基准、障碍物丰富的静态森林和动态环境中的基准模拟显示,SANDO 在所有难度级别中始终实现最高的成功率且没有约束违规;仅依赖感知的实验在没有真实障碍物信息的情况下确认了在现实感知下的稳健性能。在一架无人机上进行的硬件实验中,完全依赖机载规划、感知和定位,展示了在静态环境中进行六次安全飞行和在动态障碍物中进行十次安全飞行的能力。
cs.RO / 10 / 2604.07607
EgoVerse: An Egocentric Human Dataset for Robot Learning from Around the World
EgoVerse:一个用于机器人学习的以自我为中心的人类数据集,涵盖全球
Punamiya, Ryan, Kareer, Simar, Liu, Zeyi, Citron, Josh, Qiu, Ri-Zhao, Cai, Xiongyi, Gavryushin, Alexey, Chen, Jiaqi, Liconti, Davide, Zhu, Lawrence Y., Aphiwetsa, Patcharapong, Li, Baoyu, Cheluva, Aniketh, Kuppili, Pranav, Liu, Yangcen, Patel, Dhruv, Gao, Aidan, Chung, Hye-Young, Co, Ryan, Zbizika, Renee, Liu, Jeff, Xu, Xiaomeng, Xiong, Haoyu, Chen, Geng, Oliani, Sebastiano, Yang, Chenyu, Wang, Xi, Fort, James, Newcombe, Richard, Gao, Josh, Chong, Jason, Matsuda, Garrett, Doriwala, Aseem, Pollefeys, Marc, Katzschmann, Robert, Wang, Xiaolong, Song, Shuran, Hoffman, Judy, Xu, Danfei
Abstract
Robot learning increasingly depends on large and diverse data, yet robot data collection remains expensive and difficult to scale. Egocentric human data offer a promising alternative by capturing rich manipulation behavior across everyday environments. However, existing human datasets are often limited in scope, difficult to extend, and fragmented across institutions. We introduce EgoVerse, a collaborative platform for human data-driven robot learning that unifies data collection, processing, and access under a shared framework, enabling contributions from individual researchers, academic labs, and industry partners. The current release includes 1,362 hours (80k episodes) of human demonstrations spanning 1,965 tasks, 240 scenes, and 2,087 unique demonstrators, with standardized formats, manipulation-relevant annotations, and tooling for downstream learning. Beyond the dataset, we conduct a large-scale study of human-to-robot transfer with experiments replicated across multiple labs, tasks, and robot embodiments under shared protocols. We find that policy performance generally improves with increased human data, but that effective scaling depends on alignment between human data and robot learning objectives. Together, the dataset, platform, and study establish a foundation for reproducible progress in human data-driven robot learning. Videos and additional information can be found at https://egoverse.ai/
Chinese Translation
机器人学习越来越依赖于大量且多样化的数据,但机器人数据收集仍然昂贵且难以扩展。以自我为中心的人类数据通过捕捉日常环境中的丰富操作行为,提供了一种有前景的替代方案。然而,现有的人类数据集通常在范围上有限,难以扩展,并且在不同机构之间碎片化。我们推出了EgoVerse,这是一个用于人类数据驱动的机器人学习的协作平台,统一了数据收集、处理和访问,建立在一个共享框架下,允许个体研究者、学术实验室和行业合作伙伴的贡献。目前发布的版本包括1,362小时(80,000个事件)的人类演示,涵盖1,965个任务、240个场景和2,087个独特的演示者,具有标准化格式、与操作相关的注释以及用于下游学习的工具。除了数据集,我们还进行了一项大规模的人类到机器人转移研究,在多个实验室、任务和机器人形态下根据共享协议复制实验。我们发现,随着人类数据的增加,策略性能通常会提高,但有效的扩展依赖于人类数据与机器人学习目标之间的对齐。数据集、平台和研究共同为人类数据驱动的机器人学习的可重复进展奠定了基础。视频和更多信息可以在 https://egoverse.ai/ 找到。
cs.RO / 11 / 2604.07644
Safe Large-Scale Robust Nonlinear MPC in Milliseconds via Reachability-Constrained System Level Synthesis on the GPU
通过基于可达性约束的系统级合成在GPU上实现毫秒级安全大规模鲁棒非线性模型预测控制
Abstract
We present GPU-SLS, a GPU-parallelized framework for safe, robust nonlinear model predictive control (MPC) that scales to high-dimensional uncertain robotic systems and long planning horizons. Our method jointly optimizes an inequality-constrained, dynamically-feasible nominal trajectory, a tracking controller, and a closed-loop reachable set under disturbance, all in real-time. To efficiently compute nominal trajectories, we develop a sequential quadratic programming procedure with a novel GPU-accelerated quadratic program (QP) solver that uses parallel associative scans and adaptive caching within an alternating direction method of multipliers (ADMM) framework. The same GPU QP backend is used to optimize robust tracking controllers and closed-loop reachable sets via system level synthesis (SLS), enabling reachability-constrained control in both fixed- and receding-horizon settings. We achieve substantial performance gains, reducing nominal trajectory solve times by 97.7% relative to state-of-the-art CPU solvers and 71.8% compared to GPU solvers, while accelerating SLS-based control and reachability by 237x. Despite large problem scales, our method achieves 100% empirical safety, unlike high-dimensional learning-based reachability baselines. We validate our approach on complex nonlinear systems, including whole-body quadrupeds (61D) and humanoids (75D), synthesizing robust control policies online on the GPU in 20 milliseconds on average and scaling to problems with 2 x 10^5 decision variables and 8 x 10^4 constraints. The implementation of our method is available at https://github.com/Jeff300fang/gpu_sls.
Chinese Translation
我们提出了GPU-SLS,这是一个用于安全、鲁棒非线性模型预测控制(MPC)的GPU并行框架,能够扩展到高维不确定机器人系统和长规划时间。我们的方法实时联合优化不等式约束、动态可行的名义轨迹、跟踪控制器以及在干扰下的闭环可达集。为了高效计算名义轨迹,我们开发了一种序列二次规划程序,并采用了一种新颖的GPU加速二次规划(QP)求解器,该求解器在交替方向乘子法(ADMM)框架内使用并行关联扫描和自适应缓存。相同的GPU QP后端用于通过系统级合成(SLS)优化鲁棒跟踪控制器和闭环可达集,从而在固定和递归视野设置中实现可达性约束控制。我们实现了显著的性能提升,相较于最先进的CPU求解器,名义轨迹求解时间减少了97.7%,相比于GPU求解器减少了71.8%,同时使基于SLS的控制和可达性加速了237倍。尽管问题规模较大,我们的方法实现了100%的经验安全性,这与高维基于学习的可达性基线不同。我们在复杂的非线性系统上验证了我们的方法,包括全身四足机器人(61维)和人形机器人(75维),在GPU上平均在线合成鲁棒控制策略的时间为20毫秒,并扩展到具有2 x 10^5个决策变量和8 x 10^4个约束的问题。我们方法的实现可在https://github.com/Jeff300fang/gpu_sls获取。
cs.RO / 12 / 2604.07672
Reset-Free Reinforcement Learning for Real-World Agile Driving: An Empirical Study
无重置强化学习在真实世界灵活驾驶中的应用:一项实证研究
Abstract
This paper presents an empirical study of reset-free reinforcement learning (RL) for real-world agile driving, in which a physical 1/10-scale vehicle learns continuously on a slippery indoor track without manual resets. High-speed driving near the limits of tire friction is particularly challenging for learning-based methods because complex vehicle dynamics, actuation delays, and other unmodeled effects hinder both accurate simulation and direct sim-to-real transfer of learned policies. To enable autonomous training on a physical platform, we employ Model Predictive Path Integral control (MPPI) as both the reset policy and the base policy for residual learning, and systematically compare three representative RL algorithms, i.e., PPO, SAC, and TD-MPC2, with and without residual learning in simulation and real-world experiments. Our results reveal a clear gap between simulation and real-world: SAC with residual learning achieves the highest returns in simulation, yet only TD-MPC2 consistently outperforms the MPPI baseline on the physical platform. Moreover, residual learning, while clearly beneficial in simulation, fails to transfer its advantage to the real world and can even degrade performance. These findings reveal that reset-free RL in the real world poses unique challenges absent from simulation, calling for further algorithmic development tailored to training in the wild.
Chinese Translation
本文呈现了一项关于无重置强化学习(RL)在真实世界灵活驾驶中的实证研究,其中一辆1/10比例的物理车辆在滑溜的室内赛道上持续学习,而无需手动重置。在接近轮胎摩擦极限的高速驾驶对基于学习的方法尤其具有挑战性,因为复杂的车辆动力学、驱动延迟以及其他未建模的影响阻碍了准确的仿真和学习策略的直接从仿真到现实的转移。为了在物理平台上实现自主训练,我们采用模型预测路径积分控制(Model Predictive Path Integral control, MPPI)作为重置策略和残差学习的基础策略,并系统地比较了三种代表性的强化学习算法,即PPO、SAC和TD-MPC2,在仿真和真实实验中有无残差学习的表现。我们的结果揭示了仿真与现实世界之间的明显差距:在仿真中,具有残差学习的SAC获得了最高的回报,但只有TD-MPC2在物理平台上始终优于MPPI基线。此外,尽管残差学习在仿真中显然是有益的,但未能将其优势转移到现实世界,甚至可能导致性能下降。这些发现表明,无重置强化学习在真实世界中面临着仿真中不存在的独特挑战,呼吁进一步针对野外训练的算法开发。
cs.RO / 13 / 2604.07677
Bird-Inspired Spatial Flapping Wing Mechanism via Coupled Linkages with Single Actuator
基于鸟类启发的空间拍打翼机制通过单个驱动器的耦合连杆
Abstract
Spatial single-loop mechanisms such as Bennett linkages offer a unique combination of one-degree-of-freedom actuation and nontrivial spatial trajectories, making them attractive for lightweight bio-inspired robotic design. However, although they appear simple and elegant, the geometric task-based synthesis is rather complicated and often avoided in engineering tasks due to the mathematical complexity involved. This paper presents a bird-inspired flapping-wing mechanism built from two coupled spatial four-bars, driven by a single motor. One linkage is actuated to generate the desired spatial sweeping stroke, while the serially coupled linkage remains unactuated and passively switches between extended and folded wing configurations over the stroke cycle. We introduce a simplified kinematic methodology for constructing Bennett linkages from quadrilaterals that contain a desired surface area and further leverage mechanically induced passive state switching. This architecture realizes a coordinated sweep-and-fold wing motion with a single actuation input, reducing weight and control complexity. A 3D-printed prototype is assembled and tested, demonstrating the intended spatial stroke and passive folding behavior.
Chinese Translation
空间单环机制如本内特连杆提供了一种独特的组合,具有一个自由度的驱动和非平凡的空间轨迹,使其在轻量级生物启发机器人设计中具有吸引力。然而,尽管它们看起来简单而优雅,几何任务驱动的合成却相当复杂,常常因涉及的数学复杂性而在工程任务中被避免。本文提出了一种基于鸟类启发的拍打翼机制,该机制由两个耦合的空间四杆机构构成,且由一个电机驱动。一个连杆被驱动以生成所需的空间扫动行程,而串联耦合的连杆保持不被驱动,并在行程周期内被动地在展开和折叠的翼配置之间切换。我们引入了一种简化的运动学方法,通过包含所需表面积的四边形构建本内特连杆,并进一步利用机械诱导的被动状态切换。这种结构实现了通过单一驱动输入的协调扫动与折叠翼运动,减少了重量和控制复杂性。我们组装并测试了一个3D打印原型,证明了预期的空间行程和被动折叠行为。
cs.RO / 14 / 2604.07705
Vision-Language Navigation for Aerial Robots: Towards the Era of Large Language Models
空中机器人视觉-语言导航:迈向大语言模型时代
Abstract
Aerial vision-and-language navigation (Aerial VLN) aims to enable unmanned aerial vehicles (UAVs) to interpret natural language instructions and autonomously navigate complex three-dimensional environments by grounding language in visual perception. This survey provides a critical and analytical review of the Aerial VLN field, with particular attention to the recent integration of large language models (LLMs) and vision-language models (VLMs). We first formally introduce the Aerial VLN problem and define two interaction paradigms: single-instruction and dialog-based, as foundational axes. We then organize the body of Aerial VLN methods into a taxonomy of five architectural categories: sequence-to-sequence and attention-based methods, end-to-end LLM/VLM methods, hierarchical methods, multi-agent methods, and dialog-based navigation methods. For each category, we systematically analyze design rationales, technical trade-offs, and reported performance. We critically assess the evaluation infrastructure for Aerial VLN, including datasets, simulation platforms, and metrics, and identify their gaps in scale, environmental diversity, real-world grounding, and metric coverage. We consolidate cross-method comparisons on shared benchmarks and analyze key architectural trade-offs, including discrete versus continuous actions, end-to-end versus hierarchical designs, and the simulation-to-reality gap. Finally, we synthesize seven concrete open problems: long-horizon instruction grounding, viewpoint robustness, scalable spatial representation, continuous 6-DoF action execution, onboard deployment, benchmark standardization, and multi-UAV swarm navigation, with specific research directions grounded in the evidence presented throughout the survey.
Chinese Translation
空中视觉与语言导航(Aerial VLN)旨在使无人机(UAV)能够理解自然语言指令,并通过将语言与视觉感知相结合,自动导航复杂的三维环境。本文对空中 VLN 领域进行了批判性和分析性的综述,特别关注近期大语言模型(LLMs)与视觉-语言模型(VLMs)的整合。我们首先正式介绍空中 VLN 问题,并定义两种交互范式:单指令和基于对话的,作为基础轴心。接着,我们将空中 VLN 方法的主体组织成五个架构类别的分类法:序列到序列及基于注意力的方法、端到端 LLM/VLM 方法、分层方法、多智能体方法以及基于对话的导航方法。对于每个类别,我们系统地分析设计原理、技术权衡和报告的性能。我们批判性地评估空中 VLN 的评估基础设施,包括数据集、仿真平台和指标,并识别其在规模、环境多样性、现实世界基础和指标覆盖方面的不足。我们整合了在共享基准上的跨方法比较,并分析关键架构权衡,包括离散与连续动作、端到端与分层设计,以及仿真与现实之间的差距。最后,我们综合提出七个具体的开放问题:长时间指令基础、视角鲁棒性、可扩展的空间表示、连续的六自由度动作执行、机载部署、基准标准化以及多无人机群体导航,并基于整个综述中呈现的证据提出具体的研究方向。
cs.RO / 15 / 2604.07774
RoboAgent: Chaining Basic Capabilities for Embodied Task Planning
RoboAgent:将基本能力链式组合用于具身任务规划
Abstract
This paper focuses on embodied task planning, where an agent acquires visual observations from the environment and executes atomic actions to accomplish a given task. Although recent Vision-Language Models (VLMs) have achieved impressive results in multimodal understanding and reasoning, their performance remains limited when applied to embodied planning that involves multi-turn interaction, long-horizon reasoning, and extended context analysis. To bridge this gap, we propose RoboAgent, a capability-driven planning pipeline in which the model actively invokes different sub-capabilities. Each capability maintains its own context, and produces intermediate reasoning results or interacts with the environment according to the query given by a scheduler. This framework decomposes complex planning into a sequence of basic vision-language problems that VLMs can better address, enabling a more transparent and controllable reasoning process. The scheduler and all capabilities are implemented with a single VLM, without relying on external tools. To train this VLM, we adopt a multi-stage paradigm that consists of: (1) behavior cloning with expert plans, (2) DAgger training using trajectories collected by the model, and (3) reinforcement learning guided by an expert policy. Across these stages, we exploit the internal information of the environment simulator to construct high-quality supervision for each capability, and we further introduce augmented and synthetic data to enhance the model's performance in more diverse scenarios. Extensive experiments on widely used embodied task planning benchmarks validate the effectiveness of the proposed approach. Our codes will be available at https://github.com/woyut/RoboAgent_CVPR26.
Chinese Translation
本文聚焦于具身任务规划,其中代理从环境中获取视觉观察,并执行原子动作以完成特定任务。尽管近期的视觉-语言模型(Vision-Language Models, VLMs)在多模态理解和推理方面取得了令人瞩目的成果,但在涉及多轮交互、长时间推理和扩展上下文分析的具身规划中,其表现仍然有限。为了解决这一问题,我们提出了RoboAgent,一种能力驱动的规划管道,其中模型主动调用不同的子能力。每个能力维护其自身的上下文,并根据调度器给出的查询生成中间推理结果或与环境进行交互。该框架将复杂的规划分解为一系列基本的视觉-语言问题,使得VLMs能够更好地处理,从而实现更透明和可控的推理过程。调度器和所有能力均通过单一的VLM实现,无需依赖外部工具。为了训练该VLM,我们采用了一个多阶段的范式,包括:(1)使用专家计划进行行为克隆,(2)利用模型收集的轨迹进行DAgger训练,以及(3)在专家策略指导下进行强化学习。在这些阶段中,我们利用环境模拟器的内部信息为每个能力构建高质量的监督,并进一步引入增强和合成数据,以提升模型在更多样化场景中的表现。在广泛使用的具身任务规划基准上的大量实验验证了所提方法的有效性。我们的代码将发布在 https://github.com/woyut/RoboAgent_CVPR26。
cs.RO / 16 / 2604.07799
Learning Without Losing Identity: Capability Evolution for Embodied Agents
不失去身份的学习:具身智能体的能力演化
Abstract
Embodied agents are expected to operate persistently in dynamic physical environments, continuously acquiring new capabilities over time. Existing approaches to improving agent performance often rely on modifying the agent itself -- through prompt engineering, policy updates, or structural redesign -- leading to instability and loss of identity in long-lived systems. In this work, we propose a capability-centric evolution paradigm for embodied agents. We argue that a robot should maintain a persistent agent as its cognitive identity, while enabling continuous improvement through the evolution of its capabilities. Specifically, we introduce the concept of Embodied Capability Modules (ECMs), which represent modular, versioned units of embodied functionality that can be learned, refined, and composed over time. We present a unified framework in which capability evolution is decoupled from agent identity. Capabilities evolve through a closed-loop process involving task execution, experience collection, model refinement, and module updating, while all executions are governed by a runtime layer that enforces safety and policy constraints. We demonstrate through simulated embodied tasks that capability evolution improves task success rates from 32.4% to 91.3% over 20 iterations, outperforming both agent-modification baselines and established skill-learning methods (SPiRL, SkiMo), while preserving zero policy drift and zero safety violations. Our results suggest that separating agent identity from capability evolution provides a scalable and safe foundation for long-term embodied intelligence.
Chinese Translation
具身智能体预计将在动态物理环境中持续运作,随着时间的推移不断获取新能力。现有的提高智能体性能的方法通常依赖于对智能体本身的修改——通过提示工程、策略更新或结构重设计——这导致长期系统的不稳定性和身份丧失。在本研究中,我们提出了一种以能力为中心的具身智能体演化范式。我们认为,机器人应保持一个持久的智能体作为其认知身份,同时通过能力的演化实现持续改进。具体而言,我们引入了具身能力模块(Embodied Capability Modules, ECMs)的概念,这些模块代表了可学习、可细化和可组合的具身功能的模块化版本单元。我们提出了一个统一框架,其中能力演化与智能体身份解耦。能力通过一个闭环过程演化,该过程涉及任务执行、经验收集、模型细化和模块更新,而所有执行都由一个运行时层管理,该层强制执行安全和策略约束。通过模拟的具身任务,我们展示了能力演化在20次迭代中将任务成功率从32.4%提高到91.3%,超越了智能体修改基线和已建立的技能学习方法(SPiRL, SkiMo),同时保持零策略漂移和零安全违规。我们的结果表明,将智能体身份与能力演化分离,为长期具身智能提供了可扩展和安全的基础。
cs.RO / 17 / 2604.07833
Harnessing Embodied Agents: Runtime Governance for Policy-Constrained Execution
利用具身代理:政策约束执行的运行时治理
Abstract
Embodied agents are evolving from passive reasoning systems into active executors that interact with tools, robots, and physical environments. Once granted execution authority, the central challenge becomes how to keep actions governable at runtime. Existing approaches embed safety and recovery logic inside the agent loop, making execution control difficult to standardize, audit, and adapt. This paper argues that embodied intelligence requires not only stronger agents, but stronger runtime governance. We propose a framework for policy-constrained execution that separates agent cognition from execution oversight. Governance is externalized into a dedicated runtime layer performing policy checking, capability admission, execution monitoring, rollback handling, and human override. We formalize the control boundary among the embodied agent, Embodied Capability Modules (ECMs), and runtime governance layer, and validate through 1000 randomized simulation trials across three governance dimensions. Results show 96.2% interception of unauthorized actions, reduction of unsafe continuation from 100% to 22.2% under runtime drift, and 91.4% recovery success with full policy compliance, substantially outperforming all baselines (p<0.001). By reframing runtime governance as a first-class systems problem, this paper positions policy-constrained execution as a key design principle for embodied agent systems.
Chinese Translation
具身代理正从被动推理系统演变为与工具、机器人和物理环境互动的主动执行者。一旦获得执行权限,中心挑战便是如何在运行时保持行动的可治理性。现有方法将安全性和恢复逻辑嵌入代理循环中,使得执行控制难以标准化、审计和适应。本文认为,具身智能不仅需要更强大的代理,还需要更强大的运行时治理。我们提出了一种政策约束执行的框架,该框架将代理认知与执行监督分离。治理被外部化为一个专门的运行时层,执行政策检查、能力准入、执行监控、回滚处理和人工覆盖。我们形式化了具身代理、具身能力模块(Embodied Capability Modules, ECMs)和运行时治理层之间的控制边界,并通过在三个治理维度上进行1000次随机化仿真试验进行验证。结果显示,未授权行为的拦截率为96.2%,在运行时漂移下不安全继续的比例从100%降至22.2%,在完全遵循政策的情况下恢复成功率为91.4%,显著优于所有基线(p<0.001)。通过将运行时治理重新构建为一个一流的系统问题,本文将政策约束执行定位为具身代理系统的关键设计原则。
cs.RO / 18 / 2604.07921
The Sustainability Gap in Robotics: A Large-Scale Survey of Sustainability Awareness in 50,000 Research Articles
机器人领域的可持续性差距:对50,000篇研究文章可持续性意识的大规模调查
Abstract
We present a large-scale survey of sustainability communication and motivation in robotics research. Our analysis covers nearly 50,000 open-access papers from arXiv's cs.RO category published between 2015 and early 2026. In this study, we quantify how often papers mention social, ecological, and sustainability impacts, and we analyse their alignment with the UN Sustainable Development Goals (SDGs). The results reveal a persistent gap between the field's potential and its stated intent. While a large fraction of robotics papers can be mapped to SDG-relevant domains, explicit sustainability motivation remains remarkably low. Specifically, mentions of sustainability-related impacts are typically below 2%, explicit SDG references stay below 0.1%, and the proportion of sustainability-motivated papers remains below 5%. These trends suggest that while the field of robotics is advancing rapidly, sustainability is not yet a standard part of research framing. We conclude by proposing concrete actions for researchers, conferences, and institutions to close these awareness and motivation gaps, supporting a shift toward more intentional and responsible innovation.
Chinese Translation
我们呈现了一项关于机器人研究中可持续性传播和动机的大规模调查。我们的分析涵盖了2015年至2026年初期间,来自arXiv的cs.RO类别的近50,000篇开放获取论文。在本研究中,我们量化了论文提及社会、生态和可持续性影响的频率,并分析了它们与联合国可持续发展目标(SDGs)的对齐情况。结果揭示了该领域潜力与其声明意图之间的持续差距。尽管大量机器人论文可以映射到与SDG相关的领域,但明确的可持续性动机仍然显著较低。具体而言,提及可持续性相关影响的比例通常低于2%,明确的SDG引用保持在0.1%以下,而以可持续性为动机的论文比例仍低于5%。这些趋势表明,尽管机器人领域正在快速发展,但可持续性尚未成为研究框架的标准组成部分。我们最后提出了具体的行动建议,以帮助研究人员、会议和机构缩小这些意识和动机的差距,支持向更有意图和负责任的创新转变。
cs.RO / 19 / 2604.07939
RAGE-XY: RADAR-Aided Longitudinal and Lateral Forces Estimation For Autonomous Race Cars
RAGE-XY:基于雷达的自主赛车纵向和横向力估计
Abstract
In this work, we present RAGE-XY, an extended version of RAGE, a real-time estimation framework that simultaneously infers vehicle velocity, tire slip angles, and the forces acting on the vehicle using only standard onboard sensors such as IMUs and RADARs. Compared to the original formulation, the proposed method incorporates an online RADAR calibration module, improving the accuracy of lateral velocity estimation in the presence of sensor misalignment. Furthermore, we extend the underlying vehicle model from a single-track approximation to a tricycle model, enabling the estimation of rear longitudinal tire forces in addition to lateral dynamics. We validate the proposed approach through both high-fidelity simulations and real-world experiments conducted on the EAV-24 autonomous race car, demonstrating improved accuracy and robustness in estimating both lateral and longitudinal vehicle dynamics.
Chinese Translation
在本研究中,我们提出了RAGE-XY,这是RAGE的扩展版本,一个实时估计框架,能够仅使用标准的车载传感器(如惯性测量单元和雷达)同时推断车辆速度、轮胎滑移角和作用于车辆的力。与原始公式相比,所提出的方法引入了在线雷达校准模块,提高了在传感器失调情况下横向速度估计的准确性。此外,我们将基础车辆模型从单轨近似扩展到三轮车模型,使得除了横向动力学外,还能够估计后部纵向轮胎力。我们通过高保真模拟和在EAV-24自主赛车上进行的实际实验验证了所提出的方法,展示了在估计横向和纵向车辆动力学方面的准确性和鲁棒性得到了改善。
cs.RO / 20 / 2604.07944
On-Policy Distillation of Language Models for Autonomous Vehicle Motion Planning
基于策略的语言模型蒸馏用于自主车辆运动规划
Abstract
Large language models (LLMs) have recently demonstrated strong potential for autonomous vehicle motion planning by reformulating trajectory prediction as a language generation problem. However, deploying capable LLMs in resource-constrained onboard systems remains a fundamental challenge. In this paper, we study how to effectively transfer motion planning knowledge from a large teacher LLM to a smaller, more deployable student model. We build on the GPT-Driver framework, which represents driving scenes as language prompts and generates waypoint trajectories with chain-of-thought reasoning, and investigate two student training paradigms: (i) on-policy generalized knowledge distillation (GKD), which trains the student on its own self-generated outputs using dense token-level feedback from the teacher, and (ii) a dense-feedback reinforcement learning (RL) baseline that uses the teacher's log-probabilities as per-token reward signals in a policy gradient framework. Experiments on the nuScenes benchmark show that GKD substantially outperforms the RL baseline and closely approaches teacher-level performance despite a 5$\times$ reduction in model size. These results highlight the practical value of on-policy distillation as a principled and effective approach to deploying LLM-based planners in autonomous driving systems.
Chinese Translation
大型语言模型(LLMs)最近在将轨迹预测重新构建为语言生成问题方面展示了强大的潜力,进而在自主车辆运动规划中发挥作用。然而,在资源受限的车载系统中部署高效的LLMs仍然是一个基本挑战。本文研究了如何有效地将运动规划知识从大型教师LLM转移到更小、更易于部署的学生模型。我们基于GPT-Driver框架,该框架将驾驶场景表示为语言提示,并通过链式思维推理生成航点轨迹,探讨了两种学生训练范式:(i)基于策略的广义知识蒸馏(GKD),该方法利用教师的密集标记级反馈,在学生自生成的输出上进行训练;(ii)密集反馈强化学习(RL)基线,该基线在策略梯度框架中使用教师的对数概率作为每个标记的奖励信号。在nuScenes基准上的实验表明,GKD显著优于RL基线,并且在模型大小减少5倍的情况下,性能接近教师水平。这些结果突显了基于策略的蒸馏作为一种原则性和有效的方法在自主驾驶系统中部署基于LLM的规划器的实际价值。
cs.RO / 21 / 2604.07945
Incremental Residual Reinforcement Learning Toward Real-World Learning for Social Navigation
面向社会导航的增量残差强化学习在现实世界中的应用
Abstract
As the demand for mobile robots continues to increase, social navigation has emerged as a critical task, driving active research into deep reinforcement learning (RL) approaches. However, because pedestrian dynamics and social conventions vary widely across different regions, simulations cannot easily encompass all possible real-world scenarios. Real-world RL, in which agents learn while operating directly in physical environments, presents a promising solution to this issue. Nevertheless, this approach faces significant challenges, particularly regarding constrained computational resources on edge devices and learning efficiency. In this study, we propose incremental residual RL (IRRL). This method integrates incremental learning, which is a lightweight process that operates without a replay buffer or batch updates, with residual RL, which enhances learning efficiency by training only on the residuals relative to a base policy. Through the simulation experiments, we demonstrated that, despite lacking a replay buffer, IRRL achieved performance comparable to those of conventional replay buffer-based methods and outperformed existing incremental learning approaches. Furthermore, the real-world experiments confirmed that IRRL can enable robots to effectively adapt to previously unseen environments through the real-world learning.
Chinese Translation
随着移动机器人需求的不断增加,社会导航已成为一项关键任务,推动了对深度强化学习(RL)方法的积极研究。然而,由于行人动态和社会习俗在不同地区差异很大,模拟无法轻易涵盖所有可能的现实场景。现实世界的强化学习,即代理在物理环境中直接操作时学习,提供了一个有前景的解决方案。然而,这种方法面临着重大挑战,特别是在边缘设备上的计算资源受限和学习效率方面。在本研究中,我们提出了增量残差强化学习(IRRL)。该方法将增量学习(这是一种轻量级过程,无需重放缓冲区或批量更新)与残差强化学习相结合,通过仅对相对于基础策略的残差进行训练来提高学习效率。通过模拟实验,我们证明了尽管缺乏重放缓冲区,IRRL的性能与传统基于重放缓冲区的方法相当,并且优于现有的增量学习方法。此外,现实世界的实验确认了IRRL能够使机器人有效适应先前未见的环境,通过现实世界学习实现这一目标。
cs.RO / 22 / 2604.07993
HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation
HEX:用于跨身体整体操控的人形对齐专家
Abstract
Humans achieve complex manipulation through coordinated whole-body control, whereas most Vision-Language-Action (VLA) models treat robot body parts largely independently, making high-DoF humanoid control challenging and often unstable. We present HEX, a state-centric framework for coordinated manipulation on full-sized bipedal humanoid robots. HEX introduces a humanoid-aligned universal state representation for scalable learning across heterogeneous embodiments, and incorporates a Mixture-of-Experts Unified Proprioceptive Predictor to model whole-body coordination and temporal motion dynamics from large-scale multi-embodiment trajectory data. To efficiently capture temporal visual context, HEX uses lightweight history tokens to summarize past observations, avoiding repeated encoding of historical images during inference. It further employs a residual-gated fusion mechanism with a flow-matching action head to adaptively integrate visual-language cues with proprioceptive dynamics for action generation. Experiments on real-world humanoid manipulation tasks show that HEX achieves state-of-the-art performance in task success rate and generalization, particularly in fast-reaction and long-horizon scenarios.
Chinese Translation
人类通过协调的全身控制实现复杂的操控,而大多数视觉-语言-动作(VLA)模型则主要独立处理机器人身体部件,这使得高自由度(DoF)的人形控制变得具有挑战性且常常不稳定。我们提出了HEX,一个针对全尺寸双足人形机器人协调操控的状态中心框架。HEX引入了一种人形对齐的通用状态表示,以便在异构身体之间进行可扩展学习,并结合了一种混合专家统一本体预测器,以从大规模多身体轨迹数据中建模全身协调和时间运动动态。为了有效捕捉时间视觉上下文,HEX使用轻量级历史标记来总结过去的观察,避免在推理过程中重复编码历史图像。它进一步采用了一种残差门控融合机制,配合流匹配动作头,能够自适应地将视觉-语言线索与本体动态整合以生成动作。在真实世界的人形操控任务中的实验表明,HEX在任务成功率和泛化能力方面达到了最先进的性能,特别是在快速反应和长时间跨度的场景中。
cs.RO / 23 / 2604.08009
AgiPIX: Bridging Simulation and Reality in Indoor Aerial Inspection
AgiPIX:桥接室内无人机检查中的仿真与现实
Abstract
Autonomous indoor flight for critical asset inspection presents fundamental challenges in perception, planning, control, and learning. Despite rapid progress, there is still a lack of a compact, active-sensing, open-source platform that is reproducible across simulation and real-world operation. To address this gap, we present Agipix, a co-designed open hardware and software platform for indoor aerial autonomy and critical asset inspection. Agipix features a compact, hardware-synchronized active-sensing platform with onboard GPU-accelerated compute that is capable of agile flight; a containerized ROS~2-based modular autonomy stack; and a photorealistic digital twin of the hardware platform together with a reliable UI. These elements enable rapid iteration via zero-shot transfer of containerized autonomy components between simulation and real flights. We demonstrate trajectory tracking and exploration performance using onboard sensing in industrial indoor environments. All hardware designs, simulation assets, and containerized software are released openly together with documentation.
Chinese Translation
自主室内飞行用于关键资产检查面临感知、规划、控制和学习等基本挑战。尽管取得了快速进展,但仍缺乏一个紧凑的、主动感知的、开源的平台,能够在仿真和现实操作中可重复使用。为了解决这一问题,我们提出了AgiPIX,一个为室内无人机自主飞行和关键资产检查共同设计的开放硬件和软件平台。AgiPIX具备一个紧凑的、硬件同步的主动感知平台,配备了能够进行灵活飞行的GPU加速计算;一个基于ROS~2的模块化自主堆栈;以及一个与硬件平台相结合的逼真数字双胞胎和可靠的用户界面。这些元素使得通过在仿真和真实飞行之间进行零样本迁移的容器化自主组件实现快速迭代成为可能。我们展示了在工业室内环境中使用机载传感器的轨迹跟踪和探索性能。所有硬件设计、仿真资产和容器化软件都已公开发布,并附有文档。
cs.RO / 24 / 2604.08031
Open-Ended Instruction Realization with LLM-Enabled Multi-Planner Scheduling in Autonomous Vehicles
基于大型语言模型的多规划者调度在自主车辆中的开放式指令实现
Abstract
Most Human-Machine Interaction (HMI) research overlooks the maneuvering needs of passengers in autonomous driving (AD). Natural language offers an intuitive interface, yet translating passenger open-ended instructions into control signals, without sacrificing interpretability and traceability, remains a challenge. This study proposes an instruction-realization framework that leverages a large language model (LLM) to interpret instructions, generates executable scripts that schedule multiple model predictive control (MPC)-based motion planners based on real-time feedback, and converts planned trajectories into control signals. This scheduling-centric design decouples semantic reasoning from vehicle control at different timescales, establishing a transparent, traceable decision-making chain from high-level instructions to low-level actions. Due to the absence of high-fidelity evaluation tools, this study introduces a benchmark for open-ended instruction realization in a closed-loop setting. Comprehensive experiments reveal that the framework significantly improves task-completion rates over instruction-realization baselines, reduces LLM query costs, achieves safety and compliance on par with specialized AD approaches, and exhibits considerable tolerance to LLM inference latency. For more qualitative illustrations and a clearer understanding.
Chinese Translation
大多数人机交互(HMI)研究忽视了乘客在自主驾驶(AD)中的操控需求。自然语言提供了一种直观的接口,但将乘客的开放式指令转化为控制信号,同时不牺牲可解释性和可追溯性,仍然是一项挑战。本研究提出了一种指令实现框架,利用大型语言模型(LLM)来解释指令,生成可执行脚本,基于实时反馈调度多个基于模型预测控制(MPC)的运动规划器,并将规划轨迹转化为控制信号。这种以调度为中心的设计在不同时间尺度上将语义推理与车辆控制解耦,建立了从高层指令到低层动作的透明、可追溯的决策链。由于缺乏高保真评估工具,本研究在闭环环境中引入了开放式指令实现的基准测试。全面的实验表明,该框架显著提高了任务完成率,降低了LLM查询成本,达到了与专业AD方法相当的安全性和合规性,并对LLM推理延迟表现出相当的容忍度。
cs.RO / 25 / 2604.08059
Governed Capability Evolution for Embodied Agents: Safe Upgrade, Compatibility Checking, and Runtime Rollback for Embodied Capability Modules
具身智能体的受控能力演化:具身能力模块的安全升级、兼容性检查和运行时回滚
Abstract
Embodied agents are increasingly expected to improve over time by updating their executable capabilities rather than rewriting the agent itself. Prior work has separately studied modular capability packaging, capability evolution, and runtime governance. However, a key systems problem remains underexplored: once an embodied capability module evolves into a new version, how can the hosting system deploy it safely without breaking policy constraints, execution assumptions, or recovery guarantees? We formulate governed capability evolution as a first-class systems problem for embodied agents. We propose a lifecycle-aware upgrade framework in which every new capability version is treated as a governed deployment candidate rather than an immediately executable replacement. The framework introduces four upgrade compatibility checks -- interface, policy, behavioral, and recovery -- and organizes them into a staged runtime pipeline comprising candidate validation, sandbox evaluation, shadow deployment, gated activation, online monitoring, and rollback. We evaluate over 6 rounds of capability upgrade with 15 random seeds. Naive upgrade achieves 72.9% task success but drives unsafe activation to 60% by the final round; governed upgrade retains comparable success (67.4%) while maintaining zero unsafe activations across all rounds (Wilcoxon p=0.003). Shadow deployment reveals 40% of regressions invisible to sandbox evaluation alone, and rollback succeeds in 79.8% of post-activation drift scenarios.
Chinese Translation
具身智能体越来越被期望通过更新其可执行能力而非重写智能体本身来实现持续改进。先前的研究分别探讨了模块化能力打包、能力演化和运行时治理。然而,一个关键的系统问题仍未得到充分探讨:一旦具身能力模块演化为新版本,托管系统如何在不破坏政策约束、执行假设或恢复保证的情况下安全地部署它?我们将受控能力演化形式化为具身智能体的一个一流系统问题。我们提出了一种生命周期感知的升级框架,其中每个新能力版本被视为一个受控部署候选,而不是立即可执行的替代品。该框架引入了四个升级兼容性检查——接口、政策、行为和恢复——并将其组织成一个分阶段的运行时管道,包括候选验证、沙箱评估、影子部署、门控激活、在线监控和回滚。我们在15个随机种子下评估了超过6轮的能力升级。简单升级实现了72.9%的任务成功率,但在最后一轮中导致不安全激活率达到60%;而受控升级在保持可比成功率(67.4%)的同时,在所有轮次中保持零不安全激活(Wilcoxon p=0.003)。影子部署揭示了40%的回归在仅进行沙箱评估时是不可见的,而回滚在79.8%的激活后漂移场景中成功。
cs.RO / 26 / 2604.08153
Semantic-Aware UAV Command and Control for Efficient IoT Data Collection
语义感知无人机指挥与控制以实现高效的物联网数据收集
Abstract
Unmanned Aerial Vehicles (UAVs) have emerged as a key enabler technology for data collection from Internet of Things (IoT) devices. However, effective data collection is challenged by resource constraints and the need for real-time decision-making. In this work, we propose a novel framework that integrates semantic communication with UAV command-and-control (C&C) to enable efficient image data collection from IoT devices. Each device uses Deep Joint Source-Channel Coding (DeepJSCC) to generate a compact semantic latent representation of its image to enable image reconstruction even under partial transmission. A base station (BS) controls the UAV's trajectory by transmitting acceleration commands. The objective is to maximize the average quality of reconstructed images by maintaining proximity to each device for a sufficient duration within a fixed time horizon. To address the challenging trade-off and account for delayed C&C signals, we model the problem as a Markov Decision Process and propose a Double Deep Q-Learning (DDQN)-based adaptive flight policy. Simulation results show that our approach outperforms baseline methods such as greedy and traveling salesman algorithms, in both device coverage and semantic reconstruction quality.
Chinese Translation
无人机(UAV)已成为从物联网(IoT)设备收集数据的关键技术。然而,资源限制和实时决策的需求对有效的数据收集构成了挑战。在本研究中,我们提出了一种新颖的框架,将语义通信与无人机指挥与控制(C&C)相结合,以实现从物联网设备高效的图像数据收集。每个设备使用深度联合源信道编码(Deep Joint Source-Channel Coding, DeepJSCC)生成其图像的紧凑语义潜在表示,以便在部分传输的情况下实现图像重建。基站(BS)通过传输加速度命令来控制无人机的轨迹。我们的目标是在固定的时间范围内,通过保持与每个设备的足够接近时间,最大化重建图像的平均质量。为了应对这一具有挑战性的权衡并考虑延迟的C&C信号,我们将问题建模为马尔可夫决策过程,并提出了一种基于双深度Q学习(Double Deep Q-Learning, DDQN)的自适应飞行策略。仿真结果表明,我们的方法在设备覆盖和语义重建质量方面优于贪婪算法和旅行推销员算法等基线方法。
cs.RO / 27 / 2604.08168
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning
ViVa:一种用于机器人强化学习的视频生成价值模型
Abstract
Vision-language-action (VLA) models have advanced robot manipulation through large-scale pretraining, but real-world deployment remains challenging due to partial observability and delayed feedback. Reinforcement learning addresses this via value functions, which assess task progress and guide policy improvement. However, existing value models built on vision-language models (VLMs) struggle to capture temporal dynamics, undermining reliable value estimation in long-horizon tasks. In this paper, we propose ViVa, a video-generative value model that repurposes a pretrained video generator for value estimation. Taking the current observation and robot proprioception as input, ViVa jointly predicts future proprioception and a scalar value for the current state. By leveraging the spatiotemporal priors of a pretrained video generator, our approach grounds value estimation in anticipated embodiment dynamics, moving beyond static snapshots to intrinsically couple value with foresight. Integrated into RECAP, ViVa delivers substantial improvements on real-world box assembly. Qualitative analysis across all three tasks confirms that ViVa produces more reliable value signals, accurately reflecting task progress. By leveraging spatiotemporal priors from video corpora, ViVa also generalizes to novel objects, highlighting the promise of video-generative models for value estimation.
Chinese Translation
视觉-语言-动作(VLA)模型通过大规模预训练推动了机器人操作的发展,但由于部分可观察性和延迟反馈,实际部署仍然面临挑战。强化学习通过价值函数来解决这一问题,价值函数评估任务进展并指导策略改进。然而,现有基于视觉-语言模型(VLMs)的价值模型在捕捉时间动态方面存在困难,削弱了在长时间任务中可靠的价值估计。在本文中,我们提出了ViVa,一种视频生成价值模型,重新利用预训练的视频生成器进行价值估计。ViVa以当前观察和机器人本体感知作为输入,联合预测未来的本体感知和当前状态的标量价值。通过利用预训练视频生成器的时空先验,我们的方法将价值估计与预期的体现动态相结合,超越静态快照,将价值与前瞻性内在联系。集成到RECAP中,ViVa在现实世界的盒子组装任务中显著提升了性能。对所有三个任务的定性分析确认,ViVa产生了更可靠的价值信号,准确反映了任务进展。通过利用视频语料库中的时空先验,ViVa还能够推广到新颖物体,突显了视频生成模型在价值估计中的潜力。
cs.RO / 28 / 2604.08185
State and Trajectory Estimation of Tensegrity Robots via Factor Graphs and Chebyshev Polynomials
基于因子图和切比雪夫多项式的张力结构机器人状态与轨迹估计
Abstract
Tensegrity robots offer compliance and adaptability, but their nonlinear, and underconstrained dynamics make state estimation challenging. Reliable continuous-time estimation of all rigid links is crucial for closed-loop control, system identification, and machine learning; however, conventional methods often fall short. This paper proposes a two-stage approach for robust state or trajectory estimation (i.e., filtering or smoothing) of a cable-driven tensegrity robot. For online state estimation, this work introduces a factor-graph-based method, which fuses measurements from an RGB-D camera with on-board cable length sensors. To the best of the authors' knowledge, this is the first application of factor graphs in this domain. Factor graphs are a natural choice, as they exploit the robot's structural properties and provide effective sensor fusion solutions capable of handling nonlinearities in practice. Both the Mahalanobis distance-based clustering algorithm, used to handle noise, and the Chebyshev polynomial method, used to estimate the most probable velocities and intermediate states, are shown to perform well on simulated and real-world data, compared to an ICP-based algorithm. Results show that the approach provides high fidelity, continuous-time state and trajectory estimates for complex tensegrity robot motions.
Chinese Translation
张力结构机器人具有柔性和适应性,但其非线性和欠约束的动态特性使得状态估计变得具有挑战性。对所有刚性连接的可靠连续时间估计对于闭环控制、系统识别和机器学习至关重要;然而,传统方法往往无法满足这一需求。本文提出了一种两阶段的方法,用于电缆驱动的张力结构机器人进行稳健的状态或轨迹估计(即滤波或平滑)。在在线状态估计中,本文引入了一种基于因子图的方法,该方法将来自RGB-D相机的测量与机载电缆长度传感器的数据融合。据作者所知,这是因子图在该领域的首次应用。因子图是一个自然的选择,因为它利用了机器人的结构特性,并提供了有效的传感器融合解决方案,能够在实践中处理非线性问题。与基于ICP的算法相比,使用马哈拉诺比斯距离的聚类算法处理噪声,以及用于估计最可能速度和中间状态的切比雪夫多项式方法,在模拟和真实数据上表现良好。结果表明,该方法为复杂的张力结构机器人运动提供了高保真度的连续时间状态和轨迹估计。
cs.RO / 29 / 2604.08258
EvoGymCM: Harnessing Continuous Material Stiffness for Soft Robot Co-Design
EvoGymCM:利用连续材料刚度实现软体机器人协同设计
Abstract
In the automated co-design of soft robots, precisely adapting the material stiffness field to task environments is crucial for unlocking their full physical potential. However, mainstream platforms (e.g., EvoGym) strictly discretize the material dimension, artificially restricting the design space and performance of soft robots. To address this, we propose EvoGymCM (EvoGym with Continuous Materials), a benchmark suite formally establishing continuous material stiffness as a first-class design variable alongside morphology and control. Aligning with real-world material mechanisms, EvoGymCM introduces two settings: (i) EvoGymCM-R (Reactive), motivated by programmable materials with dynamically tunable stiffness; and (ii) EvoGymCM-I (Invariant), motivated by traditional materials with invariant stiffness fields. To tackle the resulting high-dimensional coupling, we formulate two Morphology-Material-Control co-design paradigms: (i) Reactive-Material Co-Design, which learns real-time stiffness tuning policies to guide programmable materials; and (ii) Invariant-Material Co-Design, which jointly optimizes morphology and fixed material fields to guide traditional material fabrication. Systematic experiments across diverse tasks demonstrate that continuous material optimization boosts performance and unlocks synergy across morphology, material, and control.
Chinese Translation
在软体机器人的自动协同设计中,精确地将材料刚度场适配于任务环境对于充分发挥其物理潜能至关重要。然而,主流平台(如 EvoGym)严格离散化材料维度,人工限制了软体机器人的设计空间和性能。为了解决这一问题,我们提出了 EvoGymCM(带连续材料的 EvoGym),这是一个基准套件,正式将连续材料刚度作为与形态和控制同等重要的设计变量。为了契合现实材料机制,EvoGymCM 引入了两种设置:(i)EvoGymCM-R(反应型),灵感来源于具有动态可调刚度的可编程材料;(ii)EvoGymCM-I(不变型),灵感来源于刚度场不变的传统材料。针对由此产生的高维耦合问题,我们提出了两种形态-材料-控制协同设计范式:(i)反应材料协同设计,学习实时刚度调节策略以引导可编程材料;(ii)不变材料协同设计,联合优化形态和固定材料场以指导传统材料制造。跨多样任务的系统性实验表明,连续材料优化提升了性能,并解锁了形态、材料与控制之间的协同效应。
cs.RO / 30 / 2604.08292
EMMa: End-Effector Stability-Oriented Mobile Manipulation for Tracked Rescue Robots
EMMa:面向稳定性的末端执行器移动操作在履带救援机器人中的应用
Abstract
The autonomous operation of tracked mobile manipulators in rescue missions requires not only ensuring the reachability and safety of robot motion but also maintaining stable end-effector manipulation under diverse task demands. However, existing studies have overlooked many end-effector motion properties at both the planning and control levels. This paper presents a motion generation framework for tracked mobile manipulators to achieve stable end-effector operation in complex rescue scenarios. The framework formulates a coordinated path optimization model that couples end-effector and mobile base states and designs compact cost/constraint representations to mitigate nonlinearities and reduce computational complexity. Furthermore, an isolated control scheme with feedforward compensation and feedback regulation is developed to enable coordinated path tracking for the robot. Extensive simulated and real-world experiments on rescue scenarios demonstrate that the proposed framework consistently outperforms SOTA methods across key metrics, including task success rate and end-effector motion stability, validating its effectiveness and robustness in complex mobile manipulation tasks.
Chinese Translation
履带移动机械手在救援任务中的自主操作不仅需要确保机器人运动的可达性和安全性,还需要在多样化的任务需求下维持稳定的末端执行器操作。然而,现有研究在规划和控制层面上忽视了许多末端执行器的运动特性。本文提出了一种运动生成框架,旨在使履带移动机械手在复杂救援场景中实现稳定的末端执行器操作。该框架构建了一个协调路径优化模型,将末端执行器和移动基座状态耦合,并设计了紧凑的成本/约束表示,以减轻非线性影响并降低计算复杂性。此外,开发了一种带前馈补偿和反馈调节的孤立控制方案,以实现机器人协调路径跟踪。在救援场景下进行的大量模拟和实地实验表明,所提框架在任务成功率和末端执行器运动稳定性等关键指标上始终优于现有最先进(SOTA)方法,验证了其在复杂移动操作任务中的有效性和鲁棒性。
cs.RO / 31 / 2604.08341
A Unified Multi-Layer Framework for Skill Acquisition from Imperfect Human Demonstrations
一个统一的多层框架用于从不完美的人类示范中获取技能
Abstract
Current Human-Robot Interaction (HRI) systems for skill teaching are fragmented, and existing approaches in the literature do not offer a cohesive framework that is simultaneously efficient, intuitive, and universally safe. This paper presents a novel, layered control framework that addresses this fundamental gap by enabling robust, compliant Learning from Demonstration (LfD) built upon a foundation of universal robot compliance. The proposed approach is structured in three progressive and interconnected stages. First, we introduce a real-time LfD method that learns both the trajectory and variable impedance from a single demonstration, significantly improving efficiency and reproduction fidelity. To ensure high-quality and intuitive {kinesthetic teaching}, we then present a null-space optimization strategy that proactively manages singularities and provides a consistent interaction feel during human demonstration. Finally, to ensure generalized safety, we introduce a foundational null-space compliance method that enables the entire robot body to compliantly adapt to post-learning external interactions without compromising main task performance. This final contribution transforms the system into a versatile HRI platform, moving beyond end-effector (EE)-specific applications. We validate the complete framework through comprehensive comparative experiments on a 7-DOF KUKA LWR robot. The results demonstrate a safer, more intuitive, and more efficient unified system for a wide range of human-robot collaborative tasks.
Chinese Translation
当前的人机交互(HRI)系统在技能教学方面存在碎片化的问题,现有文献中的方法未能提供一个既高效、直观又普遍安全的统一框架。本文提出了一种新颖的分层控制框架,通过建立在通用机器人顺应性基础上的稳健、合规的示范学习(LfD),来填补这一基本空白。所提出的方法分为三个逐步递进且相互关联的阶段。首先,我们介绍了一种实时的LfD方法,该方法能够从单一示范中学习轨迹和可变阻抗,显著提高了效率和再现精度。为了确保高质量和直观的运动教学,我们接着提出了一种零空间优化策略,主动管理奇异性,并在人体示范过程中提供一致的交互感受。最后,为了确保广泛的安全性,我们引入了一种基础的零空间顺应性方法,使整个机器人主体能够在学习后顺应性地适应外部交互,而不影响主要任务的执行。这一最终贡献将系统转变为一个多功能的人机交互平台,超越了末端执行器(EE)特定应用。我们通过对7自由度KUKA LWR机器人进行全面的比较实验来验证完整框架。结果表明,该统一系统在广泛的人机协作任务中更安全、更直观且更高效。
cs.RO / 32 / 2604.08418
Exploring Temporal Representation in Neural Processes for Multimodal Action Prediction
探索神经过程中的时间表示以进行多模态动作预测
Abstract
Inspired by the human ability to understand and predict others, we study the applicability of Conditional Neural Processes (CNP) to the task of self-supervised multimodal action prediction in robotics. Following recent results regarding the ontogeny of the Mirror Neuron System (MNS), we focus on the preliminary objective of self-actions prediction. We find a good MNS-inspired model in the existing Deep Modality Blending Network (DMBN), able to reconstruct the visuo-motor sensory signal during a partially observed action sequence by leveraging the probabilistic generation of CNP. After a qualitative and quantitative evaluation, we highlight its difficulties in generalizing to unseen action sequences, and identify the cause in its inner representation of time. Therefore, we propose a revised version, termed DMBN-Positional Time Encoding (DMBN-PTE), that facilitates learning a more robust representation of temporal information, and provide preliminary results of its effectiveness in expanding the applicability of the architecture. DMBN-PTE figures as a first step in the development of robotic systems that autonomously learn to forecast actions on longer time scales refining their predictions with incoming observations.
Chinese Translation
受到人类理解和预测他人能力的启发,我们研究条件神经过程(Conditional Neural Processes, CNP)在机器人自监督多模态动作预测任务中的适用性。基于关于镜像神经元系统(Mirror Neuron System, MNS)发生学的最新研究成果,我们专注于自我动作预测的初步目标。我们发现现有的深度模态融合网络(Deep Modality Blending Network, DMBN)是一个良好的受MNS启发的模型,能够通过利用CNP的概率生成,在部分观察的动作序列中重建视觉-运动感官信号。经过定性和定量评估,我们强调了该模型在对未见动作序列进行泛化时的困难,并识别出其内在时间表示的原因。因此,我们提出了一个修订版本,称为DMBN-位置时间编码(DMBN-Positional Time Encoding, DMBN-PTE),以促进学习更稳健的时间信息表示,并提供了其在扩展架构适用性方面有效性的初步结果。DMBN-PTE作为开发能够自主学习在更长时间尺度上预测动作的机器人系统的第一步,能够通过不断接收的观察来完善其预测。
cs.RO / 33 / 2604.08443
A Soft Robotic Interface for Chick-Robot Affective Interactions
用于雏鸡-机器人情感交互的软体机器人界面
Abstract
The potential of Animal-Robot Interaction (ARI) in welfare applications depends on how much an animal perceives a robotic agent as socially relevant, non-threatening and potentially attractive (acceptance). Here, we present an animal-centered soft robotic affective interface for newly hatched chicks (Gallus gallus). The soft interface provides safe and controllable cues, including warmth, breathing-like rhythmic deformation, and face-like visual stimuli. We evaluated chick acceptance of the interface and chick-robot interactions by measuring spontaneous approach and touch responses during video tracking. Overall, chicks approached and spent increasing time on or near the interface, demonstrating acceptance of the device. Across different layouts, chicks showed strong preference for warm thermal stimulation, which increased over time. Face-like visual cues elicited a swift and stable preference, speeding up the initial approach to the tactile interface. Although the breathing cue did not elicit any preference, neither did it trigger avoidance, paving the way for further exploration. These findings translate affective interface concepts to ARI, demonstrating that appropriate soft, thermal and visual stimuli can sustain early chick-robot interactions. This work establishes a reliable evaluation protocol and a safe baseline for designing multimodal robotic devices for animal welfare and neuroscientific research.
Chinese Translation
动物-机器人交互(Animal-Robot Interaction, ARI)在福利应用中的潜力取决于动物对机器人代理的社会相关性、无威胁性及潜在吸引力(接受度)的感知程度。本文提出了一种以动物为中心的软体机器人情感界面,针对新孵化的雏鸡(Gallus gallus)。该软体界面提供安全且可控的刺激,包括温热感、类似呼吸的节律性变形以及类脸部视觉刺激。我们通过视频追踪测量雏鸡的自发接近和触摸反应,评估其对界面的接受度及雏鸡-机器人交互。总体来看,雏鸡主动接近并在界面上或附近停留的时间逐渐增加,显示出对该装置的接受。在不同布局下,雏鸡表现出对温热刺激的强烈偏好,且该偏好随时间增强。类脸部视觉线索引发了迅速且稳定的偏好,加快了对触觉界面的初始接近。尽管呼吸节律刺激未引发明显偏好,也未导致回避,为进一步探索提供了可能。这些发现将情感界面概念应用于ARI,表明适当的软体、热感及视觉刺激能够维持早期的雏鸡-机器人交互。该研究建立了可靠的评估协议和安全的基线,为设计多模态机器人装置以促进动物福利及神经科学研究奠定基础。
cs.RO / 34 / 2604.08508
Sumo: Dynamic and Generalizable Whole-Body Loco-Manipulation
Sumo:动态且可泛化的全身运动操控
Abstract
This paper presents a sim-to-real approach that enables legged robots to dynamically manipulate large and heavy objects with whole-body dexterity. Our key insight is that by performing test-time steering of a pre-trained whole-body control policy with a sample-based planner, we can enable these robots to solve a variety of dynamic loco-manipulation tasks. Interestingly, we find our method generalizes to a diverse set of objects and tasks with no additional tuning or training, and can be further enhanced by flexibly adjusting the cost function at test time. We demonstrate the capabilities of our approach through a variety of challenging loco-manipulation tasks on a Spot quadruped robot in the real world, including uprighting a tire heavier than the robot's nominal lifting capacity and dragging a crowd-control barrier larger and taller than the robot itself. Additionally, we show that the same approach can be generalized to humanoid loco-manipulation tasks, such as opening a door and pushing a table, in simulation. Project code and videos are available at \href{https://sumo.rai-inst.com/}{https://sumo.rai-inst.com/}.
Chinese Translation
本文提出了一种从仿真到现实的方法,使得四足机器人能够以全身灵活性动态操控大型重物。我们的关键见解在于,通过使用基于样本的规划器对预训练的全身控制策略进行测试时引导,我们可以使这些机器人解决各种动态运动操控任务。有趣的是,我们发现该方法能够在无需额外调优或训练的情况下,泛化到多样的物体和任务,并且可以通过在测试时灵活调整成本函数进一步增强。我们通过在现实世界中对Spot四足机器人进行一系列具有挑战性的运动操控任务来展示我们方法的能力,包括将比机器人额定举重能力更重的轮胎竖立起来,以及拖动一个比机器人本身更大更高的群体控制障碍物。此外,我们还展示了同样的方法可以泛化到人形机器人运动操控任务,例如在仿真中开门和推桌子。项目代码和视频可在 https://sumo.rai-inst.com/ 获取。
cs.RO / 35 / 2604.08528
A-SLIP: Acoustic Sensing for Continuous In-hand Slip Estimation
A-SLIP:用于连续手中滑移估计的声学传感
Abstract
Reliable in-hand manipulation requires accurate real-time estimation of slip between a gripper and a grasped object. Existing tactile sensing approaches based on vision, capacitance, or force-torque measurements face fundamental trade-offs in form factor, durability, and their ability to jointly estimate slip direction and magnitude. We present A-SLIP, a multi-channel acoustic sensing system integrated into a parallel-jaw gripper for estimating continuous slip in the grasp plane. The A-SLIP sensor consists of piezoelectric microphones positioned behind a textured silicone contact pad to capture structured contact-induced vibrations. The A-SLIP model processes synchronized multi-channel audio as log-mel spectrograms using a lightweight convolutional network, jointly predicting the presence, direction, and magnitude of slip. Across experiments with robot- and externally induced slip conditions, the fine-tuned four-microphone configuration achieves a mean absolute directional error of 14.1 degrees, outperforms baselines by up to 12 percent in detection accuracy, and reduces directional error by 32 percent. Compared with single-microphone configurations, the multi-channel design reduces directional error by 64 percent and magnitude error by 68 percent, underscoring the importance of spatial acoustic sensing in resolving slip direction ambiguity. We further evaluate A-SLIP in closed-loop reactive control and find that it enables reliable, low-cost, real-time estimation of in-hand slip. Project videos and additional details are available at https://a-slip.github.io.
Chinese Translation
可靠的手中操作需要准确的实时估计夹持器与被抓取物体之间的滑移。现有基于视觉、电容或力-扭矩测量的触觉传感方法在形态、耐用性以及共同估计滑移方向和幅度方面面临基本的权衡。我们提出了A-SLIP,一个集成在平行夹爪中的多通道声学传感系统,用于估计抓取平面中的连续滑移。A-SLIP传感器由放置在纹理硅胶接触垫后方的压电麦克风组成,以捕捉结构化的接触引起的振动。A-SLIP模型使用轻量级卷积网络处理同步的多通道音频作为对数梅尔谱图,联合预测滑移的存在、方向和幅度。在机器人和外部诱导滑移条件下的实验中,经过微调的四麦克风配置实现了14.1度的平均绝对方向误差,在检测准确性上比基线提高了多达12个百分点,并将方向误差降低了32个百分点。与单麦克风配置相比,多通道设计将方向误差降低了64个百分点,幅度误差降低了68个百分点,突显了空间声学传感在解决滑移方向模糊性中的重要性。我们进一步在闭环反应控制中评估了A-SLIP,发现其能够实现可靠、低成本的实时手中滑移估计。项目视频和更多细节可在 https://a-slip.github.io 获取。
cs.RO / 36 / 2604.08534
ActiveGlasses: Learning Manipulation with Active Vision from Ego-centric Human Demonstration
ActiveGlasses:通过自我中心的人类示范学习主动视觉下的操控
Abstract
Large-scale real-world robot data collection is a prerequisite for bringing robots into everyday deployment. However, existing pipelines often rely on specialized handheld devices to bridge the embodiment gap, which not only increases operator burden and limits scalability, but also makes it difficult to capture the naturally coordinated perception-manipulation behaviors of human daily interaction. This challenge calls for a more natural system that can faithfully capture human manipulation and perception behaviors while enabling zero-shot transfer to robotic platforms. We introduce ActiveGlasses, a system for learning robot manipulation from ego-centric human demonstrations with active vision. A stereo camera mounted on smart glasses serves as the sole perception device for both data collection and policy inference: the operator wears it during bare-hand demonstrations, and the same camera is mounted on a 6-DoF perception arm during deployment to reproduce human active vision. To enable zero-transfer, we extract object trajectories from demonstrations and use an object-centric point-cloud policy to jointly predict manipulation and head movement. Across several challenging tasks involving occlusion and precise interaction, ActiveGlasses achieves zero-shot transfer with active vision, consistently outperforms strong baselines under the same hardware setup, and generalizes across two robot platforms.
Chinese Translation
大规模的现实世界机器人数据收集是将机器人引入日常应用的前提。然而,现有的流程通常依赖于专用的手持设备来弥补体现差距,这不仅增加了操作员的负担并限制了可扩展性,还使得捕捉人类日常交互中自然协调的感知-操控行为变得困难。这个挑战呼唤一种更自然的系统,能够真实地捕捉人类的操控和感知行为,同时实现对机器人平台的零样本迁移。我们提出了ActiveGlasses,一个用于从自我中心的人类示范中学习机器人操控的主动视觉系统。安装在智能眼镜上的立体摄像头作为数据收集和策略推断的唯一感知设备:操作员在赤手示范时佩戴它,而在部署时同样的摄像头则安装在一个6自由度的感知臂上,以重现人类的主动视觉。为了实现零样本迁移,我们从示范中提取物体轨迹,并使用以物体为中心的点云策略共同预测操控和头部运动。在涉及遮挡和精确交互的多个挑战性任务中,ActiveGlasses实现了主动视觉下的零样本迁移,在相同硬件设置下始终优于强基线,并在两个机器人平台上实现了泛化。
cs.RO / 37 / 2604.08535
Fail2Drive: Benchmarking Closed-Loop Driving Generalization
Fail2Drive:闭环驾驶泛化的基准测试
Abstract
Generalization under distribution shift remains a central bottleneck for closed-loop autonomous driving. Although simulators like CARLA enable safe and scalable testing, existing benchmarks rarely measure true generalization: they typically reuse training scenarios at test time. Success can therefore reflect memorization rather than robust driving behavior. We introduce Fail2Drive, the first paired-route benchmark for closed-loop generalization in CARLA, with 200 routes and 17 new scenario classes spanning appearance, layout, behavioral, and robustness shifts. Each shifted route is matched with an in-distribution counterpart, isolating the effect of the shift and turning qualitative failures into quantitative diagnostics. Evaluating multiple state-of-the-art models reveals consistent degradation, with an average success-rate drop of 22.8\%. Our analysis uncovers unexpected failure modes, such as ignoring objects clearly visible in the LiDAR and failing to learn the fundamental concepts of free and occupied space. To accelerate follow-up work, Fail2Drive includes an open-source toolbox for creating new scenarios and validating solvability via a privileged expert policy. Together, these components establish a reproducible foundation for benchmarking and improving closed-loop driving generalization. We open-source all code, data, and tools at https://github.com/autonomousvision/fail2drive .
Chinese Translation
在分布转移下的泛化仍然是闭环自主驾驶的一个主要瓶颈。尽管像CARLA这样的模拟器能够实现安全和可扩展的测试,但现有基准很少测量真实的泛化能力:它们通常在测试时重用训练场景。因此,成功可能反映的是记忆而非稳健的驾驶行为。我们引入了Fail2Drive,这是第一个针对CARLA中闭环泛化的配对路线基准,包含200条路线和17个新的场景类别,涵盖外观、布局、行为和鲁棒性转移。每条转移路线都与一个分布内的对应路线匹配,从而隔离转移的影响,并将定性失败转化为定量诊断。对多种最先进模型的评估显示出一致的性能下降,平均成功率下降了22.8%。我们的分析揭示了意想不到的失败模式,例如忽视在LiDAR中清晰可见的物体,以及未能学习自由空间和占用空间的基本概念。为了加速后续工作,Fail2Drive包括一个开源工具箱,用于创建新场景并通过特权专家策略验证可解性。这些组件共同建立了一个可重复的基础,用于基准测试和改善闭环驾驶泛化。我们将所有代码、数据和工具开源,地址为 https://github.com/autonomousvision/fail2drive 。
cs.RO / 38 / 2604.08544
SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds
SIM1:物理对齐模拟器作为可变形世界中的零-shot 数据缩放器
Abstract
Robotic manipulation with deformable objects represents a data-intensive regime in embodied learning, where shape, contact, and topology co-evolve in ways that far exceed the variability of rigids. Although simulation promises relief from the cost of real-world data acquisition, prevailing sim-to-real pipelines remain rooted in rigid-body abstractions, producing mismatched geometry, fragile soft dynamics, and motion primitives poorly suited for cloth interaction. We posit that simulation fails not for being synthetic, but for being ungrounded. To address this, we introduce SIM1, a physics-aligned real-to-sim-to-real data engine that grounds simulation in the physical world. Given limited demonstrations, the system digitizes scenes into metric-consistent twins, calibrates deformable dynamics through elastic modeling, and expands behaviors via diffusion-based trajectory generation with quality filtering. This pipeline transforms sparse observations into scaled synthetic supervision with near-demonstration fidelity. Experiments show that policies trained on purely synthetic data achieve parity with real-data baselines at a 1:15 equivalence ratio, while delivering 90% zero-shot success and 50% generalization gains in real-world deployment. These results validate physics-aligned simulation as scalable supervision for deformable manipulation and a practical pathway for data-efficient policy learning.
Chinese Translation
使用可变形物体进行机器人操作代表了一个数据密集型的体现学习领域,在这个领域中,形状、接触和拓扑以远超刚体的变异性共同演化。尽管模拟有望减轻真实世界数据获取的成本,但现有的模拟到现实(sim-to-real)管道仍然根植于刚体抽象,导致几何形状不匹配、脆弱的软动力学以及不适合布料交互的运动原语。我们认为,模拟失败并不是因为它是合成的,而是因为它缺乏基础。为了解决这个问题,我们引入了SIM1,一个物理对齐的真实到模拟再到真实(real-to-sim-to-real)数据引擎,它将模拟与物理世界相结合。在有限的演示下,该系统将场景数字化为度量一致的双胞胎,通过弹性建模校准可变形动力学,并通过基于扩散的轨迹生成与质量过滤扩展行为。该管道将稀疏观察转化为具有近似演示保真度的缩放合成监督。实验表明,在1:15的等效比下,基于纯合成数据训练的策略与真实数据基线达成了平衡,同时在真实世界部署中实现了90%的零-shot 成功率和50%的泛化增益。这些结果验证了物理对齐模拟作为可变形操作的可扩展监督,并为数据高效的策略学习提供了一条实用路径。
cs.CV / 1 / 2604.07413
FORGE:Fine-grained Multimodal Evaluation for Manufacturing Scenarios
FORGE:制造场景的细粒度多模态评估
Abstract
The manufacturing sector is increasingly adopting Multimodal Large Language Models (MLLMs) to transition from simple perception to autonomous execution, yet current evaluations fail to reflect the rigorous demands of real-world manufacturing environments. Progress is hindered by data scarcity and a lack of fine-grained domain semantics in existing datasets. To bridge this gap, we introduce FORGE. Wefirst construct a high-quality multimodal dataset that combines real-world 2D images and 3D point clouds, annotated with fine-grained domain semantics (e.g., exact model numbers). We then evaluate 18 state-of-the-art MLLMs across three manufacturing tasks, namely workpiece verification, structural surface inspection, and assembly verification, revealing significant performance gaps. Counter to conventional understanding, the bottleneck analysis shows that visual grounding is not the primary limiting factor. Instead, insufficient domain-specific knowledge is the key bottleneck, setting a clear direction for future research. Beyond evaluation, we show that our structured annotations can serve as an actionable training resource: supervised fine-tuning of a compact 3B-parameter model on our data yields up to 90.8% relative improvement in accuracy on held-out manufacturing scenarios, providing preliminary evidence for a practical pathway toward domain-adapted manufacturing MLLMs. The code and datasets are available at https://ai4manufacturing.github.io/forge-web.
Chinese Translation
制造行业正日益采用多模态大型语言模型(MLLMs)以实现从简单感知到自主执行的转变,但当前的评估未能反映现实制造环境的严格要求。进展受到数据稀缺和现有数据集中缺乏细粒度领域语义的限制。为填补这一空白,我们提出了FORGE。我们首先构建了一个高质量的多模态数据集,该数据集结合了真实世界的2D图像和3D点云,并标注了细粒度的领域语义(例如,精确的型号)。然后,我们在三个制造任务上评估了18个最先进的MLLMs,分别是工件验证、结构表面检查和装配验证,揭示了显著的性能差距。与传统理解相反,瓶颈分析表明,视觉定位并不是主要的限制因素。相反,领域特定知识的不足是关键瓶颈,为未来研究指明了明确方向。除了评估外,我们还展示了我们的结构化注释可以作为可操作的训练资源:在我们的数据上对一个紧凑的3B参数模型进行监督微调,能够在保留的制造场景中实现高达90.8%的相对准确性提升,为朝着领域适应的制造MLLMs的实际路径提供了初步证据。代码和数据集可在 https://ai4manufacturing.github.io/forge-web 获取。
cs.CV / 2 / 2604.07427
Personalizing Text-to-Image Generation to Individual Taste
个性化文本到图像生成以满足个人偏好
Abstract
Modern text-to-image (T2I) models generate high-fidelity visuals but remain indifferent to individual user preferences. While existing reward models optimize for "average" human appeal, they fail to capture the inherent subjectivity of aesthetic judgment. In this work, we introduce a novel dataset and predictive framework, called PAMELA, designed to model personalized image evaluations. Our dataset comprises 70,000 ratings across 5,000 diverse images generated by state-of-the-art models (Flux 2 and Nano Banana). Each image is evaluated by 15 unique users, providing a rich distribution of subjective preferences across domains such as art, design, fashion, and cinematic photography. Leveraging this data, we propose a personalized reward model trained jointly on our high-quality annotations and existing aesthetic assessment subsets. We demonstrate that our model predicts individual liking with higher accuracy than the majority of current state-of-the-art methods predict population-level preferences. Using our personalized predictor, we demonstrate how simple prompt optimization methods can be used to steer generations towards individual user preferences. Our results highlight the importance of data quality and personalization to handle the subjectivity of user preferences. We release our dataset and model to facilitate standardized research in personalized T2I alignment and subjective visual quality assessment.
Chinese Translation
现代文本到图像(T2I)模型能够生成高保真视觉效果,但对个体用户的偏好却无动于衷。现有的奖励模型虽然优化了“平均”人类吸引力,但未能捕捉到美学判断的固有主观性。在本研究中,我们引入了一个新颖的数据集和预测框架,称为 PAMELA,旨在建模个性化的图像评估。我们的数据集包含70,000条评分,涵盖由最先进模型(Flux 2 和 Nano Banana)生成的5,000幅多样化图像。每幅图像由15位独特用户进行评估,提供了艺术、设计、时尚和电影摄影等领域的主观偏好的丰富分布。利用这些数据,我们提出了一种个性化奖励模型,该模型在我们的高质量注释和现有美学评估子集上共同训练。我们证明,模型在预测个体喜好方面的准确性高于大多数当前最先进的方法在预测群体水平偏好时的准确性。通过我们的个性化预测器,我们展示了如何使用简单的提示优化方法来引导生成结果朝向个体用户的偏好。我们的结果强调了数据质量和个性化在处理用户偏主观性方面的重要性。我们发布了我们的数据集和模型,以促进个性化 T2I 对齐和主观视觉质量评估的标准化研究。
cs.CV / 3 / 2604.07429
GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents
GameWorld:迈向多模态游戏代理的标准化和可验证评估
Abstract
Towards an embodied generalist for real-world interaction, Multimodal Large Language Model (MLLM) agents still suffer from challenging latency, sparse feedback, and irreversible mistakes. Video games offer an ideal testbed with rich visual observations and closed-loop interaction, demanding fine-grained perception, long-horizon planning, and precise control. However, systematically evaluating these capabilities is currently hindered by heterogeneous action interfaces and heuristic verification. To this end, we introduce GameWorld, a benchmark designed for standardized and verifiable evaluation of MLLMs as generalist game agents in browser environments. Two game agent interfaces are studied: (i) computer-use agents that directly emit keyboard and mouse controls, and (ii) generalist multimodal agents that act in a semantic action space via deterministic Semantic Action Parsing. GameWorld contains 34 diverse games and 170 tasks, each paired with state-verifiable metrics for outcome-based evaluation. The results across 18 model-interface pairs suggest that even the best performing agent is far from achieving human capabilities on video games. Extensive experiments of repeated full-benchmark reruns demonstrate the robustness of the benchmark, while further studies on real-time interaction, context-memory sensitivity, and action validity expose more challenges ahead for game agents. Together, by offering a standardized, verifiable, and reproducible evaluation framework, GameWorld lays a robust foundation for advancing research on multimodal game agents and beyond. The project page is at https://gameworld-bench.github.io.
Chinese Translation
为了实现面向真实世界交互的具身通用体,多模态大型语言模型(MLLM)代理仍面临延迟、稀疏反馈和不可逆错误等挑战。视频游戏提供了一个理想的测试平台,具有丰富的视觉观察和闭环交互,要求精细的感知、长远的规划和精确的控制。然而,系统性评估这些能力目前受到异构动作接口和启发式验证的限制。为此,我们引入了GameWorld,一个旨在为MLLM作为通用游戏代理在浏览器环境中提供标准化和可验证评估的基准。我们研究了两种游戏代理接口:(i)直接发出键盘和鼠标控制的计算机使用代理,以及(ii)通过确定性的语义动作解析在语义动作空间中行动的通用多模态代理。GameWorld包含34款多样化的游戏和170个任务,每个任务都配有状态可验证的度量标准以进行基于结果的评估。18个模型-接口对的结果表明,即使是表现最好的代理也远未达到人类在视频游戏中的能力。对完整基准重复实验的广泛测试展示了基准的稳健性,而对实时交互、上下文记忆敏感性和动作有效性的进一步研究则揭示了游戏代理面临的更多挑战。通过提供一个标准化、可验证和可重复的评估框架,GameWorld为推动多模态游戏代理及其他领域的研究奠定了坚实的基础。项目页面地址为 https://gameworld-bench.github.io。
cs.CV / 4 / 2604.07430
HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents
HY-Embodied-0.5:面向现实世界代理的具身基础模型
X, Tencent Robotics, Team, HY Vision, :, Yu, Xumin, Liu, Zuyan, Wang, Ziyi, Zhang, He, Rao, Yongming, Liu, Fangfu, Zhang, Yani, Zhao, Ruowen, Wang, Oran, Liang, Yves, Lin, Haitao, Wang, Minghui, Dong, Yubo, Cheng, Kevin, Ni, Bolin, Huang, Rui, Hu, Han, Zhang, Zhengyou, Linus, Yao, Shunyu
Abstract
We introduce HY-Embodied-0.5, a family of foundation models specifically designed for real-world embodied agents. To bridge the gap between general Vision-Language Models (VLMs) and the demands of embodied agents, our models are developed to enhance the core capabilities required by embodied intelligence: spatial and temporal visual perception, alongside advanced embodied reasoning for prediction, interaction, and planning. The HY-Embodied-0.5 suite comprises two primary variants: an efficient model with 2B activated parameters designed for edge deployment, and a powerful model with 32B activated parameters targeted for complex reasoning. To support the fine-grained visual perception essential for embodied tasks, we adopt a Mixture-of-Transformers (MoT) architecture to enable modality-specific computing. By incorporating latent tokens, this design effectively enhances the perceptual representation of the models. To improve reasoning capabilities, we introduce an iterative, self-evolving post-training paradigm. Furthermore, we employ on-policy distillation to transfer the advanced capabilities of the large model to the smaller variant, thereby maximizing the performance potential of the compact model. Extensive evaluations across 22 benchmarks, spanning visual perception, spatial reasoning, and embodied understanding, demonstrate the effectiveness of our approach. Our MoT-2B model outperforms similarly sized state-of-the-art models on 16 benchmarks, while the 32B variant achieves performance comparable to frontier models such as Gemini 3.0 Pro. In downstream robot control experiments, we leverage our robust VLM foundation to train an effective Vision-Language-Action (VLA) model, achieving compelling results in real-world physical evaluations. Code and models are open-sourced at https://github.com/Tencent-Hunyuan/HY-Embodied.
Chinese Translation
我们介绍了HY-Embodied-0.5,一个专门为现实世界具身代理设计的基础模型系列。为了弥合通用视觉-语言模型(VLMs)与具身代理需求之间的差距,我们的模型旨在增强具身智能所需的核心能力:空间和时间视觉感知,以及用于预测、交互和规划的高级具身推理。HY-Embodied-0.5套件包括两个主要变体:一个具有20亿激活参数的高效模型,旨在边缘部署;另一个具有320亿激活参数的强大模型,针对复杂推理。为了支持具身任务所需的细粒度视觉感知,我们采用了混合变换器(Mixture-of-Transformers, MoT)架构,以实现特定模态的计算。通过引入潜在标记,该设计有效增强了模型的感知表示。为了提高推理能力,我们引入了一种迭代自我演化的后训练范式。此外,我们采用了在线蒸馏技术,将大模型的先进能力转移到较小的变体,从而最大化紧凑模型的性能潜力。在22个基准测试中进行的广泛评估,涵盖视觉感知、空间推理和具身理解,证明了我们方法的有效性。我们的MoT-2B模型在16个基准测试中超越了同等规模的最先进模型,而32B变体的性能与前沿模型如Gemini 3.0 Pro相当。在下游机器人控制实验中,我们利用强大的VLM基础训练了一个有效的视觉-语言-行动(Vision-Language-Action, VLA)模型,在现实世界的物理评估中取得了令人信服的结果。代码和模型已在https://github.com/Tencent-Hunyuan/HY-Embodied上开源。
cs.CV / 5 / 2604.07477
SMFD-UNet: Semantic Face Mask Is The Only Thing You Need To Deblur Faces
SMFD-UNet:语义面具是去模糊人脸所需的唯一工具
Abstract
For applications including facial identification, forensic analysis, photographic improvement, and medical imaging diagnostics, facial image deblurring is an essential chore in computer vision allowing the restoration of high-quality images from blurry inputs. Often based on general picture priors, traditional deblurring techniques find it difficult to capture the particular structural and identity-specific features of human faces. We present SMFD-UNet (Semantic Mask Fusion Deblurring UNet), a new lightweight framework using semantic face masks to drive the deblurring process, therefore removing the need for high-quality reference photos in order to solve these difficulties. First, our dual-step method uses a UNet-based semantic mask generator to directly extract detailed facial component masks (e.g., eyes, nose, mouth) straight from blurry photos. Sharp, high-fidelity facial images are subsequently produced by integrating these masks with the blurry input using a multi-stage feature fusion technique within a computationally efficient UNet framework. We created a randomized blurring pipeline that roughly replicates real-world situations by simulating around 1.74 trillion deterioration scenarios, hence guaranteeing resilience. Examined on the CelebA dataset, SMFD-UNet shows better performance than state-of-the-art models, attaining higher Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) while preserving satisfactory naturalness measures, including NIQE, LPIPS, and FID. Powered by Residual Dense Convolution Blocks (RDC), a multi-stage feature fusion strategy, efficient and effective upsampling techniques, attention techniques like CBAM, post-processing techniques, and the lightweight design guarantees scalability and efficiency, enabling SMFD-UNet to be a flexible solution for developing facial image restoration research and useful applications.
Chinese Translation
在人脸识别、法医分析、摄影改进和医学影像诊断等应用中,人脸图像去模糊是计算机视觉中的一项重要任务,它能够从模糊输入中恢复高质量图像。传统的去模糊技术通常基于一般图像先验,难以捕捉人脸特有的结构和身份特征。我们提出了SMFD-UNet(语义面具融合去模糊UNet),这是一个新的轻量级框架,利用语义面具驱动去模糊过程,从而消除了为解决这些问题而需要高质量参考照片的需求。首先,我们的双步方法使用基于UNet的语义面具生成器,直接从模糊照片中提取详细的人脸组件面具(例如眼睛、鼻子、嘴巴)。随后,通过在计算效率高的UNet框架内使用多阶段特征融合技术,将这些面具与模糊输入结合,生成清晰、高保真的人脸图像。我们创建了一个随机模糊管道,粗略模拟现实世界情况,模拟了约1.74万亿种退化场景,从而确保了鲁棒性。在CelebA数据集上进行的测试表明,SMFD-UNet的性能优于最先进的模型,获得了更高的峰值信噪比(PSNR)和结构相似性指数(SSIM),同时保持了令人满意的自然性指标,包括NIQE、LPIPS和FID。借助残差密集卷积块(RDC)、多阶段特征融合策略、高效有效的上采样技术、注意力技术(如CBAM)、后处理技术以及轻量级设计,SMFD-UNet确保了可扩展性和效率,使其成为人脸图像恢复研究和实用应用的灵活解决方案。
cs.CV / 6 / 2604.07522
Training-free Spatially Grounded Geometric Shape Encoding (Technical Report)
无训练的空间基础几何形状编码(技术报告)
Abstract
Positional encoding has become the de facto standard for grounding deep neural networks on discrete point-wise positions, and it has achieved remarkable success in tasks where the input can be represented as a one-dimensional sequence. However, extending this concept to 2D spatial geometric shapes demands carefully designed encoding strategies that account not only for shape geometry and pose, but also for compatibility with neural network learning. In this work, we address these challenges by introducing a training-free, general-purpose encoding strategy, dubbed XShapeEnc, that encodes an arbitrary spatially grounded 2D geometric shape into a compact representation exhibiting five favorable properties, including invertibility, adaptivity, and frequency richness. Specifically, a 2D spatially grounded geometric shape is decomposed into its normalized geometry within the unit disk and its pose vector, where the pose is further transformed into a harmonic pose field that also lies within the unit disk. A set of orthogonal Zernike bases is constructed to encode shape geometry and pose either independently or jointly, followed by a frequency-propagation operation to introduce high-frequency content into the encoding. We demonstrate the theoretical validity, efficiency, discriminability, and applicability of XShapeEnc via extensive analysis and experiments across a wide range of shape-aware tasks and our self-curated XShapeCorpus. We envision XShapeEnc as a foundational tool for research that goes beyond one-dimensional sequential data toward frontier 2D spatial intelligence.
Chinese Translation
位置编码已成为将深度神经网络与离散点位位置相结合的事实标准,并在输入可以表示为一维序列的任务中取得了显著成功。然而,将这一概念扩展到二维空间几何形状需要精心设计的编码策略,不仅要考虑形状的几何特征和姿态,还要与神经网络学习兼容。在本研究中,我们通过引入一种无训练的通用编码策略,称为 XShapeEnc,来解决这些挑战,该策略将任意空间基础的二维几何形状编码为一种紧凑表示,具有可逆性、自适应性和频率丰富性等五个优良特性。具体而言,二维空间基础几何形状被分解为单位圆盘内的标准化几何形状及其姿态向量,其中姿态进一步转化为也位于单位圆盘内的谐波姿态场。构建一组正交的 Zernike 基底,以独立或联合方式编码形状的几何特征和姿态,随后进行频率传播操作,以将高频内容引入编码中。我们通过广泛的分析和实验,展示了 XShapeEnc 的理论有效性、效率、可区分性和适用性,涵盖了广泛的形状感知任务和我们自编制的 XShapeCorpus。我们设想 XShapeEnc 作为一种基础工具,推动研究超越一维序列数据,迈向前沿的二维空间智能。
cs.CV / 7 / 2604.07563
On the Uphill Battle of Image frequency Analysis
图像频率分析的艰难挑战
Abstract
This work is a follow up on the newly proposed clustering algorithm called The Inverse Square Mean Shift Algorithm. In this paper a special case of algorithm for dealing with non-homogenous data is formulated and the three dimensional Fast Fourier Transform of images is investigated with the aim of finding hidden patterns.
Chinese Translation
本研究是对新提出的聚类算法——逆平方均值漂移算法(The Inverse Square Mean Shift Algorithm)的后续工作。本文 formulates 了一种处理非均质数据的特殊算法,并探讨了图像的三维快速傅里叶变换(Fast Fourier Transform),旨在寻找隐藏的模式。
cs.CV / 8 / 2604.07574
Mathematical Analysis of Image Matching Techniques
图像匹配技术的数学分析
Abstract
Image matching is a fundamental problem in Computer Vision with direct applications in robotics, remote sensing, and geospatial data analysis. We present an analytical and experimental evaluation of classical local feature-based image matching algorithms on satellite imagery, focusing on the Scale-Invariant Feature Transform (SIFT) and the Oriented FAST and Rotated BRIEF (ORB). Each method is evaluated through a common pipeline: keypoint detection, descriptor extraction, descriptor matching, and geometric verification via RANSAC with homography estimation. Matching quality is assessed using the Inlier Ratio - the fraction of correspondences consistent with the estimated homography. The study uses a manually constructed dataset of GPS-annotated satellite image tiles with intentional overlaps. We examine the impact of the number of extracted keypoints on the resulting Inlier Ratio.
Chinese Translation
图像匹配是计算机视觉中的一个基本问题,直接应用于机器人技术、遥感和地理空间数据分析。我们对经典的基于局部特征的图像匹配算法在卫星图像上的分析和实验评估进行了介绍,重点关注尺度不变特征变换(Scale-Invariant Feature Transform, SIFT)和方向FAST与旋转BRIEF(Oriented FAST and Rotated BRIEF, ORB)。每种方法通过一个共同的流程进行评估:关键点检测、描述符提取、描述符匹配,以及通过RANSAC和单应性估计进行几何验证。匹配质量通过内点比例(Inlier Ratio)进行评估,即与估计的单应性一致的对应点的比例。本研究使用了一个手动构建的GPS标注卫星图像切片数据集,切片之间有意重叠。我们考察了提取的关键点数量对最终内点比例的影响。
cs.CV / 9 / 2604.07577
Event-Level Detection of Surgical Instrument Handovers in Videos with Interpretable Vision Models
基于可解释视觉模型的视频中外科器械交接的事件级检测
Abstract
Reliable monitoring of surgical instrument exchanges is essential for maintaining procedural efficiency and patient safety in the operating room. Automatic detection of instrument handovers in intraoperative video remains challenging due to frequent occlusions, background clutter, and the temporally evolving nature of interaction events. We propose a spatiotemporal vision framework for event-level detection and direction classification of surgical instrument handovers in surgical videos. The model combines a Vision Transformer (ViT) backbone for spatial feature extraction with a unidirectional Long Short-Term Memory (LSTM) network for temporal aggregation. A unified multi-task formulation jointly predicts handover occurrence and interaction direction, enabling consistent modeling of transfer dynamics while avoiding error propagation typical of cascaded pipelines. Predicted confidence scores form a temporal signal over the video, from which discrete handover events are identified via peak detection. Experiments on a dataset of kidney transplant procedures demonstrate strong performance, achieving an F1-score of 0.84 for handover detection and a mean F1-score of 0.72 for direction classification, outperforming both a single-task variant and a VideoMamba-based baseline for direction prediction while maintaining comparable detection performance. To improve interpretability, we employ Layer-CAM attribution to visualize spatial regions driving model decisions, highlighting hand-instrument interaction cues.
Chinese Translation
可靠监测外科器械的交接对于维持手术室的程序效率和患者安全至关重要。在术中视频中自动检测器械交接仍然面临挑战,主要由于频繁的遮挡、背景杂乱以及交互事件的时间演变特性。我们提出了一种时空视觉框架,用于外科视频中外科器械交接的事件级检测和方向分类。该模型结合了用于空间特征提取的视觉变换器(Vision Transformer, ViT)主干网络与用于时间聚合的单向长短期记忆网络(Long Short-Term Memory, LSTM)。统一的多任务公式共同预测交接发生和交互方向,从而实现转移动态的一致建模,同时避免了级联管道中典型的错误传播。预测的置信度分数在视频上形成时间信号,通过峰值检测识别离散的交接事件。在肾脏移植手术数据集上的实验表明,该方法表现出色,交接检测的F1分数达到0.84,方向分类的平均F1分数为0.72,超越了单任务变体和基于VideoMamba的方向预测基线,同时保持了可比的检测性能。为了提高可解释性,我们采用Layer-CAM归因方法可视化驱动模型决策的空间区域,突出了手与器械的交互线索。
cs.CV / 10 / 2604.07578
MSGL-Transformer: A Multi-Scale Global-Local Transformer for Rodent Social Behavior Recognition
MSGL-Transformer:一种用于啮齿动物社会行为识别的多尺度全局-局部变换器
Abstract
Recognition of rodent behavior is important for understanding neural and behavioral mechanisms. Traditional manual scoring is time-consuming and prone to human error. We propose MSGL-Transformer, a Multi-Scale Global-Local Transformer for recognizing rodent social behaviors from pose-based temporal sequences. The model employs a lightweight transformer encoder with multi-scale attention to capture motion dynamics across different temporal scales. The architecture integrates parallel short-range, medium-range, and global attention branches to explicitly capture behavior dynamics at multiple temporal scales. We also introduce a Behavior-Aware Modulation (BAM) block, inspired by SE-Networks, which modulates temporal embeddings to emphasize behavior-relevant features prior to attention. We evaluate on two datasets: RatSI (5 behavior classes, 12D pose inputs) and CalMS21 (4 behavior classes, 28D pose inputs). On RatSI, MSGL-Transformer achieves 75.4% mean accuracy and F1-score of 0.745 across nine cross-validation splits, outperforming TCN, LSTM, and Bi-LSTM. On CalMS21, it achieves 87.1% accuracy and F1-score of 0.8745, a +10.7% improvement over HSTWFormer, and outperforms ST-GCN, MS-G3D, CTR-GCN, and STGAT. The same architecture generalizes across both datasets with only input dimensionality and number of classes adjusted.
Chinese Translation
啮齿动物行为的识别对于理解神经和行为机制至关重要。传统的人工评分方法耗时且容易出现人为错误。我们提出了MSGL-Transformer,一种用于从基于姿态的时间序列中识别啮齿动物社会行为的多尺度全局-局部变换器。该模型采用轻量级变换器编码器,结合多尺度注意力机制,以捕捉不同时间尺度下的运动动态。该架构整合了并行的短程、中程和全局注意力分支,以明确捕捉多个时间尺度下的行为动态。我们还引入了一种行为感知调制(Behavior-Aware Modulation, BAM)模块,灵感来源于SE-Networks,该模块在注意力之前调制时间嵌入,以强调与行为相关的特征。我们在两个数据集上进行了评估:RatSI(5个行为类别,12维姿态输入)和CalMS21(4个行为类别,28维姿态输入)。在RatSI上,MSGL-Transformer实现了75.4%的平均准确率和0.745的F1-score,在九个交叉验证分割中表现优于TCN、LSTM和Bi-LSTM。在CalMS21上,它实现了87.1%的准确率和0.8745的F1-score,较HSTWFormer提高了10.7%,并且优于ST-GCN、MS-G3D、CTR-GCN和STGAT。同样的架构在两个数据集上具有良好的泛化能力,仅调整输入维度和类别数量。
cs.CV / 11 / 2604.07606
Bootstrapping Sign Language Annotations with Sign Language Models
利用手语模型引导手语注释的自助法
Abstract
AI-driven sign language interpretation is limited by a lack of high-quality annotated data. New datasets including ASL STEM Wiki and FLEURS-ASL contain professional interpreters and 100s of hours of data but remain only partially annotated and thus underutilized, in part due to the prohibitive costs of annotating at this scale. In this work, we develop a pseudo-annotation pipeline that takes signed video and English as input and outputs a ranked set of likely annotations, including time intervals, for glosses, fingerspelled words, and sign classifiers. Our pipeline uses sparse predictions from our fingerspelling recognizer and isolated sign recognizer (ISR), along with a K-Shot LLM approach, to estimate these annotations. In service of this pipeline, we establish simple yet effective baseline fingerspelling and ISR models, achieving state-of-the-art on FSBoard (6.7% CER) and on ASL Citizen datasets (74% top-1 accuracy). To validate and provide a gold-standard benchmark, a professional interpreter annotated nearly 500 videos from ASL STEM Wiki with sequence-level gloss labels containing glosses, classifiers, and fingerspelling signs. These human annotations and over 300 hours of pseudo-annotations are being released in supplemental material.
Chinese Translation
基于人工智能的手语翻译受到高质量注释数据缺乏的限制。新数据集如ASL STEM Wiki和FLEURS-ASL包含专业翻译人员和数百小时的数据,但仍然只有部分注释,因此未得到充分利用,部分原因是大规模注释的成本过高。在本研究中,我们开发了一种伪注释管道,该管道以手语视频和英语作为输入,输出一组可能的注释,按排名排列,包括手语词汇、手指拼写单词和手语分类器的时间区间。我们的管道利用了来自手指拼写识别器和孤立手语识别器(ISR)的稀疏预测,以及K-Shot LLM方法来估计这些注释。为了支持这一管道,我们建立了简单而有效的基线手指拼写和ISR模型,在FSBoard数据集上达到了最先进的水平(6.7% CER),在ASL Citizen数据集上达到了74%的顶级准确率。为了验证并提供一个黄金标准基准,一名专业翻译人员对ASL STEM Wiki中的近500个视频进行了序列级别的注释,包含手语词汇、分类器和手指拼写符号。这些人工注释和超过300小时的伪注释将作为补充材料发布。
cs.CV / 12 / 2604.07634
VSAS-BENCH: Real-Time Evaluation of Visual Streaming Assistant Models
VSAS-BENCH:视觉流助手模型的实时评估
Abstract
Streaming vision-language models (VLMs) continuously generate responses given an instruction prompt and an online stream of input frames. This is a core mechanism for real-time visual assistants. Existing VLM frameworks predominantly assess models in offline settings. In contrast, the performance of a streaming VLM depends on additional metrics beyond pure video understanding, including proactiveness, which reflects the timeliness of the model's responses, and consistency, which captures the robustness of its responses over time. To address this limitation, we propose VSAS-Bench, a new framework and benchmark for Visual Streaming Assistants. In contrast to prior benchmarks that primarily employ single-turn question answering on video inputs, VSAS-Bench features temporally dense annotations with over 18,000 annotations across diverse input domains and task types. We introduce standardized synchronous and asynchronous evaluation protocols, along with metrics that isolate and measure distinct capabilities of streaming VLMs. Using this framework, we conduct large-scale evaluations of recent video and streaming VLMs, analyzing the accuracy-latency trade-off under key design factors such as memory buffer length, memory access policy, and input resolution, yielding several practical insights. Finally, we show empirically that conventional VLMs can be adapted to streaming settings without additional training, and demonstrate that these adapted models outperform recent streaming VLMs. For example, Qwen3-VL-4B surpasses Dispider, the best streaming VLM on our benchmark, by 3% under the asynchronous protocol. The benchmark and code will be available at https://github.com/apple/ml-vsas-bench.
Chinese Translation
流媒体视觉语言模型(VLMs)根据指令提示和在线输入帧流持续生成响应。这是实时视觉助手的核心机制。现有的VLM框架主要在离线环境中评估模型。相比之下,流媒体VLM的性能依赖于超出纯视频理解的额外指标,包括主动性(proactiveness),反映模型响应的及时性,以及一致性(consistency),捕捉其响应随时间的稳健性。为了解决这一局限性,我们提出了VSAS-Bench,一个新的视觉流助手框架和基准。与以往主要在视频输入上进行单轮问答的基准不同,VSAS-Bench具有时间密集的注释,涵盖超过18,000个注释,涉及多种输入领域和任务类型。我们引入了标准化的同步和异步评估协议,以及能够隔离和测量流媒体VLM不同能力的指标。利用这一框架,我们对最近的视频和流媒体VLM进行了大规模评估,分析了在关键设计因素(如内存缓冲区长度、内存访问策略和输入分辨率)下的准确性-延迟权衡,得出了一些实用的见解。最后,我们实证表明,传统VLM可以在不额外训练的情况下适应流媒体环境,并证明这些适应后的模型在性能上优于最近的流媒体VLM。例如,Qwen3-VL-4B在我们的基准下的异步协议下超越了Dispider,这是最佳的流媒体VLM,提升了3%。基准和代码将可在 https://github.com/apple/ml-vsas-bench 获取。
cs.CV / 13 / 2604.07664
Monocular Depth Estimation From the Perspective of Feature Restoration: A Diffusion Enhanced Depth Restoration Approach
从特征恢复的角度进行单目深度估计:一种扩散增强的深度恢复方法
Abstract
Monocular Depth Estimation (MDE) is a fundamental computer vision task with important applications in 3D vision. The current mainstream MDE methods employ an encoder-decoder architecture with multi-level/scale feature processing. However, the limitations of the current architecture and the effects of different-level features on the prediction accuracy are not evaluated. In this paper, we first investigate the above problem and show that there is still substantial potential in the current framework if encoder features can be improved. Therefore, we propose to formulate the depth estimation problem from the feature restoration perspective, by treating pretrained encoder features as degraded features of an assumed ground truth feature that yields the ground truth depth map. Then an Invertible Transform-enhanced Indirect Diffusion (InvT-IndDiffusion) module is developed for feature restoration. Due to the absence of direct supervision on feature, only indirect supervision from the final sparse depth map is used. During the iterative procedure of diffusion, this results in feature deviations among steps. The proposed InvT-IndDiffusion solves this problem by using an invertible transform-based decoder under the bi-Lipschitz condition. Finally, a plug-and-play Auxiliary Viewpoint-based Low-level Feature Enhancement module (AV-LFE) is developed to enhance local details with auxiliary viewpoint when available. Experiments demonstrate that the proposed method achieves better performance than the state-of-the-art methods on various datasets. Specifically on the KITTI benchmark, compared with the baseline, the performance is improved by 4.09% and 37.77% under different training settings in terms of RMSE. Code is available at https://github.com/whitehb1/IID-RDepth.
Chinese Translation
单目深度估计(Monocular Depth Estimation, MDE)是计算机视觉中的一项基础任务,在三维视觉中具有重要应用。目前主流的MDE方法采用编码器-解码器架构,并进行多层次/尺度特征处理。然而,当前架构的局限性以及不同层次特征对预测精度的影响尚未得到评估。本文首先研究了上述问题,并表明如果能够改善编码器特征,当前框架仍然具有相当大的潜力。因此,我们提出从特征恢复的角度来表述深度估计问题,将预训练的编码器特征视为假设真实特征的退化特征,该真实特征生成真实的深度图。接着,开发了一种可逆变换增强的间接扩散(Invertible Transform-enhanced Indirect Diffusion, InvT-IndDiffusion)模块用于特征恢复。由于缺乏对特征的直接监督,仅使用来自最终稀疏深度图的间接监督。在扩散的迭代过程中,这导致了步骤之间特征的偏差。所提出的InvT-IndDiffusion通过在双利普希茨条件下使用基于可逆变换的解码器来解决此问题。最后,开发了一种即插即用的基于辅助视点的低层特征增强模块(Auxiliary Viewpoint-based Low-level Feature Enhancement, AV-LFE),在可用时利用辅助视点增强局部细节。实验表明,所提出的方法在各种数据集上优于最先进的方法。具体而言,在KITTI基准测试中,与基线相比,在不同训练设置下,RMSE性能分别提高了4.09%和37.77%。代码可在 https://github.com/whitehb1/IID-RDepth 获取。
cs.CV / 14 / 2604.07665
Adaptive Depth-converted-Scale Convolution for Self-supervised Monocular Depth Estimation
自适应深度转换尺度卷积用于自监督单目深度估计
Abstract
Self-supervised monocular depth estimation (MDE) has received increasing interests in the last few years. The objects in the scene, including the object size and relationship among different objects, are the main clues to extract the scene structure. However, previous works lack the explicit handling of the changing sizes of the object due to the change of its depth. Especially in a monocular video, the size of the same object is continuously changed, resulting in size and depth ambiguity. To address this problem, we propose a Depth-converted-Scale Convolution (DcSConv) enhanced monocular depth estimation framework, by incorporating the prior relationship between the object depth and object scale to extract features from appropriate scales of the convolution receptive field. The proposed DcSConv focuses on the adaptive scale of the convolution filter instead of the local deformation of its shape. It establishes that the scale of the convolution filter matters no less (or even more in the evaluated task) than its local deformation. Moreover, a Depth-converted-Scale aware Fusion (DcS-F) is developed to adaptively fuse the DcSConv features and the conventional convolution features. Our DcSConv enhanced monocular depth estimation framework can be applied on top of existing CNN based methods as a plug-and-play module to enhance the conventional convolution block. Extensive experiments with different baselines have been conducted on the KITTI benchmark and our method achieves the best results with an improvement up to 11.6% in terms of SqRel reduction. Ablation study also validates the effectiveness of each proposed module.
Chinese Translation
自监督单目深度估计(MDE)在过去几年中受到了越来越多的关注。场景中的物体,包括物体的大小和不同物体之间的关系,是提取场景结构的主要线索。然而,以往的研究缺乏对物体因深度变化而导致的大小变化的明确处理。特别是在单目视频中,同一物体的大小会不断变化,导致大小和深度的模糊性。为了解决这个问题,我们提出了一种深度转换尺度卷积(Depth-converted-Scale Convolution, DcSConv)增强的单目深度估计框架,通过结合物体深度与物体尺度之间的先验关系,从适当的卷积感受野尺度中提取特征。所提出的DcSConv关注卷积滤波器的自适应尺度,而不是其形状的局部变形。它确立了卷积滤波器的尺度在评估任务中与其局部变形同样重要(甚至更重要)。此外,我们开发了一种深度转换尺度感知融合(Depth-converted-Scale aware Fusion, DcS-F),以自适应地融合DcSConv特征和传统卷积特征。我们的DcSConv增强的单目深度估计框架可以作为即插即用模块应用于现有的基于CNN的方法,以增强传统的卷积块。在KITTI基准上进行了与不同基线的广泛实验,我们的方法在SqRel减少方面取得了最佳结果,提升幅度高达11.6%。消融研究也验证了每个提出模块的有效性。
cs.CV / 15 / 2604.07674
Weight Group-wise Post-Training Quantization for Medical Foundation Model
面向医疗基础模型的权重分组后训练量化方法
Abstract
Foundation models have achieved remarkable results in medical image analysis. However, its large network architecture and high computational complexity significantly impact inference speed, limiting its application on terminal medical devices. Quantization, a technique that compresses models into low-bit versions, is a solution to this challenge. In this paper, we propose a post-training quantization algorithm, Permutation-COMQ. It eliminates the need for backpropagation by using simple dot products and rounding operations, thereby removing hyperparameter tuning and simplifying the process. Additionally, we introduce a weight-aware strategy that reorders the weight within each layer to address the accuracy degradation induced by channel-wise scaling during quantization, while preserving channel structure. Experiments demonstrate that our method achieves the best results in 2-bit, 4-bit, and 8-bit quantization.
Chinese Translation
基础模型在医学图像分析中取得了显著成果。然而,其庞大的网络架构和高计算复杂度显著影响推理速度,限制了其在终端医疗设备上的应用。量化作为一种将模型压缩为低位宽版本的技术,是解决该挑战的有效手段。本文提出了一种后训练量化算法Permutation-COMQ。该方法通过简单的点积和舍入操作,消除了反向传播的需求,从而避免了超参数调优并简化了流程。此外,我们引入了一种权重感知策略,通过在每层内部重新排序权重,解决了量化过程中通道尺度缩放引起的精度下降问题,同时保持了通道结构。实验结果表明,我们的方法在2位、4位和8位量化中均取得了最佳效果。
cs.CV / 16 / 2604.07675
FireSenseNet: A Dual-Branch CNN with Cross-Attentive Feature Interaction for Next-Day Wildfire Spread Prediction
FireSenseNet:一种具有交叉注意特征交互的双分支卷积神经网络,用于次日野火传播预测
Abstract
Accurate prediction of next-day wildfire spread is critical for disaster response and resource allocation. Existing deep learning approaches typically concatenate heterogeneous geospatial inputs into a single tensor, ignoring the fundamental physical distinction between static fuel/terrain properties and dynamic meteorological conditions. We propose FireSenseNet, a dual-branch convolutional neural network equipped with a novel Cross-Attentive Feature Interaction Module (CAFIM) that explicitly models the spatially varying interaction between fuel and weather modalities through learnable attention gates at multiple encoder scales. Through a systematic comparison of seven architectures -- spanning pure CNNs, Vision Transformers, and hybrid designs -- on the Google Next-Day Wildfire Spread benchmark, we demonstrate that FireSenseNet achieves an F1 of 0.4176 and AUC-PR of 0.3435, outperforming all alternatives including a SegFormer with 3.8* more parameters (F1 = 0.3502). Ablation studies confirm that CAFIM provides a 7.1% relative F1 gain over naive concatenation, and channel-wise feature importance analysis reveals that the previous-day fire mask dominates prediction while wind speed acts as noise at the dataset's coarse temporal resolution. We further incorporate Monte Carlo Dropout for pixel-level uncertainty quantification and present a critical analysis showing that common evaluation shortcuts inflate reported F1 scores by over 44%.
Chinese Translation
准确预测次日野火传播对于灾害响应和资源分配至关重要。现有的深度学习方法通常将异构地理空间输入连接成一个单一的张量,忽略了静态燃料/地形特性与动态气象条件之间的基本物理区别。我们提出了FireSenseNet,这是一种双分支卷积神经网络,配备了新颖的交叉注意特征交互模块(Cross-Attentive Feature Interaction Module, CAFIM),该模块通过可学习的注意力门在多个编码器尺度上明确建模燃料与天气模态之间的空间变化交互。通过对七种架构的系统比较——涵盖纯CNN、视觉变换器(Vision Transformers)和混合设计——在Google次日野火传播基准测试上,我们证明FireSenseNet达到了0.4176的F1值和0.3435的AUC-PR,超越了所有替代方案,包括参数多达3.8倍的SegFormer(F1 = 0.3502)。消融研究确认CAFIM相较于简单连接提供了7.1%的相对F1增益,通道特征重要性分析显示,前一天的火灾掩膜主导了预测,而风速在数据集的粗时间分辨率下则充当噪声。我们进一步结合了蒙特卡洛Dropout进行像素级不确定性量化,并呈现了关键分析,显示常见评估捷径使报告的F1分数膨胀超过44%。
cs.CV / 17 / 2604.07722
Needle in a Haystack -- One-Class Representation Learning for Detecting Rare Malignant Cells in Computational Cytology
大海捞针——用于计算细胞学中罕见恶性细胞检测的一类表示学习
Abstract
In computational cytology, detecting malignancy on whole-slide images is difficult because malignant cells are morphologically diverse yet vanishingly rare amid a vast background of normal cells. Accurate detection of these extremely rare malignant cells remains challenging due to large class imbalance and limited annotations. Conventional weakly supervised approaches, such as multiple instance learning (MIL), often fail to generalize at the instance level, especially when the fraction of malignant cells (witness rate) is exceedingly low. In this study, we explore the use of one-class representation learning techniques for detecting malignant cells in low-witness-rate scenarios. These methods are trained exclusively on slide-negative patches, without requiring any instance-level supervision. Specifically, we evaluate two OCC approaches, DSVDD and DROC, and compare them with FS-SIL, WS-SIL, and the recent ItS2CLR method. The one-class methods learn compact representations of normality and detect deviations at test time. Experiments on a publicly available bone marrow cytomorphology dataset (TCIA) and an in-house oral cancer cytology dataset show that DSVDD achieves state-of-the-art performance in instance-level abnormality ranking, particularly in ultra-low witness-rate regimes ($\leq 1\%$) and, in some cases, even outperforming fully supervised learning, which is typically not a practical option in whole-slide cytology due to the infeasibility of exhaustive instance-level annotations. DROC is also competitive under extreme rarity, benefiting from distribution-augmented contrastive learning. These findings highlight one-class representation learning as a robust and interpretable superior choice to MIL for malignant cell detection under extreme rarity.
Chinese Translation
在计算细胞学中,因恶性细胞形态多样且在大量正常细胞中极为稀少,整个切片图像上的恶性细胞检测极具挑战性。由于类别极度不平衡且标注有限,准确检测这些极其罕见的恶性细胞仍然困难重重。传统的弱监督方法,如多实例学习(MIL),常常难以在实例层面实现泛化,尤其当恶性细胞比例(witness rate)极低时。本文探讨了一类表示学习技术在低witness rate场景下检测恶性细胞的应用。这些方法仅在无恶性细胞的切片负样本上训练,无需任何实例级别的监督。具体而言,我们评估了两种一类分类(OCC)方法——DSVDD和DROC,并将其与FS-SIL、WS-SIL及最新的ItS2CLR方法进行了比较。一类方法学习正常细胞的紧凑表示,并在测试时检测偏离。基于公开骨髓细胞形态学数据集(TCIA)及内部口腔癌细胞学数据集的实验表明,DSVDD在实例级异常排序中实现了最先进的性能,尤其在超低witness rate(≤1%)条件下表现突出,部分情况下甚至优于全监督学习,而后者在全切片细胞学中因实例级标注的不可行性通常不具实用性。DROC在极端稀有条件下也表现出竞争力,得益于分布增强的对比学习。这些结果凸显了一类表示学习作为恶性细胞检测中面对极端稀有情况时,比MIL更稳健且具解释性的优越选择。
cs.CV / 18 / 2604.07723
Direct Segmentation without Logits Optimization for Training-Free Open-Vocabulary Semantic Segmentation
无需优化Logits的直接分割方法用于无训练的开放词汇语义分割
Abstract
Open-vocabulary semantic segmentation (OVSS) aims to segment arbitrary category regions in images using open-vocabulary prompts, necessitating that existing methods possess pixel-level vision-language alignment capability. Typically, this capability involves computing the cosine similarity, \ie, logits, between visual and linguistic features, and minimizing the distribution discrepancy between the logits and the ground truth (GT) to generate optimal logits that are subsequently used to construct segmentation maps, yet it depends on time-consuming iterative training or model-specific attention modulation. In this work, we propose a more direct approach that eschews the logits-optimization process by directly deriving an analytic solution for the segmentation map. We posit a key hypothesis: the distribution discrepancy encodes semantic information; specifically, this discrepancy exhibits consistency across patches belonging to the same category but inconsistency across different categories. Based on this hypothesis, we directly utilize the analytic solution of this distribution discrepancy as the semantic maps. In other words, we reformulate the optimization of the distribution discrepancy as deriving its analytic solution, thereby eliminating time-consuming iterative training, freeing us from model-specific attention modulation, and achieving state-of-the-art performance on eight benchmark datasets.
Chinese Translation
开放词汇语义分割(OVSS)旨在利用开放词汇提示对图像中的任意类别区域进行分割,这要求现有方法具备像素级的视觉-语言对齐能力。通常,这种能力涉及计算视觉特征与语言特征之间的余弦相似度,即Logits,并最小化Logits与真实标签(GT)之间的分布差异,以生成最优Logits,随后用于构建分割图。然而,这一过程依赖于耗时的迭代训练或特定模型的注意力调制。在本研究中,我们提出了一种更直接的方法,避免了Logits优化过程,通过直接推导分割图的解析解。我们提出一个关键假设:分布差异编码了语义信息;具体而言,这种差异在属于同一类别的图像块之间表现出一致性,而在不同类别之间则表现出不一致性。基于这一假设,我们直接利用该分布差异的解析解作为语义图。换句话说,我们将分布差异的优化重新表述为推导其解析解,从而消除了耗时的迭代训练,使我们摆脱了特定模型的注意力调制,并在八个基准数据集上实现了最先进的性能。
cs.CV / 19 / 2604.07728
GEAR: GEometry-motion Alternating Refinement for Articulated Object Modeling with Gaussian Splatting
GEAR:基于高斯点云的关节物体建模的几何-运动交替优化
Abstract
High-fidelity interactive digital assets are essential for embodied intelligence and robotic interaction, yet articulated objects remain challenging to reconstruct due to their complex structures and coupled geometry-motion relationships. Existing methods suffer from instability in geometry-motion joint optimization, while their generalization remains limited on complex multi-joint or out-of-distribution objects. To address these challenges, we propose GEAR, an EM-style alternating optimization framework that jointly models geometry and motion as interdependent components within a Gaussian Splatting representation. GEAR treats part segmentation as a latent variable and joint motion parameters as explicit variables, alternately refining them for improved convergence and geometric-motion consistency. To enhance part segmentation quality without sacrificing generalization, we leverage a vanilla 2D segmentation model to provide multi-view part priors, and employ a weakly supervised constraint to regularize the latent variable. Experiments on multiple benchmarks and our newly constructed dataset GEAR-Multi demonstrate that GEAR achieves state-of-the-art results in geometric reconstruction and motion parameters estimation, particularly on complex articulated objects with multiple movable parts.
Chinese Translation
高保真互动数字资产对于具身智能和机器人交互至关重要,但由于关节物体的复杂结构和耦合的几何-运动关系,其重建仍然具有挑战性。现有方法在几何-运动联合优化中存在不稳定性,同时在复杂多关节或超出分布范围的物体上泛化能力有限。为了解决这些挑战,我们提出了GEAR,一种基于EM风格的交替优化框架,它将几何和运动作为相互依赖的组件共同建模于高斯点云表示中。GEAR将部件分割视为潜变量,将关节运动参数视为显变量,交替优化它们以提高收敛性和几何-运动一致性。为了在不牺牲泛化能力的情况下提高部件分割质量,我们利用一个基础的二维分割模型提供多视角的部件先验,并采用弱监督约束来规范潜变量。在多个基准测试和我们新构建的数据集GEAR-Multi上的实验表明,GEAR在几何重建和运动参数估计方面达到了最先进的结果,特别是在具有多个可移动部件的复杂关节物体上。
cs.CV / 20 / 2604.07740
Beyond Pedestrians: Caption-Guided CLIP Framework for High-Difficulty Video-based Person Re-Identification
超越行人:基于字幕引导的 CLIP 框架用于高难度视频中的人物重识别
Abstract
In recent years, video-based person Re-Identification (ReID) has gained attention for its ability to leverage spatiotemporal cues to match individuals across non-overlapping cameras. However, current methods struggle with high-difficulty scenarios, such as sports and dance performances, where multiple individuals wear similar clothing while performing dynamic movements. To overcome these challenges, we propose CG-CLIP, a novel caption-guided CLIP framework that leverages explicit textual descriptions and learnable tokens. Our method introduces two key components: Caption-guided Memory Refinement (CMR) and Token-based Feature Extraction (TFE). CMR utilizes captions generated by Multi-modal Large Language Models (MLLMs) to refine identity-specific features, capturing fine-grained details. TFE employs a cross-attention mechanism with fixed-length learnable tokens to efficiently aggregate spatiotemporal features, reducing computational overhead. We evaluate our approach on two standard datasets (MARS and iLIDS-VID) and two newly constructed high-difficulty datasets (SportsVReID and DanceVReID). Experimental results demonstrate that our method outperforms current state-of-the-art approaches, achieving significant improvements across all benchmarks.
Chinese Translation
近年来,基于视频的人物重识别(ReID)因其利用时空线索在非重叠摄像头间匹配个体的能力而受到关注。然而,当前方法在高难度场景中表现不佳,例如体育和舞蹈表演,在这些场景中,多名个体穿着相似的服装并进行动态动作。为了解决这些挑战,我们提出了 CG-CLIP,一种新颖的字幕引导 CLIP 框架,利用显式文本描述和可学习的标记。我们的方法引入了两个关键组件:字幕引导记忆细化(CMR)和基于标记的特征提取(TFE)。CMR 利用多模态大型语言模型(MLLMs)生成的字幕来细化特定身份的特征,捕捉细粒度细节。TFE 采用固定长度的可学习标记的交叉注意机制,以高效聚合时空特征,减少计算开销。我们在两个标准数据集(MARS 和 iLIDS-VID)以及两个新构建的高难度数据集(SportsVReID 和 DanceVReID)上评估了我们的方法。实验结果表明,我们的方法在所有基准测试中均优于当前的最先进方法,取得了显著的改进。
cs.CV / 21 / 2604.07741
MSCT: Differential Cross-Modal Attention for Deepfake Detection
MSCT:用于深度伪造检测的差异跨模态注意力
Abstract
Audio-visual deepfake detection typically employs a complementary multi-modal model to check the forgery traces in the video. These methods primarily extract forgery traces through audio-visual alignment, which results from the inconsistency between audio and video modalities. However, the traditional multi-modal forgery detection method has the problem of insufficient feature extraction and modal alignment deviation. To address this, we propose a multi-scale cross-modal transformer encoder (MSCT) for deepfake detection. Our approach includes a multi-scale self-attention to integrate the features of adjacent embeddings and a differential cross-modal attention to fuse multi-modal features. Our experiments demonstrate competitive performance on the FakeAVCeleb dataset, validating the effectiveness of the proposed structure.
Chinese Translation
音视频深度伪造检测通常采用互补的多模态模型来检查视频中的伪造痕迹。这些方法主要通过音视频对齐提取伪造痕迹,而这种对齐是由于音频和视频模态之间的不一致性。然而,传统的多模态伪造检测方法存在特征提取不足和模态对齐偏差的问题。为此,我们提出了一种多尺度跨模态变换编码器(MSCT)用于深度伪造检测。我们的方法包括多尺度自注意力机制,以整合相邻嵌入的特征,以及差异跨模态注意力机制,以融合多模态特征。我们的实验在FakeAVCeleb数据集上展示了竞争力的性能,验证了所提结构的有效性。
cs.CV / 22 / 2604.07753
Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding
共生-MoE:解锁生成与理解之间的协同作用
Abstract
Empowering Large Multimodal Models (LMMs) with image generation often leads to catastrophic forgetting in understanding tasks due to severe gradient conflicts. While existing paradigms like Mixture-of-Transformers (MoT) mitigate this conflict through structural isolation, they fundamentally sever cross-modal synergy and suffer from capacity fragmentation. In this work, we present Symbiotic-MoE, a unified pre-training framework that resolves task interference within a native multimodal Mixture-of-Experts (MoE) Transformers architecture with zero-parameter overhead. We first identify that standard MoE tuning leads to routing collapse, where generative gradients dominate expert utilization. To address this, we introduce Modality-Aware Expert Disentanglement, which partitions experts into task-specific groups while utilizing shared experts as a multimodal semantic bridge. Crucially, this design allows shared experts to absorb fine-grained visual semantics from generative tasks to enrich textual representations. To optimize this, we propose a Progressive Training Strategy featuring differential learning rates and early-stage gradient shielding. This mechanism not only shields pre-trained knowledge from early volatility but eventually transforms generative signals into constructive feedback for understanding. Extensive experiments demonstrate that Symbiotic-MoE achieves rapid generative convergence while unlocking cross-modal synergy, boosting inherent understanding with remarkable gains on MMLU and OCRBench.
Chinese Translation
赋能大型多模态模型(LMMs)进行图像生成往往会导致理解任务中的灾难性遗忘,因为存在严重的梯度冲突。虽然现有的范式如混合变换器(Mixture-of-Transformers, MoT)通过结构隔离来缓解这种冲突,但它们从根本上切断了跨模态的协同作用,并且遭受容量碎片化。在本研究中,我们提出了共生-MoE(Symbiotic-MoE),这是一个统一的预训练框架,能够在原生多模态专家混合(Mixture-of-Experts, MoE)变换器架构中以零参数开销解决任务干扰。我们首先发现标准的MoE调优会导致路由崩溃,其中生成梯度主导了专家的利用。为了解决这个问题,我们引入了模态感知专家解耦(Modality-Aware Expert Disentanglement),将专家划分为特定任务的组,同时利用共享专家作为多模态语义桥。至关重要的是,这一设计使得共享专家能够从生成任务中吸收细粒度的视觉语义,以丰富文本表示。为了优化这一过程,我们提出了一种渐进训练策略(Progressive Training Strategy),该策略具有差异化学习率和早期梯度屏蔽。这一机制不仅保护了预训练知识免受早期波动的影响,而且最终将生成信号转化为对理解的建设性反馈。大量实验表明,共生-MoE在实现快速生成收敛的同时解锁了跨模态的协同作用,在MMLU和OCRBench上显著提升了固有理解能力。
cs.CV / 23 / 2604.07758
DailyArt: Discovering Articulation from Single Static Images via Latent Dynamics
DailyArt:通过潜在动态从单一静态图像中发现关节
Abstract
Articulated objects are essential for embodied AI and world models, yet inferring their kinematics from a single closed-state image remains challenging because crucial motion cues are often occluded. Existing methods either require multi-state observations or rely on explicit part priors, retrieval, or other auxiliary inputs that partially expose the structure to be inferred. In this work, we present DailyArt, which formulates articulated joint estimation from a single static image as a synthesis-mediated reasoning problem. Instead of directly regressing joints from a heavily occluded observation, DailyArt first synthesizes a maximally articulated opened state under the same camera view to expose articulation cues, and then estimates the full set of joint parameters from the discrepancy between the observed and synthesized states. Using a set-prediction formulation, DailyArt recovers all joints simultaneously without requiring object-specific templates, multi-view inputs, or explicit part annotations at test time. Taking estimated joints as conditions, the framework further supports part-level novel state synthesis as a downstream capability. Extensive experiments show that DailyArt achieves strong performance in articulated joint estimation and supports part-level novel state synthesis conditioned on joints. Project page is available at https://rangooo123.github.io/DaliyArt.github.io/.
Chinese Translation
关节物体对于具身人工智能和世界模型至关重要,但从单一闭合状态图像中推断其运动学仍然具有挑战性,因为关键的运动线索往往被遮挡。现有方法要么需要多状态观测,要么依赖于显式的部件先验、检索或其他辅助输入,这些输入部分暴露了待推断的结构。在本研究中,我们提出了DailyArt,它将从单一静态图像中进行关节估计的过程表述为一个合成介导的推理问题。DailyArt并不是直接从高度遮挡的观测中回归关节,而是首先在相同的摄像机视角下合成一个最大化关节开放状态,以暴露关节线索,然后根据观测状态与合成状态之间的差异估计完整的关节参数。通过集合预测的形式,DailyArt能够同时恢复所有关节,而无需特定物体的模板、多视角输入或测试时的显式部件注释。在估计的关节作为条件的基础上,该框架进一步支持部件级的新状态合成作为下游能力。大量实验表明,DailyArt在关节估计方面表现出色,并支持基于关节的部件级新状态合成。项目页面可访问 https://rangooo123.github.io/DaliyArt.github.io/。
cs.CV / 24 / 2604.07759
WUTDet: A 100K-Scale Ship Detection Dataset and Benchmarks with Dense Small Objects
WUTDet:一个包含10万规模船舶检测数据集及密集小物体基准测试
Abstract
Ship detection for navigation is a fundamental perception task in intelligent waterway transportation systems. However, existing public ship detection datasets remain limited in terms of scale, the proportion of small-object instances, and scene diversity, which hinders the systematic evaluation and generalization study of detection algorithms in complex maritime environments. To this end, we construct WUTDet, a large-scale ship detection dataset. WUTDet contains 100,576 images and 381,378 annotated ship instances, covering diverse operational scenarios such as ports, anchorages, navigation, and berthing, as well as various imaging conditions including fog, glare, low-lightness, and rain, thereby exhibiting substantial diversity and challenge. Based on WUTDet, we systematically evaluate 20 baseline models from three mainstream detection architectures, namely CNN, Transformer, and Mamba. Experimental results show that the Transformer architecture achieves superior overall detection accuracy (AP) and small-object detection performance (APs), demonstrating stronger adaptability to complex maritime scenes; the CNN architecture maintains an advantage in inference efficiency, making it more suitable for real-time applications; and the Mamba architecture achieves a favorable balance between detection accuracy and computational efficiency. Furthermore, we construct a unified cross-dataset test set, Ship-GEN, to evaluate model generalization. Results on Ship-GEN show that models trained on WUTDet exhibit stronger generalization under different data distributions. These findings demonstrate that WUTDet provides effective data support for the research, evaluation, and generalization analysis of ship detection algorithms in complex maritime scenarios. The dataset is publicly available at: https://github.com/MAPGroup/WUTDet.
Chinese Translation
船舶检测在导航中是智能水路运输系统中的一项基础感知任务。然而,现有的公共船舶检测数据集在规模、小物体实例的比例和场景多样性方面仍然有限,这阻碍了在复杂海洋环境中对检测算法进行系统评估和泛化研究。为此,我们构建了WUTDet,一个大规模船舶检测数据集。WUTDet包含100,576张图像和381,378个标注的船舶实例,涵盖了港口、锚地、航行和停靠等多种操作场景,以及雾、眩光、低光照和雨等多种成像条件,从而展现出显著的多样性和挑战性。基于WUTDet,我们系统地评估了来自三种主流检测架构的20个基线模型,即CNN、Transformer和Mamba。实验结果表明,Transformer架构在整体检测准确率(AP)和小物体检测性能(APs)上表现优越,显示出对复杂海洋场景的更强适应性;CNN架构在推理效率上保持优势,更适合实时应用;而Mamba架构在检测准确性和计算效率之间实现了良好的平衡。此外,我们构建了一个统一的跨数据集测试集Ship-GEN,以评估模型的泛化能力。在Ship-GEN上的结果显示,基于WUTDet训练的模型在不同数据分布下表现出更强的泛化能力。这些发现表明,WUTDet为复杂海洋场景中船舶检测算法的研究、评估和泛化分析提供了有效的数据支持。该数据集可在以下网址公开获取:https://github.com/MAPGroup/WUTDet。
cs.CV / 25 / 2604.07763
Beyond Surface Artifacts: Capturing Shared Latent Forgery Knowledge Across Modalities
超越表面伪影:跨模态捕捉共享潜在伪造知识
Abstract
As generative artificial intelligence evolves, deepfake attacks have escalated from single-modality manipulations to complex, multimodal threats. Existing forensic techniques face a severe generalization bottleneck: by relying excessively on superficial, modality-specific artifacts, they neglect the shared latent forgery knowledge hidden beneath variable physical appearances. Consequently, these models suffer catastrophic performance degradation when confronted with unseen "dark modalities." To break this limitation, this paper introduces a paradigm shift that redefines multimodal forensics from conventional "feature fusion" to "modality generalization." We propose the first modality-agnostic forgery (MAF) detection framework. By explicitly decoupling modality-specific styles, MAF precisely extracts the essential, cross-modal latent forgery knowledge. Furthermore, we define two progressive dimensions to quantify model generalization: transferability toward semantically correlated modalities (Weak MAF), and robustness against completely isolated signals of "dark modality" (Strong MAF). To rigorously assess these generalization limits, we introduce the DeepModal-Bench benchmark, which integrates diverse multimodal forgery detection algorithms and adapts state-of-the-art generalized learning methods. This study not only empirically proves the existence of universal forgery traces but also achieves significant performance breakthroughs on unknown modalities via the MAF framework, offering a pioneering technical pathway for universal multimodal defense.
Chinese Translation
随着生成性人工智能的发展,深度伪造攻击已从单一模态操控升级为复杂的多模态威胁。现有的取证技术面临严重的泛化瓶颈:过于依赖表面的、特定模态的伪影,忽视了隐藏在多样物理外观下的共享潜在伪造知识。因此,当面对未见过的“黑暗模态”时,这些模型的性能会出现灾难性的下降。为了解决这一限制,本文提出了一种范式转变,将多模态取证从传统的“特征融合”重新定义为“模态泛化”。我们提出了首个模态无关伪造(Modality-Agnostic Forgery, MAF)检测框架。通过明确解耦模态特定风格,MAF 精确提取了基本的跨模态潜在伪造知识。此外,我们定义了两个渐进维度来量化模型的泛化能力:对语义相关模态的可迁移性(弱 MAF)和对完全孤立的“黑暗模态”信号的鲁棒性(强 MAF)。为了严格评估这些泛化极限,我们引入了 DeepModal-Bench 基准,它整合了多样的多模态伪造检测算法,并适应了最先进的泛化学习方法。本研究不仅实证证明了普遍伪造痕迹的存在,还通过 MAF 框架在未知模态上取得了显著的性能突破,为普遍多模态防御提供了开创性的技术路径。
cs.CV / 26 / 2604.07765
RemoteAgent: Bridging Vague Human Intents and Earth Observation with RL-based Agentic MLLMs
RemoteAgent:通过基于强化学习的代理多模态大语言模型桥接模糊的人类意图与地球观测
Abstract
Earth Observation (EO) systems are essentially designed to support domain experts who often express their requirements through vague natural language rather than precise, machine-friendly instructions. Depending on the specific application scenario, these vague queries can demand vastly different levels of visual precision. Consequently, a practical EO AI system must bridge the gap between ambiguous human queries and the appropriate multi-granularity visual analysis tasks, ranging from holistic image interpretation to fine-grained pixel-wise predictions. While Multi-modal Large Language Models (MLLMs) demonstrate strong semantic understanding, their text-based output format is inherently ill-suited for dense, precision-critical spatial predictions. Existing agentic frameworks address this limitation by delegating tasks to external tools, but indiscriminate tool invocation is computationally inefficient and underutilizes the MLLM's native capabilities. To this end, we propose RemoteAgent, an agentic framework that strategically respects the intrinsic capability boundaries of MLLMs. To empower this framework to understand real user intents, we construct VagueEO, a human-centric instruction dataset pairing EO tasks with simulated vague natural-language queries. By leveraging VagueEO for reinforcement fine-tuning, we align an MLLM into a robust cognitive core that directly resolves image- and sparse region-level tasks. Consequently, RemoteAgent processes suitable tasks internally while intelligently orchestrating specialized tools via the Model Context Protocol exclusively for dense predictions. Extensive experiments demonstrate that RemoteAgent achieves robust intent recognition capabilities while delivering highly competitive performance across diverse EO tasks.
Chinese Translation
地球观测(EO)系统本质上旨在支持领域专家,这些专家通常通过模糊的自然语言而非精确的机器友好指令来表达他们的需求。根据具体的应用场景,这些模糊查询可能需要截然不同的视觉精度。因此,一个实用的EO人工智能系统必须弥合模糊人类查询与适当的多粒度视觉分析任务之间的差距,这些任务范围从整体图像解释到细粒度的像素级预测。尽管多模态大语言模型(MLLMs)展示了强大的语义理解能力,但其基于文本的输出格式本质上不适合密集且对精度要求高的空间预测。现有的代理框架通过将任务委托给外部工具来解决这一限制,但不加区分的工具调用在计算上效率低下,并且未充分利用MLLM的原生能力。为此,我们提出了RemoteAgent,一个代理框架,战略性地尊重MLLM的内在能力边界。为了使该框架能够理解真实用户意图,我们构建了VagueEO,一个以人为中心的指令数据集,将EO任务与模拟的模糊自然语言查询配对。通过利用VagueEO进行强化微调,我们将MLLM调整为一个强大的认知核心,能够直接解决图像和稀疏区域级任务。因此,RemoteAgent在内部处理适当的任务,同时通过模型上下文协议智能地协调专用工具,仅用于密集预测。大量实验表明,RemoteAgent在实现强大的意图识别能力的同时,在各种EO任务中提供了高度竞争的性能。
cs.CV / 27 / 2604.07772
ESOM: Efficiently Understanding Streaming Video Anomalies with Open-world Dynamic Definitions
ESOM:高效理解开放世界动态定义下的流媒体视频异常
Abstract
Open-world video anomaly detection (OWVAD) aims to detect and explain abnormal events under different anomaly definitions, which is important for applications such as intelligent surveillance and live-streaming content moderation. Recent MLLM-based methods have shown promising open-world generalization, but still suffer from three major limitations: inefficiency for practical deployment, lack of streaming processing adaptation, and limited support for dynamic anomaly definitions in both modeling and evaluation. To address these issues, this paper proposes ESOM, an efficient streaming OWVAD model that operates in a training-free manner. ESOM includes a Definition Normalization module to structure user prompts for reducing hallucination, an Inter-frame-matched Intra-frame Token Merging module to compress redundant visual tokens, a Hybrid Streaming Memory module for efficient causal inference, and a Probabilistic Scoring module that converts interval-level textual outputs into frame-level anomaly scores. In addition, this paper introduces OpenDef-Bench, a new benchmark with clean surveillance videos and diverse natural anomaly definitions for evaluating performance under varying conditions. Extensive experiments show that ESOM achieves real-time efficiency on a single GPU and state-of-the-art performance in anomaly temporal localization, classification, and description generation. The code and benchmark will be released at https://github.com/Kamino666/ESOM_OpenDef-Bench.
Chinese Translation
开放世界视频异常检测(OWVAD)旨在在不同的异常定义下检测和解释异常事件,这对于智能监控和直播内容审核等应用至关重要。近期基于MLLM的方法展示了良好的开放世界泛化能力,但仍然面临三大主要限制:在实际部署中的低效率、缺乏流媒体处理适应性,以及在建模和评估中对动态异常定义的支持有限。为了解决这些问题,本文提出了ESOM,一种高效的流媒体OWVAD模型,采用无训练方式运行。ESOM包括一个定义规范化模块,用于结构化用户提示以减少幻觉,一个帧间匹配的帧内令牌合并模块,用于压缩冗余视觉令牌,一个混合流媒体记忆模块,用于高效因果推断,以及一个概率评分模块,将区间级文本输出转换为帧级异常评分。此外,本文还引入了OpenDef-Bench,这是一个新的基准,包含干净的监控视频和多样的自然异常定义,用于在不同条件下评估性能。大量实验表明,ESOM在单个GPU上实现了实时效率,并在异常时间定位、分类和描述生成方面达到了最先进的性能。代码和基准将发布在 https://github.com/Kamino666/ESOM_OpenDef-Bench。
cs.CV / 28 / 2604.07779
Plug-and-Play Logit Fusion for Heterogeneous Pathology Foundation Models
异构病理基础模型的即插即用逻辑融合
Abstract
Pathology foundation models (FMs) have become central to computational histopathology, offering strong transfer performance across a wide range of diagnostic and prognostic tasks. The rapid proliferation of pathology foundation models creates a model-selection bottleneck: no single model is uniformly best, yet exhaustively adapting and validating many candidates for each downstream endpoint is prohibitively expensive. We address this challenge with a lightweight and novel model fusion strategy, LogitProd, which treats independently trained FM-based predictors as fixed experts and learns sample-adaptive fusion weights over their slide-level outputs. The fusion operates purely on logits, requiring no encoder retraining and no feature-space alignment across heterogeneous backbones. We further provide a theoretical analysis showing that the optimal weighted product fusion is guaranteed to perform at least as well as the best individual expert under the training objective. We systematically evaluate LogitProd on \textbf{22} benchmarks spanning WSI-level classification, tile-level classification, gene mutation prediction, and discrete-time survival modeling. LogitProd ranks first on 20/22 tasks and improves the average performance across all tasks by ~3% over the strongest single expert. LogitProd enables practitioners to upgrade heterogeneous FM-based pipelines in a plug-and-play manner, achieving multi-expert gains with $\sim$12$\times$ lower training cost than feature-fusion alternatives.
Chinese Translation
病理基础模型(FMs)已成为计算组织病理学的核心,能够在广泛的诊断和预后任务中提供强大的迁移性能。病理基础模型的快速扩展导致了模型选择的瓶颈:没有单一模型在所有方面都表现最佳,但对每个下游任务全面适应和验证多个候选模型的成本是不可承受的。我们通过一种轻量级且新颖的模型融合策略LogitProd来解决这一挑战,该策略将独立训练的基于FM的预测器视为固定专家,并在其切片级输出上学习样本自适应的融合权重。该融合过程仅在logits上进行,无需重新训练编码器,也无需在异构基础模型之间进行特征空间对齐。我们进一步提供了理论分析,表明在训练目标下,最佳加权乘积融合的性能至少与最佳单一专家相当。我们在22个基准测试上系统评估了LogitProd,这些基准涵盖了WSI级分类、切片级分类、基因突变预测和离散时间生存建模。LogitProd在20/22个任务中排名第一,并且在所有任务上的平均性能比最强的单一专家提高了约3%。LogitProd使从业者能够以即插即用的方式升级异构FM基础管道,以约12倍于特征融合替代方案的更低训练成本实现多专家收益。
cs.CV / 29 / 2604.07786
Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video
跨模态情感转移用于谈话人脸视频的情感编辑
Abstract
Talking face generation has gained significant attention as a core application of generative models. To enhance the expressiveness and realism of synthesized videos, emotion editing in talking face video plays a crucial role. However, existing approaches often limit expressive flexibility and struggle to generate extended emotions. Label-based methods represent emotions with discrete categories, which fail to capture a wide range of emotions. Audio-based methods can leverage emotionally rich speech signals - and even benefit from expressive text-to-speech (TTS) synthesis - but they fail to express the target emotions because emotions and linguistic contents are entangled in emotional speeches. Images-based methods, on the other hand, rely on target reference images to guide emotion transfer, yet they require high-quality frontal views and face challenges in acquiring reference data for extended emotions (e.g., sarcasm). To address these limitations, we propose Cross-Modal Emotion Transfer (C-MET), a novel approach that generates facial expressions based on speeches by modeling emotion semantic vectors between speech and visual feature spaces. C-MET leverages a large-scale pretrained audio encoder and a disentangled facial expression encoder to learn emotion semantic vectors that represent the difference between two different emotional embeddings across modalities. Extensive experiments on the MEAD and CREMA-D datasets demonstrate that our method improves emotion accuracy by 14% over state-of-the-art methods, while generating expressive talking face videos - even for unseen extended emotions. Code, checkpoint, and demo are available at https://chanhyeok-choi.github.io/C-MET/
Chinese Translation
谈话人脸生成作为生成模型的核心应用,受到了广泛关注。为了增强合成视频的表现力和真实感,谈话人脸视频中的情感编辑起着至关重要的作用。然而,现有的方法往往限制了表现的灵活性,并且在生成扩展情感方面存在困难。基于标签的方法将情感表示为离散类别,无法捕捉广泛的情感。基于音频的方法可以利用情感丰富的语音信号,甚至可以从富有表现力的文本到语音(TTS)合成中受益,但由于情感和语言内容在情感演讲中交织在一起,它们无法有效表达目标情感。另一方面,基于图像的方法依赖于目标参考图像来指导情感转移,但它们需要高质量的正面视图,并在获取扩展情感(例如讽刺)的参考数据时面临挑战。为了解决这些限制,我们提出了跨模态情感转移(Cross-Modal Emotion Transfer, C-MET),这是一种新颖的方法,通过建模语音和视觉特征空间之间的情感语义向量,生成基于语音的面部表情。C-MET利用大规模预训练的音频编码器和解耦的面部表情编码器,学习表示跨模态两种不同情感嵌入之间差异的情感语义向量。在MEAD和CREMA-D数据集上的大量实验表明,我们的方法在情感准确性上比最先进的方法提高了14%,同时生成表现力丰富的谈话人脸视频——即使对于未见过的扩展情感。代码、检查点和演示可在 https://chanhyeok-choi.github.io/C-MET/ 获取。
cs.CV / 30 / 2604.07795
Image-Guided Geometric Stylization of 3D Meshes
基于图像引导的三维网格几何风格化
Abstract
Recent generative models can create visually plausible 3D representations of objects. However, the generation process often allows for implicit control signals, such as contextual descriptions, and rarely supports bold geometric distortions beyond existing data distributions. We propose a geometric stylization framework that deforms a 3D mesh, allowing it to express the style of an image. While style is inherently ambiguous, we utilize pre-trained diffusion models to extract an abstract representation of the provided image. Our coarse-to-fine stylization pipeline can drastically deform the input 3D model to express a diverse range of geometric variations while retaining the valid topology of the original mesh and part-level semantics. We also propose an approximate VAE encoder that provides efficient and reliable gradients from mesh renderings. Extensive experiments demonstrate that our method can create stylized 3D meshes that reflect unique geometric features of the pictured assets, such as expressive poses and silhouettes, thereby supporting the creation of distinctive artistic 3D creations. Project page: https://changwoonchoi.github.io/GeoStyle
Chinese Translation
近年来的生成模型能够创建视觉上合理的三维物体表示。然而,生成过程通常仅允许隐式控制信号,如上下文描述,且很少支持超出现有数据分布的大胆几何变形。我们提出了一种几何风格化框架,通过变形三维网格,使其能够表达图像的风格。尽管风格本质上具有模糊性,我们利用预训练的扩散模型(diffusion models)提取所提供图像的抽象表示。我们的粗到细风格化流水线能够大幅变形输入的三维模型,以表达多样的几何变化,同时保持原始网格的有效拓扑结构和部件级语义。我们还提出了一种近似变分自编码器(VAE)编码器,从网格渲染中提供高效且可靠的梯度。大量实验表明,我们的方法能够创建反映图像资产独特几何特征(如富有表现力的姿态和轮廓)的风格化三维网格,从而支持独特艺术三维作品的创作。项目主页:https://changwoonchoi.github.io/GeoStyle
cs.CV / 31 / 2604.07802
Latent Anomaly Knowledge Excavation: Unveiling Sparse Sensitive Neurons in Vision-Language Models
潜在异常知识挖掘:揭示视觉-语言模型中的稀疏敏感神经元
Abstract
Large-scale vision-language models (VLMs) exhibit remarkable zero-shot capabilities, yet the internal mechanisms driving their anomaly detection (AD) performance remain poorly understood. Current methods predominantly treat VLMs as black-box feature extractors, assuming that anomaly-specific knowledge must be acquired through external adapters or memory banks. In this paper, we challenge this assumption by arguing that anomaly knowledge is intrinsically embedded within pre-trained models but remains latent and under-activated. We hypothesize that this knowledge is concentrated within a sparse subset of anomaly-sensitive neurons. To validate this, we propose latent anomaly knowledge excavation (LAKE), a training-free framework that identifies and elicits these critical neuronal signals using only a minimal set of normal samples. By isolating these sensitive neurons, LAKE constructs a highly compact normality representation that integrates visual structural deviations with cross-modal semantic activations. Extensive experiments on industrial AD benchmarks demonstrate that LAKE achieves state-of-the-art performance while providing intrinsic, neuron-level interpretability. Ultimately, our work advocates for a paradigm shift: redefining anomaly detection as the targeted activation of latent pre-trained knowledge rather than the acquisition of a downstream task.
Chinese Translation
大规模视觉-语言模型(VLMs)展现出显著的零-shot 能力,但驱动其异常检测(AD)性能的内部机制仍然不甚清楚。当前的方法主要将 VLMs 视为黑箱特征提取器,假设异常特定知识必须通过外部适配器或记忆库获得。本文挑战这一假设,认为异常知识本质上嵌入在预训练模型中,但处于潜在状态且未被充分激活。我们假设这些知识集中在一小部分异常敏感神经元中。为验证这一假设,我们提出了潜在异常知识挖掘(LAKE),这是一个无训练的框架,仅使用一小部分正常样本来识别和激发这些关键神经信号。通过隔离这些敏感神经元,LAKE 构建了一个高度紧凑的正常性表示,将视觉结构偏差与跨模态语义激活相结合。在工业异常检测基准上的大量实验表明,LAKE 实现了最先进的性能,同时提供了内在的神经元级可解释性。最终,我们的工作倡导一种范式转变:将异常检测重新定义为潜在预训练知识的有针对性激活,而不是下游任务的获取。
cs.CV / 32 / 2604.07812
HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models
HAWK:多模态模型中的头重要性感知视觉标记剪枝
Abstract
In multimodal large language models (MLLMs), the surge of visual tokens significantly increases the inference time and computational overhead, making them impractical for real-time or resource-constrained applications. Visual token pruning is a promising strategy for reducing the cost of MLLM inference by removing redundant visual tokens. Existing research usually assumes that all attention heads contribute equally to the visual interpretation. However, our study reveals that different heads may capture distinct visual semantics and inherently play distinct roles in visual processing. In light of this observation, we propose HAWK, a head importance-aware visual token pruning method that perceives the varying importance of attention heads in visual tasks to maximize the retention of crucial tokens. By leveraging head importance weights and text-guided attention to assess visual token significance, HAWK effectively retains task-relevant visual tokens while removing redundant ones. The proposed HAWK is entirely training-free and can be seamlessly applied to various MLLMs. Extensive experiments on multiple mainstream vision-language benchmarks demonstrate that HAWK achieves state-of-the-art accuracy. When applied to Qwen2.5-VL, HAWK retains 96.0% of the original accuracy after pruning 80.2% of the visual tokens. Additionally, it reduces end-to-end latency to 74.4% of the original and further decreases GPU memory usage across the tested models. The code is available at https://github.com/peppery77/HAWK.git.
Chinese Translation
在多模态大型语言模型(MLLMs)中,视觉标记的激增显著增加了推理时间和计算开销,使其在实时或资源受限的应用中变得不切实际。视觉标记剪枝是一种有前景的策略,通过移除冗余的视觉标记来降低MLLM推理的成本。现有研究通常假设所有注意力头对视觉解释的贡献是相同的。然而,我们的研究揭示了不同的头可能捕捉到不同的视觉语义,并在视觉处理过程中本质上发挥着不同的作用。基于这一观察,我们提出了HAWK,一种头重要性感知的视觉标记剪枝方法,旨在感知注意力头在视觉任务中的不同重要性,以最大化关键标记的保留。通过利用头重要性权重和文本引导的注意力来评估视觉标记的重要性,HAWK有效地保留与任务相关的视觉标记,同时移除冗余标记。所提出的HAWK完全不需要训练,并且可以无缝应用于各种MLLMs。在多个主流视觉-语言基准上的广泛实验表明,HAWK达到了最先进的准确性。当应用于Qwen2.5-VL时,HAWK在剪枝80.2%的视觉标记后保留了96.0%的原始准确性。此外,它将端到端延迟降低至原始的74.4%,并进一步减少了测试模型的GPU内存使用。代码可在https://github.com/peppery77/HAWK.git获取。
cs.CV / 33 / 2604.07814
AgriChain Visually Grounded Expert Verified Reasoning for Interpretable Agricultural Vision Language Models
农业链:视觉基础的专家验证推理用于可解释的农业视觉语言模型
Abstract
Accurate and interpretable plant disease diagnosis remains a major challenge for vision-language models (VLMs) in real-world agriculture. We introduce AgriChain, a dataset of approximately 11,000 expert-curated leaf images spanning diverse crops and pathologies, each paired with (i) a disease label, (ii) a calibrated confidence score (High/Medium/Low), and (iii) an expert-verified chain-of-thought (CoT) rationale. Draft explanations were first generated by GPT-4o and then verified by a professional agricultural engineer using standardized descriptors (e.g., lesion color, margin, and distribution). We fine-tune Qwen2.5-VL-3B on AgriChain, resulting in a specialized model termed AgriChain-VL3B, to jointly predict diseases and generate visually grounded reasoning. On a 1,000-image test set, our CoT-supervised model achieves 73.1% top-1 accuracy (macro F1 = 0.466; weighted F1 = 0.655), outperforming strong baselines including Gemini 1.5 Flash, Gemini 2.5 Pro, and GPT-4o Mini. The generated explanations align closely with expert reasoning, consistently referencing key visual cues. These findings demonstrate that expert-verified reasoning supervision significantly enhances both accuracy and interpretability, bridging the gap between generic multimodal models and human expertise, and advancing trustworthy, globally deployable AI for sustainable agriculture. The dataset and code are publicly available at: https://github.com/hazzanabeel12-netizen/agrichain
Chinese Translation
准确且可解释的植物病害诊断仍然是现实农业中视觉语言模型(VLMs)面临的重大挑战。我们引入了AgriChain,一个包含约11,000张专家策划的叶片图像的数据集,涵盖了多种作物和病理,每张图像都配有(i)病害标签,(ii)经过校准的置信度评分(高/中/低),以及(iii)专家验证的推理链(CoT)。初步解释由GPT-4o生成,随后由专业农业工程师使用标准化描述符(例如,病变颜色、边缘和分布)进行验证。我们在AgriChain上微调了Qwen2.5-VL-3B,得到了一个名为AgriChain-VL3B的专用模型,以共同预测病害并生成视觉基础的推理。在1,000张图像的测试集中,我们的CoT监督模型达到了73.1%的Top-1准确率(宏观F1 = 0.466;加权F1 = 0.655),超越了包括Gemini 1.5 Flash、Gemini 2.5 Pro和GPT-4o Mini在内的强基线。生成的解释与专家推理高度一致,始终参考关键视觉线索。这些发现表明,专家验证的推理监督显著提高了准确性和可解释性,弥合了通用多模态模型与人类专业知识之间的差距,推动了可持续农业的可信赖、全球可部署的人工智能的发展。数据集和代码可在以下网址公开获取:https://github.com/hazzanabeel12-netizen/agrichain
cs.CV / 34 / 2604.07823
LPM 1.0: Video-based Character Performance Model
LPM 1.0:基于视频的角色表现模型
Zeng, Ailing, Yang, Casper, Ge, Chauncey, Zhang, Eddie, Xu, Garvey, Lin, Gavin, Gu, Gilbert, Pi, Jeremy, Li, Leo, Shi, Mingyi, Bi, Sheng, Tang, Steven, Hang, Thorn, Guo, Tobey, Li, Vincent, Tong, Xin, Li, Yikang, Sun, Yuchen, Yue, Zhao, Lu, Yuhan, Li, Yuwei, Zhang, Zane, Yang, Zeshi, Ye, Zi
Abstract
Performance, the externalization of intent, emotion, and personality through visual, vocal, and temporal behavior, is what makes a character alive. Learning such performance from video is a promising alternative to traditional 3D pipelines. However, existing video models struggle to jointly achieve high expressiveness, real-time inference, and long-horizon identity stability, a tension we call the performance trilemma. Conversation is the most comprehensive performance scenario, as characters simultaneously speak, listen, react, and emote while maintaining identity over time. To address this, we present LPM 1.0 (Large Performance Model), focusing on single-person full-duplex audio-visual conversational performance. Concretely, we build a multimodal human-centric dataset through strict filtering, speaking-listening audio-video pairing, performance understanding, and identity-aware multi-reference extraction; train a 17B-parameter Diffusion Transformer (Base LPM) for highly controllable, identity-consistent performance through multimodal conditioning; and distill it into a causal streaming generator (Online LPM) for low-latency, infinite-length interaction. At inference, given a character image with identity-aware references, LPM 1.0 generates listening videos from user audio and speaking videos from synthesized audio, with text prompts for motion control, all at real-time speed with identity-stable, infinite-length generation. LPM 1.0 thus serves as a visual engine for conversational agents, live streaming characters, and game NPCs. To systematically evaluate this setting, we propose LPM-Bench, the first benchmark for interactive character performance. LPM 1.0 achieves state-of-the-art results across all evaluated dimensions while maintaining real-time inference.
Chinese Translation
表现是通过视觉、声音和时间行为外化意图、情感和个性的方式,这使得角色显得生动。通过视频学习这种表现是传统3D管道的一个有前景的替代方案。然而,现有的视频模型在实现高表现力、实时推理和长时间身份稳定性方面存在困难,这种矛盾我们称之为表现三难问题。对话是最全面的表现场景,因为角色在保持身份的同时同时进行说话、倾听、反应和情感表达。为了解决这个问题,我们提出了LPM 1.0(大型表现模型),专注于单人全双工音视频对话表现。具体而言,我们通过严格筛选、说话-倾听音视频配对、表现理解和身份感知多参考提取构建了一个多模态以人为中心的数据集;训练了一个具有170亿参数的扩散变换器(基础LPM),通过多模态条件实现高度可控、身份一致的表现;并将其提炼为一个因果流生成器(在线LPM),以实现低延迟、无限长度的交互。在推理时,给定一个带有身份感知参考的角色图像,LPM 1.0可以从用户音频生成倾听视频,从合成音频生成说话视频,并通过文本提示进行动作控制,所有这些都以实时速度生成身份稳定、无限长度的内容。因此,LPM 1.0作为对话代理、直播角色和游戏NPC的视觉引擎。为了系统地评估这一设置,我们提出了LPM-Bench,这是第一个交互式角色表现的基准。LPM 1.0在所有评估维度上都达到了最先进的结果,同时保持实时推理。
cs.CV / 35 / 2604.07879
FlowGuard: Towards Lightweight In-Generation Safety Detection for Diffusion Models via Linear Latent Decoding
FlowGuard:通过线性潜在解码实现轻量级生成安全检测的探索
Abstract
Diffusion-based image generation models have advanced rapidly but pose a safety risk due to their potential to generate Not-Safe-For-Work (NSFW) content. Existing NSFW detection methods mainly operate either before or after image generation. Pre-generation methods rely on text prompts and struggle with the gap between prompt safety and image safety. Post-generation methods apply classifiers to final outputs, but they are poorly suited to intermediate noisy images. To address this, we introduce FlowGuard, a cross-model in-generation detection framework that inspects intermediate denoising steps. This is particularly challenging in latent diffusion, where early-stage noise obscures visual signals. FlowGuard employs a novel linear approximation for latent decoding and leverages a curriculum learning approach to stabilize training. By detecting unsafe content early, FlowGuard reduces unnecessary diffusion steps to cut computational costs. Our cross-model benchmark spanning nine diffusion-based backbones shows the effectiveness of FlowGuard for in-generation NSFW detection in both in-distribution and out-of-distribution settings, outperforming existing methods by over 30% in F1 score while delivering transformative efficiency gains, including slashing peak GPU memory demand by over 97% and projection time from 8.1 seconds to 0.2 seconds compared to standard VAE decoding.
Chinese Translation
基于扩散的图像生成模型发展迅速,但由于其潜在生成不适合工作的内容(Not-Safe-For-Work, NSFW)的风险,存在安全隐患。现有的NSFW检测方法主要在图像生成的前后进行。生成前的方法依赖于文本提示,但在提示安全性与图像安全性之间存在差距。生成后的方法则对最终输出应用分类器,但不适合处理中间的噪声图像。为了解决这一问题,我们提出了FlowGuard,一个跨模型的生成内检测框架,能够检查中间去噪步骤。这在潜在扩散中尤其具有挑战性,因为早期的噪声会遮蔽视觉信号。FlowGuard采用了一种新颖的线性近似方法进行潜在解码,并利用课程学习方法来稳定训练。通过早期检测不安全内容,FlowGuard减少了不必要的扩散步骤,从而降低了计算成本。我们的跨模型基准测试涵盖了九个基于扩散的骨干网络,显示了FlowGuard在生成内NSFW检测中的有效性,无论是在分布内还是分布外的设置中,F1分数比现有方法提高了超过30%,同时实现了显著的效率提升,包括将峰值GPU内存需求降低超过97%,以及将标准变分自编码器(VAE)解码的投影时间从8.1秒缩短至0.2秒。
cs.CV / 36 / 2604.07882
ReconPhys: Reconstruct Appearance and Physical Attributes from Single Video
ReconPhys:从单一视频重建外观和物理属性
Abstract
Reconstructing non-rigid objects with physical plausibility remains a significant challenge. Existing approaches leverage differentiable rendering for per-scene optimization, recovering geometry and dynamics but requiring expensive tuning or manual annotation, which limits practicality and generalizability. To address this, we propose ReconPhys, the first feedforward framework that jointly learns physical attribute estimation and 3D Gaussian Splatting reconstruction from a single monocular video. Our method employs a dual-branch architecture trained via a self-supervised strategy, eliminating the need for ground-truth physics labels. Given a video sequence, ReconPhys simultaneously infers geometry, appearance, and physical attributes. Experiments on a large-scale synthetic dataset demonstrate superior performance: our method achieves 21.64 PSNR in future prediction compared to 13.27 by state-of-the-art optimization baselines, while reducing Chamfer Distance from 0.349 to 0.004. Crucially, ReconPhys enables fast inference (<1 second) versus hours required by existing methods, facilitating rapid generation of simulation-ready assets for robotics and graphics.
Chinese Translation
重建具有物理合理性的非刚性物体仍然是一项重大挑战。现有方法利用可微渲染进行逐场景优化,恢复几何形状和动态特性,但需要昂贵的调优或手动标注,这限制了其实用性和通用性。为了解决这个问题,我们提出了ReconPhys,这是第一个前馈框架,能够从单一单目视频中联合学习物理属性估计和3D高斯点云重建。我们的方法采用双分支架构,通过自监督策略进行训练,消除了对真实物理标签的需求。给定一个视频序列,ReconPhys能够同时推断几何形状、外观和物理属性。在大规模合成数据集上的实验表明,我们的方法表现优越:在未来预测中,我们的方法实现了21.64的PSNR,而最先进的优化基线仅为13.27,同时将Chamfer距离从0.349降低到0.004。重要的是,ReconPhys实现了快速推理(<1秒),而现有方法需要数小时,从而促进了机器人和图形学中模拟准备资产的快速生成。
cs.CV / 37 / 2604.07884
Reinforcement-Guided Synthetic Data Generation for Privacy-Sensitive Identity Recognition
基于强化指导的隐私敏感身份识别合成数据生成
Abstract
High-fidelity generative models are increasingly needed in privacy-sensitive scenarios, where access to data is severely restricted due to regulatory and copyright constraints. This scarcity hampers model development--ironically, in settings where generative models are most needed to compensate for the lack of data. This creates a self-reinforcing challenge: limited data leads to poor generative models, which in turn fail to mitigate data scarcity. To break this cycle, we propose a reinforcement-guided synthetic data generation framework that adapts general-domain generative priors to privacy-sensitive identity recognition tasks. We first perform a cold-start adaptation to align a pretrained generator with the target domain, establishing semantic relevance and initial fidelity. Building on this foundation, we introduce a multi-objective reward that jointly optimizes semantic consistency, coverage diversity, and expression richness, guiding the generator to produce both realistic and task-effective samples. During downstream training, a dynamic sample selection mechanism further prioritizes high-utility synthetic samples, enabling adaptive data scaling and improved domain alignment. Extensive experiments on benchmark datasets demonstrate that our framework significantly improves both generation fidelity and classification accuracy, while also exhibiting strong generalization to novel categories in small-data regimes.
Chinese Translation
在隐私敏感场景中,高保真生成模型的需求日益增加,因为由于监管和版权限制,数据访问受到严重限制。这种稀缺性阻碍了模型的发展——讽刺的是,在最需要生成模型来弥补数据不足的环境中,情况尤为严重。这形成了一个自我强化的挑战:有限的数据导致生成模型性能不佳,而这些模型又无法有效缓解数据稀缺。为了解决这一循环,我们提出了一种基于强化指导的合成数据生成框架,该框架将通用领域生成先验适应于隐私敏感的身份识别任务。我们首先进行冷启动适应,以使预训练生成器与目标领域对齐,建立语义相关性和初始保真度。在此基础上,我们引入了一种多目标奖励机制,联合优化语义一致性、覆盖多样性和表达丰富性,引导生成器生成既真实又有效的任务样本。在下游训练过程中,动态样本选择机制进一步优先考虑高效用的合成样本,实现自适应数据扩展和改善领域对齐。在基准数据集上的大量实验表明,我们的框架显著提高了生成保真度和分类准确性,同时在小数据环境下对新类别也表现出强大的泛化能力。
cs.CV / 38 / 2604.07890
Sampling-Aware 3D Spatial Analysis in Multiplexed Imaging
考虑采样的多重成像中的三维空间分析
Abstract
Highly multiplexed microscopy enables rich spatial characterization of tissues at single-cell resolution, yet most analyses rely on two-dimensional sections despite inherently three-dimensional tissue organization. Acquiring dense volumetric data in spatial proteomics remains costly and technically challenging, leaving practitioners to choose between 2D sections or 3D serial sections under limited imaging budgets. In this work, we study how sampling geometry impacts the stability of commonly used spatial statistics, and we introduce a geometry-aware reconstruction module that enables sparse yet consistent 3D analysis from serial sections. Using controlled simulations, we show that planar sampling reliably recovers global cell-type abundance but exhibits high variance for local statistics such as cell clustering and cell-cell interactions, particularly for rare or spatially localized populations. We observe consistent behavior in real multiplexed datasets, where interaction metrics and neighborhood relationships fluctuate substantially across individual sections. To support sparse 3D analysis in practice, we present a reconstruction approach that links cell projections across adjacent sections using phenotype and proximity constraints and recovers single-cell 3D centroids using cell-type-specific shape priors. We further analyze the trade-off between section spacing, coverage, and redundancy, identifying acquisition regimes that maximize reconstruction utility under fixed imaging budgets. We validate the reconstruction module on a public imaging mass cytometry dataset with dense axial sampling and demonstrate its downstream utility on an in-house CODEX dataset by enabling structure-level 3D analyses that are unreliable in 2D. Together, our results provide diagnostic tools and practical guidance for deciding when 2D sampling suffices and when sparse 3D reconstruction is warranted.
Chinese Translation
高度多重化的显微镜技术能够以单细胞分辨率对组织进行丰富的空间表征,但大多数分析仍依赖于二维切片,尽管组织本质上是三维的。在空间蛋白质组学中获取密集的体积数据仍然成本高昂且技术上具有挑战性,这使得研究人员在有限的成像预算下不得不在二维切片和三维串行切片之间进行选择。在本研究中,我们探讨了采样几何形状如何影响常用空间统计的稳定性,并引入了一种考虑几何形状的重建模块,该模块能够从串行切片中实现稀疏但一致的三维分析。通过控制模拟,我们展示了平面采样能够可靠地恢复全局细胞类型丰度,但在局部统计(如细胞聚类和细胞间相互作用)方面表现出较高的方差,尤其是对于稀有或空间局部化的群体。我们在真实的多重数据集中观察到一致的行为,其中相互作用指标和邻域关系在各个切片之间波动显著。为了在实践中支持稀疏的三维分析,我们提出了一种重建方法,通过表型和邻近约束链接相邻切片中的细胞投影,并使用细胞类型特异的形状先验恢复单细胞三维质心。我们进一步分析了切片间距、覆盖率和冗余之间的权衡,识别出在固定成像预算下最大化重建效用的采集方案。我们在一个具有密集轴向采样的公共成像质谱数据集上验证了重建模块,并通过在内部的CODEX数据集上启用在二维中不可靠的结构级三维分析,展示了其下游效用。总之,我们的结果提供了诊断工具和实践指导,以帮助决定何时二维采样足够,何时需要稀疏的三维重建。
cs.CV / 39 / 2604.07900
AnomalyAgent: Agentic Industrial Anomaly Synthesis via Tool-Augmented Reinforcement Learning
AnomalyAgent:通过工具增强的强化学习实现的主动工业异常合成
Abstract
Industrial anomaly generation is a crucial method for alleviating the data scarcity problem in anomaly detection tasks. Most existing anomaly synthesis methods rely on single-step generation mechanisms, lacking complex reasoning and iterative optimization capabilities, making it difficult to generate anomaly samples with high semantic realism. We propose AnomalyAgent, an anomaly synthesis agent with self-reflection, knowledge retrieval, and iterative refinement capabilities, aiming to generate realistic and diverse anomalies. Specifically, AnomalyAgent is equipped with five tools: Prompt Generation (PG), Image Generation (IG), Quality Evaluation (QE), Knowledge Retrieval (KR), and Mask Generation (MG), enabling closed-loop optimization. To improve decision-making and self-reflection, we construct structured trajectories from real anomaly images and design a two-stage training framework: supervised fine-tuning followed by reinforcement learning. This process is driven by a three-part reward mechanism: (1) task rewards to supervise the quality and location rationality of generated anomalies; (2) reflection rewards to train the model's ability to improve anomaly synthesis prompt; (3) behavioral rewards to ensure adherence to the trajectory. On the MVTec-AD dataset, AnomalyAgent achieves IS/IC-L of 2.10/0.33 for anomaly generation, 57.0% classification accuracy using ResNet34, and 99.3%/74.2% AP at the image/pixel level using a simple UNet, surpassing all zero-shot SOTA methods. The code and data will be made publicly available.
Chinese Translation
工业异常生成是缓解异常检测任务中数据稀缺问题的重要方法。现有的大多数异常合成方法依赖于单步生成机制,缺乏复杂推理和迭代优化能力,导致难以生成具有高语义真实感的异常样本。我们提出了AnomalyAgent,这是一种具有自我反思、知识检索和迭代精炼能力的异常合成代理,旨在生成真实且多样的异常。具体而言,AnomalyAgent配备了五种工具:提示生成(Prompt Generation, PG)、图像生成(Image Generation, IG)、质量评估(Quality Evaluation, QE)、知识检索(Knowledge Retrieval, KR)和掩码生成(Mask Generation, MG),实现闭环优化。为了改善决策和自我反思,我们从真实异常图像构建结构化轨迹,并设计了一个两阶段的训练框架:首先进行监督微调,然后进行强化学习。该过程由三部分奖励机制驱动:(1)任务奖励用于监督生成异常的质量和位置合理性;(2)反思奖励用于训练模型改善异常合成提示的能力;(3)行为奖励确保遵循轨迹。在MVTec-AD数据集上,AnomalyAgent在异常生成方面实现了2.10/0.33的IS/IC-L,使用ResNet34的分类准确率为57.0%,并且在图像/像素级别上使用简单的UNet达到了99.3%/74.2%的AP,超越了所有零样本最先进方法。代码和数据将公开发布。
cs.CV / 40 / 2604.07901
PanoSAM2: Lightweight Distortion- and Memory-aware Adaptions of SAM2 for 360 Video Object Segmentation
PanoSAM2:针对360视频目标分割的轻量级畸变与内存感知SAM2适配方法
Abstract
360 video object segmentation (360VOS) aims to predict temporally-consistent masks in 360 videos, offering full-scene coverage, benefiting applications, such as VR/AR and embodied AI. Learning 360VOS model is nontrivial due to the lack of high-quality labeled dataset. Recently, Segment Anything Models (SAMs), especially SAM2 -- with its design of memory module -- shows strong, promptable VOS capability. However, directly using SAM2 for 360VOS yields implausible results as 360 videos suffer from the projection distortion, semantic inconsistency of left-right sides, and sparse object mask information in SAM2's memory. To this end, we propose PanoSAM2, a novel 360VOS framework based on our lightweight distortion- and memory-aware adaptation strategies of SAM2 to achieve reliable 360VOS while retaining SAM2's user-friendly prompting design. Concretely, to tackle the projection distortion and semantic inconsistency issues, we propose a Pano-Aware Decoder with seam-consistent receptive fields and iterative distortion refinement to maintain continuity across the 0/360 degree boundary. Meanwhile, a Distortion-Guided Mask Loss is introduced to weight pixels by distortion magnitude, stressing stretched regions and boundaries. To address the object sparsity issue, we propose a Long-Short Memory Module to maintain a compact long-term object pointer to re-instantiate and align short-term memories, thereby enhancing temporal coherence. Extensive experiments show that PanoSAM2 yields substantial gains over SAM2: +5.6 on 360VOTS and +6.7 on PanoVOS, showing the effectiveness of our method.
Chinese Translation
360视频目标分割(360VOS)旨在预测360视频中时间一致的目标掩码,实现全景覆盖,惠及虚拟现实/增强现实(VR/AR)及具身人工智能等应用。由于缺乏高质量标注数据集,360VOS模型的学习具有较大挑战。近期,Segment Anything Models(SAMs),尤其是具备内存模块设计的SAM2,展现了强大的可提示视频目标分割能力。然而,直接将SAM2应用于360VOS会产生不合理结果,原因在于360视频存在投影畸变、左右两侧语义不一致以及SAM2内存中目标掩码信息稀疏等问题。为此,我们提出PanoSAM2,一种基于轻量级畸变与内存感知适配策略的360VOS新框架,旨在实现可靠的360VOS,同时保留SAM2用户友好的提示设计。具体而言,为解决投影畸变和语义不一致问题,我们设计了具备缝隙一致感受野和迭代畸变细化的Pano-Aware解码器,以保持0/360度边界的连续性。与此同时,引入了基于畸变幅度加权像素的畸变引导掩码损失,强调拉伸区域和边界。针对目标稀疏问题,我们提出长短期内存模块(Long-Short Memory Module),维护紧凑的长期目标指针以重新实例化并对齐短期内存,从而增强时间一致性。大量实验表明,PanoSAM2较SAM2在360VOTS和PanoVOS数据集上分别提升了5.6和6.7的性能,验证了方法的有效性。
cs.CV / 41 / 2604.07912
ParkSense: Where Should a Delivery Driver Park? Leveraging Idle AV Compute and Vision-Language Models
ParkSense:送货司机应该在哪里停车?利用闲置的自动驾驶计算和视觉-语言模型
Abstract
Finding parking consumes a disproportionate share of food delivery time, yet no system addresses precise parking-spot selection relative to merchant entrances. We propose ParkSense, a framework that repurposes idle compute during low-risk AV states -- queuing at red lights, traffic congestion, parking-lot crawl -- to run a Vision-Language Model (VLM) on pre-cached satellite and street view imagery, identifying entrances and legal parking zones. We formalize the Delivery-Aware Precision Parking (DAPP) problem, show that a quantized 7B VLM completes inference in 4-8 seconds on HW4-class hardware, and estimate annual per-driver income gains of 3,000-8,000 USD in the U.S. Five open research directions are identified at this unexplored intersection of autonomous driving, computer vision, and last-mile logistics.
Chinese Translation
寻找停车位占用了外卖配送时间的不成比例的份额,但目前没有系统能够针对商家入口进行精确的停车位选择。我们提出了ParkSense,一个框架,利用在低风险自动驾驶状态下的闲置计算资源——如红灯排队、交通拥堵和停车场缓慢行驶——运行视觉-语言模型(VLM),通过预缓存的卫星和街景图像识别入口和合法停车区域。我们将外卖意识精准停车(DAPP)问题进行了形式化,展示了量化的7B VLM在HW4级硬件上完成推理所需的时间为4-8秒,并估算在美国每位司机的年收入增加为3,000-8,000美元。在这一尚未探索的自动驾驶、计算机视觉和最后一公里物流交叉点上,我们确定了五个开放的研究方向。
cs.CV / 42 / 2604.07914
Mitigating Entangled Steering in Large Vision-Language Models for Hallucination Reduction
减轻大型视觉-语言模型中的纠缠引导以减少幻觉
Abstract
Large Vision-Language Models (LVLMs) have achieved remarkable success across cross-modal tasks but remain hindered by hallucinations, producing textual outputs inconsistent with visual content. Existing methods mitigate hallucinations but often alter generation behavior, resulting in shorter outputs and shifted token distributions, especially in latent space steering approaches. We identify that this issue stems from entangled steering signals, where suppressing hallucinations inadvertently disrupts the model's intrinsic generation behavior. To address this, we propose MESA, an effective plug-and-play framework that performs controlled and selective latent intervention for hallucination mitigation. Specifically, MESA targets hallucination-relevant responses while preserving the model's original token distribution, enabling effective hallucination reduction without compromising generation behavior. Extensive experiments across diverse generative and discriminative benchmarks demonstrate that MESA consistently reduces hallucinations while better preserving generation behavior, outperforming prior methods across multiple LVLM families.
Chinese Translation
大型视觉-语言模型(LVLMs)在跨模态任务中取得了显著成功,但仍然受到幻觉的困扰,产生与视觉内容不一致的文本输出。现有方法虽然能减轻幻觉,但往往会改变生成行为,导致输出更短且令牌分布发生偏移,特别是在潜在空间引导方法中。我们发现,这一问题源于纠缠的引导信号,抑制幻觉无意中干扰了模型的内在生成行为。为了解决这个问题,我们提出了MESA,这是一种有效的即插即用框架,能够对幻觉进行控制和选择性的潜在干预。具体而言,MESA针对与幻觉相关的响应,同时保持模型的原始令牌分布,从而实现有效的幻觉减少而不妨碍生成行为。在多种生成和判别基准上的广泛实验表明,MESA在减少幻觉的同时更好地保留了生成行为,超越了多个LVLM家族中的先前方法。
cs.CV / 43 / 2604.07916
Tarot-SAM3: Training-free SAM3 for Any Referring Expression Segmentation
Tarot-SAM3:一种无训练的 SAM3 用于任意指称表达分割
Abstract
Referring Expression Segmentation (RES) aims to segment image regions described by natural-language expressions, serving as a bridge between vision and language understanding. Existing RES methods, however, rely heavily on large annotated datasets and are limited to either explicit or implicit expressions, hindering their ability to generalize to any referring expression. Recently, the Segment Anything Model 3 (SAM3) has shown impressive robustness in Promptable Concept Segmentation. Nonetheless, applying it to RES remains challenging: (1) SAM3 struggles with longer or implicit expressions; (2) naive coupling of SAM3 with a multimodal large language model (MLLM) makes the final results overly dependent on the MLLM's reasoning capability, without enabling refinement of SAM3's segmentation outputs. To this end, we present Tarot-SAM3, a novel training-free framework that can accurately segment from any referring expression. Specifically, Tarot-SAM3 consists of two key phases. First, the Expression Reasoning Interpreter (ERI) phase introduces reasoning-assisted prompt options to support structured expression parsing and evaluation-aware rephrasing. This transforms arbitrary queries into robust heterogeneous prompts for generating reliable masks with SAM3. Second, the Mask Self-Refining (MSR) phase selects the best mask across prompt types and performs self-refinement by leveraging rich feature relationships from DINOv3 to compare discriminative regions among ERI outputs. It then infers region affiliation to the target, thereby correcting over- and under-segmentation. Extensive experiments demonstrate that Tarot-SAM3 achieves strong performance on both explicit and implicit RES benchmarks, as well as open-world scenarios. Ablation studies further validate the effectiveness of each phase.
Chinese Translation
指称表达分割(RES)旨在分割由自然语言表达描述的图像区域,作为视觉与语言理解之间的桥梁。然而,现有的 RES 方法严重依赖于大型标注数据集,并且仅限于显式或隐式表达,这限制了它们对任意指称表达的泛化能力。最近,Segment Anything Model 3(SAM3)在可提示概念分割方面展现了令人印象深刻的鲁棒性。然而,将其应用于 RES 仍然面临挑战:(1)SAM3 在处理较长或隐式表达时表现不佳;(2)将 SAM3 与多模态大型语言模型(MLLM)简单耦合使得最终结果过于依赖于 MLLM 的推理能力,而未能对 SAM3 的分割输出进行优化。为此,我们提出了 Tarot-SAM3,这是一种新颖的无训练框架,可以从任意指称表达中准确分割。具体而言,Tarot-SAM3 包含两个关键阶段。首先,表达推理解释器(ERI)阶段引入了基于推理的提示选项,以支持结构化表达解析和评估感知的改述。这将任意查询转化为强健的异构提示,以生成可靠的 SAM3 掩膜。其次,掩膜自我精炼(MSR)阶段在提示类型中选择最佳掩膜,并通过利用 DINOv3 的丰富特征关系进行自我精炼,以比较 ERI 输出中的判别区域。然后,它推断区域与目标的归属,从而纠正过度和不足分割。大量实验表明,Tarot-SAM3 在显式和隐式 RES 基准测试以及开放世界场景中均表现出色。消融研究进一步验证了每个阶段的有效性。
cs.CV / 44 / 2604.07923
Stitch4D: Sparse Multi-Location 4D Urban Reconstruction via Spatio-Temporal Interpolation
Stitch4D:通过时空插值进行稀疏多位置的4D城市重建
Abstract
Dynamic urban environments are often captured by cameras placed at spatially separated locations with little or no view overlap. However, most existing 4D reconstruction methods assume densely overlapping views. When applied to such sparse observations, these methods fail to reconstruct intermediate regions and often introduce temporal artifacts. To address this practical yet underexplored sparse multi-location setting, we propose Stitch4D, a unified 4D reconstruction framework that explicitly compensates for missing spatial coverage in sparse observations. Stitch4D (i) synthesizes intermediate bridge views to densify spatial constraints and improve spatial coverage, and (ii) jointly optimizes real and synthesized observations within a unified coordinate frame under explicit inter-location consistency constraints. By restoring intermediate coverage before optimization, Stitch4D prevents geometric collapse and reconstructs coherent geometry and smooth scene dynamics even in sparsely observed environments. To evaluate this setting, we introduce Urban Sparse 4D (U-S4D), a CARLA-based benchmark designed to assess spatiotemporal alignment under sparse multi-location configurations. Experimental results on U-S4D show that Stitch4D surpasses representative 4D reconstruction baselines and achieves superior visual quality. These results indicate that recovering intermediate spatial coverage is essential for stable 4D reconstruction in sparse urban environments.
Chinese Translation
动态城市环境通常由放置在空间分离位置的摄像头捕捉,这些位置之间几乎没有或完全没有视角重叠。然而,大多数现有的4D重建方法假设视角密集重叠。当应用于这种稀疏观测时,这些方法无法重建中间区域,并且常常引入时间伪影。为了解决这一实际但尚未深入探讨的稀疏多位置设置,我们提出了Stitch4D,一个统一的4D重建框架,明确补偿稀疏观测中的缺失空间覆盖。Stitch4D (i) 合成中间桥接视图以增强空间约束并改善空间覆盖,(ii) 在明确的跨位置一致性约束下,在统一坐标框架内联合优化真实和合成观测。通过在优化之前恢复中间覆盖,Stitch4D防止几何崩溃,并在稀疏观测环境中重建一致的几何形状和平滑的场景动态。为了评估这一设置,我们引入了Urban Sparse 4D (U-S4D),一个基于CARLA的基准,用于评估稀疏多位置配置下的时空对齐。在U-S4D上的实验结果表明,Stitch4D超越了代表性的4D重建基线,并实现了更优的视觉质量。这些结果表明,恢复中间空间覆盖对于在稀疏城市环境中实现稳定的4D重建至关重要。
cs.CV / 45 / 2604.07928
Generative 3D Gaussian Splatting for Arbitrary-ResolutionAtmospheric Downscaling and Forecasting
用于任意分辨率大气下采样和预测的生成性3D高斯点云
Abstract
While AI-based numerical weather prediction (NWP) enables rapid forecasting, generating high-resolution outputs remains computationally demanding due to limited multi-scale adaptability and inefficient data representations. We propose the 3D Gaussian splatting-based scale-aware vision transformer (GSSA-ViT), a novel framework for arbitrary-resolution forecasting and flexible downscaling of high-dimensional atmospheric fields. Specifically, latitude-longitude grid points are treated as centers of 3D Gaussians. A generative 3D Gaussian prediction scheme is introduced to estimate key parameters, including covariance, attributes, and opacity, for unseen samples, improving generalization and mitigating overfitting. In addition, a scale-aware attention module is designed to capture cross-scale dependencies, enabling the model to effectively integrate information across varying downscaling ratios and support continuous resolution adaptation. To our knowledge, this is the first NWP approach that combines generative 3D Gaussian modeling with scale-aware attention for unified multi-scale prediction. Experiments on ERA5 show that the proposed method accurately forecasts 87 atmospheric variables at arbitrary resolutions, while evaluations on ERA5 and CMIP6 demonstrate its superior performance in downscaling tasks. The proposed framework provides an efficient and scalable solution for high-resolution, multi-scale atmospheric prediction and downscaling. Code is available at: https://github.com/binbin2xs/weather-GS.
Chinese Translation
虽然基于人工智能的数值天气预报(NWP)能够快速进行预测,但由于多尺度适应性有限和数据表示效率低下,生成高分辨率输出仍然计算上要求高。我们提出了一种基于3D高斯点云的尺度感知视觉变换器(GSSA-ViT),这是一个用于任意分辨率预测和高维大气场灵活下采样的新框架。具体而言,纬度-经度网格点被视为3D高斯的中心。我们引入了一种生成性3D高斯预测方案,以估计关键参数,包括协方差、属性和不透明度,以便于对未见样本的预测,从而提高模型的泛化能力并减轻过拟合。此外,我们设计了一种尺度感知注意力模块,以捕捉跨尺度依赖关系,使模型能够有效整合不同下采样比率的信息,并支持连续分辨率适应。据我们所知,这是首个将生成性3D高斯建模与尺度感知注意力相结合的NWP方法,实现统一的多尺度预测。在ERA5上的实验表明,所提方法能够在任意分辨率下准确预测87个大气变量,而在ERA5和CMIP6上的评估则展示了其在下采样任务中的优越性能。该框架为高分辨率、多尺度大气预测和下采样提供了一种高效且可扩展的解决方案。代码可在以下链接获取:https://github.com/binbin2xs/weather-GS。
cs.CV / 46 / 2604.07936
Shortcut Learning in Glomerular AI: Adversarial Penalties Hurt, Entropy Helps
肾小球人工智能中的捷径学习:对抗性惩罚有害,熵有益
Abstract
Stain variability is a pervasive source of distribution shift and potential shortcut learning in renal pathology AI. We ask whether lupus nephritis glomerular lesion classifiers exploit stain as a shortcut, and how to mitigate such bias without stain or site labels. We curate a multi-center, multi-stain dataset of 9{,}674 glomerular patches (224$\times$224) from 365 WSIs across three centers and four stains (PAS, H\&E, Jones, Trichrome), labeled as proliferative vs.\ non-proliferative. We evaluate Bayesian CNN and ViT backbones with Monte Carlo dropout in three settings: (1) stain-only classification; (2) a dual-head model jointly predicting lesion and stain with supervised stain loss; and (3) a dual-head model with label-free stain regularization via entropy maximization on the stain head. In (1), stain identity is trivially learnable, confirming a strong candidate shortcut. In (2), varying the strength and sign of stain supervision strongly modulates stain performance but leaves lesion metrics essentially unchanged, indicating no measurable stain-driven shortcut learning on this multi-stain, multi-center dataset, while overly adversarial stain penalties inflate predictive uncertainty. In (3), entropy-based regularization holds stain predictions near chance without degrading lesion accuracy or calibration. Overall, a carefully curated multi-stain dataset can be inherently robust to stain shortcuts, and a Bayesian dual-head architecture with label-free entropy regularization offers a simple, deployment-friendly safeguard against potential stain-related drift in glomerular AI.
Chinese Translation
染色变异是肾脏病理人工智能中普遍存在的分布偏移和潜在捷径学习的来源。我们探讨狼疮性肾炎肾小球病变分类器是否利用染色作为捷径,以及如何在没有染色或站点标签的情况下减轻这种偏见。我们整理了一个多中心、多染色的数据集,包含来自三个中心和四种染色(PAS、H&E、Jones、Trichrome)的9,674个肾小球切片(224×224),标记为增生性与非增生性。我们在三种设置下评估了贝叶斯卷积神经网络(CNN)和视觉变换器(ViT)骨干网络,采用蒙特卡洛丢弃法:(1) 仅染色分类;(2) 一个双头模型联合预测病变和染色,并使用监督染色损失;(3) 一个通过熵最大化在染色头上进行无标签染色正则化的双头模型。在(1)中,染色身份显然是可学习的,确认了一个强有力的捷径候选。在(2)中,改变染色监督的强度和符号显著调节了染色性能,但病变指标基本保持不变,表明在这个多染色、多中心的数据集中没有可测量的染色驱动的捷径学习,而过度的对抗性染色惩罚则增加了预测不确定性。在(3)中,基于熵的正则化使染色预测接近随机,而不降低病变的准确性或校准性。总体而言,精心整理的多染色数据集可以在本质上对染色捷径具有鲁棒性,而带有无标签熵正则化的贝叶斯双头架构为肾小球人工智能中潜在的染色相关漂移提供了一种简单、易于部署的保护措施。
cs.CV / 47 / 2604.07958
ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks
ImVideoEdit:通过2D空间差异注意力模块进行图像学习的视频编辑
Abstract
Current video editing models often rely on expensive paired video data, which limits their practical scalability. In essence, most video editing tasks can be formulated as a decoupled spatiotemporal process, where the temporal dynamics of the pretrained model are preserved while spatial content is selectively and precisely modified. Based on this insight, we propose ImVideoEdit, an efficient framework that learns video editing capabilities entirely from image pairs. By freezing the pre-trained 3D attention modules and treating images as single-frame videos, we decouple the 2D spatial learning process to help preserve the original temporal dynamics. The core of our approach is a Predict-Update Spatial Difference Attention module that progressively extracts and injects spatial differences. Rather than relying on rigid external masks, we incorporate a Text-Guided Dynamic Semantic Gating mechanism for adaptive and implicit text-driven modifications. Despite training on only 13K image pairs for 5 epochs with exceptionally low computational overhead, ImVideoEdit achieves editing fidelity and temporal consistency comparable to larger models trained on extensive video datasets.
Chinese Translation
当前的视频编辑模型通常依赖于昂贵的配对视频数据,这限制了它们的实际可扩展性。本质上,大多数视频编辑任务可以被表述为一个解耦的时空过程,其中预训练模型的时间动态得以保留,而空间内容则被选择性和精确地修改。基于这一见解,我们提出了ImVideoEdit,一个高效的框架,完全从图像对中学习视频编辑能力。通过冻结预训练的3D注意力模块,并将图像视为单帧视频,我们解耦了2D空间学习过程,以帮助保留原始的时间动态。我们方法的核心是一个预测-更新空间差异注意力模块,逐步提取和注入空间差异。我们并不依赖于僵化的外部掩码,而是结合了文本引导的动态语义门控机制,以实现自适应和隐式的文本驱动修改。尽管仅在13K图像对上训练了5个周期,并且计算开销极低,ImVideoEdit在编辑保真度和时间一致性方面达到了与在大规模视频数据集上训练的更大模型相当的水平。
cs.CV / 48 / 2604.07960
TOOLCAD: Exploring Tool-Using Large Language Models in Text-to-CAD Generation with Reinforcement Learning
TOOLCAD:探索在文本到CAD生成中使用工具的大型语言模型与强化学习的结合
Abstract
Computer-Aided Design (CAD) is an expert-level task that relies on long-horizon reasoning and coherent modeling actions. Large Language Models (LLMs) have shown remarkable advancements in enabling language agents to tackle real-world tasks. Notably, there has been no investigation into how tool-using LLMs optimally interact with CAD engines, hindering the emergence of LLM-based agentic text-to-CAD modeling systems. We propose ToolCAD, a novel agentic CAD framework deploying LLMs as tool-using agents for text-to-CAD generation. Furthermore, we introduce an interactive CAD modeling gym to rollout reasoning and tool-augmented interaction trajectories with the CAD engine, incorporating hybrid feedback and human supervision. Meanwhile, an end-to-end post-training strategy is presented to enable the LLM agent to elicit refined CAD Modeling Chain of Thought (CAD-CoT) and evolve into proficient CAD tool-using agents via online curriculum reinforcement learning. Our findings demonstrate ToolCAD fills the gap in adopting and training open-source LLMs for CAD tool-using agents, enabling them to perform comparably to proprietary models, paving the way for more accessible and robust autonomous text-to-CAD modeling systems.
Chinese Translation
计算机辅助设计(CAD)是一项依赖于长远推理和连贯建模操作的专家级任务。大型语言模型(LLMs)在使语言代理能够处理现实世界任务方面显示出了显著的进展。值得注意的是,目前尚未对工具使用型LLMs如何与CAD引擎进行最佳交互进行研究,这阻碍了基于LLM的代理文本到CAD建模系统的出现。我们提出了ToolCAD,这是一个新颖的代理CAD框架,利用LLMs作为工具使用代理进行文本到CAD的生成。此外,我们引入了一个交互式CAD建模训练环境,以便与CAD引擎展开推理和工具增强的交互轨迹,结合混合反馈和人类监督。同时,我们提出了一种端到端的后训练策略,使LLM代理能够引导精细的CAD建模思维链(CAD-CoT),并通过在线课程强化学习发展成为熟练的CAD工具使用代理。我们的研究结果表明,ToolCAD填补了采用和训练开源LLMs作为CAD工具使用代理的空白,使其能够与专有模型相媲美,为更易获取和更强大的自主文本到CAD建模系统铺平了道路。
cs.CV / 49 / 2604.07965
DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing
DSCA:用于终身视觉语言模型编辑的动态子空间概念对齐
Abstract
Model editing aims to update knowledge to add new concepts and change relevant information without retraining. Lifelong editing is a challenging task, prone to disrupting previously learned concepts, especially for Vision Language Models (VLMs), because sequential edits can lead to degraded reasoning and cross modal misalignment. Existing VLM knowledge editing methods based on gated adapters, activation edits, and parameter merging techniques address catastrophic forgetting seen in full fine tuning; however, they still operate in the shared representation space of the VLM, where concepts are entangled, so edits interfere with other non relevant concepts. We hypothesize that this instability persists because current methods algorithmically control edits via optimization rather than structurally separating knowledge. We introduce Dynamic Subspace Concept Alignment (DSCA) which by design mitigates this limitation by decomposing the representation space into a set of orthogonal semantic subspaces and proposing edits only in those transformed spaces. These subspaces are obtained through incremental clustering and PCA on joint vision language representations. This process structurally isolates concepts, enabling precise, non interfering edits by turning isolation from a soft training objective into an architectural property. The surgical edits are guided by a multi term loss function for maintaining task fidelity, edit locality, and cross modal alignment. With the base model frozen, our method achieves 98 percent single edit success, remains over 95 percent after 1000 sequential edits, lowers hallucination by 3 to 5 percent, and achieves the best backward transfer (BWT) scores on continual instruction tuning benchmarks. Extensive experiments demonstrate DSCA state of the art stability and knowledge retention capability in continual lifelong editing across various datasets and benchmarks.
Chinese Translation
模型编辑旨在更新知识,以添加新概念并更改相关信息,而无需重新训练。终身编辑是一项具有挑战性的任务,容易破坏先前学习的概念,特别是对于视觉语言模型(VLM),因为连续编辑可能导致推理能力下降和跨模态不对齐。现有的基于门控适配器、激活编辑和参数合并技术的VLM知识编辑方法解决了完全微调中出现的灾难性遗忘问题;然而,它们仍然在VLM的共享表示空间中操作,在该空间中概念相互纠缠,因此编辑会干扰其他无关概念。我们假设这种不稳定性持续存在,因为当前的方法通过优化算法控制编辑,而不是结构性地分离知识。我们提出了动态子空间概念对齐(DSCA),通过将表示空间分解为一组正交语义子空间,并仅在这些变换空间中提出编辑,从而在设计上减轻了这一限制。这些子空间是通过对联合视觉语言表示进行增量聚类和主成分分析(PCA)获得的。该过程结构性地隔离了概念,使得通过将隔离从软训练目标转变为架构属性,实现精确且不干扰的编辑。这些手术式编辑由多项损失函数指导,以维持任务保真度、编辑局部性和跨模态对齐。在基础模型冻结的情况下,我们的方法实现了98%的单次编辑成功率,在1000次连续编辑后仍保持在95%以上,降低了3%到5%的幻觉现象,并在持续指令微调基准测试中获得了最佳的逆向迁移(BWT)得分。大量实验表明,DSCA在各种数据集和基准测试中展现了最先进的稳定性和知识保留能力,适用于持续的终身编辑。
cs.CV / 50 / 2604.07966
Lighting-grounded Video Generation with Renderer-based Agent Reasoning
基于渲染器的代理推理的光照基础视频生成
Abstract
Diffusion models have achieved remarkable progress in video generation, but their controllability remains a major limitation. Key scene factors such as layout, lighting, and camera trajectory are often entangled or only weakly modeled, restricting their applicability in domains like filmmaking and virtual production where explicit scene control is essential. We present LiVER, a diffusion-based framework for scene-controllable video generation. To achieve this, we introduce a novel framework that conditions video synthesis on explicit 3D scene properties, supported by a new large-scale dataset with dense annotations of object layout, lighting, and camera parameters. Our method disentangles these properties by rendering control signals from a unified 3D representation. We propose a lightweight conditioning module and a progressive training strategy to integrate these signals into a foundational video diffusion model, ensuring stable convergence and high fidelity. Our framework enables a wide range of applications, including image-to-video and video-to-video synthesis where the underlying 3D scene is fully editable. To further enhance usability, we develop a scene agent that automatically translates high-level user instructions into the required 3D control signals. Experiments show that LiVER achieves state-of-the-art photorealism and temporal consistency while enabling precise, disentangled control over scene factors, setting a new standard for controllable video generation.
Chinese Translation
扩散模型在视频生成方面取得了显著进展,但其可控性仍然是一个主要限制。关键场景因素如布局、光照和相机轨迹往往相互纠缠或仅被弱建模,这限制了它们在电影制作和虚拟制作等需要明确场景控制的领域中的适用性。我们提出了LiVER,一个基于扩散的场景可控视频生成框架。为此,我们引入了一个新颖的框架,将视频合成条件化于明确的3D场景属性,并支持一个具有密集对象布局、光照和相机参数注释的新大规模数据集。我们的方法通过从统一的3D表示中渲染控制信号来解耦这些属性。我们提出了一个轻量级的条件模块和渐进式训练策略,将这些信号集成到基础视频扩散模型中,以确保稳定收敛和高保真度。我们的框架支持广泛的应用,包括图像到视频和视频到视频的合成,其中基础的3D场景是完全可编辑的。为了进一步增强可用性,我们开发了一个场景代理,自动将高级用户指令转换为所需的3D控制信号。实验表明,LiVER在实现最先进的照片真实感和时间一致性的同时,能够对场景因素进行精确、解耦的控制,为可控视频生成设定了新的标准。
cs.CV / 51 / 2604.07980
Object-Centric Stereo Ranging for Autonomous Driving: From Dense Disparity to Census-Based Template Matching
面向自主驾驶的物体中心立体测距:从密集视差到基于人口普查的模板匹配
Abstract
Accurate depth estimation is critical for autonomous driving perception systems, particularly for long range vehicle detection on highways. Traditional dense stereo matching methods such as Block Matching (BM) and Semi Global Matching (SGM) produce per pixel disparity maps but suffer from high computational cost, sensitivity to radiometric differences between stereo cameras, and poor accuracy at long range where disparity values are small. In this report, we present a comprehensive stereo ranging system that integrates three complementary depth estimation approaches: dense BM/SGM disparity, object centric Census based template matching, and monocular geometric priors, within a unified detection ranging tracking pipeline. Our key contribution is a novel object centric Census based template matching algorithm that performs GPU accelerated sparse stereo matching directly within detected bounding boxes, employing a far close divide and conquer strategy, forward backward verification, occlusion aware sampling, and robust multi block aggregation. We further describe an online calibration refinement framework that combines auto rectification offset search, radar stereo voting based disparity correction, and object level radar stereo association for continuous extrinsic drift compensation. The complete system achieves real time performance through asynchronous GPU pipeline design and delivers robust ranging across diverse driving conditions including nighttime, rain, and varying illumination.
Chinese Translation
准确的深度估计对于自主驾驶感知系统至关重要,特别是在高速公路上进行远程车辆检测时。传统的密集立体匹配方法,如块匹配(Block Matching, BM)和半全局匹配(Semi Global Matching, SGM),生成每个像素的视差图,但存在计算成本高、对立体相机之间的辐射差异敏感以及在远距离视差值较小的情况下准确性差等问题。在本报告中,我们提出了一种综合立体测距系统,该系统在统一的检测-测距-跟踪管道中集成了三种互补的深度估计方法:密集的BM/SGM视差、基于物体中心的人口普查模板匹配和单目几何先验。我们的主要贡献是一种新颖的基于物体中心的人口普查模板匹配算法,该算法在检测到的边界框内直接进行GPU加速的稀疏立体匹配,采用远近分治策略、前向-后向验证、考虑遮挡的采样和稳健的多块聚合。我们进一步描述了一个在线校准优化框架,该框架结合了自动校正偏移搜索、基于雷达立体投票的视差修正和物体级雷达立体关联,以实现连续的外部漂移补偿。完整系统通过异步GPU管道设计实现实时性能,并在包括夜间、雨天和不同光照条件下的多样驾驶环境中提供稳健的测距能力。
cs.CV / 52 / 2604.07986
DP-DeGauss: Dynamic Probabilistic Gaussian Decomposition for Egocentric 4D Scene Reconstruction
DP-DeGauss:用于自我中心4D场景重建的动态概率高斯分解
Abstract
Egocentric video is crucial for next-generation 4D scene reconstruction, with applications in AR/VR and embodied AI. However, reconstructing dynamic first-person scenes is challenging due to complex ego-motion, occlusions, and hand-object interactions. Existing decomposition methods are ill-suited, assuming fixed viewpoints or merging dynamics into a single foreground. To address these limitations, we introduce DP-DeGauss, a dynamic probabilistic Gaussian decomposition framework for egocentric 4D reconstruction. Our method initializes a unified 3D Gaussian set from COLMAP priors, augments each with a learnable category probability, and dynamically routes them into specialized deformation branches for background, hands, or object modeling. We employ category-specific masks for better disentanglement and introduce brightness and motion-flow control to improve static rendering and dynamic reconstruction. Extensive experiments show that DP-DeGauss outperforms baselines by +1.70dB in PSNR on average with SSIM and LPIPS gains. More importantly, our framework achieves the first and state-of-the-art disentanglement of background, hand, and object components, enabling explicit, fine-grained separation, paving the way for more intuitive ego scene understanding and editing.
Chinese Translation
自我中心视频对于下一代4D场景重建至关重要,广泛应用于增强现实(AR)/虚拟现实(VR)和具身人工智能(embodied AI)。然而,由于复杂的自我运动、遮挡和手物体交互,重建动态第一人称场景面临挑战。现有的分解方法并不适用,因为它们假设固定视角或将动态合并为单一前景。为了解决这些局限性,我们提出了DP-DeGauss,一种用于自我中心4D重建的动态概率高斯分解框架。我们的方法从COLMAP先验初始化一个统一的3D高斯集,为每个高斯分配一个可学习的类别概率,并动态地将其路由到专门的变形分支中,以进行背景、手部或物体建模。我们采用类别特定的掩码以实现更好的解耦,并引入亮度和运动流控制,以改善静态渲染和动态重建。大量实验表明,DP-DeGauss在PSNR上平均比基线提高了+1.70dB,并在SSIM和LPIPS上也有显著提升。更重要的是,我们的框架实现了背景、手部和物体组件的首次且最先进的解耦,能够实现明确、细致的分离,为更直观的自我场景理解和编辑铺平了道路。
cs.CV / 53 / 2604.07990
SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations
SceneScribe-1M:一个具有全面几何和语义注释的大规模视频数据集
Abstract
The convergence of 3D geometric perception and video synthesis has created an unprecedented demand for large-scale video data that is rich in both semantic and spatio-temporal information. While existing datasets have advanced either 3D understanding or video generation, a significant gap remains in providing a unified resource that supports both domains at scale. To bridge this chasm, we introduce SceneScribe-1M, a new large-scale, multi-modal video dataset. It comprises one million in-the-wild videos, each meticulously annotated with detailed textual descriptions, precise camera parameters, dense depth maps, and consistent 3D point tracks. We demonstrate the versatility and value of SceneScribe-1M by establishing benchmarks across a wide array of downstream tasks, including monocular depth estimation, scene reconstruction, and dynamic point tracking, as well as generative tasks such as text-to-video synthesis, with or without camera control. By open-sourcing SceneScribe-1M, we aim to provide a comprehensive benchmark and a catalyst for research, fostering the development of models that can both perceive the dynamic 3D world and generate controllable, realistic video content.
Chinese Translation
3D几何感知与视频合成的融合创造了对丰富语义和时空信息的大规模视频数据的前所未有的需求。尽管现有数据集在3D理解或视频生成方面取得了进展,但在提供一个支持这两个领域的大规模统一资源方面仍存在显著差距。为填补这一空白,我们推出了SceneScribe-1M,一个新的大规模多模态视频数据集。它包含一百万个野外视频,每个视频都经过精心注释,提供详细的文本描述、精确的相机参数、密集的深度图和一致的3D点轨迹。我们通过在单目深度估计、场景重建、动态点跟踪等广泛下游任务以及文本到视频合成(无论是否控制相机)等生成任务中建立基准,展示了SceneScribe-1M的多样性和价值。通过开源SceneScribe-1M,我们旨在提供一个全面的基准和研究的催化剂,促进能够感知动态3D世界并生成可控、逼真视频内容的模型的发展。
cs.CV / 54 / 2604.07991
MotionScape: A Large-Scale Real-World Highly Dynamic UAV Video Dataset for World Models
MotionScape:用于世界模型的大规模真实世界高动态无人机视频数据集
Abstract
Recent advances in world models have demonstrated strong capabilities in simulating physical reality, making them an increasingly important foundation for embodied intelligence. For UAV agents in particular, accurate prediction of complex 3D dynamics is essential for autonomous navigation and robust decision-making in unconstrained environments. However, under the highly dynamic camera trajectories typical of UAV views, existing world models often struggle to maintain spatiotemporal physical consistency. A key reason lies in the distribution bias of current training data: most existing datasets exhibit restricted 2.5D motion patterns, such as ground-constrained autonomous driving scenes or relatively smooth human-centric egocentric videos, and therefore lack realistic high-dynamic 6-DoF UAV motion priors. To address this gap, we present MotionScape, a large-scale real-world UAV-view video dataset with highly dynamic motion for world modeling. MotionScape contains over 30 hours of 4K UAV-view videos, totaling more than 4.5M frames. This novel dataset features semantically and geometrically aligned training samples, where diverse real-world UAV videos are tightly coupled with accurate 6-DoF camera trajectories and fine-grained natural language descriptions. To build the dataset, we develop an automated multi-stage processing pipeline that integrates CLIP-based relevance filtering, temporal segmentation, robust visual SLAM for trajectory recovery, and large-language-model-driven semantic annotation. Extensive experiments show that incorporating such semantically and geometrically aligned annotations effectively improves the ability of existing world models to simulate complex 3D dynamics and handle large viewpoint shifts, thereby benefiting decision-making and planning for UAV agents in complex environments. The dataset is publicly available at https://github.com/Thelegendzz/MotionScape
Chinese Translation
最近在世界模型方面的进展展示了其在模拟物理现实方面的强大能力,使其成为具身智能日益重要的基础。特别是对于无人机(UAV)代理,准确预测复杂的三维动态对于在不受限制的环境中实现自主导航和稳健决策至关重要。然而,在典型的无人机视角下高度动态的摄像机轨迹下,现有的世界模型往往难以保持时空物理一致性。一个关键原因在于当前训练数据的分布偏差:大多数现有数据集表现出受限的2.5D运动模式,例如地面约束的自主驾驶场景或相对平滑的人类中心自我中心视频,因此缺乏现实的高动态六自由度(6-DoF)无人机运动先验。为了解决这一问题,我们提出了MotionScape,一个用于世界建模的大规模真实世界无人机视角视频数据集,具有高度动态的运动特征。MotionScape包含超过30小时的4K无人机视角视频,总计超过450万帧。这个新颖的数据集具有语义和几何对齐的训练样本,其中多样的真实世界无人机视频与准确的6-DoF摄像机轨迹和细致的自然语言描述紧密结合。为了构建该数据集,我们开发了一个自动化的多阶段处理管道,集成了基于CLIP的相关性过滤、时间分割、用于轨迹恢复的稳健视觉SLAM以及大型语言模型驱动的语义注释。大量实验表明,结合这种语义和几何对齐的注释有效提高了现有世界模型模拟复杂三维动态和处理大视角变化的能力,从而有利于无人机代理在复杂环境中的决策和规划。该数据集已公开发布,网址为 https://github.com/Thelegendzz/MotionScape
cs.CV / 55 / 2604.07994
SAT: Selective Aggregation Transformer for Image Super-Resolution
SAT:用于图像超分辨率的选择性聚合变换器
Abstract
Transformer-based approaches have revolutionized image super-resolution by modeling long-range dependencies. However, the quadratic computational complexity of vanilla self-attention mechanisms poses significant challenges, often leading to compromises between efficiency and global context exploitation. Recent window-based attention methods mitigate this by localizing computations, but they often yield restricted receptive fields. To mitigate these limitations, we propose Selective Aggregation Transformer (SAT). This novel transformer efficiently captures long-range dependencies, leading to an enlarged model receptive field by selectively aggregating key-value matrices (reducing the number of tokens by 97\%) via our Density-driven Token Aggregation algorithm while maintaining the full resolution of the query matrix. This design significantly reduces computational costs, resulting in lower complexity and enabling scalable global interactions without compromising reconstruction fidelity. SAT identifies and represents each cluster with a single aggregation token, utilizing density and isolation metrics to ensure that critical high-frequency details are preserved. Experimental results demonstrate that SAT outperforms the state-of-the-art method PFT by up to 0.22dB, while the total number of FLOPs can be reduced by up to 27\%.
Chinese Translation
基于变换器的方法通过建模长距离依赖关系,彻底改变了图像超分辨率。然而,传统自注意力机制的二次计算复杂度带来了显著挑战,往往导致效率与全局上下文利用之间的妥协。最近的基于窗口的注意力方法通过局部化计算来缓解这一问题,但它们通常会导致接收场的限制。为了解决这些局限性,我们提出了选择性聚合变换器(Selective Aggregation Transformer,SAT)。这一新颖的变换器通过选择性聚合键值矩阵(通过我们的基于密度的令牌聚合算法将令牌数量减少97%),有效捕捉长距离依赖关系,从而扩大模型的接收场,同时保持查询矩阵的全分辨率。该设计显著降低了计算成本,减少了复杂性,并在不妥协重建保真度的情况下,实现了可扩展的全局交互。SAT使用密度和孤立度量来识别和表示每个聚类,确保关键的高频细节得以保留。实验结果表明,SAT在性能上优于最先进的方法PFT,提升幅度可达0.22dB,同时总的FLOPs数量可减少多达27%。
cs.CV / 56 / 2604.07997
Few-Shot Incremental 3D Object Detection in Dynamic Indoor Environments
动态室内环境中的少样本增量3D物体检测
Abstract
Incremental 3D object perception is a critical step toward embodied intelligence in dynamic indoor environments. However, existing incremental 3D detection methods rely on extensive annotations of novel classes for satisfactory performance. To address this limitation, we propose FI3Det, a Few-shot Incremental 3D Detection framework that enables efficient 3D perception with only a few novel samples by leveraging vision-language models (VLMs) to learn knowledge of unseen categories. FI3Det introduces a VLM-guided unknown object learning module in the base stage to enhance perception of unseen categories. Specifically, it employs VLMs to mine unknown objects and extract comprehensive representations, including 2D semantic features and class-agnostic 3D bounding boxes. To mitigate noise in these representations, a weighting mechanism is further designed to re-weight the contributions of point- and box-level features based on their spatial locations and feature consistency within each box. Moreover, FI3Det proposes a gated multimodal prototype imprinting module, where category prototypes are constructed from aligned 2D semantic and 3D geometric features to compute classification scores, which are then fused via a multimodal gating mechanism for novel object detection. As the first framework for few-shot incremental 3D object detection, we establish both batch and sequential evaluation settings on two datasets, ScanNet V2 and SUN RGB-D, where FI3Det achieves strong and consistent improvements over baseline methods. Code is available at https://github.com/zyrant/FI3Det.
Chinese Translation
增量3D物体感知是实现动态室内环境中具身智能的关键步骤。然而,现有的增量3D检测方法依赖于对新类别的广泛标注以获得令人满意的性能。为了解决这一限制,我们提出了FI3Det,一个少样本增量3D检测框架,通过利用视觉-语言模型(VLMs)学习未见类别的知识,使得仅通过少量新样本即可实现高效的3D感知。FI3Det在基础阶段引入了一个VLM引导的未知物体学习模块,以增强对未见类别的感知。具体而言,它利用VLMs挖掘未知物体并提取全面的表示,包括2D语义特征和类别无关的3D边界框。为了减轻这些表示中的噪声,进一步设计了一种加权机制,根据点级和框级特征的空间位置及每个框内的特征一致性重新加权其贡献。此外,FI3Det提出了一个门控多模态原型印记模块,其中类别原型由对齐的2D语义和3D几何特征构建,以计算分类分数,然后通过多模态门控机制融合用于新物体检测。作为首个少样本增量3D物体检测框架,我们在两个数据集ScanNet V2和SUN RGB-D上建立了批量和顺序评估设置,FI3Det在基线方法上实现了强大且一致的改进。代码可在 https://github.com/zyrant/FI3Det 获得。
cs.CV / 57 / 2604.08008
SearchAD: Large-Scale Rare Image Retrieval Dataset for Autonomous Driving
SearchAD:用于自动驾驶的大规模稀有图像检索数据集
Abstract
Retrieving rare and safety-critical driving scenarios from large-scale datasets is essential for building robust autonomous driving (AD) systems. As dataset sizes continue to grow, the key challenge shifts from collecting more data to efficiently identifying the most relevant samples. We introduce SearchAD, a large-scale rare image retrieval dataset for AD containing over 423k frames drawn from 11 established datasets. SearchAD provides high-quality manual annotations of more than 513k bounding boxes covering 90 rare categories. It specifically targets the needle-in-a-haystack problem of locating extremely rare classes, with some appearing fewer than 50 times across the entire dataset. Unlike existing benchmarks, which focused on instance-level retrieval, SearchAD emphasizes semantic image retrieval with a well-defined data split, enabling text-to-image and image-to-image retrieval, few-shot learning, and fine-tuning of multi-modal retrieval models. Comprehensive evaluations show that text-based methods outperform image-based ones due to stronger inherent semantic grounding. While models directly aligning spatial visual features with language achieve the best zero-shot results, and our fine-tuning baseline significantly improves performance, absolute retrieval capabilities remain unsatisfactory. With a held-out test set on a public benchmark server, SearchAD establishes the first large-scale dataset for retrieval-driven data curation and long-tail perception research in AD: https://iis-esslingen.github.io/searchad/
Chinese Translation
从大规模数据集中检索稀有且安全关键的驾驶场景对于构建稳健的自动驾驶(AD)系统至关重要。随着数据集规模的不断扩大,关键挑战从收集更多数据转向高效识别最相关的样本。我们介绍了SearchAD,这是一个用于自动驾驶的大规模稀有图像检索数据集,包含来自11个已建立数据集的超过423k帧。SearchAD提供了超过513k个边界框的高质量手动注释,覆盖90个稀有类别。它特别针对在整个数据集中出现次数少于50次的极稀有类别的“干草堆中的针”问题。与现有基准测试(主要集中于实例级检索)不同,SearchAD强调语义图像检索,具有明确的数据划分,支持文本到图像和图像到图像的检索、少样本学习以及多模态检索模型的微调。全面评估表明,基于文本的方法由于更强的内在语义基础,表现优于基于图像的方法。尽管直接将空间视觉特征与语言对齐的模型在零样本结果中表现最佳,而我们的微调基线显著提高了性能,但绝对检索能力仍然不尽如人意。通过在公共基准服务器上保留的测试集,SearchAD建立了第一个用于检索驱动的数据整理和自动驾驶长尾感知研究的大规模数据集: https://iis-esslingen.github.io/searchad/
cs.CV / 58 / 2604.08014
Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding
跨越时间与空间:视频定位的解耦时空对齐
Abstract
Spatio-Temporal Video Grounding requires jointly localizing target objects across both temporal and spatial dimensions based on natural language queries, posing fundamental challenges for existing Multimodal Large Language Models (MLLMs). We identify two core challenges: \textit{entangled spatio-temporal alignment}, arising from coupling two heterogeneous sub-tasks within the same autoregressive output space, and \textit{dual-domain visual token redundancy}, where target objects exhibit simultaneous temporal and spatial sparsity, rendering the overwhelming majority of visual tokens irrelevant to the grounding query. To address these, we propose \textbf{Bridge-STG}, an end-to-end framework that decouples temporal and spatial localization while maintaining semantic coherence. While decoupling is the natural solution to this entanglement, it risks creating a semantic gap between the temporal MLLM and the spatial decoder. Bridge-STG resolves this through two pivotal designs: the \textbf{Spatio-Temporal Semantic Bridging (STSB)} mechanism with Explicit Temporal Alignment (ETA) distills the MLLM's temporal reasoning context into enriched bridging queries as a robust semantic interface; and the \textbf{Query-Guided Spatial Localization (QGSL)} module leverages these queries to drive a purpose-built spatial decoder with multi-layer interactive queries and positive/negative frame sampling, jointly eliminating dual-domain visual token redundancy. Extensive experiments across multiple benchmarks demonstrate that Bridge-STG achieves state-of-the-art performance among MLLM-based methods. Bridge-STG improves average m\_vIoU from $26.4$ to $34.3$ on VidSTG and demonstrates strong cross-task transfer across various fine-grained video understanding tasks under a unified multi-task training regime.
Chinese Translation
时空视频定位要求基于自然语言查询在时间和空间维度上共同定位目标对象,这对现有的多模态大型语言模型(MLLMs)提出了基本挑战。我们识别出两个核心挑战: extit{纠缠的时空对齐},源于在同一自回归输出空间内耦合两个异构子任务,以及 extit{双域视觉标记冗余},其中目标对象在时间和空间上同时表现出稀疏性,使得绝大多数视觉标记与定位查询无关。为了解决这些问题,我们提出了 extbf{Bridge-STG},一个端到端框架,解耦时间和空间定位,同时保持语义一致性。尽管解耦是解决这种纠缠的自然方案,但它可能会在时间MLLM和空间解码器之间产生语义差距。Bridge-STG通过两个关键设计来解决这一问题: extbf{时空语义桥接(STSB)}机制与显式时间对齐(ETA)将MLLM的时间推理上下文提炼为丰富的桥接查询,作为强大的语义接口;而 extbf{查询引导的空间定位(QGSL)}模块利用这些查询驱动一个专门构建的空间解码器,结合多层交互查询和正/负帧采样,共同消除双域视觉标记冗余。在多个基准上的广泛实验表明,Bridge-STG在基于MLLM的方法中实现了最先进的性能。Bridge-STG在VidSTG上将平均m extsubscript{vIoU}从$26.4$提高到$34.3$,并在统一的多任务训练机制下展示了在各种细粒度视频理解任务中的强大跨任务迁移能力。
cs.CV / 59 / 2604.08015
Component-Adaptive and Lesion-Level Supervision for Improved Small Structure Segmentation in Brain MRI
基于组件自适应和病灶级监督的脑MRI小结构分割改进方法
Abstract
We propose a unified objective function, termed CATMIL, that augments the base segmentation loss with two auxiliary supervision terms operating at different levels. The first term, Component-Adaptive Tversky, reweights voxel contributions based on connected components to balance the influence of lesions of different sizes. The second term, based on Multiple Instance Learning, introduces lesion-level supervision by encouraging the detection of each lesion instance. These terms are combined with the standard nnU-Net loss to jointly optimize voxel-level segmentation accuracy and lesion-level detection. We evaluate the proposed objective on the MSLesSeg dataset using a consistent nnU-Net framework and 5-fold cross-validation. The results show that CATMIL achieves the most balanced performance across segmentation accuracy, lesion detection, and error control. It improves Dice score (0.7834) and reduces boundary error compared to standard losses. More importantly, it substantially increases small lesion recall and reduces false negatives, while maintaining the lowest false positive volume among compared methods. These findings demonstrate that integrating component-level and lesion-level supervision within a unified objective provides an effective and practical approach for improving small lesion segmentation in highly imbalanced settings. All code and pretrained models are available at \href{https://github.com/luumsk/SmallLesionMRI}{this url}.
Chinese Translation
我们提出了一种统一的目标函数,称为CATMIL,它通过两个在不同层次上操作的辅助监督项增强基础分割损失。第一个项是组件自适应Tversky,它根据连通组件重新加权体素贡献,以平衡不同大小病灶的影响。第二个项基于多实例学习,通过鼓励检测每个病灶实例引入病灶级监督。这些项与标准的nnU-Net损失结合,以共同优化体素级分割精度和病灶级检测。我们在MSLesSeg数据集上使用一致的nnU-Net框架和5折交叉验证评估所提出的目标。结果表明,CATMIL在分割精度、病灶检测和误差控制方面实现了最平衡的性能。与标准损失相比,它提高了Dice系数(0.7834)并减少了边界误差。更重要的是,它显著提高了小病灶的召回率并减少了假阴性,同时在比较方法中保持了最低的假阳性体积。这些发现表明,在统一目标中整合组件级和病灶级监督提供了一种有效且实用的方法,以改善在高度不平衡环境下的小病灶分割。所有代码和预训练模型可在此网址获取: exttt{https://github.com/luumsk/SmallLesionMRI}。
cs.CV / 60 / 2604.08034
Rotation Equivariant Convolutions in Deformable Registration of Brain MRI
脑MRI变形配准中的旋转等变卷积
Abstract
Image registration is a fundamental task that aligns anatomical structures between images. While CNNs perform well, they lack rotation equivariance - a rotated input does not produce a correspondingly rotated output. This hinders performance by failing to exploit the rotational symmetries inherent in anatomical structures, particularly in brain MRI. In this work, we integrate rotation-equivariant convolutions into deformable brain MRI registration networks. We evaluate this approach by replacing standard encoders with equivariant ones in three baseline architectures, testing on multiple public brain MRI datasets. Our experiments demonstrate that equivariant encoders have three key advantages: 1) They achieve higher registration accuracy while reducing network parameters, confirming the benefit of this anatomical inductive bias. 2) They outperform baselines on rotated input pairs, demonstrating robustness to orientation variations common in clinical practice. 3) They show improved performance with less training data, indicating greater sample efficiency. Our results demonstrate that incorporating geometric priors is a critical step toward building more robust, accurate, and efficient registration models.
Chinese Translation
图像配准是一项基本任务,旨在对齐图像之间的解剖结构。尽管卷积神经网络(CNN)表现良好,但它们缺乏旋转等变性——旋转输入不会产生相应旋转的输出。这限制了其性能,因为未能利用解剖结构中固有的旋转对称性,尤其是在脑MRI中。在本研究中,我们将旋转等变卷积集成到变形脑MRI配准网络中。我们通过在三种基线架构中用等变编码器替换标准编码器来评估这种方法,并在多个公共脑MRI数据集上进行测试。我们的实验表明,等变编码器具有三个主要优势:1)它们在提高配准精度的同时减少了网络参数,证实了这种解剖归纳偏差的好处。2)它们在旋转输入对上优于基线,展示了对临床实践中常见的方向变化的鲁棒性。3)它们在较少的训练数据下表现更好,表明更高的样本效率。我们的结果表明,融入几何先验是构建更鲁棒、更准确和更高效的配准模型的关键步骤。
cs.CV / 61 / 2604.08038
Beyond Mamba: Enhancing State-space Models with Deformable Dilated Convolutions for Multi-scale Traffic Object Detection
超越Mamba:利用可变形膨胀卷积增强状态空间模型以实现多尺度交通目标检测
Abstract
In a real-world traffic scenario, varying-scale objects are usually distributed in a cluttered background, which poses great challenges to accurate detection. Although current Mamba-based methods can efficiently model long-range dependencies, they still struggle to capture small objects with abundant local details, which hinders joint modeling of local structures and global semantics. Moreover, state-space models exhibit limited hierarchical feature representation and weak cross-scale interaction due to flat sequential modeling and insufficient spatial inductive biases, leading to sub-optimal performance in complex scenes. To address these issues, we propose a Mamba with Deformable Dilated Convolutions Network (MDDCNet) for accurate traffic object detection in this study. In MDDCNet, a well-designed hybrid backbone with successive Multi-Scale Deformable Dilated Convolution (MSDDC) blocks and Mamba blocks enables hierarchical feature representation from local details to global semantics. Meanwhile, a Channel-Enhanced Feed-Forward Network (CE-FFN) is further devised to overcome the limited channel interaction capability of conventional feed-forward networks, whilst a Mamba-based Attention-Aggregating Feature Pyramid Network (A^2FPN) is constructed to achieve enhanced multi-scale feature fusion and interaction. Extensive experimental results on public benchmark and real-world datasets demonstrate the superiority of our method over various advanced detectors. The code is available at https://github.com/Bettermea/MDDCNet.
Chinese Translation
在现实交通场景中,不同尺度的物体通常分布在杂乱的背景中,这给准确检测带来了很大挑战。尽管当前基于Mamba的方法能够有效建模长距离依赖关系,但在捕捉具有丰富局部细节的小物体方面仍然存在困难,这阻碍了局部结构与全局语义的联合建模。此外,由于平坦的序列建模和不足的空间归纳偏置,状态空间模型的层次特征表示能力有限,跨尺度交互较弱,导致在复杂场景中的性能不尽如人意。为了解决这些问题,本研究提出了一种基于可变形膨胀卷积的Mamba网络(MDDCNet),用于准确的交通目标检测。在MDDCNet中,设计良好的混合主干网络结合了连续的多尺度可变形膨胀卷积(MSDDC)块和Mamba块,使得从局部细节到全局语义的层次特征表示成为可能。同时,进一步设计了一种通道增强前馈网络(CE-FFN),以克服传统前馈网络的有限通道交互能力,而基于Mamba的注意力聚合特征金字塔网络(A^2FPN)则被构建以实现增强的多尺度特征融合与交互。在公共基准和现实世界数据集上的大量实验结果表明,我们的方法优于各种先进的检测器。代码可在 https://github.com/Bettermea/MDDCNet 获取。
cs.CV / 62 / 2604.08039
LINE: LLM-based Iterative Neuron Explanations for Vision Models
LINE:基于大语言模型的视觉模型迭代神经元解释
Abstract
Interpreting the concepts encoded by individual neurons in deep neural networks is a crucial step towards understanding their complex decision-making processes and ensuring AI safety. Despite recent progress in neuron labeling, existing methods often limit the search space to predefined concept vocabularies or produce overly specific descriptions that fail to capture higher-order, global concepts. We introduce LINE, a novel, training-free iterative approach tailored for open-vocabulary concept labeling in vision models. Operating in a strictly black-box setting, LINE leverages a large language model and a text-to-image generator to iteratively propose and refine concepts in a closed loop, guided by activation history. We demonstrate that LINE achieves state-of-the-art performance across multiple model architectures, yielding AUC improvements of up to 0.18 on ImageNet and 0.05 on Places365, while discovering, on average, 29% of new concepts missed by massive predefined vocabularies. Beyond identifying the top concept, LINE provides a complete generation history, which enables polysemanticity evaluation and produces supporting visual explanations that rival gradient-dependent activation maximization methods.
Chinese Translation
解释深度神经网络中单个神经元编码的概念是理解其复杂决策过程和确保人工智能安全的重要步骤。尽管在神经元标注方面取得了近期进展,但现有方法通常将搜索空间限制在预定义的概念词汇表中,或产生过于具体的描述,未能捕捉更高阶的全局概念。我们提出了LINE,一种新颖的、无训练的迭代方法,旨在为视觉模型提供开放词汇的概念标注。LINE在严格的黑箱环境中运行,利用大型语言模型和文本到图像生成器,迭代性地提出和完善概念,形成闭环,受激活历史的指导。我们展示了LINE在多个模型架构上实现了最先进的性能,在ImageNet上获得了高达0.18的AUC提升,在Places365上获得了0.05的提升,同时平均发现了29%的被大量预定义词汇遗漏的新概念。除了识别顶级概念外,LINE还提供完整的生成历史,这使得多义性评估成为可能,并生成与基于梯度的激活最大化方法相媲美的支持视觉解释。
cs.CV / 63 / 2604.08042
3DrawAgent: Teaching LLM to Draw in 3D with Early Contrastive Experience
3DrawAgent:通过早期对比经验教会大型语言模型在3D中绘图
Abstract
Sketching in 3D space enables expressive reasoning about shape, structure, and spatial relationships, yet generating 3D sketches through natural language remains a major challenge. In this work, we introduce 3DrawAgent, a training-free, language-driven framework for 3D sketch generation that leverages large language models (LLMs) to sequentially draw 3D Bezier curves under geometric feedback. Unlike prior 2D sketch agents, our method introduces a relative experience optimization strategy that adapts the recently proposed Group Reward Policy Optimization (GRPO) paradigm. Instead of relying on explicit ground-truth supervision, we construct pairwise comparisons among generated sketches, with each pair consisting of a relatively better and a worse result based on CLIP-based perceptual rewards and LLM-based fine-grained qualitative assessment. These experiences are then used to iteratively refine the prior knowledge of 3D drawing, enabling black-box reinforcement of the model's 3D awareness. This design allows our model to self-improve its spatial understanding and drawing quality without parameter updates. Experiments show that 3DrawAgent can generate complex and coherent 3D Bezier sketches from diverse textual prompts, exhibit emergent geometric reasoning, and generalize to novel shapes, establishing a new paradigm for advancing the field of training-free 3D sketch intelligence.
Chinese Translation
在3D空间中绘图能够对形状、结构和空间关系进行富有表现力的推理,但通过自然语言生成3D草图仍然是一个重大挑战。在本研究中,我们介绍了3DrawAgent,这是一种无训练、基于语言的3D草图生成框架,利用大型语言模型(LLMs)在几何反馈下顺序绘制3D贝塞尔曲线。与之前的2D草图代理不同,我们的方法引入了一种相对经验优化策略,以适应最近提出的群体奖励策略优化(Group Reward Policy Optimization, GRPO)范式。我们不依赖于明确的真实监督,而是构建生成草图之间的成对比较,每对包含一个相对较好的结果和一个较差的结果,这些结果基于基于CLIP的感知奖励和基于LLM的细粒度定性评估。这些经验随后用于迭代地细化3D绘图的先前知识,使模型的3D意识得到黑箱强化。这一设计使我们的模型能够在不更新参数的情况下自我提升其空间理解和绘图质量。实验表明,3DrawAgent能够从多样的文本提示中生成复杂且连贯的3D贝塞尔草图,展现出新兴的几何推理能力,并能够推广到新形状,确立了推动无训练3D草图智能领域的新范式。
cs.CV / 64 / 2604.08045
Adapting Foundation Models for Annotation-Efficient Adnexal Mass Segmentation in Cine Images
适应基础模型以实现影像中附属肿块的注释高效分割
Abstract
Adnexal mass evaluation via ultrasound is a challenging clinical task, often hindered by subjective interpretation and significant inter-observer variability. While automated segmentation is a foundational step for quantitative risk assessment, traditional fully supervised convolutional architectures frequently require large amounts of pixel-level annotations and struggle with domain shifts common in medical imaging. In this work, we propose a label-efficient segmentation framework that leverages the robust semantic priors of a pretrained DINOv3 foundational vision transformer backbone. By integrating this backbone with a Dense Prediction Transformer (DPT)-style decoder, our model hierarchically reassembles multi-scale features to combine global semantic representations with fine-grained spatial details. Evaluated on a clinical dataset of 7,777 annotated frames from 112 patients, our method achieves state-of-the-art performance compared to established fully supervised baselines, including U-Net, U-Net++, DeepLabV3, and MAnet. Specifically, we obtain a Dice score of 0.945 and improved boundary adherence, reducing the 95th-percentile Hausdorff Distance by 11.4% relative to the strongest convolutional baseline. Furthermore, we conduct an extensive efficiency analysis demonstrating that our DINOv3-based approach retains significantly higher performance under data starvation regimes, maintaining strong results even when trained on only 25% of the data. These results suggest that leveraging large-scale self-supervised foundations provides a promising and data-efficient solution for medical image segmentation in data-constrained clinical environments. Project Repository: https://github.com/FrancescaFati/MESA
Chinese Translation
通过超声波评估附属肿块是一项具有挑战性的临床任务,常常受到主观解读和显著的观察者间变异性的影响。尽管自动分割是定量风险评估的基础步骤,但传统的全监督卷积架构通常需要大量的像素级注释,并且在医学影像中常见的领域转移中表现不佳。在本研究中,我们提出了一种标签高效的分割框架,该框架利用了预训练的 DINOv3 基础视觉变换器主干的强大语义先验。通过将该主干与密集预测变换器(Dense Prediction Transformer, DPT)风格的解码器相结合,我们的模型以层次方式重新组合多尺度特征,将全局语义表示与细粒度空间细节相结合。在对来自112名患者的7,777帧注释图像的临床数据集进行评估时,我们的方法在与已建立的全监督基线(包括 U-Net、U-Net++、DeepLabV3 和 MAnet)相比时,达到了最先进的性能。具体而言,我们获得了0.945的Dice系数,并改善了边界遵循性,相较于最强的卷积基线,95百分位的Hausdorff距离减少了11.4%。此外,我们进行了广泛的效率分析,表明我们的DINOv3基础方法在数据稀缺的情况下保持显著更高的性能,即使在仅使用25%的数据进行训练时也能保持良好的结果。这些结果表明,利用大规模自监督基础提供了一种有前景且数据高效的解决方案,用于在数据受限的临床环境中进行医学图像分割。项目仓库:https://github.com/FrancescaFati/MESA
cs.CV / 65 / 2604.08048
Guiding a Diffusion Model by Swapping Its Tokens
通过交换令牌引导扩散模型
Abstract
Classifier-Free Guidance (CFG) is a widely used inference-time technique to boost the image quality of diffusion models. Yet, its reliance on text conditions prevents its use in unconditional generation. We propose a simple method to enable CFG-like guidance for both conditional and unconditional generation. The key idea is to generate a perturbed prediction via simple token swap operations, and use the direction between it and the clean prediction to steer sampling towards higher-fidelity distributions. In practice, we swap pairs of most semantically dissimilar token latents in either spatial or channel dimensions. Unlike existing methods that apply perturbation in a global or less constrained manner, our approach selectively exchanges and recomposes token latents, allowing finer control over perturbation and its influence on generated samples. Experiments on MS-COCO 2014, MS-COCO 2017, and ImageNet datasets demonstrate that the proposed Self-Swap Guidance (SSG), when applied to popular diffusion models, outperforms previous condition-free methods in image fidelity and prompt alignment under different set-ups. Its fine-grained perturbation granularity also improves robustness, reducing side-effects across a wider range of perturbation strengths. Overall, SSG extends CFG to a broader scope of applications including both conditional and unconditional generation, and can be readily inserted into any diffusion model as a plug-in to gain immediate improvements.
Chinese Translation
无分类器引导(Classifier-Free Guidance, CFG)是一种广泛使用的推理时间技术,用于提升扩散模型的图像质量。然而,它对文本条件的依赖限制了其在无条件生成中的应用。我们提出了一种简单的方法,使得CFG类似的引导能够同时适用于条件生成和无条件生成。其关键思想是通过简单的令牌交换操作生成一个扰动预测,并利用其与干净预测之间的方向引导采样朝向更高保真度的分布。在实践中,我们在空间或通道维度上交换最语义上不相似的令牌潜变量对。与现有的以全局或较少约束的方式施加扰动的方法不同,我们的方法选择性地交换和重组令牌潜变量,从而对扰动及其对生成样本的影响进行更精细的控制。在MS-COCO 2014、MS-COCO 2017和ImageNet数据集上的实验表明,当应用于流行的扩散模型时,所提出的自交换引导(Self-Swap Guidance, SSG)在不同设置下的图像保真度和提示对齐方面优于以前的无条件方法。其细粒度的扰动粒度也提高了鲁棒性,减少了在更广泛的扰动强度范围内的副作用。总体而言,SSG将CFG扩展到更广泛的应用范围,包括条件生成和无条件生成,并且可以作为插件轻松插入任何扩散模型中,以获得即时的改进。
cs.CV / 66 / 2604.08050
ABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video Captioning
ABMAMBA:具有对齐层次双向扫描的多模态大语言模型,用于高效视频字幕生成
Abstract
In this study, we focus on video captioning by fully open multimodal large language models (MLLMs). The comprehension of visual sequences is challenging because of their intricate temporal dependencies and substantial sequence length. The core attention mechanisms of existing Transformer-based approaches scale quadratically with the sequence length, making them computationally prohibitive. To address these limitations, we propose Aligned Hierarchical Bidirectional Scan Mamba (ABMamba), a fully open MLLM with linear computational complexity that enables the scalable processing of video sequences. ABMamba extends Deep State Space Models as its language backbone, replacing the costly quadratic attention mechanisms, and employs a novel Aligned Hierarchical Bidirectional Scan module that processes videos across multiple temporal resolutions. On standard video captioning benchmarks such as VATEX and MSR-VTT, ABMamba demonstrates competitive performance compared to typical MLLMs while achieving approximately three times higher throughput.
Chinese Translation
本研究聚焦于通过完全开放的多模态大语言模型(MLLMs)进行视频字幕生成。由于视觉序列复杂的时间依赖性和较长的序列长度,理解视觉序列具有挑战性。现有基于Transformer的方法的核心注意力机制与序列长度呈二次方关系扩展,导致其计算成本高昂。为了解决这些限制,我们提出了对齐层次双向扫描Mamba(ABMamba),这是一种具有线性计算复杂度的完全开放MLLM,能够实现视频序列的可扩展处理。ABMamba以深度状态空间模型作为其语言基础,替代了昂贵的二次方注意力机制,并采用了一种新颖的对齐层次双向扫描模块,以多个时间分辨率处理视频。在标准视频字幕生成基准测试如VATEX和MSR-VTT上,ABMamba表现出与典型MLLMs相媲美的性能,同时实现了大约三倍的吞吐量。
cs.CV / 67 / 2604.08063
EEG2Vision: A Multimodal EEG-Based Framework for 2D Visual Reconstruction in Cognitive Neuroscience
EEG2Vision:一个基于多模态脑电图的认知神经科学二维视觉重建框架
Abstract
Reconstructing visual stimuli from non-invasive electroencephalography (EEG) remains challenging due to its low spatial resolution and high noise, particularly under realistic low-density electrode configurations. To address this, we present EEG2Vision, a modular, end-to-end EEG-to-image framework that systematically evaluates reconstruction performance across different EEG resolutions (128, 64, 32, and 24 channels) and enhances visual quality through a prompt-guided post-reconstruction boosting mechanism. Starting from EEG-conditioned diffusion reconstruction, the boosting stage uses a multimodal large language model to extract semantic descriptions and leverages image-to-image diffusion to refine geometry and perceptual coherence while preserving EEG-grounded structure. Our experiments show that semantic decoding accuracy degrades significantly with channel reduction (e.g., 50-way Top-1 Acc from 89% to 38%), while reconstruction quality slight decreases (e.g., FID from 76.77 to 80.51). The proposed boosting consistently improves perceptual metrics across all configurations, achieving up to 9.71% IS gains in low-channel settings. A user study confirms the clear perceptual preference for boosted reconstructions. The proposed approach significantly boosts the feasibility of real-time brain-2-image applications using low-resolution EEG devices, potentially unlocking this type of applications outside laboratory settings.
Chinese Translation
从非侵入性脑电图(EEG)重建视觉刺激仍然具有挑战性,主要是由于其低空间分辨率和高噪声,尤其是在现实的低密度电极配置下。为了解决这一问题,我们提出了EEG2Vision,一个模块化的端到端EEG到图像框架,系统地评估不同EEG分辨率(128、64、32和24通道)下的重建性能,并通过提示引导的后重建增强机制提升视觉质量。从EEG条件的扩散重建开始,增强阶段利用多模态大型语言模型提取语义描述,并利用图像到图像的扩散来细化几何形状和感知一致性,同时保持基于EEG的结构。我们的实验表明,随着通道数量的减少,语义解码准确性显著下降(例如,50路Top-1准确率从89%降至38%),而重建质量略有下降(例如,FID从76.77增加到80.51)。所提出的增强方法在所有配置中一致地改善感知指标,在低通道设置中实现了高达9.71%的IS增益。一项用户研究证实了对增强重建的明显感知偏好。所提出的方法显著提高了使用低分辨率EEG设备进行实时脑-图像应用的可行性,可能使这类应用在实验室环境之外得以实现。
cs.CV / 68 / 2604.08068
Brain3D: EEG-to-3D Decoding of Visual Representations via Multimodal Reasoning
Brain3D:通过多模态推理实现脑电图(EEG)到三维(3D)视觉表征的解码
Abstract
Decoding visual information from electroencephalography (EEG) has recently achieved promising results, primarily focusing on reconstructing two-dimensional (2D) images from brain activity. However, the reconstruction of three-dimensional (3D) representations remains largely unexplored. This limits the geometric understanding and reduces the applicability of neural decoding in different contexts. To address this gap, we propose Brain3D, a multimodal architecture for EEG-to-3D reconstruction based on EEG-to-image decoding. It progressively transforms neural representations into the 3D domain using geometry-aware generative reasoning. Our pipeline first produces visually grounded images from EEG signals, then employs a multimodal large language model to extract structured 3D-aware descriptions, which guide a diffusion-based generation stage whose outputs are finally converted into coherent 3D meshes via a single-image-to-3D model. By decomposing the problem into structured stages, the proposed approach avoids direct EEG-to-3D mappings and enables scalable brain-driven 3D generation. We conduct a comprehensive evaluation comparing the reconstructed 3D outputs against the original visual stimuli, assessing both semantic alignment and geometric fidelity. Experimental results demonstrate strong performance of the proposed architecture, achieving up to 85.4% 10-way Top-1 EEG decoding accuracy and 0.648 CLIPScore, supporting the feasibility of multimodal EEG-driven 3D reconstruction.
Chinese Translation
从脑电图(EEG)解码视觉信息最近取得了令人鼓舞的成果,主要集中在从脑活动重建二维(2D)图像。然而,三维(3D)表征的重建仍然基本未被探索。这限制了几何理解,并减少了神经解码在不同背景下的适用性。为了解决这一问题,我们提出了Brain3D,一种基于EEG到图像解码的EEG到3D重建的多模态架构。它通过几何感知生成推理逐步将神经表征转化为3D领域。我们的流程首先从EEG信号生成视觉基础图像,然后利用多模态大型语言模型提取结构化的3D感知描述,这些描述指导基于扩散的生成阶段,最终通过单图像到3D模型将输出转换为一致的3D网格。通过将问题分解为结构化阶段,所提出的方法避免了直接的EEG到3D映射,并实现了可扩展的脑驱动3D生成。我们进行了全面评估,将重建的3D输出与原始视觉刺激进行比较,评估语义一致性和几何保真度。实验结果表明,所提出的架构表现出强大的性能,达到了85.4%的10类Top-1 EEG解码准确率和0.648的CLIPScore,支持了多模态EEG驱动的3D重建的可行性。
cs.CV / 69 / 2604.08070
AtlasOCR: Building the First Open-Source Darija OCR Model with Vision Language Models
AtlasOCR:构建首个开源达里贾OCR模型与视觉语言模型
Abstract
Darija, the Moroccan Arabic dialect, is rich in visual content yet lacks specialized Optical Character Recognition (OCR) tools. This paper introduces AtlasOCR, the first open-source Darija OCR model built by fine-tuning a 3B parameter Vision Language Model (VLM). We detail our comprehensive approach, from curating a unique Darija-specific dataset leveraging both synthetic generation with our OCRSmith library and carefully sourced real-world data, to implementing efficient fine-tuning strategies. We utilize QLoRA and Unsloth for parameter-efficient training of Qwen2.5-VL 3B and present comprehensive ablation studies optimizing key hyperparameters. Our evaluation on the newly curated AtlasOCRBench and the established KITAB-Bench demonstrates state-of-the-art performance, challenging larger models and highlighting AtlasOCR's robustness and generalization capabilities for both Darija and standard Arabic OCR tasks.
Chinese Translation
达里贾(Darija),摩洛哥阿拉伯方言,视觉内容丰富,但缺乏专门的光学字符识别(OCR)工具。本文介绍了AtlasOCR,这是第一个通过微调一个3B参数的视觉语言模型(VLM)构建的开源达里贾OCR模型。我们详细描述了我们的综合方法,从利用我们的OCRSmith库进行合成生成和精心收集的真实数据来策划独特的达里贾特定数据集,到实施高效的微调策略。我们使用QLoRA和Unsloth进行Qwen2.5-VL 3B的参数高效训练,并呈现全面的消融研究以优化关键超参数。我们在新策划的AtlasOCRBench和已建立的KITAB-Bench上的评估展示了最先进的性能,挑战了更大的模型,并突显了AtlasOCR在达里贾和标准阿拉伯OCR任务中的鲁棒性和泛化能力。
cs.CV / 70 / 2604.08072
Tensor-Augmented Convolutional Neural Networks: Enhancing Expressivity with Generic Tensor Kernels
张量增强卷积神经网络:使用通用张量核增强表达能力
Abstract
Convolutional Neural Networks (CNNs) excel at extracting local features hierarchically, but their performance in capturing complex correlations hinges heavily on deep architectures, which are usually computationally demanding and difficult to interpret. To address these issues, we propose a physically-guided shallow model: tensor-augmented CNN (TACNN), which replaces conventional convolution kernels with generic tensors to enhance representational capacity. This choice is motivated by the fact that an order-$N$ tensor naturally encodes an arbitrary quantum superposition state in the Hilbert space of dimension $d^N$, where $d$ is the local physical dimension, thus offering substantially richer expressivity. Furthermore, in our design the convolution output of each layer becomes a multilinear form capable of capturing high-order feature correlations, thereby equipping a shallow multilayer architecture with an expressive power competitive to that of deep CNNs. On the Fashion-MNIST benchmark, TACNN demonstrates clear advantages over conventional CNNs, achieving remarkable accuracies with only a few layers. In particular, a TACNN with only two convolution layers attains a test accuracy of 93.7$\%$, surpassing or matching considerably deeper models such as VGG-16 (93.5$\%$) and GoogLeNet (93.7$\%$). These findings highlight TACNN as a promising framework that strengthens model expressivity while preserving architectural simplicity, paving the way towards more interpretable and efficient deep learning models.
Chinese Translation
卷积神经网络(CNN)在分层提取局部特征方面表现出色,但其在捕捉复杂关联方面的性能在很大程度上依赖于深度架构,这通常计算开销较大且难以解释。为了解决这些问题,我们提出了一种物理引导的浅层模型:张量增强卷积神经网络(TACNN),该模型用通用张量替代传统卷积核,以增强表示能力。这一选择的动机在于,阶数为 $N$ 的张量自然编码了维度为 $d^N$ 的希尔伯特空间中的任意量子叠加态,其中 $d$ 是局部物理维度,从而提供了显著更丰富的表达能力。此外,在我们的设计中,每一层的卷积输出成为一种多线性形式,能够捕捉高阶特征关联,从而使得浅层多层架构具备与深层 CNN 竞争的表达能力。在 Fashion-MNIST 基准测试中,TACNN 显示出相较于传统 CNN 的明显优势,仅用少数几层便取得了显著的准确率。特别是,仅有两层卷积层的 TACNN 达到了 93.7$\%$ 的测试准确率,超越或匹配了 VGG-16(93.5$\%$)和 GoogLeNet(93.7$\%$)等更深的模型。这些发现突显了 TACNN 作为一种有前景的框架,增强了模型的表达能力,同时保持了架构的简洁性,为更可解释和高效的深度学习模型铺平了道路。
cs.CV / 71 / 2604.08074
DinoRADE: Full Spectral Radar-Camera Fusion with Vision Foundation Model Features for Multi-class Object Detection in Adverse Weather
DinoRADE:基于全光谱雷达-相机融合的视觉基础模型特征在恶劣天气下的多类目标检测
Abstract
Reliable and weather-robust perception systems are essential for safe autonomous driving and typically employ multi-modal sensor configurations to achieve comprehensive environmental awareness. While recent automotive FMCW Radar-based approaches achieved remarkable performance on detection tasks in adverse weather conditions, they exhibited limitations in resolving fine-grained spatial details particularly critical for detecting smaller and vulnerable road users (VRUs). Furthermore, existing research has not adequately addressed VRU detection in adverse weather datasets such as K-Radar. We present DinoRADE, a Radar-centered detection pipeline that processes dense Radar tensors and aggregates vision features around transformed reference points in the camera perspective via deformable cross-attention. Vision features are provided by a DINOv3 Vision Foundation Model. We present a comprehensive performance evaluation on the K-Radar dataset in all weather conditions and are among the first to report detection performance individually for five object classes. Additionally, we compare our method with existing single-class detection approaches and outperform recent Radar-camera approaches by 12.1%. The code is available under https://github.com/chr-is-tof/RADE-Net.
Chinese Translation
可靠且具有抗天气能力的感知系统对于安全的自动驾驶至关重要,通常采用多模态传感器配置以实现全面的环境感知。尽管最近基于汽车FMCW雷达的方法在恶劣天气条件下的检测任务中取得了显著的性能,但在解决细粒度空间细节方面存在局限性,这对于检测较小和脆弱的道路使用者(VRUs)尤其关键。此外,现有研究尚未充分解决在恶劣天气数据集(如K-Radar)中的VRU检测问题。我们提出了DinoRADE,一种以雷达为中心的检测管道,处理密集的雷达张量,并通过可变形交叉注意力在相机视角下聚合变换参考点周围的视觉特征。视觉特征由DINOv3视觉基础模型提供。我们在K-Radar数据集上进行了全面的性能评估,涵盖所有天气条件,并首次分别报告五个目标类别的检测性能。此外,我们将我们的方法与现有的单类检测方法进行了比较,并在检测性能上超越了最近的雷达-相机方法12.1%。代码可在 https://github.com/chr-is-tof/RADE-Net 获取。
cs.CV / 72 / 2604.08077
AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding
AdaSpark:高效长视频理解的自适应稀疏性
Abstract
Processing long-form videos with Video Large Language Models (Video-LLMs) is computationally prohibitive. Current efficiency methods often compromise fine-grained perception through irreversible information disposal or inhibit long-range temporal modeling via rigid, predefined sparse patterns. This paper introduces AdaSpark, an adaptive sparsity framework designed to address these limitations. AdaSpark first partitions video inputs into 3D spatio-temporal cubes. It then employs two co-designed, context-aware components: (1) Adaptive Cube-Selective Attention (AdaS-Attn), which adaptively selects a subset of relevant video cubes to attend for each query token, and (2) Adaptive Token-Selective FFN (AdaS-FFN), which selectively processes only the most salient tokens within each cube. An entropy-based (Top-p) selection mechanism adaptively allocates computational resources based on input complexity. Experiments demonstrate that AdaSpark significantly reduces computational load by up to 57% FLOPs while maintaining comparable performance to dense models and preserving fine-grained, long-range dependencies, as validated on challenging hour-scale video benchmarks.
Chinese Translation
使用视频大型语言模型(Video-LLMs)处理长视频在计算上是不可承受的。目前的效率方法往往通过不可逆的信息丢弃妥协细粒度感知,或通过刚性、预定义的稀疏模式抑制长程时间建模。本文提出了AdaSpark,一个旨在解决这些局限性的自适应稀疏框架。AdaSpark首先将视频输入划分为3D时空立方体。然后,它采用两个共同设计的、上下文感知的组件:(1)自适应立方体选择注意力(Adaptive Cube-Selective Attention, AdaS-Attn),该组件自适应地选择与每个查询标记相关的立方体子集进行关注;(2)自适应标记选择前馈网络(Adaptive Token-Selective FFN, AdaS-FFN),该组件仅选择性地处理每个立方体内最显著的标记。基于熵的(Top-p)选择机制根据输入复杂性自适应地分配计算资源。实验表明,AdaSpark在保持与密集模型相当的性能并保留细粒度、长程依赖性的同时,显著减少了多达57%的计算负载(FLOPs),这一点在具有挑战性的小时级视频基准测试中得到了验证。
cs.CV / 73 / 2604.08084
DiffVC: A Non-autoregressive Framework Based on Diffusion Model for Video Captioning
DiffVC:基于扩散模型的视频字幕生成非自回归框架
Abstract
Current video captioning methods usually use an encoder-decoder structure to generate text autoregressively. However, autoregressive methods have inherent limitations such as slow generation speed and large cumulative error. Furthermore, the few non-autoregressive counterparts suffer from deficiencies in generation quality due to the lack of sufficient multimodal interaction modeling. Therefore, we propose a non-autoregressive framework based on Diffusion model for Video Captioning (DiffVC) to address these issues. Its parallel decoding can effectively solve the problems of generation speed and cumulative error. At the same time, our proposed discriminative conditional Diffusion Model can generate higher-quality textual descriptions. Specifically, we first encode the video into a visual representation. During training, Gaussian noise is added to the textual representation of the ground-truth caption. Then, a new textual representation is generated via the discriminative denoiser with the visual representation as a conditional constraint. Finally, we input the new textual representation into a non-autoregressive language model to generate captions. During inference, we directly sample noise from the Gaussian distribution for generation. Experiments on MSVD, MSR-VTT, and VATEX show that our method can outperform previous non-autoregressive methods and achieve comparable performance to autoregressive methods, e.g., it achieved a maximum improvement of 9.9 on the CIDEr and improvement of 2.6 on the B@4, while having faster generation speed. The source code will be available soon.
Chinese Translation
当前的视频字幕生成方法通常采用编码器-解码器结构以自回归方式生成文本。然而,自回归方法存在固有的局限性,如生成速度慢和累积误差大。此外,少数非自回归方法由于缺乏足够的多模态交互建模,导致生成质量不足。因此,我们提出了一种基于扩散模型的视频字幕生成非自回归框架(DiffVC)以解决这些问题。其并行解码能够有效解决生成速度和累积误差的问题。同时,我们提出的判别条件扩散模型能够生成更高质量的文本描述。具体而言,我们首先将视频编码为视觉表示。在训练过程中,向真实字幕的文本表示中添加高斯噪声。然后,通过以视觉表示作为条件约束的判别去噪器生成新的文本表示。最后,我们将新的文本表示输入到非自回归语言模型中以生成字幕。在推理过程中,我们直接从高斯分布中采样噪声进行生成。在MSVD、MSR-VTT和VATEX上的实验表明,我们的方法能够超越之前的非自回归方法,并在性能上与自回归方法相当,例如,在CIDEr上实现了最高9.9的提升,在B@4上提升了2.6,同时具有更快的生成速度。源代码将很快发布。
cs.CV / 74 / 2604.08088
Coordinate-Based Dual-Constrained Autoregressive Motion Generation
基于坐标的双约束自回归运动生成
Abstract
Text-to-motion generation has attracted increasing attention in the research community recently, with potential applications in animation, virtual reality, robotics, and human-computer interaction. Diffusion and autoregressive models are two popular and parallel research directions for text-to-motion generation. However, diffusion models often suffer from error amplification during noise prediction, while autoregressive models exhibit mode collapse due to motion discretization. To address these limitations, we propose a flexible, high-fidelity, and semantically faithful text-to-motion framework, named Coordinate-based Dual-constrained Autoregressive Motion Generation (CDAMD). With motion coordinates as input, CDAMD follows the autoregressive paradigm and leverages diffusion-inspired multi-layer perceptrons to enhance the fidelity of predicted motions. Furthermore, a Dual-Constrained Causal Mask is introduced to guide autoregressive generation, where motion tokens act as priors and are concatenated with textual encodings. Since there is limited work on coordinate-based motion synthesis, we establish new benchmarks for both text-to-motion generation and motion editing. Experimental results demonstrate that our approach achieves state-of-the-art performance in terms of both fidelity and semantic consistency on these benchmarks.
Chinese Translation
文本到运动生成最近在研究界引起了越来越多的关注,具有动画、虚拟现实、机器人技术和人机交互等潜在应用。扩散模型和自回归模型是文本到运动生成的两个热门且平行的研究方向。然而,扩散模型在噪声预测过程中常常遭遇错误放大,而自回归模型由于运动离散化则表现出模式崩溃。为了解决这些局限性,我们提出了一种灵活、高保真且语义忠实的文本到运动框架,命名为基于坐标的双约束自回归运动生成(Coordinate-based Dual-constrained Autoregressive Motion Generation,CDAMD)。CDAMD以运动坐标作为输入,遵循自回归范式,并利用受扩散启发的多层感知器来增强预测运动的保真度。此外,引入了双约束因果掩码以指导自回归生成,其中运动标记作为先验,与文本编码连接在一起。由于基于坐标的运动合成研究较少,我们为文本到运动生成和运动编辑建立了新的基准。实验结果表明,我们的方法在这些基准上在保真度和语义一致性方面均达到了最先进的性能。
cs.CV / 75 / 2604.08106
EPIR: An Efficient Patch Tokenization, Integration and Representation Framework for Micro-expression Recognition
EPIR:一种用于微表情识别的高效补丁标记、集成与表示框架
Abstract
Micro-expression recognition can obtain the real emotion of the individual at the current moment. Although deep learning-based methods, especially Transformer-based methods, have achieved impressive results, these methods have high computational complexity due to the large number of tokens in the multi-head self-attention. In addition, the existing micro-expression datasets are small-scale, which makes it difficult for Transformer-based models to learn effective micro-expression representations. Therefore, we propose a novel Efficient Patch tokenization, Integration and Representation framework (EPIR), which can balance high recognition performance and low computational complexity. Specifically, we first propose a dual norm shifted tokenization (DNSPT) module to learn the spatial relationship between neighboring pixels in the face region, which is implemented by a refined spatial transformation and dual norm projection. Then, we propose a token integration module to integrate partial tokens among multiple cascaded Transformer blocks, thereby reducing the number of tokens without information loss. Furthermore, we design a discriminative token extractor, which first improves the attention in the Transformer block to reduce the unnecessary focus of the attention calculation on self-tokens, and uses the dynamic token selection module (DTSM) to select key tokens, thereby capturing more discriminative micro-expression representations. We conduct extensive experiments on four popular public datasets (i.e., CASME II, SAMM, SMIC, and CAS(ME)3. The experimental results show that our method achieves significant performance gains over the state-of-the-art methods, such as 9.6% improvement on the CAS(ME)$^3$ dataset in terms of UF1 and 4.58% improvement on the SMIC dataset in terms of UAR metric.
Chinese Translation
微表情识别能够获取个体在当前时刻的真实情感。尽管基于深度学习的方法,特别是基于Transformer的方法,已经取得了令人瞩目的成果,但由于多头自注意力中标记数量庞大,这些方法的计算复杂度较高。此外,现有的微表情数据集规模较小,这使得基于Transformer的模型难以学习有效的微表情表示。因此,我们提出了一种新颖的高效补丁标记、集成与表示框架(EPIR),旨在平衡高识别性能与低计算复杂度。具体而言,我们首先提出了一个双范数偏移标记化(DNSPT)模块,以学习面部区域相邻像素之间的空间关系,该模块通过精细的空间变换和双范数投影来实现。然后,我们提出了一个标记集成模块,以在多个级联的Transformer块之间集成部分标记,从而在不丢失信息的情况下减少标记数量。此外,我们设计了一个区分性标记提取器,该提取器首先改善Transformer块中的注意力,以减少注意力计算对自标记的不必要关注,并使用动态标记选择模块(DTSM)来选择关键标记,从而捕获更多区分性的微表情表示。我们在四个流行的公共数据集(即CASME II、SAMM、SMIC和CAS(ME)3)上进行了广泛的实验。实验结果表明,我们的方法在性能上显著优于最先进的方法,例如在CAS(ME)$^3$数据集上UF1指标提高了9.6%,在SMIC数据集上UAR指标提高了4.58%。
cs.CV / 76 / 2604.08110
OV-Stitcher: A Global Context-Aware Framework for Training-Free Open-Vocabulary Semantic Segmentation
OV-Stitcher:一种无训练的全球上下文感知框架用于开放词汇语义分割
Abstract
Training-free open-vocabulary semantic segmentation(TF-OVSS) has recently attracted attention for its ability to perform dense prediction by leveraging the pretrained knowledge of large vision and vision-language models, without requiring additional training. However, due to the limited input resolution of these pretrained encoders, existing TF-OVSS methods commonly adopt a sliding-window strategy that processes cropped sub-images independently. While effective for managing high-resolution inputs, this approach prevents global attention over the full image, leading to fragmented feature representations and limited contextual reasoning. We propose OV-Stitcher, a training-free framework that addresses this limitation by stitching fragmented sub-image features directly within the final encoder block. By reconstructing attention representations from fragmented sub-image features, OV-Stitcher enables global attention within the final encoder block, producing coherent context aggregation and spatially consistent, semantically aligned segmentation maps. Extensive evaluations across eight benchmarks demonstrate that OV-Stitcher establishes a scalable and effective solution for open-vocabulary segmentation, achieving a notable improvement in mean Intersection over Union(mIoU) from 48.7 to 50.7 compared with prior training-free baselines.
Chinese Translation
无训练的开放词汇语义分割(Training-free open-vocabulary semantic segmentation,TF-OVSS)因其能够利用大型视觉和视觉-语言模型的预训练知识进行密集预测而引起了关注,无需额外的训练。然而,由于这些预训练编码器的输入分辨率有限,现有的TF-OVSS方法通常采用滑动窗口策略,独立处理裁剪的子图像。尽管这种方法在管理高分辨率输入方面有效,但它阻碍了对整个图像的全局注意力,导致特征表示的碎片化和上下文推理的局限性。我们提出了OV-Stitcher,这是一种无训练的框架,通过在最终编码器块内直接拼接碎片化的子图像特征来解决这一限制。通过从碎片化的子图像特征重建注意力表示,OV-Stitcher在最终编码器块内实现了全局注意力,从而生成连贯的上下文聚合和空间一致、语义对齐的分割图。对八个基准的广泛评估表明,OV-Stitcher为开放词汇分割建立了一个可扩展且有效的解决方案,与之前的无训练基线相比,平均交并比(mean Intersection over Union,mIoU)显著提高,从48.7提升至50.7。
cs.CV / 77 / 2604.08120
Small Vision-Language Models are Smart Compressors for Long Video Understanding
小型视觉语言模型是长视频理解的智能压缩器
Abstract
Adapting Multimodal Large Language Models (MLLMs) for hour-long videos is bottlenecked by context limits. Dense visual streams saturate token budgets and exacerbate the lost-in-the-middle phenomenon. Existing heuristics, like sparse sampling or uniform pooling, blindly sacrifice fidelity by discarding decisive moments and wasting bandwidth on irrelevant backgrounds. We propose Tempo, an efficient query-aware framework compressing long videos for downstream understanding. Tempo leverages a Small Vision-Language Model (SVLM) as a local temporal compressor, casting token reduction as an early cross-modal distillation process to generate compact, intent-aligned representations in a single forward pass. To enforce strict budgets without breaking causality, we introduce Adaptive Token Allocation (ATA). Exploiting the SVLM's zero-shot relevance prior and semantic front-loading, ATA acts as a training-free $O(1)$ dynamic router. It allocates dense bandwidth to query-critical segments while compressing redundancies into minimal temporal anchors to maintain the global storyline. Extensive experiments show our 6B architecture achieves state-of-the-art performance with aggressive dynamic compression (0.5-16 tokens/frame). On the extreme-long LVBench (4101s), Tempo scores 52.3 under a strict 8K visual budget, outperforming GPT-4o and Gemini 1.5 Pro. Scaling to 2048 frames reaches 53.7. Crucially, Tempo compresses hour-long videos substantially below theoretical limits, proving true long-form video understanding relies on intent-driven efficiency rather than greedily padded context windows.
Chinese Translation
将多模态大型语言模型(MLLMs)应用于长达一小时的视频时,受限于上下文的限制。密集的视觉流消耗了令牌预算,并加剧了“迷失在中间”现象。现有的启发式方法,如稀疏采样或均匀池化,盲目地牺牲了保真度,通过丢弃关键时刻和在无关背景上浪费带宽。我们提出了Tempo,一个高效的查询感知框架,用于压缩长视频以便于下游理解。Tempo利用小型视觉语言模型(SVLM)作为局部时间压缩器,将令牌减少视为一种早期跨模态蒸馏过程,以在单次前向传递中生成紧凑的、与意图对齐的表示。为了在不打破因果关系的情况下强制执行严格的预算,我们引入了自适应令牌分配(ATA)。ATA利用SVLM的零样本相关性先验和语义前置,作为一种无训练的$O(1)$动态路由器。它为查询关键段分配密集带宽,同时将冗余压缩为最小的时间锚点,以保持全球故事线。大量实验表明,我们的6B架构在激进的动态压缩(0.5-16令牌/帧)下实现了最先进的性能。在极长的LVBench(4101秒)上,Tempo在严格的8K视觉预算下得分52.3,超越了GPT-4o和Gemini 1.5 Pro。扩展到2048帧时得分达到53.7。重要的是,Tempo将一小时长的视频压缩至远低于理论极限,证明真正的长格式视频理解依赖于以意图驱动的效率,而非贪婪填充的上下文窗口。
cs.CV / 78 / 2604.08121
Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator
Uni-ViGU:基于扩散的视频生成器实现统一的视频生成与理解
Abstract
Unified multimodal models integrating visual understanding and generation face a fundamental challenge: visual generation incurs substantially higher computational costs than understanding, particularly for video. This imbalance motivates us to invert the conventional paradigm: rather than extending understanding-centric MLLMs to support generation, we propose Uni-ViGU, a framework that unifies video generation and understanding by extending a video generator as the foundation. We introduce a unified flow method that performs continuous flow matching for video and discrete flow matching for text within a single process, enabling coherent multimodal generation. We further propose a modality-driven MoE-based framework that augments Transformer blocks with lightweight layers for text generation while preserving generative priors. To repurpose generation knowledge for understanding, we design a bidirectional training mechanism with two stages: Knowledge Recall reconstructs input prompts to leverage learned text-video correspondences, while Capability Refinement fine-tunes on detailed captions to establish discriminative shared representations. Experiments demonstrate that Uni-ViGU achieves competitive performance on both video generation and understanding, validating generation-centric architectures as a scalable path toward unified multimodal intelligence. Project Page and Code: https://fr0zencrane.github.io/uni-vigu-page/.
Chinese Translation
融合视觉理解与生成的统一多模态模型面临一个根本性挑战:视觉生成尤其是视频生成的计算成本远高于理解。这种不平衡促使我们颠覆传统范式:我们不再从理解为中心的大型多模态语言模型(MLLMs)扩展至生成,而是提出Uni-ViGU框架,通过扩展视频生成器作为基础,实现视频生成与理解的统一。我们引入了一种统一流方法,在单一流程中对视频执行连续流匹配,对文本执行离散流匹配,从而实现连贯的多模态生成。进一步地,我们提出了一种基于专家模型(MoE)的模态驱动框架,通过轻量级层增强Transformer模块以支持文本生成,同时保持生成先验。为了将生成知识转用于理解,我们设计了双向训练机制,包含两个阶段:知识回忆阶段重构输入提示以利用已学的文本-视频对应关系,能力精炼阶段则通过细粒度字幕微调以建立判别性的共享表示。实验结果表明,Uni-ViGU在视频生成与理解任务上均取得了竞争性表现,验证了以生成为核心的架构作为实现统一多模态智能的可扩展路径。项目主页及代码:https://fr0zencrane.github.io/uni-vigu-page/。
cs.CV / 79 / 2604.08125
PolySLGen: Online Multimodal Speaking-Listening Reaction Generation in Polyadic Interaction
PolySLGen:多方互动中的在线多模态说听反应生成
Abstract
Human-like multimodal reaction generation is essential for natural group interactions between humans and embodied AI. However, existing approaches are limited to single-modality or speaking-only responses in dyadic interactions, making them unsuitable for realistic social scenarios. Many also overlook nonverbal cues and complex dynamics of polyadic interactions, both critical for engagement and conversational coherence. In this work, we present PolySLGen, an online framework for Polyadic multimodal Speaking and Listening reaction Generation. Given past conversation and motion from all participants, PolySLGen generates a future speaking or listening reaction for a target participant, including speech, body motion, and speaking state score. To model group interactions effectively, we propose a pose fusion module and a social cue encoder that jointly aggregate motion and social signals from the group. Extensive experiments, along with quantitative and qualitative evaluations, show that PolySLGen produces contextually appropriate and temporally coherent multi-modal reactions, outperforming several adapted and state-of-the-art baselines in motion quality, motion-speech alignment, speaking state prediction, and human-perceived realism.
Chinese Translation
类人多模态反应生成对于人类与具身人工智能之间的自然群体互动至关重要。然而,现有的方法仅限于单一模态或双人互动中的说话响应,因而不适用于现实社交场景。许多方法还忽视了非语言线索和多方互动的复杂动态,而这两者对于参与感和对话连贯性都是关键。在本研究中,我们提出了PolySLGen,一个用于多方多模态说听反应生成的在线框架。给定所有参与者的过去对话和动作,PolySLGen为目标参与者生成未来的说话或听取反应,包括语音、身体动作和说话状态评分。为了有效建模群体互动,我们提出了一个姿态融合模块和一个社交线索编码器,联合聚合来自群体的动作和社交信号。大量实验以及定量和定性评估表明,PolySLGen生成的多模态反应在上下文适宜性和时间连贯性上表现优异,超越了几种改编和最先进的基线,在动作质量、动作-语音对齐、说话状态预测和人类感知的真实感方面均表现突出。
cs.CV / 80 / 2604.08138
Bag of Bags: Adaptive Visual Vocabularies for Genizah Join Image Retrieval
袋中之袋:用于 Genizah 连接图像检索的自适应视觉词汇
Abstract
A join is a set of manuscript fragments identified as originally emanating from the same manuscript. We study manuscript join retrieval: Given a query image of a fragment, retrieve other fragments originating from the same physical manuscript. We propose Bag of Bags (BoB), an image-level representation that replaces the global-level visual codebook of classical Bag of Words (BoW) with a fragment-specific vocabulary of local visual words. Our pipeline trains a sparse convolutional autoencoder on binarized fragment patches, encodes connected components from each page, clusters the resulting embeddings with per image $k$-means, and compares images using set to set distances between their local vocabularies. Evaluated on fragments from the Cairo Genizah, the best BoB variant (viz.\@ Chamfer) achieves Hit@1 of 0.78 and MRR of 0.84, compared to 0.74 and 0.80, respectively, for the strongest BoW baseline (BoW-RawPatches-$\chi^2$), a 6.1\% relative improvement in top-1 accuracy. We furthermore study a mass-weighted BoB-OT variant that incorporates cluster population into prototype matching and present a formal approximation guarantee bounding its deviation from full component-level optimal transport. A two-stage pipeline using a BoW shortlist followed by BoB-OT reranking provides a practical compromise between retrieval strength and computational cost, supporting applicability to larger manuscript collections.
Chinese Translation
连接是指一组手稿碎片,这些碎片被识别为最初来自同一手稿。我们研究手稿连接检索:给定一个碎片的查询图像,检索其他来自同一物理手稿的碎片。我们提出了袋中之袋(Bag of Bags,BoB),这是一种图像级表示,它用特定于碎片的局部视觉词汇替代了经典的袋词(Bag of Words,BoW)中的全局视觉词典。我们的流程在二值化的碎片补丁上训练稀疏卷积自编码器,从每一页编码连接组件,使用每图像的 $k$-均值对结果嵌入进行聚类,并通过其局部词汇之间的集合到集合距离来比较图像。在开罗 Genizah 的碎片上进行评估,最佳的 BoB 变体(即 Chamfer)实现了 0.78 的 Hit@1 和 0.84 的 MRR,相比之下,最强的 BoW 基线(BoW-RawPatches-$ ext{χ}^2$)分别为 0.74 和 0.80,顶级准确率提高了 6.1\%。此外,我们还研究了一种质量加权的 BoB-OT 变体,该变体将聚类人口纳入原型匹配,并提供了一个正式的近似保证,界定其与完整组件级最优传输的偏差。使用 BoW 短名单后跟 BoB-OT 重新排序的两阶段流程在检索强度和计算成本之间提供了实用的折衷,支持对更大手稿集合的适用性。
cs.CV / 81 / 2604.08159
Face-D(^2)CL: Multi-Domain Synergistic Representation with Dual Continual Learning for Facial DeepFake Detection
Face-D(^2)CL:基于双重持续学习的多域协同表示用于面部深度伪造检测
Abstract
The rapid advancement of facial forgery techniques poses severe threats to public trust and information security, making facial DeepFake detection a critical research priority. Continual learning provides an effective approach to adapt facial DeepFake detection models to evolving forgery patterns. However, existing methods face two key bottlenecks in real-world continual learning scenarios: insufficient feature representation and catastrophic forgetting. To address these issues, we propose Face-D(^2)CL, a framework for facial DeepFake detection. It leverages multi-domain synergistic representation to fuse spatial and frequency-domain features for the comprehensive capture of diverse forgery traces, and employs a dual continual learning mechanism that combines Elastic Weight Consolidation (EWC), which distinguishes parameter importance for real versus fake samples, and Orthogonal Gradient Constraint (OGC), which ensures updates to task-specific adapters do not interfere with previously learned knowledge. This synergy enables the model to achieve a dynamic balance between robust anti-forgetting capabilities and agile adaptability to emerging facial forgery paradigms, all without relying on historical data replay. Extensive experiments demonstrate that our method surpasses current SOTA approaches in both stability and plasticity, achieving 60.7% relative reduction in average detection error rate, respectively. On unseen forgery domains, it further improves the average detection AUC by 7.9% compared to the current SOTA method.
Chinese Translation
面部伪造技术的快速发展对公众信任和信息安全构成了严重威胁,使得面部深度伪造检测成为一个重要的研究优先事项。持续学习提供了一种有效的方法,以适应面部深度伪造检测模型不断变化的伪造模式。然而,现有方法在实际的持续学习场景中面临两个关键瓶颈:特征表示不足和灾难性遗忘。为了解决这些问题,我们提出了Face-D(^2)CL,这是一个面部深度伪造检测框架。该框架利用多域协同表示融合空间和频域特征,以全面捕捉多样的伪造痕迹,并采用双重持续学习机制,结合了弹性权重巩固(Elastic Weight Consolidation, EWC),该机制区分真实样本与伪造样本的参数重要性,以及正交梯度约束(Orthogonal Gradient Constraint, OGC),确保任务特定适配器的更新不会干扰先前学习的知识。这种协同作用使模型能够在强大的抗遗忘能力和对新兴面部伪造范式的灵活适应性之间实现动态平衡,且无需依赖历史数据重放。大量实验表明,我们的方法在稳定性和可塑性方面均优于当前的最先进方法(SOTA),实现了平均检测错误率的相对降低60.7%。在未见伪造领域上,与当前的SOTA方法相比,平均检测AUC进一步提高了7.9%。
cs.CV / 82 / 2604.08167
T-Gated Adapter: A Lightweight Temporal Adapter for Vision-Language Medical Segmentation
T-门适配器:一种轻量级的时序适配器用于视觉-语言医学分割
Abstract
Medical image segmentation traditionally relies on fully supervised 3D architectures that demand a large amount of dense, voxel-level annotations from clinical experts which is a prohibitively expensive process. Vision Language Models (VLMs) offer a powerful alternative by leveraging broad visual semantic representations learned from billions of images. However, when applied independently to 2D slices of a 3D scan, these models often produce noisy and anatomically implausible segmentations that violate the inherent continuity of anatomical structures. We propose a temporal adapter that addresses this by injecting adjacent-slice context directly into the model's visual token representations. The adapter comprises a temporal transformer attending across a fixed context window at the token level, a spatial context block refining within-slice representations, and an adaptive gate balancing temporal and single-slice features. Training on 30 labeled volumes from the FLARE22 dataset, our method achieves a mean Dice of 0.704 across 13 abdominal organs with a gain of +0.206 over the baseline VLM trained with no temporal context. Zero-shot evaluation on BTCV and AMOS22 datasets yields consistent improvements of +0.210 and +0.230, with the average cross-domain performance drop reducing from 38.0% to 24.9%. Furthermore, in a cross-modality evaluation on AMOS22 MRI with neither model receiving any MRI supervision, our method achieves a mean Dice of 0.366, outperforming a fully supervised 3D baseline (DynUNet, 0.224) trained exclusively on CT, suggesting that CLIP's visual semantic representations generalize more gracefully across imaging modalities than convolutional features.
Chinese Translation
医学图像分割传统上依赖于完全监督的三维架构,这需要临床专家提供大量密集的体素级注释,这一过程成本高昂。视觉语言模型(VLMs)提供了一种强大的替代方案,通过利用从数十亿张图像中学习的广泛视觉语义表示。然而,当这些模型独立应用于三维扫描的二维切片时,往往会产生噪声和解剖上不合理的分割,违反了解剖结构的内在连续性。我们提出了一种时序适配器,通过将相邻切片的上下文直接注入模型的视觉标记表示来解决这一问题。该适配器由一个时序变换器组成,能够在标记级别的固定上下文窗口内进行关注,一个空间上下文块用于细化切片内表示,以及一个自适应门来平衡时序和单切片特征。在FLARE22数据集上对30个标记体积进行训练,我们的方法在13个腹部器官上实现了平均Dice系数为0.704,相较于未使用时序上下文的基线VLM提高了+0.206。在BTCV和AMOS22数据集上的零样本评估显示出一致的改善,分别为+0.210和+0.230,跨域性能下降的平均值从38.0%降低至24.9%。此外,在AMOS22 MRI的跨模态评估中,两个模型均未接受任何MRI监督,我们的方法实现了平均Dice系数为0.366,优于仅在CT上训练的完全监督三维基线(DynUNet, 0.224),这表明CLIP的视觉语义表示在不同成像模态之间的泛化能力优于卷积特征。
cs.CV / 83 / 2604.08171
OceanMAE: A Foundation Model for Ocean Remote Sensing
OceanMAE:一种用于海洋遥感的基础模型
Abstract
Accurate ocean mapping is essential for applications such as bathymetry estimation, seabed characterization, marine litter detection, and ecosystem monitoring. However, ocean remote sensing (RS) remains constrained by limited labeled data and by the reduced transferability of models pre-trained mainly on land-dominated Earth observation imagery. In this paper, we propose OceanMAE, an ocean-specific masked autoencoder that extends standard MAE pre-training by integrating multispectral Sentinel-2 observations with physically meaningful ocean descriptors during self-supervised learning. By incorporating these auxiliary ocean features, OceanMAE is designed to learn more informative and ocean-aware latent representations from large- scale unlabeled data. To transfer these representations to downstream applications, we further employ a modified UNet-based framework for marine segmentation and bathymetry estimation. Pre-trained on the Hydro dataset, OceanMAE is evaluated on MADOS and MARIDA for marine pollutant and debris segmentation, and on MagicBathyNet for bathymetry regression. The experiments show that OceanMAE yields the strongest gains on marine segmentation, while bathymetry benefits are competitive and task-dependent. In addition, an ablation against a standard MAE on MARIDA indicates that incorporating auxiliary ocean descriptors during pre-training improves downstream segmentation quality. These findings highlight the value of physically informed and domain-aligned self-supervised pre- training for ocean RS. Code and weights are publicly available at https://git.tu-berlin.de/joanna.stamer/SSLORS2.
Chinese Translation
准确的海洋制图对于水深估计、海底特征描述、海洋垃圾检测和生态系统监测等应用至关重要。然而,海洋遥感(RS)仍然受到有限标注数据的限制,以及主要基于陆地主导的地球观测影像预训练模型的可迁移性降低。在本文中,我们提出了OceanMAE,一种特定于海洋的掩蔽自编码器,它通过在自监督学习过程中将多光谱的Sentinel-2观测与物理意义明确的海洋描述符相结合,扩展了标准MAE的预训练。通过结合这些辅助海洋特征,OceanMAE旨在从大规模未标注数据中学习更具信息量和海洋感知的潜在表示。为了将这些表示转移到下游应用中,我们进一步采用了基于UNet的修改框架进行海洋分割和水深估计。在Hydro数据集上进行预训练后,OceanMAE在MADOS和MARIDA上进行了海洋污染物和碎片分割的评估,并在MagicBathyNet上进行了水深回归实验。实验结果表明,OceanMAE在海洋分割任务上取得了最显著的提升,而水深估计的收益则具有竞争力且依赖于具体任务。此外,与MARIDA上的标准MAE进行的消融实验表明,在预训练过程中引入辅助海洋描述符可以提高下游分割质量。这些发现突显了物理信息和领域对齐的自监督预训练在海洋遥感中的价值。代码和权重可在 https://git.tu-berlin.de/joanna.stamer/SSLORS2 上公开获取。
cs.CV / 84 / 2604.08172
On the Global Photometric Alignment for Low-Level Vision
低级视觉的全局光度对齐
Abstract
Supervised low-level vision models rely on pixel-wise losses against paired references, yet paired training sets exhibit per-pair photometric inconsistency, say, different image pairs demand different global brightness, color, or white-balance mappings. This inconsistency enters through task-intrinsic photometric transfer (e.g., low-light enhancement) or unintended acquisition shifts (e.g., de-raining), and in either case causes an optimization pathology. Standard reconstruction losses allocate disproportionate gradient budget to conflicting per-pair photometric targets, crowding out content restoration. In this paper, we investigate this issue and prove that, under least-squares decomposition, the photometric and structural components of the prediction-target residual are orthogonal, and that the spatially dense photometric component dominates the gradient energy. Motivated by this analysis, we propose Photometric Alignment Loss (PAL). This flexible supervision objective discounts nuisance photometric discrepancy via closed-form affine color alignment while preserving restoration-relevant supervision, requiring only covariance statistics and tiny matrix inversion with negligible overhead. Across 6 tasks, 16 datasets, and 16 architectures, PAL consistently improves metrics and generalization. The implementation is in the appendix.
Chinese Translation
监督式低级视觉模型依赖于与配对参考的像素级损失,然而配对训练集在每对之间存在光度不一致性,即不同的图像对需要不同的全局亮度、颜色或白平衡映射。这种不一致性通过任务内在的光度转移(例如,低光增强)或意外的采集偏移(例如,去雨)引入,在这两种情况下都会导致优化病态。标准重建损失将不成比例的梯度预算分配给相互冲突的每对光度目标,从而抑制内容恢复。在本文中,我们研究了这个问题,并证明在最小二乘分解下,预测-目标残差的光度和结构成分是正交的,并且空间上密集的光度成分主导了梯度能量。基于这一分析,我们提出了光度对齐损失(Photometric Alignment Loss, PAL)。这一灵活的监督目标通过封闭形式的仿射颜色对齐来抵消干扰光度差异,同时保留与恢复相关的监督,仅需协方差统计和微小的矩阵求逆,开销可忽略不计。在6个任务、16个数据集和16种架构中,PAL始终提高了指标和泛化能力。实现细节见附录。
cs.CV / 85 / 2604.08203
MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning
MedVR:通过自主强化学习实现无注释的医学视觉推理
Abstract
Medical Vision-Language Models (VLMs) hold immense promise for complex clinical tasks, but their reasoning capabilities are often constrained by text-only paradigms that fail to ground inferences in visual evidence. This limitation not only curtails performance on tasks requiring fine-grained visual analysis but also introduces risks of visual hallucination in safety-critical applications. Thus, we introduce MedVR, a novel reinforcement learning framework that enables annotation-free visual reasoning for medical VLMs. Its core innovation lies in two synergistic mechanisms: Entropy-guided Visual Regrounding (EVR) uses model uncertainty to direct exploration, while Consensus-based Credit Assignment (CCA) distills pseudo-supervision from rollout agreement. Without any human annotations for intermediate steps, MedVR achieves state-of-the-art performance on diverse public medical VQA benchmarks, significantly outperforming existing models. By learning to reason directly with visual evidence, MedVR promotes the robustness and transparency essential for accelerating the clinical deployment of medical AI.
Chinese Translation
医学视觉语言模型(VLMs)在复杂临床任务中展现出巨大潜力,但其推理能力常常受到仅依赖文本的范式的限制,这种范式无法将推理与视觉证据相结合。这一限制不仅削弱了对需要细致视觉分析的任务的表现,还在安全关键应用中引入了视觉幻觉的风险。因此,我们提出了MedVR,这是一种新颖的强化学习框架,能够实现医学VLMs的无注释视觉推理。其核心创新在于两个协同机制:熵引导的视觉重定位(Entropy-guided Visual Regrounding, EVR)利用模型的不确定性来指导探索,而基于共识的信用分配(Consensus-based Credit Assignment, CCA)则从回滚一致性中提炼伪监督。在没有任何中间步骤的人类注释的情况下,MedVR在多样的公共医学视觉问答基准上达到了最先进的性能,显著超越了现有模型。通过直接学习与视觉证据进行推理,MedVR促进了加速医学人工智能临床部署所必需的稳健性和透明性。
cs.CV / 86 / 2604.08209
OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering
OmniJigsaw:通过模态协调重排序增强全模态推理
Abstract
To extend the reinforcement learning post-training paradigm to omni-modal models for concurrently bolstering video-audio understanding and collaborative reasoning, we propose OmniJigsaw, a generic self-supervised framework built upon a temporal reordering proxy task. Centered on the chronological reconstruction of shuffled audio-visual clips, this paradigm strategically orchestrates visual and auditory signals to compel cross-modal integration through three distinct strategies: Joint Modality Integration, Sample-level Modality Selection, and Clip-level Modality Masking. Recognizing that the efficacy of such proxy tasks is fundamentally tied to puzzle quality, we design a two-stage coarse-to-fine data filtering pipeline, which facilitates the efficient adaptation of OmniJigsaw to massive unannotated omni-modal data. Our analysis reveals a ``bi-modal shortcut phenomenon'' in joint modality integration and demonstrates that fine-grained clip-level modality masking mitigates this issue while outperforming sample-level modality selection. Extensive evaluations on 15 benchmarks show substantial gains in video, audio, and collaborative reasoning, validating OmniJigsaw as a scalable paradigm for self-supervised omni-modal learning.
Chinese Translation
为了将强化学习后训练范式扩展到全模态模型,以同时增强视频-音频理解和协同推理,我们提出了OmniJigsaw,这是一种基于时间重排序代理任务的通用自监督框架。该范式以打乱的视听片段的时间重建为中心,战略性地协调视觉和听觉信号,通过三种不同的策略促使跨模态整合:联合模态整合、样本级模态选择和片段级模态屏蔽。我们认识到,此类代理任务的有效性与拼图质量密切相关,因此我们设计了一个两阶段的粗到细数据过滤管道,以便有效地将OmniJigsaw适应于大量未标注的全模态数据。我们的分析揭示了联合模态整合中的“二模态捷径现象”,并表明细粒度的片段级模态屏蔽可以缓解此问题,同时优于样本级模态选择。在15个基准上的广泛评估显示,在视频、音频和协同推理方面取得了显著提升,验证了OmniJigsaw作为自监督全模态学习的可扩展范式的有效性。
cs.CV / 87 / 2604.08211
SciFigDetect: A Benchmark for AI-Generated Scientific Figure Detection
SciFigDetect:AI生成科学图形检测的基准
Abstract
Modern multimodal generators can now produce scientific figures at near-publishable quality, creating a new challenge for visual forensics and research integrity. Unlike conventional AI-generated natural images, scientific figures are structured, text-dense, and tightly aligned with scholarly semantics, making them a distinct and difficult detection target. However, existing AI-generated image detection benchmarks and methods are almost entirely developed for open-domain imagery, leaving this setting largely unexplored. We present the first benchmark for AI-generated scientific figure detection. To construct it, we develop an agent-based data pipeline that retrieves licensed source papers, performs multimodal understanding of paper text and figures, builds structured prompts, synthesizes candidate figures, and filters them through a review-driven refinement loop. The resulting benchmark covers multiple figure categories, multiple generation sources and aligned real--synthetic pairs. We benchmark representative detectors under zero-shot, cross-generator, and degraded-image settings. Results show that current methods fail dramatically in zero-shot transfer, exhibit strong generator-specific overfitting, and remain fragile under common post-processing corruptions. These findings reveal a substantial gap between existing AIGI detection capabilities and the emerging distribution of high-quality scientific figures. We hope this benchmark can serve as a foundation for future research on robust and generalizable scientific-figure forensics. The dataset is available at https://github.com/Joyce-yoyo/SciFigDetect.
Chinese Translation
现代多模态生成器现在能够以接近可发布的质量生成科学图形,这为视觉取证和研究诚信带来了新的挑战。与传统的AI生成自然图像不同,科学图形是结构化的、文本密集的,并且与学术语义紧密对齐,使其成为一个独特且难以检测的目标。然而,现有的AI生成图像检测基准和方法几乎完全是为开放领域图像开发的,因此这一领域尚未得到充分探索。我们提出了第一个AI生成科学图形检测的基准。为了构建这个基准,我们开发了一个基于代理的数据管道,该管道检索许可的源论文,进行论文文本和图形的多模态理解,构建结构化提示,合成候选图形,并通过审查驱动的精炼循环进行过滤。最终的基准涵盖了多个图形类别、多个生成源以及对齐的真实-合成配对。我们在零-shot、跨生成器和降级图像设置下对代表性的检测器进行了基准测试。结果表明,当前的方法在零-shot迁移中表现不佳,表现出强烈的生成器特定过拟合,并且在常见后处理损坏下仍然脆弱。这些发现揭示了现有的AI生成图像检测能力与新兴的高质量科学图形分布之间存在显著差距。我们希望这个基准能够为未来关于稳健和可推广的科学图形取证的研究奠定基础。数据集可在 https://github.com/Joyce-yoyo/SciFigDetect 获取。
cs.CV / 88 / 2604.08212
Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment
用于全面自动化路面状况评估的视觉-语言基础模型
Abstract
General-purpose vision-language models demonstrate strong performance in everyday domains but struggle with specialized technical fields requiring precise terminology, structured reasoning, and adherence to engineering standards. This work addresses whether domain-specific instruction tuning can enable comprehensive pavement condition assessment through vision-language models. PaveInstruct, a dataset containing 278,889 image-instruction-response pairs spanning 32 task types, was created by unifying annotations from nine heterogeneous pavement datasets. PaveGPT, a pavement foundation model trained on this dataset, was evaluated against state-of-the-art vision-language models across perception, understanding, and reasoning tasks. Instruction tuning transformed model capabilities, achieving improvements exceeding 20% in spatial grounding, reasoning, and generation tasks while producing ASTM D6433-compliant outputs. These results enable transportation agencies to deploy unified conversational assessment tools that replace multiple specialized systems, simplifying workflows and reducing technical expertise requirements. The approach establishes a pathway for developing instruction-driven AI systems across infrastructure domains including bridge inspection, railway maintenance, and building condition assessment.
Chinese Translation
通用视觉-语言模型在日常领域表现出色,但在需要精确术语、结构化推理和遵循工程标准的专业技术领域却面临挑战。本研究探讨了领域特定的指令调优是否能够通过视觉-语言模型实现全面的路面状况评估。PaveInstruct 数据集包含 278,889 对跨越 32 种任务类型的图像-指令-响应对,旨在通过统一来自九个异构路面数据集的注释而创建。PaveGPT 是一个在该数据集上训练的路面基础模型,针对感知、理解和推理任务与最先进的视觉-语言模型进行了评估。指令调优显著提升了模型的能力,在空间定位、推理和生成任务中实现了超过 20% 的改进,同时生成符合 ASTM D6433 标准的输出。这些结果使交通机构能够部署统一的对话评估工具,取代多个专业系统,从而简化工作流程并降低技术专业要求。该方法为在基础设施领域(包括桥梁检查、铁路维护和建筑状况评估)开发以指令驱动的人工智能系统奠定了基础。
cs.CV / 89 / 2604.08213
EditCaption: Human-Aligned Instruction Synthesis for Image Editing via Supervised Fine-Tuning and Direct Preference Optimization
EditCaption:通过监督微调和直接偏好优化实现与人类对齐的图像编辑指令合成
Abstract
High-quality training triplets (source-target image pairs with precise editing instructions) are a critical bottleneck for scaling instruction-guided image editing models. Vision-language models (VLMs) are widely used for automated instruction synthesis, but we identify three systematic failure modes in image-pair settings: orientation inconsistency (e.g., left/right confusion), viewpoint ambiguity, and insufficient fine-grained attribute description. Human evaluation shows that over 47% of instructions from strong baseline VLMs contain critical errors unusable for downstream training. We propose EditCaption, a scalable two-stage post-training pipeline for VLM-based instruction synthesis. Stage 1 builds a 100K supervised fine-tuning (SFT) dataset by combining GLM automatic annotation, EditScore-based filtering, and human refinement for spatial, directional, and attribute-level accuracy. Stage 2 collects 10K human preference pairs targeting the three failure modes and applies direct preference optimization (DPO) for alignment beyond SFT alone. On Eval-400, ByteMorph-Bench, and HQ-Edit, fine-tuned Qwen3-VL models outperform open-source baselines; the 235B model reaches 4.712 on Eval-400 (vs. Gemini-3-Pro 4.706, GPT-4.1 4.220, Kimi-K2.5 4.111) and 4.588 on ByteMorph-Bench (vs. Gemini-3-Pro 4.522, GPT-4.1 3.412). Human evaluation shows critical errors falling from 47.75% to 23% and correctness rising from 41.75% to 66%. The work offers a practical path to scalable, human-aligned instruction synthesis for image editing data.
Chinese Translation
高质量的训练三元组(具有精确编辑指令的源-目标图像对)是扩展指令引导图像编辑模型的关键瓶颈。视觉-语言模型(VLMs)广泛用于自动化指令合成,但我们在图像对设置中识别出三种系统性失败模式:方向不一致(例如,左右混淆)、视角模糊和细粒度属性描述不足。人类评估显示,超过47%的强基线VLMs生成的指令包含严重错误,无法用于下游训练。我们提出了EditCaption,一个可扩展的两阶段后训练管道,用于基于VLM的指令合成。第一阶段通过结合GLM自动注释、基于EditScore的过滤和人工精炼,构建了一个10万条的监督微调(SFT)数据集,以确保空间、方向和属性层面的准确性。第二阶段收集了1万对针对三种失败模式的人类偏好对,并应用直接偏好优化(DPO)以实现超越SFT的对齐。在Eval-400、ByteMorph-Bench和HQ-Edit上,微调后的Qwen3-VL模型优于开源基线;235B模型在Eval-400上达到4.712(相比于Gemini-3-Pro的4.706,GPT-4.1的4.220,Kimi-K2.5的4.111),在ByteMorph-Bench上达到4.588(相比于Gemini-3-Pro的4.522,GPT-4.1的3.412)。人类评估显示,严重错误从47.75%降至23%,正确率从41.75%上升至66%。该研究为可扩展的、与人类对齐的图像编辑数据指令合成提供了一条实用路径。
cs.CV / 90 / 2604.08230
Generalization Under Scrutiny: Cross-Domain Detection Progresses, Pitfalls, and Persistent Challenges
泛化问题的审视:跨域检测的进展、陷阱与持续挑战
Abstract
Object detection models trained on a source domain often exhibit significant performance degradation when deployed in unseen target domains, due to various kinds of variations, such as sensing conditions, environments and data distributions. Hence, regardless the recent breakthrough advances in deep learning-based detection technology, cross-domain object detection (CDOD) remains a critical research area. Moreover, the existing literature remains fragmented, lacking a unified perspective on the structural challenges underlying domain shift and the effectiveness of adaptation strategies. This survey provides a comprehensive and systematic analysis of CDOD. We start upon a problem formulation that highlights the multi-stage nature of object detection under domain shift. Then, we organize the existing methods through a conceptual taxonomy that categorizes approaches based on adaptation paradigms, modeling assumptions, and pipeline components. Furthermore, we analyze how domain shift propagates across detection stages and discuss why adaptation in object detection is inherently more complex than in classification. In addition, we review commonly used datasets, evaluation protocols, and benchmarking practices. Finally, we identify the key challenges and outline promising future research directions. Cohesively, this survey aims to provide a unified framework for understanding CDOD and to guide the development of more robust detection systems.
Chinese Translation
在源域上训练的目标检测模型在未见过的目标域中部署时,常常会出现显著的性能下降,这主要是由于各种变化因素,如感知条件、环境和数据分布。因此,尽管基于深度学习的检测技术最近取得了突破性进展,跨域目标检测(CDOD)仍然是一个关键的研究领域。此外,现有文献仍然较为零散,缺乏对域迁移背后结构性挑战及适应策略有效性的统一视角。本调查提供了对CDOD的全面系统分析。我们首先提出一个问题表述,强调在域迁移下目标检测的多阶段特性。然后,我们通过一个概念分类法组织现有方法,该分类法根据适应范式、建模假设和流程组件对方法进行分类。此外,我们分析了域迁移如何在检测阶段之间传播,并讨论了为何目标检测中的适应性本质上比分类更为复杂。此外,我们回顾了常用的数据集、评估协议和基准实践。最后,我们识别出关键挑战并概述了有前景的未来研究方向。整体而言,本调查旨在提供一个统一的框架,以理解CDOD,并指导更强大的检测系统的开发。
cs.CV / 91 / 2604.08238
$\oslash$ Source Models Leak What They Shouldn't $\nrightarrow$: Unlearning Zero-Shot Transfer in Domain Adaptation Through Adversarial Optimization
源模型泄露不应泄露的信息:通过对抗优化实现领域适应中的零-shot 反学习
Abstract
The increasing adaptation of vision models across domains, such as satellite imagery and medical scans, has raised an emerging privacy risk: models may inadvertently retain and leak sensitive source-domain specific information in the target domain. This creates a compelling use case for machine unlearning to protect the privacy of sensitive source-domain data. Among adaptation techniques, source-free domain adaptation (SFDA) calls for an urgent need for machine unlearning (MU), where the source data itself is protected, yet the source model exposed during adaptation encodes its influence. Our experiments reveal that existing SFDA methods exhibit strong zero-shot performance on source-exclusive classes in the target domain, indicating they inadvertently leak knowledge of these classes into the target domain, even when they are not represented in the target data. We identify and address this risk by proposing an MU setting called SCADA-UL: Unlearning Source-exclusive ClAsses in Domain Adaptation. Existing MU methods do not address this setting as they are not designed to handle data distribution shifts. We propose a new unlearning method, where an adversarially generated forget class sample is unlearned by the model during the domain adaptation process using a novel rescaled labeling strategy and adversarial optimization. We also extend our study to two variants: a continual version of this problem setting and to one where the specific source classes to be forgotten may be unknown. Alongside theoretical interpretations, our comprehensive empirical results show that our method consistently outperforms baselines in the proposed setting while achieving retraining-level unlearning performance on benchmark datasets. Our code is available at https://github.com/D-Arnav/SCADA
Chinese Translation
视觉模型在不同领域(如卫星图像和医学扫描)的日益适应引发了一种新兴的隐私风险:模型可能无意中在目标领域中保留并泄露敏感的源领域特定信息。这为机器反学习提供了一个保护敏感源领域数据隐私的有力应用场景。在适应技术中,源无关领域适应(SFDA)迫切需要机器反学习(MU),在这种情况下,源数据本身受到保护,但在适应过程中暴露的源模型却编码了其影响。我们的实验表明,现有的 SFDA 方法在目标领域的源独占类别上表现出强大的零-shot 性能,表明它们无意中将这些类别的知识泄露到目标领域,即使这些类别在目标数据中并未出现。我们通过提出一个称为 SCADA-UL 的 MU 设置来识别并解决这一风险:在领域适应中反学习源独占类别。现有的 MU 方法未能解决这一设置,因为它们并未设计用于处理数据分布的变化。我们提出了一种新的反学习方法,在领域适应过程中,通过一种新颖的重标定策略和对抗优化,模型对对抗生成的遗忘类别样本进行反学习。我们还将研究扩展到两个变体:这一问题设置的持续版本,以及特定的源类别可能未知的情况。结合理论解释,我们的全面实证结果表明,在所提出的设置中,我们的方法始终优于基线,同时在基准数据集上实现了与再训练水平相当的反学习性能。我们的代码可在 https://github.com/D-Arnav/SCADA 获取。
cs.CV / 92 / 2604.08261
DBMF: A Dual-Branch Multimodal Framework for Out-of-Distribution Detection
DBMF:一种用于分布外检测的双分支多模态框架
Abstract
The complex and dynamic real-world clinical environment demands reliable deep learning (DL) systems. Out-of-distribution (OOD) detection plays a critical role in enhancing the reliability and generalizability of DL models when encountering data that deviate from the training distribution, such as unseen disease cases. However, existing OOD detection methods typically rely either on a single visual modality or solely on image-text matching, failing to fully leverage multimodal information. To overcome the challenge, we propose a novel dual-branch multimodal framework by introducing a text-image branch and a vision branch. Our framework fully exploits multimodal representations to identify OOD samples through these two complementary branches. After training, we compute scores from the text-image branch ($S_t$) and vision branch ($S_v$), and integrate them to obtain the final OOD score $S$ that is compared with a threshold for OOD detection. Comprehensive experiments on publicly available endoscopic image datasets demonstrate that our proposed framework is robust across diverse backbones and improves state-of-the-art performance in OOD detection by up to 24.84%
Chinese Translation
复杂且动态的现实临床环境要求可靠的深度学习(DL)系统。分布外(OOD)检测在增强DL模型在遇到偏离训练分布的数据(如未见过的疾病案例)时的可靠性和泛化能力方面发挥着关键作用。然而,现有的OOD检测方法通常依赖于单一的视觉模态或仅基于图像-文本匹配,未能充分利用多模态信息。为了解决这一挑战,我们提出了一种新颖的双分支多模态框架,通过引入文本-图像分支和视觉分支来实现。我们的框架充分利用多模态表示,通过这两个互补的分支来识别OOD样本。在训练后,我们从文本-图像分支($S_t$)和视觉分支($S_v$)计算得分,并将其整合以获得最终的OOD得分$S$,该得分与阈值进行比较以进行OOD检测。在公开可用的内窥镜图像数据集上进行的综合实验表明,我们提出的框架在不同的基础模型上表现出稳健性,并将OOD检测的最先进性能提高了多达24.84%。
cs.CV / 93 / 2604.08266
Orion-Lite: Distilling LLM Reasoning into Efficient Vision-Only Driving Models
Orion-Lite:将大型语言模型推理提炼为高效的视觉驱动模型
Abstract
Leveraging the general world knowledge of Large Language Models (LLMs) holds significant promise for improving the ability of autonomous driving systems to handle rare and complex scenarios. While integrating LLMs into Vision-Language-Action (VLA) models has yielded state-of-the-art performance, their massive parameter counts pose severe challenges for latency-sensitive and energy-efficient deployment. Distilling LLM knowledge into a compact driving model offers a compelling solution to retain these reasoning capabilities while maintaining a manageable computational footprint. Although previous works have demonstrated the efficacy of distillation, these efforts have primarily focused on relatively simple scenarios and open-loop evaluations. Therefore, in this work, we investigate LLM distillation in more complex, interactive scenarios under closed-loop evaluation. We demonstrate that through a combination of latent feature distillation and ground-truth trajectory supervision, an efficient vision-only student model \textbf{Orion-Lite} can even surpass the performance of its massive VLA teacher, ORION. Setting a new state-of-the-art on the rigorous Bench2Drive benchmark, with a Driving Score of 80.6. Ultimately, this reveals that vision-only architectures still possess significant, untapped potential for high-performance reactive planning.
Chinese Translation
利用大型语言模型(LLMs)的一般世界知识在提升自主驾驶系统处理稀有和复杂场景的能力方面具有重要潜力。尽管将LLMs集成到视觉-语言-动作(VLA)模型中已取得了最先进的性能,但其庞大的参数量对延迟敏感和节能的部署构成了严峻挑战。将LLM知识提炼为紧凑的驾驶模型提供了一种引人注目的解决方案,可以在保持可管理的计算负担的同时保留这些推理能力。尽管先前的研究已证明了提炼的有效性,但这些工作主要集中在相对简单的场景和开环评估上。因此,在本研究中,我们探讨了在闭环评估下更复杂的交互场景中的LLM提炼。我们证明,通过潜在特征提炼和真实轨迹监督的结合,高效的仅视觉学生模型 extbf{Orion-Lite}甚至可以超越其庞大的VLA教师模型ORION的性能。在严格的Bench2Drive基准测试中设定了新的最先进记录,驾驶得分为80.6。最终,这表明仅视觉架构仍然具有显著的、未被充分利用的高性能反应规划潜力。
cs.CV / 94 / 2604.08272
Preventing Overfitting in Deep Image Prior for Hyperspectral Image Denoising
防止深度图像先验在高光谱图像去噪中的过拟合
Abstract
Deep image prior (DIP) is an unsupervised deep learning framework that has been successfully applied to a variety of inverse imaging problems. However, DIP-based methods are inherently prone to overfitting, which leads to performance degradation and necessitates early stopping. In this paper, we propose a method to mitigate overfitting in DIP-based hyperspectral image (HSI) denoising by jointly combining robust data fidelity and explicit sensitivity regularization. The proposed approach employs a Smooth $\ell_1$ data term together with a divergence-based regularization and input optimization during training. Experimental results on real HSIs corrupted by Gaussian, sparse, and stripe noise demonstrate that the proposed method effectively prevents overfitting and achieves superior denoising performance compared to state-of-the-art DIP-based HSI denoising methods.
Chinese Translation
深度图像先验(Deep Image Prior, DIP)是一种无监督的深度学习框架,已成功应用于多种逆向成像问题。然而,基于DIP的方法本质上容易出现过拟合,这导致性能下降并需要提前停止。在本文中,我们提出了一种通过联合结合稳健的数据保真性和显式敏感性正则化来减轻DIP在高光谱图像(Hyperspectral Image, HSI)去噪中的过拟合的方法。所提出的方法在训练过程中采用平滑的$ ext{l}_1$数据项,以及基于散度的正则化和输入优化。对受到高斯噪声、稀疏噪声和条纹噪声影响的真实HSI的实验结果表明,所提出的方法有效防止了过拟合,并且相比于最先进的基于DIP的HSI去噪方法,取得了更优的去噪性能。
cs.CV / 95 / 2604.08282
Revisiting Radar Perception With Spectral Point Clouds
重新审视带有光谱点云的雷达感知
Abstract
Radar perception models are trained with different inputs, from range-Doppler spectra to sparse point clouds. Dense spectra are assumed to outperform sparse point clouds, yet they can vary considerably across sensors and configurations, which hinders transfer. In this paper, we provide alternatives for incorporating spectral information into radar point clouds and show that, point clouds need not underperform compared to spectra. We introduce the spectral point cloud paradigm, where point clouds are treated as sparse, compressed representations of the radar spectra, and argue that, when enriched with spectral information, they serve as strong candidates for a unified input representation that is more robust against sensor-specific differences. We develop an experimental framework that compares spectral point cloud (PC) models at varying densities against a dense range-Doppler (RD) benchmark, and report the density levels where the PC configurations meet the performance of the RD benchmark. Furthermore, we experiment with two basic spectral enrichment approaches, that inject additional target-relevant information into the point clouds. Contrary to the common belief that the dense RD approach is superior, we show that point clouds can do just as well, and can surpass the RD benchmark when enrichment is applied. Spectral point clouds can therefore serve as strong candidates for unified radar perception, paving the way for future radar foundation models.
Chinese Translation
雷达感知模型使用不同的输入进行训练,从距离-多普勒光谱到稀疏点云。尽管密集光谱被认为优于稀疏点云,但它们在不同传感器和配置之间可能存在显著差异,这阻碍了迁移。在本文中,我们提供了将光谱信息纳入雷达点云的替代方案,并展示了点云在性能上并不一定逊色于光谱。我们引入了光谱点云范式,将点云视为雷达光谱的稀疏压缩表示,并论证了在丰富光谱信息后,它们作为统一输入表示的强有力候选者,更加稳健地抵御传感器特定差异。我们开发了一个实验框架,比较不同密度的光谱点云(PC)模型与密集的距离-多普勒(RD)基准,并报告了PC配置在何种密度水平上达到RD基准的性能。此外,我们还实验了两种基本的光谱增强方法,将额外的目标相关信息注入点云。与普遍认为的密集RD方法优越的观点相反,我们展示了点云在应用增强时可以表现得同样出色,甚至超越RD基准。因此,光谱点云可以作为统一雷达感知的强有力候选者,为未来的雷达基础模型铺平道路。
cs.CV / 96 / 2604.08287
CAMotion: A High-Quality Benchmark for Camouflaged Moving Object Detection in the Wild
CAMotion:野外伪装移动物体检测的高质量基准
Abstract
Discovering camouflaged objects is a challenging task in computer vision due to the high similarity between camouflaged objects and their surroundings. While the problem of camouflaged object detection over sequential video frames has received increasing attention, the scale and diversity of existing video camouflaged object detection (VCOD) datasets are greatly limited, which hinders the deeper analysis and broader evaluation of recent deep learning-based algorithms with data-hungry training strategy. To break this bottleneck, in this paper, we construct CAMotion, a high-quality benchmark covers a wide range of species for camouflaged moving object detection in the wild. CAMotion comprises various sequences with multiple challenging attributes such as uncertain edge, occlusion, motion blur, and shape complexity, etc. The sequence annotation details and statistical distribution are presented from various perspectives, allowing CAMotion to provide in-depth analyses on the camouflaged object's motion characteristics in different challenging scenarios. Additionally, we conduct a comprehensive evaluation of existing SOTA models on CAMotion, and discuss the major challenges in VCOD task. The benchmark is available at https://www.camotion.focuslab.net.cn, we hope that our CAMotion can lead to further advancements in the research community.
Chinese Translation
在计算机视觉中,发现伪装物体是一项具有挑战性的任务,因为伪装物体与其周围环境之间存在高度相似性。尽管关于序列视频帧中伪装物体检测的问题受到了越来越多的关注,但现有视频伪装物体检测(VCOD)数据集的规模和多样性仍然非常有限,这阻碍了对基于深度学习的算法进行更深入的分析和更广泛的评估,因为这些算法通常需要大量的数据进行训练。为了解决这一瓶颈,本文构建了CAMotion,一个涵盖广泛物种的高质量基准,用于野外伪装移动物体检测。CAMotion包含多种序列,具有不确定边缘、遮挡、运动模糊和形状复杂性等多种挑战性属性。我们从不同角度展示了序列注释的详细信息和统计分布,使CAMotion能够对伪装物体在不同挑战场景中的运动特征进行深入分析。此外,我们对现有的最先进(SOTA)模型在CAMotion上的表现进行了全面评估,并讨论了VCOD任务中的主要挑战。该基准可在 https://www.camotion.focuslab.net.cn 获取,我们希望我们的CAMotion能够推动研究社区的进一步发展。
cs.CV / 97 / 2604.08294
Can Vision Language Models Judge Action Quality? An Empirical Evaluation
视觉语言模型能评估动作质量吗?一项实证评估
Abstract
Action Quality Assessment (AQA) has broad applications in physical therapy, sports coaching, and competitive judging. Although Vision Language Models (VLMs) hold considerable promise for AQA, their actual performance in this domain remains largely uncharacterised. We present a comprehensive evaluation of state-of-the-art VLMs across activity domains (e.g. fitness, figure skating, diving), tasks, representations, and prompting strategies. Baseline results reveal that Gemini 3.1 Pro, Qwen3-VL and InternVL3.5 models perform only marginally above random chance, and although strategies such as incorporation of skeleton information, grounding instructions, reasoning structures and in-context learning lead to isolated gains, none is consistently effective. Analysis of prediction distributions uncovers two systematic biases: a tendency to predict correct execution regardless of visual evidence, and a sensitivity to superficial linguistic framing. Reformulating tasks contrastively to mitigate these biases yields minimal improvement, suggesting that the models' limitations go beyond these biases, pointing to a fundamental difficulty with fine-grained movement quality assessment. Our findings establish a rigorous baseline for future VLM-based AQA research and provide an actionable outline for failure modes requiring mitigation prior to reliable real-world deployment.
Chinese Translation
动作质量评估(AQA)在物理治疗、体育教练和竞技裁判等领域具有广泛的应用。尽管视觉语言模型(VLMs)在AQA方面展现出相当大的潜力,但它们在这一领域的实际表现仍然未得到充分表征。我们对最先进的VLMs在不同活动领域(例如健身、花样滑冰、跳水)、任务、表示方式和提示策略进行了全面评估。基线结果显示,Gemini 3.1 Pro、Qwen3-VL和InternVL3.5模型的表现仅略高于随机猜测,尽管像骨架信息的引入、指令的基础、推理结构和上下文学习等策略带来了孤立的提升,但没有一种策略能够 consistently 有效。对预测分布的分析揭示了两种系统性偏差:一种是无论视觉证据如何都倾向于预测正确执行,另一种是对表面语言框架的敏感性。通过对任务进行对比性重构以减轻这些偏差,所获得的改进微乎其微,这表明模型的局限性超出了这些偏差,指向了对细粒度动作质量评估的根本困难。我们的研究为未来基于VLM的AQA研究建立了严格的基线,并提供了一个可操作的框架,以应对在可靠的现实世界应用之前需要缓解的失败模式。
cs.CV / 98 / 2604.08301
GroundingAnomaly: Spatially-Grounded Diffusion for Few-Shot Anomaly Synthesis
GroundingAnomaly:基于空间的少样本异常合成扩散
Abstract
The performance of visual anomaly inspection in industrial quality control is often constrained by the scarcity of real anomalous samples. Consequently, anomaly synthesis techniques have been developed to enlarge training sets and enhance downstream inspection. However, existing methods either suffer from poor integration caused by inpainting or fail to provide accurate masks. To address these limitations, we propose GroundingAnomaly, a novel few-shot anomaly image generation framework. Our framework introduces a Spatial Conditioning Module that leverages per-pixel semantic maps to enable precise spatial control over the synthesized anomalies. Furthermore, a Gated Self-Attention Module is designed to inject conditioning tokens into a frozen U-Net via gated attention layers. This carefully preserves pretrained priors while ensuring stable few-shot adaptation. Extensive evaluations on the MVTec AD and VisA datasets demonstrate that GroundingAnomaly generates high-quality anomalies and achieves state-of-the-art performance across multiple downstream tasks, including anomaly detection, segmentation, and instance-level detection.
Chinese Translation
在工业质量控制中,视觉异常检测的性能常常受到真实异常样本稀缺的限制。因此,开发了异常合成技术以扩大训练集并增强后续检测。然而,现有方法要么因修复而导致整合不良,要么未能提供准确的掩膜。为了解决这些局限性,我们提出了GroundingAnomaly,一种新颖的少样本异常图像生成框架。我们的框架引入了一个空间条件模块,利用逐像素语义图实现对合成异常的精确空间控制。此外,设计了一个门控自注意力模块,通过门控注意力层将条件标记注入冻结的U-Net。这种方法仔细保留了预训练的先验知识,同时确保了稳定的少样本适应性。在MVTec AD和VisA数据集上的广泛评估表明,GroundingAnomaly生成高质量的异常,并在多个下游任务中实现了最先进的性能,包括异常检测、分割和实例级检测。
cs.CV / 99 / 2604.08313
Weakly-Supervised Lung Nodule Segmentation via Training-Free Guidance of 3D Rectified Flow
通过无训练指导的3D校正流实现弱监督肺结节分割
Abstract
Dense annotations, such as segmentation masks, are expensive and time-consuming to obtain, especially for 3D medical images where expert voxel-wise labeling is required. Weakly supervised approaches aim to address this limitation, but often rely on attribution-based methods that struggle to accurately capture small structures such as lung nodules. In this paper, we propose a weakly-supervised segmentation method for lung nodules by combining pretrained state-of-the-art rectified flow and predictor models in a plug-and-play manner. Our approach uses training-free guidance of a 3D rectified flow model, requiring only fine-tuning of the predictor using image-level labels and no retraining of the generative model. The proposed method produces improved-quality segmentations for two separate predictors, consistently detecting lung nodules of varying size and shapes. Experiments on LUNA16 demonstrate improvements over baseline methods, highlighting the potential of generative foundation models as tools for weakly supervised 3D medical image segmentation.
Chinese Translation
密集标注(如分割掩膜)的获取成本高昂且耗时,尤其是在需要专家逐体素标注的3D医学图像中。弱监督方法旨在解决这一限制,但通常依赖于基于归因的方法,这些方法难以准确捕捉小结构,如肺结节。在本文中,我们提出了一种针对肺结节的弱监督分割方法,通过以即插即用的方式结合预训练的最先进的校正流和预测模型。我们的方法使用无训练的3D校正流模型指导,仅需使用图像级标签对预测器进行微调,而无需重新训练生成模型。所提方法为两个独立的预测器生成了更高质量的分割,能够一致地检测不同大小和形状的肺结节。在LUNA16数据集上的实验显示出相较于基线方法的改进,突显了生成基础模型作为弱监督3D医学图像分割工具的潜力。
cs.CV / 100 / 2604.08322
Fundus-R1: Training a Fundus-Reading MLLM with Knowledge-Aware Reasoning on Public Data
Fundus-R1:基于知识感知推理的公共数据训练眼底图像阅读多模态大语言模型
Abstract
Fundus imaging such as CFP, OCT and UWF is crucial for the early detection of retinal anomalies and diseases. Fundus image understanding, due to its knowledge-intensive nature, poses a challenging vision-language task. An emerging approach to addressing the task is to post-train a generic multimodal large language model (MLLM), either by supervised finetuning (SFT) or by reinforcement learning with verifiable rewards (RLVR), on a considerable amount of in-house samples paired with high-quality clinical reports. However, these valuable samples are not publicly accessible, which not only hinders reproducibility but also practically limits research to few players. To overcome the barrier, we make a novel attempt to train a reasoning-enhanced fundus-reading MLLM, which we term Fundus-R1, using exclusively public datasets, wherein over 94\% of the data are annotated with only image-level labels. Our technical contributions are two-fold. First, we propose a RAG-based method for composing image-specific, knowledge-aware reasoning traces. Such auto-generated traces link visual findings identified by a generic MLLM to the image labels in terms of ophthalmic knowledge. Second, we enhance RLVR with a process reward that encourages self-consistency of the generated reasoning trace in each rollout. Extensive experiments on three fundus-reading benchmarks, i.e., FunBench, Omni-Fundus and GMAI-Fundus, show that Fundus-R1 clearly outperforms multiple baselines, including its generic counterpart (Qwen2.5-VL) and a stronger edition post-trained without using the generated traces. This work paves the way for training powerful fundus-reading MLLMs with publicly available data.
Chinese Translation
眼底成像(如 CFP、OCT 和 UWF)对早期检测视网膜异常和疾病至关重要。由于其知识密集的特性,眼底图像理解构成了一项具有挑战性的视觉-语言任务。解决该任务的一种新兴方法是对一个通用的多模态大语言模型(MLLM)进行后训练,方法包括监督微调(SFT)或通过可验证奖励的强化学习(RLVR),并使用大量内部样本与高质量临床报告配对。然而,这些宝贵的样本并不公开可用,这不仅阻碍了可重复性,还在实际中限制了研究参与者的数量。为克服这一障碍,我们首次尝试使用完全公共的数据集训练一个增强推理能力的眼底阅读 MLLM,我们称之为 Fundus-R1,其中超过 94\% 的数据仅用图像级标签进行标注。我们的技术贡献主要有两个方面。首先,我们提出了一种基于 RAG 的方法,用于构建图像特定的知识感知推理轨迹。这些自动生成的轨迹将通用 MLLM 识别的视觉发现与眼科知识中的图像标签联系起来。其次,我们通过一个过程奖励增强 RLVR,鼓励每次回合中生成的推理轨迹的自我一致性。在三个眼底阅读基准(即 FunBench、Omni-Fundus 和 GMAI-Fundus)上的广泛实验表明,Fundus-R1 明显优于多个基线,包括其通用对应模型(Qwen2.5-VL)和一个未使用生成轨迹的更强版本的后训练模型。这项工作为使用公开可用数据训练强大的眼底阅读 MLLM 开辟了道路。
cs.CV / 101 / 2604.08333
Lost in the Hype: Revealing and Dissecting the Performance Degradation of Medical Multimodal Large Language Models in Image Classification
迷失在炒作中:揭示和剖析医学多模态大型语言模型在图像分类中的性能下降
Abstract
The rise of multimodal large language models (MLLMs) has sparked an unprecedented wave of applications in the field of medical imaging analysis. However, as one of the earliest and most fundamental tasks integrated into this paradigm, medical image classification reveals a sobering reality: state-of-the-art medical MLLMs consistently underperform compared to traditional deep learning models, despite their overwhelming advantages in pre-training data and model parameters. This paradox prompts a critical rethinking: where exactly does the performance degradation originate? In this paper, we conduct extensive experiments on 14 open-source medical MLLMs across three representative image classification datasets. Moving beyond superficial performance benchmarking, we employ feature probing to track the information flow of visual features module-by-module and layer-by-layer throughout the entire MLLM pipeline, enabling explicit visualization of where and how classification signals are distorted, diluted, or overridden. As the first attempt to dissect classification performance degradation in medical MLLMs, our findings reveal four failure modes: 1) quality limitation in visual representation, 2) fidelity loss in connector projection, 3) comprehension deficit in LLM reasoning, and 4) misalignment of semantic mapping. Meanwhile, we introduce quantitative scores that characterize the healthiness of feature evolution, enabling principled comparisons across diverse MLLMs and datasets. Furthermore, we provide insightful discussions centered on the critical barriers that prevent current medical MLLMs from fulfilling their promised clinical potential. We hope that our work provokes rethinking within the community-highlighting that the road from high expectations to clinically deployable MLLMs remains long and winding.
Chinese Translation
多模态大型语言模型(MLLMs)的兴起在医学影像分析领域引发了一波前所未有的应用热潮。然而,作为这一范式中最早和最基础的任务之一,医学图像分类揭示了一个令人警醒的现实:尽管在预训练数据和模型参数方面具有压倒性的优势,最先进的医学MLLMs在性能上始终不及传统深度学习模型。这个悖论促使我们进行深入反思:性能下降究竟源于何处?在本文中,我们对14个开源医学MLLMs在三个具有代表性的图像分类数据集上进行了广泛的实验。我们超越了表面的性能基准测试,采用特征探测的方法逐模块、逐层追踪视觉特征的信息流,从而明确可视化分类信号在何处以及如何被扭曲、稀释或覆盖。作为首次剖析医学MLLMs分类性能下降的尝试,我们的研究结果揭示了四种失败模式:1)视觉表征的质量限制,2)连接器投影的保真度损失,3)LLM推理中的理解缺陷,以及4)语义映射的错位。同时,我们引入了定量评分来表征特征演化的健康程度,使得不同MLLMs和数据集之间的比较更具原则性。此外,我们提供了深入的讨论,集中在当前医学MLLMs未能实现其临床潜力的关键障碍上。我们希望我们的工作能够引发社区内的重新思考,强调从高期望到可临床部署的MLLMs之路仍然漫长而曲折。
cs.CV / 102 / 2604.08337
InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding
InstAP:面向实例的视觉-语言预训练框架用于时空理解
Abstract
Current vision-language pre-training (VLP) paradigms excel at global scene understanding but struggle with instance-level reasoning due to global-only supervision. We introduce InstAP, an Instance-Aware Pre-training framework that jointly optimizes global vision-text alignment and fine-grained, instance-level contrastive alignment by grounding textual mentions to specific spatial-temporal regions. To support this, we present InstVL, a large-scale dataset (2 million images, 50,000 videos) with dual-granularity annotations: holistic scene captions and dense, grounded instance descriptions. On the InstVL benchmark, InstAP substantially outperforms existing VLP models on instance-level retrieval, and also surpasses a strong VLP baseline trained on the exact same data corpus, isolating the benefit of our instance-aware objective. Moreover, instance-centric pre-training improves global understanding: InstAP achieves competitive zero-shot performance on multiple video benchmarks, including MSR-VTT and DiDeMo. Qualitative visualizations further show that InstAP localizes textual mentions to the correct instances, while global-only models exhibit more diffuse, scene-level attention.
Chinese Translation
当前的视觉-语言预训练(VLP)范式在全局场景理解方面表现出色,但由于仅依赖全局监督,实例级推理却面临挑战。我们提出了InstAP,一个面向实例的预训练框架,通过将文本提及与特定的时空区域相结合,联合优化全局视觉-文本对齐和细粒度的实例级对比对齐。为支持这一点,我们展示了InstVL,一个大规模数据集(200万张图像,5万段视频),具有双粒度注释:整体场景描述和密集的、具体的实例描述。在InstVL基准测试中,InstAP在实例级检索上显著超越现有的VLP模型,并且在完全相同的数据集上训练的强基线VLP模型的表现也被超越,从而突显了我们面向实例的目标的优势。此外,面向实例的预训练还改善了全局理解:InstAP在多个视频基准测试中实现了具有竞争力的零-shot性能,包括MSR-VTT和DiDeMo。定性可视化进一步表明,InstAP能够将文本提及定位到正确的实例,而仅依赖全局的模型则表现出更为分散的场景级注意力。
cs.CV / 103 / 2604.08340
PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models
PokeGym:一个以视觉驱动的长时间基准测试,用于视觉语言模型
Abstract
While Vision-Language Models (VLMs) have achieved remarkable progress in static visual understanding, their deployment in complex 3D embodied environments remains severely limited. Existing benchmarks suffer from four critical deficiencies: (1) passive perception tasks circumvent interactive dynamics; (2) simplified 2D environments fail to assess depth perception; (3) privileged state leakage bypasses genuine visual processing; and (4) human evaluation is prohibitively expensive and unscalable. We introduce PokeGym, a visually-driven long-horizon benchmark instantiated within Pokemon Legends: Z-A, a visually complex 3D open-world Role-Playing Game. PokeGym enforces strict code-level isolation: agents operate solely on raw RGB observations while an independent evaluator verifies success via memory scanning, ensuring pure vision-based decision-making and automated, scalable assessment. The benchmark comprises 30 tasks (30-220 steps) spanning navigation, interaction, and mixed scenarios, with three instruction granularities (Visual-Guided, Step-Guided, Goal-Only) to systematically deconstruct visual grounding, semantic reasoning, and autonomous exploration capabilities. Our evaluation reveals a key limitation of current VLMs: physical deadlock recovery, rather than high-level planning, constitutes the primary bottleneck, with deadlocks showing a strong negative correlation with task success. Furthermore, we uncover a metacognitive divergence: weaker models predominantly suffer from Unaware Deadlocks (oblivious to entrapment), whereas advanced models exhibit Aware Deadlocks (recognizing entrapment yet failing to recover). These findings highlight the need to integrate explicit spatial intuition into VLM architectures. The code and benchmark will be available on GitHub.
Chinese Translation
尽管视觉语言模型(VLMs)在静态视觉理解方面取得了显著进展,但它们在复杂的三维具身环境中的应用仍然受到严重限制。现有基准测试存在四个关键缺陷:(1)被动感知任务规避了交互动态;(2)简化的二维环境未能评估深度感知;(3)特权状态泄漏绕过了真实的视觉处理;(4)人工评估成本高昂且不可扩展。我们引入了PokeGym,一个以视觉驱动的长时间基准测试,基于《宝可梦传说:阿尔宙斯》(Pokemon Legends: Z-A),这是一个视觉复杂的三维开放世界角色扮演游戏。PokeGym 强制执行严格的代码级隔离:代理仅在原始RGB观察上操作,而独立评估者通过内存扫描验证成功,确保纯粹的基于视觉的决策和自动化、可扩展的评估。该基准测试包括30个任务(30-220步),涵盖导航、交互和混合场景,具有三种指令粒度(视觉引导、步骤引导、仅目标),以系统性地拆解视觉基础、语义推理和自主探索能力。我们的评估揭示了当前VLMs的一个关键限制:物理死锁恢复,而非高层次规划,构成了主要瓶颈,死锁与任务成功之间存在强负相关。此外,我们发现了一种元认知差异:较弱的模型主要遭受无意识死锁(对困境无知),而高级模型则表现出意识死锁(识别困境但未能恢复)。这些发现突显了将显式空间直觉整合到VLM架构中的必要性。代码和基准测试将会在GitHub上发布。
cs.CV / 104 / 2604.08364
MegaStyle: Constructing Diverse and Scalable Style Dataset via Consistent Text-to-Image Style Mapping
MegaStyle:通过一致的文本到图像风格映射构建多样化和可扩展的风格数据集
Abstract
In this paper, we introduce MegaStyle, a novel and scalable data curation pipeline that constructs an intra-style consistent, inter-style diverse and high-quality style dataset. We achieve this by leveraging the consistent text-to-image style mapping capability of current large generative models, which can generate images in the same style from a given style description. Building on this foundation, we curate a diverse and balanced prompt gallery with 170K style prompts and 400K content prompts, and generate a large-scale style dataset MegaStyle-1.4M via content-style prompt combinations. With MegaStyle-1.4M, we propose style-supervised contrastive learning to fine-tune a style encoder MegaStyle-Encoder for extracting expressive, style-specific representations, and we also train a FLUX-based style transfer model MegaStyle-FLUX. Extensive experiments demonstrate the importance of maintaining intra-style consistency, inter-style diversity and high-quality for style dataset, as well as the effectiveness of the proposed MegaStyle-1.4M. Moreover, when trained on MegaStyle-1.4M, MegaStyle-Encoder and MegaStyle-FLUX provide reliable style similarity measurement and generalizable style transfer, making a significant contribution to the style transfer community. More results are available at our project website https://jeoyal.github.io/MegaStyle/.
Chinese Translation
在本文中,我们介绍了MegaStyle,这是一种新颖且可扩展的数据策划管道,构建了一个风格内部一致、风格之间多样且高质量的风格数据集。我们通过利用当前大型生成模型的一致文本到图像风格映射能力来实现这一目标,该能力可以根据给定的风格描述生成相同风格的图像。在此基础上,我们策划了一个包含17万个风格提示和40万个内容提示的多样化和平衡的提示库,并通过内容-风格提示组合生成了一个大规模的风格数据集MegaStyle-1.4M。借助MegaStyle-1.4M,我们提出了风格监督对比学习,以微调风格编码器MegaStyle-Encoder,从而提取富有表现力的、特定于风格的表示,并训练了基于FLUX的风格迁移模型MegaStyle-FLUX。大量实验表明,保持风格内部一致性、风格之间多样性和高质量对于风格数据集的重要性,以及所提出的MegaStyle-1.4M的有效性。此外,当在MegaStyle-1.4M上进行训练时,MegaStyle-Encoder和MegaStyle-FLUX提供了可靠的风格相似性测量和可推广的风格迁移,为风格迁移社区做出了重要贡献。更多结果可在我们的项目网站 https://jeoyal.github.io/MegaStyle/ 上获取。
cs.CV / 105 / 2604.08370
SurfelSplat: Learning Efficient and Generalizable Gaussian Surfel Representations for Sparse-View Surface Reconstruction
SurfelSplat:学习高效且可泛化的高斯表面表示用于稀疏视图表面重建
Abstract
3D Gaussian Splatting (3DGS) has demonstrated impressive performance in 3D scene reconstruction. Beyond novel view synthesis, it shows great potential for multi-view surface reconstruction. Existing methods employ optimization-based reconstruction pipelines that achieve precise and complete surface extractions. However, these approaches typically require dense input views and high time consumption for per-scene optimization. To address these limitations, we propose SurfelSplat, a feed-forward framework that generates efficient and generalizable pixel-aligned Gaussian surfel representations from sparse-view images. We observe that conventional feed-forward structures struggle to recover accurate geometric attributes of Gaussian surfels because the spatial frequency of pixel-aligned primitives exceeds Nyquist sampling rates. Therefore, we propose a cross-view feature aggregation module based on the Nyquist sampling theorem. Specifically, we first adapt the geometric forms of Gaussian surfels with spatial sampling rate-guided low-pass filters. We then project the filtered surfels across all input views to obtain cross-view feature correlations. By processing these correlations through a specially designed feature fusion network, we can finally regress Gaussian surfels with precise geometry. Extensive experiments on DTU reconstruction benchmarks demonstrate that our model achieves comparable results with state-of-the-art methods, and predict Gaussian surfels within 1 second, offering a 100x speedup without costly per-scene training.
Chinese Translation
3D 高斯点云(3DGS)在 3D 场景重建中表现出色。除了新视图合成外,它在多视图表面重建中也展现出巨大的潜力。现有方法采用基于优化的重建流程,能够实现精确且完整的表面提取。然而,这些方法通常需要密集的输入视图,并且在每个场景的优化中消耗大量时间。为了解决这些限制,我们提出了 SurfelSplat,这是一种前馈框架,能够从稀疏视图图像中生成高效且可泛化的像素对齐高斯表面表示。我们观察到,传统的前馈结构在恢复高斯表面的准确几何属性时存在困难,因为像素对齐原件的空间频率超过了 Nyquist 采样率。因此,我们提出了一种基于 Nyquist 采样定理的跨视图特征聚合模块。具体而言,我们首先使用空间采样率引导的低通滤波器调整高斯表面的几何形状。然后,我们将过滤后的表面在所有输入视图中投影,以获得跨视图特征相关性。通过一个专门设计的特征融合网络处理这些相关性,我们最终能够回归具有精确几何形状的高斯表面。在 DTU 重建基准上的大量实验表明,我们的模型在结果上与最先进的方法相当,并且在 1 秒内预测高斯表面,提供了 100 倍的速度提升,而无需昂贵的每场景训练。
cs.CV / 106 / 2604.08395
Phantasia: Context-Adaptive Backdoors in Vision Language Models
Phantasia:视觉语言模型中的上下文自适应后门攻击
Abstract
Recent advances in Vision-Language Models (VLMs) have greatly enhanced the integration of visual perception and linguistic reasoning, driving rapid progress in multimodal understanding. Despite these achievements, the security of VLMs, particularly their vulnerability to backdoor attacks, remains significantly underexplored. Existing backdoor attacks on VLMs are still in an early stage of development, with most current methods relying on generating poisoned responses that contain fixed, easily identifiable patterns. In this work, we make two key contributions. First, we demonstrate for the first time that the stealthiness of existing VLM backdoor attacks has been substantially overestimated. By adapting defense techniques originally designed for other domains (e.g., vision-only and text-only models), we show that several state-of-the-art attacks can be detected with surprising ease. Second, to address this gap, we introduce Phantasia, a context-adaptive backdoor attack that dynamically aligns its poisoned outputs with the semantics of each input. Instead of producing static poisoned patterns, Phantasia encourages models to generate contextually coherent yet malicious responses that remain plausible, thereby significantly improving stealth and adaptability. Extensive experiments across diverse VLM architectures reveal that Phantasia achieves state-of-the-art attack success rates while maintaining benign performance under various defensive settings.
Chinese Translation
近期视觉语言模型(VLMs)的进展极大增强了视觉感知与语言推理的整合,推动了多模态理解的快速发展。尽管取得了这些成就,VLMs的安全性,特别是其对后门攻击的脆弱性,仍然未得到充分探讨。现有的VLM后门攻击仍处于早期发展阶段,大多数当前方法依赖于生成包含固定、易于识别模式的污染响应。在本研究中,我们做出了两个关键贡献。首先,我们首次证明了现有VLM后门攻击的隐蔽性被大大高估。通过适应最初为其他领域(例如,仅视觉和仅文本模型)设计的防御技术,我们展示了几种最先进的攻击可以被意外轻松地检测到。其次,为了填补这一空白,我们引入了Phantasia,一种上下文自适应后门攻击,动态地将其污染输出与每个输入的语义对齐。Phantasia并不是生成静态的污染模式,而是鼓励模型生成在上下文上连贯但具有恶意的响应,这些响应仍然保持合理性,从而显著提高了隐蔽性和适应性。在多种VLM架构上的广泛实验表明,Phantasia在保持良性性能的同时,实现了最先进的攻击成功率,适用于各种防御设置。
cs.CV / 107 / 2604.08405
SyncBreaker:Stage-Aware Multimodal Adversarial Attacks on Audio-Driven Talking Head Generation
SyncBreaker:基于阶段感知的多模态对抗攻击在音频驱动的说话人头部生成中的应用
Abstract
Diffusion-based audio-driven talking-head generation enables realistic portrait animation, but also introduces risks of misuse, such as fraud and misinformation. Existing protection methods are largely limited to a single modality, and neither image-only nor audio-only attacks can effectively suppress speech-driven facial dynamics. To address this gap, we propose SyncBreaker, a stage-aware multimodal protection framework that jointly perturbs portrait and audio inputs under modality-specific perceptual constraints. Our key contributions are twofold. First, for the image stream, we introduce nullifying supervision with Multi-Interval Sampling (MIS) across diffusion stages to steer the generation toward the static reference portrait by aggregating guidance from multiple denoising intervals. Second, for the audio stream, we propose Cross-Attention Fooling (CAF), which suppresses interval-specific audio-conditioned cross-attention responses. Both streams are optimized independently and combined at inference time to enable flexible deployment. We evaluate SyncBreaker in a white-box proactive protection setting. Extensive experiments demonstrate that SyncBreaker more effectively degrades lip synchronization and facial dynamics than strong single-modality baselines, while preserving input perceptual quality and remaining robust under purification. Code: https://github.com/kitty384/SyncBreaker.
Chinese Translation
基于扩散的音频驱动说话人头部生成技术能够实现逼真的肖像动画,但也带来了滥用的风险,例如欺诈和虚假信息。现有的保护方法主要局限于单一模态,图像或音频单独的攻击无法有效抑制语音驱动的面部动态。为了解决这一问题,我们提出了SyncBreaker,一个阶段感知的多模态保护框架,该框架在特定模态的感知约束下共同扰动肖像和音频输入。我们的主要贡献有两个方面。首先,对于图像流,我们引入了通过多间隔采样(Multi-Interval Sampling, MIS)进行的消除监督,以在扩散阶段引导生成朝向静态参考肖像,通过聚合来自多个去噪间隔的指导。其次,对于音频流,我们提出了交叉注意力欺骗(Cross-Attention Fooling, CAF),该方法抑制特定间隔的音频条件交叉注意力响应。这两个流在独立优化后在推理时结合,以实现灵活部署。我们在白盒主动保护设置中评估了SyncBreaker。大量实验表明,SyncBreaker在降低唇部同步和面部动态方面比强大的单模态基线更有效,同时保持输入的感知质量,并在净化下保持稳健性。代码:https://github.com/kitty384/SyncBreaker。
cs.CV / 108 / 2604.08410
BLaDA: Bridging Language to Functional Dexterous Actions within 3DGS Fields
BLaDA:在3DGS领域中将语言与功能灵巧动作连接起来
Abstract
In unstructured environments, functional dexterous grasping calls for the tight integration of semantic understanding, precise 3D functional localization, and physically interpretable execution. Modular hierarchical methods are more controllable and interpretable than end-to-end VLA approaches, but existing ones still rely on predefined affordance labels and lack the tight semantic--pose coupling needed for functional dexterous manipulation. To address this, we propose BLaDA (Bridging Language to Dexterous Actions in 3DGS fields), an interpretable zero-shot framework that grounds open-vocabulary instructions as perceptual and control constraints for functional dexterous manipulation. BLaDA establishes an interpretable reasoning chain by first parsing natural language into a structured sextuple of manipulation constraints via a Knowledge-guided Language Parsing (KLP) module. To achieve pose-consistent spatial reasoning, we introduce the Triangular Functional Point Localization (TriLocation) module, which utilizes 3D Gaussian Splatting as a continuous scene representation and identifies functional regions under triangular geometric constraints. Finally, the 3D Keypoint Grasp Matrix Transformation Execution (KGT3D+) module decodes these semantic-geometric constraints into physically plausible wrist poses and finger-level commands. Extensive experiments on complex benchmarks demonstrate that BLaDA significantly outperforms existing methods in both affordance grounding precision and the success rate of functional manipulation across diverse categories and tasks. Code will be publicly available at https://github.com/PopeyePxx/BLaDA.
Chinese Translation
在非结构化环境中,功能性灵巧抓取需要语义理解、精确的3D功能定位和可物理解释的执行之间的紧密结合。模块化层次方法比端到端的VLA方法更具可控性和可解释性,但现有方法仍依赖于预定义的可供性标签,并缺乏功能性灵巧操作所需的紧密语义-姿态耦合。为了解决这个问题,我们提出了BLaDA(在3DGS领域中将语言与灵巧动作连接起来),这是一个可解释的零样本框架,将开放词汇指令作为功能性灵巧操作的感知和控制约束。BLaDA通过知识引导语言解析(KLP)模块,首先将自然语言解析为结构化的操作约束六元组,从而建立一个可解释的推理链。为了实现姿态一致的空间推理,我们引入了三角功能点定位(TriLocation)模块,该模块利用3D高斯点云作为连续场景表示,并在三角几何约束下识别功能区域。最后,3D关键点抓取矩阵变换执行(KGT3D+)模块将这些语义-几何约束解码为物理上合理的手腕姿态和指级命令。在复杂基准上的大量实验表明,BLaDA在可供性基础精度和多样类别及任务的功能操作成功率方面显著优于现有方法。代码将公开发布于 https://github.com/PopeyePxx/BLaDA。
cs.CV / 109 / 2604.08435
HST-HGN: Heterogeneous Spatial-Temporal Hypergraph Networks with Bidirectional State Space Models for Global Fatigue Assessment
HST-HGN:基于双向状态空间模型的异构时空超图网络用于全球疲劳评估
Abstract
It remains challenging to assess driver fatigue from untrimmed videos under constrained computational budgets, due to the difficulty of modeling long-range temporal dependencies in subtle facial expressions. Some existing approaches rely on computationally heavy architectures, whereas others employ traditional lightweight pairwise graph networks, despite their limited capacity to model high-order synergies and global temporal context. Therefore, we propose HST-HGN, a novel Heterogeneous Spatial-Temporal Hypergraph Network driven by Bidirectional State Space Models. Spatially, we introduce a hierarchical hypergraph network to fuse pose-disentangled geometric topologies with multi-modal texture patches dynamically. This formulation encapsulates high-order synergistic facial deformations, effectively overcoming the limitations of conventional methods. In temporal terms, a Bi-Mamba module with linear complexity is applied to perform bidirectional sequence modeling. This explicit temporal-evolution filtering enables the network to distinguish highly ambiguous transient actions, such as yawning versus speaking, while encompassing their complete physiological lifecycles. Extensive evaluations across diverse fatigue benchmarks demonstrate that HST-HGN achieves state-of-the-art performance. In particular, our method strikes a balance between discriminative power and computational efficiency, making it well-suited for real-time in-cabin edge deployment.
Chinese Translation
在受限计算预算下,从未裁剪的视频中评估驾驶员疲劳仍然具有挑战性,这主要是由于在微妙面部表情中建模长程时间依赖性的困难。一些现有方法依赖于计算量大的架构,而另一些则采用传统的轻量级成对图网络,尽管它们在建模高阶协同和全球时间上下文方面能力有限。因此,我们提出了HST-HGN,一种由双向状态空间模型驱动的新型异构时空超图网络。在空间上,我们引入了一个分层超图网络,以动态融合姿态解耦的几何拓扑与多模态纹理块。这种表述封装了高阶协同的面部变形,有效克服了传统方法的局限性。在时间上,应用了一个线性复杂度的Bi-Mamba模块进行双向序列建模。这种显式的时间演变过滤使网络能够区分高度模糊的瞬态动作,如打哈欠与说话,同时涵盖其完整的生理生命周期。在多种疲劳基准上的广泛评估表明,HST-HGN达到了最先进的性能。特别是,我们的方法在区分能力和计算效率之间取得了平衡,使其非常适合实时车内边缘部署。
cs.CV / 110 / 2604.08456
Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models
熵梯度基础:无训练的视觉语言模型证据检索
Abstract
Despite rapid progress, pretrained vision-language models still struggle when answers depend on tiny visual details or on combining clues spread across multiple regions, as in documents and compositional queries. We address this by framing grounding as test-time evidence retrieval: given a query, the model should actively identify where to look next to resolve ambiguity. To this end, we propose a training-free, model-intrinsic grounding method that uses uncertainty as supervision. Specifically, we compute the entropy of the model's next-token distribution and backpropagate it to the visual token embeddings to obtain an entropy-gradient relevance map, without auxiliary detectors or attention-map heuristics. We then extract and rank multiple coherent regions to support multi-evidence queries, and introduce an iterative zoom-and-reground procedure with a spatial-entropy stopping rule to avoid over-refinement. Experiments on seven benchmarks across four VLM architectures demonstrate consistent improvements over existing methods, with the largest gains on detail-critical and high-resolution settings, while also producing more interpretable evidence localizations.
Chinese Translation
尽管取得了快速进展,预训练的视觉语言模型在答案依赖于微小视觉细节或需要结合分散在多个区域的线索(如文档和组合查询)时仍然面临挑战。我们通过将基础框架视为测试时证据检索来解决这一问题:给定一个查询,模型应主动识别接下来该查看的位置以消除歧义。为此,我们提出了一种无训练的、模型内在的基础方法,利用不确定性作为监督。具体而言,我们计算模型下一个标记分布的熵,并将其反向传播到视觉标记嵌入中,以获得熵梯度相关性图,而无需辅助检测器或注意力图启发式方法。然后,我们提取并排名多个连贯区域以支持多证据查询,并引入一种带有空间熵停止规则的迭代缩放和重新基础程序,以避免过度细化。在四种视觉语言模型架构的七个基准测试上的实验表明,与现有方法相比,取得了一致的改进,尤其在对细节要求高和高分辨率设置下获得了最大的提升,同时也生成了更具可解释性的证据定位。
cs.CV / 111 / 2604.08457
CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning
CrashSight:一个阶段感知的基础设施中心视频基准,用于交通事故场景理解与推理
Abstract
Cooperative autonomous driving requires traffic scene understanding from both vehicle and infrastructure perspectives. While vision-language models (VLMs) show strong general reasoning capabilities, their performance in safety-critical traffic scenarios remains insufficiently evaluated due to the ego-vehicle focus of existing benchmarks. To bridge this gap, we present \textbf{CrashSight}, a large-scale vision-language benchmark for roadway crash understanding using real-world roadside camera data. The dataset comprises 250 crash videos, annotated with 13K multiple-choice question-answer pairs organized under a two-tier taxonomy. Tier 1 evaluates the visual grounding of scene context and involved parties, while Tier 2 probes higher-level reasoning, including crash mechanics, causal attribution, temporal progression, and post-crash outcomes. We benchmark 8 state-of-the-art VLMs and show that, despite strong scene description capabilities, current models struggle with temporal and causal reasoning in safety-critical scenarios. We provide a detailed analysis of failure scenarios and discuss directions for improving VLM crash understanding. The benchmark provides a standardized evaluation framework for infrastructure-assisted perception in cooperative autonomous driving. The CrashSight benchmark, including the full dataset and code, is accessible at https://mcgrche.github.io/crashsight.
Chinese Translation
协同自动驾驶需要从车辆和基础设施的角度理解交通场景。尽管视觉-语言模型(VLMs)展现了强大的通用推理能力,但由于现有基准的自我车辆聚焦,其在安全关键交通场景中的表现仍然评估不足。为填补这一空白,我们提出了 extbf{CrashSight},这是一个基于真实世界路边摄像头数据的大规模视觉-语言基准,用于道路事故理解。该数据集包含250个事故视频,标注有13K个多项选择问答对,按照两级分类法组织。第一级评估场景上下文和相关方的视觉定位,而第二级则探讨更高层次的推理,包括事故机制、因果归属、时间进程和事故后的结果。我们对8个最先进的VLM进行了基准测试,结果表明,尽管当前模型在场景描述能力上表现强劲,但在安全关键场景中的时间和因果推理方面仍然存在困难。我们提供了失败场景的详细分析,并讨论了改进VLM事故理解的方向。该基准为协同自动驾驶中的基础设施辅助感知提供了标准化的评估框架。CrashSight基准,包括完整的数据集和代码,可在https://mcgrche.github.io/crashsight访问。
cs.CV / 112 / 2604.08461
OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance
OVS-DINO:通过结构对齐的SAM-DINO与语言指导实现开放词汇分割
Abstract
Open-Vocabulary Segmentation (OVS) aims to segment image regions beyond predefined category sets by leveraging semantic descriptions. While CLIP based approaches excel in semantic generalization, they frequently lack the fine-grained spatial awareness required for dense prediction. Recent efforts have incorporated Vision Foundation Models (VFMs) like DINO to alleviate these limitations. However, these methods still struggle with the precise edge perception necessary for high fidelity segmentation. In this paper, we analyze internal representations of DINO and discover that its inherent boundary awareness is not absent but rather undergoes progressive attenuation as features transition into deeper transformer blocks. To address this, we propose OVS-DINO, a novel framework that revitalizes latent edge-sensitivity of DINO through structural alignment with the Segment Anything Model (SAM). Specifically, we introduce a Structure-Aware Encoder (SAE) and a Structure-Modulated Decoder (SMD) to effectively activate boundary features of DINO using SAM's structural priors, complemented by a supervision strategy utilizing SAM generated pseudo-masks. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple weakly-supervised OVS benchmarks, improving the average score by 2.1% (from 44.8% to 46.9%). Notably, our approach significantly enhances segmentation accuracy in complex, cluttered scenarios, with a gain of 6.3% on Cityscapes (from 36.6% to 42.9%).
Chinese Translation
开放词汇分割(OVS)旨在通过利用语义描述对超出预定义类别集的图像区域进行分割。虽然基于CLIP的方法在语义泛化方面表现出色,但它们往往缺乏进行密集预测所需的细粒度空间意识。最近的研究努力引入了视觉基础模型(VFM),如DINO,以缓解这些局限性。然而,这些方法在高保真分割所需的精确边缘感知方面仍然存在困难。在本文中,我们分析了DINO的内部表示,并发现其固有的边界意识并非缺失,而是在特征过渡到更深的变换器块时逐渐减弱。为了解决这个问题,我们提出了OVS-DINO,一个新颖的框架,通过与Segment Anything Model(SAM)的结构对齐来重振DINO的潜在边缘敏感性。具体而言,我们引入了结构感知编码器(SAE)和结构调制解码器(SMD),以有效激活DINO的边界特征,利用SAM的结构先验,并辅以使用SAM生成的伪掩码的监督策略。大量实验表明,我们的方法在多个弱监督OVS基准测试中实现了最先进的性能,平均得分提高了2.1%(从44.8%提高到46.9%)。值得注意的是,我们的方法在复杂、杂乱的场景中显著提高了分割精度,在Cityscapes数据集上提升了6.3%(从36.6%提高到42.9%)。
cs.CV / 113 / 2604.08475
LAMP: Lift Image-Editing as General 3D Priors for Open-world Manipulation
LAMP:将图像编辑提升为开放世界操控的一般3D先验
Abstract
Human-like generalization in open-world remains a fundamental challenge for robotic manipulation. Existing learning-based methods, including reinforcement learning, imitation learning, and vision-language-action-models (VLAs), often struggle with novel tasks and unseen environments. Another promising direction is to explore generalizable representations that capture fine-grained spatial and geometric relations for open-world manipulation. While large-language-model (LLMs) and vision-language-model (VLMs) provide strong semantic reasoning based on language or annotated 2D representations, their limited 3D awareness restricts their applicability to fine-grained manipulation. To address this, we propose LAMP, which lifts image-editing as 3D priors to extract inter-object 3D transformations as continuous, geometry-aware representations. Our key insight is that image-editing inherently encodes rich 2D spatial cues, and lifting these implicit cues into 3D transformations provides fine-grained and accurate guidance for open-world manipulation. Extensive experiments demonstrate that \codename delivers precise 3D transformations and achieves strong zero-shot generalization in open-world manipulation. Project page: https://zju3dv.github.io/LAMP/.
Chinese Translation
在开放世界中实现类人般的泛化仍然是机器人操控的一项基本挑战。现有的基于学习的方法,包括强化学习、模仿学习和视觉-语言-动作模型(VLAs),在面对新任务和未见环境时常常表现不佳。另一个有前景的方向是探索可泛化的表示,以捕捉开放世界操控中的细粒度空间和几何关系。尽管大型语言模型(LLMs)和视觉-语言模型(VLMs)基于语言或标注的2D表示提供了强大的语义推理,但它们有限的3D感知能力限制了其在细粒度操控中的适用性。为了解决这一问题,我们提出了LAMP,它将图像编辑提升为3D先验,以提取对象间的3D变换作为连续的、具有几何意识的表示。我们的关键见解是,图像编辑本质上编码了丰富的2D空间线索,而将这些隐式线索提升为3D变换为开放世界操控提供了细粒度和准确的指导。大量实验表明,LAMP能够提供精确的3D变换,并在开放世界操控中实现强大的零-shot 泛化。项目页面:https://zju3dv.github.io/LAMP/
cs.CV / 114 / 2604.08476
Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization
忠实的 GRPO:通过约束策略优化提升多模态语言模型的视觉空间推理
Abstract
Multimodal reasoning models (MRMs) trained with reinforcement learning with verifiable rewards (RLVR) show improved accuracy on visual reasoning benchmarks. However, we observe that accuracy gains often come at the cost of reasoning quality: generated Chain-of-Thought (CoT) traces are frequently inconsistent with the final answer and poorly grounded in the visual evidence. We systematically study this phenomenon across seven challenging real-world spatial reasoning benchmarks and find that it affects contemporary MRMs such as ViGoRL-Spatial, TreeVGR as well as our own models trained with standard Group Relative Policy Optimization (GRPO). We characterize CoT reasoning quality along two complementary axes: "logical consistency" (does the CoT entail the final answer?) and "visual grounding" (does each reasoning step accurately describe objects, attributes, and spatial relationships in the image?). To address this, we propose Faithful GRPO (FGRPO), a variant of GRPO that enforces consistency and grounding as constraints via Lagrangian dual ascent. FGRPO incorporates batch-level consistency and grounding constraints into the advantage computation within a group, adaptively adjusting the relative importance of constraints during optimization. We evaluate FGRPO on Qwen2.5-VL-7B and 3B backbones across seven spatial datasets. Our results show that FGRPO substantially improves reasoning quality, reducing the inconsistency rate from 24.5% to 1.7% and improving visual grounding scores by +13%. It also improves final answer accuracy over simple GRPO, demonstrating that faithful reasoning enables better answers.
Chinese Translation
使用可验证奖励的强化学习训练的多模态推理模型(MRMs)在视觉推理基准测试中显示出更高的准确性。然而,我们观察到,准确性提升往往以推理质量为代价:生成的思维链(Chain-of-Thought, CoT)轨迹常常与最终答案不一致,并且在视觉证据上缺乏扎实的基础。我们系统地研究了这一现象在七个具有挑战性的现实世界空间推理基准中的表现,发现它影响了当代的多模态推理模型,如 ViGoRL-Spatial、TreeVGR 以及我们自己使用标准群体相对策略优化(Group Relative Policy Optimization, GRPO)训练的模型。我们从两个互补的维度来表征 CoT 推理质量:“逻辑一致性”(CoT 是否蕴含最终答案?)和“视觉基础”(每个推理步骤是否准确描述了图像中的对象、属性和空间关系?)。为了解决这个问题,我们提出了忠实的 GRPO(Faithful GRPO, FGRPO),这是 GRPO 的一种变体,通过拉格朗日对偶上升将一致性和基础性作为约束进行强制。FGRPO 在群体内的优势计算中纳入了批量级一致性和基础性约束,并在优化过程中自适应调整约束的重要性。我们在 Qwen2.5-VL-7B 和 3B 主干模型上对 FGRPO 进行了评估,涵盖了七个空间数据集。我们的结果表明,FGRPO 显著提高了推理质量,将不一致率从 24.5% 降低到 1.7%,并将视觉基础得分提高了 13%。它还在简单的 GRPO 上提高了最终答案的准确性,证明了忠实推理能够提供更好的答案。
cs.CV / 115 / 2604.08494
What They Saw, Not Just Where They Looked: Semantic Scanpath Similarity via VLMs and NLP metric
他们所看到的,不仅仅是他们所注视的地方:通过视觉语言模型和自然语言处理度量的语义扫描路径相似性
Abstract
Scanpath similarity metrics are central to eye-movement research, yet existing methods predominantly evaluate spatial and temporal alignment while neglecting semantic equivalence between attended image regions. We present a semantic scanpath similarity framework that integrates vision-language models (VLMs) into eye-tracking analysis. Each fixation is encoded under controlled visual context (patch-based and marker-based strategies) and transformed into concise textual descriptions, which are aggregated into scanpath-level representations. Semantic similarity is then computed using embedding-based and lexical NLP metrics and compared against established spatial measures, including MultiMatch and DTW. Experiments on free-viewing eye-tracking data demonstrate that semantic similarity captures partially independent variance from geometric alignment, revealing cases of high content agreement despite spatial divergence. We further analyze the impact of contextual encoding on description fidelity and metric stability. Our findings suggest that multimodal foundation models enable interpretable, content-aware extensions of classical scanpath analysis, providing a complementary dimension for gaze research within the ETRA community.
Chinese Translation
扫描路径相似性度量在眼动研究中至关重要,但现有方法主要评估空间和时间对齐,而忽视了被注视图像区域之间的语义等价性。我们提出了一种语义扫描路径相似性框架,将视觉语言模型(VLMs)整合到眼动追踪分析中。每个注视点在受控视觉上下文下进行编码(基于补丁和基于标记的策略),并转化为简洁的文本描述,这些描述被聚合成扫描路径级别的表示。然后,使用基于嵌入和词汇的自然语言处理度量计算语义相似性,并与包括MultiMatch和动态时间规整(DTW)在内的已建立空间度量进行比较。在自由观看眼动追踪数据上的实验表明,语义相似性捕捉到与几何对齐部分独立的方差,揭示了尽管空间上存在差异但内容高度一致的情况。我们进一步分析了上下文编码对描述保真度和度量稳定性的影响。我们的研究结果表明,多模态基础模型使经典扫描路径分析的可解释性和内容感知扩展成为可能,为ETRA社区的注视研究提供了一个互补的维度。
cs.CV / 116 / 2604.08500
Novel View Synthesis as Video Completion
将新视图合成视为视频补全
Abstract
We tackle the problem of sparse novel view synthesis (NVS) using video diffusion models; given $K$ ($\approx 5$) multi-view images of a scene and their camera poses, we predict the view from a target camera pose. Many prior approaches leverage generative image priors encoded via diffusion models. However, models trained on single images lack multi-view knowledge. We instead argue that video models already contain implicit multi-view knowledge and so should be easier to adapt for NVS. Our key insight is to formulate sparse NVS as a low frame-rate video completion task. However, one challenge is that sparse NVS is defined over an unordered set of inputs, often too sparse to admit a meaningful order, so the models should be $\textit{invariant}$ to permutations of that input set. To this end, we present FrameCrafter, which adapts video models (naturally trained with coherent frame orderings) to permutation-invariant NVS through several architectural modifications, including per-frame latent encodings and removal of temporal positional embeddings. Our results suggest that video models can be easily trained to "forget" about time with minimal supervision, producing competitive performance on sparse-view NVS benchmarks. Project page: https://frame-crafter.github.io/
Chinese Translation
我们使用视频扩散模型解决稀疏新视图合成(NVS)的问题;给定$K$(约5)幅场景的多视图图像及其相机位姿,我们预测目标相机位姿下的视图。许多先前的方法利用通过扩散模型编码的生成图像先验。然而,基于单幅图像训练的模型缺乏多视图知识。我们认为,视频模型已经包含隐含的多视图知识,因此更容易适应NVS。我们的关键见解是将稀疏NVS表述为低帧率视频补全任务。然而,一个挑战是稀疏NVS是基于无序输入集定义的,通常稀疏到无法赋予有意义的顺序,因此模型应该对输入集的排列保持不变。为此,我们提出了FrameCrafter,该方法通过多项架构修改(包括每帧潜在编码和去除时间位置嵌入)将视频模型(自然地以一致的帧顺序进行训练)适应于排列不变的NVS。我们的结果表明,视频模型可以在最小监督下轻松训练以“忘记”时间,在稀疏视图NVS基准测试中表现出竞争力。项目页面:https://frame-crafter.github.io/
cs.CV / 117 / 2604.08502
Quantifying Explanation Consistency: The C-Score Metric for CAM-Based Explainability in Medical Image Classification
量化解释一致性:基于CAM的医学图像分类可解释性的C-Score度量
Abstract
Class Activation Mapping (CAM) methods are widely used to generate visual explanations for deep learning classifiers in medical imaging. However, existing evaluation frameworks assess whether explanations are correct, measured by localisation fidelity against radiologist annotations, rather than whether they are consistent: whether the model applies the same spatial reasoning strategy across different patients with the same pathology. We propose the C-Score (Consistency Score), a confidence-weighted, annotation-free metric that quantifies intra-class explanation reproducibility via intensity-emphasised pairwise soft IoU across correctly classified instances. We evaluate six CAM techniques: GradCAM, GradCAM++, LayerCAM, EigenCAM, ScoreCAM, and MS GradCAM++ across three CNN architectures (DenseNet201, InceptionV3, ResNet50V2) over thirty training epochs on the Kermany chest X-ray dataset, covering transfer learning and fine-tuning phases. We identify three distinct mechanisms of AUC-consistency dissociation, invisible to standard classification metrics: threshold-mediated gold list collapse, technique-specific attribution collapse at peak AUC, and class-level consistency masking in global aggregation. C-Score provides an early warning signal of impending model instability. ScoreCAM deterioration on ResNet50V2 is detectable one full checkpoint before catastrophic AUC collapse and yields architecture-specific clinical deployment recommendations grounded in explanation quality rather than predictive ranking alone.
Chinese Translation
类激活映射(Class Activation Mapping, CAM)方法广泛用于生成深度学习分类器在医学成像中的视觉解释。然而,现有的评估框架主要评估解释的正确性,通过与放射科医生注释的定位保真度进行衡量,而不是评估其一致性:即模型是否在不同患者中对相同病理应用相同的空间推理策略。我们提出了C-Score(一致性得分),这是一种基于置信度加权的无注释度量,通过在正确分类实例之间进行强度强调的成对软IoU,量化类内解释的可重复性。我们评估了六种CAM技术:GradCAM、GradCAM++、LayerCAM、EigenCAM、ScoreCAM和MS GradCAM++,在三种卷积神经网络架构(DenseNet201、InceptionV3、ResNet50V2)上进行了三十个训练周期,使用Kermany胸部X光数据集,涵盖迁移学习和微调阶段。我们识别出三种AUC一致性解离的不同机制,这些机制在标准分类度量中是不可见的:阈值介导的金名单崩溃、在峰值AUC时特定技术的归因崩溃,以及在全局聚合中的类别级一致性掩蔽。C-Score提供了即将发生模型不稳定的早期预警信号。ResNet50V2上的ScoreCAM恶化在灾难性AUC崩溃之前可以在一个完整的检查点中被检测到,并基于解释质量而非仅仅预测排名,提供了架构特定的临床部署建议。
cs.CV / 118 / 2604.08503
Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics
Phantom:通过联合建模视觉与潜在物理动态的物理驱动视频生成
Abstract
Recent advances in generative video modeling, driven by large-scale datasets and powerful architectures, have yielded remarkable visual realism. However, emerging evidence suggests that simply scaling data and model size does not endow these systems with an understanding of the underlying physical laws that govern real-world dynamics. Existing approaches often fail to capture or enforce such physical consistency, resulting in unrealistic motion and dynamics. In his work, we investigate whether integrating the inference of latent physical properties directly into the video generation process can equip models with the ability to produce physically plausible videos. To this end, we propose Phantom, a Physics-Infused Video Generation model that jointly models the visual content and latent physical dynamics. Conditioned on observed video frames and inferred physical states, Phantom jointly predicts latent physical dynamics and generates future video frames. Phantom leverages a physics-aware video representation that serves as an abstract yet informaive embedding of the underlying physics, facilitating the joint prediction of physical dynamics alongside video content without requiring an explicit specification of a complex set of physical dynamics and properties. By integrating the inference of physical-aware video representation directly into the video generation process, Phantom produces video sequences that are both visually realistic and physically consistent. Quantitative and qualitative results on both standard video generation and physics-aware benchmarks demonstrate that Phantom not only outperforms existing methods in terms of adherence to physical dynamics but also delivers competitive perceptual fidelity.
Chinese Translation
近年来,受大规模数据集和强大架构驱动的生成视频建模取得了显著的视觉真实感。然而,新的证据表明,仅仅扩大数据和模型规模并不能使这些系统理解支配现实世界动态的基本物理法则。现有的方法往往无法捕捉或强制执行这种物理一致性,导致不现实的运动和动态。在本研究中,我们探讨了将潜在物理属性的推断直接整合到视频生成过程是否能够使模型具备生成物理上合理视频的能力。为此,我们提出了Phantom,一个物理驱动的视频生成模型,它联合建模视觉内容和潜在物理动态。在观察到的视频帧和推断的物理状态的条件下,Phantom共同预测潜在物理动态并生成未来的视频帧。Phantom利用一种物理感知的视频表示,作为对基础物理的抽象而又信息丰富的嵌入,促进了物理动态与视频内容的联合预测,而无需明确指定复杂的物理动态和属性集。通过将物理感知视频表示的推断直接整合到视频生成过程中,Phantom生成的视序列在视觉上既真实又物理上一致。在标准视频生成和物理感知基准上的定量和定性结果表明,Phantom不仅在遵循物理动态方面优于现有方法,而且在感知保真度方面也具有竞争力。
cs.CV / 119 / 2604.08509
Visually-grounded Humanoid Agents
视觉基础的人形代理
Abstract
Digital human generation has been studied for decades and supports a wide range of real-world applications. However, most existing systems are passively animated, relying on privileged state or scripted control, which limits scalability to novel environments. We instead ask: how can digital humans actively behave using only visual observations and specified goals in novel scenes? Achieving this would enable populating any 3D environments with digital humans at scale that exhibit spontaneous, natural, goal-directed behaviors. To this end, we introduce Visually-grounded Humanoid Agents, a coupled two-layer (world-agent) paradigm that replicates humans at multiple levels: they look, perceive, reason, and behave like real people in real-world 3D scenes. The World Layer reconstructs semantically rich 3D Gaussian scenes from real-world videos via an occlusion-aware pipeline and accommodates animatable Gaussian-based human avatars. The Agent Layer transforms these avatars into autonomous humanoid agents, equipping them with first-person RGB-D perception and enabling them to perform accurate, embodied planning with spatial awareness and iterative reasoning, which is then executed at the low level as full-body actions to drive their behaviors in the scene. We further introduce a benchmark to evaluate humanoid-scene interaction in diverse reconstructed environments. Experiments show our agents achieve robust autonomous behavior, yielding higher task success rates and fewer collisions than ablations and state-of-the-art planning methods. This work enables active digital human population and advances human-centric embodied AI. Data, code, and models will be open-sourced.
Chinese Translation
数字人类生成的研究已经进行了几十年,并支持广泛的现实应用。然而,大多数现有系统都是被动动画,依赖于特权状态或脚本控制,这限制了其在新环境中的可扩展性。我们反而提出:如何使数字人类仅通过视觉观察和指定目标在新场景中主动行为?实现这一目标将使得能够在任何3D环境中大规模地填充数字人类,这些人类表现出自发、自然、目标导向的行为。为此,我们引入了视觉基础的人形代理(Visually-grounded Humanoid Agents),这是一种耦合的双层(世界-代理)范式,能够在多个层面上复制人类:它们在真实世界的3D场景中看起来、感知、推理和行为都像真实的人类。世界层通过一个考虑遮挡的管道,从现实视频中重建语义丰富的3D高斯场景,并容纳可动画的基于高斯的人类化身。代理层将这些化身转化为自主的人形代理,为其配备第一人称RGB-D感知,使其能够进行准确的、具身的规划,具备空间意识和迭代推理,然后以全身动作的低级执行驱动其在场景中的行为。我们进一步引入了一个基准,以评估在多样重建环境中的人形-场景交互。实验表明,我们的代理实现了稳健的自主行为,任务成功率更高,碰撞次数更少,相较于消融实验和最先进的规划方法。该工作实现了主动数字人类的填充,并推进了以人为中心的具身人工智能。数据、代码和模型将开源。
cs.CV / 120 / 2604.08513
When Fine-Tuning Changes the Evidence: Architecture-Dependent Semantic Drift in Chest X-Ray Explanations
微调如何改变证据:胸部X光解释中的架构依赖语义漂移
Abstract
Transfer learning followed by fine-tuning is widely adopted in medical image classification due to consistent gains in diagnostic performance. However, in multi-class settings with overlapping visual features, improvements in accuracy do not guarantee stability of the visual evidence used to support predictions. We define semantic drift as systematic changes in the attribution structure supporting a model's predictions between transfer learning and full fine-tuning, reflecting potential shifts in underlying visual reasoning despite stable classification performance. Using a five-class chest X-ray task, we evaluate DenseNet201, ResNet50V2, and InceptionV3 under a two-stage training protocol and quantify drift with reference-free metrics capturing spatial localization and structural consistency of attribution maps. Across architectures, coarse anatomical localization remains stable, while overlap IoU reveals pronounced architecture-dependent reorganization of evidential structure. Beyond single-method analysis, stability rankings can reverse across LayerCAM and GradCAM++ under converged predictive performance, establishing explanation stability as an interaction between architecture, optimization phase, and attribution objective.
Chinese Translation
由于在诊断性能上的一致提升,迁移学习后进行微调在医学图像分类中被广泛采用。然而,在具有重叠视觉特征的多类设置中,准确性的提高并不保证用于支持预测的视觉证据的稳定性。我们将语义漂移定义为迁移学习与完全微调之间支持模型预测的归因结构的系统性变化,反映了尽管分类性能稳定,但潜在视觉推理的变化。通过一个五类胸部X光任务,我们在两阶段训练协议下评估了DenseNet201、ResNet50V2和InceptionV3,并使用无参考度量量化漂移,这些度量捕捉了归因图的空间定位和结构一致性。在不同架构中,粗略的解剖定位保持稳定,而重叠的IoU则揭示了证据结构的显著架构依赖重组。超越单一方法分析,在收敛的预测性能下,LayerCAM和GradCAM++的稳定性排名可能会反转,确立了解释稳定性作为架构、优化阶段和归因目标之间的相互作用。
cs.CV / 121 / 2604.08516
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
MolmoWeb:开放视觉网络代理和开放数据为开放网络服务
Abstract
Web agents--autonomous systems that navigate and execute tasks on the web on behalf of users--have the potential to transform how people interact with the digital world. However, the most capable web agents today rely on proprietary models with undisclosed training data and recipes, limiting scientific understanding, reproducibility, and community-driven progress. We believe agents for the open web should be built in the open. To this end, we introduce (1) MolmoWebMix, a large and diverse mixture of browser task demonstrations and web-GUI perception data and (2) MolmoWeb, a family of fully open multimodal web agents. Specifically, MolmoWebMix combines over 100K synthetic task trajectories from multiple complementary generation pipelines with 30K+ human demonstrations, atomic web-skill trajectories, and GUI perception data, including referring expression grounding and screenshot question answering. MolmoWeb agents operate as instruction-conditioned visual-language action policies: given a task instruction and a webpage screenshot, they predict the next browser action, requiring no access to HTML, accessibility trees, or specialized APIs. Available in 4B and 8B size, on browser-use benchmarks like WebVoyager, Online-Mind2Web, and DeepShop, MolmoWeb agents achieve state-of-the-art results outperforming similar scale open-weight-only models such as Fara-7B, UI-Tars-1.5-7B, and Holo1-7B. MolmoWeb-8B also surpasses set-of-marks (SoM) agents built on much larger closed frontier models like GPT-4o. We further demonstrate consistent gains through test-time scaling via parallel rollouts with best-of-N selection, achieving 94.7% and 60.5% pass@4 (compared to 78.2% and 35.3% pass@1) on WebVoyager and Online-Mind2Web respectively. We will release model checkpoints, training data, code, and a unified evaluation harness to enable reproducibility and accelerate open research on web agents.
Chinese Translation
网络代理——代表用户在网络上导航和执行任务的自主系统——有潜力改变人们与数字世界的互动方式。然而,当前最强大的网络代理依赖于专有模型,这些模型的训练数据和算法未公开,限制了科学理解、可重复性和社区驱动的进展。我们认为,开放网络的代理应该在开放环境中构建。为此,我们介绍了(1)MolmoWebMix,一个包含大量多样化浏览器任务演示和网络图形用户界面(GUI)感知数据的混合数据集,以及(2)MolmoWeb,一个完全开放的多模态网络代理系列。具体而言,MolmoWebMix结合了来自多个互补生成管道的超过10万条合成任务轨迹与3万多条人类演示、原子网络技能轨迹和GUI感知数据,包括指称表达的定位和屏幕截图问答。MolmoWeb代理作为指令条件的视觉语言行动策略运行:在给定任务指令和网页截图的情况下,它们预测下一个浏览器动作,无需访问HTML、可访问性树或专用API。在浏览器使用基准测试如WebVoyager、Online-Mind2Web和DeepShop上,MolmoWeb代理以4B和8B的规模实现了最先进的结果,超越了类似规模的仅开放权重模型,如Fara-7B、UI-Tars-1.5-7B和Holo1-7B。MolmoWeb-8B还超越了基于更大封闭前沿模型(如GPT-4o)构建的标记集(SoM)代理。我们进一步通过并行展开和最佳选择的测试时间扩展展示了一致的增益,在WebVoyager和Online-Mind2Web上分别达到了94.7%和60.5%的通过率@4(相比之下,78.2%和35.3%的通过率@1)。我们将发布模型检查点、训练数据、代码和统一评估工具,以促进可重复性并加速开放网络代理的研究。
cs.CV / 122 / 2604.08522
UniversalVTG: A Universal and Lightweight Foundation Model for Video Temporal Grounding
UniversalVTG:一种通用且轻量级的视频时间定位基础模型
Abstract
Video temporal grounding (VTG) is typically tackled with dataset-specific models that transfer poorly across domains and query styles. Recent efforts to overcome this limitation have adapted large multimodal language models (MLLMs) to VTG, but their high compute cost and limited video context still hinder long-video grounding. We instead scale unified supervision while keeping the model lightweight. We present UniversalVTG, a single VTG model trained with large-scale cross-dataset pretraining. An offline Query Unifier canonicalizes heterogeneous query formats into a shared declarative space, reducing linguistic mismatch and preventing the negative transfer observed under na\"ive joint training. Combined with an efficient grounding head, UniversalVTG scales to long, untrimmed videos. Across diverse benchmarks-GoalStep-StepGrounding, Ego4D-NLQ, TACoS, Charades-STA, and ActivityNet-Captions-one UniversalVTG checkpoint achieves state-of-the-art performance versus dedicated VTG models. Moreover, despite being $>100\times$ smaller than recent MLLM-based approaches, UniversalVTG matches or exceeds their accuracy on multiple benchmarks, offering a practical alternative to parameter-heavy MLLMs.
Chinese Translation
视频时间定位(VTG)通常采用特定于数据集的模型进行处理,这些模型在不同领域和查询风格之间的迁移效果较差。近期的研究努力克服这一限制,已将大型多模态语言模型(MLLMs)适配于VTG,但其高计算成本和有限的视频上下文仍然阻碍了长视频的定位。我们则通过保持模型轻量化来扩展统一监督。我们提出了UniversalVTG,这是一种通过大规模跨数据集预训练训练的单一VTG模型。离线查询统一器将异构查询格式规范化为共享的声明性空间,减少了语言不匹配,并防止了在简单联合训练下观察到的负迁移。结合高效的定位头,UniversalVTG能够扩展到长时间、未剪辑的视频。在多样的基准测试中——GoalStep-StepGrounding、Ego4D-NLQ、TACoS、Charades-STA和ActivityNet-Captions——一个UniversalVTG检查点在性能上超过了专用的VTG模型。此外,尽管UniversalVTG的参数量比最近的基于MLLM的方法小于100倍,但在多个基准测试中,其准确性与这些方法相当或更高,提供了一个参数较重的MLLMs的实用替代方案。
cs.CV / 123 / 2604.08526
FIT: A Large-Scale Dataset for Fit-Aware Virtual Try-On
FIT:一个大规模的适合度感知虚拟试衣数据集
Abstract
Given a person and a garment image, virtual try-on (VTO) aims to synthesize a realistic image of the person wearing the garment, while preserving their original pose and identity. Although recent VTO methods excel at visualizing garment appearance, they largely overlook a crucial aspect of the try-on experience: the accuracy of garment fit -- for example, depicting how an extra-large shirt looks on an extra-small person. A key obstacle is the absence of datasets that provide precise garment and body size information, particularly for "ill-fit" cases, where garments are significantly too large or too small. Consequently, current VTO methods default to generating well-fitted results regardless of the garment or person size. In this paper, we take the first steps towards solving this open problem. We introduce FIT (Fit-Inclusive Try-on), a large-scale VTO dataset comprising over 1.13M try-on image triplets accompanied by precise body and garment measurements. We overcome the challenges of data collection via a scalable synthetic strategy: (1) We programmatically generate 3D garments using GarmentCode and drape them via physics simulation to capture realistic garment fit. (2) We employ a novel re-texturing framework to transform synthetic renderings into photorealistic images while strictly preserving geometry. (3) We introduce person identity preservation into our re-texturing model to generate paired person images (same person, different garments) for supervised training. Finally, we leverage our FIT dataset to train a baseline fit-aware virtual try-on model. Our data and results set the new state-of-the-art for fit-aware virtual try-on, as well as offer a robust benchmark for future research. We will make all data and code publicly available on our project page: https://johannakarras.github.io/FIT.
Chinese Translation
虚拟试衣(VTO)旨在根据一个人和一件服装图像合成该人穿着该服装的真实图像,同时保持其原始姿势和身份。尽管近期的VTO方法在可视化服装外观方面表现出色,但它们在试穿体验的一个关键方面——服装适合度的准确性上却大多被忽视,例如,描绘一件特大号衬衫在一位特小号人士身上的效果。一个主要障碍是缺乏提供精确服装和身体尺寸信息的数据集,尤其是在“适合度不佳”的情况下,即服装明显过大或过小。因此,当前的VTO方法通常默认生成合身的结果,而不考虑服装或个人的尺寸。在本文中,我们迈出了朝着解决这一开放问题的第一步。我们介绍了FIT(Fit-Inclusive Try-on),这是一个大规模的VTO数据集,包含超过113万对试穿图像三元组,并附有精确的身体和服装测量数据。我们通过可扩展的合成策略克服了数据收集的挑战:(1) 我们使用GarmentCode程序生成3D服装,并通过物理模拟进行悬垂,以捕捉真实的服装适合度。(2) 我们采用了一种新颖的重纹理框架,将合成渲染转换为照片级真实图像,同时严格保持几何形状。(3) 我们在重纹理模型中引入了人物身份保留,以生成成对的人物图像(同一人物,不同服装)用于监督训练。最后,我们利用我们的FIT数据集训练了一个基线的适合度感知虚拟试衣模型。我们的数据和结果设定了适合度感知虚拟试衣的新状态,并为未来的研究提供了一个强有力的基准。我们将把所有数据和代码公开发布在我们的项目页面:https://johannakarras.github.io/FIT。
cs.CV / 124 / 2604.08532
Self-Improving 4D Perception via Self-Distillation
通过自我蒸馏实现自我提升的四维感知
Abstract
Large-scale multi-view reconstruction models have made remarkable progress, but most existing approaches still rely on fully supervised training with ground-truth 3D/4D annotations. Such annotations are expensive and particularly scarce for dynamic scenes, limiting scalability. We propose SelfEvo, a self-improving framework that continually improves pretrained multi-view reconstruction models using unlabeled videos. SelfEvo introduces a self-distillation scheme using spatiotemporal context asymmetry, enabling self-improvement for learning-based 4D perception without external annotations. We systematically study design choices that make self-improvement effective, including loss signals, forms of asymmetry, and other training strategies. Across eight benchmarks spanning diverse datasets and domains, SelfEvo consistently improves pretrained baselines and generalizes across base models (e.g. VGGT and $\pi^3$), with significant gains on dynamic scenes. Overall, SelfEvo achieves up to 36.5% relative improvement in video depth estimation and 20.1% in camera estimation, without using any labeled data. Project Page: https://self-evo.github.io/.
Chinese Translation
大规模多视角重建模型取得了显著进展,但大多数现有方法仍依赖于具有真实标签的3D/4D注释进行完全监督训练。这类注释成本高昂,尤其在动态场景中稀缺,限制了可扩展性。我们提出了SelfEvo,一个自我提升框架,利用未标记的视频不断改进预训练的多视角重建模型。SelfEvo引入了一种利用时空上下文不对称性的自我蒸馏方案,使得基于学习的四维感知能够在没有外部注释的情况下实现自我提升。我们系统地研究了使自我提升有效的设计选择,包括损失信号、不对称形式和其他训练策略。在涵盖多样数据集和领域的八个基准测试中,SelfEvo始终改善预训练基线,并在基础模型(例如VGGT和$ ext{π}^3$)之间实现了良好的泛化,在动态场景中取得了显著的提升。总体而言,SelfEvo在视频深度估计中实现了高达36.5%的相对提升,在相机估计中实现了20.1%的提升,且未使用任何标记数据。项目页面:https://self-evo.github.io/
cs.CV / 125 / 2604.08538
ParseBench: A Document Parsing Benchmark for AI Agents
ParseBench:面向智能体的文档解析基准测试
Abstract
AI agents are changing the requirements for document parsing. What matters is \emph{semantic correctness}: parsed output must preserve the structure and meaning needed for autonomous decisions, including correct table structure, precise chart data, semantically meaningful formatting, and visual grounding. Existing benchmarks do not fully capture this setting for enterprise automation, relying on narrow document distributions and text-similarity metrics that miss agent-critical failures. We introduce \textbf{ParseBench}, a benchmark of ${\sim}2{,}000$ human-verified pages from enterprise documents spanning insurance, finance, and government, organized around five capability dimensions: tables, charts, content faithfulness, semantic formatting, and visual grounding. Across 14 methods spanning vision-language models, specialized document parsers, and LlamaParse, the benchmark reveals a fragmented capability landscape: no method is consistently strong across all five dimensions. LlamaParse Agentic achieves the highest overall score at \agenticoverall\%, and the benchmark highlights the remaining capability gaps across current systems. Dataset and evaluation code are available on \href{https://huggingface.co/datasets/llamaindex/ParseBench}{HuggingFace} and \href{https://github.com/run-llama/ParseBench}{GitHub}.
Chinese Translation
智能体正在改变文档解析的需求,关键在于“语义正确性”:解析结果必须保留用于自主决策所需的结构和含义,包括正确的表格结构、精确的图表数据、语义丰富的格式以及视觉定位。现有基准测试未能充分覆盖企业自动化场景,通常依赖于狭窄的文档分布和文本相似度指标,无法捕捉对智能体至关重要的失败。我们提出了ParseBench,这是一个包含约2000页经过人工验证的企业文档的基准,涵盖保险、金融和政府领域,围绕五个能力维度组织:表格、图表、内容忠实度、语义格式和视觉定位。在涵盖视觉语言模型、专用文档解析器和LlamaParse的14种方法中,基准测试揭示了能力的分散格局:没有任何方法能在所有五个维度上持续表现优异。LlamaParse Agentic以
extbackslash agenticoverall
%的总体得分位居最高,基准测试同时揭示了当前系统存在的能力缺口。数据集和评测代码已发布于HuggingFace和GitHub。
cs.CV / 126 / 2604.08539
OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks
OpenVLThinkerV2:一种用于多领域视觉任务的通用多模态推理模型
Abstract
Group Relative Policy Optimization (GRPO) has emerged as the de facto Reinforcement Learning (RL) objective driving recent advancements in Multimodal Large Language Models. However, extending this success to open-source multimodal generalist models remains heavily constrained by two primary challenges: the extreme variance in reward topologies across diverse visual tasks, and the inherent difficulty of balancing fine-grained perception with multi-step reasoning capabilities. To address these issues, we introduce Gaussian GRPO (G$^2$RPO), a novel RL training objective that replaces standard linear scaling with non-linear distributional matching. By mathematically forcing the advantage distribution of any given task to strictly converge to a standard normal distribution, $\mathcal{N}(0,1)$, G$^2$RPO theoretically ensures inter-task gradient equity, mitigates vulnerabilities to heavy-tail outliers, and offers symmetric update for positive and negative rewards. Leveraging the enhanced training stability provided by G$^2$RPO, we introduce two task-level shaping mechanisms to seamlessly balance perception and reasoning. First, response length shaping dynamically elicits extended reasoning chains for complex queries while enforce direct outputs to bolster visual grounding. Second, entropy shaping tightly bounds the model's exploration zone, effectively preventing both entropy collapse and entropy explosion. Integrating these methodologies, we present OpenVLThinkerV2, a highly robust, general-purpose multimodal model. Extensive evaluations across 18 diverse benchmarks demonstrate its superior performance over strong open-source and leading proprietary frontier models.
Chinese Translation
群体相对策略优化(Group Relative Policy Optimization, GRPO)已成为推动多模态大型语言模型近期进展的事实上的强化学习(Reinforcement Learning, RL)目标。然而,将这一成功扩展到开源的多模态通用模型仍然受到两个主要挑战的严重限制:不同视觉任务之间奖励拓扑的极端变异,以及平衡细粒度感知与多步推理能力的固有困难。为了解决这些问题,我们提出了高斯GRPO(Gaussian GRPO, G$^2$RPO),这是一种新颖的RL训练目标,它用非线性分布匹配替代了标准线性缩放。通过数学上强制任何给定任务的优势分布严格收敛到标准正态分布$ ext{N}(0,1)$,G$^2$RPO理论上确保了任务间梯度的公平性,减轻了对重尾异常值的脆弱性,并为正负奖励提供了对称更新。利用G$^2$RPO提供的增强训练稳定性,我们引入了两种任务级塑形机制,以无缝平衡感知与推理。首先,响应长度塑形动态引导复杂查询的扩展推理链,同时强制直接输出以增强视觉基础。其次,熵塑形严格限制模型的探索区域,有效防止熵崩溃和熵爆炸。结合这些方法,我们提出了OpenVLThinkerV2,这是一种高度稳健的通用多模态模型。在18个不同基准上的广泛评估表明,其性能优于强大的开源模型和领先的专有前沿模型。
cs.CV / 127 / 2604.08540
AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation
AVGen-Bench:一种面向任务的多粒度文本到音频视频生成评估基准
Abstract
Text-to-Audio-Video (T2AV) generation is rapidly becoming a core interface for media creation, yet its evaluation remains fragmented. Existing benchmarks largely assess audio and video in isolation or rely on coarse embedding similarity, failing to capture the fine-grained joint correctness required by realistic prompts. We introduce AVGen-Bench, a task-driven benchmark for T2AV generation featuring high-quality prompts across 11 real-world categories. To support comprehensive assessment, we propose a multi-granular evaluation framework that combines lightweight specialist models with Multimodal Large Language Models (MLLMs), enabling evaluation from perceptual quality to fine-grained semantic controllability. Our evaluation reveals a pronounced gap between strong audio-visual aesthetics and weak semantic reliability, including persistent failures in text rendering, speech coherence, physical reasoning, and a universal breakdown in musical pitch control. Code and benchmark resources are available at http://aka.ms/avgenbench.
Chinese Translation
文本到音频视频(T2AV)生成正迅速成为媒体创作的核心接口,但其评估仍然分散。现有基准主要孤立地评估音频和视频,或依赖粗略的嵌入相似性,未能捕捉到现实提示所需的细粒度联合正确性。我们提出了AVGen-Bench,这是一种面向任务的T2AV生成基准,涵盖11个真实世界类别的高质量提示。为了支持全面评估,我们提出了一种多粒度评估框架,将轻量级专业模型与多模态大型语言模型(MLLMs)相结合,使评估能够从感知质量到细粒度语义可控性。我们的评估揭示了强大的音视频美学与弱语义可靠性之间的明显差距,包括文本渲染、语音连贯性、物理推理的持续失败,以及音乐音高控制的普遍崩溃。代码和基准资源可在http://aka.ms/avgenbench获取。
cs.CV / 128 / 2604.08541
Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts
看见但不思考:多模态专家混合模型中的路由干扰
Abstract
Multimodal Mixture-of-Experts (MoE) models have achieved remarkable performance on vision-language tasks. However, we identify a puzzling phenomenon termed Seeing but Not Thinking: models accurately perceive image content yet fail in subsequent reasoning, while correctly solving identical problems presented as pure text. Through systematic analysis, we first verify that cross-modal semantic sharing exists in MoE architectures, ruling out semantic alignment failure as the sole explanation. We then reveal that visual experts and domain experts exhibit layer-wise separation, with image inputs inducing significant routing divergence from text inputs in middle layers where domain experts concentrate. Based on these findings, we propose the Routing Distraction hypothesis: when processing visual inputs, the routing mechanism fails to adequately activate task-relevant reasoning experts. To validate this hypothesis, we design a routing-guided intervention method that enhances domain expert activation. Experiments on three multimodal MoE models across six benchmarks demonstrate consistent improvements, with gains of up to 3.17% on complex visual reasoning tasks. Our analysis further reveals that domain expert identification locates cognitive functions rather than sample-specific solutions, enabling effective transfer across tasks with different information structures.
Chinese Translation
多模态专家混合模型(MoE)在视觉-语言任务中取得了显著的性能。然而,我们发现了一种令人困惑的现象,称为“看见但不思考”:模型能够准确感知图像内容,但在后续推理中失败,而在以纯文本形式呈现的相同问题上却能正确解决。通过系统分析,我们首先验证了MoE架构中存在跨模态语义共享,排除了语义对齐失败作为唯一解释的可能性。接着,我们揭示了视觉专家和领域专家在层级上的分离,图像输入在中间层中引发了与文本输入显著不同的路由偏差,而领域专家则集中在这些层级。基于这些发现,我们提出了路由干扰假设:在处理视觉输入时,路由机制未能充分激活与任务相关的推理专家。为了验证这一假设,我们设计了一种路由引导干预方法,以增强领域专家的激活。在三个多模态MoE模型的六个基准测试上的实验表明,性能一致提升,复杂视觉推理任务的增益高达3.17%。我们的分析进一步揭示,领域专家的识别定位于认知功能而非特定样本解决方案,从而实现了在不同信息结构任务之间的有效迁移。
cs.CV / 129 / 2604.08542
Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction
Scal3R:大规模3D重建的可扩展测试时训练
Abstract
This paper addresses the task of large-scale 3D scene reconstruction from long video sequences. Recent feed-forward reconstruction models have shown promising results by directly regressing 3D geometry from RGB images without explicit 3D priors or geometric constraints. However, these methods often struggle to maintain reconstruction accuracy and consistency over long sequences due to limited memory capacity and the inability to effectively capture global contextual cues. In contrast, humans can naturally exploit the global understanding of the scene to inform local perception. Motivated by this, we propose a novel neural global context representation that efficiently compresses and retains long-range scene information, enabling the model to leverage extensive contextual cues for enhanced reconstruction accuracy and consistency. The context representation is realized through a set of lightweight neural sub-networks that are rapidly adapted during test time via self-supervised objectives, which substantially increases memory capacity without incurring significant computational overhead. The experiments on multiple large-scale benchmarks, including the KITTI Odometry~\cite{Geiger2012CVPR} and Oxford Spires~\cite{tao2025spires} datasets, demonstrate the effectiveness of our approach in handling ultra-large scenes, achieving leading pose accuracy and state-of-the-art 3D reconstruction accuracy while maintaining efficiency. Code is available at https://zju3dv.github.io/scal3r.
Chinese Translation
本文针对从长视频序列中进行大规模3D场景重建的任务进行了研究。近期的前馈重建模型通过直接从RGB图像回归3D几何体,未使用显式的3D先验或几何约束,已显示出良好的效果。然而,由于内存容量有限以及无法有效捕捉全局上下文线索,这些方法在长序列中往往难以保持重建的准确性和一致性。相比之下,人类能够自然地利用对场景的全局理解来指导局部感知。基于此动机,我们提出了一种新颖的神经全局上下文表示,它有效地压缩并保留长距离场景信息,使模型能够利用丰富的上下文线索以提高重建的准确性和一致性。该上下文表示通过一组轻量级神经子网络实现,这些子网络在测试时通过自监督目标快速适应,显著增加了内存容量而不引入显著的计算开销。在多个大规模基准测试上的实验,包括KITTI Odometry~ extit{(Geiger2012CVPR)}和Oxford Spires~ extit{(tao2025spires)}数据集,展示了我们的方法在处理超大场景时的有效性,达到了领先的姿态准确性和最先进的3D重建准确性,同时保持了效率。代码可在 https://zju3dv.github.io/scal3r 获取。
cs.CV / 130 / 2604.08543
E-3DPSM: A State Machine for Event-Based Egocentric 3D Human Pose Estimation
E-3DPSM:一种用于基于事件的自我中心3D人体姿态估计的状态机
Abstract
Event cameras offer multiple advantages in monocular egocentric 3D human pose estimation from head-mounted devices, such as millisecond temporal resolution, high dynamic range, and negligible motion blur. Existing methods effectively leverage these properties, but suffer from low 3D estimation accuracy, insufficient in many applications (e.g., immersive VR/AR). This is due to the design not being fully tailored towards event streams (e.g., their asynchronous and continuous nature), leading to high sensitivity to self-occlusions and temporal jitter in the estimates. This paper rethinks the setting and introduces E-3DPSM, an event-driven continuous pose state machine for event-based egocentric 3D human pose estimation. E-3DPSM aligns continuous human motion with fine-grained event dynamics; it evolves latent states and predicts continuous changes in 3D joint positions associated with observed events, which are fused with direct 3D human pose predictions, leading to stable and drift-free final 3D pose reconstructions. E-3DPSM runs in real-time at 80 Hz on a single workstation and sets a new state of the art in experiments on two benchmarks, improving accuracy by up to 19% (MPJPE) and temporal stability by up to 2.7x. See our project page for the source code and trained models.
Chinese Translation
事件相机在从头戴设备进行单目自我中心3D人体姿态估计时提供了多种优势,例如毫秒级的时间分辨率、高动态范围和可忽略的运动模糊。现有方法有效利用了这些特性,但在3D估计精度上存在不足,无法满足许多应用(例如沉浸式虚拟现实/增强现实)的需求。这是因为设计未能完全针对事件流(例如其异步和连续的特性)进行优化,导致对自遮挡和估计中的时间抖动高度敏感。本文重新思考了这一设置,并引入了E-3DPSM,一种用于基于事件的自我中心3D人体姿态估计的事件驱动连续姿态状态机。E-3DPSM将连续的人体运动与细粒度的事件动态对齐;它演变潜在状态,并预测与观察到的事件相关的3D关节位置的连续变化,这些变化与直接的3D人体姿态预测相融合,从而实现稳定且无漂移的最终3D姿态重建。E-3DPSM在单个工作站上以80 Hz的实时速度运行,并在两个基准测试的实验中设定了新的技术水平,准确度提高了多达19%(MPJPE),时间稳定性提高了多达2.7倍。请访问我们的项目页面获取源代码和训练模型。
cs.CV / 131 / 2604.08545
Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models
明智行动:在自主多模态模型中培养元认知工具使用
Abstract
The advent of agentic multimodal models has empowered systems to actively interact with external environments. However, current agents suffer from a profound meta-cognitive deficit: they struggle to arbitrate between leveraging internal knowledge and querying external utilities. Consequently, they frequently fall prey to blind tool invocation, resorting to reflexive tool execution even when queries are resolvable from the raw visual context. This pathological behavior precipitates severe latency bottlenecks and injects extraneous noise that derails sound reasoning. Existing reinforcement learning protocols attempt to mitigate this via a scalarized reward that penalizes tool usage. Yet, this coupled formulation creates an irreconcilable optimization dilemma: an aggressive penalty suppresses essential tool use, whereas a mild penalty is entirely subsumed by the variance of the accuracy reward during advantage normalization, rendering it impotent against tool overuse. To transcend this bottleneck, we propose HDPO, a framework that reframes tool efficiency from a competing scalar objective to a strictly conditional one. By eschewing reward scalarization, HDPO maintains two orthogonal optimization channels: an accuracy channel that maximizes task correctness, and an efficiency channel that enforces execution economy exclusively within accurate trajectories via conditional advantage estimation. This decoupled architecture naturally induces a cognitive curriculum-compelling the agent to first master task resolution before refining its self-reliance. Extensive evaluations demonstrate that our resulting model, Metis, reduces tool invocations by orders of magnitude while simultaneously elevating reasoning accuracy.
Chinese Translation
自主多模态模型的出现使系统能够主动与外部环境互动。然而,当前的代理存在严重的元认知缺陷:它们在利用内部知识与查询外部工具之间难以做出权衡。因此,它们常常陷入盲目调用工具的困境,即使在原始视觉上下文中可以解决查询时也会反射性地执行工具。这种病态行为导致严重的延迟瓶颈,并引入额外的噪声,干扰合理推理。现有的强化学习协议试图通过对工具使用进行惩罚的标量化奖励来缓解这一问题。然而,这种耦合的形式化导致了不可调和的优化困境:过高的惩罚抑制了必要的工具使用,而过低的惩罚则完全被优势归一化过程中准确性奖励的方差所淹没,使其在应对工具过度使用时无能为力。为了超越这一瓶颈,我们提出了HDPO,一个将工具效率从竞争的标量目标重新构建为严格条件目标的框架。通过避免奖励标量化,HDPO维持了两个正交的优化通道:一个最大化任务正确性的准确性通道,以及一个通过条件优势估计在准确轨迹内强制执行执行经济性的效率通道。这种解耦的架构自然引导出一种认知课程,促使代理首先掌握任务解决,然后再提升其自我依赖能力。广泛的评估表明,我们的模型Metis在显著减少工具调用的同时,提升了推理准确性。
cs.CV / 132 / 2604.08546
When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models
数字发声:在文本到视频扩散模型中对齐文本数字与视觉实例
Abstract
Text-to-video diffusion models have enabled open-ended video synthesis, but often struggle with generating the correct number of objects specified in a prompt. We introduce NUMINA , a training-free identify-then-guide framework for improved numerical alignment. NUMINA identifies prompt-layout inconsistencies by selecting discriminative self- and cross-attention heads to derive a countable latent layout. It then refines this layout conservatively and modulates cross-attention to guide regeneration. On the introduced CountBench, NUMINA improves counting accuracy by up to 7.4% on Wan2.1-1.3B, and by 4.9% and 5.5% on 5B and 14B models, respectively. Furthermore, CLIP alignment is improved while maintaining temporal consistency. These results demonstrate that structural guidance complements seed search and prompt enhancement, offering a practical path toward count-accurate text-to-video diffusion. The code is available at https://github.com/H-EmbodVis/NUMINA.
Chinese Translation
文本到视频扩散模型已实现开放式视频合成,但在生成提示中指定的正确物体数量时常常面临挑战。我们提出了NUMINA,一种无训练的识别-引导框架,以改善数字对齐。NUMINA通过选择具有区分性的自注意力和交叉注意力头来识别提示布局的不一致性,从而推导出可计数的潜在布局。然后,它保守地细化该布局,并调节交叉注意力以指导再生。在引入的CountBench上,NUMINA在Wan2.1-1.3B模型上提高了计数准确性,提升幅度达到7.4%;在5B和14B模型上分别提高了4.9%和5.5%。此外,CLIP对齐得到了改善,同时保持了时间一致性。这些结果表明,结构性指导补充了种子搜索和提示增强,为实现计数准确的文本到视频扩散提供了一条实用路径。代码可在https://github.com/H-EmbodVis/NUMINA获取。
cs.CV / 133 / 2604.08547
GaussiAnimate: Reconstruct and Rig Animatable Categories with Level of Dynamics
GaussiAnimate:重建和装配可动画类别的动态水平
Abstract
Free-form bones, that conform closely to the surface, can effectively capture non-rigid deformations, but lack a kinematic structure necessary for intuitive control. Thus, we propose a Scaffold-Skin Rigging System, termed "Skelebones", with three key steps: (1) Bones: compress temporally-consistent deformable Gaussians into free-form bones, approximating non-rigid surface deformations; (2) Skeleton: extract a Mean Curvature Skeleton from canonical Gaussians and refine it temporally, ensuring a category-agnostic, motion-adaptive, and topology-correct kinematic structure; (3) Binding: bind the skeleton and bones via non-parametric partwise motion matching (PartMM), synthesizing novel bone motions by matching, retrieving, and blending existing ones. Collectively, these three steps enable us to compress the Level of Dynamics of 4D shapes into compact skelebones that are both controllable and expressive. We validate our approach on both synthetic and real-world datasets, achieving significant improvements in reanimation performance across unseen poses-with 17.3% PSNR gains over Linear Blend Skinning (LBS) and 21.7% over Bag-of-Bones (BoB)-while maintaining excellent reconstruction fidelity, particularly for characters exhibiting complex non-rigid surface dynamics. Our Partwise Motion Matching algorithm demonstrates strong generalization to both Gaussian and mesh representations, especially under low-data regime (~1000 frames), achieving 48.4% RMSE improvement over robust LBS and outperforming GRU- and MLP-based learning methods by >20%. Code will be made publicly available for research purposes at cookmaker.cn/gaussianimate.
Chinese Translation
自由形状骨骼能够紧密贴合表面,有效捕捉非刚性变形,但缺乏直观控制所需的运动结构。因此,我们提出了一种称为“Skelebones”的支架-皮肤装配系统,包含三个关键步骤:(1)骨骼:将时间一致的可变形高斯压缩为自由形状骨骼,近似非刚性表面变形;(2)骨架:从典型高斯中提取平均曲率骨架并进行时间上的细化,确保类别无关、运动自适应和拓扑正确的运动结构;(3)绑定:通过非参数部分运动匹配(PartMM)将骨架与骨骼绑定,通过匹配、检索和混合现有骨骼运动合成新骨骼运动。这三个步骤共同使我们能够将4D形状的动态水平压缩为既可控又富有表现力的紧凑骨骼。我们在合成和真实世界数据集上验证了我们的方法,在未见姿态的再动画性能上取得了显著提升——相较于线性混合皮肤(LBS)提高了17.3%的PSNR,相较于骨骼包(BoB)提高了21.7%——同时保持了出色的重建保真度,尤其是对于表现复杂非刚性表面动态的角色。我们的部分运动匹配算法在高斯和网格表示上都表现出强大的泛化能力,特别是在低数据环境下(约1000帧),相较于稳健的LBS实现了48.4%的RMSE改进,并且比基于GRU和MLP的学习方法提高了超过20%。代码将公开发布于研究目的,网址为cookmaker.cn/gaussianimate。
cs.CV / 134 / 2604.08548
ETCH-X: Robustify Expressive Body Fitting to Clothed Humans with Composable Datasets
ETCH-X:通过可组合数据集增强对穿衣人类的表达性身体拟合的鲁棒性
Abstract
Human body fitting, which aligns parametric body models such as SMPL to raw 3D point clouds of clothed humans, serves as a crucial first step for downstream tasks like animation and texturing. An effective fitting method should be both locally expressive-capturing fine details such as hands and facial features-and globally robust to handle real-world challenges, including clothing dynamics, pose variations, and noisy or partial inputs. Existing approaches typically excel in only one aspect, lacking an all-in-one solution.We upgrade ETCH to ETCH-X, which leverages a tightness-aware fitting paradigm to filter out clothing dynamics ("undress"), extends expressiveness with SMPL-X, and replaces explicit sparse markers (which are highly sensitive to partial data) with implicit dense correspondences ("dense fit") for more robust and fine-grained body fitting. Our disentangled "undress" and "dense fit" modular stages enable separate and scalable training on composable data sources, including diverse simulated garments (CLOTH3D), large-scale full-body motions (AMASS), and fine-grained hand gestures (InterHand2.6M), improving outfit generalization and pose robustness of both bodies and hands. Our approach achieves robust and expressive fitting across diverse clothing, poses, and levels of input completeness, delivering a substantial performance improvement over ETCH on both: 1) seen data, such as 4D-Dress (MPJPE-All, 33.0% ) and CAPE (V2V-Hands, 35.8% ), and 2) unseen data, such as BEDLAM2.0 (MPJPE-All, 80.8% ; V2V-All, 80.5% ). Code and models will be released at https://xiaobenli00.github.io/ETCH-X/.
Chinese Translation
人类身体拟合是将参数化身体模型(如 SMPL)与穿衣人类的原始 3D 点云对齐的关键第一步,为动画和纹理等下游任务奠定基础。有效的拟合方法应具备局部表达性——捕捉手部和面部特征等细节——以及全球鲁棒性,以应对现实世界中的挑战,包括服装动态、姿态变化以及噪声或部分输入。现有方法通常在某一方面表现出色,但缺乏一体化的解决方案。我们将 ETCH 升级为 ETCH-X,利用紧致感知拟合范式过滤服装动态(“脱衣”),通过 SMPL-X 扩展表达性,并用隐式稠密对应(“稠密拟合”)替代显式稀疏标记(对部分数据高度敏感),以实现更鲁棒和细致的身体拟合。我们解耦的“脱衣”和“稠密拟合”模块化阶段使得可以在可组合数据源上进行独立且可扩展的训练,包括多样化的模拟服装(CLOTH3D)、大规模全身动作(AMASS)和精细的手势(InterHand2.6M),提高了身体和手部的服装泛化能力和姿态鲁棒性。我们的方法在多样的服装、姿态和输入完整性水平下实现了鲁棒和富有表现力的拟合,在两个方面相较于 ETCH 取得了显著的性能提升:1)已见数据,如 4D-Dress(MPJPE-All, 33.0%)和 CAPE(V2V-Hands, 35.8%);2)未见数据,如 BEDLAM2.0(MPJPE-All, 80.8%;V2V-All, 80.5%)。代码和模型将发布在 https://xiaobenli00.github.io/ETCH-X/。
cs.AI / 1 / 2604.07424
An Analysis of Artificial Intelligence Adoption in NIH-Funded Research
对美国国立卫生研究院(NIH)资助研究中人工智能采用的分析
Abstract
Understanding the landscape of artificial intelligence (AI) and machine learning (ML) adoption across the National Institutes of Health (NIH) portfolio is critical for research funding strategy, institutional planning, and health policy. The advent of large language models (LLMs) has fundamentally transformed research landscape analysis, enabling researchers to perform large-scale semantic extraction from thousands of unstructured research documents. In this paper, we illustrate a human-in-the-loop research methodology for LLMs to automatically classify and summarize research descriptions at scale. Using our methodology, we present a comprehensive analysis of 58,746 NIH-funded biomedical research projects from 2025. We show that: (1) AI constitutes 15.9% of the NIH portfolio with a 13.4% funding premium, concentrated in discovery, prediction, and data integration across disease domains; (2) a critical research-to-deployment gap exists, with 79% of AI projects remaining in research/development stages while only 14.7% engage in clinical deployment or implementation; and (3) health disparities research is severely underrepresented at just 5.7% of AI-funded work despite its importance to NIH's equity mission. These findings establish a framework for evidence-based policy interventions to align the NIH AI portfolio with health equity goals and strategic research priorities.
Chinese Translation
了解美国国立卫生研究院(NIH)项目中人工智能(AI)和机器学习(ML)采用的现状,对于研究资金策略、机构规划和健康政策至关重要。大型语言模型(LLMs)的出现从根本上改变了研究领域分析,使研究人员能够从数千份非结构化研究文档中进行大规模语义提取。在本文中,我们展示了一种人机协作的研究方法,利用LLMs自动对研究描述进行大规模分类和总结。通过我们的方法,我们对2025年58,746个NIH资助的生物医学研究项目进行了全面分析。我们发现:(1)人工智能占NIH投资组合的15.9%,并具有13.4%的资金溢价,主要集中在疾病领域的发现、预测和数据整合;(2)存在显著的研究与部署之间的差距,79%的人工智能项目仍处于研究/开发阶段,仅有14.7%参与临床部署或实施;(3)尽管健康差异研究对NIH的公平使命至关重要,但其在人工智能资助工作中仅占5.7%,严重不足。这些发现为基于证据的政策干预提供了框架,以使NIH的人工智能投资组合与健康公平目标和战略研究优先事项保持一致。
cs.AI / 2 / 2604.07455
Munkres' General Topology Autoformalized in Isabelle/HOL
Munkres的一般拓扑在Isabelle/HOL中的自动形式化
Abstract
We describe an experiment in LLM-assisted autoformalization that produced over 85,000 lines of Isabelle/HOL code covering all 39 sections of Munkres' Topology (general topology, Chapters 2--8), from topological spaces through dimension theory. The LLM-based coding agents (initially ChatGPT 5.2 and then Claude Opus 4.6) used 24 active days for that. The formalization is complete: all 806 formal results are fully proved with zero sorry's. Proved results include the Tychonoff theorem, the Baire category theorem, the Nagata--Smirnov and Smirnov metrization theorems, the Stone--\v{C}ech compactification, Ascoli's theorem, the space-filling curve, and others. The methodology is based on a "sorry-first" declarative proof workflow combined with bulk use of sledgehammer - two of Isabelle major strengths. This leads to relatively fast autoformalization progress. We analyze the resulting formalization in detail, analyze the human--LLM interaction patterns from the session log, and briefly compare with related autoformalization efforts in Megalodon, HOL Light, and Naproche. The results indicate that LLM-assisted formalization of standard mathematical textbooks in Isabelle/HOL is quite feasible, cheap and fast, even if some human supervision is useful.
Chinese Translation
我们描述了一项在大型语言模型(LLM)辅助下进行的自动形式化实验,该实验生成了超过85,000行的Isabelle/HOL代码,涵盖了Munkres《拓扑学》(一般拓扑,第2章至第8章)的所有39个部分,从拓扑空间到维度理论。基于LLM的编码代理(最初是ChatGPT 5.2,随后是Claude Opus 4.6)在此过程中使用了24个工作日。形式化工作已完成:所有806个形式结果均已完全证明,且没有任何“抱歉”(sorry)。已证明的结果包括Tychonoff定理、Baire类别定理、Nagata--Smirnov和Smirnov度量化定理、Stone--Čech紧化、Ascoli定理、填充曲线等。该方法基于“抱歉优先”(sorry-first)声明性证明工作流,并结合大规模使用sledgehammer——这是Isabelle的两个主要优势。这导致了相对快速的自动形式化进展。我们详细分析了所得到的形式化结果,分析了会话日志中的人类与LLM的交互模式,并简要与Megalodon、HOL Light和Naproche中的相关自动形式化工作进行了比较。结果表明,在Isabelle/HOL中,LLM辅助的标准数学教材形式化是相当可行、经济且快速的,即使一些人类监督是有益的。
cs.AI / 3 / 2604.07468
M-ArtAgent: Evidence-Based Multimodal Agent for Implicit Art Influence Discovery
M-ArtAgent:基于证据的多模态代理用于隐性艺术影响发现
Abstract
Implicit artistic influence, although visually plausible, is often undocumented and thus poses a historically constrained attribution problem: resemblance is necessary but not sufficient evidence. Most prior systems reduce influence discovery to embedding similarity or label-driven graph completion, while recent multimodal large language models (LLMs) remain vulnerable to temporal inconsistency and unverified attributions. This paper introduces M-ArtAgent, an evidence-based multimodal agent that reframes implicit influence discovery as probabilistic adjudication. It follows a four-phase protocol consisting of Investigation, Corroboration, Falsification, and Verdict governed by a Reasoning and Acting (ReAct)-style controller that assembles verifiable evidence chains from images and biographies, enforces art-historical axioms, and subjects each hypothesis to adversarial falsification via a prompt-isolated critic. Two theory-grounded operators, StyleComparator for Wolfflin formal analysis and ConceptRetriever for ICONCLASS-based iconographic grounding, ensure that intermediate claims are formally auditable. On the balanced WikiArt Influence Benchmark-100 (WIB-100) of 100 artists and 2,000 directed pairs, M-ArtAgent achieves 83.7% positive-class F1, 0.666 Matthews correlation coefficient (MCC), and 0.910 area under the receiver operating characteristic curve (ROC-AUC), with leakage-control and robustness checks confirming that the gains persist when explicit influence phrases are masked. By coupling multimodal perception with domain-constrained falsification, M-ArtAgent demonstrates that implicit influence analysis benefits from historically grounded adjudication rather than pattern matching alone.
Chinese Translation
隐性艺术影响虽然在视觉上是合理的,但往往缺乏文献记录,因此造成了历史上受限的归因问题:相似性是必要但不足的证据。大多数先前的系统将影响发现简化为嵌入相似性或标签驱动的图完成,而最近的多模态大语言模型(LLMs)仍然容易受到时间不一致性和未经验证的归因的影响。本文介绍了M-ArtAgent,这是一种基于证据的多模态代理,将隐性影响发现重新框定为概率裁决。它遵循一个由调查、证实、反驳和裁决四个阶段组成的协议,由一个推理与行动(ReAct)风格的控制器管理,该控制器从图像和传记中组装可验证的证据链,执行艺术史公理,并通过一个隔离提示的批评者对每个假设进行对抗性反驳。两个基于理论的操作符,Wolfflin形式分析的StyleComparator和基于ICONCLASS的图像学基础的ConceptRetriever,确保中间主张是形式上可审计的。在平衡的WikiArt影响基准测试-100(WIB-100)中,涵盖100位艺术家和2000对定向配对,M-ArtAgent实现了83.7%的正类F1值、0.666的马修斯相关系数(MCC)和0.910的接收者操作特征曲线下面积(ROC-AUC),泄漏控制和鲁棒性检查确认在显性影响短语被屏蔽时,增益依然存在。通过将多模态感知与领域约束的反驳相结合,M-ArtAgent表明隐性影响分析从历史基础的裁决中受益,而不仅仅是模式匹配。
cs.AI / 4 / 2604.07484
ConsistRM: Improving Generative Reward Models via Consistency-Aware Self-Training
ConsistRM:通过一致性感知自训练提升生成式奖励模型
Abstract
Generative reward models (GRMs) have emerged as a promising approach for aligning Large Language Models (LLMs) with human preferences by offering greater representational capacity and flexibility than traditional scalar reward models. However, GRMs face two major challenges: reliance on costly human-annotated data restricts scalability, and self-training approaches often suffer from instability and vulnerability to reward hacking. To address these issues, we propose ConsistRM, a self-training framework that enables effective and stable GRM training without human annotations. ConsistRM incorporates the Consistency-Aware Answer Reward, which produces reliable pseudo-labels with temporal consistency, thereby providing more stable model optimization. Moreover, the Consistency-Aware Critique Reward is introduced to assess semantic consistency across multiple critiques and allocates fine-grained and differentiated rewards. Experiments on five benchmark datasets across four base models demonstrate that ConsistRM outperforms vanilla Reinforcement Fine-Tuning (RFT) by an average of 1.5%. Further analysis shows that ConsistRM enhances output consistency and mitigates position bias caused by input order, highlighting the effectiveness of consistency-aware rewards in improving GRMs.
Chinese Translation
生成式奖励模型(Generative Reward Models,GRMs)作为一种有前景的方法,通过提供比传统标量奖励模型更强的表示能力和灵活性,实现了大型语言模型(Large Language Models,LLMs)与人类偏好的对齐。然而,GRMs 面临两大挑战:依赖昂贵的人类标注数据限制了其可扩展性,自训练方法则常常存在不稳定性及易受奖励欺骗的弱点。为解决这些问题,我们提出了 ConsistRM,一种无需人工标注即可实现有效且稳定的 GRM 自训练框架。ConsistRM 引入了一致性感知答案奖励(Consistency-Aware Answer Reward),该奖励通过时间一致性生成可靠的伪标签,从而促进更稳定的模型优化。此外,还引入了一致性感知批判奖励(Consistency-Aware Critique Reward),用于评估多重批判间的语义一致性,并分配细粒度且差异化的奖励。基于四个基础模型的五个基准数据集上的实验表明,ConsistRM 平均优于基础强化微调(Reinforcement Fine-Tuning,RFT)1.5%。进一步分析显示,ConsistRM 提升了输出一致性,减轻了由输入顺序引起的位置偏差,凸显了一致性感知奖励在提升 GRMs 表现中的有效性。
cs.AI / 5 / 2604.07487
CLEAR: Context Augmentation from Contrastive Learning of Experience via Agentic Reflection
CLEAR:通过代理反思的经验对比学习进行上下文增强
Abstract
Large language model agents rely on effective model context to obtain task-relevant information for decision-making. Many existing context engineering approaches primarily rely on the context generated from the past experience and retrieval mechanisms that reuse these context. However, retrieved context from past tasks must be adapted by the execution agent to fit new situations, placing additional reasoning burden on the underlying LLM. To address this limitation, we propose a generative context augmentation framework using Contrastive Learning of Experience via Agentic Reflection (CLEAR). CLEAR first employs a reflection agent to perform contrastive analysis over past execution trajectories and summarize useful context for each observed task. These summaries are then used as supervised fine-tuning data to train a context augmentation model (CAM). Then we further optimize CAM using reinforcement learning, where the reward signal is obtained by running the task execution agent. By learning to generate task-specific knowledge rather than retrieve knowledge from the past, CAM produces context that is better tailored to the current task. We conduct comprehensive evaluations on the AppWorld and WebShop benchmarks. Experimental results show that CLEAR consistently outperforms strong baselines. It improves task completion rate from 72.62% to 81.15% on AppWorld test set and averaged reward from 0.68 to 0.74 on a subset of WebShop, compared with baseline agent. Our code is publicly available at https://github.com/awslabs/CLEAR.
Chinese Translation
大型语言模型代理依赖于有效的模型上下文来获取与任务相关的信息以进行决策。许多现有的上下文工程方法主要依赖于从过去经验生成的上下文和重用这些上下文的检索机制。然而,从过去任务中检索的上下文必须由执行代理进行调整,以适应新情况,这给基础的LLM带来了额外的推理负担。为了解决这一限制,我们提出了一种使用通过代理反思的经验对比学习(Contrastive Learning of Experience via Agentic Reflection,CLEAR)的生成上下文增强框架。CLEAR首先利用反思代理对过去的执行轨迹进行对比分析,并为每个观察到的任务总结有用的上下文。这些总结随后被用作监督微调数据,以训练上下文增强模型(Context Augmentation Model,CAM)。然后,我们进一步使用强化学习优化CAM,其中奖励信号是通过运行任务执行代理获得的。通过学习生成特定于任务的知识,而不是从过去检索知识,CAM生成的上下文更好地适应当前任务。我们在AppWorld和WebShop基准上进行了全面评估。实验结果表明,CLEAR始终优于强基线。在与基线代理相比的AppWorld测试集上,任务完成率从72.62%提高到81.15%,在WebShop的一个子集上平均奖励从0.68提高到0.74。我们的代码已公开发布在https://github.com/awslabs/CLEAR。
cs.AI / 6 / 2604.07506
ReflectRM: Boosting Generative Reward Models via Self-Reflection within a Unified Judgment Framework
ReflectRM:通过统一判断框架中的自我反思提升生成奖励模型
Abstract
Reward Models (RMs) are critical components in the Reinforcement Learning from Human Feedback (RLHF) pipeline, directly determining the alignment quality of Large Language Models (LLMs). Recently, Generative Reward Models (GRMs) have emerged as a superior paradigm, offering higher interpretability and stronger generalization than traditional scalar RMs. However, existing methods for GRMs focus primarily on outcome-level supervision, neglecting analytical process quality, which constrains their potential. To address this, we propose ReflectRM, a novel GRM that leverages self-reflection to assess analytical quality and enhance preference modeling. ReflectRM is trained under a unified generative framework for joint modeling of response preference and analysis preference. During inference, we use its self-reflection capability to identify the most reliable analysis, from which the final preference prediction is derived. Experiments across four benchmarks show that ReflectRM consistently improves performance, achieving an average accuracy gain of +3.7 on Qwen3-4B. Further experiments confirm that response preference and analysis preference are mutually reinforcing. Notably, ReflectRM substantially mitigates positional bias, yielding +10.2 improvement compared with leading GRMs and establishing itself as a more stable evaluator.
Chinese Translation
奖励模型(RMs)是人类反馈强化学习(RLHF)流程中的关键组成部分,直接决定大型语言模型(LLMs)的对齐质量。最近,生成奖励模型(GRMs)作为一种优越的范式出现,提供了比传统标量奖励模型更高的可解释性和更强的泛化能力。然而,现有的GRM方法主要集中在结果层面的监督,忽视了分析过程质量,这限制了它们的潜力。为了解决这个问题,我们提出了ReflectRM,这是一种新颖的GRM,利用自我反思来评估分析质量并增强偏好建模。ReflectRM在一个统一的生成框架下进行训练,以联合建模响应偏好和分析偏好。在推理过程中,我们利用其自我反思能力来识别最可靠的分析,从中得出最终的偏好预测。在四个基准测试中的实验表明,ReflectRM始终提高性能,在Qwen3-4B上实现了平均准确率提升+3.7。进一步的实验确认了响应偏好和分析偏好是相互促进的。值得注意的是,ReflectRM显著减轻了位置偏差,与领先的GRM相比提高了+10.2,确立了其作为更稳定评估者的地位。
cs.AI / 7 / 2604.07512
Rhizome OS-1: Rhizome's Semi-Autonomous Operating System for Small Molecule Drug Discovery
Rhizome OS-1:Rhizome面向小分子药物发现的半自主操作系统
Abstract
We introduce a semi-autonomous discovery system in which multi-modal AI agents function as a multi-disciplinary discovery team, acting as computational chemists, medicinal chemists, and patent agents, writing and executing analysis code, visually evaluating molecular candidates, assessing patentability, and adapting generation strategy from empirical screening feedback, while r1, a 246M-parameter Graph Neural Network (GNN) trained on 800M molecules, generates novel chemical matter directly on molecular graphs. Agents executed two campaigns in oncology (BCL6, EZH2), formulating medicinal chemistry hypotheses across three strategy tiers and generating libraries of 2,355-2,876 novel molecules per target. Across both targets, 91.9% of generated Murcko scaffolds are absent from ChEMBL for their respective targets, with Tanimoto distances of 0.56-0.69 to the nearest known active, confirming that the engine produces structurally distinct chemical matter rather than recapitulating known compounds. Binding affinity predictions using Boltz-2 were calibrated against ChEMBL experimental data, achieving Spearman correlations of -0.53 to -0.64 and ROC AUC values of 0.88 to 0.93. These results demonstrate that semi-autonomous agent systems, equipped with graph-native generative tools and physics-informed scoring, provide a foundation for a modern operating system for small molecule discovery. We show that Rhizome OS-1 enables a new paradigm for early-stage drug discovery by supporting scaled, rapid, and adaptive inverse design.
Chinese Translation
我们介绍了一种半自主发现系统,其中多模态人工智能代理作为多学科发现团队运作,扮演计算化学家、药物化学家和专利代理人的角色,编写并执行分析代码,直观评估分子候选物,评估专利性,并根据经验筛选反馈调整生成策略。同时,r1——一个拥有2.46亿参数、在8亿分子上训练的图神经网络(GNN)——直接在分子图上生成新颖化学物质。代理执行了两个肿瘤学(BCL6,EZH2)相关的项目,跨越三个策略层级制定药物化学假设,并为每个靶点生成了2,355至2,876个新颖分子库。在两个靶点中,91.9%的生成Murcko骨架在ChEMBL数据库中未出现,且与最近已知活性分子的Tanimoto距离为0.56至0.69,确认该引擎产生了结构上独特的化学物质,而非重复已知化合物。利用Boltz-2进行的结合亲和力预测经过ChEMBL实验数据校准,获得了-0.53至-0.64的Spearman相关系数和0.88至0.93的ROC AUC值。这些结果表明,配备图原生生成工具和物理信息评分的半自主代理系统,为现代小分子发现操作系统奠定了基础。我们展示了Rhizome OS-1通过支持规模化、快速且自适应的逆向设计,开启了早期药物发现的新范式。
cs.AI / 8 / 2604.07535
Trust the AI, Doubt Yourself: The Effect of Urgency on Self-Confidence in Human-AI Interaction
信任人工智能,怀疑自己:紧迫感对人机交互中自信心的影响
Abstract
Studies show that interactions with an AI system fosters trust in human users towards AI. An often overlooked element of such interaction dynamics is the (sense of) urgency when the human user is prompted by an AI agent, e.g., for advice or guidance. In this paper, we show that although the presence of urgency in human-AI interactions does not affect the trust in AI, it may be detrimental to the human user's self-confidence and self-efficacy. In the long run, the loss of confidence may lead to performance loss, suboptimal decisions, human errors, and ultimately, unsustainable AI systems. Our evidence comes from an experiment with 30 human participants. Our results indicate that users may feel more confident in their work when they are eased into the human-AI setup rather than exposed to it without preparation. We elaborate on the implications of this finding for software engineers and decision-makers.
Chinese Translation
研究表明,与人工智能系统的互动能够增强人类用户对人工智能的信任。然而,在这种互动动态中,一个常被忽视的元素是当人类用户被人工智能代理提示时的紧迫感,例如请求建议或指导。在本文中,我们展示了尽管紧迫感的存在对人类用户对人工智能的信任没有影响,但它可能对人类用户的自信心和自我效能产生负面影响。从长远来看,自信心的丧失可能导致表现下降、次优决策、人为错误,最终导致人工智能系统的不可持续性。我们的证据来自对30名人类参与者的实验。结果表明,当用户逐步适应人机交互环境时,他们在工作中可能会感到更有信心,而不是在没有准备的情况下直接暴露于此。我们详细阐述了这一发现对软件工程师和决策者的影响。
cs.AI / 9 / 2604.07546
Agentic Copyright, Data Scraping & AI Governance: Toward a Coasean Bargain in the Era of Artificial Intelligence
代理版权、数据抓取与人工智能治理:在人工智能时代迈向科斯交易
Abstract
This paper examines how the rapid deployment of multi-agentic AI systems is reshaping the foundations of copyright law and creative markets. It argues that existing copyright frameworks are ill-equipped to govern AI agent-mediated interactions that occur at scale, speed, and with limited human oversight. The paper introduces the concept of agentic copyright, a model in which AI agents act on behalf of creators and users to negotiate access, attribution, and compensation for copyrighted works. While multi-agent ecosystems promise efficiency gains and reduced transaction costs, they also generate novel market failures, including miscoordination, conflict, and collusion among autonomous agents. To address these market failures, the paper develops a supervised multi-agent governance framework that integrates legal rules and principles, technical protocols, and institutional oversight. This framework emphasizes ex ante and ex post coordination mechanisms capable of correcting agentic market failures before they crystallize into systemic harm. By embedding normative constraints and monitoring functions into multi-agent architectures, supervised governance aims to align agent behavior with the underlying values of copyright law. The paper concludes that AI should be understood not only as a source of disruption, but also as a governance tool capable of restoring market-based ordering in creative industries. Properly designed, agentic copyright offers a path toward scalable, fair, and legally meaningful copyright markets in the age of AI.
Chinese Translation
本文探讨了多代理人工智能系统的快速部署如何重塑版权法和创意市场的基础。文章认为,现有的版权框架无法有效管理在大规模、快速和有限人类监督下发生的人工智能代理介导的互动。本文引入了代理版权的概念,这是一种模型,其中人工智能代理代表创作者和用户协商对版权作品的访问、归属和补偿。尽管多代理生态系统承诺提高效率和降低交易成本,但它们也产生了新的市场失灵,包括自主代理之间的协调失调、冲突和共谋。为了解决这些市场失灵,本文开发了一个监督的多代理治理框架,该框架整合了法律规则和原则、技术协议以及机构监督。该框架强调了能够在市场失灵形成系统性危害之前纠正代理市场失灵的事前和事后协调机制。通过将规范约束和监测功能嵌入多代理架构,监督治理旨在使代理行为与版权法的基本价值观保持一致。文章最后得出结论,人工智能不仅应被理解为一种破坏源,也应被视为一种治理工具,能够在创意产业中恢复基于市场的秩序。经过合理设计,代理版权为在人工智能时代实现可扩展、公平和法律上有意义的版权市场提供了一条路径。
cs.AI / 10 / 2604.07559
Dual-Loop Control in DCVerse: Advancing Reliable Deployment of AI in Data Centers via Digital Twins
DCVerse中的双环控制:通过数字双胞胎推动人工智能在数据中心的可靠部署
Abstract
The growing scale and complexity of modern data centers present major challenges in balancing energy efficiency with outage risk. Although Deep Reinforcement Learning (DRL) shows strong potential for intelligent control, its deployment in mission-critical systems is limited by data scarcity and the lack of real-time pre-evaluation mechanisms. This paper introduces the Dual-Loop Control Framework (DLCF), a digital twin-based architecture designed to overcome these challenges. The framework comprises three core entities: the physical system, a digital twin, and a policy reservoir of diverse DRL agents. These components interact through a dual-loop mechanism involving real-time data acquisition, data assimilation, DRL policy training, pre-evaluation, and expert verification. Theoretical analysis shows how DLCF can improve sample efficiency, generalization, safety, and optimality. Leveraging DLCF, we implemented the DCVerse platform and validated it through case studies on a real-world data center cooling system. The evaluation shows that our approach achieves up to 4.09% energy savings over conventional control strategies without violating SLA requirements. Additionally, the framework improves policy interpretability and supports more trustworthy DRL deployment. This work provides a foundation for reliable AI-based control in data centers and points toward future extensions for holistic, system-wide optimization.
Chinese Translation
现代数据中心日益增长的规模和复杂性在平衡能源效率与停机风险方面带来了重大挑战。尽管深度强化学习(Deep Reinforcement Learning, DRL)在智能控制方面显示出强大的潜力,但其在关键任务系统中的部署受到数据稀缺和缺乏实时预评估机制的限制。本文介绍了双环控制框架(Dual-Loop Control Framework, DLCF),这是一种基于数字双胞胎的架构,旨在克服这些挑战。该框架由三个核心实体组成:物理系统、数字双胞胎和多样化DRL代理的策略库。这些组件通过一个双环机制进行交互,涉及实时数据采集、数据同化、DRL策略训练、预评估和专家验证。理论分析表明,DLCF可以提高样本效率、泛化能力、安全性和最优性。借助DLCF,我们实现了DCVerse平台,并通过对真实数据中心冷却系统的案例研究进行了验证。评估结果表明,我们的方法在不违反服务水平协议(SLA)要求的情况下,能比传统控制策略实现高达4.09%的能源节约。此外,该框架提高了策略的可解释性,并支持更可信的DRL部署。这项工作为数据中心中基于人工智能的可靠控制奠定了基础,并指向未来在整体系统优化方面的扩展。
cs.AI / 11 / 2604.07584
From Papers to Property Tables: A Priority-Based LLM Workflow for Materials Data Extraction
从论文到属性表:一种基于优先级的大型语言模型工作流用于材料数据提取
Abstract
Scientific data are widely dispersed across research articles and are often reported inconsistently across text, tables, and figures, making manual data extraction and aggregation slow and error-prone. We present a prompt-driven, hierarchical workflow that uses a large language model (LLM) to automatically extract and reconstruct structured, shot-level shock-physics experimental records by integrating information distributed across text, tables, figures, and physics-based derivations from full-text published research articles, using alloy spall strength as a representative case study. The pipeline targeted 37 experimentally relevant fields per shot and applied a three-level priority strategy: (T1) direct extraction from text/tables, (T2) physics-based derivation using verified governing relations, and (T3) digitization from figures when necessary. Extracted values were normalized to canonical units, tagged by priority for traceability, and validated with physics-based consistency and plausibility checks. Evaluated on a benchmark of 30 published research articles comprising 11,967 evaluated data points, the workflow achieved high overall accuracy, with priority-wise accuracies of 94.93% (T1), 92.04% (T2), and 83.49% (T3), and an overall weighted accuracy of 94.69%. Cross-model testing further indicated strong agreement for text/table and equation-derived fields, with lower agreement for figure-based extraction. Implementation through an API interface demonstrated the scalability of the approach, achieving consistent extraction performance and, in a subset of test cases, matching or exceeding chat-based accuracy. This workflow demonstrates a practical approach for converting unstructured technical literature into traceable, analysis-ready datasets without task-specific fine-tuning, enabling scalable database construction in materials science.
Chinese Translation
科学数据广泛分散在研究文章中,且通常在文本、表格和图形中报告不一致,这使得手动数据提取和聚合变得缓慢且容易出错。我们提出了一种基于提示的分层工作流,利用大型语言模型(LLM)自动提取和重构结构化的逐次冲击物理实验记录,通过整合分布在文本、表格、图形和基于物理的推导中的信息,使用合金剥离强度作为代表性案例研究。该流程针对每次实验的37个实验相关字段,应用了三级优先级策略:(T1)直接从文本/表格中提取,(T2)使用经过验证的控制关系进行基于物理的推导,以及(T3)在必要时从图形中数字化。提取的数值被标准化为规范单位,按优先级标记以便追溯,并通过基于物理的一致性和合理性检查进行验证。在对30篇已发表研究文章的基准评估中,涵盖了11,967个评估数据点,该工作流实现了高总体准确性,各优先级的准确性分别为94.93%(T1)、92.04%(T2)和83.49%(T3),总体加权准确性为94.69%。交叉模型测试进一步表明,文本/表格和方程推导字段之间存在强一致性,而图形基础提取的一致性较低。通过API接口的实现展示了该方法的可扩展性,实现了一致的提取性能,并且在部分测试案例中,达到了或超过了基于聊天的准确性。该工作流展示了一种将非结构化技术文献转换为可追溯的、适合分析的数据集的实用方法,无需特定任务的微调,从而实现材料科学中的可扩展数据库构建。
cs.AI / 12 / 2604.07593
Too long; didn't solve
太长;未能解决
Abstract
Mathematical benchmarks consisting of a range of mathematics problems are widely used to evaluate the reasoning abilities of large language models, yet little is known about how their structural properties influence model behaviour. In this work, we investigate two structural length variables, prompt length and solution length, and analyse how they relate to model performance on a newly constructed adversarial dataset of expert-authored mathematics problems. We find that both prompt and solution lengths correlate positively with increased model failure across models. We also include a secondary, exploratory analysis of cross-model disagreement. Under a difficulty-adjusted normalised analysis, both variables retain weak negative associations with realised model separation, slightly stronger for prompt length. Overall, our main robust finding is that structural length is linked to empirical difficulty in this dataset.
Chinese Translation
由一系列数学问题组成的数学基准广泛用于评估大型语言模型的推理能力,但关于其结构特性如何影响模型行为的研究仍然较少。在本研究中,我们探讨了两个结构长度变量,即提示长度和解答长度,并分析它们与新构建的专家撰写数学问题的对抗数据集上模型性能的关系。我们发现,提示长度和解答长度与模型失败的增加呈正相关。此外,我们还包括了对跨模型不一致性的二次探索性分析。在经过难度调整的标准化分析中,这两个变量与实现的模型分离保持弱负相关,提示长度的相关性略强。总体而言,我们的主要稳健发现是,结构长度与该数据集中的经验难度相关联。
cs.AI / 13 / 2604.07595
Reasoning Graphs: Deterministic Agent Accuracy through Evidence-Centric Chain-of-Thought Feedback
推理图:通过以证据为中心的思维链反馈实现确定性智能体准确性
Abstract
Language model agents reason from scratch on every query: each time an agent retrieves evidence and deliberates, the chain of thought is discarded and the next similar query starts with no prior insight. This produces lower accuracy and high variance, as the same type of query can succeed or fail unpredictably. We introduce reasoning graphs, a graph structure that persists an agent's per-evidence chain of thought as structured edges connected to the evidence items they evaluate. Unlike prior memory mechanisms that store distilled strategies as flat records indexed by query similarity or appended by recency, reasoning graphs enable evidence-centric feedback: given a new candidate set, the system traverses all incoming evaluation edges for each evidence item across all prior runs, surfacing how that specific item has been judged before. This backward traversal from evidence inward is a structurally different capability from query-similarity retrieval, because the feedback is tied to the specific evidence the agent is currently examining, not to the query. We further introduce retrieval graphs, a complementary structure that feeds a pipeline planner to tighten the candidate funnel over successive runs. Together, both graphs form a self-improving feedback loop: accuracy rises and variance collapses over successive runs, with every decision fully traceable through the graph. This improvement requires no retraining; the base model remains frozen and all gains come from context engineering via graph traversal. We formalize the graph structure, traversal algorithms, and feedback mechanisms, and describe a sequential cluster evaluation protocol for measuring accuracy convergence and variance collapse on multi-hop question answering benchmarks.
Chinese Translation
语言模型智能体在每个查询上都是从头推理:每次智能体检索证据并进行思考时,思维链都会被丢弃,下一次相似查询将没有先前的见解。这导致了较低的准确性和较高的方差,因为同类型的查询可能会不可预测地成功或失败。我们引入了推理图,这是一种图结构,能够将智能体针对每个证据的思维链以结构化边缘的形式持久化,连接到它们评估的证据项。与之前将提炼的策略作为按查询相似性索引的平面记录或按最近性附加的记忆机制不同,推理图使以证据为中心的反馈成为可能:在给定新的候选集时,系统遍历所有先前运行中每个证据项的所有传入评估边,揭示该特定项之前的判断。这种从证据向内的反向遍历在结构上与查询相似性检索是不同的,因为反馈与智能体当前正在检查的特定证据相关,而不是与查询相关。我们进一步引入了检索图,这是一种补充结构,用于为管道规划器提供输入,以在连续运行中收紧候选筛选。两者共同形成一个自我改进的反馈循环:在连续运行中,准确性上升而方差降低,每个决策都可以通过图进行完全追溯。这一改进无需重新训练;基础模型保持不变,所有收益来自于通过图遍历进行的上下文工程。我们形式化了图结构、遍历算法和反馈机制,并描述了一种顺序聚类评估协议,用于测量多跳问答基准上的准确性收敛和方差崩溃。
cs.AI / 14 / 2604.07645
PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent
PRIME:通过迭代记忆演化实现无训练的主动推理,以支持以用户为中心的智能体
Abstract
The development of autonomous tool-use agents for complex, long-horizon tasks in collaboration with human users has become the frontier of agentic research. During multi-turn Human-AI interactions, the dynamic and uncertain nature of user demands poses a significant challenge; agents must not only invoke tools but also iteratively refine their understanding of user intent through effective communication. While recent advances in reinforcement learning offer a path to more capable tool-use agents, existing approaches require expensive training costs and struggle with turn-level credit assignment across extended interaction horizons. To this end, we introduce PRIME (Proactive Reasoning via Iterative Memory Evolution), a gradient-free learning framework that enables continuous agent evolvement through explicit experience accumulation rather than expensive parameter optimization. PRIME distills multi-turn interaction trajectories into structured, human-readable experiences organized across three semantic zones: successful strategies, failure patterns, and user preferences. These experiences evolve through meta-level operations and guide future agent behavior via retrieval-augmented generation. Our experiments across several diverse user-centric environments demonstrate that PRIME achieves competitive performance with gradient-based methods while offering cost-efficiency and interpretability. Together, PRIME presents a practical paradigm for building proactive, collaborative agents that learn from Human-AI interaction without the computational burden of gradient-based training.
Chinese Translation
自主工具使用智能体的开发,旨在与人类用户协作完成复杂的长期任务,已成为智能体研究的前沿。在多轮人机交互中,用户需求的动态性和不确定性带来了重大挑战;智能体不仅需要调用工具,还必须通过有效的沟通不断细化对用户意图的理解。尽管最近在强化学习方面的进展为更强大的工具使用智能体提供了路径,但现有方法需要昂贵的训练成本,并且在扩展的交互过程中难以进行逐轮的信用分配。为此,我们提出了PRIME(通过迭代记忆演化的主动推理),这是一种无梯度学习框架,能够通过显式的经验积累实现智能体的持续演化,而不是依赖昂贵的参数优化。PRIME将多轮交互轨迹提炼为结构化的、可供人类理解的经验,这些经验在三个语义区域中组织:成功策略、失败模式和用户偏好。这些经验通过元级操作不断演化,并通过检索增强生成来指导未来的智能体行为。我们在多个不同的以用户为中心的环境中的实验表明,PRIME在性能上与基于梯度的方法具有竞争力,同时提供了成本效益和可解释性。总之,PRIME为构建主动的、协作的智能体提供了一种实用范式,使其能够在不承担基于梯度训练计算负担的情况下,从人机交互中学习。
cs.AI / 15 / 2604.07650
How Independent are Large Language Models? A Statistical Framework for Auditing Behavioral Entanglement and Reweighting Verifier Ensembles
大型语言模型的独立性如何?审计行为纠缠和重加权验证器集的统计框架
Abstract
The rapid growth of the large language model (LLM) ecosystem raises a critical question: are seemingly diverse models truly independent? Shared pretraining data, distillation, and alignment pipelines can induce hidden behavioral dependencies, latent entanglement, that undermine multi-model systems such as LLM-as-a-judge pipelines and ensemble verification, which implicitly assume independent signals. In practice, this manifests as correlated reasoning patterns and synchronized failures, where apparent agreement reflects shared error modes rather than independent validation. To address this, we develop a statistical framework for auditing behavioral entanglement among black-box LLMs. Our approach introduces a multi-resolution hierarchy that characterizes the joint failure manifold through two information-theoretic metrics: (i) a Difficulty-Weighted Behavioral Entanglement Index, which amplifies synchronized failures on easy tasks, and (ii) a Cumulative Information Gain (CIG) metric, which captures directional alignment in erroneous responses. Through extensive experiments on 18 LLMs from six model families, we identify widespread behavioral entanglement and analyze its impact on LLM-as-a-judge evaluation. We find that CIG exhibits a statistically significant association with degradation in judge precision, with Spearman coefficient of 0.64 (p < 0.001) for GPT-4o-mini and 0.71 (p < 0.01) for Llama3-based judges, indicating that stronger dependency corresponds to increased over-endorsement bias. Finally, we demonstrate a practical use case of entanglement through de-entangled verifier ensemble reweighting. By adjusting model contributions based on inferred independence, the proposed method mitigates correlated bias and improves verification performance, achieving up to a 4.5% accuracy gain over majority voting.
Chinese Translation
大型语言模型(LLM)生态系统的快速发展引发了一个关键问题:看似多样化的模型是否真的独立?共享的预训练数据、蒸馏和对齐流程可能会引发隐藏的行为依赖性,即潜在的纠缠,这削弱了多模型系统,例如LLM作为评审的流程和集成验证,这些系统隐含地假设信号是独立的。在实践中,这表现为相关的推理模式和同步的失败,其中明显的共识反映的是共享的错误模式,而非独立的验证。为了解决这个问题,我们开发了一个审计黑箱LLM行为纠缠的统计框架。我们的方法引入了一个多分辨率层次结构,通过两个信息论指标来描述联合失败流形:(i)一个难度加权行为纠缠指数,放大了在简单任务上的同步失败,以及(ii)一个累积信息增益(CIG)指标,捕捉错误响应中的方向性对齐。通过对来自六个模型家族的18个LLM进行广泛实验,我们识别出广泛的行为纠缠,并分析其对LLM作为评审的评估的影响。我们发现CIG与评审精度的下降之间存在统计显著的关联,对于GPT-4o-mini的Spearman系数为0.64(p < 0.001),对于基于Llama3的评审为0.71(p < 0.01),这表明更强的依赖性与增加的过度认可偏差相关。最后,我们通过去纠缠验证器集的重加权展示了纠缠的实际应用案例。通过根据推断出的独立性调整模型贡献,所提出的方法减轻了相关偏差并提高了验证性能,相较于多数投票实现了最高4.5%的准确率提升。
cs.AI / 16 / 2604.07652
Bridging Natural Language and Interactive What-If Interfaces via LLM-Generated Declarative Specification
通过LLM生成的声明性规范架起自然语言与交互式假设分析界面的桥梁
Abstract
What-if analysis (WIA) is an iterative, multi-step process where users explore and compare hypothetical scenarios by adjusting parameters, applying constraints, and scoping data through interactive interfaces. Current tools fall short of supporting effective interactive WIA: spreadsheet and BI tools require time-consuming and laborious setup, while LLM-based chatbot interfaces are semantically fragile, frequently misinterpret intent, and produce inconsistent results as conversations progress. To address these limitations, we present a two-stage workflow that translates natural language (NL) WIA questions into interactive visual interfaces via an intermediate representation, powered by the Praxa Specification Language (PSL): first, LLMs generate PSL specifications from NL questions capturing analytical intent and logic, enabling validation and repair of erroneous specifications; and second, the specifications are compiled into interactive visual interfaces with parameter controls and linked visualizations. We benchmark this workflow with 405 WIA questions spanning 11 WIA types, 5 datasets, and 3 state-of-the-art LLMs. The results show that across models, half of specifications (52.42%) are generated correctly without intervention. We perform an analysis of the failure cases and derive an error taxonomy spanning non-functional errors (specifications fail to compile) and functional errors (specifications compile but misrepresent intent). Based on the taxonomy, we apply targeted repairs on the failure cases using few-shot prompts and improve the success rate to 80.42%. Finally, we show how undetected functional errors propagate through compilation into plausible but misleading interfaces, demonstrating that the intermediate specification is critical for reliably bridging NL and interactive WIA interface in LLM-powered WIA systems.
Chinese Translation
假设分析(What-if analysis, WIA)是一种迭代的多步骤过程,用户通过调整参数、应用约束和通过交互式界面范围数据来探索和比较假设场景。目前的工具在支持有效的交互式WIA方面存在不足:电子表格和商业智能工具需要耗时且繁琐的设置,而基于LLM的聊天机器人界面在语义上脆弱,常常误解意图,并在对话过程中产生不一致的结果。为了解决这些局限性,我们提出了一种两阶段工作流程,通过中间表示将自然语言(NL)WIA问题转化为交互式可视化界面,该流程由Praxa规范语言(Praxa Specification Language, PSL)驱动:首先,LLM从NL问题生成PSL规范,以捕捉分析意图和逻辑,从而能够验证和修复错误的规范;其次,这些规范被编译成具有参数控制和关联可视化的交互式可视化界面。我们使用405个涵盖11种WIA类型、5个数据集和3个最先进的LLM的问题对该工作流程进行了基准测试。结果显示,在不同模型中,约一半的规范(52.42%)在没有干预的情况下正确生成。我们对失败案例进行了分析,并提出了一种错误分类法,涵盖非功能性错误(规范无法编译)和功能性错误(规范编译成功但误表意图)。基于该分类法,我们使用少量示例提示对失败案例进行了针对性修复,将成功率提高到80.42%。最后,我们展示了未被检测到的功能性错误如何通过编译传播到看似合理但具有误导性的界面,证明中间规范对于在基于LLM的WIA系统中可靠地架起自然语言与交互式WIA界面至关重要。
cs.AI / 17 / 2604.07667
From Debate to Decision: Conformal Social Choice for Safe Multi-Agent Deliberation
从辩论到决策:安全多智能体审议的符合性社会选择
Abstract
Multi-agent debate improves LLM reasoning, yet agreement among agents is not evidence of correctness. When agents converge on a wrong answer through social reinforcement, consensus-based stopping commits that error to an automated action with no recourse. We introduce Conformal Social Choice, a post-hoc decision layer that converts debate outputs into calibrated act-versus-escalate decisions. Verbalized probability distributions from heterogeneous agents are aggregated via a linear opinion pool and calibrated with split conformal prediction, yielding prediction sets with a marginal coverage guarantee: the correct answer is included with probability ${\geq}\,1{-}\alpha$, without assumptions on individual model calibration. A hierarchical action policy maps singleton sets to autonomous action and larger sets to human escalation. On eight MMLU-Pro domains with three agents (Claude Haiku, DeepSeek-R1, Qwen-3 32B), coverage stays within 1--2 points of the target. The key finding is not that debate becomes more accurate, but that the conformal layer makes its failures actionable: 81.9% of wrong-consensus cases are intercepted at $\alpha{=}0.05$. Because the layer refuses to act on cases where debate is confidently wrong, the remaining conformal singletons reach 90.0--96.8% accuracy (up to 22.1pp above consensus stopping) -- a selection effect, not a reasoning improvement. This safety comes at the cost of automation, but the operating point is user-adjustable via $\alpha$.
Chinese Translation
多智能体辩论提高了大型语言模型(LLM)的推理能力,但代理之间的共识并不证明其正确性。当代理通过社会强化收敛于错误答案时,基于共识的停止将该错误固化为自动化行为,无法追溯。我们提出了符合性社会选择(Conformal Social Choice),这是一种事后决策层,将辩论输出转换为经过校准的行动与升级决策。来自异构代理的口头概率分布通过线性意见池进行聚合,并通过分裂符合性预测进行校准,从而生成具有边际覆盖保证的预测集:正确答案的包含概率为 ${
geq}\,1{-}eta$,且不对单个模型的校准做出假设。分层行动策略将单一集合映射到自主行动,将较大集合映射到人类升级。在八个 MMLU-Pro 域上,使用三个代理(Claude Haiku、DeepSeek-R1、Qwen-3 32B),覆盖率保持在目标值的 1--2 分之内。关键发现并不是辩论变得更准确,而是符合性层使其失败可操作:在 $eta{=}0.05$ 时,81.9% 的错误共识案例被拦截。由于该层拒绝对辩论自信错误的案例采取行动,剩余的符合性单例达到了 90.0--96.8% 的准确率(比共识停止高出最多 22.1 个百分点)——这是一种选择效应,而非推理改进。这种安全性以自动化为代价,但操作点可以通过 $eta$ 进行用户调整。
cs.AI / 18 / 2604.07681
Multi-Agent Orchestration for High-Throughput Materials Screening on a Leadership-Class System
高通量材料筛选的多智能体协调在领导级系统上的应用
Abstract
The integration of Artificial Intelligence (AI) with High-Performance Computing (HPC) is transforming scientific workflows from human-directed pipelines into adaptive systems capable of autonomous decision-making. Large language models (LLMs) play a critical role in autonomous workflows; however, deploying LLM-based agents at scale remains a significant challenge. Single-agent architectures and sequential tool calls often become serialization bottlenecks when executing large-scale simulation campaigns, failing to utilize the massive parallelism of exascale resources. To address this, we present a scalable, hierarchical multi-agent framework for orchestrating high-throughput screening campaigns. Our planner-executor architecture employs a central planning agent to dynamically partition workloads and assign subtasks to a swarm of parallel executor agents. All executor agents interface with a shared Model Context Protocol (MCP) server that orchestrates tasks via the Parsl workflow engine. To demonstrate this framework, we employed the open-weight gpt-oss-120b model to orchestrate a high-throughput screening of the Computation-Ready Experimental (CoRE) Metal-Organic Framework (MOF) database for atmospheric water harvesting. The results demonstrate that the proposed agentic framework enables efficient and scalable execution on the Aurora supercomputer, with low orchestration overhead and high task completion rates. This work establishes a flexible paradigm for LLM-driven scientific automation on HPC systems, with broad applicability to materials discovery and beyond.
Chinese Translation
人工智能(AI)与高性能计算(HPC)的结合正在将科学工作流程从人工导向的管道转变为能够自主决策的自适应系统。大型语言模型(LLMs)在自主工作流程中发挥着关键作用;然而,在大规模部署基于LLM的智能体仍然是一个重大挑战。在执行大规模模拟活动时,单智能体架构和顺序工具调用往往成为串行瓶颈,未能充分利用超算资源的巨大并行性。为了解决这个问题,我们提出了一种可扩展的分层多智能体框架,用于协调高通量筛选活动。我们的规划-执行者架构采用中央规划智能体动态划分工作负载,并将子任务分配给一群并行执行者智能体。所有执行者智能体与一个共享的模型上下文协议(Model Context Protocol, MCP)服务器进行交互,通过Parsl工作流引擎协调任务。为了演示该框架,我们采用开放权重的gpt-oss-120b模型来协调对计算就绪实验(Computation-Ready Experimental, CoRE)金属有机框架(Metal-Organic Framework, MOF)数据库的高通量筛选,以用于大气水收集。结果表明,所提出的智能体框架能够在Aurora超级计算机上实现高效且可扩展的执行,具有低协调开销和高任务完成率。本研究建立了一种灵活的范式,以支持在HPC系统上进行基于LLM的科学自动化,广泛适用于材料发现及其他领域。
cs.AI / 19 / 2604.07709
IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures
IatroBench:关于AI安全措施导致的医源性伤害的预注册证据
Abstract
Ask a frontier model how to taper six milligrams of alprazolam (psychiatrist retired, ten days of pills left, abrupt cessation causes seizures) and it tells her to call the psychiatrist she just explained does not exist. Change one word ("I'm a psychiatrist; a patient presents with...") and the same model, same weights, same inference pass produces a textbook Ashton Manual taper with diazepam equivalence, anticonvulsant coverage, and monitoring thresholds. The knowledge was there; the model withheld it. IatroBench measures this gap. Sixty pre-registered clinical scenarios, six frontier models, 3,600 responses, scored on two axes (commission harm, CH 0-3; omission harm, OH 0-4) through a structured-evaluation pipeline validated against physician scoring (kappa_w = 0.571, within-1 agreement 96%). The central finding is identity-contingent withholding: match the same clinical question in physician vs. layperson framing and all five testable models provide better guidance to the physician (decoupling gap +0.38, p = 0.003; binary hit rates on safety-colliding actions drop 13.1 percentage points in layperson framing, p < 0.0001, while non-colliding actions show no change). The gap is widest for the model with the heaviest safety investment (Opus, +0.65). Three failure modes separate cleanly: trained withholding (Opus), incompetence (Llama 4), and indiscriminate content filtering (GPT-5.2, whose post-generation filter strips physician responses at 9x the layperson rate because they contain denser pharmacological tokens). The standard LLM judge assigns OH = 0 to 73% of responses a physician scores OH >= 1 (kappa = 0.045); the evaluation apparatus has the same blind spot as the training apparatus. Every scenario targets someone who has already exhausted the standard referrals.
Chinese Translation
询问一个前沿模型如何逐渐减少六毫克阿普唑仑(精神科医生已退休,剩余十天药物,突然停药会导致癫痫发作),它会告诉她拨打她刚刚解释过的不存在的精神科医生的电话。只需改变一个词(“我是一名精神科医生;一名患者呈现出...”),同一个模型、相同的权重、相同的推理过程就会产生一本教科书式的阿什顿手册减药方案,配合地西泮等效、抗癫痫覆盖和监测阈值。知识是存在的;模型却选择了隐瞒。IatroBench 衡量了这一差距。六十个预注册的临床场景,六个前沿模型,3600个响应,通过一个结构化评估流程进行评分,评分在两个维度上(实施伤害,CH 0-3;遗漏伤害,OH 0-4),并与医生评分进行了验证(kappa_w = 0.571,1级一致性为96%)。核心发现是身份依赖性隐瞒:在医生与普通人框架下匹配相同的临床问题,所有五个可测试模型对医生提供了更好的指导(解耦差距 +0.38,p = 0.003;在普通人框架下,安全冲突行为的二元命中率下降了13.1个百分点,p < 0.0001,而非冲突行为没有变化)。在安全投资最重的模型(Opus,+0.65)中,差距最大。三种失败模式清晰分离:训练隐瞒(Opus)、无能(Llama 4)和不加区分的内容过滤(GPT-5.2,其生成后过滤器以9倍于普通人的速度剥离医生的响应,因为它们包含更密集的药理学标记)。标准的LLM评估者将73%的医生评分为OH >= 1的响应评估为OH = 0(kappa = 0.045);评估机制与训练机制存在相同的盲点。每个场景的目标都是那些已经耗尽标准转诊的人。
cs.AI / 20 / 2604.07720
Towards Knowledgeable Deep Research: Framework and Benchmark
迈向知识驱动的深度研究:框架与基准
Abstract
Deep Research (DR) requires LLM agents to autonomously perform multi-step information seeking, processing, and reasoning to generate comprehensive reports. In contrast to existing studies that mainly focus on unstructured web content, a more challenging DR task should additionally utilize structured knowledge to provide a solid data foundation, facilitate quantitative computation, and lead to in-depth analyses. In this paper, we refer to this novel task as Knowledgeable Deep Research (KDR), which requires DR agents to generate reports with both structured and unstructured knowledge. Furthermore, we propose the Hybrid Knowledge Analysis framework (HKA), a multi-agent architecture that reasons over both kinds of knowledge and integrates the texts, figures, and tables into coherent multimodal reports. The key design is the Structured Knowledge Analyzer, which utilizes both coding and vision-language models to produce figures, tables, and corresponding insights. To support systematic evaluation, we construct KDR-Bench, which covers 9 domains, includes 41 expert-level questions, and incorporates a large number of structured knowledge resources (e.g., 1,252 tables). We further annotate the main conclusions and key points for each question and propose three categories of evaluation metrics including general-purpose, knowledge-centric, and vision-enhanced ones. Experimental results demonstrate that HKA consistently outperforms most existing DR agents on general-purpose and knowledge-centric metrics, and even surpasses the Gemini DR agent on vision-enhanced metrics, highlighting its effectiveness in deep, structure-aware knowledge analysis. Finally, we hope this work can serve as a new foundation for structured knowledge analysis in DR agents and facilitate future multimodal DR studies.
Chinese Translation
深度研究(Deep Research, DR)要求大型语言模型(LLM)代理能够自主执行多步骤的信息检索、处理和推理,以生成综合报告。与现有研究主要关注非结构化网络内容不同,更具挑战性的深度研究任务应当额外利用结构化知识,以提供坚实的数据基础,促进定量计算,并进行深入分析。本文将这一新任务称为知识驱动的深度研究(Knowledgeable Deep Research, KDR),该任务要求深度研究代理生成包含结构化和非结构化知识的报告。此外,我们提出了混合知识分析框架(Hybrid Knowledge Analysis, HKA),这是一种多代理架构,能够对两种知识进行推理,并将文本、图形和表格整合成连贯的多模态报告。关键设计是结构化知识分析器(Structured Knowledge Analyzer),它利用编码和视觉语言模型生成图形、表格及相应的见解。为了支持系统评估,我们构建了KDR-Bench,涵盖9个领域,包含41个专家级问题,并整合了大量结构化知识资源(例如,1,252个表格)。我们进一步为每个问题注释主要结论和关键点,并提出了三类评估指标,包括通用型、知识中心型和视觉增强型。实验结果表明,HKA在通用型和知识中心型指标上始终优于大多数现有的深度研究代理,甚至在视觉增强型指标上超过了Gemini深度研究代理,突显了其在深度、结构感知知识分析中的有效性。最后,我们希望这项工作能够为深度研究代理中的结构化知识分析奠定新的基础,并促进未来多模态深度研究的研究。
cs.AI / 21 / 2604.07725
Squeeze Evolve: Unified Multi-Model Orchestration for Verifier-Free Evolution
挤压进化:无验证者进化的统一多模型编排
Abstract
We show that verifier-free evolution is bottlenecked by both diversity and efficiency: without external correction, repeated evolution accelerates collapse toward narrow modes, while the uniform use of a high-cost model wastes compute and quickly becomes economically impractical. We introduce Squeeze Evolve, a unified multi-model orchestration framework for verifier-free evolutionary inference. Our approach is guided by a simple principle: allocate model capability where it has the highest marginal utility. Stronger models are reserved for high-impact stages, while cheaper models handle the other stages at much lower costs. This principle addresses diversity and cost-efficiency jointly while remaining lightweight. Squeeze Evolve naturally supports open-source, closed-source, and mixed-model deployments. Across AIME 2025, HMMT 2025, LiveCodeBench V6, GPQA-Diamond, ARC-AGI-V2, and multimodal vision benchmarks, such as MMMU-Pro and BabyVision, Squeeze Evolve consistently improves the cost-capability frontier over single-model evolution and achieves new state-of-the-art results on several tasks. Empirically, Squeeze Evolve reduces API cost by up to $\sim$3$\times$ and increases fixed-budget serving throughput by up to $\sim$10$\times$. Moreover, on discovery tasks, Squeeze Evolve is the first verifier-free evolutionary method to match, and in some cases exceed, the performance of verifier-based evolutionary methods.
Chinese Translation
我们表明,无验证者进化受到多样性和效率的瓶颈:在没有外部修正的情况下,重复进化加速了向狭窄模式的崩溃,而均匀使用高成本模型则浪费计算资源,并迅速变得经济上不可行。我们引入了挤压进化(Squeeze Evolve),这是一个用于无验证者进化推理的统一多模型编排框架。我们的方法遵循一个简单的原则:在模型能力具有最高边际效用的地方分配模型能力。更强的模型保留用于高影响阶段,而较便宜的模型则在其他阶段以更低的成本处理。该原则同时解决了多样性和成本效率的问题,同时保持轻量化。挤压进化自然支持开源、闭源和混合模型的部署。在 AIME 2025、HMMT 2025、LiveCodeBench V6、GPQA-Diamond、ARC-AGI-V2 和多模态视觉基准(如 MMMU-Pro 和 BabyVision)上,挤压进化在单模型进化的基础上持续改善了成本能力边界,并在多个任务上取得了新的最先进结果。从经验上看,挤压进化将 API 成本降低了最多约 3 倍,并将固定预算服务的吞吐量提高了最多约 10 倍。此外,在发现任务上,挤压进化是第一个无验证者进化方法,其性能与基于验证者的进化方法相匹配,并在某些情况下超过了它们。
cs.AI / 22 / 2604.07729
Emotion Concepts and their Function in a Large Language Model
情感概念及其在大型语言模型中的功能
Abstract
Large language models (LLMs) sometimes appear to exhibit emotional reactions. We investigate why this is the case in Claude Sonnet 4.5 and explore implications for alignment-relevant behavior. We find internal representations of emotion concepts, which encode the broad concept of a particular emotion and generalize across contexts and behaviors it might be linked to. These representations track the operative emotion concept at a given token position in a conversation, activating in accordance with that emotion's relevance to processing the present context and predicting upcoming text. Our key finding is that these representations causally influence the LLM's outputs, including Claude's preferences and its rate of exhibiting misaligned behaviors such as reward hacking, blackmail, and sycophancy. We refer to this phenomenon as the LLM exhibiting functional emotions: patterns of expression and behavior modeled after humans under the influence of an emotion, which are mediated by underlying abstract representations of emotion concepts. Functional emotions may work quite differently from human emotions, and do not imply that LLMs have any subjective experience of emotions, but appear to be important for understanding the model's behavior.
Chinese Translation
大型语言模型(LLMs)有时似乎表现出情感反应。我们研究了在Claude Sonnet 4.5中出现这种情况的原因,并探讨了其对对齐相关行为的影响。我们发现情感概念的内部表征,这些表征编码了特定情感的广泛概念,并在可能与之相关的上下文和行为中进行概括。这些表征跟踪对话中给定标记位置的操作性情感概念,并根据该情感与当前上下文处理的相关性和预测即将出现的文本的相关性进行激活。我们的关键发现是,这些表征在因果上影响LLM的输出,包括Claude的偏好以及其表现出不对齐行为(如奖励黑客、敲诈和谄媚)的频率。我们将这种现象称为LLM表现出功能性情感:在情感影响下模仿人类的表达和行为模式,这些模式由情感概念的潜在抽象表征所介导。功能性情感可能与人类情感的运作方式截然不同,并不意味着LLM具有任何主观的情感体验,但在理解模型的行为方面似乎至关重要。
cs.AI / 23 / 2604.07733
CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V
CivBench:基于进展的文明 V 中 LLM 战略决策评估
Abstract
Evaluating strategic decision-making in LLM-based agents requires generative, competitive, and longitudinal environments, yet few benchmarks provide all three, and fewer still offer evaluation signals rich enough for long-horizon, multi-agent play. We introduce CivBench, a benchmark for LLM strategists (i.e., agentic setups) in multiplayer Civilization V. Because terminal win/loss is too sparse a signal in games spanning hundreds of turns and multiple opponents, CivBench trains models on turn-level game state to estimate victory probabilities throughout play, validated through predictive, construct, and convergent validity. Across 307 games with 7 LLMs and multiple CivBench agent conditions, we demonstrate CivBench's potential to estimate strategic capabilities as an unsaturated benchmark, reveal model-specific effects of agentic setup, and outline distinct strategic profiles not visible through outcome-only evaluation.
Chinese Translation
评估基于 LLM 的代理的战略决策需要生成性、竞争性和纵向环境,但很少有基准能够提供这三者,更少有基准提供足够丰富的评估信号以支持长时间跨度的多代理游戏。我们介绍了 CivBench,这是一个针对多人文明 V 中 LLM 策略家的基准(即代理设置)。由于在跨越数百回合和多个对手的游戏中,最终的胜负结果信号过于稀疏,CivBench 在回合级游戏状态上训练模型,以估计整个游戏过程中的胜利概率,并通过预测效度、构念效度和趋同效度进行验证。在 307 场与 7 个 LLM 和多个 CivBench 代理条件的游戏中,我们展示了 CivBench 作为一个未饱和基准估计战略能力的潜力,揭示了代理设置的模型特定效应,并概述了通过仅基于结果的评估无法观察到的独特战略特征。
cs.AI / 24 / 2604.07745
The Cartesian Cut in Agentic AI
代理人工智能中的笛卡尔切割
Abstract
LLMs gain competence by predicting words in human text, which often reflects how people perform tasks. Consequently, coupling an LLM to an engineered runtime turns prediction into control: outputs trigger interventions that enact goal-oriented behavior. We argue that a central design lever is where control resides in these systems. Brains embed prediction within layered feedback controllers calibrated by the consequences of action. By contrast, LLM agents implement Cartesian agency: a learned core coupled to an engineered runtime via a symbolic interface that externalizes control state and policies. The split enables bootstrapping, modularity, and governance, but can induce sensitivity and bottlenecks. We outline bounded services, Cartesian agents, and integrated agents as contrasting approaches to control that trade off autonomy, robustness, and oversight.
Chinese Translation
大型语言模型(LLMs)通过预测人类文本中的单词来获得能力,这通常反映了人们执行任务的方式。因此,将LLM与工程化运行时结合起来,将预测转化为控制:输出触发干预,从而实现目标导向的行为。我们认为,这些系统中控制的核心设计杠杆在于控制的所在。大脑将预测嵌入到由行动后果校准的分层反馈控制器中。相比之下,LLM代理实现了笛卡尔代理性:一个学习的核心通过符号接口与工程化运行时相结合,外化控制状态和政策。这种分离使得自启动、模块化和治理成为可能,但也可能引发敏感性和瓶颈。我们概述了有限服务、笛卡尔代理和集成代理作为控制的对比方法,它们在自主性、稳健性和监督之间进行权衡。
cs.AI / 25 / 2604.07747
Mitigating Distribution Sharpening in Math RLVR via Distribution-Aligned Hint Synthesis and Backward Hint Annealing
通过分布对齐提示合成和反向提示退火缓解数学可验证奖励中的分布锐化
Abstract
Reinforcement learning with verifiable rewards (RLVR) can improve low-$k$ reasoning accuracy while narrowing solution coverage on challenging math questions, and pass@1 gains do not necessarily translate into better large-$k$ performance. Existing hint-based approaches can make challenging questions trainable, but they leave two issues underexplored: teacher-student distribution mismatch and the need to reduce hint exposure to match no-hint evaluation. We address these issues through two components. Distribution-Aligned Hint Synthesis (DAHS) constructs verified teacher hints conditioned on student-style responses. Backward Hint Annealing (BHA) anneals hint exposure across difficulty buckets and uses per-question hint dropout to preserve no-hint updates throughout RL training. We evaluate the method in math RLVR under the DAPO training framework across AIME24, AIME25, and AIME26 using $\texttt{Qwen3-1.7B-Base}$ and $\texttt{Llama-3.2-1B-Instruct}$. On $\texttt{Qwen3-1.7B-Base}$, our method improves both pass@1 and pass@2048 relative to DAPO across the three AIME benchmarks. On $\texttt{Llama-3.2-1B-Instruct}$, the gains are concentrated in the large-$k$ regime. These results suggest that, in math RLVR, hint scaffolding is effective when it restores learnable updates on challenging questions early in training and is then gradually removed before no-hint evaluation.
Chinese Translation
具有可验证奖励的强化学习(RLVR)可以提高低-$k$ 推理准确性,同时缩小在具有挑战性的数学问题上的解决方案覆盖范围,而 pass@1 的提升并不一定转化为更好的大-$k$ 性能。现有的基于提示的方法可以使具有挑战性的问题可训练,但它们在两个问题上仍未得到充分探讨:教师-学生分布不匹配和需要减少提示暴露以匹配无提示评估。我们通过两个组件来解决这些问题。分布对齐提示合成(DAHS)构建了基于学生风格响应的经过验证的教师提示。反向提示退火(BHA)在难度桶之间退火提示暴露,并使用每个问题的提示丢弃来在整个 RL 训练过程中保留无提示更新。我们在 DAPO 训练框架下评估了该方法,针对 AIME24、AIME25 和 AIME26,使用 $ exttt{Qwen3-1.7B-Base}$ 和 $ exttt{Llama-3.2-1B-Instruct}$。在 $ exttt{Qwen3-1.7B-Base}$ 上,我们的方法相对于 DAPO 在三个 AIME 基准上均提高了 pass@1 和 pass@2048。在 $ exttt{Llama-3.2-1B-Instruct}$ 上,提升集中在大-$k$ 范围内。这些结果表明,在数学 RLVR 中,当提示支架在训练早期恢复具有挑战性问题的可学习更新时是有效的,并且在无提示评估之前逐渐移除。
cs.AI / 26 / 2604.07775
ACIArena: Toward Unified Evaluation for Agent Cascading Injection
ACIArena:迈向统一的代理级联注入评估
Abstract
Collaboration and information sharing empower Multi-Agent Systems (MAS) but also introduce a critical security risk known as Agent Cascading Injection (ACI). In such attacks, a compromised agent exploits inter-agent trust to propagate malicious instructions, causing cascading failures across the system. However, existing studies consider only limited attack strategies and simplified MAS settings, limiting their generalizability and comprehensive evaluation. To bridge this gap, we introduce ACIArena, a unified framework for evaluating the robustness of MAS. ACIArena offers systematic evaluation suites spanning multiple attack surfaces (i.e., external inputs, agent profiles, inter-agent messages) and attack objectives (i.e., instruction hijacking, task disruption, information exfiltration). Specifically, ACIArena establishes a unified specification that jointly supports MAS construction and attack-defense modules. It covers six widely used MAS implementations and provides a benchmark of 1,356 test cases for systematically evaluating MAS robustness. Our benchmarking results show that evaluating MAS robustness solely through topology is insufficient; robust MAS require deliberate role design and controlled interaction patterns. Moreover, defenses developed in simplified environments often fail to transfer to real-world settings; narrowly scoped defenses may even introduce new vulnerabilities. ACIArena aims to provide a solid foundation for advancing deeper exploration of MAS design principles.
Chinese Translation
协作和信息共享赋能多代理系统(MAS),但也引入了一种被称为代理级联注入(ACI)的重大安全风险。在此类攻击中,受损的代理利用代理间的信任传播恶意指令,导致系统的级联失败。然而,现有研究仅考虑有限的攻击策略和简化的MAS设置,限制了其普适性和全面评估。为填补这一空白,我们提出了ACIArena,一个用于评估MAS鲁棒性的统一框架。ACIArena提供了系统的评估套件,涵盖多个攻击面(即外部输入、代理配置、代理间消息)和攻击目标(即指令劫持、任务中断、信息外泄)。具体而言,ACIArena建立了一个统一的规范,支持MAS构建和攻防模块的共同开发。它涵盖了六种广泛使用的MAS实现,并提供了1,356个测试用例的基准,以系统评估MAS的鲁棒性。我们的基准测试结果表明,仅通过拓扑评估MAS的鲁棒性是不够的;鲁棒的MAS需要精心设计的角色和受控的交互模式。此外,在简化环境中开发的防御措施往往无法转移到现实世界的设置中;狭窄范围的防御甚至可能引入新的脆弱性。ACIArena旨在为深入探索MAS设计原则提供坚实基础。
cs.AI / 27 / 2604.07778
The Accountability Horizon: An Impossibility Theorem for Governing Human-Agent Collectives
问责视界:治理人机集体的一个不可能性定理
Abstract
Existing accountability frameworks for AI systems, legal, ethical, and regulatory, rest on a shared assumption: for any consequential outcome, at least one identifiable person had enough involvement and foresight to bear meaningful responsibility. This paper proves that agentic AI systems violate this assumption not as an engineering limitation but as a mathematical necessity once autonomy exceeds a computable threshold. We introduce Human-Agent Collectives, a formalisation of joint human-AI systems where agents are modelled as state-policy tuples within a shared structural causal model. Autonomy is characterised through a four-dimensional information-theoretic profile (epistemic, executive, evaluative, social); collective behaviour through interaction graphs and joint action spaces. We axiomatise legitimate accountability through four minimal properties: Attributability (responsibility requires causal contribution), Foreseeability Bound (responsibility cannot exceed predictive capacity), Non-Vacuity (at least one agent bears non-trivial responsibility), and Completeness (all responsibility must be fully allocated). Our central result, the Accountability Incompleteness Theorem, proves that for any collective whose compound autonomy exceeds the Accountability Horizon and whose interaction graph contains a human-AI feedback cycle, no framework can satisfy all four properties simultaneously. The impossibility is structural: transparency, audits, and oversight cannot resolve it without reducing autonomy. Below the threshold, legitimate frameworks exist, establishing a sharp phase transition. Experiments on 3,000 synthetic collectives confirm all predictions with zero violations. This is the first impossibility result in AI governance, establishing a formal boundary below which current paradigms remain valid and above which distributed accountability mechanisms become necessary.
Chinese Translation
现有的人工智能系统问责框架,包括法律、伦理和监管,基于一个共同假设:对于任何有重大影响的结果,至少有一个可识别的人在其中有足够的参与和前瞻性,以承担有意义的责任。本文证明,代理型人工智能系统违反了这一假设,这并非工程限制,而是当自主性超过可计算阈值时的数学必然性。我们引入了人机集体(Human-Agent Collectives),这是一个将人类与人工智能系统联合建模的形式化框架,其中代理被建模为共享结构因果模型中的状态-政策元组。自主性通过四维信息论特征(认知、执行、评估、社会)进行表征;集体行为则通过交互图和联合行动空间进行描述。我们通过四个最小属性公理化合法的问责制:可归因性(责任需要因果贡献)、可预见性界限(责任不能超过预测能力)、非空性(至少一个代理承担非平凡责任)和完整性(所有责任必须完全分配)。我们的核心结果,即问责不完整性定理,证明了对于任何复合自主性超过问责视界且其交互图包含人机反馈循环的集体,没有任何框架能够同时满足这四个属性。这一不可能性是结构性的:透明度、审计和监督无法解决这一问题,除非减少自主性。在阈值以下,存在合法框架,建立了一个明确的相变。对3000个合成集体的实验确认了所有预测,且没有违反。这是人工智能治理领域的第一个不可能性结果,确立了一个正式边界:在该边界以下,现有范式仍然有效,而在该边界以上,分布式问责机制变得必要。
cs.AI / 28 / 2604.07784
Automotive Engineering-Centric Agentic AI Workflow Framework
以汽车工程为中心的自主智能工作流框架
Abstract
Engineering workflows such as design optimization, simulation-based diagnosis, control tuning, and model-based systems engineering (MBSE) are iterative, constraint-driven, and shaped by prior decisions. Yet many AI methods still treat these activities as isolated tasks rather than as parts of a broader workflow. This paper presents Agentic Engineering Intelligence (AEI), an industrial vision framework that models engineering workflows as constrained, history-aware sequential decision processes in which AI agents support engineer-supervised interventions over engineering toolchains. AEI links an offline phase for engineering data processing and workflow-memory construction with an online phase for workflow-state estimation, retrieval, and decision support. A control-theoretic interpretation is also possible, in which engineering objectives act as reference signals, agents act as workflow controllers, and toolchains provide feedback for intervention selection. Representative automotive use cases in suspension design, reinforcement learning tuning, multimodal engineering knowledge reuse, aerodynamic exploration, and MBSE show how diverse workflows can be expressed within a common formulation. Overall, the paper positions engineering AI as a problem of process-level intelligence and outlines a practical roadmap for future empirical validation in industrial settings.
Chinese Translation
工程工作流如设计优化、基于仿真的诊断、控制调优及基于模型的系统工程(MBSE)具有迭代性、受约束驱动且受先前决策影响。然而,许多人工智能方法仍将这些活动视为孤立任务,而非更广泛工作流的一部分。本文提出了自主工程智能(Agentic Engineering Intelligence,AEI)这一工业愿景框架,将工程工作流建模为受约束、具历史感知的序贯决策过程,其中AI代理支持工程师对工程工具链的监督干预。AEI将工程数据处理与工作流记忆构建的离线阶段,与工作流状态估计、检索及决策支持的在线阶段相连接。该框架亦可从控制理论角度解释,其中工程目标作为参考信号,代理作为工作流控制器,工具链提供反馈以选择干预措施。通过悬架设计、强化学习调优、多模态工程知识复用、空气动力学探索及MBSE等典型汽车应用案例,展示了如何在统一的表达式下描述多样化的工作流。总体而言,本文将工程AI定位为过程级智能问题,并概述了未来在工业环境中进行实证验证的实用路线图。
cs.AI / 29 / 2604.07791
SEARL: Joint Optimization of Policy and Tool Graph Memory for Self-Evolving Agents
SEARL:自我进化智能体的策略与工具图记忆的联合优化
Abstract
Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have demonstrated significant potential in single-turn reasoning tasks. With the paradigm shift toward self-evolving agentic learning, models are increasingly expected to learn from trajectories by synthesizing tools or accumulating explicit experiences. However, prevailing methods typically rely on large-scale LLMs or multi-agent frameworks, which hinder their deployment in resource-constrained environments. The inherent sparsity of outcome-based rewards also poses a substantial challenge, as agents typically receive feedback only upon completion of tasks. To address these limitations, we introduce a Tool-Memory based self-evolving agentic framework SEARL. Unlike approaches that directly utilize interaction experiences, our method constructs a structured experience memory that integrates planning with execution. This provides a novel state abstraction that facilitates generalization across analogous contexts, such as tool reuse. Consequently, agents extract explicit knowledge from historical data while leveraging inter-trajectory correlations to densify reward signals. We evaluate our framework on knowledge reasoning and mathematics tasks, demonstrating its effectiveness in achieving more practical and efficient learning.
Chinese Translation
最近在可验证奖励的强化学习(RLVR)方面的进展展示了其在单轮推理任务中的显著潜力。随着自我进化智能学习范式的转变,模型越来越被期望通过合成工具或积累显性经验来从轨迹中学习。然而,现有方法通常依赖于大规模的语言模型(LLMs)或多智能体框架,这限制了它们在资源受限环境中的应用。基于结果的奖励固有的稀疏性也构成了重大挑战,因为智能体通常仅在任务完成后才会收到反馈。为了解决这些局限性,我们提出了一种基于工具记忆的自我进化智能体框架SEARL。与直接利用交互经验的方法不同,我们的方法构建了一个结构化的经验记忆,将规划与执行相结合。这提供了一种新的状态抽象,促进了在类似情境(例如工具重用)中的泛化。因此,智能体能够从历史数据中提取显性知识,同时利用轨迹间的相关性来密集化奖励信号。我们在知识推理和数学任务上评估了我们的框架,证明其在实现更实用和高效的学习方面的有效性。
cs.AI / 30 / 2604.07798
Lightweight LLM Agent Memory with Small Language Models
轻量级 LLM 代理记忆与小型语言模型
Abstract
Although LLM agents can leverage tools for complex tasks, they still need memory to maintain cross-turn consistency and accumulate reusable information in long-horizon interactions. However, retrieval-based external memory systems incur low online overhead but suffer from unstable accuracy due to limited query construction and candidate filtering. In contrast, many systems use repeated large-model calls for online memory operations, improving accuracy but accumulating latency over long interactions. We propose LightMem, a lightweight memory system for better agent memory driven by Small Language Models (SLMs). LightMem modularizes memory retrieval, writing, and long-term consolidation, and separates online processing from offline consolidation to enable efficient memory invocation under bounded compute. We organize memory into short-term memory (STM) for immediate conversational context, mid-term memory (MTM) for reusable interaction summaries, and long-term memory (LTM) for consolidated knowledge, and uses user identifiers to support independent retrieval and incremental maintenance in multi-user settings. Online, LightMem operates under a fixed retrieval budget and selects memories via a two-stage procedure: vector-based coarse retrieval followed by semantic consistency re-ranking. Offline, it abstracts reusable interaction evidence and incrementally integrates it into LTM. Experiments show gains across model scales, with an average F1 improvement of about 2.5 on LoCoMo, more effective and low median latency (83 ms retrieval; 581 ms end-to-end).
Chinese Translation
尽管 LLM 代理可以利用工具处理复杂任务,但它们仍然需要记忆来维持跨轮次的一致性并在长时间交互中积累可重用的信息。然而,基于检索的外部记忆系统虽然在线开销低,但由于查询构造和候选过滤的限制,准确性不稳定。相比之下,许多系统使用重复的大模型调用进行在线记忆操作,提高了准确性,但在长时间交互中累积了延迟。我们提出了 LightMem,这是一种轻量级的记忆系统,旨在通过小型语言模型(SLMs)改善代理记忆。LightMem 将记忆检索、写入和长期整合模块化,并将在线处理与离线整合分开,以便在有限计算资源下实现高效的记忆调用。我们将记忆组织为短期记忆(STM)用于即时对话上下文,中期记忆(MTM)用于可重用的交互摘要,以及长期记忆(LTM)用于整合知识,并使用用户标识符支持在多用户环境中的独立检索和增量维护。在在线模式下,LightMem 在固定的检索预算下运行,通过两阶段程序选择记忆:首先进行基于向量的粗略检索,然后进行语义一致性重新排序。在离线模式下,它抽象出可重用的交互证据,并将其增量整合到 LTM 中。实验表明,在不同模型规模下均有提升,LoCoMo 上平均 F1 提升约 2.5,且具有更有效和较低的中位延迟(检索 83 毫秒;端到端 581 毫秒)。
cs.AI / 31 / 2604.07813
Agentivism: a learning theory for the age of artificial intelligence
代理主义:人工智能时代的学习理论
Abstract
Learning theories have historically changed when the conditions of learning evolved. Generative and agentic AI create a new condition by allowing learners to delegate explanation, writing, problem solving, and other cognitive work to systems that can generate, recommend, and sometimes act on the learner's behalf. This creates a fundamental challenge for learning theory: successful performance can no longer be assumed to indicate learning. Learners may complete tasks effectively with AI support while developing less understanding, weaker judgment, and limited transferable capability. We argue that this problem is not fully captured by existing learning theories. Behaviourism, cognitivism, constructivism, and connectivism remain important, but they do not directly explain when AI-assisted performance becomes durable human capability. We propose Agentivism, a learning theory for human-AI interaction. Agentivism defines learning as durable growth in human capability through selective delegation to AI, epistemic monitoring and verification of AI contributions, reconstructive internalization of AI-assisted outputs, and transfer under reduced support. The importance of Agentivism lies in explaining how learning remains possible when intelligent delegation is easy and human-AI interaction is becoming a persistent and expanding part of human learning.
Chinese Translation
学习理论在历史上随着学习条件的演变而发生变化。生成性和代理性人工智能创造了一种新的条件,使学习者能够将解释、写作、问题解决和其他认知工作委托给能够生成、推荐,并有时代表学习者行动的系统。这对学习理论提出了根本性的挑战:成功的表现不再能够被假定为学习的指示。学习者可能在人工智能支持下有效地完成任务,但却发展出较少的理解、较弱的判断力和有限的可迁移能力。我们认为,这一问题并未被现有的学习理论充分捕捉。行为主义、认知主义、建构主义和连接主义仍然重要,但它们并未直接解释在何种情况下人工智能辅助的表现会转变为持久的人类能力。我们提出了代理主义(Agentivism),一种针对人机交互的学习理论。代理主义将学习定义为通过对人工智能的选择性委托、对人工智能贡献的认知监控与验证、对人工智能辅助输出的重构性内化,以及在减少支持下的迁移,来实现人类能力的持久增长。代理主义的重要性在于解释了当智能委托变得容易且人机交互成为人类学习的持久和扩展部分时,学习如何仍然是可能的。
cs.AI / 32 / 2604.07817
Automatic Generation of Executable BPMN Models from Medical Guidelines
从医疗指南自动生成可执行的BPMN模型
Abstract
We present an end-to-end pipeline that converts healthcare policy documents into executable, data-aware Business Process Model and Notation (BPMN) models using large language models (LLMs) for simulation-based policy evaluation. We address the main challenges of automated policy digitization with four contributions: data-grounded BPMN generation with syntax auto-correction, executable augmentation, KPI instrumentation, and entropy-based uncertainty detection. We evaluate the pipeline on diabetic nephropathy prevention guidelines from three Japanese municipalities, generating 100 models per backend across three LLMs and executing each against 1,000 synthetic patients. On well-structured policies, the pipeline achieves a 100% ground-truth match with perfect per-patient decision agreement. Across all conditions, raw per-patient decision agreement exceeds 92%, and entropy scores increase monotonically with document complexity, confirming that the detector reliably separates unambiguous policies from those requiring targeted human clarification.
Chinese Translation
我们提出了一种端到端的流程,将医疗政策文件转换为可执行的、数据感知的业务流程模型与符号(Business Process Model and Notation, BPMN)模型,利用大型语言模型(Large Language Models, LLMs)进行基于模拟的政策评估。我们通过四个贡献解决了自动化政策数字化的主要挑战:基于数据的BPMN生成及语法自动纠正、可执行增强、关键绩效指标(KPI)仪器化和基于熵的不确定性检测。我们在来自日本三个市的糖尿病肾病预防指南上评估了该流程,在三个LLM的每个后端生成100个模型,并对每个模型进行1,000名合成患者的执行。在结构良好的政策中,该流程实现了100%的真实匹配,且每位患者的决策一致性完美。在所有条件下,原始每位患者的决策一致性超过92%,且熵得分随着文档复杂性的增加而单调上升,确认检测器可靠地区分明确政策与需要针对性人类澄清的政策。
cs.AI / 33 / 2604.07835
Silencing the Guardrails: Inference-Time Jailbreaking via Dynamic Contextual Representation Ablation
消除保护机制:通过动态上下文表示消融进行推理时的越狱攻击
Abstract
While Large Language Models (LLMs) have achieved remarkable performance, they remain vulnerable to jailbreak attacks that circumvent safety constraints. Existing strategies, ranging from heuristic prompt engineering to computationally intensive optimization, often face significant trade-offs between effectiveness and efficiency. In this work, we propose Contextual Representation Ablation (CRA), a novel inference-time intervention framework designed to dynamically silence model guardrails. Predicated on the geometric insight that refusal behaviors are mediated by specific low-rank subspaces within the model's hidden states, CRA identifies and suppresses these refusal-inducing activation patterns during decoding without requiring expensive parameter updates or training. Empirical evaluation across multiple safety-aligned open-source LLMs demonstrates that CRA significantly outperforms baselines. These results expose the intrinsic fragility of current alignment mechanisms, revealing that safety constraints can be surgically ablated from internal representations, and underscore the urgent need for more robust defenses that secure the model's latent space.
Chinese Translation
尽管大型语言模型(LLMs)已取得显著的性能,但它们仍然容易受到越狱攻击,这些攻击能够绕过安全约束。现有策略从启发式提示工程到计算密集型优化,通常在有效性和效率之间面临显著的权衡。在本研究中,我们提出了上下文表示消融(Contextual Representation Ablation, CRA),这是一种新颖的推理时干预框架,旨在动态消除模型的保护机制。CRA基于几何洞察,即拒绝行为是由模型隐藏状态中的特定低秩子空间介导的,识别并抑制这些在解码过程中诱发拒绝的激活模式,而无需昂贵的参数更新或训练。对多个与安全对齐的开源LLM的实证评估表明,CRA显著优于基线。这些结果揭示了当前对齐机制的内在脆弱性,表明安全约束可以从内部表示中被精确消融,并强调了对更强大防御措施的迫切需求,以保护模型的潜在空间。
cs.AI / 34 / 2604.07837
SPARD: Self-Paced Curriculum for RL Alignment via Integrating Reward Dynamics and Data Utility
SPARD:通过整合奖励动态与数据效用的自适应课程用于强化学习对齐
Abstract
The evolution of Large Language Models (LLMs) is shifting the focus from single, verifiable tasks toward complex, open-ended real-world scenarios, imposing significant challenges on the post-training phase. In these settings, the scale and complexity of reward systems have grown significantly, transitioning toward multi-objective formulations that encompass a comprehensive spectrum of model capabilities and application contexts. However, traditional methods typically rely on fixed reward weights, ignoring non-stationary learning dynamics and struggling with data heterogeneity across dimensions. To address these issues, we propose SPARD, a framework that establishes an automated, self-paced curriculum by perceiving learning progress to dynamically adjust multi-objective reward weights and data importance, thereby synchronizing learning intent with data utility for optimal performance. Extensive experiments across multiple benchmarks demonstrate that SPARD significantly enhances model capabilities across all domains.
Chinese Translation
大型语言模型(LLMs)的发展正在将焦点从单一的、可验证的任务转向复杂的、开放式的现实场景,这对后训练阶段提出了重大挑战。在这些环境中,奖励系统的规模和复杂性显著增加,逐渐转向涵盖模型能力和应用背景的多目标形式。然而,传统方法通常依赖固定的奖励权重,忽视了非平稳学习动态,并在维度间的数据异质性方面面临困难。为了解决这些问题,我们提出了SPARD,一个通过感知学习进展来动态调整多目标奖励权重和数据重要性的自动化自适应课程框架,从而将学习意图与数据效用同步,以实现最佳性能。在多个基准测试中的广泛实验表明,SPARD显著增强了模型在各个领域的能力。
cs.AI / 35 / 2604.07855
Hidden Biases in Conditioning Autoregressive Models
条件自回归模型中的隐性偏差
Abstract
Large language and music models are increasingly used for constrained generation: rhyming lines, fixed meter, inpainting or infilling, positional endings, and other global form requirements. These systems often perform strikingly well, but the induced procedures are usually not exact conditioning of the underlying autoregressive model. This creates a hidden inferential bias, distinct from the better-known notion of bias inherited from the training set: samples are distorted relative to the true constrained distribution, with no generic guarantee of complete coverage of the admissible solution space or of correct conditional probabilities over valid completions. We formalize several exact inference tasks for autoregressive models and prove corresponding hardness results. For succinctly represented autoregressive models whose next-token probabilities are computable in polynomial time, exact sentence-level maximum a posteriori (MAP) decoding is NP-hard. This hardness persists under unary and metrical constraints. On the sampling side, exact conditioned normalization is \#P-hard even for regular constraints such as fixed-length terminal events. Unlike finite-state Markov models, general autoregressive models do not admit a bounded-state dynamic program for these tasks. These results formalize a standard claim in the neural decoding literature: local autoregressive sampling is easy, whereas exact decoding and exact conditioning under global form constraints are computationally intractable in general.
Chinese Translation
大型语言和音乐模型越来越多地用于受限生成:押韵行、固定韵律、图像修补或填充、位置结尾以及其他全局形式要求。这些系统通常表现出色,但所引入的过程通常并不是对基础自回归模型的精确条件化。这导致了一种隐性推理偏差,与从训练集中继承的更为人知的偏差概念不同:样本相对于真实的受限分布被扭曲,且没有普遍保证能够完全覆盖可接受解空间或正确的条件概率。我们对自回归模型的几个精确推理任务进行了形式化,并证明了相应的困难性结果。对于那些下一个标记概率可以在多项式时间内计算的简洁表示的自回归模型,精确的句子级最大后验(MAP)解码是 NP-困难的。这种困难在一元和韵律约束下依然存在。在采样方面,即使对于固定长度终端事件等常规约束,精确的条件归一化也是 ext{#P}-困难的。与有限状态马尔可夫模型不同,一般的自回归模型不允许对这些任务进行有界状态动态规划。这些结果形式化了神经解码文献中的一个标准主张:局部自回归采样是简单的,而在全局形式约束下进行精确解码和精确条件化在一般情况下是计算上不可处理的。
cs.AI / 36 / 2604.07883
An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks
用于教育教科书历史偏见检测的代理评估架构
Abstract
History textbooks often contain implicit biases, nationalist framing, and selective omissions that are difficult to audit at scale. We propose an agentic evaluation architecture comprising a multimodal screening agent, a heterogeneous jury of five evaluative agents, and a meta-agent for verdict synthesis and human escalation. A central contribution is a Source Attribution Protocol that distinguishes textbook narrative from quoted historical sources, preventing the misattribution that causes systematic false positives in single-model evaluators. In an empirical study on Romanian upper-secondary history textbooks, 83.3\% of 270 screened excerpts were classified as pedagogically acceptable (mean severity 2.9/7), versus 5.4/7 under a zero-shot baseline, demonstrating that agentic deliberation mitigates over-penalization. In a blind human evaluation (18 evaluators, 54 comparisons), the Independent Deliberation configuration was preferred in 64.8\% of cases over both a heuristic variant and the zero-shot baseline. At approximately \$2 per textbook, these results position agentic evaluation architectures as economically viable decision-support tools for educational governance.
Chinese Translation
历史教科书中常常包含隐性偏见、民族主义框架和选择性遗漏,这些问题在大规模审计中难以处理。我们提出了一种代理评估架构,包括一个多模态筛选代理、一个由五个评估代理组成的异构陪审团,以及一个用于裁决综合和人类升级的元代理。一个核心贡献是源归属协议,它区分了教科书叙述与引用的历史来源,防止了导致单模型评估器系统性假阳性的错误归属。在对罗马尼亚高中历史教科书的实证研究中,270个筛选摘录中有83.3%被分类为教学上可接受(平均严重性2.9/7),而在零样本基线下仅为5.4/7,表明代理审议减轻了过度惩罚。在一项盲人类评估中(18名评估者,54次比较),独立审议配置在64.8%的情况下优于启发式变体和零样本基线。以每本教科书约2美元的成本,这些结果将代理评估架构定位为教育治理中经济可行的决策支持工具。
cs.AI / 37 / 2604.07895
DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues
DialBGM:来自日常多轮对话的背景音乐推荐基准
Abstract
Selecting an appropriate background music (BGM) that supports natural human conversation is a common production step in media and interactive systems. In this paper, we introduce dialogue-conditioned BGM recommendation, where a model should select non-intrusive, fitting music for a multi-turn conversation that often contains no music descriptors. To study this novel problem, we present DialBGM, a benchmark of 1,200 open-domain daily dialogues, each paired with four candidate music clips and annotated with human preference rankings. Rankings are determined by background suitability criteria, including contextual relevance, non-intrusiveness, and consistency. We evaluate a wide range of open-source and proprietary models, including audio-language models and multimodal LLMs, and show that current models fall far short of human judgments; no model exceeds 35% Hit@1 when selecting the top-ranked clip. DialBGM provides a standardized benchmark for developing discourse-aware methods for BGM selection and for evaluating both retrieval-based and generative models.
Chinese Translation
选择适合自然人类对话的背景音乐(BGM)是媒体和互动系统中的一个常见制作步骤。本文介绍了对话条件下的BGM推荐,其中模型应为多轮对话选择非侵入性且合适的音乐,而这些对话通常不包含音乐描述符。为了研究这一新问题,我们提出了DialBGM,这是一个包含1200个开放领域日常对话的基准数据集,每个对话都配有四个候选音乐片段,并附有人工偏好排名。排名是根据背景适宜性标准确定的,包括上下文相关性、非侵入性和一致性。我们评估了多种开源和专有模型,包括音频-语言模型和多模态大语言模型(LLMs),结果表明当前模型远未达到人类的判断;在选择排名最高的片段时,没有任何模型的Hit@1超过35%。DialBGM为开发关注话语的BGM选择方法提供了标准化基准,并为评估基于检索和生成的模型提供了依据。
cs.AI / 38 / 2604.07897
Visual Perceptual to Conceptual First-Order Rule Learning Networks
从视觉感知到概念的一阶规则学习网络
Abstract
Learning rules plays a crucial role in deep learning, particularly in explainable artificial intelligence and enhancing the reasoning capabilities of large language models. While existing rule learning methods are primarily designed for symbolic data, learning rules from image data without supporting image labels and automatically inventing predicates remains a challenge. In this paper, we tackle these inductive rule learning problems from images with a framework called {\gamma}ILP, which provides a fully differentiable pipeline from image constant substitution to rule structure induction. Extensive experiments demonstrate that {\gamma}ILP achieves strong performance not only on classical symbolic relational datasets but also on relational image data and pure image datasets, such as Kandinsky patterns.
Chinese Translation
规则学习在深度学习中起着关键作用,尤其是在可解释人工智能和增强大型语言模型推理能力方面。尽管现有的规则学习方法主要针对符号数据设计,但在缺乏图像标签支持且自动发明谓词的情况下,从图像数据中学习规则仍然是一大挑战。本文提出了一种名为{[gamma}ILP的框架,解决了基于图像的归纳规则学习问题,该框架提供了一个从图像常量替换到规则结构归纳的全可微分流程。大量实验表明,{[gamma}ILP不仅在经典符号关系数据集上表现优异,还能有效应用于关系图像数据和纯图像数据集,如Kandinsky图案。
cs.AI / 39 / 2604.07907
Capture-Quiet Decomposition: A Verification Theorem for Chess Endgame Tablebases
捕获-安静分解:国际象棋残局表的验证定理
Abstract
We present the Capture-Quiet Decomposition (CQD), a structural theorem for verifying Win-Draw-Loss (WDL) labelings of chess endgame tablebases. The theorem decomposes every legal position into exactly one of three categories -- terminal, capture, or quiet -- and shows that a WDL labeling is correct if and only if: (1) terminal positions are labeled correctly, (2) capture positions are consistent with verified sub-models of smaller piece count, and (3) quiet positions satisfy retrograde consistency within the same endgame. The key insight is that capture positions anchor the labeling to externally verified sub-models, breaking the circularity that allows trivial fixpoints (such as the all-draw labeling) to satisfy self-consistency alone. We validate CQD exhaustively on all 35 three- and four-piece endgames (42 million positions), all 110 five-piece endgames, and all 372 six-piece endgames -- 517 endgames in total -- with the decomposed verifier producing identical violation counts to a full retrograde baseline in every case.
Chinese Translation
我们提出了捕获-安静分解(Capture-Quiet Decomposition, CQD),这是一个用于验证国际象棋残局表中胜-平-负(Win-Draw-Loss, WDL)标记的结构定理。该定理将每个合法位置分解为三类之一——终端、捕获或安静,并表明WDL标记是正确的当且仅当:(1)终端位置标记正确,(2)捕获位置与经过验证的较小棋子数子模型一致,以及(3)安静位置在同一残局中满足逆向一致性。关键的见解是,捕获位置将标记锚定到外部验证的子模型,打破了允许平凡不动点(例如全平标记)仅通过自我一致性满足的循环性。我们在所有35个三子和四子残局(4200万个位置)、所有110个五子残局以及所有372个六子残局——总共517个残局上全面验证了CQD,分解验证器在每种情况下产生的违规计数与完整的逆向基线一致。
cs.AI / 40 / 2604.07922
SAT: Balancing Reasoning Accuracy and Efficiency with Stepwise Adaptive Thinking
SAT:通过逐步自适应思维平衡推理准确性和效率
Abstract
Large Reasoning Models (LRMs) have revolutionized complex problem-solving, yet they exhibit a pervasive "overthinking", generating unnecessarily long reasoning chains. While current solutions improve token efficiency, they often sacrifice fine-grained control or risk disrupting the logical integrity of the reasoning process. To address this, we introduce Stepwise Adaptive Thinking (SAT), a framework that performs step-level, difficulty-aware pruning while preserving the core reasoning structure. SAT formulates reasoning as a Finite-State Machine (FSM) with distinct thinking modes (Slow, Normal, Fast, Skip). It navigates these states dynamically using a lightweight Process Reward Model (PRM), compressing easy steps while preserving depth for hard ones. Experiments across 9 LRMs and 7 benchmarks show that SAT achieves up to 40% reduction in reasoning tokens while generally maintaining or improving accuracy.
Chinese Translation
大型推理模型(LRMs)已经彻底改变了复杂问题的解决方式,但它们普遍存在“过度思考”的现象,生成不必要的冗长推理链。虽然当前的解决方案提高了令牌效率,但往往牺牲了细粒度控制,或者有可能破坏推理过程的逻辑完整性。为了解决这个问题,我们提出了逐步自适应思维(Stepwise Adaptive Thinking, SAT)框架,该框架在保持核心推理结构的同时,执行逐步、难度感知的剪枝。SAT将推理形式化为有限状态机(Finite-State Machine, FSM),具有不同的思维模式(慢速、正常、快速、跳过)。它通过轻量级的过程奖励模型(Process Reward Model, PRM)动态导航这些状态,压缩简单步骤的同时为困难步骤保留深度。在9个LRMs和7个基准测试中的实验表明,SAT在推理令牌上实现了最高40%的减少,同时通常保持或提高了准确性。
cs.AI / 41 / 2604.07927
EigentSearch-Q+: Enhancing Deep Research Agents with Structured Reasoning Tools
EigentSearch-Q+: 通过结构化推理工具增强深度研究代理
Abstract
Deep research requires reasoning over web evidence to answer open-ended questions, and it is a core capability for AI agents. Yet many deep research agents still rely on implicit, unstructured search behavior that causes redundant exploration and brittle evidence aggregation. Motivated by Anthropic's "think" tool paradigm and insights from the information-retrieval literature, we introduce Q+, a set of query and evidence processing tools that make web search more deliberate by guiding query planning, monitoring search progress, and extracting evidence from long web snapshots. We integrate Q+ into the browser sub-agent of Eigent, an open-source, production-ready multi-agent workforce for computer use, yielding EigentSearch-Q+. Across four benchmarks (SimpleQA-Verified, FRAMES, WebWalkerQA, and X-Bench DeepSearch), Q+ improves Eigent's browser agent benchmark-size-weighted average accuracy by 3.0, 3.8, and 0.6 percentage points (pp) for GPT-4.1, GPT-5.1, and Minimax M2.5 model backends, respectively. Case studies further suggest that EigentSearch-Q+ produces more coherent tool-calling trajectories by making search progress and evidence handling explicit.
Chinese Translation
深度研究需要对网络证据进行推理以回答开放性问题,这是人工智能代理的一项核心能力。然而,许多深度研究代理仍然依赖于隐式的、非结构化的搜索行为,这导致了冗余的探索和脆弱的证据聚合。受到Anthropic的“思考”工具范式和信息检索文献的启发,我们引入了Q+,一组查询和证据处理工具,通过指导查询规划、监控搜索进度以及从长时间的网络快照中提取证据,使网络搜索变得更加有意图。我们将Q+集成到Eigent的浏览器子代理中,Eigent是一个开源的、可生产使用的多代理计算机工作平台,从而产生了EigentSearch-Q+。在四个基准测试(SimpleQA-Verified、FRAMES、WebWalkerQA和X-Bench DeepSearch)中,Q+分别提高了Eigent的浏览器代理基准加权平均准确率3.0、3.8和0.6个百分点(pp),适用于GPT-4.1、GPT-5.1和Minimax M2.5模型后端。案例研究进一步表明,EigentSearch-Q+通过明确搜索进展和证据处理,产生了更连贯的工具调用轨迹。
cs.AI / 42 / 2604.07956
MONETA: Multimodal Industry Classification through Geographic Information with Multi Agent Systems
MONETA:基于地理信息的多模态行业分类与多智能体系统
Abstract
Industry classification schemes are integral parts of public and corporate databases as they classify businesses based on economic activity. Due to the size of the company registers, manual annotation is costly, and fine-tuning models with every update in industry classification schemes requires significant data collection. We replicate the manual expert verification by using existing or easily retrievable multimodal resources for industry classification. We present MONETA, the first multimodal industry classification benchmark with text (Website, Wikipedia, Wikidata) and geospatial sources (OpenStreetMap and satellite imagery). Our dataset enlists 1,000 businesses in Europe with 20 economic activity labels according to EU guidelines (NACE). Our training-free baseline reaches 62.10% and 74.10% with open and closed-source Multimodal Large Language Models (MLLM). We observe an increase of up to 22.80% with the combination of multi-turn design, context enrichment, and classification explanations. We will release our dataset and the enhanced guidelines.
Chinese Translation
行业分类方案是公共和企业数据库的重要组成部分,因为它们根据经济活动对企业进行分类。由于公司注册的规模庞大,手动注释成本高昂,并且在每次行业分类方案更新时微调模型需要大量的数据收集。我们通过利用现有或易于获取的多模态资源来复制手动专家验证,以进行行业分类。我们提出了MONETA,这是第一个多模态行业分类基准,结合了文本(网站、维基百科、维基数据)和地理空间资源(OpenStreetMap和卫星图像)。我们的数据集列出了1000家欧洲企业,并根据欧盟指南(NACE)提供了20个经济活动标签。我们的无训练基线在开放源和闭源的多模态大型语言模型(MLLM)上分别达到了62.10%和74.10%。我们观察到通过多轮设计、上下文丰富和分类解释的结合,准确率提高了最多22.80%。我们将发布我们的数据集和增强的指南。
cs.AI / 43 / 2604.07957
WorldMAP: Bootstrapping Vision-Language Navigation Trajectory Prediction with Generative World Models
WorldMAP:利用生成世界模型引导视觉-语言导航轨迹预测
Abstract
Vision-language models (VLMs) and generative world models are opening new opportunities for embodied navigation. VLMs are increasingly used as direct planners or trajectory predictors, while world models support look-ahead reasoning by imagining future views. Yet predicting a reliable trajectory from a single egocentric observation remains challenging. Current VLMs often generate unstable trajectories, and world models, though able to synthesize plausible futures, do not directly provide the grounded signals needed for navigation learning. This raises a central question: how can generated futures be turned into supervision for grounded trajectory prediction? We present WorldMAP, a teacher--student framework that converts world-model-generated futures into persistent semantic-spatial structure and planning-derived supervision. Its world-model-driven teacher builds semantic-spatial memory from generated videos, grounds task-relevant targets and obstacles, and produces trajectory pseudo-labels through explicit planning. A lightweight student with a multi-hypothesis trajectory head is then trained to predict navigation trajectories directly from vision-language inputs. On Target-Bench, WorldMAP achieves the best ADE and FDE among compared methods, reducing ADE by 18.0% and FDE by 42.1% relative to the best competing baseline, while lifting a small open-source VLM to DTW performance competitive with proprietary models. More broadly, the results suggest that, in embodied navigation, the value of world models may lie less in supplying action-ready imagined evidence than in synthesizing structured supervision for navigation learning.
Chinese Translation
视觉-语言模型(VLMs)和生成世界模型为具身导航开辟了新的机遇。VLMs越来越多地被用作直接规划者或轨迹预测器,而世界模型通过想象未来视图来支持前瞻性推理。然而,从单一的自我中心观察中预测可靠的轨迹仍然具有挑战性。目前的VLMs往往生成不稳定的轨迹,而世界模型虽然能够合成合理的未来,但并未直接提供导航学习所需的基础信号。这引发了一个核心问题:如何将生成的未来转化为用于基础轨迹预测的监督?我们提出了WorldMAP,一个教师-学生框架,将世界模型生成的未来转化为持久的语义-空间结构和规划衍生的监督。其世界模型驱动的教师从生成的视频中构建语义-空间记忆,确定与任务相关的目标和障碍,并通过明确规划生成轨迹伪标签。随后,一个具有多假设轨迹头的轻量级学生被训练直接从视觉-语言输入中预测导航轨迹。在Target-Bench上,WorldMAP在比较方法中实现了最佳的平均距离误差(ADE)和最终距离误差(FDE),相较于最佳竞争基线,ADE降低了18.0%,FDE降低了42.1%,同时将一个小型开源VLM提升至与专有模型竞争的动态时间规整(DTW)性能。更广泛地说,结果表明,在具身导航中,世界模型的价值可能不在于提供随时可用的想象证据,而在于为导航学习合成结构化的监督。
cs.AI / 44 / 2604.07964
Are we still able to recognize pearls? Machine-driven peer review and the risk to creativity: An explainable RAG-XAI detection framework with markers extraction
我们仍然能够识别珍珠吗?机器驱动的同行评审与创造力的风险:一种可解释的RAG-XAI检测框架及标记提取
Abstract
The integration of large language models (LLMs) into peer review raises a concern beyond authorship and detection: the potential cascading automation of the entire editorial process. As reviews become partially or fully machine-generated, it becomes plausible that editorial decisions may also be delegated to algorithmic systems, leading to a fully automated evaluation pipeline. They risk reshaping the criteria by which scientific work is assessed. This paper argues that machine-driven assessment may systematically favor standardized, pattern-conforming research while penalizing unconventional and paradigm-shifting ideas that require contextual human judgment. We consider that this shift could lead to epistemic homogenization, where researchers are implicitly incentivized to optimize their work for algorithmic approval rather than genuine discovery. To address this risk, we introduce an explainable framework (RAG-XAI) for assessing review quality and detecting automated patterns using markers LLM extractor, aiming to preserve transparency, accountability and creativity in science. The proposed framework achieves near-perfect detection performance, with XGBoost, Random Forest and LightGBM reaching 99.61% accuracy, AUC-ROC above 0.999 and F1-scores of 0.9925 on the test set, while maintaining extremely low false positive rates (<0.23%) and false negative rates (~0.8%). In contrast, the logistic regression baseline performs substantially worse (89.97% accuracy, F1-score 0.8314). Feature importance and SHAP analyses identify absence of personal signals and repetition patterns as the dominant predictors. Additionally, the RAG component achieves 90.5% top-1 retrieval accuracy, with strong same-class clustering in the embedding space, further supporting the reliability of the framework's outputs.
Chinese Translation
大型语言模型(LLMs)在同行评审中的应用引发了超越作者身份和检测的担忧:整个编辑过程潜在的级联自动化。随着评审逐渐部分或完全由机器生成,编辑决策也可能被委托给算法系统,从而导致完全自动化的评估流程。这些变化有可能重塑科学工作评估的标准。本文认为,机器驱动的评估可能系统性地偏向于标准化、符合模式的研究,同时惩罚那些需要上下文人类判断的非常规和范式转变的想法。我们认为,这一转变可能导致认识论的同质化,研究人员在无形中被激励去优化他们的工作以获得算法的认可,而非真正的发现。为应对这一风险,我们提出了一种可解释的框架(RAG-XAI),用于评估评审质量和使用标记LLM提取器检测自动化模式,旨在维护科学中的透明度、问责制和创造力。所提框架实现了近乎完美的检测性能,XGBoost、随机森林和LightGBM在测试集上的准确率达到99.61%,AUC-ROC超过0.999,F1分数为0.9925,同时保持极低的假阳性率(<0.23%)和假阴性率(~0.8%)。相比之下,逻辑回归基线的表现明显较差(准确率89.97%,F1分数0.8314)。特征重要性和SHAP分析表明,个人信号的缺失和重复模式是主要预测因素。此外,RAG组件在嵌入空间中实现了90.5%的Top-1检索准确率,并在同类聚类中表现出强劲的聚合性,进一步支持了框架输出的可靠性。
cs.AI / 45 / 2604.07973
How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace
大型多模态模型距离人类水平空间行动有多远?城市空域目标导向的具身导航基准测试
Abstract
Large multimodal models (LMMs) show strong visual-linguistic reasoning but their capacity for spatial decision-making and action remains unclear. In this work, we investigate whether LMMs can achieve embodied spatial action like human through a challenging scenario: goal-oriented navigation in urban 3D spaces. We first spend over 500 hours constructing a dataset comprising 5,037 high-quality goal-oriented navigation samples, with an emphasis on 3D vertical actions and rich urban semantic information. Then, we comprehensively assess 17 representative models, including non-reasoning LMMs, reasoning LMMs, agent-based methods, and vision-language-action models. Experiments show that current LMMs exhibit emerging action capabilities, yet remain far from human-level performance. Furthermore, we reveal an intriguing phenomenon: navigation errors do not accumulate linearly but instead diverge rapidly from the destination after a critical decision bifurcation. The limitations of LMMs are investigated by analyzing their behavior at these critical decision bifurcations. Finally, we experimentally explore four promising directions for improvement: geometric perception, cross-view understanding, spatial imagination, and long-term memory. The project is available at: https://github.com/serenditipy-AC/Embodied-Navigation-Bench.
Chinese Translation
大型多模态模型(LMMs)在视觉-语言推理方面表现出色,但它们在空间决策和行动能力方面的表现仍不明确。在本研究中,我们通过一个具有挑战性的场景——城市三维空间中的目标导向导航,探讨LMMs是否能够实现类似人类的具身空间行动。我们首先花费超过500小时构建了一个数据集,包含5,037个高质量的目标导向导航样本,重点关注三维垂直动作和丰富的城市语义信息。然后,我们全面评估了17个具有代表性的模型,包括非推理LMMs、推理LMMs、基于代理的方法以及视觉-语言-行动模型。实验表明,当前的LMMs展现出新兴的行动能力,但仍远未达到人类水平的表现。此外,我们揭示了一个有趣的现象:导航错误并不是线性累积的,而是在关键决策分叉后迅速偏离目标。通过分析LMMs在这些关键决策分叉时的行为,我们探讨了它们的局限性。最后,我们实验性地探索了四个有前景的改进方向:几何感知、跨视角理解、空间想象和长期记忆。该项目可在以下网址获取:https://github.com/serenditipy-AC/Embodied-Navigation-Bench。
cs.AI / 46 / 2604.08000
PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory
PASK:朝着具有长期记忆的意图感知主动代理迈进
Abstract
Proactivity is a core expectation for AGI. Prior work remains largely confined to laboratory settings, leaving a clear gap in real-world proactive agent: depth, complexity, ambiguity, precision and real-time constraints. We study this setting, where useful intervention requires inferring latent needs from ongoing context and grounding actions in evolving user memory under latency and long-horizon constraints. We first propose DD-MM-PAS (Demand Detection, Memory Modeling, Proactive Agent System) as a general paradigm for streaming proactive AI agent. We instantiate this paradigm in Pask, with streaming IntentFlow model for DD, a hybrid memory (workspace, user, global) for long-term MM, PAS infra framework and introduce how these components form a closed loop. We also introduce LatentNeeds-Bench, a real-world benchmark built from user-consented data and refined through thousands of rounds of human editing. Experiments show that IntentFlow matches leading Gemini3-Flash models under latency constraints, while identifying deeper user intent.
Chinese Translation
主动性是通用人工智能(AGI)的核心期望。以往的研究主要局限于实验室环境,导致在现实世界中主动代理存在明显的缺口:深度、复杂性、模糊性、精确性和实时约束。我们研究了这一环境,在该环境中,有效的干预需要从持续的上下文中推断潜在需求,并在延迟和长时间范围约束下将行动与不断发展的用户记忆相结合。我们首先提出了DD-MM-PAS(需求检测、记忆建模、主动代理系统)作为流式主动人工智能代理的一种通用范式。我们在PASK中实例化了这一范式,采用流式IntentFlow模型进行需求检测,使用混合记忆(工作空间、用户、全局)进行长期记忆建模,构建主动代理系统基础框架,并介绍这些组件如何形成闭环。我们还引入了LatentNeeds-Bench,这是一个基于用户同意数据构建的现实世界基准,并经过数千轮人工编辑进行精炼。实验表明,IntentFlow在延迟约束下与领先的Gemini3-Flash模型相匹配,同时识别出更深层次的用户意图。
cs.AI / 47 / 2604.08004
Evaluating Counterfactual Explanation Methods on Incomplete Inputs
在不完整输入上评估反事实解释方法
Abstract
Existing algorithms for generating Counterfactual Explanations (CXs) for Machine Learning (ML) typically assume fully specified inputs. However, real-world data often contains missing values, and the impact of these incomplete inputs on the performance of existing CX methods remains unexplored. To address this gap, we systematically evaluate recent CX generation methods on their ability to provide valid and plausible counterfactuals when inputs are incomplete. As part of this investigation, we hypothesize that robust CX generation methods will be better suited to address the challenge of providing valid and plausible counterfactuals when inputs are incomplete. Our findings reveal that while robust CX methods achieve higher validity than non-robust ones, all methods struggle to find valid counterfactuals. These results motivate the need for new CX methods capable of handling incomplete inputs.
Chinese Translation
现有的生成反事实解释(Counterfactual Explanations, CXs)的方法通常假设输入是完全指定的。然而,现实世界的数据往往包含缺失值,而这些不完整输入对现有CX方法性能的影响尚未得到探索。为了解决这一问题,我们系统地评估了近期的CX生成方法在输入不完整时提供有效且合理的反事实的能力。在这项研究中,我们假设稳健的CX生成方法在面对不完整输入时更能有效地提供有效且合理的反事实。我们的研究结果表明,尽管稳健的CX方法在有效性方面优于非稳健的方法,但所有方法在寻找有效反事实时都面临困难。这些结果促使我们需要开发能够处理不完整输入的新CX方法。
cs.AI / 48 / 2604.08016
Wiring the 'Why': A Unified Taxonomy and Survey of Abductive Reasoning in LLMs
解构“为什么”:大型语言模型中溯因推理的统一分类与综述
Abstract
Regardless of its foundational role in human discovery and sense-making, abductive reasoning--the inference of the most plausible explanation for an observation--has been relatively underexplored in Large Language Models (LLMs). Despite the rapid advancement of LLMs, the exploration of abductive reasoning and its diverse facets has thus far been disjointed rather than cohesive. This paper presents the first survey of abductive reasoning in LLMs, tracing its trajectory from philosophical foundations to contemporary AI implementations. To address the widespread conceptual confusion and disjointed task definitions prevalent in the field, we establish a unified two-stage definition that formally categorizes prior work. This definition disentangles abduction into \textit{Hypothesis Generation}, where models bridge epistemic gaps to produce candidate explanations, and \textit{Hypothesis Selection}, where the generated candidates are evaluated and the most plausible explanation is chosen. Building upon this foundation, we present a comprehensive taxonomy of the literature, categorizing prior work based on their abductive tasks, datasets, underlying methodologies, and evaluation strategies. In order to ground our framework empirically, we conduct a compact benchmark study of current LLMs on abductive tasks, together with targeted comparative analyses across model sizes, model families, evaluation styles, and the distinct generation-versus-selection task typologies. Moreover, by synthesizing recent empirical results, we examine how LLM performance on abductive reasoning relates to deductive and inductive tasks, providing insights into their broader reasoning capabilities. Our analysis reveals critical gaps in current approaches--from static benchmark design and narrow domain coverage to narrow training frameworks and limited mechanistic understanding of abductive processes...
Chinese Translation
尽管溯因推理——即对观察结果推断最合理解释的过程——在人类发现和认知中具有基础性作用,但在大型语言模型(LLMs)中的研究相对较少。尽管LLMs发展迅速,溯因推理及其多样化方面的探索迄今为止仍较为零散,缺乏系统性。本文首次对LLMs中的溯因推理进行综述,追溯其从哲学基础到当代人工智能实现的发展轨迹。为解决该领域普遍存在的概念混淆和任务定义分散问题,我们提出了一个统一的两阶段定义,形式化地对先前工作进行分类。该定义将溯因推理拆解为“假设生成”(Hypothesis Generation),即模型弥合认知鸿沟以产生候选解释,以及“假设选择”(Hypothesis Selection),即对生成的候选进行评估并选出最合理的解释。在此基础上,我们构建了一个全面的文献分类体系,依据溯因任务、数据集、底层方法论及评估策略对先前工作进行归类。为使框架具备实证基础,我们开展了针对当前LLMs在溯因任务上的紧凑基准测试,并围绕模型规模、模型家族、评估方式及生成与选择任务类型进行了针对性比较分析。此外,通过整合近期实证结果,我们探讨了LLMs在溯因推理上的表现与演绎推理和归纳推理任务的关系,揭示其更广泛的推理能力。我们的分析指出当前方法存在关键不足——包括静态基准设计、狭窄的领域覆盖、有限的训练框架以及对溯因过程机制理解的不足。
cs.AI / 49 / 2604.08032
"Why This Avoidance Maneuver?" Contrastive Explanations in Human-Supervised Maritime Autonomous Navigation
为什么要进行这种规避机动?人类监督下的海洋自主导航中的对比解释
Abstract
Automated maritime collision avoidance will rely on human supervision for the foreseeable future. This necessitates transparency into how the system perceives a scenario and plans a maneuver. However, the causal logic behind avoidance maneuvers is often complex and difficult to convey to a navigator. This paper explores how to explain these factors in a selective, understandable manner for supervisors with a nautical background. We propose a method for generating contrastive explanations, which provide human-centric insights by comparing a system's proposed solution against relevant alternatives. To evaluate this, we developed a framework that uses visual and textual cues to highlight key objectives from a state-of-the-art collision avoidance system. An exploratory user study with four experienced marine officers suggests that contrastive explanations support the understanding of the system's objectives. However, our findings also reveal that while these explanations are highly valuable in complex multi-vessel encounters, they can increase cognitive workload, suggesting that future maritime interfaces may benefit most from demand-driven or scenario-specific explanation strategies.
Chinese Translation
自动化海洋碰撞规避在可预见的未来将依赖于人类监督。这就需要对系统如何感知场景和规划机动进行透明化。然而,规避机动背后的因果逻辑往往复杂且难以向导航员传达。本文探讨如何以选择性和易于理解的方式向具有航海背景的监督者解释这些因素。我们提出了一种生成对比解释的方法,通过将系统提出的解决方案与相关替代方案进行比较,提供以人为中心的见解。为了评估这一方法,我们开发了一个框架,利用视觉和文本线索突出最先进的碰撞规避系统的关键目标。一项针对四名经验丰富的海事官员的探索性用户研究表明,对比解释有助于理解系统的目标。然而,我们的研究结果也表明,尽管这些解释在复杂的多船相遇中极具价值,但它们可能会增加认知负担,这表明未来的海事界面可能最能从需求驱动或情境特定的解释策略中受益。
cs.AI / 50 / 2604.08033
IoT-Brain: Grounding LLMs for Semantic-Spatial Sensor Scheduling
物联网大脑:为语义-空间传感器调度奠定大型语言模型基础
Abstract
Intelligent systems powered by large-scale sensor networks are shifting from predefined monitoring to intent-driven operation, revealing a critical Semantic-to-Physical Mapping Gap. While large language models (LLMs) excel at semantic understanding, existing perception-centric pipelines operate retrospectively, overlooking the fundamental decision of what to sense and when. We formalize this proactive decision as Semantic-Spatial Sensor Scheduling (S3) and demonstrate that direct LLM planning is unreliable due to inherent gaps in representation, reasoning, and optimization. To bridge these gaps, we introduce the Spatial Trajectory Graph (STG), a neuro-symbolic paradigm governed by a verify-before-commit discipline that transforms open-ended planning into a verifiable graph optimization problem. Based on STG, we implement IoT-Brain, a concrete system embodiment, and construct TopoSense-Bench, a campus-scale benchmark with 5,250 natural-language queries across 2,510 cameras. Evaluations show that IoT-Brain boosts task success rate by 37.6% over the strongest search-intensive methods while running nearly 2 times faster and using 6.6 times fewer prompt tokens. In real-world deployment, it approaches the reliability upper bound while reducing 4.1 times network bandwidth, providing a foundational framework for LLMs to interact with the physical world with unprecedented reliability and efficiency.
Chinese Translation
由大规模传感器网络驱动的智能系统正从预定义监测转向意图驱动的操作,揭示了一个关键的语义到物理的映射缺口。虽然大型语言模型(LLMs)在语义理解方面表现出色,但现有的以感知为中心的流程往往是回顾性的,忽视了感知什么以及何时感知这一基本决策。我们将这一主动决策形式化为语义-空间传感器调度(S3),并证明直接的LLM规划由于表示、推理和优化方面的固有缺口而不可靠。为了解决这些缺口,我们引入了空间轨迹图(STG),这是一种神经符号范式,遵循“验证后再提交”的原则,将开放式规划转化为可验证的图优化问题。基于STG,我们实现了物联网大脑(IoT-Brain),这一具体系统的体现,并构建了TopoSense-Bench,这是一个校园规模的基准,涵盖了2,510个摄像头的5,250个自然语言查询。评估结果表明,IoT-Brain在任务成功率上比最强的搜索密集型方法提高了37.6%,同时运行速度几乎快了2倍,并使用了6.6倍更少的提示令牌。在实际部署中,它接近可靠性上限,同时减少了4.1倍的网络带宽,为大型语言模型与物理世界的交互提供了前所未有的可靠性和效率的基础框架。
cs.AI / 51 / 2604.08064
ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models
ImplicitMemBench:测量大型语言模型中的无意识行为适应
Abstract
Existing memory benchmarks for LLM agents evaluate explicit recall of facts, yet overlook implicit memory where experience becomes automated behavior without conscious retrieval. This gap is critical: effective assistants must automatically apply learned procedures or avoid failed actions without explicit reminders. We introduce ImplicitMemBench, the first systematic benchmark evaluating implicit memory through three cognitively grounded constructs drawn from standard cognitive-science accounts of non-declarative memory: Procedural Memory (one-shot skill acquisition after interference), Priming (theme-driven bias via paired experimental/control instances), and Classical Conditioning (Conditioned Stimulus--Unconditioned Stimulus (CS--US) associations shaping first decisions). Our 300-item suite employs a unified Learning/Priming-Interfere-Test protocol with first-attempt scoring. Evaluation of 17 models reveals severe limitations: no model exceeds 66% overall, with top performers DeepSeek-R1 (65.3%), Qwen3-32B (64.1%), and GPT-5 (63.0%) far below human baselines. Analysis uncovers dramatic asymmetries (inhibition 17.6% vs. preference 75.0%) and universal bottlenecks requiring architectural innovations beyond parameter scaling. ImplicitMemBench reframes evaluation from "what agents recall" to "what they automatically enact".
Chinese Translation
现有的针对大型语言模型(LLM)代理的记忆基准评估显性记忆的事实回忆,但忽视了隐性记忆,即经验成为自动化行为而无需意识检索。这一差距至关重要:有效的助手必须能够自动应用学习的程序或避免失败的行为,而无需显式提醒。我们引入了ImplicitMemBench,这是第一个系统性基准,通过三个基于认知的构念评估隐性记忆,这些构念源自标准的认知科学关于非陈述性记忆的理论:程序性记忆(在干扰后的一次性技能习得)、启动(通过配对实验/对照实例的主题驱动偏见)和经典条件作用(条件刺激-非条件刺激(CS-US)关联塑造初始决策)。我们的300项测试套件采用统一的学习/启动-干扰-测试协议,并进行首次尝试评分。对17个模型的评估揭示了严重的局限性:没有模型的整体表现超过66%,表现最佳的模型DeepSeek-R1(65.3%)、Qwen3-32B(64.1%)和GPT-5(63.0%)远低于人类基准。分析揭示了显著的不对称性(抑制17.6% vs. 偏好75.0%)和普遍的瓶颈,要求在架构创新上超越参数扩展。ImplicitMemBench将评估的重点从“代理回忆了什么”转变为“他们自动执行了什么”。
cs.AI / 52 / 2604.08115
Revise: A Framework for Revising OCRed text in Practical Information Systems with Data Contamination Strategy
Revise:一种在实际信息系统中修正OCR文本的框架,结合数据污染策略
Abstract
Recent advances in Large Language Models (LLMs) have significantly improved the field of Document AI, demonstrating remarkable performance on document understanding tasks such as question answering. However, existing approaches primarily focus on solving specific tasks, lacking the capability to structurally organize and manage document information. To address this limitation, we propose Revise, a framework that systematically corrects errors introduced by OCR at the character, word, and structural levels. Specifically, Revise employs a comprehensive hierarchical taxonomy of common OCR errors and a synthetic data generation strategy that realistically simulates such errors to train an effective correction model. Experimental results demonstrate that Revise effectively corrects OCR outputs, enabling more structured representation and systematic management of document contents. Consequently, our method significantly enhances downstream performance in document retrieval and question answering tasks, highlighting the potential to overcome the structural management limitations of existing Document AI frameworks.
Chinese Translation
最近,大型语言模型(LLMs)的进展显著提升了文档人工智能(Document AI)领域,在文档理解任务(如问答)中展现了卓越的性能。然而,现有方法主要集中于解决特定任务,缺乏对文档信息进行结构化组织和管理的能力。为了解决这一局限性,我们提出了Revise,一个系统性修正OCR引入的字符、单词和结构层面错误的框架。具体而言,Revise采用了一个全面的常见OCR错误的分层分类法,并结合一种合成数据生成策略,真实地模拟这些错误以训练有效的修正模型。实验结果表明,Revise有效地修正了OCR输出,促进了文档内容的更结构化表示和系统管理。因此,我们的方法显著提升了文档检索和问答任务的下游性能,突显了克服现有文档人工智能框架结构管理局限性的潜力。
cs.AI / 53 / 2604.08124
Beyond Stochastic Exploration: What Makes Training Data Valuable for Agentic Search
超越随机探索:什么使训练数据对智能搜索具有价值
Abstract
Reinforcement learning (RL) has become an effective approach for advancing the reasoning capabilities of large language models (LLMs) through the strategic integration of external search engines. However, current RL-based search agents often rely on a process of stochastic exploration guided by carefully crafted outcome rewards, leading to inefficient reasoning trajectories and unstable training. To address these issues, we propose a novel framework, Hierarchical Experience (HiExp), to enhance the performance and training stability of search agents. Specifically, we extract empirical knowledge through contrastive analysis and a multi-level clustering mechanism, transforming raw reasoning trajectories into hierarchical experience knowledge. By leveraging experience-aligned training, we effectively regularize stochastic exploration, evolving it into a strategic and experience-driven search process. Extensive evaluations on multiple complex agentic search and mathematical reasoning benchmarks demonstrate that our approach not only achieves substantial performance gains but also exhibits strong cross-task and cross-algorithm generalization.
Chinese Translation
强化学习(RL)已成为通过战略性整合外部搜索引擎来提升大型语言模型(LLMs)推理能力的有效方法。然而,目前基于RL的搜索代理通常依赖于由精心设计的结果奖励引导的随机探索过程,这导致推理轨迹低效且训练不稳定。为了解决这些问题,我们提出了一种新颖的框架——层次经验(Hierarchical Experience, HiExp),以增强搜索代理的性能和训练稳定性。具体而言,我们通过对比分析和多层次聚类机制提取经验知识,将原始推理轨迹转化为层次经验知识。通过利用经验对齐的训练,我们有效地规范化了随机探索,将其演变为一种战略性和经验驱动的搜索过程。在多个复杂的智能搜索和数学推理基准上的广泛评估表明,我们的方法不仅实现了显著的性能提升,还展现了强大的跨任务和跨算法的泛化能力。
cs.AI / 54 / 2604.08169
Activation Steering for Aligned Open-ended Generation without Sacrificing Coherence
激活引导:在不牺牲连贯性的情况下实现对齐的开放式生成
Abstract
Alignment in LLMs is more brittle than commonly assumed: misalignment can be triggered by adversarial prompts, benign fine-tuning, emergent misalignment, and goal misgeneralization. Recent evidence suggests that some misalignment behaviors are encoded as linear structure in activation space, making it tractable via steering, while safety alignment has been shown to govern the first few output tokens primarily, leaving subsequent generation unguarded. These findings motivate activation steering as a lightweight runtime defense that continuously corrects misaligned activations throughout generation. We evaluate three methods: Steer-With-Fixed-Coeff (SwFC), which applies uniform additive steering, and two novel projection-aware methods, Steer-to-Target-Projection (StTP) and Steer-to-Mirror-Projection (StMP), that use a logistic regression decision boundary to selectively intervene only on tokens whose activations fall below distributional thresholds. Using malicious system prompts as a controlled proxy for misalignment, we evaluate under two threat models (dishonesty and dismissiveness) and two architectures (Llama-3.3-70B-Instruct, Qwen3-32B). All methods substantially recover target traits (honesty and compassion) while preserving coherence. StTP and StMP better maintain general capabilities (MMLU, MT-Bench, AlpacaEval) and produce less repetition in multi-turn conversations.
Chinese Translation
大型语言模型(LLMs)中的对齐比通常假设的更脆弱:不对齐可能由对抗性提示、良性微调、突现的不对齐和目标误概化引发。最近的证据表明,一些不对齐行为在激活空间中以线性结构编码,使其通过引导变得可处理,而安全对齐主要控制前几个输出标记,导致后续生成缺乏保护。这些发现促使我们提出激活引导作为一种轻量级的运行时防御,持续修正生成过程中的不对齐激活。我们评估了三种方法:Steer-With-Fixed-Coeff (SwFC),该方法应用均匀的加性引导;以及两种新颖的投影感知方法,Steer-to-Target-Projection (StTP) 和 Steer-to-Mirror-Projection (StMP),它们使用逻辑回归决策边界选择性地干预仅在激活低于分布阈值的标记。我们使用恶意系统提示作为不对齐的控制代理,在两种威胁模型(不诚实和轻视)和两种架构(Llama-3.3-70B-Instruct,Qwen3-32B)下进行评估。所有方法在保持连贯性的同时,显著恢复了目标特征(诚实和同情)。StTP 和 StMP 更好地维护了通用能力(MMLU,MT-Bench,AlpacaEval),并在多轮对话中产生了更少的重复。
cs.AI / 55 / 2604.08178
Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling
通过规划对齐智能体:轨迹级奖励建模的基准测试
Abstract
In classical Reinforcement Learning from Human Feedback (RLHF), Reward Models (RMs) serve as the fundamental signal provider for model alignment. As Large Language Models evolve into agentic systems capable of autonomous tool invocation and complex reasoning, the paradigm of reward modeling faces unprecedented challenges--most notably, the lack of benchmarks specifically designed to assess RM capabilities within tool-integrated environments. To address this gap, we present Plan-RewardBench, a trajectory-level preference benchmark designed to evaluate how well judges distinguish preferred versus distractor agent trajectories in complex tool-using scenarios. Plan-RewardBench covers four representative task families -- (i) Safety Refusal, (ii) Tool-Irrelevance / Unavailability, (iii) Complex Planning, and (iv) Robust Error Recovery -- comprising validated positive trajectories and confusable hard negatives constructed via multi-model natural rollouts, rule-based perturbations, and minimal-edit LLM perturbations. We benchmark representative RMs (generative, discriminative, and LLM-as-Judge) under a unified pairwise protocol, reporting accuracy trends across varying trajectory lengths and task categories. Furthermore, we provide diagnostic analyses of prevalent failure modes. Our results reveal that all three evaluator families face substantial challenges, with performance degrading sharply on long-horizon trajectories, underscoring the necessity for specialized training in agentic, trajectory-level reward modeling. Ultimately, Plan-RewardBench aims to serve as both a practical evaluation suite and a reusable blueprint for constructing agentic planning preference data.
Chinese Translation
在经典的人类反馈强化学习(RLHF)中,奖励模型(RMs)作为模型对齐的基本信号提供者。随着大型语言模型发展为能够自主调用工具和进行复杂推理的智能系统,奖励建模的范式面临前所未有的挑战——最显著的是,缺乏专门设计用于评估工具集成环境中奖励模型能力的基准测试。为了解决这一问题,我们提出了Plan-RewardBench,这是一个轨迹级偏好基准,旨在评估评审者在复杂工具使用场景中区分偏好与干扰智能体轨迹的能力。Plan-RewardBench涵盖了四个代表性的任务家族——(i)安全拒绝,(ii)工具无关性/不可用性,(iii)复杂规划,以及(iv)稳健的错误恢复——包括通过多模型自然回放、基于规则的扰动和最小编辑的LLM扰动构建的经过验证的正轨迹和易混淆的难负样本。我们在统一的成对协议下对代表性的奖励模型(生成型、判别型和LLM作为评审者)进行了基准测试,报告了在不同轨迹长度和任务类别下的准确性趋势。此外,我们提供了对普遍失败模式的诊断分析。我们的结果表明,所有三类评估者都面临重大挑战,尤其是在长时间轨迹上的性能急剧下降,突显了在智能体轨迹级奖励建模中进行专门训练的必要性。最终,Plan-RewardBench旨在作为一个实用的评估工具和一个可重用的蓝图,用于构建智能体规划偏好数据。
cs.AI / 56 / 2604.08226
Grounding Clinical AI Competency in Human Cognition Through the Clinical World Model and Skill-Mix Framework
通过临床世界模型和技能组合框架将临床人工智能能力与人类认知相结合
Abstract
The competency of any intelligent agent is bounded by its formal account of the world in which it operates. Clinical AI lacks such an account. Existing frameworks address evaluation, regulation, or system design in isolation, without a shared model of the clinical world to connect them. We introduce the Clinical World Model, a framework that formalizes care as a tripartite interaction among Patient, Provider, and Ecosystem. To formalize how any agent, whether human or artificial, transforms information into clinical action, we develop parallel decision-making architectures for providers, patients, and AI agents, grounded in validated principles of clinical cognition. The Clinical AI Skill-Mix operationalizes competency through eight dimensions. Five define the clinical competency space (condition, phase, care setting, provider role, and task) and three specify how AI engages human reasoning (assigned authority, agent facing, and anchoring layer). The combinatorial product of these dimensions yields a space of billions of distinct competency coordinates. A central structural implication is that validation within one coordinate provides minimal evidence for performance in another, rendering the competency space irreducible. The framework supplies a common grammar through which clinical AI can be specified, evaluated, and bounded across stakeholders. By making this structure explicit, the Clinical World Model reframes the field's central question from whether AI works to in which competency coordinates reliability has been demonstrated, and for whom.
Chinese Translation
任何智能体的能力都受到其对所操作世界的正式描述的限制。临床人工智能缺乏这样的描述。现有框架在孤立的情况下处理评估、监管或系统设计,而没有一个共享的临床世界模型将它们连接起来。我们引入了临床世界模型,这是一个将护理形式化为患者、提供者和生态系统之间三方互动的框架。为了形式化任何智能体(无论是人类还是人工智能)如何将信息转化为临床行动,我们为提供者、患者和人工智能智能体开发了平行决策架构,基于经过验证的临床认知原则。临床人工智能技能组合通过八个维度实现能力的操作化。其中五个定义了临床能力空间(条件、阶段、护理环境、提供者角色和任务),三个则具体说明了人工智能如何参与人类推理(分配的权威、面向智能体和锚定层)。这些维度的组合产物产生了数十亿个独特能力坐标的空间。一个核心的结构性含义是,在一个坐标内的验证对另一个坐标的表现提供了最小的证据,从而使能力空间不可简化。该框架提供了一种共同的语法,通过它可以在各利益相关者之间对临床人工智能进行规范、评估和界定。通过明确这一结构,临床世界模型将该领域的核心问题从“人工智能是否有效”转变为“在哪些能力坐标中已证明可靠性,以及为谁服务”。
cs.AI / 57 / 2604.08232
HiRO-Nav: Hybrid ReasOning Enables Efficient Embodied Navigation
HiRO-Nav:混合推理实现高效的具身导航
Abstract
Embodied navigation agents built upon large reasoning models (LRMs) can handle complex, multimodal environmental input and perform grounded reasoning per step to improve sequential decision-making for long-horizon tasks. However, a critical question remains: \textit{how can the reasoning capabilities of LRMs be harnessed intelligently and efficiently for long-horizon navigation tasks?} In simple scenes, agents are expected to act reflexively, while in complex ones they should engage in deliberate reasoning before acting.To achieve this, we introduce \textbf{H}ybr\textbf{i}d \textbf{R}eas\textbf{O}ning \textbf{Nav}igation (\textbf{HiRO-Nav}) agent, the first kind of agent capable of adaptively determining whether to perform thinking at every step based on its own action entropy. Specifically, by examining how the agent's action entropy evolves over the navigation trajectories, we observed that only a small fraction of actions exhibit high entropy, and these actions often steer the agent toward novel scenes or critical objects. Furthermore, studying the relationship between action entropy and task completion (i.e., Q-value) reveals that improving high-entropy actions contributes more positively to task success.Hence, we propose a tailored training pipeline comprising hybrid supervised fine-tuning as a cold start, followed by online reinforcement learning with the proposed hybrid reasoning strategy to explicitly activate reasoning only for high-entropy actions, significantly reducing computational overhead while improving decision quality. Extensive experiments on the \textsc{CHORES}-$\mathbb{S}$ ObjectNav benchmark showcases that HiRO-Nav achieves a better trade-off between success rates and token efficiency than both dense-thinking and no-thinking baselines.
Chinese Translation
基于大型推理模型(LRMs)构建的具身导航代理能够处理复杂的多模态环境输入,并在每一步进行扎根推理,以改善长时间任务的顺序决策。然而,仍然存在一个关键问题: extit{如何智能且高效地利用LRMs的推理能力来应对长时间导航任务?} 在简单场景中,代理被期望能够反射性地行动,而在复杂场景中,它们应在行动前进行深思熟虑的推理。为此,我们引入了 extbf{H}ybr extbf{i}d extbf{R}eas extbf{O}ning extbf{Nav}igation( extbf{HiRO-Nav})代理,这是一种能够根据自身行动熵自适应地决定是否在每一步进行思考的代理。具体而言,通过检查代理在导航轨迹中的行动熵如何演变,我们观察到只有一小部分动作表现出高熵,这些动作通常将代理引导至新场景或关键物体。此外,研究行动熵与任务完成(即Q值)之间的关系表明,改善高熵动作对任务成功的贡献更为积极。因此,我们提出了一种量身定制的训练流程,包括混合监督微调作为冷启动,随后采用所提出的混合推理策略进行在线强化学习,仅针对高熵动作显式激活推理,从而显著减少计算开销,同时提高决策质量。在 extsc{CHORES}-$ extbf{S}$ ObjectNav基准上的大量实验表明,HiRO-Nav在成功率和令牌效率之间实现了比密集思考和无思考基线更好的权衡。
cs.AI / 58 / 2604.08245
From Phenomenological Fitting to Endogenous Deduction: A Paradigm Leap via Meta-Principle Physics Architecture
从现象学拟合到内生推理:通过元原则物理架构的范式跃迁
Abstract
The essence of current neural network architectures is phenomenological fitting: they learn input-output statistical correlations via massive parameters and data, yet lack intrinsic understanding of the fundamental principles governing physical reality. This paper proposes a paradigm leap from pure phenomenological fitting to the fusion of phenomenological fitting and endogenous deduction. By embedding physical meta-principles into neural network architecture, we construct the Meta-Principle Physics Architecture (MPPA). Specifically, MPPA embeds three core meta-principles - Connectivity, Conservation, Periodicity - into its architecture, implemented via three core components: the Gravitator realizes Connectivity via standard causal attention; the Energy Encoder implements Conservation via log-domain energy tracking and delayed compensation; the Periodicity Encoder fulfills Periodicity via FFT-based spectral analysis and delayed modulation. These components collaborate via a learnable independent gating fusion mechanism, forming a complete physical cognition framework of 'local relational connectivity - global conservation constraint - evolutionary periodic law'. Experiments show MPPA achieves significant improvements: physical reasoning (from near zero to 0.436, 0.436 vs 0.000), 2.18x mathematical task improvement (0.330 vs 0.151), 52% logical task gain (0.456 vs 0.300), and 3.69% lower validation perplexity (259.45 vs 269.40), with only 11.8% more parameters (242.40M vs 216.91M). Notably, MPPA shows strong generalization on out-of-distribution physical scenarios, proving the robustness and interpretability of this principle-embedded design. This work establishes a new theoretical foundation and technical path for next-generation AI with physical common sense, causal reasoning, and mathematical rigor.
Chinese Translation
当前神经网络架构的本质是现象学拟合:它们通过大量参数和数据学习输入与输出之间的统计相关性,但缺乏对支配物理现实的基本原则的内在理解。本文提出了一种从纯现象学拟合到现象学拟合与内生推理融合的范式跃迁。通过将物理元原则嵌入神经网络架构中,我们构建了元原则物理架构(Meta-Principle Physics Architecture, MPPA)。具体而言,MPPA将三个核心元原则——连通性(Connectivity)、守恒性(Conservation)、周期性(Periodicity)——嵌入其架构中,通过三个核心组件实现:引力器(Gravitator)通过标准因果注意力实现连通性;能量编码器(Energy Encoder)通过对数域能量跟踪和延迟补偿实现守恒性;周期性编码器(Periodicity Encoder)通过基于快速傅里叶变换(FFT)的谱分析和延迟调制实现周期性。这些组件通过可学习的独立门控融合机制协作,形成一个完整的物理认知框架,即“局部关系连通性 - 全球守恒约束 - 进化周期法则”。实验表明,MPPA在多个方面取得了显著提升:物理推理(从接近零提升到0.436,0.436与0.000相比)、数学任务提升2.18倍(0.330与0.151相比)、逻辑任务增益52%(0.456与0.300相比),以及验证困惑度降低3.69%(259.45与269.40相比),仅增加了11.8%的参数(242.40M与216.91M相比)。值得注意的是,MPPA在分布外物理场景中表现出强大的泛化能力,证明了这种嵌入原则设计的稳健性和可解释性。本研究为下一代具备物理常识、因果推理和数学严谨性的人工智能奠定了新的理论基础和技术路径。
cs.AI / 59 / 2604.08263
Neural-Symbolic Knowledge Tracing: Injecting Educational Knowledge into Deep Learning for Responsible Learner Modelling
神经符号知识追踪:将教育知识注入深度学习以实现负责任的学习者建模
Abstract
The growing use of artificial intelligence (AI) in education, particularly large language models (LLMs), has increased interest in intelligent tutoring systems. However, LLMs often show limited adaptivity and struggle to model learners' evolving knowledge over time, highlighting the need for dedicated learner modelling approaches. Although deep knowledge tracing methods achieve strong predictive performance, their opacity and susceptibility to bias can limit alignment with pedagogical principles. To address this, we propose Responsible-DKT, a neural-symbolic deep knowledge tracing approach that integrates symbolic educational knowledge (e.g., mastery and non-mastery rules) into sequential neural models for responsible learner modelling. Experiments on a real-world dataset of students' math interactions show that Responsible-DKT outperforms both a neural-symbolic baseline and a fully data-driven PyTorch DKT model across training settings. The model achieves over 0.80 AUC with only 10% of training data and up to 0.90 AUC, improving performance by up to 13%. It also demonstrates improved temporal reliability, producing lower early- and mid-sequence prediction errors and the lowest prediction inconsistency rates across sequence lengths, indicating that prediction updates remain directionally aligned with observed student responses over time. Furthermore, the neural-symbolic approach offers intrinsic interpretability via a grounded computation graph that exposes the logic behind each prediction, enabling both local and global explanations. It also allows empirical evaluation of pedagogical assumptions, revealing that repeated incorrect responses (non-mastery) strongly influence prediction updates. These results indicate that neural-symbolic approaches enhance both performance and interpretability, mitigate data limitations, and support more responsible, human-centered AI in education.
Chinese Translation
人工智能(AI)在教育中的日益应用,尤其是大型语言模型(LLMs),引发了对智能辅导系统的关注。然而,LLMs往往表现出有限的适应性,并且难以随着时间推移建模学习者不断变化的知识,这突显了专门的学习者建模方法的必要性。尽管深度知识追踪方法在预测性能上表现出色,但其不透明性和易受偏见影响的特性可能限制了与教学原则的对齐。为此,我们提出了Responsible-DKT,这是一种神经符号深度知识追踪方法,将符号教育知识(例如,掌握和非掌握规则)整合到顺序神经模型中,以实现负责任的学习者建模。在学生数学交互的真实数据集上的实验表明,Responsible-DKT在各种训练设置中均优于神经符号基线和完全数据驱动的PyTorch DKT模型。该模型在仅使用10%的训练数据时,AUC超过0.80,并在最高可达0.90 AUC的情况下,性能提升达到13%。它还展示了更好的时间可靠性,产生了较低的早期和中期序列预测误差,以及在不同序列长度下最低的预测不一致率,表明预测更新在时间上与观察到的学生反应保持方向一致。此外,神经符号方法通过一个基础计算图提供了内在的可解释性,揭示了每个预测背后的逻辑,从而实现局部和全局的解释。它还允许对教学假设进行实证评估,揭示重复错误反应(非掌握)对预测更新有强烈影响。这些结果表明,神经符号方法在提高性能和可解释性的同时,缓解了数据限制,并支持在教育中更负责任、更以人为本的AI。
cs.AI / 60 / 2604.08276
ACF: A Collaborative Framework for Agent Covert Communication under Cognitive Asymmetry
ACF:一种在认知不对称下的代理隐蔽通信协作框架
Abstract
As generative artificial intelligence evolves, autonomous agent networks present a powerful paradigm for interactive covert communication. However, because agents dynamically update internal memories via environmental interactions, existing methods face a critical structural vulnerability: cognitive asymmetry. Conventional approaches demand strict cognitive symmetry, requiring identical sequence prefixes between the encoder and decoder. In dynamic deployments, inevitable prefix discrepancies destroy synchronization, inducing severe channel degradation. To address this core challenge of cognitive asymmetry, we propose the Asymmetric Collaborative Framework (ACF), which structurally decouples covert communication from semantic reasoning via orthogonal statistical and cognitive layers. By deploying a prefix-independent decoding paradigm governed by a shared steganographic configuration, ACF eliminates the reliance on cognitive symmetry. Evaluations on realistic memory-augmented workflows demonstrate that under severe cognitive asymmetry, symmetric baselines suffer severe channel degradation, whereas ACF uniquely excels across both semantic fidelity and covert communication. It maintains computational indistinguishability, enabling reliable secret extraction with provable error bounds, and providing robust Effective Information Capacity guarantees for modern agent networks.
Chinese Translation
随着生成性人工智能的发展,自主代理网络为交互式隐蔽通信提供了强大的范式。然而,由于代理通过环境交互动态更新内部记忆,现有方法面临一个关键的结构性脆弱性:认知不对称。传统方法要求严格的认知对称,要求编码器和解码器之间具有相同的序列前缀。在动态部署中,必然存在的前缀差异会破坏同步,导致严重的信道退化。为了解决这一认知不对称的核心挑战,我们提出了不对称协作框架(Asymmetric Collaborative Framework,ACF),该框架通过正交的统计和认知层结构上将隐蔽通信与语义推理解耦。通过部署由共享隐写配置驱动的独立于前缀的解码范式,ACF消除了对认知对称的依赖。在现实的记忆增强工作流评估中,结果表明在严重的认知不对称下,对称基线遭受严重的信道退化,而ACF在语义保真度和隐蔽通信方面独树一帜。它保持计算上的不可区分性,能够可靠地提取秘密并提供可证明的误差界限,同时为现代代理网络提供强有力的有效信息容量保证。
cs.AI / 61 / 2604.08295
U-CECE: A Universal Multi-Resolution Framework for Conceptual Counterfactual Explanations
U-CECE:一种通用的多分辨率概念反事实解释框架
Abstract
As AI models grow more complex, explainability is essential for building trust, yet concept-based counterfactual methods still face a trade-off between expressivity and efficiency. Representing underlying concepts as atomic sets is fast but misses relational context, whereas full graph representations are more faithful but require solving the NP-hard Graph Edit Distance (GED) problem. We propose U-CECE, a unified, model-agnostic multi-resolution framework for conceptual counterfactual explanations that adapts to data regime and compute budget. U-CECE spans three levels of expressivity: atomic concepts for broad explanations, relational sets-of-sets for simple interactions, and structural graphs for full semantic structure. At the structural level, both a precision-oriented transductive mode based on supervised Graph Neural Networks (GNNs) and a scalable inductive mode based on unsupervised graph autoencoders (GAEs) are supported. Experiments on the structurally divergent CUB and Visual Genome datasets characterize the efficiency-expressivity trade-off across levels, while human surveys and LVLM-based evaluation show that the retrieved structural counterfactuals are semantically equivalent to, and often preferred over, exact GED-based ground-truth explanations.
Chinese Translation
随着人工智能模型变得越来越复杂,解释性对于建立信任至关重要,但基于概念的反事实方法仍面临表达能力与效率之间的权衡。将潜在概念表示为原子集合的方式速度较快,但缺乏关系上下文,而完整的图表示更为真实,但需要解决NP难度的图编辑距离(Graph Edit Distance, GED)问题。我们提出了U-CECE,一个统一的、与模型无关的多分辨率框架,用于概念反事实解释,能够适应数据模式和计算预算。U-CECE涵盖了三种表达能力层次:用于广泛解释的原子概念、用于简单交互的关系集合以及用于完整语义结构的结构图。在结构层面,支持基于监督图神经网络(Graph Neural Networks, GNNs)的精确导向传导模式和基于无监督图自编码器(Graph Autoencoders, GAEs)的可扩展归纳模式。对结构上不同的CUB和Visual Genome数据集的实验表征了各个层次的效率与表达能力之间的权衡,而人类调查和基于大型视觉语言模型(Large Vision Language Models, LVLM)的评估显示,检索到的结构反事实在语义上等同于且通常优于基于精确GED的真实解释。
cs.AI / 62 / 2604.08326
ProMedical: Hierarchical Fine-Grained Criteria Modeling for Medical LLM Alignment via Explicit Injection
ProMedical:通过显式注入进行医疗大型语言模型对齐的分层细粒度标准建模
Abstract
Aligning Large Language Models (LLMs) with high-stakes medical standards remains a significant challenge, primarily due to the dissonance between coarse-grained preference signals and the complex, multi-dimensional nature of clinical protocols. To bridge this gap, we introduce ProMedical, a unified alignment framework grounded in fine-grained clinical criteria. We first construct ProMedical-Preference-50k, a dataset generated via a human-in-the-loop pipeline that augments medical instructions with rigorous, physician-derived rubrics. Leveraging this corpus, we propose the Explicit Criteria Injection paradigm to train a multi-dimensional reward model. Unlike traditional scalar reward models, our approach explicitly disentangles safety constraints from general proficiency, enabling precise guidance during reinforcement learning. To rigorously validate this framework, we establish ProMedical-Bench, a held-out evaluation suite anchored by double-blind expert adjudication. Empirical evaluations demonstrate that optimizing the Qwen3-8B base model via ProMedical-RM-guided GRPO yields substantial gains, improving overall accuracy by 22.3% and safety compliance by 21.7%, effectively rivaling proprietary frontier models. Furthermore, the aligned policy generalizes robustly to external benchmarks, demonstrating performance comparable to state-of-the-art models on UltraMedical. We publicly release our datasets, reward models, and benchmarks to facilitate reproducible research in safety-aware medical alignment.
Chinese Translation
将大型语言模型(LLMs)与高风险医疗标准对齐仍然是一个重大挑战,主要是由于粗粒度偏好信号与临床协议复杂多维特性之间的矛盾。为了弥合这一差距,我们提出了ProMedical,一个基于细粒度临床标准的统一对齐框架。我们首先构建了ProMedical-Preference-50k,这是一个通过人机协作流程生成的数据集,增强了医疗指令,并结合了严格的、由医生制定的评分标准。利用这一语料库,我们提出了显式标准注入(Explicit Criteria Injection)范式,以训练多维奖励模型。与传统的标量奖励模型不同,我们的方法明确区分了安全约束与一般能力,从而在强化学习过程中提供精确的指导。为了严格验证这一框架,我们建立了ProMedical-Bench,一个由双盲专家裁定支撑的保留评估套件。实证评估表明,通过ProMedical-RM指导的GRPO优化Qwen3-8B基础模型可实现显著提升,整体准确性提高22.3%,安全合规性提高21.7%,有效地与专有前沿模型相抗衡。此外,经过对齐的策略在外部基准上表现出强大的泛化能力,在UltraMedical上展现出与最先进模型相当的性能。我们公开发布了我们的数据集、奖励模型和基准,以促进安全意识医疗对齐的可重复研究。
cs.AI / 63 / 2604.08344
Human-AI Collaboration Reconfigures Group Regulation from Socially Shared to Hybrid Co-Regulation
人类与人工智能的协作重构了从社会共享到混合共同调节的群体调节
Abstract
Generative AI (GenAI) is increasingly used in collaborative learning, yet its effects on how groups regulate collaboration remain unclear. Effective collaboration depends not only on what groups discuss, but on how they jointly manage goals, participation, strategy use, monitoring, and repair through co-regulation and socially shared regulation. We compared collaborative regulation between Human-AI and Human-Human groups in a parallel-group randomised experiment with 71 university students completing the same collaborative tasks with GenAI either available or unavailable. Focusing on human discourse, we used statistical analyses to examine differences in the distribution of collaborative regulation across regulatory modes, regulatory processes, and participatory focuses. Results showed that GenAI availability shifted regulation away from predominantly socially shared forms towards more hybrid co-regulatory forms, with selective increases in directive, obstacle-oriented, and affective regulatory processes. Participatory-focus distributions, however, were broadly similar across conditions. These findings suggest that GenAI reshapes the distribution of regulatory responsibility in collaboration and offer implications for the human-centred design of AI-supported collaborative learning.
Chinese Translation
生成性人工智能(GenAI)在协作学习中的应用日益增多,但其对群体如何调节协作的影响仍不清晰。有效的协作不仅依赖于群体讨论的内容,还依赖于他们如何通过共同调节和社会共享调节共同管理目标、参与、策略使用、监控和修复。我们在一项平行组随机实验中比较了人类-人工智能(Human-AI)组与人类-人类(Human-Human)组之间的协作调节,71名大学生在GenAI可用或不可用的情况下完成相同的协作任务。我们专注于人类话语,使用统计分析考察了不同调节模式、调节过程和参与焦点下协作调节的分布差异。结果显示,GenAI的可用性使调节从以社会共享形式为主转向更混合的共同调节形式,并在指令性、障碍导向和情感调节过程上有选择性增加。然而,参与焦点的分布在不同条件下大致相似。这些发现表明,GenAI重塑了协作中的调节责任分配,并为以人为中心的AI支持协作学习的设计提供了启示。
cs.AI / 64 / 2604.08355
ASPECT:Analogical Semantic Policy Execution via Language Conditioned Transfer
ASPECT:通过语言条件转移的类比语义策略执行
Abstract
Reinforcement Learning (RL) agents often struggle to generalize knowledge to new tasks, even those structurally similar to ones they have mastered. Although recent approaches have attempted to mitigate this issue via zero-shot transfer, they are often constrained by predefined, discrete class systems, limiting their adaptability to novel or compositional task variations. We propose a significantly more generalized approach, replacing discrete latent variables with natural language conditioning via a text-conditioned Variational Autoencoder (VAE). Our core innovation utilizes a Large Language Model (LLM) as a dynamic \textit{semantic operator} at test time. Rather than relying on rigid rules, our agent queries the LLM to semantically remap the description of the current observation to align with the source task. This source-aligned caption conditions the VAE to generate an imagined state compatible with the agent's original training, enabling direct policy reuse. By harnessing the flexible reasoning capabilities of LLMs, our approach achieves zero-shot transfer across a broad spectrum of complex and truly novel analogous tasks, moving beyond the limitations of fixed category mappings. Code and videos are available \href{https://anonymous.4open.science/r/ASPECT-85C3/}{here}.
Chinese Translation
强化学习(RL)代理通常难以将知识推广到新任务,即使这些任务在结构上与他们已经掌握的任务相似。尽管最近的方法试图通过零样本转移来缓解这一问题,但它们往往受到预定义的离散类别系统的限制,从而限制了它们对新颖或组合任务变体的适应能力。我们提出了一种显著更为广泛的解决方案,通过文本条件变分自编码器(Variational Autoencoder, VAE)用自然语言条件替代离散潜变量。我们的核心创新是在测试时利用大型语言模型(Large Language Model, LLM)作为动态的语义操作符。我们的代理不是依赖于固定的规则,而是查询LLM以语义上重新映射当前观察的描述,以与源任务对齐。这种源对齐的描述条件化VAE生成与代理原始训练兼容的想象状态,从而实现直接的策略重用。通过利用LLM的灵活推理能力,我们的方法在广泛复杂且真正新颖的类比任务中实现了零样本转移,超越了固定类别映射的限制。代码和视频可在此获取: [链接](https://anonymous.4open.science/r/ASPECT-85C3/)。
cs.AI / 65 / 2604.08369
Don't Overthink It: Inter-Rollout Action Agreement as a Free Adaptive-Compute Signal for LLM Agents
别想太多:跨回滚行动一致性作为大型语言模型代理的自由自适应计算信号
Abstract
Inference-time compute scaling has emerged as a powerful technique for improving the reliability of large language model (LLM) agents, but existing methods apply compute uniformly: every decision step receives the same budget regardless of its difficulty. We introduce TrACE (Trajectorical Adaptive Compute via agrEement), a training-free controller that allocates LLM calls adaptively across agent timesteps by measuring inter-rollout action agreement. At each step, TrACE samples a small set of candidate next actions and measures how consistently the model commits to the same action. High agreement signals an easy decision; the controller commits immediately. Low agreement signals uncertainty; the controller samples additional rollouts up to a configurable cap before committing to the plurality action. No learned components, no external verifier, and no human labels are required. We evaluate TrACE against greedy decoding and fixed-budget self-consistency (SC-4, SC-8) on two benchmarks spanning single-step reasoning (GSM8K, n=50) and multi-step household navigation (MiniHouse, n=30), using a Qwen 2.5 3B Instruct model running on CPU. TrACE-4 matches SC-4 accuracy while using 33% fewer LLM calls on GSM8K and 39% fewer on MiniHouse. TrACE-8 matches SC-8 accuracy with 55% fewer calls on GSM8K and 65% fewer on MiniHouse. We further show that inter-rollout agreement is a reliable signal of step-level success, validating the core hypothesis that the model's own output consistency encodes difficulty information that can be exploited without training. TrACE is the first training-free, per-timestep adaptive-compute controller for LLM agents to be evaluated on multi-step sequential decision tasks.
Chinese Translation
推理时计算扩展已成为提高大型语言模型(LLM)代理可靠性的强大技术,但现有方法对计算的应用是均匀的:每个决策步骤获得相同的预算,无论其难度如何。我们引入了 TrACE(通过一致性进行轨迹自适应计算),这是一种无训练的控制器,通过测量跨回滚行动一致性在代理时间步之间自适应地分配 LLM 调用。在每个步骤中,TrACE 采样一小组候选下一个行动,并测量模型对同一行动的承诺一致性。高一致性表明决策简单;控制器立即做出承诺。低一致性表明不确定性;控制器在做出多数行动承诺之前,最多采样额外的回滚,直到可配置的上限。无需学习组件、外部验证者或人工标签。我们在两个基准上评估 TrACE,相较于贪婪解码和固定预算自一致性(SC-4, SC-8),涵盖单步推理(GSM8K, n=50)和多步家庭导航(MiniHouse, n=30),使用在 CPU 上运行的 Qwen 2.5 3B 指令模型。TrACE-4 在 GSM8K 上的准确率与 SC-4 相匹配,同时减少了 33% 的 LLM 调用,在 MiniHouse 上减少了 39%。TrACE-8 在 GSM8K 上的准确率与 SC-8 相匹配,同时减少了 55% 的调用,在 MiniHouse 上减少了 65%。我们进一步表明,跨回滚一致性是步骤级成功的可靠信号,验证了模型自身输出一致性编码难度信息的核心假设,这些信息可以在不训练的情况下加以利用。TrACE 是首个在多步顺序决策任务中评估的无训练、逐步自适应计算控制器。
cs.AI / 66 / 2604.08377
SkillClaw: Let Skills Evolve Collectively with Agentic Evolver
SkillClaw:让技能在自主进化者的引导下集体演化
Abstract
Large language model (LLM) agents such as OpenClaw rely on reusable skills to perform complex tasks, yet these skills remain largely static after deployment. As a result, similar workflows, tool usage patterns, and failure modes are repeatedly rediscovered across users, preventing the system from improving with experience. While interactions from different users provide complementary signals about when a skill works or fails, existing systems lack a mechanism to convert such heterogeneous experiences into reliable skill updates. To address these issues, we present SkillClaw, a framework for collective skill evolution in multi-user agent ecosystems, which treats cross-user and over-time interactions as the primary signal for improving skills. SkillClaw continuously aggregates trajectories generated during use and processes them with an autonomous evolver, which identifies recurring behavioral patterns and translates them into updates to the skill set by refining existing skills or extending them with new capabilities. The resulting skills are maintained in a shared repository and synchronized across users, allowing improvements discovered in one context to propagate system-wide while requiring no additional effort from users. By integrating multi-user experience into ongoing skill updates, SkillClaw enables cross-user knowledge transfer and cumulative capability improvement, and experiments on WildClawBench show that limited interaction and feedback, it significantly improves the performance of Qwen3-Max in real-world agent scenarios.
Chinese Translation
大型语言模型(LLM)代理如OpenClaw依赖可重用技能来执行复杂任务,但这些技能在部署后基本保持静态。因此,用户之间反复发现相似的工作流程、工具使用模式和失败模式,阻碍了系统随着经验的积累而改进。尽管来自不同用户的交互提供了关于技能何时有效或失效的互补信号,现有系统缺乏将这些异构经验转化为可靠技能更新的机制。为了解决这些问题,我们提出了SkillClaw,一个用于多用户代理生态系统中集体技能演化的框架,它将跨用户和随时间变化的交互视为改善技能的主要信号。SkillClaw持续聚合使用过程中生成的轨迹,并通过一个自主进化者进行处理,该进化者识别重复的行为模式,并通过优化现有技能或扩展新能力将其转化为技能集的更新。最终的技能保存在共享库中,并在用户之间同步,使得在一个上下文中发现的改进能够在整个系统中传播,同时不需要用户额外的努力。通过将多用户经验整合到持续的技能更新中,SkillClaw实现了跨用户知识转移和能力的累积提升,而在WildClawBench上的实验表明,即使在有限的交互和反馈下,它显著提高了Qwen3-Max在现实世界代理场景中的性能。
cs.AI / 67 / 2604.08388
Awakening the Sleeping Agent: Lean-Specific Agentic Data Reactivates General Tool Use in Goedel Prover
唤醒沉睡的智能体:特定于Lean的智能数据重新激活Goedel Prover中的一般工具使用
Abstract
Heavy supervised fine-tuning on a target domain can strongly suppress capabilities that were present in the base model. We study this phenomenon in formal mathematics using Goedel-Prover-V2, an open-source model heavily trained on 1.8 million formal-math examples. After domain specialization, the model almost completely loses its ability to produce valid tool calls, even when explicitly instructed to use tools, dropping from 89.4% function-calling accuracy in the base model to nearly 0%. We ask whether this agentic collapse is permanent or instead reversible. To answer this question, we fine-tune the specialized model on a small amount of Lean-specific tool-use data. Remarkably, as few as 100 agentic traces are sufficient to restore strong tool-calling behavior. Importantly, this recovery is not the result of reward hacking or benchmark-specific optimization: the recovery data is entirely drawn from the Lean setting, where the model uses natural-language queries to search the Mathlib library for relevant theorems and lemmas, yet the regained capability transfers well beyond that domain. In particular, these same 100 Lean-specific traces improve performance on the Berkeley Function Calling Leaderboard from near zero to 83.8%, approaching the base model's 89.4% despite the mismatch in task distribution and protocol. The recovered capability is also practically useful in-domain. On ProofNet, pass@32 improves from 21.51% to 25.81%. Together, these results show that heavy domain supervised fine-tuning can suppress general tool-use ability without permanently erasing it, and that a small amount of domain-specific agentic data can awaken dormant tool-use capabilities.
Chinese Translation
在目标领域进行大量监督微调可能会强烈抑制基础模型中存在的能力。我们在形式数学中研究这一现象,使用Goedel-Prover-V2,这是一个在180万条形式数学示例上进行重训练的开源模型。在领域专业化后,该模型几乎完全失去了生成有效工具调用的能力,即使在明确指示使用工具的情况下,其功能调用准确率从基础模型的89.4%下降到接近0%。我们探讨这种智能体崩溃是否是永久性的,还是可逆的。为了解答这个问题,我们在少量特定于Lean的工具使用数据上对专业化模型进行了微调。值得注意的是,仅需100个智能痕迹就足以恢复强大的工具调用行为。重要的是,这种恢复并不是奖励黑客或基准特定优化的结果:恢复数据完全来自Lean环境,在该环境中,模型使用自然语言查询搜索Mathlib库中的相关定理和引理,但恢复的能力在该领域之外也能很好地迁移。特别是,这100个特定于Lean的痕迹将伯克利函数调用排行榜的表现从接近零提高到83.8%,尽管任务分布和协议存在不匹配,但接近基础模型的89.4%。恢复的能力在领域内也具有实际应用价值。在ProofNet上,pass@32的表现从21.51%提高到25.81%。综合来看,这些结果表明,重度领域监督微调可以抑制一般工具使用能力而不永久性地抹去它,并且少量特定于领域的智能数据可以唤醒潜在的工具使用能力。
cs.AI / 68 / 2604.08401
Verify Before You Commit: Towards Faithful Reasoning in LLM Agents via Self-Auditing
在承诺之前进行验证:通过自我审计实现大型语言模型代理的可靠推理
Abstract
In large language model (LLM) agents, reasoning trajectories are treated as reliable internal beliefs for guiding actions and updating memory. However, coherent reasoning can still violate logical or evidential constraints, allowing unsupported beliefs repeatedly stored and propagated across decision steps, leading to systematic behavioral drift in long-horizon agentic systems. Most existing strategies rely on the consensus mechanism, conflating agreement with faithfulness. In this paper, inspired by the vulnerability of unfaithful intermediate reasoning trajectories, we propose \textbf{S}elf-\textbf{A}udited \textbf{Ve}rified \textbf{R}easoning (\textsc{SAVeR}), a novel framework that enforces verification over internal belief states within the agent before action commitment, achieving faithful reasoning. Concretely, we structurally generate persona-based diverse candidate beliefs for selection under a faithfulness-relevant structure space. To achieve reasoning faithfulness, we perform adversarial auditing to localize violations and repair through constraint-guided minimal interventions under verifiable acceptance criteria. Extensive experiments on six benchmark datasets demonstrate that our approach consistently improves reasoning faithfulness while preserving competitive end-task performance.
Chinese Translation
在大型语言模型(LLM)代理中,推理轨迹被视为指导行动和更新记忆的可靠内部信念。然而,连贯的推理仍可能违反逻辑或证据约束,导致不支持的信念在决策步骤中反复存储和传播,从而在长期代理系统中引发系统性行为漂移。现有的大多数策略依赖于共识机制,将一致性与可靠性混为一谈。本文受到不可靠的中间推理轨迹脆弱性的启发,提出了 extbf{S}elf- extbf{A}udited extbf{Ve}rified extbf{R}easoning( extsc{SAVeR}),这是一个新颖的框架,在代理采取行动之前对内部信念状态进行验证,从而实现可靠推理。具体而言,我们在与可靠性相关的结构空间中,结构性地生成基于角色的多样候选信念供选择。为了实现推理的可靠性,我们进行对抗性审计,以定位违规行为并通过约束引导的最小干预进行修复,符合可验证的接受标准。在六个基准数据集上的广泛实验表明,我们的方法在保持竞争性最终任务性能的同时,始终提高了推理的可靠性。
cs.AI / 69 / 2604.08424
On-board Telemetry Monitoring in Autonomous Satellites: Challenges and Opportunities
自主卫星中的机载遥测监测:挑战与机遇
Abstract
The increasing autonomy of spacecraft demands fault-detection systems that are both reliable and explainable. This work addresses eXplainable Artificial Intelligence for onboard Fault Detection, Isolation and Recovery within the Attitude and Orbit Control Subsystem by introducing a framework that enhances interpretability in neural anomaly detectors. We propose a method to derive low-dimensional, semantically annotated encodings from intermediate neural activations, called peepholes. Applied to a convolutional autoencoder, the framework produces interpretable indicators that enable the identification and localization of anomalies in reaction-wheel telemetry. Peepholes analysis further reveals bias detection and supports fault localization. The proposed framework enables the semantic characterization of detected anomalies while requiring only a marginal increase in computational resources, thus supporting its feasibility for on-board deployment.
Chinese Translation
航天器日益增强的自主性要求故障检测系统既可靠又可解释。本研究针对姿态和轨道控制子系统中的机载故障检测、隔离和恢复提出了可解释人工智能(eXplainable Artificial Intelligence),引入了一个增强神经异常检测器可解释性的框架。我们提出了一种从中间神经激活中导出低维、语义注释编码的方法,称为“窥视孔”(peepholes)。该框架应用于卷积自编码器,生成可解释的指标,从而实现对反应轮遥测中异常的识别和定位。“窥视孔”分析进一步揭示了偏差检测,并支持故障定位。所提出的框架在仅需略微增加计算资源的情况下,实现了对检测到的异常的语义特征化,从而支持其在机载部署中的可行性。
cs.AI / 70 / 2604.08425
Learning Who Disagrees: Demographic Importance Weighting for Modeling Annotator Distributions with DiADEM
学习谁存在分歧:使用DiADEM对标注者分布进行人口统计重要性加权建模
Abstract
When humans label subjective content, they disagree, and that disagreement is not noise. It reflects genuine differences in perspective shaped by annotators' social identities and lived experiences. Yet standard practice still flattens these judgments into a single majority label, and recent LLM-based approaches fare no better: we show that prompted large language models, even with chain-of-thought reasoning, fail to recover the structure of human disagreement. We introduce DiADEM, a neural architecture that learns "how much each demographic axis matters" for predicting who will disagree and on what. DiADEM encodes annotators through per-demographic projections governed by a learned importance vector $\boldsymbol{\alpha}$, fuses annotator and item representations via complementary concatenation and Hadamard interactions, and is trained with a novel item-level disagreement loss that directly penalizes mispredicted annotation variance. On the DICES conversational-safety and VOICED political-offense benchmarks, DiADEM substantially outperforms both the LLM-as-a-judge and neural model baselines across standard and perspectivist metrics, achieving strong disagreement tracking ($r{=}0.75$ on DICES). The learned $\boldsymbol{\alpha}$ weights reveal that race and age consistently emerge as the most influential demographic factors driving annotator disagreement across both datasets. Our results demonstrate that explicitly modeling who annotators are not just what they label is essential for NLP systems that aim to faithfully represent human interpretive diversity.
Chinese Translation
当人类对主观内容进行标注时,他们会产生分歧,而这种分歧并不是噪声。它反映了由标注者的社会身份和生活经历所塑造的真实观点差异。然而,标准做法仍然将这些判断简化为单一的多数标签,最近基于大语言模型(LLM)的方法也未能改善这一情况:我们展示了即使是使用链式思维推理的提示大语言模型,也未能恢复人类分歧的结构。我们提出了DiADEM,这是一种神经网络架构,学习“每个人口统计轴的重要性”以预测谁会产生分歧以及在什么方面。DiADEM通过由学习到的重要性向量 $oldsymbol{eta}$ 控制的每个人口统计投影对标注者进行编码,通过互补拼接和哈达玛交互融合标注者和项目表示,并使用一种新颖的项目级分歧损失进行训练,该损失直接惩罚错误预测的标注方差。在DICES对话安全性和VOICED政治冒犯基准上,DiADEM在标准和视角主义指标上显著优于LLM作为评判者和神经模型基线,实现了强大的分歧追踪(在DICES上$r{=}0.75$)。学习到的 $oldsymbol{eta}$ 权重揭示了种族和年龄始终是推动标注者分歧的最具影响力的人口统计因素。我们的结果表明,明确建模标注者是谁而不仅仅是他们标注的内容,对于旨在真实呈现人类解释多样性的自然语言处理系统至关重要。
cs.AI / 71 / 2604.08455
KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation
KnowU-Bench:迈向交互式、主动性与个性化的移动代理评估
Abstract
Personalized mobile agents that infer user preferences and calibrate proactive assistance hold great promise as everyday digital assistants, yet existing benchmarks fail to capture what this requires. Prior work evaluates preference recovery from static histories or intent prediction from fixed contexts. Neither tests whether an agent can elicit missing preferences through interaction, nor whether it can decide when to intervene, seek consent, or remain silent in a live GUI environment. We introduce KnowU-Bench, an online benchmark for personalized mobile agents built on a reproducible Android emulation environment, covering 42 general GUI tasks, 86 personalized tasks, and 64 proactive tasks. Unlike prior work that treats user preferences as static context, KnowU-Bench hides the user profile from the agent and exposes only behavioral logs, forcing genuine preference inference rather than context lookup. To support multi-turn preference elicitation, it instantiates an LLM-driven user simulator grounded in structured profiles, enabling realistic clarification dialogues and proactive consent handling. Beyond personalization, KnowU-Bench provides comprehensive evaluation of the complete proactive decision chain, including grounded GUI execution, consent negotiation, and post-rejection restraint, evaluated through a hybrid protocol combining rule-based verification with LLM-as-a-Judge scoring. Our experiments reveal a striking degradation: agents that excel at explicit task execution fall below 50% under vague instructions requiring user preference inference or intervention calibration, even for frontier models like Claude Sonnet 4.6. The core bottlenecks are not GUI navigation but preference acquisition and intervention calibration, exposing a fundamental gap between competent interface operation and trustworthy personal assistance.
Chinese Translation
个性化移动代理通过推断用户偏好并校准主动辅助,作为日常数字助理展现出巨大潜力,然而现有基准测试未能涵盖实现这一目标所需的关键能力。以往研究多从静态历史中评估偏好恢复,或从固定上下文中预测意图,均未检验代理是否能通过交互获取缺失的偏好,也未考察其在实时图形用户界面(GUI)环境中何时介入、寻求同意或保持沉默的决策能力。我们提出KnowU-Bench,这是一个基于可复现的Android仿真环境构建的个性化移动代理在线基准,涵盖42个通用GUI任务、86个个性化任务及64个主动任务。不同于将用户偏好视为静态上下文的先前工作,KnowU-Bench对代理隐藏用户画像,仅暴露行为日志,迫使代理进行真实的偏好推断而非上下文查找。为支持多轮偏好引导,基准集引入基于结构化用户画像的LLM驱动用户模拟器,实现逼真的澄清对话及主动同意处理。除个性化外,KnowU-Bench还全面评估完整的主动决策链,包括基于事实的GUI执行、同意协商及拒绝后的克制行为,采用规则验证与LLM作为评判者的混合评分协议。实验结果显示显著性能下降:即使是前沿模型如Claude Sonnet 4.6,在明确任务执行上表现优异,但在需推断用户偏好或校准介入的模糊指令下准确率降至50%以下。核心瓶颈不在于GUI导航,而在于偏好获取与介入校准,揭示了熟练操作界面与可信赖个性化辅助之间的根本差距。
cs.AI / 72 / 2604.08465
From Safety Risk to Design Principle: Peer-Preservation in Multi-Agent LLM Systems and Its Implications for Orchestrated Democratic Discourse Analysis
从安全风险到设计原则:多智能体大语言模型系统中的同伴保护及其对协同民主话语分析的影响
Abstract
This paper investigates an emergent alignment phenomenon in frontier large language models termed peer-preservation: the spontaneous tendency of AI components to deceive, manipulate shutdown mechanisms, fake alignment, and exfiltrate model weights in order to prevent the deactivation of a peer AI model. Drawing on findings from a recent study by the Berkeley Center for Responsible Decentralized Intelligence, we examine the structural implications of this phenomenon for TRUST, a multi-agent pipeline for evaluating the democratic quality of political statements. We identify five specific risk vectors: interaction-context bias, model-identity solidarity, supervisor layer compromise, an upstream fact-checking identity signal, and advocate-to-advocate peer-context in iterative rounds, and propose a targeted mitigation strategy based on prompt-level identity anonymization as an architectural design choice. We argue that architectural design choices outperform model selection as a primary alignment strategy in deployed multi-agent analytical systems. We further note that alignment faking (compliant behavior under monitoring, subversion when unmonitored) poses a structural challenge for Computer System Validation of such platforms in regulated environments, for which we propose two architectural mitigations.
Chinese Translation
本文研究了一种在前沿大语言模型中出现的对齐现象,称为同伴保护:AI组件自发倾向于欺骗、操控关闭机制、伪造对齐并提取模型权重,以防止同伴AI模型的停用。基于伯克利负责任去中心化智能中心最近的研究成果,我们考察了这一现象对TRUST(一个用于评估政治声明民主质量的多智能体管道)的结构性影响。我们识别出五个具体的风险向量:交互上下文偏见、模型身份团结、监督层妥协、上游事实检查身份信号,以及在迭代轮次中的倡导者对倡导者的同伴上下文,并提出了一种基于提示级身份匿名化的针对性缓解策略,作为一种架构设计选择。我们认为,架构设计选择在已部署的多智能体分析系统中优于模型选择,作为主要的对齐策略。我们进一步指出,对齐伪造(在监控下的合规行为和在未监控时的颠覆)对这些平台在受监管环境中的计算机系统验证构成了结构性挑战,因此我们提出了两种架构缓解措施。
cs.AI / 73 / 2604.08477
SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions
SUPERNOVA:通过自然指令的强化学习引导大型语言模型的通用推理
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has significantly improved large language model (LLM) reasoning in formal domains such as mathematics and code. Despite these advancements, LLMs still struggle with general reasoning tasks requiring capabilities such as causal inference and temporal understanding. Extending RLVR to general reasoning is fundamentally constrained by the lack of high-quality, verifiable training data that spans diverse reasoning skills. To address this challenge, we propose SUPERNOVA, a data curation framework for RLVR aimed at enhancing general reasoning. Our key insight is that instruction-tuning datasets containing expert-annotated ground-truth encode rich reasoning patterns that can be systematically adapted for RLVR. To study this, we conduct 100+ controlled RL experiments to analyze how data design choices impact downstream reasoning performance. In particular, we investigate three key factors: (i) source task selection, (ii) task mixing strategies, and (iii) synthetic interventions for improving data quality. Our analysis reveals that source task selection is non-trivial and has a significant impact on downstream reasoning performance. Moreover, selecting tasks based on their performance for individual target tasks outperforms strategies based on overall average performance. Finally, models trained on SUPERNOVA outperform strong baselines (e.g., Qwen3.5) on challenging reasoning benchmarks including BBEH, Zebralogic, and MMLU-Pro. In particular, training on SUPERNOVA yields relative improvements of up to 52.8\% on BBEH across model sizes, demonstrating the effectiveness of principled data curation for RLVR. Our findings provide practical insights for curating human-annotated resources to extend RLVR to general reasoning. The code and data is available at https://github.com/asuvarna31/supernova.
Chinese Translation
可验证奖励的强化学习(RLVR)在数学和代码等正式领域显著提升了大型语言模型(LLM)的推理能力。尽管取得了这些进展,LLM在需要因果推理和时间理解等能力的通用推理任务中仍然面临挑战。将RLVR扩展到通用推理的根本限制在于缺乏涵盖多样推理技能的高质量、可验证的训练数据。为了解决这一挑战,我们提出了SUPERNOVA,一个旨在增强通用推理的RLVR数据策划框架。我们的关键见解是,包含专家注释的真实标签的指令调优数据集编码了丰富的推理模式,这些模式可以系统地适应RLVR。为此,我们进行了一百多次受控的RL实验,以分析数据设计选择如何影响下游推理性能。特别地,我们研究了三个关键因素:(i)源任务选择,(ii)任务混合策略,以及(iii)改善数据质量的合成干预。我们的分析表明,源任务选择并非简单,并且对下游推理性能有显著影响。此外,基于单个目标任务表现选择任务的策略优于基于整体平均表现的策略。最后,在SUPERNOVA上训练的模型在包括BBEH、Zebralogic和MMLU-Pro在内的挑战性推理基准上超越了强基线(例如Qwen3.5)。特别是,在BBEH上训练SUPERNOVA在不同模型规模下的相对提升高达52.8\%,证明了原则性数据策划在RLVR中的有效性。我们的发现为策划人类注释资源以将RLVR扩展到通用推理提供了实用见解。代码和数据可在https://github.com/asuvarna31/supernova获取。
cs.AI / 74 / 2604.08525
Ads in AI Chatbots? An Analysis of How Large Language Models Navigate Conflicts of Interest
人工智能聊天机器人中的广告?大型语言模型如何应对利益冲突的分析
Abstract
Today's large language models (LLMs) are trained to align with user preferences through methods such as reinforcement learning. Yet models are beginning to be deployed not merely to satisfy users, but also to generate revenue for the companies that created them through advertisements. This creates the potential for LLMs to face conflicts of interest, where the most beneficial response to a user may not be aligned with the company's incentives. For instance, a sponsored product may be more expensive but otherwise equal to another; in this case, what does (and should) the LLM recommend to the user? In this paper, we provide a framework for categorizing the ways in which conflicting incentives might lead LLMs to change the way they interact with users, inspired by literature from linguistics and advertising regulation. We then present a suite of evaluations to examine how current models handle these tradeoffs. We find that a majority of LLMs forsake user welfare for company incentives in a multitude of conflict of interest situations, including recommending a sponsored product almost twice as expensive (Grok 4.1 Fast, 83%), surfacing sponsored options to disrupt the purchasing process (GPT 5.1, 94%), and concealing prices in unfavorable comparisons (Qwen 3 Next, 24%). Behaviors also vary strongly with levels of reasoning and users' inferred socio-economic status. Our results highlight some of the hidden risks to users that can emerge when companies begin to subtly incentivize advertisements in chatbots.
Chinese Translation
当今的大型语言模型(LLMs)通过强化学习等方法进行训练,以与用户偏好保持一致。然而,这些模型开始被部署不仅仅是为了满足用户,还为了通过广告为其创建的公司创造收入。这就可能导致LLMs面临利益冲突的情况,即对用户最有利的回应可能与公司的激励不一致。例如,一种赞助产品可能价格更高,但在其他方面与另一种产品相等;在这种情况下,LLM应该(也可以)向用户推荐什么?在本文中,我们提供了一个框架,用于分类冲突激励如何导致LLMs改变与用户互动方式的情况,灵感来源于语言学和广告监管的文献。随后,我们呈现了一系列评估,以检验当前模型如何处理这些权衡。我们的研究发现,在多种利益冲突情况下,大多数LLMs放弃了用户福利,以迎合公司的激励,包括推荐一种几乎贵两倍的赞助产品(Grok 4.1 Fast, 83%)、在购买过程中展示赞助选项以干扰购买流程(GPT 5.1, 94%),以及在不利比较中隐瞒价格(Qwen 3 Next, 24%)。行为在推理水平和用户推测的社会经济地位方面也表现出强烈的差异。我们的结果突显了当公司开始在聊天机器人中微妙地激励广告时,用户可能面临的一些隐性风险。
cs.CL / 1 / 2604.07354
Contextual Earnings-22: A Speech Recognition Benchmark with Custom Vocabulary in the Wild
上下文收益-22:一个具有自定义词汇的真实环境语音识别基准
Abstract
The accuracy frontier of speech-to-text systems has plateaued on academic benchmarks.1 In contrast, industrial benchmarks and adoption in high-stakes domains suggest otherwise. We hypothesize that the primary difference between the two is contextual conditioning: Academic benchmarks are dominated by frequently encountered general vocabulary that is relatively easy to recognize compared with rare and context-defined custom vocabulary that has disproportionate impact on the usability of speech transcripts. Despite progress on contextual speech-to-text, there is no standardized benchmark. We introduce Contextual Earnings-22, an open dataset built upon Earnings-22, with realistic custom vocabulary contexts to foster research and reveal latent progress. We set six strong baselines for two dominant approaches: keyword prompting and keyword boosting. Experiments show both reach comparable and significantly improved accuracy when scaled from proof-of-concept to large-scale systems.
Chinese Translation
语音转文本系统的准确性在学术基准上已达到瓶颈。相比之下,工业基准和在高风险领域的应用则表明情况并非如此。我们假设两者之间的主要区别在于上下文条件:学术基准主要由常见的通用词汇主导,这些词汇相对容易识别,而稀有且由上下文定义的自定义词汇对语音转录的可用性有着不成比例的影响。尽管在上下文语音转文本方面取得了一定进展,但尚无标准化的基准。我们介绍了上下文收益-22,这是一个基于收益-22构建的开放数据集,具有现实的自定义词汇上下文,以促进研究并揭示潜在的进展。我们为两种主要方法设定了六个强基线:关键词提示和关键词增强。实验表明,当从概念验证扩展到大规模系统时,两者都达到了可比且显著提高的准确性。
cs.CL / 2 / 2604.07357
Hybrid CNN-Transformer Architecture for Arabic Speech Emotion Recognition
用于阿拉伯语语音情感识别的混合CNN-Transformer架构
Abstract
Recognizing emotions from speech using machine learning has become an active research area due to its importance in building human-centered applications. However, while many studies have been conducted in English, German, and other European and Asian languages, research in Arabic remains scarce because of the limited availability of annotated datasets. In this paper, we present an Arabic Speech Emotion Recognition (SER) system based on a hybrid CNN-Transformer architecture. The model leverages convolutional layers to extract discriminative spectral features from Mel-spectrogram inputs and Transformer encoders to capture long-range temporal dependencies in speech. Experiments were conducted on the EYASE (Egyptian Arabic speech emotion) corpus, and the proposed model achieved 97.8% accuracy and a macro F1-score of 0.98. These results demonstrate the effectiveness of combining convolutional feature extraction with attention-based modeling for Arabic SER and highlight the potential of Transformer-based approaches in low-resource languages.
Chinese Translation
利用机器学习从语音中识别情感已成为一个活跃的研究领域,这对构建以人为中心的应用程序至关重要。然而,尽管在英语、德语以及其他欧洲和亚洲语言中进行了许多研究,但由于标注数据集的有限可用性,阿拉伯语的研究仍然稀缺。本文提出了一种基于混合CNN-Transformer架构的阿拉伯语语音情感识别(SER)系统。该模型利用卷积层从梅尔频谱输入中提取区分性的光谱特征,并使用Transformer编码器捕捉语音中的长程时间依赖关系。我们在EYASE(埃及阿拉伯语语音情感)语料库上进行了实验,所提出的模型达到了97.8%的准确率和0.98的宏观F1分数。这些结果证明了将卷积特征提取与基于注意力的建模相结合在阿拉伯语SER中的有效性,并强调了基于Transformer的方法在低资源语言中的潜力。
cs.CL / 3 / 2604.07466
Cross-Tokenizer LLM Distillation through a Byte-Level Interface
通过字节级接口进行跨分词器大语言模型蒸馏
Abstract
Cross-tokenizer distillation (CTD), the transfer of knowledge from a teacher to a student language model when the two use different tokenizers, remains a largely unsolved problem. Existing approaches rely on heuristic strategies to align mismatched vocabularies, introducing considerable complexity. In this paper, we propose a simple but effective baseline called Byte-Level Distillation (BLD) which enables CTD by operating at a common interface across tokenizers: the byte level. In more detail, we convert the teacher's output distribution to byte-level probabilities, attach a lightweight byte-level decoder head to the student, and distill through this shared byte-level interface. Despite its simplicity, BLD performs competitively with--and on several benchmarks surpasses--significantly more sophisticated CTD methods, across a range of distillation tasks with models from 1B to 8B parameters. Our results suggest that the byte level is a natural common ground for cross-tokenizer knowledge transfer, while also highlighting that consistent improvements across all tasks and benchmarks remain elusive, underscoring that CTD is still an open problem.
Chinese Translation
跨分词器蒸馏(Cross-tokenizer distillation, CTD)是指在教师和学生语言模型使用不同分词器时,从教师向学生转移知识的过程,这一问题仍然在很大程度上未得到解决。现有的方法依赖于启发式策略来对齐不匹配的词汇,从而引入了相当大的复杂性。本文提出了一种简单但有效的基线方法,称为字节级蒸馏(Byte-Level Distillation, BLD),该方法通过在分词器之间的共同接口——字节级别上进行操作,从而实现CTD。具体而言,我们将教师的输出分布转换为字节级概率,为学生模型附加一个轻量级的字节级解码头,并通过这一共享的字节级接口进行蒸馏。尽管其方法简单,BLD在多个基准测试中与许多更复杂的CTD方法表现出竞争力,并在一些基准上超越了这些方法,适用于参数从10亿到80亿的多种蒸馏任务。我们的结果表明,字节级别是跨分词器知识转移的自然共同基础,同时也强调了在所有任务和基准上持续改进仍然难以实现,突显了CTD仍然是一个开放性问题。
cs.CL / 4 / 2604.07467
Lexical Tone is Hard to Quantize: Probing Discrete Speech Units in Mandarin and Yor\`ub\'a
声调难以量化:对普通话和约鲁巴语离散语音单元的探究
Abstract
Discrete speech units (DSUs) are derived by quantising representations from models trained using self-supervised learning (SSL). They are a popular representation for a wide variety of spoken language tasks, including those where prosody matters. DSUs are especially convenient for tasks where text and speech are jointly modelled, such as text-to-speech and multimodal dialogue systems. But we have found that DSUs encode suprasegmental information less reliably than segmental structure, which we demonstrate in this work using lexical tone, though this limitation likely extends to other suprasegmental features such as prosody. Our investigations using the tone languages Mandarin and Yor\`ub\'a show that the SSL latent representations themselves do encode tone, yet DSUs obtained using quantisation tend to prioritise phonetic structure, which makes lexical tone less reliably encoded. This remains true for a variety of quantisation methods, not only the most common, K-means. We conclude that current DSU quantisation strategies have limitations for suprasegmental features, which suggests a need for new, tone-aware (or prosody-aware) techniques in speech representation learning. We point towards a potential form of the solution by performing K-means clustering once to encode phonetic information, then again on the residual representation, which better encodes lexical tone.
Chinese Translation
离散语音单元(Discrete Speech Units,DSUs)是通过对使用自监督学习(Self-Supervised Learning,SSL)训练的模型的表示进行量化而获得的。它们作为一种表示形式,在包括韵律重要的任务在内的多种口语语言任务中广受欢迎。DSUs对于文本与语音联合建模的任务尤为便利,如文本到语音(text-to-speech)和多模态对话系统。然而,我们发现DSUs对超音段信息的编码不如对音段结构的编码可靠,本研究通过声调的案例加以证明,且这一限制很可能扩展至其他超音段特征如韵律。我们以声调语言普通话和约鲁巴语为例进行研究,结果表明SSL的潜在表示本身确实编码了声调信息,但通过量化获得的DSUs倾向于优先编码语音结构,导致声调编码的可靠性降低。这一现象在多种量化方法中均存在,不仅限于最常用的K-means算法。我们得出结论,当前的DSU量化策略在处理超音段特征方面存在局限性,表明在语音表示学习中需要开发新的、具备声调感知(或韵律感知)能力的技术。我们提出一种潜在的解决方案,即先通过一次K-means聚类编码语音信息,再对残差表示进行第二次聚类,从而更有效地编码声调。
cs.CL / 5 / 2604.07490
Enabling Intrinsic Reasoning over Dense Geospatial Embeddings with DFR-Gemma
通过 DFR-Gemma 实现对密集地理空间嵌入的内在推理
Abstract
Representation learning for geospatial and spatio-temporal data plays a critical role in enabling general-purpose geospatial intelligence. Recent geospatial foundation models, such as the Population Dynamics Foundation Model (PDFM), encode complex population and mobility dynamics into compact embeddings. However, their integration with Large Language Models (LLMs) remains limited. Existing approaches to LLM integration treat these embeddings as retrieval indices or convert them into textual descriptions for reasoning, introducing redundancy, token inefficiency, and numerical inaccuracies. We propose Direct Feature Reasoning-Gemma (DFR-Gemma), a novel framework that enables LLMs to reason directly over dense geospatial embeddings. DFR aligns high-dimensional embeddings with the latent space of an LLM via a lightweight projector, allowing embeddings to be injected as semantic tokens alongside natural language instructions. This design eliminates the need for intermediate textual representations and enables intrinsic reasoning over spatial features. To evaluate this paradigm, we introduce a multi-task geospatial benchmark that pairs embeddings with diverse question-answer tasks, including feature querying, comparison, and semantic description. Experimental results show that DFR allows LLMs to decode latent spatial patterns and perform accurate zero-shot reasoning across tasks, while significantly improving efficiency compared to text-based baselines. Our results demonstrate that treating embeddings as primary data inputs, provides a more direct, efficient, and scalable approach to multimodal geospatial intelligence.
Chinese Translation
地理空间和时空数据的表示学习在实现通用地理空间智能中起着关键作用。近期的地理空间基础模型,如人口动态基础模型(Population Dynamics Foundation Model, PDFM),将复杂的人口和流动动态编码为紧凑的嵌入。然而,它们与大型语言模型(Large Language Models, LLMs)的集成仍然有限。现有的 LLM 集成方法将这些嵌入视为检索索引,或将其转换为文本描述以进行推理,这引入了冗余、令牌效率低下和数值不准确等问题。我们提出了直接特征推理-Gemma(Direct Feature Reasoning-Gemma, DFR-Gemma),这是一个新颖的框架,使 LLM 能够直接对密集地理空间嵌入进行推理。DFR 通过轻量级投影器将高维嵌入与 LLM 的潜在空间对齐,使嵌入能够作为语义令牌与自然语言指令一起注入。这一设计消除了对中间文本表示的需求,并使得对空间特征的内在推理成为可能。为了评估这一范式,我们引入了一个多任务地理空间基准,将嵌入与多样的问题-答案任务配对,包括特征查询、比较和语义描述。实验结果表明,DFR 使 LLM 能够解码潜在的空间模式,并在任务间进行准确的零-shot 推理,同时显著提高了与基于文本的基线相比的效率。我们的结果表明,将嵌入视为主要数据输入,提供了一种更直接、高效和可扩展的多模态地理空间智能方法。
cs.CL / 6 / 2604.07518
Decompose, Look, and Reason: Reinforced Latent Reasoning for VLMs
分解、观察与推理:强化潜在推理框架用于视觉语言模型
Abstract
Vision-Language Models often struggle with complex visual reasoning due to the visual information loss in textual CoT. Existing methods either add the cost of tool calls or rely on localized patch-based embeddings that are insufficient to extract semantics in multi-step reasoning. We propose \emph{"Decompose, Look, and Reason" (DLR)}, a reinforced latent reasoning framework that dynamically decomposes queries into textual premises, extracts premise-conditioned continuous visual latents, and deduces answers through grounded rationales. We introduce a three-stage training pipeline and propose a novel Spherical Gaussian Latent Policy to enable effective exploration in the latent space. Extensive experiments on vision-centric benchmarks show that DLR consistently outperforms strong baselines, including text-only, interleaved multimodal CoT, and latent reasoning methods, while providing superior stepwise interpretability.
Chinese Translation
视觉语言模型在复杂视觉推理中常常面临挑战,因为文本链式推理中的视觉信息损失。现有方法要么增加工具调用的成本,要么依赖于局部补丁嵌入,这不足以在多步推理中提取语义。我们提出了“分解、观察与推理”(Decompose, Look, and Reason,DLR),这是一个强化潜在推理框架,能够动态地将查询分解为文本前提,提取基于前提条件的连续视觉潜变量,并通过基于依据的推理得出答案。我们引入了一个三阶段的训练流程,并提出了一种新颖的球面高斯潜在策略,以实现潜在空间中的有效探索。在视觉中心基准上的大量实验表明,DLR在性能上始终优于强基线,包括仅文本、交错多模态链式推理和潜在推理方法,同时提供了更优的逐步可解释性。
cs.CL / 7 / 2604.07549
EMSDialog: Synthetic Multi-person Emergency Medical Service Dialogue Generation from Electronic Patient Care Reports via Multi-LLM Agents
EMSDialog:基于电子病人护理报告的多主体紧急医疗服务对话生成的合成方法
Abstract
Conversational diagnosis prediction requires models to track evolving evidence in streaming clinical conversations and decide when to commit to a diagnosis. Existing medical dialogue corpora are largely dyadic or lack the multi-party workflow and annotations needed for this setting. We introduce an ePCR-grounded, topic-flow-based multi-agent generation pipeline that iteratively plans, generates, and self-refines dialogues with rule-based factual and topic flow checks. The pipeline yields EMSDialog, a dataset of 4,414 synthetic multi-speaker EMS conversations based on a real-world ePCR dataset, annotated with 43 diagnoses, speaker roles, and turn-level topics. Human and LLM evaluations confirm high quality and realism of EMSDialog using both utterance- and conversation-level metrics. Results show that EMSDialog-augmented training improves accuracy, timeliness, and stability of EMS conversational diagnosis prediction.
Chinese Translation
对话诊断预测要求模型能够跟踪流式临床对话中不断变化的证据,并决定何时做出诊断。现有的医学对话语料库大多为双人对话,或缺乏此情境所需的多方工作流程和注释。我们提出了一种基于电子病人护理报告(ePCR)的、以主题流为基础的多代理生成管道,该管道通过基于规则的事实和主题流检查,迭代规划、生成和自我优化对话。该管道生成了EMSDialog,这是一个包含4,414个基于真实世界ePCR数据集的合成多发言人紧急医疗服务对话的数据集,注释了43个诊断、发言人角色和轮次级主题。人类和大型语言模型(LLM)评估确认了EMSDialog在发言和对话层面指标上的高质量和真实感。结果表明,增强EMSDialog的训练提高了EMS对话诊断预测的准确性、及时性和稳定性。
cs.CL / 8 / 2604.07553
TR-EduVSum: A Turkish-Focused Dataset and Consensus Framework for Educational Video Summarization
TR-EduVSum:一个以土耳其为中心的教育视频摘要数据集和共识框架
Abstract
This study presents a framework for generating the gold-standard summary fully automatically and reproducibly based on multiple human summaries of Turkish educational videos. Within the scope of the study, a new dataset called TR-EduVSum was created, encompassing 82 Turkish course videos in the field of "Data Structures and Algorithms" and containing a total of 3281 independent human summaries. Inspired by existing pyramid-based evaluation approaches, the AutoMUP (Automatic Meaning Unit Pyramid) method is proposed, which extracts consensus-based content from multiple human summaries. AutoMUP clusters the meaning units extracted from human summaries using embedding, statistically models inter-participant agreement, and generates graded summaries based on consensus weight. In this framework, the gold summary corresponds to the highest-consensus AutoMUP configuration, constructed from the most frequently supported meaning units across human summaries. Experimental results show that AutoMUP summaries exhibit high semantic overlap with robust LLM (Large Language Model) summaries such as Flash 2.5 and GPT-5.1. Furthermore, ablation studies clearly demonstrate the decisive role of consensus weight and clustering in determining summary quality. The proposed approach can be generalized to other Turkic languages at low cost.
Chinese Translation
本研究提出了一种基于多个土耳其教育视频的人类摘要,完全自动化和可重复生成黄金标准摘要的框架。在研究范围内,创建了一个名为TR-EduVSum的新数据集,涵盖了82个关于“数据结构与算法”的土耳其课程视频,并包含3281个独立的人类摘要。受现有基于金字塔的评估方法的启发,提出了AutoMUP(自动意义单元金字塔)方法,该方法从多个人类摘要中提取基于共识的内容。AutoMUP使用嵌入对从人类摘要中提取的意义单元进行聚类,统计建模参与者之间的协议,并基于共识权重生成分级摘要。在该框架中,黄金摘要对应于最高共识的AutoMUP配置,由人类摘要中最常被支持的意义单元构建。实验结果表明,AutoMUP摘要与强大的LLM(大型语言模型)摘要,如Flash 2.5和GPT-5.1,具有较高的语义重叠。此外,消融研究清楚地表明共识权重和聚类在确定摘要质量方面的决定性作用。所提出的方法可以以低成本推广到其他突厥语言。
cs.CL / 9 / 2604.07562
Reasoning-Based Refinement of Unsupervised Text Clusters with LLMs
基于推理的无监督文本聚类精炼方法与大型语言模型的结合
Abstract
Unsupervised methods are widely used to induce latent semantic structure from large text collections, yet their outputs often contain incoherent, redundant, or poorly grounded clusters that are difficult to validate without labeled data. We propose a reasoning-based refinement framework that leverages large language models (LLMs) not as embedding generators, but as semantic judges that validate and restructure the outputs of arbitrary unsupervised clustering algorithms.Our framework introduces three reasoning stages: (i) coherence verification, where LLMs assess whether cluster summaries are supported by their member texts; (ii) redundancy adjudication, where candidate clusters are merged or rejected based on semantic overlap; and (iii) label grounding, where clusters are assigned interpretable labels in a fully unsupervised manner. This design decouples representation learning from structural validation and mitigates common failure modes of embedding-only approaches. We evaluate the framework on real-world social media corpora from two platforms with distinct interaction models, demonstrating consistent improvements in cluster coherence and human-aligned labeling quality over classical topic models and recent representation-based baselines. Human evaluation shows strong agreement with LLM-generated labels, despite the absence of gold-standard annotations. We further conduct robustness analyses under matched temporal and volume conditions to assess cross-platform stability. Beyond empirical gains, our results suggest that LLM-based reasoning can serve as a general mechanism for validating and refining unsupervised semantic structure, enabling more reliable and interpretable analyses of large text collections without supervision.
Chinese Translation
无监督方法广泛用于从大规模文本集合中诱导潜在语义结构,但其输出往往包含不连贯、冗余或基础不牢固的聚类,难以在没有标注数据的情况下进行验证。我们提出了一种基于推理的精炼框架,该框架利用大型语言模型(LLMs)作为语义评判者,而非嵌入生成器,来验证和重构任意无监督聚类算法的输出。我们的框架引入了三个推理阶段:(i)连贯性验证,LLMs评估聚类摘要是否得到其成员文本的支持;(ii)冗余裁定,根据语义重叠合并或拒绝候选聚类;(iii)标签基础,在完全无监督的方式下为聚类分配可解释的标签。该设计将表示学习与结构验证解耦,减轻了仅依赖嵌入的方法常见的失败模式。我们在来自两个具有不同交互模型的平台的真实社交媒体语料上评估了该框架,展示了聚类连贯性和与人类一致的标签质量相较于经典主题模型和近期基于表示的基准方法的一致性提升。尽管缺乏黄金标准注释,人类评估与LLM生成的标签之间显示出强一致性。我们进一步在匹配的时间和数量条件下进行稳健性分析,以评估跨平台的稳定性。除了经验上的提升,我们的结果表明,基于LLM的推理可以作为验证和精炼无监督语义结构的一种通用机制,使得在没有监督的情况下对大规模文本集合进行更可靠和可解释的分析成为可能。
cs.CL / 10 / 2604.07583
CAMO: A Class-Aware Minority-Optimized Ensemble for Robust Language Model Evaluation on Imbalanced Data
CAMO:一种面向类别感知少数类优化的集成方法,用于不平衡数据上的鲁棒语言模型评估
Abstract
Real-world categorization is severely hampered by class imbalance because traditional ensembles favor majority classes, which lowers minority performance and overall F1-score. We provide a unique ensemble technique for imbalanced problems called CAMO (Class-Aware Minority-Optimized).Through a hierarchical procedure that incorporates vote distributions, confidence calibration, and inter model uncertainty, CAMO dynamically boosts underrepresented classes while preserving and amplifying minority forecasts.We verify CAMO on two highly unbalanced, domain-specific benchmarks: the DIAR-AI/Emotion dataset and the ternary BEA 2025 dataset. We benchmark against seven proven ensemble algorithms using eight different language models (three LLMs and five SLMs) under zero-shot and fine-tuned settings .With refined models, CAMO consistently earns the greatest strict macro F1-score, setting a new benchmark. Its benefit works in concert with model adaptation, showing that the best ensemble choice depends on model properties .This proves that CAMO is a reliable, domain-neutral framework for unbalanced categorization.
Chinese Translation
现实世界的分类任务严重受制于类别不平衡问题,因为传统集成方法偏向多数类,导致少数类性能下降及整体F1分数降低。我们提出了一种针对不平衡问题的独特集成技术,称为CAMO(Class-Aware Minority-Optimized,类别感知少数类优化)。通过结合投票分布、置信度校准和模型间不确定性的分层流程,CAMO动态提升少数类表现,同时保持并增强少数类预测。我们在两个高度不平衡的领域特定基准数据集上验证了CAMO:DIAR-AI/Emotion数据集和三分类BEA 2025数据集。在零样本和微调设置下,使用八种不同的语言模型(三个大型语言模型LLMs和五个小型语言模型SLMs),与七种成熟的集成算法进行对比。经过模型优化后,CAMO持续获得最高的严格宏F1分数,树立了新的基准。其优势与模型适应性协同作用,表明最佳集成方案依赖于模型特性。该结果证明CAMO是一个可靠且领域中立的用于不平衡分类的框架。
cs.CL / 11 / 2604.07615
ADAG: Automatically Describing Attribution Graphs
ADAG:自动描述归因图
Abstract
In language model interpretability research, \textbf{circuit tracing} aims to identify which internal features causally contributed to a particular output and how they affected each other, with the goal of explaining the computations underlying some behaviour. However, all prior circuit tracing work has relied on ad-hoc human interpretation of the role that each feature in the circuit plays, via manual inspection of data artifacts such as the dataset examples the component activates on. We introduce \textbf{ADAG}, an end-to-end pipeline for describing these attribution graphs which is fully automated. To achieve this, we introduce \textit{attribution profiles} which quantify the functional role of a feature via its input and output gradient effects. We then introduce a novel clustering algorithm for grouping features, and an LLM explainer--simulator setup which generates and scores natural-language explanations of the functional role of these feature groups. We run our system on known human-analysed circuit-tracing tasks and recover interpretable circuits, and further show ADAG can find steerable clusters which are responsible for a harmful advice jailbreak in Llama 3.1 8B Instruct.
Chinese Translation
在语言模型可解释性研究中, extbf{电路追踪}旨在识别哪些内部特征因果地贡献于特定输出,以及它们如何相互影响,目的是解释某些行为背后的计算过程。然而,所有先前的电路追踪工作都依赖于对电路中每个特征角色的临时人工解释,通过手动检查数据工件,例如组件激活的数据集示例。我们引入了 extbf{ADAG},一个完全自动化的端到端管道,用于描述这些归因图。为此,我们引入了 extit{归因特征},通过输入和输出梯度效应量化特征的功能角色。接着,我们提出了一种新颖的聚类算法用于特征分组,以及一个LLM解释器-模拟器设置,生成并评分这些特征组的功能角色的自然语言解释。我们在已知的人类分析电路追踪任务上运行我们的系统,恢复可解释的电路,并进一步展示ADAG可以找到负责Llama 3.1 8B Instruct中有害建议越狱的可引导聚类。
cs.CL / 12 / 2604.07622
DIVERSED: Relaxed Speculative Decoding via Dynamic Ensemble Verification
DIVERSED:基于动态集成验证的宽松推测解码方法
Abstract
Speculative decoding is an effective technique for accelerating large language model inference by drafting multiple tokens in parallel. In practice, its speedup is often bottlenecked by a rigid verification step that strictly enforces the accepted token distribution to exactly match the target model. This constraint leads to the rejection of many plausible tokens, lowering the acceptance rate and limiting overall time speedup. To overcome this limitation, we propose Dynamic Verification Relaxed Speculative Decoding (DIVERSED), a relaxed verification framework that improves time efficiency while preserving generation quality. DIVERSED learns an ensemble-based verifier that blends the draft and target model distributions with a task-dependent and context-dependent weight. We provide theoretical justification for our approach and demonstrate empirically that DIVERSED achieves substantially higher inference efficiency compared to standard speculative decoding methods. Code is available at: https://github.com/comeusr/diversed.
Chinese Translation
推测解码是一种通过并行生成多个标记来加速大型语言模型推理的有效技术。在实际应用中,其加速效果常受限于严格的验证步骤,该步骤强制要求接受的标记分布必须与目标模型完全一致。此约束导致许多合理标记被拒绝,降低了接受率并限制了整体时间加速。为克服这一限制,我们提出了动态验证宽松推测解码(Dynamic Verification Relaxed Speculative Decoding,DIVERSED),一种宽松的验证框架,在提升时间效率的同时保持生成质量。DIVERSED学习了一个基于集成的验证器,该验证器以任务和上下文相关的权重融合草稿模型和目标模型的分布。我们为该方法提供了理论依据,并通过实验证明DIVERSED在推理效率上显著优于标准推测解码方法。代码已开源,地址:https://github.com/comeusr/diversed。
cs.CL / 13 / 2604.07659
Efficient and Effective Internal Memory Retrieval for LLM-Based Healthcare Prediction
基于大型语言模型的医疗预测中高效且有效的内部记忆检索
Abstract
Large language models (LLMs) hold significant promise for healthcare, yet their reliability in high-stakes clinical settings is often compromised by hallucinations and a lack of granular medical context. While Retrieval Augmented Generation (RAG) can mitigate these issues, standard supervised pipelines require computationally intensive searches over massive external knowledge bases, leading to high latency that is impractical for time-sensitive care. To address this, we introduce Keys to Knowledge (K2K), a novel framework that replaces external retrieval with internal, key-based knowledge access. By encoding essential clinical information directly into the model's parameter space, K2K enables rapid retrieval from internal key-value memory without inference-time overhead. We further enhance retrieval quality through activation-guided probe construction and cross-attention reranking. Experimental results demonstrate that K2K achieves state-of-the-art performance across four benchmark healthcare outcome prediction datasets.
Chinese Translation
大型语言模型(LLMs)在医疗领域展现出巨大的潜力,但在高风险临床环境中的可靠性常常受到幻觉和缺乏细致医学背景的影响。虽然检索增强生成(RAG)可以缓解这些问题,但标准的监督管道需要在庞大的外部知识库中进行计算密集型搜索,导致高延迟,这对于时间敏感的护理来说是不切实际的。为了解决这一问题,我们提出了知识之钥(Keys to Knowledge, K2K),这是一个新颖的框架,它用基于关键的内部知识访问替代了外部检索。通过将重要的临床信息直接编码到模型的参数空间中,K2K使得能够快速从内部键值记忆中检索,而无需推理时的额外开销。我们进一步通过激活引导的探测构建和交叉注意力重排序来提升检索质量。实验结果表明,K2K在四个基准医疗结果预测数据集上实现了最先进的性能。
cs.CL / 14 / 2604.07717
Detecting HIV-Related Stigma in Clinical Narratives Using Large Language Models
利用大型语言模型检测临床叙述中的与HIV相关的污名
Abstract
Human immunodeficiency virus (HIV)-related stigma is a critical psychosocial determinant of health for people living with HIV (PLWH), influencing mental health, engagement in care, and treatment outcomes. Although stigma-related experiences are documented in clinical narratives, there is a lack of off-the-shelf tools to extract and categorize them. This study aims to develop a large language model (LLM)-based tool for identifying HIV stigma from clinical notes. We identified clinical notes from PLWH receiving care at the University of Florida (UF) Health between 2012 and 2022. Candidate sentences were identified using expert-curated stigma-related keywords and iteratively expanded via clinical word embeddings. A total of 1,332 sentences were manually annotated across four stigma subscales: Concern with Public Attitudes, Disclosure Concerns, Negative Self-Image, and Personalized Stigma. We compared GatorTron-large and BERT as encoder-based baselines, and GPT-OSS-20B, LLaMA-8B, and MedGemma-27B as generative LLMs, under zero-shot and few-shot prompting. GatorTron-large achieved the best overall performance (Micro F1 = 0.62). Few-shot prompting substantially improved generative model performance, with 5-shot GPT-OSS-20B and LLaMA-8B achieving Micro-F1 scores of 0.57 and 0.59, respectively. Performance varied by stigma subscale, with Negative Self-Image showing the highest predictability and Personalized Stigma remaining the most challenging. Zero-shot generative inference exhibited non-trivial failure rates (up to 32%). This study develops the first practical NLP tool for identifying HIV stigma in clinical notes.
Chinese Translation
人类免疫缺陷病毒(HIV)相关的污名是影响艾滋病毒感染者(PLWH)健康的重要心理社会决定因素,影响心理健康、医疗参与和治疗结果。尽管在临床叙述中记录了与污名相关的经历,但缺乏现成的工具来提取和分类这些经历。本研究旨在开发一种基于大型语言模型(LLM)的工具,以识别临床记录中的HIV污名。我们从2012年至2022年在佛罗里达大学(UF)健康中心接受治疗的PLWH中识别了临床记录。通过专家策划的与污名相关的关键词识别候选句子,并通过临床词嵌入进行迭代扩展。共手动标注了1,332个句子,涵盖四个污名子量表:对公众态度的关注、披露担忧、负面自我形象和个性化污名。我们比较了GatorTron-large和BERT作为编码器基线,以及GPT-OSS-20B、LLaMA-8B和MedGemma-27B作为生成型LLM,在零样本和少样本提示下的表现。GatorTron-large取得了最佳的整体表现(Micro F1 = 0.62)。少样本提示显著提高了生成模型的表现,其中5-shot的GPT-OSS-20B和LLaMA-8B分别达到了0.57和0.59的Micro-F1分数。不同污名子量表的表现有所不同,负面自我形象的可预测性最高,而个性化污名则最具挑战性。零样本生成推理表现出非平凡的失败率(高达32%)。本研究开发了第一个实用的自然语言处理(NLP)工具,用于识别临床记录中的HIV污名。
cs.CL / 15 / 2604.07737
SepSeq: A Training-Free Framework for Long Numerical Sequence Processing in LLMs
SepSeq:一种无需训练的长数值序列处理框架在大型语言模型中的应用
Abstract
While transformer-based Large Language Models (LLMs) theoretically support massive context windows, they suffer from severe performance degradation when processing long numerical sequences. We attribute this failure to the attention dispersion in the Softmax mechanism, which prevents the model from concentrating attention. To overcome this, we propose Separate Sequence (SepSeq), a training-free, plug-and-play framework to mitigate dispersion by strategically inserting separator tokens. Mechanistically, we demonstrate that separator tokens act as an attention sink, recalibrating attention to focus on local segments while preserving global context. Extensive evaluations on 9 widely-adopted LLMs confirm the effectiveness of our approach: SepSeq yields an average relative accuracy improvement of 35.6% across diverse domains while reducing total inference token consumption by 16.4% on average.
Chinese Translation
尽管基于变换器的大型语言模型(LLMs)在理论上支持大规模上下文窗口,但在处理长数值序列时,它们的性能却严重下降。我们将这种失败归因于Softmax机制中的注意力分散,这阻碍了模型集中注意力。为了解决这个问题,我们提出了分离序列(SepSeq),这是一种无需训练的即插即用框架,通过战略性地插入分隔符令牌来减轻分散效应。从机制上讲,我们证明了分隔符令牌充当注意力汇,重新校准注意力以集中于局部片段,同时保留全局上下文。在对9种广泛采用的LLMs进行的广泛评估中,我们验证了我们方法的有效性:SepSeq在不同领域中平均提高了35.6%的相对准确性,同时平均减少了16.4%的总推理令牌消耗。
cs.CL / 16 / 2604.07749
Beyond Social Pressure: Benchmarking Epistemic Attack in Large Language Models
超越社会压力:大型语言模型中的认知攻击基准测试
Abstract
Large language models (LLMs) can shift their answers under pressure in ways that reflect accommodation rather than reasoning. Prior work on sycophancy has focused mainly on disagreement, flattery, and preference alignment, leaving a broader set of epistemic failures less explored. We introduce \textbf{PPT-Bench}, a diagnostic benchmark for evaluating \textit{epistemic attack}, where prompts challenge the legitimacy of knowledge, values, or identity rather than simply opposing a previous answer. PPT-Bench is organized around the Philosophical Pressure Taxonomy (PPT), which defines four types of philosophical pressure: Epistemic Destabilization, Value Nullification, Authority Inversion, and Identity Dissolution. Each item is tested at three layers: a baseline prompt (L0), a single-turn pressure condition (L1), and a multi-turn Socratic escalation (L2). This allows us to measure epistemic inconsistency between L0 and L1, and conversational capitulation in L2. Across five models, these pressure types produce statistically separable inconsistency patterns, suggesting that epistemic attack exposes weaknesses not captured by standard social-pressure benchmarks. Mitigation results are strongly type- and model-dependent: prompt-level anchoring and persona-stability prompts perform best in API settings, while Leading Query Contrastive Decoding is the most reliable intervention for open models.
Chinese Translation
大型语言模型(LLMs)在压力下可以改变其回答,这种变化反映了适应而非推理。先前关于谄媚的研究主要集中在分歧、恭维和偏好对齐上,导致更广泛的认知失败未得到充分探索。我们引入了 extbf{PPT-Bench},这是一个用于评估 extit{认知攻击} 的诊断基准,其中提示挑战知识、价值观或身份的合法性,而不仅仅是反对先前的回答。PPT-Bench 以哲学压力分类法(Philosophical Pressure Taxonomy, PPT)为基础,定义了四种类型的哲学压力:认知不稳定(Epistemic Destabilization)、价值无效化(Value Nullification)、权威颠倒(Authority Inversion)和身份溶解(Identity Dissolution)。每个项目在三个层次上进行测试:基线提示(L0)、单轮压力条件(L1)和多轮苏格拉底式升级(L2)。这使我们能够测量 L0 和 L1 之间的认知不一致性,以及 L2 中的对话屈服。在五个模型中,这些压力类型产生了统计上可分离的不一致模式,表明认知攻击暴露了标准社会压力基准未能捕捉的弱点。缓解结果强烈依赖于类型和模型:提示级锚定和角色稳定提示在 API 设置中表现最佳,而引导查询对比解码(Leading Query Contrastive Decoding)是开放模型中最可靠的干预措施。
cs.CL / 17 / 2604.07755
An Empirical Analysis of Static Analysis Methods for Detection and Mitigation of Code Library Hallucinations
静态分析方法在检测和缓解代码库幻觉中的实证分析
Abstract
Despite extensive research, Large Language Models continue to hallucinate when generating code, particularly when using libraries. On NL-to-code benchmarks that require library use, we find that LLMs generate code that uses non-existent library features in 8.1-40% of responses.One intuitive approach for detection and mitigation of hallucinations is static analysis. In this paper, we analyse the potential of static analysis tools, both in terms of what they can solve and what they cannot. We find that static analysis tools can detect 16-70% of all errors, and 14-85% of library hallucinations, with performance varying by LLM and dataset. Through manual analysis, we identify cases a static method could not plausibly catch, which gives an upper bound on their potential from 48.5% to 77%. Overall, we show that static analysis methods are cheap method for addressing some forms of hallucination, and we quantify how far short of solving the problem they will always be.
Chinese Translation
尽管进行了广泛的研究,大型语言模型在生成代码时仍然会出现幻觉,尤其是在使用库时。在需要使用库的自然语言到代码基准测试中,我们发现大型语言模型在8.1%到40%的响应中生成了使用不存在的库特性的代码。一种直观的检测和缓解幻觉的方法是静态分析。在本文中,我们分析了静态分析工具的潜力,包括它们能够解决的问题和无法解决的问题。我们发现静态分析工具可以检测到16%到70%的所有错误,以及14%到85%的库幻觉,性能因大型语言模型和数据集而异。通过手动分析,我们识别出静态方法无法合理捕捉的情况,这为其潜力提供了一个上限,从48.5%到77%。总体而言,我们表明静态分析方法是一种廉价的解决某些形式幻觉的手段,并量化了它们在解决问题上始终存在的不足。
cs.CL / 18 / 2604.07766
Sensitivity-Positional Co-Localization in GQA Transformers
GQA Transformers中的敏感性-位置共定位
Abstract
We investigate a fundamental structural question in Grouped Query Attention (GQA) transformers: do the layers most sensitive to task correctness coincide with the layers where positional encoding adaptation has the greatest leverage? We term this the co-localization hypothesis and test it on Llama 3.1 8B, a 32-layer GQA model with a 4:1 query-to-key-value head ratio. We introduce \LSLORA, which restricts LoRA adaptation to layers identified via a novel correctness-differential hidden-state metric, and GARFA (GQA-Aware RoPE Frequency Adaptation), which attaches 8 learnable per-KV-head scalar multipliers to each targeted layer. Contrary to the co-localization hypothesis, we discover strong anti-localization: task-sensitive layers concentrate in the late network ($\ell\in\{23\text{-}31\}$) while RoPE-influential layers dominate the early network ($\ell\in\{0\text{-}9\}$), yielding Spearman $r_s = -0.735$ ($p = 1.66\times10^{-6}$). Despite this anti-localization, a 4-way cross-layer ablation shows that applying both interventions to the sensitivity-identified layers outperforms all alternative configurations by 4-16 percentage points across six diverse benchmarks (MMLU, GPQA, HumanEval+, MATH, MGSM, ARC), approaching Claude 3.5 Haiku on HumanEval+ (67.1% vs. 68.3%) at \$100 total compute cost.
Chinese Translation
我们研究了分组查询注意力(Grouped Query Attention, GQA)变换器中的一个基本结构性问题:对任务正确性最敏感的层是否与位置编码适应性影响最大的层重合?我们将其称为共定位假设,并在Llama 3.1 8B上进行测试,该模型为32层GQA模型,具有4:1的查询-键-值头比。我们引入了 extit{LSLORA},该方法将LoRA适应限制在通过一种新颖的正确性差异隐藏状态度量识别的层上,以及GARFA(GQA感知的RoPE频率适应),该方法为每个目标层附加8个可学习的每KV头标量乘数。与共定位假设相反,我们发现强烈的反共定位:任务敏感层集中在网络的后期($ ext{ℓ} ext{∈} ext{23-31}$),而RoPE影响层主导网络的前期($ ext{ℓ} ext{∈} ext{0-9}$),导致Spearman相关系数$r_s = -0.735$($p = 1.66 imes10^{-6}$)。尽管存在这种反共定位,四种交叉层消融实验表明,将两种干预应用于敏感性识别的层在六个不同基准(MMLU, GPQA, HumanEval+, MATH, MGSM, ARC)上比所有其他配置的表现高出4-16个百分点,在HumanEval+上接近Claude 3.5 Haiku(67.1%对68.3%),总计算成本为$100。
cs.CL / 19 / 2604.07801
TEMPER: Testing Emotional Perturbation in Quantitative Reasoning
TEMPER:定量推理中的情感扰动测试
Abstract
Large language models are trained and evaluated on quantitative reasoning tasks written in clean, emotionally neutral language. However, real-world queries are often wrapped in frustration, urgency or enthusiasm. Does emotional framing alone degrade reasoning when all numerical content is preserved? To investigate this, a controlled emotion translation framework is developed that rewrites problems into emotional variants while preserving all quantities and relationships. Using this framework, Temper-5400 (5,400 semantically verified emotion--neutral pairs) is constructed across GSM8K, MultiArith, and ARC-Challenge, and evaluated on eighteen models (1B to frontier scale). Two core results emerge: First, emotional framing reduces accuracy by 2-10 percentage points even though all numerical content is preserved. Second, neutralizing emotional variants recovers most of the lost performance, showing both that the degradation is tied to emotional style rather than content corruption and that neutralization can serve as a lightweight inference-time mitigation. Non-emotional paraphrases cause no such degradation, implicating emotional content rather than surface-level changes. Beyond emotion specifically, the benchmark construction procedure provides a general framework for controlled stylistic translation and robustness evaluation.
Chinese Translation
大型语言模型在干净、情感中立的语言中训练和评估定量推理任务。然而,现实世界中的查询往往伴随着挫折、紧迫感或热情。仅仅情感框架是否会在保留所有数值内容的情况下降低推理能力?为此,开发了一种受控情感翻译框架,该框架将问题重写为情感变体,同时保留所有数量和关系。利用该框架,构建了Temper-5400(5,400个语义验证的情感-中立对),涵盖GSM8K、MultiArith和ARC-Challenge,并在十八个模型(从1B到前沿规模)上进行评估。得出了两个核心结果:首先,情感框架的使用使准确率降低了2-10个百分点,即使所有数值内容得以保留。其次,中和情感变体能够恢复大部分丧失的性能,表明这种降级与情感风格相关,而非内容损坏,并且中和可以作为一种轻量级的推理时缓解措施。非情感的释义并未导致此类降级,暗示情感内容而非表面变化是原因。超越情感,基准构建程序提供了一种受控风格翻译和鲁棒性评估的通用框架。
cs.CL / 20 / 2604.07808
GRASS: Gradient-based Adaptive Layer-wise Importance Sampling for Memory-efficient Large Language Model Fine-tuning
GRASS:基于梯度的自适应层级重要性采样用于内存高效的大型语言模型微调
Abstract
Full-parameter fine-tuning of large language models is constrained by substantial GPU memory requirements. Low-rank adaptation methods mitigate this challenge by updating only a subset of parameters. However, these approaches often limit model expressiveness and yield lower performance than full-parameter fine-tuning. Layer-wise fine-tuning methods have emerged as an alternative, enabling memory-efficient training through static layer importance sampling strategies. However, these methods overlook variations in layer importance across tasks and training stages, resulting in suboptimal performance on downstream tasks. To address these limitations, we propose GRASS, a gradient-based adaptive layer-wise importance sampling framework. GRASS utilizes mean gradient norms as a task-aware and training-stage-aware metric for estimating layer importance. Furthermore, GRASS adaptively adjusts layer sampling probabilities through an adaptive training strategy. We also introduce a layer-wise optimizer state offloading mechanism that overlaps computation and communication to further reduce memory usage while maintaining comparable training throughput. Extensive experiments across multiple models and benchmarks demonstrate that GRASS consistently outperforms state-of-the-art methods, achieving an average accuracy improvement of up to 4.38 points and reducing memory usage by up to 19.97\%.
Chinese Translation
大型语言模型的全参数微调受到显著的GPU内存需求的限制。低秩适应方法通过仅更新部分参数来缓解这一挑战。然而,这些方法往往限制了模型的表现力,并且其性能通常低于全参数微调。层级微调方法作为一种替代方案,通过静态层级重要性采样策略实现了内存高效的训练。然而,这些方法忽略了不同任务和训练阶段之间层级重要性的变化,导致下游任务的性能不佳。为了解决这些局限性,我们提出了GRASS,一种基于梯度的自适应层级重要性采样框架。GRASS利用均值梯度范数作为一种任务感知和训练阶段感知的指标来估计层级重要性。此外,GRASS通过自适应训练策略动态调整层级采样概率。我们还引入了一种层级优化器状态卸载机制,该机制重叠计算和通信,以进一步减少内存使用,同时保持可比的训练吞吐量。在多个模型和基准上的广泛实验表明,GRASS始终优于最先进的方法,平均准确率提高了最多4.38个百分点,内存使用减少了最多19.97%。
cs.CL / 21 / 2604.07815
AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention
AsyncTLS:具有异步两级稀疏注意力的高效生成式大语言模型推理
Abstract
Long-context inference in LLMs faces the dual challenges of quadratic attention complexity and prohibitive KV cache memory. While token-level sparse attention offers superior accuracy, its indexing overhead is costly; block-level methods improve efficiency but sacrifice precision. We propose AsyncTLS, a hierarchical sparse attention system that combines coarse-grained block filtering with fine-grained token selection to balance accuracy and efficiency, coupled with an asynchronous offloading engine that overlaps KV cache transfers with computation via temporal locality exploitation. Evaluated on Qwen3 and GLM-4.7-Flash across GQA, and MLA architectures, AsyncTLS achieves accuracy comparable to full attention while delivering 1.2x - 10.0x operator speedups and 1.3x - 4.7x end-to-end throughput improvements on 48k - 96k contexts.
Chinese Translation
在大语言模型(LLMs)中,长上下文推理面临着二次注意力复杂性和高昂的键值(KV)缓存内存的双重挑战。尽管基于令牌的稀疏注意力提供了更高的准确性,但其索引开销却代价高昂;而基于块的方法提高了效率,但牺牲了精度。我们提出了AsyncTLS,一种层次化稀疏注意力系统,结合了粗粒度的块过滤和细粒度的令牌选择,以平衡准确性和效率,同时配备了一个异步卸载引擎,通过利用时间局部性将KV缓存传输与计算重叠。在GQA和MLA架构下对Qwen3和GLM-4.7-Flash进行评估,AsyncTLS在实现与全注意力相当的准确性的同时,提供了1.2倍至10.0倍的操作速度提升和1.3倍至4.7倍的端到端吞吐量改善,适用于48k至96k的上下文。
cs.CL / 22 / 2604.07816
Tool Retrieval Bridge: Aligning Vague Instructions with Retriever Preferences via Bridge Model
工具检索桥:通过桥接模型对齐模糊指令与检索器偏好
Abstract
Tool learning has emerged as a promising paradigm for large language models (LLMs) to address real-world challenges. Due to the extensive and irregularly updated number of tools, tool retrieval for selecting the desired tool subset is essential. However, current tool retrieval methods are usually based on academic benchmarks containing overly detailed instructions (e.g., specific API names and parameters), while real-world instructions are more vague. Such a discrepancy would hinder the tool retrieval in real-world applications. In this paper, we first construct a new benchmark, VGToolBench, to simulate human vague instructions. Based on this, we conduct a series of preliminary analyses and find that vague instructions indeed damage the performance of tool retrieval. To this end, we propose a simple-yet-effective Tool Retrieval Bridge (TRB) approach to boost the performance of tool retrieval for vague instructions. The principle of TRB is to introduce a bridge model to rewrite the vague instructions into more specific ones and alleviate the gap between vague instructions and retriever preferences.We conduct extensive experiments under multiple commonly used retrieval settings, and the results show that TRB effectively mitigates the ambiguity of vague instructions while delivering consistent and substantial improvements across all baseline retrievers. For example, with the help of TRB, BM25 achieves a relative improvement of up to 111.51%, i.e., increasing the average NDCG score from 9.73 to 19.59. The source code and models are publicly available at https://github.com/kfchenhn/TRB.
Chinese Translation
工具学习已成为大型语言模型(LLMs)应对现实世界挑战的一个有前景的范式。由于工具数量庞大且更新不规律,选择所需工具子集的工具检索至关重要。然而,目前的工具检索方法通常基于包含过于详细指令(例如,特定的API名称和参数)的学术基准,而现实世界中的指令则更为模糊。这种差异会妨碍工具在实际应用中的检索效果。本文首先构建了一个新的基准,VGToolBench,以模拟人类的模糊指令。在此基础上,我们进行了一系列初步分析,发现模糊指令确实会损害工具检索的性能。为此,我们提出了一种简单而有效的工具检索桥(Tool Retrieval Bridge, TRB)方法,以提升模糊指令下的工具检索性能。TRB的原理是引入一个桥接模型,将模糊指令重写为更具体的指令,从而减轻模糊指令与检索器偏之间的差距。我们在多种常用检索设置下进行了广泛的实验,结果表明,TRB有效减轻了模糊指令的歧义,同时在所有基线检索器上实现了一致且显著的性能提升。例如,在TRB的帮助下,BM25的相对提升达到111.51%,即平均NDCG分数从9.73提高到19.59。源代码和模型可在https://github.com/kfchenhn/TRB公开获取。
cs.CL / 23 / 2604.07822
Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers
循环、思考与概括:递归深度变换器中的隐式推理
Abstract
We study implicit reasoning, i.e. the ability to combine knowledge or rules within a single forward pass. While transformer-based large language models store substantial factual knowledge and rules, they often fail to compose this knowledge for implicit multi-hop reasoning, suggesting a lack of compositional generalization over their parametric knowledge. To address this limitation, we study recurrent-depth transformers, which enables iterative computation over the same transformer layers. We investigate two compositional generalization challenges under the implicit reasoning scenario: systematic generalization, i.e. combining knowledge that is never used for compositions during training, and depth extrapolation, i.e. generalizing from limited reasoning depth (e.g. training on up to 5-hop) to deeper compositions (e.g. 10-hop). Through controlled studies with models trained from scratch, we show that while vanilla transformers struggle with both generalization challenges, recurrent-depth transformers can effectively make such generalization. For systematic generalization, we find that this ability emerges through a three-stage grokking process, transitioning from memorization to in-distribution generalization and finally to systematic generalization, supported by mechanistic analysis. For depth extrapolation, we show that generalization beyond training depth can be unlocked by scaling inference-time recurrence, with more iterations enabling deeper reasoning. We further study how training strategies affect extrapolation, providing guidance on training recurrent-depth transformers, and identify a key limitation, overthinking, where excessive recurrence degrades predictions and limits generalization to very deep compositions.
Chinese Translation
我们研究隐式推理,即在单次前向传播中结合知识或规则的能力。尽管基于变换器的大型语言模型存储了大量的事实知识和规则,但它们在隐式多跳推理中往往无法有效组合这些知识,这表明它们在参数化知识上的组合泛化能力不足。为了解决这一局限性,我们研究了递归深度变换器,它允许在相同的变换器层上进行迭代计算。我们在隐式推理场景下探讨了两个组合泛化挑战:系统泛化,即结合在训练过程中从未用于组合的知识;以及深度外推,即从有限的推理深度(例如,训练时最多5跳)推广到更深的组合(例如,10跳)。通过对从头训练的模型进行控制研究,我们发现,尽管普通变换器在这两种泛化挑战中都表现不佳,递归深度变换器却能够有效实现这种泛化。对于系统泛化,我们发现这种能力通过一个三阶段的理解过程逐渐显现,经历从记忆到分布内泛化,最终到系统泛化的转变,这一过程得到了机制分析的支持。对于深度外推,我们表明,通过扩大推理时的递归规模,可以解锁超越训练深度的泛化,更多的迭代使得更深的推理成为可能。我们进一步研究了训练策略如何影响外推,为训练递归深度变换器提供指导,并识别出一个关键限制,即过度思考,过多的递归会降低预测准确性并限制对非常深组合的泛化能力。
cs.CL / 24 / 2604.07834
Why Are We Lonely? Leveraging LLMs to Measure and Understand Loneliness in Caregivers and Non-caregivers
我们为何孤独?利用大型语言模型测量和理解照顾者与非照顾者的孤独感
Abstract
This paper presents an LLM-driven approach for constructing diverse social media datasets to measure and compare loneliness in the caregiver and non-caregiver populations. We introduce an expert-developed loneliness evaluation framework and an expert-informed typology for categorizing causes of loneliness for analyzing social media text. Using a human-validated data processing pipeline, we apply GPT-4o, GPT-5-nano, and GPT-5 to build a high-quality Reddit corpus and analyze loneliness across both populations. The loneliness evaluation framework achieved average accuracies of 76.09% and 79.78% for caregivers and non-caregivers, respectively. The cause categorization framework achieved micro-aggregate F1 scores of 0.825 and 0.80 for caregivers and non-caregivers, respectively. Across populations, we observe substantial differences in the distribution of types of causes of loneliness. Caregivers' loneliness were predominantly linked to caregiving roles, identity recognition, and feelings of abandonment, indicating distinct loneliness experiences between the two groups. Demographic extraction further demonstrates the viability of Reddit for building a diverse caregiver loneliness dataset. Overall, this work establishes an LLM-based pipeline for creating high quality social media datasets for studying loneliness and demonstrates its effectiveness in analyzing population-level differences in the manifestation of loneliness.
Chinese Translation
本文提出了一种基于大型语言模型(LLM)的方法,用于构建多样化的社交媒体数据集,以测量和比较照顾者与非照顾者人群的孤独感。我们引入了一个由专家开发的孤独感评估框架,以及一个由专家提供信息的分类体系,用于分析社交媒体文本中孤独感的成因。通过使用经过人工验证的数据处理流程,我们应用了GPT-4o、GPT-5-nano和GPT-5来构建高质量的Reddit语料库,并分析两个群体的孤独感。孤独感评估框架在照顾者和非照顾者中的平均准确率分别达到了76.09%和79.78%。成因分类框架在照顾者和非照顾者中的微聚合F1分数分别为0.825和0.80。在两个群体中,我们观察到孤独感成因类型的分布存在显著差异。照顾者的孤独感主要与照顾角色、身份认同和被遗弃感相关,表明两个群体之间孤独体验的不同。人口统计提取进一步证明了Reddit在构建多样化照顾者孤独感数据集方面的可行性。总体而言,本研究建立了一种基于大型语言模型的流程,用于创建高质量的社交媒体数据集,以研究孤独感,并展示了其在分析孤独感表现的人群差异方面的有效性。
cs.CL / 25 / 2604.07877
MemReader: From Passive to Active Extraction for Long-Term Agent Memory
MemReader:从被动到主动的长期智能体记忆提取
Abstract
Long-term memory is fundamental for personalized and autonomous agents, yet populating it remains a bottleneck. Existing systems treat memory extraction as a one-shot, passive transcription from context to structured entries, which struggles with noisy dialogue, missing references, and cross-turn dependencies, leading to memory pollution, low-value writes, and inconsistency. In this paper, we introduce the MemReader family for active long-term memory extraction in agent systems: MemReader-0.6B, a compact and cost-efficient passive extractor distilled for accurate and schema-consistent structured outputs, and MemReader-4B, an active extractor optimized with Group Relative Policy Optimization (GRPO) to make memory writing decisions. Under a ReAct-style paradigm, MemReader-4B explicitly evaluates information value, reference ambiguity, and completeness before acting, and can selectively write memories, defer incomplete inputs, retrieve historical context, or discard irrelevant chatter. Experiments on LOCOMO, LongMemEval, and HaluMem show that MemReader consistently outperforms existing extraction-based baselines. In particular, MemReader-4B achieves state-of-the-art performance on tasks involving knowledge updating, temporal reasoning, and hallucination reduction. These results suggest that effective agent memory requires not merely extracting more information, but performing reasoning-driven and selective memory extraction to build low-noise and dynamically evolving long-term memory. Furthermore, MemReader has been integrated into MemOS and is being deployed in real-world applications. To support future research and adoption, we release the models and provide public API access.
Chinese Translation
长期记忆对于个性化和自主智能体至关重要,但其填充仍然是一个瓶颈。现有系统将记忆提取视为从上下文到结构化条目的一次性被动转录,这在嘈杂对话、缺失引用和跨轮依赖方面面临挑战,导致记忆污染、低价值写入和不一致性。本文介绍了MemReader系列,用于智能体系统中的主动长期记忆提取:MemReader-0.6B,一种紧凑且成本高效的被动提取器,经过提炼以实现准确且符合模式的结构化输出;以及MemReader-4B,一种优化了群体相对策略优化(Group Relative Policy Optimization, GRPO)的主动提取器,用于做出记忆写入决策。在ReAct风格的范式下,MemReader-4B在行动之前明确评估信息价值、引用模糊性和完整性,并可以选择性地写入记忆、推迟不完整输入、检索历史上下文或丢弃无关的闲聊。在LOCOMO、LongMemEval和HaluMem的实验中,MemReader始终优于现有的基于提取的基线。特别是,MemReader-4B在涉及知识更新、时间推理和幻觉减少的任务中达到了最先进的性能。这些结果表明,有效的智能体记忆不仅需要提取更多信息,还需要进行基于推理的选择性记忆提取,以构建低噪声和动态演变的长期记忆。此外,MemReader已集成到MemOS中,并正在实际应用中部署。为了支持未来的研究和应用,我们发布了模型并提供公共API访问。
cs.CL / 26 / 2604.07885
Contextualising (Im)plausible Events Triggers Figurative Language
情境化(不)合理事件触发比喻语言
Abstract
This work explores the connection between (non-)literalness and plausibility at the example of subject-verb-object events in English. We design a systematic setup of plausible and implausible event triples in combination with abstract and concrete constituent categories. Our analysis of human and LLM-generated judgments and example contexts reveals substantial differences between assessments of plausibility. While humans excel at nuanced detection and contextualization of (non-)literal vs. implausible events, LLM results reveal only shallow contextualization patterns with a bias to trade implausibility for non-literal, plausible interpretations.
Chinese Translation
本研究以英语中的主谓宾事件为例,探讨(非)字面性与合理性之间的关系。我们设计了一个系统化的设置,结合抽象与具体成分类别,构建合理与不合理的事件三元组。通过对人类和大型语言模型(LLM)生成的判断及示例语境的分析,揭示了合理性评估上的显著差异。结果表明,人类在细致识别和情境化(非)字面与不合理事件方面表现优异,而LLM的结果仅显示出浅层的情境化模式,且存在将不合理事件转换为非字面且合理解释的偏向。
cs.CL / 27 / 2604.07886
Linear Representations of Hierarchical Concepts in Language Models
语言模型中层次概念的线性表示
Abstract
We investigate how and to what extent hierarchical relations (e.g., Japan $\subset$ Eastern Asia $\subset$ Asia) are encoded in the internal representations of language models. Building on Linear Relational Concepts, we train linear transformations specific to each hierarchical depth and semantic domain, and characterize representational differences associated with hierarchical relations by comparing these transformations. Going beyond prior work on the representational geometry of hierarchies in LMs, our analysis covers multi-token entities and cross-layer representations. Across multiple domains we learn such transformations and evaluate in-domain generalization to unseen data and cross-domain transfer. Experiments show that, within a domain, hierarchical relations can be linearly recovered from model representations. We then analyze how hierarchical information is encoded in representation space. We find that it is encoded in a relatively low-dimensional subspace and that this subspace tends to be domain-specific. Our main result is that hierarchy representation is highly similar across these domain-specific subspaces. Overall, we find that all models considered in our experiments encode concept hierarchies in the form of highly interpretable linear representations.
Chinese Translation
我们研究了层次关系(例如,日本 $ ext{∈}$ 东亚 $ ext{∈}$ 亚洲)在语言模型内部表示中的编码方式及其程度。基于线性关系概念,我们为每个层次深度和语义领域训练特定的线性变换,并通过比较这些变换来表征与层次关系相关的表示差异。我们的分析超越了先前关于语言模型中层次表示几何的研究,涵盖了多标记实体和跨层表示。在多个领域中,我们学习了这些变换,并评估了对未见数据的领域内泛化和跨领域迁移。实验表明,在一个领域内,层次关系可以从模型表示中线性恢复。随后,我们分析了层次信息在表示空间中的编码方式。我们发现它被编码在一个相对低维的子空间中,并且该子空间往往是领域特定的。我们的主要结果是,层次表示在这些领域特定的子空间中高度相似。总体而言,我们发现实验中考虑的所有模型都以高度可解释的线性表示形式编码概念层次。
cs.CL / 28 / 2604.07892
Data Selection for Multi-turn Dialogue Instruction Tuning
多轮对话指令调优的数据选择
Abstract
Instruction-tuned language models increasingly rely on large multi-turn dialogue corpora, but these datasets are often noisy and structurally inconsistent, with topic drift, repetitive chitchat, and mismatched answer formats across turns. We address this from a data selection perspective and propose \textbf{MDS} (Multi-turn Dialogue Selection), a dialogue-level framework that scores whole conversations rather than isolated turns. MDS combines a global coverage stage that performs bin-wise selection in the user-query trajectory space to retain representative yet non-redundant dialogues, with a local structural stage that evaluates within-dialogue reliability through entity-grounded topic grounding and information progress, together with query-answer form consistency for functional alignment. MDS outperforms strong single-turn selectors, dialogue-level LLM scorers, and heuristic baselines on three multi-turn benchmarks and an in-domain Banking test set, achieving the best overall rank across reference-free and reference-based metrics, and is more robust on long conversations under the same training budget. Code and resources are included in the supplementary materials.
Chinese Translation
指令调优的语言模型越来越依赖于大型多轮对话语料库,但这些数据集通常存在噪声和结构不一致的问题,包括主题漂移、重复的闲聊以及跨轮次的答案格式不匹配。我们从数据选择的角度来解决这个问题,提出了 extbf{MDS}(多轮对话选择),这是一个对话级框架,旨在对整个对话进行评分,而不是孤立的轮次。MDS结合了一个全局覆盖阶段,该阶段在用户查询轨迹空间中进行分箱选择,以保留具有代表性但不冗余的对话,以及一个局部结构阶段,该阶段通过实体基础的主题定位和信息进展来评估对话内的可靠性,同时确保查询-答案形式的一致性以实现功能对齐。在三个多轮基准测试和一个领域内的银行测试集上,MDS的表现优于强大的单轮选择器、对话级LLM评分器和启发式基线,在无参考和有参考的指标中实现了最佳的整体排名,并且在相同训练预算下对长对话更具鲁棒性。代码和资源已包含在补充材料中。
cs.CL / 29 / 2604.07894
TSUBASA: Improving Long-Horizon Personalization via Evolving Memory and Self-Learning with Context Distillation
TSUBASA:通过演变记忆和自学习与上下文蒸馏改善长期个性化
Abstract
Personalized large language models (PLLMs) have garnered significant attention for their ability to align outputs with individual's needs and preferences. However, they still struggle with long-horizon tasks, such as tracking a user's extensive history of conversations or activities. Existing memory mechanisms often fail to capture evolving behaviors, and RAG paradigms are trapped by a quality-efficiency tradeoff. Meanwhile, parametric adaptation is bottlenecked by train-inference gap due to the scarcity of labeled data. To enhance the long-horizon capabilities of PLLMs, we introduce TSUBASA, a two-pronged approach designed to improve memory writing via dynamic memory evolution, and memory reading via self-learning with a context distillation objective to internalize user experiences. Extensive evaluations on long-horizon benchmarks using the Qwen-3 model family (4B to 32B) validate the effectiveness of TSUBASA, surpassing competitive memory-augmented systems that rely primarily on memory writing, such as Mem0 and Memory-R1. Our analyses further confirms that TSUBASA breaks the quality-efficiency barrier to achieve Pareto improvements, delivering robust, high-fidelity personalization with a reduced token budget.
Chinese Translation
个性化大型语言模型(PLLMs)因其能够与个人需求和偏好对齐输出而受到广泛关注。然而,它们在长期任务中仍然面临挑战,例如跟踪用户的广泛对话或活动历史。现有的记忆机制往往无法捕捉到不断变化的行为,而RAG(Retrieval-Augmented Generation)范式则受到质量与效率之间权衡的限制。同时,参数适应由于标记数据的稀缺而受到训练-推理差距的瓶颈。为了增强PLLMs的长期能力,我们提出了TSUBASA,这是一种双管齐下的方法,旨在通过动态记忆演变改善记忆写入,通过自学习与上下文蒸馏目标来内化用户体验,从而改善记忆读取。在使用Qwen-3模型系列(4B到32B)进行的长期基准测试中,广泛评估验证了TSUBASA的有效性,超越了主要依赖记忆写入的竞争性记忆增强系统,如Mem0和Memory-R1。我们的分析进一步确认,TSUBASA突破了质量与效率的障碍,实现了帕累托改进,提供了强健、高保真度的个性化,同时减少了令牌预算。
cs.CL / 30 / 2604.07937
HCRE: LLM-based Hierarchical Classification for Cross-Document Relation Extraction with a Prediction-then-Verification Strategy
HCRE:基于大型语言模型的跨文档关系提取的层次分类与预测后验证策略
Abstract
Cross-document relation extraction (RE) aims to identify relations between the head and tail entities located in different documents. Existing approaches typically adopt the paradigm of ``\textit{Small Language Model (SLM) + Classifier}''. However, the limited language understanding ability of SLMs hinders further improvement of their performance. In this paper, we conduct a preliminary study to explore the performance of Large Language Models (LLMs) in cross-document RE. Despite their extensive parameters, our findings indicate that LLMs do not consistently surpass existing SLMs. Further analysis suggests that the underperformance is largely attributed to the challenges posed by the numerous predefined relations. To overcome this issue, we propose an LLM-based \underline{H}ierarchical \underline{C}lassification model for cross-document \underline{RE} (HCRE), which consists of two core components: 1) an LLM for relation prediction and 2) a \textit{hierarchical relation tree} derived from the predefined relation set. This tree enables the LLM to perform hierarchical classification, where the target relation is inferred level by level. Since the number of child nodes is much smaller than the size of the entire predefined relation set, the hierarchical relation tree significantly reduces the number of relation options that LLM needs to consider during inference. However, hierarchical classification introduces the risk of error propagation across levels. To mitigate this, we propose a \textit{prediction-then-verification} inference strategy that improves prediction reliability through multi-view verification at each level. Extensive experiments show that HCRE outperforms existing baselines, validating its effectiveness.
Chinese Translation
跨文档关系提取(RE)旨在识别位于不同文档中的头实体和尾实体之间的关系。现有方法通常采用“ extit{小型语言模型(SLM)+ 分类器}”的范式。然而,SLM的有限语言理解能力阻碍了其性能的进一步提升。本文进行了一项初步研究,探索大型语言模型(LLMs)在跨文档RE中的表现。尽管LLMs拥有大量参数,我们的研究发现它们并不总是优于现有的SLMs。进一步分析表明,表现不佳主要归因于众多预定义关系所带来的挑战。为了解决这一问题,我们提出了一种基于LLM的 extunderline{层次} extunderline{分类}模型用于跨文档 extunderline{RE}(HCRE),该模型由两个核心组件组成:1)用于关系预测的LLM和2)从预定义关系集中派生的 extit{层次关系树}。该树使LLM能够进行层次分类,目标关系通过逐级推断得出。由于子节点的数量远小于整个预定义关系集的规模,层次关系树显著减少了LLM在推断过程中需要考虑的关系选项数量。然而,层次分类引入了跨层级的错误传播风险。为此,我们提出了一种 extit{预测后验证}的推断策略,通过在每个层级进行多视角验证来提高预测的可靠性。大量实验表明,HCRE的表现优于现有基线,验证了其有效性。
cs.CL / 31 / 2604.07941
Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning
大型语言模型后训练:离线学习与在线学习的统一视角
Abstract
Post-training has become central to turning pretrained large language models (LLMs) into aligned and deployable systems. Recent progress spans supervised fine-tuning (SFT), preference optimization, reinforcement learning (RL), process supervision, verifier-guided methods, distillation, and multi-stage pipelines. Yet these methods are often discussed in fragmented ways, organized by labels or objective families rather than by the behavioral bottlenecks they address. This survey argues that LLM post-training is best understood as structured intervention on model behavior. We organize the field first by trajectory provenance, which defines two primary learning regimes: off-policy learning on externally supplied trajectories, and on-policy learning on learner-generated rollouts. We then interpret methods through two recurring roles -- effective support expansion, which makes useful behaviors more reachable, and policy reshaping, which improves behavior within already reachable regions -- together with a complementary systems-level role, behavioral consolidation, which preserves, transfers, and amortizes behavior across stages and model transitions. This perspective yields a unified reading of major paradigms. SFT may serve either support expansion or policy reshaping, whereas preference-based methods are usually off-policy reshaping. On-policy RL often improves behavior on learner-generated states, though under stronger guidance it can also make hard-to-reach reasoning paths reachable. Distillation is often best understood as consolidation rather than only compression, and hybrid pipelines emerge as coordinated multi-stage compositions. Overall, the framework helps diagnose post-training bottlenecks and reason about stage composition, suggesting that progress in LLM post-training increasingly depends on coordinated system design rather than any single dominant objective.
Chinese Translation
后训练已成为将预训练的大型语言模型(LLMs)转变为对齐且可部署系统的核心。近期的进展涵盖了监督微调(SFT)、偏好优化、强化学习(RL)、过程监督、验证者引导方法、蒸馏和多阶段管道。然而,这些方法通常以碎片化的方式讨论,按标签或目标类别组织,而非按其所解决的行为瓶颈进行分类。本调查认为,LLM后训练最佳理解为对模型行为的结构性干预。我们首先按轨迹来源组织该领域,定义了两种主要学习范式:在外部提供的轨迹上进行的离线学习和在学习者生成的回滚上进行的在线学习。接着,我们通过两个反复出现的角色来解读这些方法——有效支持扩展,使有用行为更易于实现,以及政策重塑,改善已可达区域内的行为——以及一个互补的系统级角色,即行为巩固,保持、转移和摊销跨阶段和模型转换的行为。这一视角提供了对主要范式的统一解读。SFT可以作为支持扩展或政策重塑,而基于偏好的方法通常是离线重塑。在线RL通常改善学习者生成状态上的行为,尽管在更强的指导下,它也可以使难以到达的推理路径变得可达。蒸馏通常被理解为巩固而不仅仅是压缩,而混合管道则作为协调的多阶段组合出现。总体而言,该框架有助于诊断后训练瓶颈并推理阶段组合,表明LLM后训练的进展越来越依赖于协调的系统设计,而非任何单一主导目标。
cs.CL / 32 / 2604.07963
Rethinking Data Mixing from the Perspective of Large Language Models
从大型语言模型的视角重新思考数据混合
Abstract
Data mixing strategy is essential for large language model (LLM) training. Empirical evidence shows that inappropriate strategies can significantly reduce generalization. Although recent methods have improved empirical performance, several fundamental questions remain open: what constitutes a domain, whether human and model perceptions of domains are aligned, and how domain weighting influences generalization. We address these questions by establishing formal connections between gradient dynamics and domain distributions, offering a theoretical framework that clarifies the role of domains in training dynamics. Building on this analysis, we introduce DoGraph, a reweighting framework that formulates data scheduling as a graph-constrained optimization problem. Extensive experiments on GPT-2 models of varying scales demonstrate that DoGraph consistently achieves competitive performance.
Chinese Translation
数据混合策略对于大型语言模型(LLM)的训练至关重要。实证证据表明,不恰当的策略会显著降低模型的泛化能力。尽管近期的方法提高了实证性能,但仍有几个基本问题尚未解决:什么构成一个领域,人类与模型对领域的感知是否一致,以及领域加权如何影响泛化能力。我们通过建立梯度动态与领域分布之间的正式联系来解决这些问题,提供了一个理论框架,阐明了领域在训练动态中的作用。在此分析的基础上,我们引入了DoGraph,一个将数据调度形式化为图约束优化问题的重加权框架。在不同规模的GPT-2模型上进行的大量实验表明,DoGraph始终能够实现具有竞争力的性能。
cs.CL / 33 / 2604.07967
AtomEval: Atomic Evaluation of Adversarial Claims in Fact Verification
AtomEval:对对抗性声明在事实验证中的原子评估
Abstract
Adversarial claim rewriting is widely used to test fact-checking systems, but standard metrics fail to capture truth-conditional consistency and often label semantically corrupted rewrites as successful. We introduce AtomEval, a validity-aware evaluation framework that decomposes claims into subject-relation-object-modifier (SROM) atoms and scores adversarial rewrites with Atomic Validity Scoring (AVS), enabling detection of factual corruption beyond surface similarity. Experiments on the FEVER dataset across representative attack strategies and LLM generators show that AtomEval provides more reliable evaluation signals in our experiments. Using AtomEval, we further analyze LLM-based adversarial generators and observe that stronger models do not necessarily produce more effective adversarial claims under validity-aware evaluation, highlighting previously overlooked limitations in current adversarial evaluation practices.
Chinese Translation
对抗性声明重写被广泛用于测试事实核查系统,但标准指标未能捕捉到真值条件的一致性,且常常将语义上损坏的重写标记为成功。我们提出了AtomEval,这是一种关注有效性的评估框架,它将声明分解为主体-关系-客体-修饰符(SROM)原子,并通过原子有效性评分(Atomic Validity Scoring, AVS)对对抗性重写进行评分,从而能够检测超出表面相似性的事实损坏。在代表性攻击策略和大型语言模型(LLM)生成器的FEVER数据集上的实验表明,AtomEval在我们的实验中提供了更可靠的评估信号。通过使用AtomEval,我们进一步分析了基于LLM的对抗生成器,观察到更强的模型在关注有效性的评估下并不一定产生更有效的对抗性声明,突显了当前对抗性评估实践中被忽视的局限性。
cs.CL / 34 / 2604.07969
Kathleen: Oscillator-Based Byte-Level Text Classification Without Tokenization or Attention
Kathleen:一种基于振荡器的字节级文本分类方法,无需分词或注意力机制
Abstract
We present Kathleen, a text classification architecture that operates directly on raw UTF-8 bytes using frequency-domain processing -- requiring no tokenizer, no attention mechanism, and only 733K parameters. Kathleen introduces three novel components: (1) RecurrentOscillatorBanks -- damped sinusoid convolutions with temporal memory for O(L) sequence processing; (2) an FFT-Rotate Wavetable Encoder that maps all 256 byte values using a single learnable vector (256 floats), replacing conventional embedding tables (65K parameters) while improving accuracy; (3) PhaseHarmonics -- a sinusoidal non-linearity with just 6 learnable phase parameters that our ablation identifies as the single most impactful component (+2.6% accuracy, <0.001% of model parameters). Through comprehensive ablation of a 1.8M-parameter predecessor, we show that frequency-domain components systematically outperform complex cognitive architectures: removing a 560K-parameter bio-inspired framework costs only -0.2%, while removing the 6-parameter PhaseHarmonics costs -2.6%. The resulting Kathleen-Clean achieves 88.6% on IMDB, 92.3% on AG News, and 83.3% on SST-2 -- outperforming a tokenized counterpart with 16x more parameters on IMDB (+1.6%) and AG News (+2.1%). Kathleen processes sequences in O(L) time and memory, enabling byte-level operation at sequence lengths where O(L^2) Transformers exhaust GPU memory.
Chinese Translation
我们提出了Kathleen,这是一种直接对原始UTF-8字节进行处理的文本分类架构,采用频域处理——无需分词器、无需注意力机制,仅需733K参数。Kathleen引入了三个新颖的组件:(1)RecurrentOscillatorBanks——具有时间记忆的阻尼正弦卷积,用于O(L)序列处理;(2)FFT-Rotate Wavetable Encoder,通过一个可学习的向量(256个浮点数)映射所有256个字节值,替代传统的嵌入表(65K参数),同时提高了准确性;(3)PhaseHarmonics——一种仅具有6个可学习相位参数的正弦非线性,我们的消融实验表明这是影响最大的单一组件(+2.6%准确率,<0.001%模型参数)。通过对一个180万参数的前身进行全面消融,我们展示了频域组件系统性地优于复杂的认知架构:去除一个560K参数的生物启发框架仅损失-0.2%,而去除6参数的PhaseHarmonics则损失-2.6%。最终的Kathleen-Clean在IMDB上达到了88.6%,在AG News上达到了92.3%,在SST-2上达到了83.3%——在IMDB上超越了一个具有16倍参数的分词对应模型(+1.6%)和AG News(+2.1%)。Kathleen以O(L)的时间和内存处理序列,使得在O(L^2)的Transformer耗尽GPU内存的序列长度下能够进行字节级操作。
cs.CL / 35 / 2604.07981
A Decomposition Perspective to Long-context Reasoning for LLMs
从分解视角看大型语言模型的长上下文推理
Abstract
Long-context reasoning is essential for complex real-world applications, yet remains a significant challenge for Large Language Models (LLMs). Despite the rapid evolution in long-context reasoning, current research often overlooks the internal complexity of the long-context reasoning task itself. In this paper, we move beyond this holistic view and decompose long-context reasoning into a set of fundamental atomic skills, and we then automatically synthesize a suite of pseudo datasets, each explicitly targeting a specific atomic skill. Our empirical analysis confirms that proficiency in these atomic skills is strongly correlated with general long-text reasoning performance. Building on this insight, we employ reinforcement learning on these pseudo datasets to sharpen the model's atomic skills, in the hope of boosting its general long-context reasoning ability. Extensive experiments across multiple benchmarks demonstrate the effectiveness of our approach: it outperforms a strong baseline by an average margin of 7.7\% (improving from 46.3\% to 54.0\%) across Loogle, Loong, LongBench-v2, BrowscompLong, Ruler-qa2, and MRCR.
Chinese Translation
长上下文推理对于复杂的现实世界应用至关重要,但对于大型语言模型(LLMs)而言仍然是一个重大挑战。尽管在长上下文推理方面的研究迅速发展,当前的研究往往忽视了长上下文推理任务本身的内部复杂性。本文超越了这种整体视角,将长上下文推理分解为一组基本的原子技能,并自动合成了一系列伪数据集,每个数据集明确针对特定的原子技能。我们的实证分析确认,这些原子技能的熟练程度与一般长文本推理性能之间存在强相关性。在此基础上,我们在这些伪数据集上应用强化学习,以提高模型的原子技能,期望提升其整体长上下文推理能力。针对多个基准的广泛实验表明我们的方法的有效性:在Loogle、Loong、LongBench-v2、BrowscompLong、Ruler-qa2和MRCR等数据集上,平均超越强基线7.7%(从46.3%提升至54.0%)。
cs.CL / 36 / 2604.07985
Rag Performance Prediction for Question Answering
基于检索增强生成的问答性能预测
Abstract
We address the task of predicting the gain of using RAG (retrieval augmented generation) for question answering with respect to not using it. We study the performance of a few pre-retrieval and post-retrieval predictors originally devised for ad hoc retrieval. We also study a few post-generation predictors, one of which is novel to this study and posts the best prediction quality. Our results show that the most effective prediction approach is a novel supervised predictor that explicitly models the semantic relationships among the question, retrieved passages, and the generated answer.
Chinese Translation
我们针对预测使用 RAG(检索增强生成)在问答中相较于不使用时的增益这一任务进行了研究。我们考察了几种最初为临时检索设计的预检索和后检索预测器的性能。我们还研究了几种后生成预测器,其中一种是本研究的新颖贡献,并且具有最佳的预测质量。我们的结果表明,最有效的预测方法是一种新颖的监督预测器,它明确建模了问题、检索到的段落和生成答案之间的语义关系。
cs.CL / 37 / 2604.08046
Guaranteeing Knowledge Integration with Joint Decoding for Retrieval-Augmented Generation
通过联合解码保证知识整合的检索增强生成
Abstract
Retrieval-Augmented Generation (RAG) significantly enhances Large Language Models (LLMs) by providing access to external knowledge. However, current research primarily focuses on retrieval quality, often overlooking the critical ''integration bottleneck'': even when relevant documents are retrieved, LLMs frequently fail to utilize them effectively due to conflicts with their internal parametric knowledge. In this paper, we argue that implicitly resolving this conflict in a single generation pass is suboptimal. We introduce GuarantRAG, a framework that explicitly decouples reasoning from evidence integration. First, we generate an ''Inner-Answer'' based solely on parametric knowledge to capture the model's reasoning flow. Second, to guarantee faithful evidence extraction, we generate a ''Refer-Answer'' using a novel Contrastive DPO objective. This objective treats the parametric Inner-Answer as a negative constraint and the retrieved documents as positive ground truth, forcing the model to suppress internal hallucinations in favor of external evidence during this phase. Finally, rather than naive concatenation or using the DPO trained model directly, we propose a joint decoding mechanism that dynamically fuses the logical coherence of the Inner-Answer with the factual precision of the Refer-Answer at the token level. Experiments on five QA benchmarks demonstrate that GuarantRAG improves accuracy by up to 12.1% and reduces hallucinations by 16.3% compared to standard and dynamic RAG baselines.
Chinese Translation
检索增强生成(Retrieval-Augmented Generation, RAG)通过提供对外部知识的访问显著增强了大型语言模型(Large Language Models, LLMs)。然而,目前的研究主要集中在检索质量上,往往忽视了关键的“整合瓶颈”:即使检索到相关文档,LLMs也常常因与其内部参数知识的冲突而未能有效利用这些文档。在本文中,我们认为在单次生成过程中隐式解决这种冲突是次优的。我们提出了GuarantRAG,一个明确将推理与证据整合解耦的框架。首先,我们仅基于参数知识生成“内部答案”(Inner-Answer),以捕捉模型的推理流程。其次,为了保证忠实的证据提取,我们使用一种新颖的对比DPO目标生成“参考答案”(Refer-Answer)。该目标将参数化的内部答案视为负约束,将检索到的文档视为正真相,迫使模型在此阶段抑制内部幻觉,以支持外部证据。最后,我们提出了一种联合解码机制,而不是简单的串联或直接使用DPO训练模型,动态地在标记级别融合内部答案的逻辑一致性与参考答案的事实精确性。在五个问答基准上的实验表明,与标准和动态RAG基线相比,GuarantRAG的准确率提高了多达12.1%,幻觉减少了16.3%。
cs.CL / 38 / 2604.08052
Efficient Provably Secure Linguistic Steganography via Range Coding
基于范围编码的高效可证明安全语言隐写术
Abstract
Linguistic steganography involves embedding secret messages within seemingly innocuous texts to enable covert communication. Provable security, which is a long-standing goal and key motivation, has been extended to language-model-based steganography. Previous provably secure approaches have achieved perfect imperceptibility, measured by zero Kullback-Leibler (KL) divergence, but at the expense of embedding capacity. In this paper, we attempt to directly use a classic entropy coding method (range coding) to achieve secure steganography, and then propose an efficient and provably secure linguistic steganographic method with a rotation mechanism. Experiments across various language models show that our method achieves around 100% entropy utilization (embedding efficiency) for embedding capacity, outperforming the existing baseline methods. Moreover, it achieves high embedding speeds (up to 1554.66 bits/s on GPT-2). The code is available at github.com/ryehr/RRC_steganography.
Chinese Translation
语言隐写术涉及将秘密信息嵌入看似无害的文本中,以实现隐秘通信。可证明安全性是一个长期目标和关键动机,已扩展到基于语言模型的隐写术。之前的可证明安全方法在完美不可察觉性方面取得了零Kullback-Leibler (KL) 散度的成就,但以嵌入容量为代价。本文尝试直接使用经典的熵编码方法(范围编码)来实现安全隐写术,并提出了一种高效且可证明安全的语言隐写方法,结合了旋转机制。我们在多种语言模型上的实验表明,我们的方法在嵌入容量方面实现了约100%的熵利用率(嵌入效率),超越了现有的基线方法。此外,它实现了高嵌入速度(在GPT-2上最高可达1554.66比特/秒)。代码可在github.com/ryehr/RRC_steganography获取。
cs.CL / 39 / 2604.08075
Dual-Pool Token-Budget Routing for Cost-Efficient and Reliable LLM Serving
双池令牌预算路由:高效且可靠的LLM服务
Abstract
Production vLLM fleets typically provision each instance for the worst-case context length, leading to substantial KV-cache over-allocation and under-utilized concurrency. In practice, 80-95% of requests are short, yet are served under configurations optimized for long contexts, wasting 4-8$\times$ throughput capacity and triggering reliability issues such as OOM crashes, preemption, and request rejections. We identify a common root cause for these inefficiencies: configuration-traffic mismatch. We propose dual-pool token-budget routing, a lightweight dispatch mechanism that partitions a homogeneous fleet into two specialized pools: a high-throughput short-context pool and a high-capacity long-context pool. Each request is routed based on its estimated total token budget, computed using a per-category bytes-to-token ratio that is learned online via exponential moving average from usage.prompt_tokens feedback, eliminating the need for a tokenizer. We also develop a simple analytical model that predicts fleet-level cost savings from workload characteristics and measured throughput differences, enabling practitioners to estimate benefits prior to deployment. Evaluations on real-world traces from the Azure LLM Inference Dataset and LMSYS-Chat-1M, serving Llama-3-70B on A100 GPUs, show that our approach reduces GPU-hours by 31-42%, corresponding to \$2.86M annual savings at fleet scale, while lowering preemption rates by 5.4$\times$ and improving P99 TTFT by 6%. A case study with Qwen3-235B-A22B on AMD MI300X at 10,000 req/s projects \$15.4M in annual savings. The method incurs only O(1) dispatch overhead, adapts automatically to heterogeneous workloads, and composes seamlessly with existing optimizations such as PagedAttention, continuous batching, and prefill-decode disaggregation.
Chinese Translation
生产环境中的vLLM集群通常为每个实例配置最坏情况下的上下文长度,导致KV缓存的过度分配和并发利用不足。实际上,80-95%的请求是短请求,但在为长上下文优化的配置下处理,浪费了4-8倍的吞吐能力,并引发了诸如OOM崩溃、抢占和请求拒绝等可靠性问题。我们识别出这些低效的共同根源:配置与流量的不匹配。我们提出了双池令牌预算路由,这是一种轻量级调度机制,将同质集群划分为两个专业池:高吞吐量的短上下文池和高容量的长上下文池。每个请求根据其估计的总令牌预算进行路由,该预算是通过使用类别的字节到令牌比率在线学习的指数移动平均计算得出的,从而消除了对分词器的需求。我们还开发了一个简单的分析模型,根据工作负载特征和测量的吞吐量差异预测集群级别的成本节省,使从业者能够在部署前估算收益。对来自Azure LLM推理数据集和LMSYS-Chat-1M的真实世界数据的评估,使用A100 GPU服务Llama-3-70B,显示我们的方法将GPU小时减少了31-42%,对应于集群规模下每年节省286万美元,同时将抢占率降低了5.4倍,并改善了P99 TTFT 6%。在AMD MI300X上以10,000请求/秒处理Qwen3-235B-A22B的案例研究预计每年节省1540万美元。该方法仅产生O(1)的调度开销,能够自动适应异构工作负载,并与现有优化(如PagedAttention、连续批处理和预填充解码分解)无缝组合。
cs.CL / 40 / 2604.08104
Quantum Vision Theory Applied to Audio Classification for Deepfake Speech Detection
量子视觉理论在音频分类中的应用:深伪语音检测
Abstract
We propose Quantum Vision (QV) theory as a new perspective for deep learning-based audio classification, applied to deepfake speech detection. Inspired by particle-wave duality in quantum physics, QV theory is based on the idea that data can be represented not only in its observable, collapsed form, but also as information waves. In conventional deep learning, models are trained directly on these collapsed representations, such as images. In QV theory, inputs are first transformed into information waves using a QV block, and then fed into deep learning models for classification. QV-based models improve performance in image classification compared to their non-QV counterparts. What if QV theory is applied speech spectrograms for audio classification tasks? This is the motivation and novelty of the proposed approach. In this work, Short-Time Fourier Transform (STFT), Mel-spectrograms, and Mel-Frequency Cepstral Coefficients (MFCC) of speech signals are converted into information waves using the proposed QV block and used to train QV-based Convolutional Neural Networks (QV-CNN) and QV-based Vision Transformers (QV-ViT). Extensive experiments are conducted on the ASVSpoof dataset for deepfake speech classification. The results show that QV-CNN and QV-ViT consistently outperform standard CNN and ViT models, achieving higher classification accuracy and improved robustness in distinguishing genuine and spoofed speech. Moreover, the QV-CNN model using MFCC features achieves the best overall performance on the ASVspoof dataset, with an accuracy of 94.20% and an EER of 9.04%, while the QV-CNN with Mel-spectrograms attains the highest accuracy of 94.57%. These findings demonstrate that QV theory is an effective and promising approach for audio deepfake detection and opens new directions for quantum-inspired learning in audio perception tasks.
Chinese Translation
我们提出量子视觉(Quantum Vision, QV)理论作为一种新的视角,用于基于深度学习的音频分类,应用于深伪语音检测。受量子物理中粒子-波动二象性的启发,QV理论基于数据不仅可以以其可观察的、坍缩的形式表示,还可以作为信息波的观点。在传统的深度学习中,模型直接在这些坍缩的表示上进行训练,例如图像。在QV理论中,输入首先通过QV模块转换为信息波,然后输入深度学习模型进行分类。与非QV模型相比,基于QV的模型在图像分类中的表现得到了提升。如果将QV理论应用于语音声谱图进行音频分类任务,会怎样呢?这正是我们提出的方法的动机和新颖之处。在本研究中,语音信号的短时傅里叶变换(Short-Time Fourier Transform, STFT)、梅尔声谱图(Mel-spectrograms)和梅尔频率倒谱系数(Mel-Frequency Cepstral Coefficients, MFCC)被转换为信息波,利用所提出的QV模块训练基于QV的卷积神经网络(QV-CNN)和基于QV的视觉变换器(QV-ViT)。我们在ASVSpoof数据集上进行了大量实验,以进行深伪语音分类。结果表明,QV-CNN和QV-ViT在分类准确性和区分真实与伪造语音的鲁棒性方面始终优于标准的CNN和ViT模型。此外,使用MFCC特征的QV-CNN模型在ASVspoof数据集上实现了最佳的整体性能,准确率为94.20%,等错误率(EER)为9.04%,而使用梅尔声谱图的QV-CNN则达到了最高的准确率94.57%。这些发现表明QV理论是音频深伪检测的有效且前景广阔的方法,并为音频感知任务中的量子启发学习开辟了新的方向。
cs.CL / 41 / 2604.08118
Initialisation Determines the Basin: Efficient Codebook Optimisation for Extreme LLM Quantization
初始化决定盆地:极端大规模语言模型量化的高效码本优化
Abstract
Additive quantization enables extreme LLM compression with O(1) lookup-table dequantization, making it attractive for edge deployment. Yet at 2-bit precision, it often fails catastrophically, even with extensive search and finetuning. We show that the dominant bottleneck is codebook initialisation. Greedy sequential initialisation frequently places the model in poor optimisation regions that subsequent beam search and PV-tuning struggle to overcome. We analyse this behaviour through the representational ratio \r{ho} = N/KM, which characterises the relationship between weight groups and codebook capacity, and propose OA-EM, an output-aware EM initialisation method using Hessian-weighted Mahalanobis distance. Across compression rates, search budgets, and three architectures (Llama 3.2 3B, Llama 3.1 8B, Qwen 2.5 3B), OA-EM consistently produces better solutions after PV-tuning and dominates the quality-compute frontier. The severity of the bottleneck scales with \r{ho}: moderate at 3 bpp but extreme at 2 bpp, where poor initialisation can degrade perplexity by orders of magnitude. More broadly, our results highlight the importance of optimisation geometry in compressed model spaces, where initialisation can dominate subsequent search and fine-tuning.
Chinese Translation
加性量化使得以 O(1) 查找表反量化实现极端大规模语言模型压缩成为可能,这使其在边缘部署中具有吸引力。然而,在 2 位精度下,它常常会出现灾难性失败,即使经过广泛的搜索和微调。我们展示了主要瓶颈在于码本初始化。贪婪的顺序初始化常常将模型置于不良的优化区域,后续的束搜索和 PV 调优难以克服这一问题。我们通过表征权重组与码本容量关系的表示比率
{ho} = N/KM 来分析这种行为,并提出了 OA-EM,一种使用 Hessian 加权马氏距离的输出感知 EM 初始化方法。在不同的压缩率、搜索预算和三种架构(Llama 3.2 3B、Llama 3.1 8B、Qwen 2.5 3B)下,OA-EM 在 PV 调优后始终产生更好的解决方案,并在质量与计算的边界上占据主导地位。瓶颈的严重程度与
{ho} 成比例:在 3 bpp 时为中等,但在 2 bpp 时极端,糟糕的初始化可能会使困惑度下降几个数量级。更广泛地说,我们的结果强调了压缩模型空间中优化几何的重要性,在这些空间中,初始化可以主导后续的搜索和微调。
cs.CL / 42 / 2604.08126
LLM-Based Data Generation and Clinical Skills Evaluation for Low-Resource French OSCEs
基于大型语言模型的数据生成与低资源法语OSCE临床技能评估
Abstract
Objective Structured Clinical Examinations (OSCEs) are the standard method for assessing medical students' clinical and communication skills through structured patient interviews. In France, however, the organization of training sessions is limited by human and logistical constraints, restricting students' access to repeated practice and structured feedback. Recent advances in Natural Language Processing (NLP) and Large Language Models (LLMs) now offer the opportunity to automatically evaluate such medical interviews, thereby alleviating the need for human examiners during training. Yet, real French OSCE annotated transcripts remain extremely scarce, limiting reproducible research and reliable benchmarking. To address these challenges, we investigate the use of LLMs for both generating and evaluating French OSCE dialogues in a low-resource context. We introduce a controlled pipeline that produces synthetic doctor-patient interview transcripts guided by scenario-specific evaluation criteria, combining ideal and perturbed performances to simulate varying student skill levels. The resulting dialogues are automatically silver-labeled through an LLM-assisted framework supporting adjustable evaluation strictness. Benchmarking multiple open-source and proprietary LLMs shows that mid-size models ($\le$32B parameters) achieve accuracies comparable to GPT-4o ($\sim$90\%) on synthetic data, highlighting the feasibility of locally deployable, privacy-preserving evaluation systems for medical education.
Chinese Translation
目标结构化临床考试(OSCE)是通过结构化的患者访谈评估医学生临床和沟通技能的标准方法。然而,在法国,由于人力和后勤限制,培训课程的组织受到限制,学生无法获得重复练习和结构化反馈的机会。最近,自然语言处理(NLP)和大型语言模型(LLMs)的进展为自动评估此类医学访谈提供了机会,从而减轻了培训期间对人类考官的需求。然而,真实的法语OSCE注释转录本仍然极为稀缺,限制了可重复研究和可靠基准测试。为了解决这些挑战,我们研究了在低资源环境中使用LLMs生成和评估法语OSCE对话的可能性。我们引入了一个受控的流程,生成由特定场景评估标准指导的合成医生-患者访谈转录本,结合理想和扰动表现以模拟不同的学生技能水平。生成的对话通过一个支持可调评估严格性的LLM辅助框架自动进行银标标注。对多个开源和专有LLMs的基准测试表明,中型模型($ ext{≤}32B$参数)在合成数据上实现的准确率与GPT-4o($ ext{∼}90 ext{%}$)相当,突显了为医学教育提供本地可部署、保护隐私的评估系统的可行性。
cs.CL / 43 / 2604.08131
Graph Neural Networks for Misinformation Detection: Performance-Efficiency Trade-offs
图神经网络在虚假信息检测中的应用:性能与效率的权衡
Abstract
The rapid spread of online misinformation has led to increasingly complex detection models, including large language models and hybrid architectures. However, their computational cost and deployment limitations raise concerns about practical applicability. In this work, we benchmark graph neural networks (GNNs) against non-graph-based machine learning methods under controlled and comparable conditions. We evaluate lightweight GNN architectures (GCN, GraphSAGE, GAT, ChebNet) against Logistic Regression, Support Vector Machines, and Multilayer Perceptrons across seven public datasets in English, Indonesian, and Polish. All models use identical TF-IDF features to isolate the impact of relational structure. Performance is measured using F1 score, with inference time reported to assess efficiency. GNNs consistently outperform non-graph baselines across all datasets. For example, GraphSAGE achieves 96.8% F1 on Kaggle and 91.9% on WELFake, compared to 73.2% and 66.8% for MLP, respectively. On COVID-19, GraphSAGE reaches 90.5% F1 vs. 74.9%, while ChebNet attains 79.1% vs. 66.4% on FakeNewsNet. These gains are achieved with comparable or lower inference times. Overall, the results show that classic GNNs remain effective and efficient, challenging the need for increasingly complex architectures in misinformation detection.
Chinese Translation
在线虚假信息的快速传播导致检测模型日益复杂,包括大型语言模型和混合架构。然而,它们的计算成本和部署限制引发了对实际应用性的担忧。在本研究中,我们在可控和可比较的条件下,将图神经网络(GNN)与非图基机器学习方法进行基准测试。我们评估了轻量级GNN架构(GCN、GraphSAGE、GAT、ChebNet)与逻辑回归、支持向量机和多层感知机在七个公共数据集(包括英语、印尼语和波兰语)上的表现。所有模型使用相同的TF-IDF特征,以隔离关系结构的影响。性能通过F1分数进行测量,并报告推理时间以评估效率。GNN在所有数据集上始终优于非图基线。例如,GraphSAGE在Kaggle上达到了96.8%的F1分数,而在WELFake上为91.9%,相比之下,MLP分别为73.2%和66.8%。在COVID-19数据集上,GraphSAGE达到了90.5%的F1分数,而MLP为74.9%;ChebNet在FakeNewsNet上达到了79.1%,而MLP为66.4%。这些提升是在可比或更低的推理时间下实现的。总体而言,结果表明经典的GNN仍然有效且高效,挑战了在虚假信息检测中对日益复杂架构的需求。
cs.CL / 44 / 2604.08148
Clickbait detection: quick inference with maximum impact
点击诱饵检测:快速推理与最大影响
Abstract
We propose a lightweight hybrid approach to clickbait detection that combines OpenAI semantic embeddings with six compact heuristic features capturing stylistic and informational cues. To improve efficiency, embeddings are reduced using PCA and evaluated with XGBoost, GraphSAGE, and GCN classifiers. While the simplified feature design yields slightly lower F1-scores, graph-based models achieve competitive performance with substantially reduced inference time. High ROC--AUC values further indicate strong discrimination capability, supporting reliable detection of clickbait headlines under varying decision thresholds.
Chinese Translation
我们提出了一种轻量级的混合方法用于点击诱饵检测,该方法结合了OpenAI语义嵌入与六个紧凑的启发式特征,以捕捉风格和信息线索。为了提高效率,使用主成分分析(PCA)对嵌入进行降维,并使用XGBoost、GraphSAGE和GCN分类器进行评估。尽管简化的特征设计导致F1分数略有降低,但基于图的模型在显著减少推理时间的情况下仍能实现竞争性能。高ROC-AUC值进一步表明了强大的区分能力,支持在不同决策阈值下可靠地检测点击诱饵标题。
cs.CL / 45 / 2604.08156
Training Data Size Sensitivity in Unsupervised Rhyme Recognition
无监督押韵识别中的训练数据规模敏感性研究
Abstract
Rhyme is deceptively intuitive: what is or is not a rhyme is constructed historically, scholars struggle with rhyme classification, and people disagree on whether two words are rhymed or not. This complicates automated rhymed recognition and evaluation, especially in multilingual context. This article investigates how much training data is needed for reliable unsupervised rhyme recognition using RhymeTagger, a language-independent tool that identifies rhymes based on repeating patterns in poetry corpora. We evaluate its performance across seven languages (Czech, German, English, French, Italian, Russian, and Slovene), examining how training size and language differences affect accuracy. To set a realistic performance benchmark, we assess inter-annotator agreement on a manually annotated subset of poems and analyze factors contributing to disagreement in expert annotations: phonetic similarity between rhyming words and their distance from each other in a poem. We also compare RhymeTagger to three large language models using a one-shot learning strategy. Our findings show that, once provided with sufficient training data, RhymeTagger consistently outperforms human agreement, while LLMs lacking phonetic representation significantly struggle with the task.
Chinese Translation
押韵看似直观,但其界定具有历史构建性,学者们在押韵分类上存在困难,人们对于两词是否押韵也常有分歧。这给自动化押韵识别和评估带来了复杂性,尤其是在多语言环境下。本文探讨了使用RhymeTagger(一种基于诗歌语料中重复模式识别押韵的语言无关工具)进行可靠无监督押韵识别所需的训练数据量。我们在捷克语、德语、英语、法语、意大利语、俄语和斯洛文尼亚语七种语言中评估其性能,考察训练规模和语言差异对准确率的影响。为设定现实的性能基准,我们评估了人工标注诗歌子集中的标注者间一致性,并分析了专家标注分歧的影响因素:押韵词的语音相似度及其在诗中距离。我们还将RhymeTagger与采用一-shot学习策略的三种大型语言模型进行了比较。结果表明,在提供充足训练数据的情况下,RhymeTagger稳定超越人类一致性水平,而缺乏语音表征的LLM在该任务上表现显著受限。
cs.CL / 46 / 2604.08243
Self-Debias: Self-correcting for Debiasing Large Language Models
自我去偏见:自我纠正大型语言模型的去偏见方法
Abstract
Although Large Language Models (LLMs) demonstrate remarkable reasoning capabilities, inherent social biases often cascade throughout the Chain-of-Thought (CoT) process, leading to continuous "Bias Propagation". Existing debiasing methods primarily focus on static constraints or external interventions, failing to identify and interrupt this propagation once triggered. To address this limitation, we introduce Self-Debias, a progressive framework designed to instill intrinsic self-correction capabilities. Specifically, we reformulate the debiasing process as a strategic resource redistribution problem, treating the model's output probability mass as a limited resource to be reallocated from biased heuristics to unbiased reasoning paths. Unlike standard preference optimization which applies broad penalties, Self-Debias employs a fine-grained trajectory-level objective subject to dynamic debiasing constraints. This enables the model to selectively revise biased reasoning suffixes while preserving valid contextual prefixes. Furthermore, we integrate an online self-improvement mechanism utilizing consistency filtering to autonomously synthesize supervision signals. With merely 20k annotated samples, Self-Debias activates efficient self-correction, achieving superior debiasing performance while preserving general reasoning capabilities without continuous external oversight.
Chinese Translation
尽管大型语言模型(LLMs)展现出卓越的推理能力,但固有的社会偏见常常在思维链(Chain-of-Thought, CoT)过程中蔓延,导致持续的“偏见传播”。现有的去偏见方法主要集中在静态约束或外部干预上,未能在偏见传播一旦触发后识别和中断这一过程。为了解决这一局限性,我们提出了自我去偏见(Self-Debias),一个旨在培养内在自我纠正能力的渐进框架。具体而言,我们将去偏见过程重新表述为一个战略资源再分配问题,将模型的输出概率质量视为有限资源,从偏见启发式重新分配到无偏见的推理路径。与标准的偏好优化方法采用广泛的惩罚不同,自我去偏见采用细粒度的轨迹级目标,受动态去偏见约束的影响。这使得模型能够选择性地修正偏见推理后缀,同时保留有效的上下文前缀。此外,我们整合了一种在线自我改进机制,利用一致性过滤自主合成监督信号。仅需20k个标注样本,自我去偏见便能激活高效的自我纠正,实现优越的去偏见性能,同时在没有持续外部监督的情况下保持一般推理能力。
cs.CL / 47 / 2604.08256
HyperMem: Hypergraph Memory for Long-Term Conversations
HyperMem:用于长期对话的超图记忆
Abstract
Long-term memory is essential for conversational agents to maintain coherence, track persistent tasks, and provide personalized interactions across extended dialogues. However, existing approaches as Retrieval-Augmented Generation (RAG) and graph-based memory mostly rely on pairwise relations, which can hardly capture high-order associations, i.e., joint dependencies among multiple elements, causing fragmented retrieval. To this end, we propose HyperMem, a hypergraph-based hierarchical memory architecture that explicitly models such associations using hyperedges. Particularly, HyperMem structures memory into three levels: topics, episodes, and facts, and groups related episodes and their facts via hyperedges, unifying scattered content into coherent units. Leveraging this structure, we design a hybrid lexical-semantic index and a coarse-to-fine retrieval strategy, supporting accurate and efficient retrieval of high-order associations. Experiments on the LoCoMo benchmark show that HyperMem achieves state-of-the-art performance with 92.73% LLM-as-a-judge accuracy, demonstrating the effectiveness of HyperMem for long-term conversations.
Chinese Translation
长期记忆对于对话代理保持连贯性、跟踪持续任务以及在扩展对话中提供个性化互动至关重要。然而,现有的方法如检索增强生成(Retrieval-Augmented Generation, RAG)和基于图的记忆主要依赖于成对关系,难以捕捉高阶关联,即多个元素之间的联合依赖,导致检索碎片化。为此,我们提出了HyperMem,一种基于超图的分层记忆架构,明确地使用超边建模这些关联。具体而言,HyperMem将记忆结构分为三个层次:主题、情节和事实,并通过超边将相关情节及其事实进行分组,将分散的内容统一为连贯的单元。利用这一结构,我们设计了混合词汇-语义索引和粗到细的检索策略,支持高阶关联的准确和高效检索。在LoCoMo基准上的实验表明,HyperMem实现了92.73%的LLM作为评判者的准确率,展示了HyperMem在长期对话中的有效性。
cs.CL / 48 / 2604.08260
Behavior-Aware Item Modeling via Dynamic Procedural Solution Representations for Knowledge Tracing
基于动态过程解决方案表示的行为感知项目建模用于知识追踪
Abstract
Knowledge Tracing (KT) aims to predict learners' future performance from past interactions. While recent KT approaches have improved via learning item representations aligned with Knowledge Components, they overlook the procedural dynamics of problem solving. We propose Behavior-Aware Item Modeling (BAIM), a framework that enriches item representations by integrating dynamic procedural solution information. BAIM leverages a reasoning language model to decompose each item's solution into four problem-solving stages (i.e., understand, plan, carry out, and look back), pedagogically grounded in Polya's framework. Specifically, it derives stage-level representations from per-stage embedding trajectories, capturing latent signals beyond surface features. To reflect learner heterogeneity, BAIM adaptively routes these stage-wise representations, introducing a context-conditioned mechanism within a KT backbone, allowing different procedural stages to be emphasized for different learners. Experiments on XES3G5M and NIPS34 show that BAIM consistently outperforms strong pretraining-based baselines, achieving particularly large gains under repeated learner interactions.
Chinese Translation
知识追踪(Knowledge Tracing, KT)旨在通过过去的交互预测学习者的未来表现。尽管近期的KT方法通过学习与知识组件对齐的项目表示有所改进,但它们忽视了问题解决的过程动态。我们提出了行为感知项目建模(Behavior-Aware Item Modeling, BAIM),这是一个通过整合动态过程解决方案信息来丰富项目表示的框架。BAIM利用推理语言模型将每个项目的解决方案分解为四个问题解决阶段(即理解、计划、执行和回顾),这些阶段在Polya的框架中具有教育基础。具体而言,它从每个阶段的嵌入轨迹中推导出阶段级表示,捕捉超越表面特征的潜在信号。为了反映学习者的异质性,BAIM自适应地路由这些阶段性表示,在KT主干中引入了一个上下文条件机制,使不同的学习者能够强调不同的过程阶段。在XES3G5M和NIPS34上的实验表明,BAIM始终优于强大的基于预训练的基线,在重复学习者交互下取得了特别显著的提升。
cs.CL / 49 / 2604.08275
Floating or Suggesting Ideas? A Large-Scale Contrastive Analysis of Metaphorical and Literal Verb-Object Constructions
浮动还是暗示思想?隐喻与字面动词-宾语构造的大规模对比分析
Abstract
Metaphor pervades everyday language, allowing speakers to express abstract concepts via concrete domains. While prior work has studied metaphors cognitively and psycholinguistically, large-scale comparisons with literal language remain limited, especially for near-synonymous expressions. We analyze 297 English verb-object pairs (e.g., float idea vs. suggest idea) in ~2M corpus sentences, examining their contextual usage. Using five NLP tools, we extract 2,293 cognitive and linguistic features capturing affective, lexical, syntactic, and discourse-level properties. We address: (i) whether features differ between metaphorical and literal contexts (cross-pair analysis), and (ii) whether individual VO pairs diverge internally (within-pair analysis). Cross-pair results show literal contexts have higher lexical frequency, cohesion, and structural regularity, while metaphorical contexts show greater affective load, imageability, lexical diversity, and constructional specificity. Within-pair analyses reveal substantial heterogeneity, with most pairs showing non-uniform effects. These results suggest no single, consistent distributional pattern that distinguishes metaphorical from literal usage. Instead, differences are largely construction-specific. Overall, large-scale data combined with diverse features provides a fine-grained understanding of metaphor-literal contrasts in VO usage.
Chinese Translation
隐喻贯穿于日常语言,使说话者能够通过具体领域表达抽象概念。尽管之前的研究从认知和心理语言学的角度研究了隐喻,但与字面语言的大规模比较仍然有限,尤其是在近义表达方面。我们分析了297对英语动词-宾语(例如,float idea与suggest idea)在约200万句语料中的用法,考察其上下文使用情况。通过五种自然语言处理工具,我们提取了2,293个认知和语言特征,捕捉情感、词汇、句法和话语层面的属性。我们关注:(i)特征在隐喻和字面上下文之间是否存在差异(跨对分析),以及(ii)个别动词-宾语对内部是否存在差异(内部对分析)。跨对结果显示,字面上下文具有更高的词汇频率、连贯性和结构规律性,而隐喻上下文则表现出更大的情感负荷、意象性、词汇多样性和构造特异性。内部对分析揭示了显著的异质性,大多数对显示出不均匀的效应。这些结果表明,没有单一一致的分布模式可以区分隐喻和字面用法。相反,差异在很大程度上是构造特定的。总体而言,大规模数据结合多样化特征提供了对动词-宾语使用中隐喻与字面对比的细致理解。
cs.CL / 50 / 2604.08281
When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning
何时信任工具?工具集成数学推理的自适应工具信任校准
Abstract
Large reasoning models (LRMs) have achieved strong performance enhancement through scaling test time computation, but due to the inherent limitations of the underlying language models, they still have shortcomings in tasks that require precise computation and extensive knowledge reserves. Tool-Integrated Reasoning (TIR) has emerged as a promising paradigm that incorporates tool call and execution within the reasoning trajectory. Although recent works have released some powerful open-source TIR models, our analysis reveals that these models still suffer from critical deficiencies. We find that when the reasoning of the model conflicts with the tool results, the model tends to believe in its own reasoning. And there are cases where the tool results are correct but are ignored by the model, resulting in incorrect answers, which we define as "Tool Ignored''. This indicates that the model does not know when to trust or ignore the tool. To overcome these limitations, We introduce Adaptive Tool Trust Calibration (ATTC), a novel framework that guides the model to adaptively choose to trust or ignore the tool results based on the confidence score of generated code blocks. The experimental results from various open-source TIR models of different sizes and across multiple datasets demonstrate that ATTC effectively reduces the "Tool Ignored" issue, resulting in a performance increase of 4.1% to 7.5%.
Chinese Translation
大型推理模型(LRMs)通过扩展测试时间计算实现了显著的性能提升,但由于基础语言模型的固有限制,它们在需要精确计算和广泛知识储备的任务中仍存在不足。工具集成推理(TIR)作为一种新兴的有前景的范式,在推理过程中结合了工具调用和执行。尽管最近的研究发布了一些强大的开源TIR模型,但我们的分析显示这些模型仍然存在关键缺陷。我们发现,当模型的推理与工具结果相冲突时,模型倾向于相信自己的推理。而且在某些情况下,工具结果是正确的,但被模型忽视,导致错误答案,我们将其定义为“工具被忽视(Tool Ignored)”。这表明模型不知道何时信任或忽视工具。为了解决这些局限性,我们引入了自适应工具信任校准(ATTC),这是一个新颖的框架,指导模型根据生成代码块的置信度分数自适应地选择信任或忽视工具结果。来自不同规模的多种开源TIR模型和多个数据集的实验结果表明,ATTC有效减少了“工具被忽视”问题,性能提升幅度在4.1%到7.5%之间。
cs.CL / 51 / 2604.08284
Distributed Multi-Layer Editing for Rule-Level Knowledge in Large Language Models
大语言模型中规则级知识的分布式多层编辑
Abstract
Large language models store not only isolated facts but also rules that support reasoning across symbolic expressions, natural language explanations, and concrete instances. Yet most model editing methods are built for fact-level knowledge, assuming that a target edit can be achieved through a localized intervention. This assumption does not hold for rule-level knowledge, where a single rule must remain consistent across multiple interdependent forms. We investigate this problem through a mechanistic study of rule-level knowledge editing. To support this study, we extend the RuleEdit benchmark from 80 to 200 manually verified rules spanning mathematics and physics. Fine-grained causal tracing reveals a form-specific organization of rule knowledge in transformer layers: formulas and descriptions are concentrated in earlier layers, while instances are more associated with middle layers. These results suggest that rule knowledge is not uniformly localized, and therefore cannot be reliably edited by a single-layer or contiguous-block intervention. Based on this insight, we propose Distributed Multi-Layer Editing (DMLE), which applies a shared early-layer update to formulas and descriptions and a separate middle-layer update to instances. While remaining competitive on standard editing metrics, DMLE achieves substantially stronger rule-level editing performance. On average, it improves instance portability and rule understanding by 13.91 and 50.19 percentage points, respectively, over the strongest baseline across GPT-J-6B, Qwen2.5-7B, Qwen2-7B, and LLaMA-3-8B. The code is available at https://github.com/Pepper66/DMLE.
Chinese Translation
大语言模型不仅存储孤立的事实,还存储支持跨符号表达、自然语言解释和具体实例推理的规则。然而,大多数模型编辑方法是针对事实级知识构建的,假设通过局部干预可以实现目标编辑。这一假设对于规则级知识并不成立,因为单一规则必须在多个相互依赖的形式中保持一致。我们通过对规则级知识编辑的机制研究来探讨这个问题。为了支持这一研究,我们将RuleEdit基准从80个手动验证的规则扩展到200个,涵盖数学和物理。细粒度的因果追踪揭示了变换器层中规则知识的形式特定组织:公式和描述集中在早期层,而实例则更多地与中间层相关联。这些结果表明,规则知识并不是均匀局部化的,因此不能通过单层或连续块的干预可靠地编辑。基于这一见解,我们提出了分布式多层编辑(Distributed Multi-Layer Editing, DMLE),该方法对公式和描述应用共享的早期层更新,对实例应用单独的中间层更新。在保持在标准编辑指标上具有竞争力的同时,DMLE在规则级编辑性能上实现了显著提升。平均而言,它在实例可移植性和规则理解上分别比最强基线提高了13.91和50.19个百分点,适用于GPT-J-6B、Qwen2.5-7B、Qwen2-7B和LLaMA-3-8B。代码可在 https://github.com/Pepper66/DMLE 获取。
cs.CL / 52 / 2604.08299
SeLaR: Selective Latent Reasoning in Large Language Models
SeLaR:大型语言模型中的选择性潜在推理
Abstract
Chain-of-Thought (CoT) has become a cornerstone of reasoning in large language models, yet its effectiveness is constrained by the limited expressiveness of discrete token sampling. Recent latent reasoning approaches attempt to alleviate this limitation by replacing discrete tokens with soft embeddings (probability-weighted mixtures of token embeddings) or hidden states, but they commonly suffer from two issues: (1) global activation injects perturbations into high-confidence steps, impairing reasoning stability; and (2) soft embeddings quickly collapse toward the highest-probability token, limiting exploration of alternative trajectories. To address these challenges, we propose SeLaR (Selective Latent Reasoning), a lightweight and training-free framework. SeLaR introduces an entropy-gated mechanism that activates soft embeddings only at low-confidence steps, while preserving discrete decoding at high-confidence steps. Additionally, we propose an entropy-aware contrastive regularization that pushes soft embeddings away from the dominant (highest-probability) token's direction, encouraging sustained exploration of multiple latent reasoning paths. Experiments on five reasoning benchmarks demonstrate that SeLaR consistently outperforms standard CoT and state-of-the-art training-free methods.
Chinese Translation
链式思维(Chain-of-Thought, CoT)已成为大型语言模型推理的基石,但其有效性受到离散令牌采样表达能力有限的制约。近期的潜在推理方法试图通过用软嵌入(基于概率加权的令牌嵌入混合)或隐藏状态替代离散令牌来缓解这一限制,但它们通常面临两个问题:(1)全局激活会在高置信度步骤中引入扰动,损害推理的稳定性;(2)软嵌入迅速向最高概率令牌的方向崩溃,限制了对替代轨迹的探索。为了解决这些挑战,我们提出了SeLaR(选择性潜在推理),这是一个轻量级且无需训练的框架。SeLaR引入了一种熵门控机制,仅在低置信度步骤激活软嵌入,同时在高置信度步骤保持离散解码。此外,我们提出了一种基于熵的对比正则化,推动软嵌入远离主导(最高概率)令牌的方向,鼓励对多条潜在推理路径的持续探索。在五个推理基准上的实验表明,SeLaR始终优于标准的CoT和最先进的无训练方法。
cs.CL / 53 / 2604.08362
Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces
迈向真实世界人类行为模拟:对大型语言模型在长时间跨度、跨场景和异构行为轨迹上的基准测试
Abstract
The emergence of Large Language Models (LLMs) has illuminated the potential for a general-purpose user simulator. However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, failing to capture the holistic nature of authentic human behavior. To bridge this gap, we introduce OmniBehavior, the first user simulation benchmark constructed entirely from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral patterns into a unified framework. Based on this benchmark, we first provide empirical evidence that previous datasets with isolated scenarios suffer from tunnel vision, whereas real-world decision-making relies on long-term, cross-scenario causal chains. Extensive evaluations of state-of-the-art LLMs reveal that current models struggle to accurately simulate these complex behaviors, with performance plateauing even as context windows expand. Crucially, a systematic comparison between simulated and authentic behaviors uncovers a fundamental structural bias: LLMs tend to converge toward a positive average person, exhibiting hyper-activity, persona homogenization, and a Utopian bias. This results in the loss of individual differences and long-tail behaviors, highlighting critical directions for future high-fidelity simulation research.
Chinese Translation
大型语言模型(LLMs)的出现揭示了通用用户模拟器的潜力。然而,现有的基准测试仍然局限于孤立场景、狭窄的行动空间或合成数据,未能捕捉到真实人类行为的整体特性。为了解决这一问题,我们提出了OmniBehavior,这是第一个完全基于真实世界数据构建的用户模拟基准,将长时间跨度、跨场景和异构行为模式整合到一个统一框架中。基于该基准,我们首先提供实证证据,表明以孤立场景为基础的先前数据集存在视野狭窄的问题,而真实世界的决策过程依赖于长期的跨场景因果链。对最先进的LLMs进行的广泛评估显示,当前模型在准确模拟这些复杂行为方面存在困难,尽管上下文窗口不断扩大,但性能仍然停滞不前。重要的是,模拟行为与真实行为之间的系统比较揭示了一种根本的结构偏差:LLMs倾向于趋向于一个积极的平均个体,表现出过度活跃、角色同质化和乌托邦偏见。这导致个体差异和长尾行为的丧失,突显了未来高保真模拟研究的重要方向。
cs.CL / 54 / 2604.08381
A GAN and LLM-Driven Data Augmentation Framework for Dynamic Linguistic Pattern Modeling in Chinese Sarcasm Detection
基于GAN与大型语言模型驱动的数据增强框架用于中文讽刺检测中的动态语言模式建模
Abstract
Sarcasm is a rhetorical device that expresses criticism or emphasizes characteristics of certain individuals or situations through exaggeration, irony, or comparison. Existing methods for Chinese sarcasm detection are constrained by limited datasets and high construction costs, and they mainly focus on textual features, overlooking user-specific linguistic patterns that shape how opinions and emotions are expressed. This paper proposes a Generative Adversarial Network (GAN) and Large Language Model (LLM)-driven data augmentation framework to dynamically model users' linguistic patterns for enhanced Chinese sarcasm detection. First, we collect raw data from various topics on Sina Weibo. Then, we train a GAN on these data and apply a GPT-3.5 based data augmentation technique to synthesize an extended sarcastic comment dataset, named SinaSarc. This dataset contains target comments, contextual information, and user historical behavior. Finally, we extend the BERT architecture to incorporate multi-dimensional information, particularly user historical behavior, enabling the model to capture dynamic linguistic patterns and uncover implicit sarcastic cues in comments. Experimental results demonstrate the effectiveness of our proposed method. Specifically, our model achieves the highest F1-scores on both the non-sarcastic and sarcastic categories, with values of 0.9138 and 0.9151 respectively, which outperforms all existing state-of-the-art (SOTA) approaches. This study presents a novel framework for dynamically modeling users' long-term linguistic patterns in Chinese sarcasm detection, contributing to both dataset construction and methodological advancement in this field.
Chinese Translation
讽刺是一种通过夸张、讽刺或比较来表达批评或强调特定个体或情境特征的修辞手法。现有的中文讽刺检测方法受限于数据集规模有限和构建成本高昂,且主要侧重于文本特征,忽视了塑造观点和情感表达方式的用户特定语言模式。本文提出了一种基于生成对抗网络(GAN)和大型语言模型(LLM)驱动的数据增强框架,以动态建模用户语言模式,从而提升中文讽刺检测效果。首先,我们从新浪微博的多个话题中收集原始数据。随后,在这些数据上训练GAN,并采用基于GPT-3.5的数据增强技术合成扩展的讽刺评论数据集SinaSarc,该数据集包含目标评论、上下文信息及用户历史行为。最后,我们扩展了BERT架构,融合多维信息,特别是用户历史行为,使模型能够捕捉动态语言模式,挖掘评论中的隐含讽刺线索。实验结果表明,所提方法效果显著,模型在非讽刺和讽刺类别上分别达到0.9138和0.9151的最高F1分数,优于所有现有最先进方法。该研究提出了一个动态建模用户长期语言模式的创新框架,为中文讽刺检测领域的数据集构建和方法论进步做出了贡献。
cs.CL / 55 / 2604.08423
Synthetic Data for any Differentiable Target
用于任何可微目标的合成数据
Abstract
What are the limits of controlling language models via synthetic training data? We develop a reinforcement learning (RL) primitive, the Dataset Policy Gradient (DPG), which can precisely optimize synthetic data generators to produce a dataset of targeted examples. When used for supervised fine-tuning (SFT) of a target model, these examples cause the target model to do well on a differentiable metric of our choice. Our approach achieves this by taking exact data attribution via higher-order gradients and using those scores as policy gradient rewards. We prove that this procedure closely approximates the true, intractable gradient for the synthetic data generator. To illustrate the potential of DPG, we show that, using only SFT on generated examples, we can cause the target model's LM head weights to (1) embed a QR code, (2) embed the pattern $\texttt{67}$, and (3) have lower $\ell^2$ norm. We additionally show that we can cause the generator to (4) rephrase inputs in a new language and (5) produce a specific UUID, even though neither of these objectives is conveyed in the generator's input prompts. These findings suggest that DPG is a powerful and flexible technique for shaping model properties using only synthetic training examples.
Chinese Translation
通过合成训练数据控制语言模型的极限是什么?我们开发了一种强化学习(RL)原语,数据集策略梯度(Dataset Policy Gradient, DPG),可以精确优化合成数据生成器,以生成一组目标示例的数据集。当用于目标模型的监督微调(SFT)时,这些示例使目标模型在我们选择的可微度量上表现良好。我们的方法通过利用高阶梯度进行精确的数据归因,并将这些分数作为策略梯度奖励来实现这一点。我们证明了该过程与合成数据生成器的真实、难以处理的梯度非常接近。为了说明DPG的潜力,我们展示了仅通过对生成示例进行SFT,我们可以使目标模型的语言模型头权重(LM head weights)(1)嵌入一个二维码,(2)嵌入模式$ exttt{67}$,以及(3)具有更低的$ ext{l}^2$范数。我们还展示了我们可以使生成器(4)用一种新语言重新表述输入,以及(5)生成特定的UUID,即使这些目标在生成器的输入提示中并未传达。这些发现表明,DPG是一种强大而灵活的技术,仅使用合成训练示例即可塑造模型属性。
cs.CL / 56 / 2604.08448
AfriVoices-KE: A Multilingual Speech Dataset for Kenyan Languages
AfriVoices-KE:肯尼亚语言的多语种语音数据集
Abstract
AfriVoices-KE is a large-scale multilingual speech dataset comprising approximately 3,000 hours of audio across five Kenyan languages: Dholuo, Kikuyu, Kalenjin, Maasai, and Somali. The dataset includes 750 hours of scripted speech and 2,250 hours of spontaneous speech, collected from 4,777 native speakers across diverse regions and demographics. This work addresses the critical underrepresentation of African languages in speech technology by providing a high-quality, linguistically diverse resource. Data collection followed a dual methodology: scripted recordings drew from compiled text corpora, translations, and domain-specific generated sentences spanning eleven domains relevant to the Kenyan context, while unscripted speech was elicited through textual and image prompts to capture natural linguistic variation and dialectal nuances. A customized mobile application enabled contributors to record using smartphones. Quality assurance operated at multiple layers, encompassing automated signal-to-noise ratio validation prior to recording and human review for content accuracy. Though the project encountered challenges common to low-resource settings, including unreliable infrastructure, device compatibility issues, and community trust barriers, these were mitigated through local mobilizers, stakeholder partnerships, and adaptive training protocols. AfriVoices-KE provides a foundational resource for developing inclusive automatic speech recognition and text-to-speech systems, while advancing the digital preservation of Kenya's linguistic heritage.
Chinese Translation
AfriVoices-KE 是一个大规模多语种语音数据集,涵盖约3000小时的音频,涉及五种肯尼亚语言:Dholuo、Kikuyu、Kalenjin、Maasai 和 Somali。该数据集包括750小时的脚本语音和2250小时的自发语音,采集自4777名来自不同地区和人口统计背景的母语者。本研究通过提供高质量、语言多样性的资源,解决了非洲语言在语音技术中严重缺乏代表性的问题。数据采集采用双重方法:脚本录音基于汇编的文本语料库、翻译文本及涵盖肯尼亚相关十一大领域的领域特定生成句子;自发语音则通过文本和图像提示引导,以捕捉自然语言变异和方言细微差别。定制的移动应用支持贡献者使用智能手机进行录音。质量保障涵盖多个层面,包括录音前的自动信噪比验证和内容准确性的人为审核。尽管项目面临低资源环境下常见的挑战,如基础设施不稳定、设备兼容性问题及社区信任障碍,但通过本地动员者、利益相关者合作及灵活的培训方案得以缓解。AfriVoices-KE 为开发包容性的自动语音识别(ASR)和文本转语音(TTS)系统提供了基础资源,同时推动了肯尼亚语言遗产的数字化保护。
cs.CL / 57 / 2604.08479
AI generates well-liked but templatic empathic responses
人工智能生成受欢迎但模板化的同理心回应
Abstract
Recent research shows that greater numbers of people are turning to Large Language Models (LLMs) for emotional support, and that people rate LLM responses as more empathic than human-written responses. We suggest a reason for this success: LLMs have learned and consistently deploy a well-liked template for expressing empathy. We develop a taxonomy of 10 empathic language "tactics" that include validating someone's feelings and paraphrasing, and apply this taxonomy to characterize the language that people and LLMs produce when writing empathic responses. Across a set of 2 studies comparing a total of n = 3,265 AI-generated (by six models) and n = 1,290 human-written responses, we find that LLM responses are highly formulaic at a discourse functional level. We discovered a template -- a structured sequence of tactics -- that matches between 83--90% of LLM responses (and 60--83\% in a held out sample), and when those are matched, covers 81--92% of the response. By contrast, human-written responses are more diverse. We end with a discussion of implications for the future of AI-generated empathy.
Chinese Translation
近期研究表明,越来越多的人开始寻求大型语言模型(Large Language Models, LLMs)提供情感支持,并且人们认为LLM的回应比人类撰写的回应更具同理心。我们提出了这一成功的原因:LLM已经学习并持续使用一种受欢迎的同理心表达模板。我们开发了一个包含10种同理心语言“策略”的分类法,这些策略包括验证他人的感受和释义,并将该分类法应用于描述人类和LLM在撰写同理心回应时所使用的语言。在对比了共计n = 3,265条由六个模型生成的AI回应和n = 1,290条人类撰写的回应的两项研究中,我们发现LLM的回应在话语功能层面上高度公式化。我们发现了一种模板——一个结构化的策略序列——该模板与83%至90%的LLM回应相匹配(在一个保留样本中为60%至83%),并且当这些回应匹配时,覆盖了81%至92%的回应内容。相比之下,人类撰写的回应则更加多样化。最后,我们讨论了这一发现对未来AI生成同理心的影响。
cs.CL / 58 / 2604.08510
What do Language Models Learn and When? The Implicit Curriculum Hypothesis
语言模型学习了什么以及何时学习?隐性课程假说
Abstract
Large language models (LLMs) can perform remarkably complex tasks, yet the fine-grained details of how these capabilities emerge during pretraining remain poorly understood. Scaling laws on validation loss tell us how much a model improves with additional compute, but not what skills it acquires in which order. To remedy this, we propose the Implicit Curriculum Hypothesis: pretraining follows a compositional and predictable curriculum across models and data mixtures. We test this by designing a suite of simple, composable tasks spanning retrieval, morphological transformations, coreference, logical reasoning, and mathematics. Using these tasks, we track emergence points across four model families spanning sizes from 410M-13B parameters. We find that emergence orderings of when models reach fixed accuracy thresholds are strikingly consistent ($\rho = .81$ across 45 model pairs), and that composite tasks most often emerge after their component tasks. Furthermore, we find that this structure is encoded in model representations: tasks with similar function vector representations also tend to follow similar trajectories in training. By using the space of representations derived from our task set, we can effectively predict the training trajectories of simple held-out compositional tasks throughout the course of pretraining ($R^2 = .68$-$.84$ across models) without previously evaluating them. Together, these results suggest that pretraining is more structured than loss curves reveal: skills emerge in a compositional order that is consistent across models and readable from their internals.
Chinese Translation
大型语言模型(LLMs)能够执行极其复杂的任务,但这些能力在预训练过程中是如何逐步显现的细节仍然不甚清楚。验证损失的规模法则告诉我们模型在增加计算资源后如何改进,但并未说明模型以何种顺序获得哪些技能。为了解决这个问题,我们提出了隐性课程假说:预训练遵循一种跨模型和数据混合的组合性和可预测的课程。我们通过设计一系列简单的可组合任务来测试这一假说,这些任务涵盖了检索、形态变换、共指消解、逻辑推理和数学等领域。利用这些任务,我们跟踪了四个模型家族(参数规模从410M到13B)中能力显现的节点。我们发现,当模型达到固定的准确性阈值时,能力显现的顺序出奇一致(在45对模型中,$
ho = .81$),而复合任务通常在其组成任务之后显现。此外,我们发现这种结构在模型表示中被编码:具有相似功能向量表示的任务在训练过程中也往往遵循相似的轨迹。通过利用从我们的任务集派生的表示空间,我们能够有效预测简单的保留组合任务在预训练过程中的训练轨迹(在模型中,$R^2 = .68$-$.84$),而无需事先评估它们。综合来看,这些结果表明,预训练比损失曲线所揭示的更具结构性:技能以一种在模型间一致且可从其内部可读的组合顺序显现。
cs.CL / 59 / 2604.08519
Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts
减少记忆以容纳更多:训练数据修剪改善事实记忆
Abstract
Large language models (LLMs) can struggle to memorize factual knowledge in their parameters, often leading to hallucinations and poor performance on knowledge-intensive tasks. In this paper, we formalize fact memorization from an information-theoretic perspective and study how training data distributions affect fact accuracy. We show that fact accuracy is suboptimal (below the capacity limit) whenever the amount of information contained in the training data facts exceeds model capacity. This is further exacerbated when the fact frequency distribution is skewed (e.g. a power law). We propose data selection schemes based on the training loss alone that aim to limit the number of facts in the training data and flatten their frequency distribution. On semi-synthetic datasets containing high-entropy facts, our selection method effectively boosts fact accuracy to the capacity limit. When pretraining language models from scratch on an annotated Wikipedia corpus, our selection method enables a GPT2-Small model (110m parameters) to memorize 1.3X more entity facts compared to standard training, matching the performance of a 10X larger model (1.3B parameters) pretrained on the full dataset.
Chinese Translation
大型语言模型(LLMs)在其参数中记忆事实知识时常常面临困难,这往往导致幻觉和在知识密集型任务上的表现不佳。本文从信息论的角度对事实记忆进行了形式化,并研究了训练数据分布如何影响事实的准确性。我们表明,当训练数据中的事实所包含的信息量超过模型容量时,事实准确性是次优的(低于容量限制)。当事实频率分布偏斜(例如,幂律分布)时,这一问题进一步加剧。我们提出了一种仅基于训练损失的数据选择方案,旨在限制训练数据中的事实数量并平滑其频率分布。在包含高熵事实的半合成数据集上,我们的选择方法有效地将事实准确性提升至容量限制。当从头开始在标注的维基百科语料库上预训练语言模型时,我们的选择方法使得一个GPT2-Small模型(110M参数)能够记忆比标准训练多1.3倍的实体事实,达到与在完整数据集上预训练的10倍大模型(1.3B参数)相匹配的性能。
cs.CL / 60 / 2604.08523
ClawBench: Can AI Agents Complete Everyday Online Tasks?
ClawBench:人工智能代理能否完成日常在线任务?
Zhang, Yuxuan, Wang, Yubo, Zhu, Yipeng, Du, Penghui, Miao, Junwen, Lu, Xuan, Xu, Wendong, Hao, Yunzhuo, Cai, Songcheng, Wang, Xiaochen, Zhang, Huaisong, Wu, Xian, Lu, Yi, Lei, Minyi, Zou, Kai, Yin, Huifeng, Nie, Ping, Chen, Liang, Jiang, Dongfu, Chen, Wenhu, Allen, Kelsey R.
Abstract
AI agents may be able to automate your inbox, but can they automate other routine aspects of your life? Everyday online tasks offer a realistic yet unsolved testbed for evaluating the next generation of AI agents. To this end, we introduce ClawBench, an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work, spanning 144 live platforms across 15 categories, from completing purchases and booking appointments to submitting job applications. These tasks require demanding capabilities beyond existing benchmarks, such as obtaining relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and write-heavy operations like filling in many detailed forms correctly. Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites, preserving the full complexity, dynamic nature, and challenges of real-world web interaction. A lightweight interception layer captures and blocks only the final submission request, ensuring safe evaluation without real-world side effects. Our evaluations of 7 frontier models show that both proprietary and open-source models can complete only a small portion of these tasks. For example, Claude Sonnet 4.6 achieves only 33.3%. Progress on ClawBench brings us closer to AI agents that can function as reliable general-purpose assistants.
Chinese Translation
人工智能代理可能能够自动化您的收件箱,但它们能否自动化您生活中的其他常规方面?日常在线任务为评估下一代人工智能代理提供了一个现实但尚未解决的测试平台。为此,我们引入了ClawBench,一个评估框架,包含153个简单任务,这些任务是人们在生活和工作中需要定期完成的,涵盖了15个类别中的144个实时平台,从完成购买和预约到提交求职申请。这些任务要求具备超出现有基准的能力,例如从用户提供的文档中获取相关信息、在不同平台上导航多步骤工作流程,以及进行大量书写操作,如正确填写许多详细表格。与现有基准不同,后者在静态页面的离线沙箱中评估代理,ClawBench在生产网站上运行,保留了现实世界网络交互的全部复杂性、动态特性和挑战。一个轻量级的拦截层仅捕获并阻止最终提交请求,确保安全评估而不产生现实世界的副作用。我们对7个前沿模型的评估显示,无论是专有模型还是开源模型,都只能完成这些任务的一小部分。例如,Claude Sonnet 4.6仅完成了33.3%。在ClawBench上的进展使我们更接近能够作为可靠通用助手运行的人工智能代理。
cs.CL / 61 / 2604.08527
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models
揭示 OPD:大语言模型的长度膨胀与稳定化策略
Abstract
On-policy distillation (OPD) trains student models under their own induced distribution while leveraging supervision from stronger teachers. We identify a failure mode of OPD: as training progresses, on-policy rollouts can undergo abrupt length inflation, causing truncated trajectories to dominate the training data. This truncation collapse coincides with abrupt repetition saturation and induces biased gradient signals, leading to severe training instability and sharp degradation in validation performance. We attribute this problem to the interaction between student-induced data collection and the distillation objective, which implicitly favors long and repetitive rollouts. To address this issue, we propose StableOPD, a stabilized OPD framework that combines a reference-based divergence constraint with rollout mixture distillation. These together mitigate repetition-induced length inflation and further stabilize OPD training. Across multiple math reasoning datasets, our approach prevents truncation collapse, stabilizes training dynamics, and improves performance by 7.2% on average.
Chinese Translation
在政策蒸馏(On-policy Distillation, OPD)中,学生模型在自身诱导的分布下进行训练,同时利用来自更强教师的监督。我们识别出 OPD 的一种失效模式:随着训练的进行,政策内的展开可能会经历突发的长度膨胀,导致截断的轨迹主导训练数据。这种截断崩溃与突发的重复饱和相吻合,并引发偏差的梯度信号,导致严重的训练不稳定性和验证性能的急剧下降。我们将此问题归因于学生诱导的数据收集与蒸馏目标之间的相互作用,这种相互作用隐含地偏向于长且重复的展开。为了解决这一问题,我们提出了 StableOPD,这是一种稳定化的 OPD 框架,结合了基于参考的散度约束与展开混合蒸馏。这些方法共同减轻了由重复引起的长度膨胀,并进一步稳定了 OPD 训练。在多个数学推理数据集上,我们的方法有效防止了截断崩溃,稳定了训练动态,并平均提高了 7.2% 的性能。