cs.RO / 1 / 2605.15298
PhysBrain 1.0 Technical Report
PhysBrain 1.0 技术报告
Abstract
Vision-language-action models have advanced rapidly, but robot trajectories alone provide limited coverage for learning broad physical understanding. PhysBrain 1.0 studies a complementary route: converting large-scale human egocentric video into structured physical commonsense supervision before robot adaptation. Our data engine extracts scene elements, spatial dynamics, action execution, and depth-aware relations, then turns them into question-answer supervision for training PhysBrain VLMs. The resulting physical priors are further transferred to VLA policies through a capability-preserving and language-sensitive adaptation design. Across multimodal QA benchmarks and embodied control benchmarks, including ERQA, PhysBench, SimplerEnv-WidowX, LIBERO, and RoboCasa, PhysBrain 1.0 achieves SOTA results and shows especially strong out-of-domain performance on SimplerEnv. These results suggest that scaling physical commonsense from human interaction video can provide an effective bridge from multimodal understanding to robot action.
Chinese Translation
视觉-语言-动作模型快速发展,但仅依赖机器人轨迹对广泛的物理理解学习覆盖有限。PhysBrain 1.0 研究了一条互补的路径:在机器人适应之前,将大规模人类自我中心视频转换为结构化的物理常识监督。我们的数据引擎提取场景元素、空间动态、动作执行和深度感知关系,然后将其转化为用于训练 PhysBrain VLMs 的问答监督。所得到的物理先验进一步通过一种保持能力和对语言敏感的适应设计转移到 VLA 策略中。在多模态问答基准和具身控制基准中,包括 ERQA、PhysBench、SimplerEnv-WidowX、LIBERO 和 RoboCasa,PhysBrain 1.0 实现了最先进的结果,并在 SimplerEnv 上表现出特别强的域外性能。这些结果表明,从人类互动视频中扩展物理常识可以为多模态理解与机器人动作之间提供有效的桥梁。
cs.RO / 2 / 2605.15336
HoloMotion-1 Technical Report
HoloMotion-1 技术报告
Abstract
In this report, we present HoloMotion-1, a humanoid motion foundation model for zero-shot whole-body motion tracking. A key innovation of HoloMotion-1 is to scale control-policy training with a large-scale hybrid motion corpus, where video-reconstructed motions from in-the-wild videos provide the dominant source of motion diversity, while curated motion-capture and in-house motion data provide higher-fidelity supervision and deployment-oriented coverage. This data regime enables HoloMotion-1 to move beyond conventional MoCap-only training and exposes the policy to substantially broader behaviors, capture conditions, and motion styles. Learning from such heterogeneous data introduces new challenges, including reconstruction noise, source-domain mismatch, uneven motion quality, and the need for temporal modeling under large behavioral variation. To address these challenges, HoloMotion-1 integrates large-capacity temporal modeling, a sparsely activated Mixture-of-Experts Transformer with KV-cache inference for real-time control, and a sequence-level training strategy that improves learning efficiency on extended motion sequences. Extensive experiments on multiple unseen motion benchmarks show that HoloMotion-1 generalizes robustly across diverse motion types and capture conditions, significantly improves tracking accuracy over prior methods, and transfers directly to a real humanoid robot without task-specific fine-tuning.
Chinese Translation
在本报告中,我们介绍了 HoloMotion-1,一个用于零-shot 全身运动跟踪的人形运动基础模型。HoloMotion-1 的一个关键创新是利用大规模混合运动语料库来扩展控制策略的训练,其中来自野外视频的重建运动提供了主要的运动多样性来源,而策划的运动捕捉和内部运动数据则提供了更高保真度的监督和面向部署的覆盖。这种数据模式使 HoloMotion-1 超越传统的仅基于运动捕捉(MoCap)的训练,暴露于更广泛的行为、捕捉条件和运动风格。学习如此异质的数据引入了新的挑战,包括重建噪声、源域不匹配、不均匀的运动质量,以及在大行为变化下进行时间建模的需求。为了解决这些挑战,HoloMotion-1 集成了大容量的时间建模、稀疏激活的专家混合 Transformer(Mixture-of-Experts Transformer)以及用于实时控制的 KV-cache 推理,并采用了一种序列级训练策略,提升了对扩展运动序列的学习效率。在多个未见运动基准上的广泛实验表明,HoloMotion-1 在多样化的运动类型和捕捉条件下具有良好的泛化能力,显著提高了跟踪精度,相较于之前的方法,并且能够直接转移到真实的人形机器人上,而无需特定任务的微调。
cs.RO / 3 / 2605.15352
Diffusion Policy for Coordinated Control of a Nonholonomic Mobile Base and Dual Arms in Door Opening and Passing
用于非完整移动底盘和双臂协调控制的扩散策略在开门和通过中的应用
Abstract
Opening heavy, self closing doors, especially those that require pulling remains a long standing challenge in robotics. Humans naturally employ both arms in a dexterous manner, rotating the handle, widening the gap, holding the door, switching arms when needed, and moving through while maintaining clearance. To replicate such behaviors, a robot must perform a long sequence of motions spanning multiple stages and interactions with different parts of the door. Traditional approaches rely on state machines that transition between manually defined stages (e.g., pulling after the knob is rotated, passing after the gap is sufficiently wide). While intuitive, these methods lack robustness, as hand crafted trajectories fail to generalize to the diversity of real world conditions without extensive engineering effort. Recent advances in imitation learning offer a scalable alternative, yet no existing visual action model has demonstrated simultaneous coordination of a nonholonomic base and dual arms for the complete door opening and passing task. In this paper, we tackle this complex, highly constrained problem using a diffusion based visuomotor control policy. Our results demonstrate that a single end to end policy can be learned to execute long horizon tasks requiring tight coordination between manipulation and locomotion. The resulting policy not only achieves a high success rate in opening and traversing damped pull doors but also demonstrates strong robustness to external disturbances capabilities that are difficult to realize with traditional methods.
Chinese Translation
打开沉重的自闭门,尤其是那些需要拉动的门,一直以来都是机器人技术中的一大挑战。人类自然地以灵活的方式使用双臂,旋转门把手、扩大缝隙、支撑门、在需要时切换手臂,并在保持间隙的同时通过。为了复制这种行为,机器人必须执行一系列跨越多个阶段并与门的不同部分进行交互的长序列动作。传统方法依赖于状态机在手动定义的阶段之间转换(例如,旋转把手后拉动,缝隙足够宽后通过)。尽管这些方法直观,但由于手工设计的轨迹难以适应现实世界条件的多样性,缺乏鲁棒性,通常需要大量的工程努力。最近在模仿学习方面的进展提供了一种可扩展的替代方案,但现有的视觉动作模型尚未展示出同时协调非完整底盘和双臂以完成开门和通过任务的能力。在本文中,我们使用基于扩散的视觉运动控制策略来解决这一复杂且高度受限的问题。我们的结果表明,可以学习到一个单一的端到端策略,以执行需要操作和移动之间紧密协调的长时间任务。所得到的策略不仅在打开和穿越阻尼拉门时实现了高成功率,而且在面对外部干扰时表现出强大的鲁棒性,这些能力是传统方法难以实现的。
cs.RO / 4 / 2605.15430
Where to Perch in a Tree: Vision-Guidance for Tree-Grasping Drones
树上栖息的最佳位置:基于视觉引导的树木抓取无人机
Abstract
This study demonstrates a method to locate an ideal perch location on a tree for vision-guided autonomous tree-perching drones. Various image processing algorithms, including those used for machine learning, image segmentation and binary image morphology, are implemented to assess the shape and structure of a tree. Rather than identifying the closest available branch, this study builds on vision methods by evaluating the potential of each branch, determining its suitability for perching based on factors such as branch width, slope (angle to the horizontal) and curvature. For a given tree-perching drone and a dataset of more than 10,000 urban tree images taken from February to October in a subtropical and temperate monsoon climate, the proposed method successfully produces a result for 76% of feasible targets. A feasible target defined as a tree where the branch diameters are sufficiently thick and where the available perching space is at least equal to the width of a tendon-driven grasping claw. These successful preliminary results create a foundation from which a number of identified improvements and additional features can be developed to create a generalised method; this will involve the incorporation of supplementary data from depth perception and attitude sensors to enhance the branch assessment.
Chinese Translation
本研究展示了一种为视觉引导的自主树木栖息无人机定位理想栖息位置的方法。实施了多种图像处理算法,包括用于机器学习、图像分割和二值图像形态学的算法,以评估树木的形状和结构。本研究并非简单识别最近的可用树枝,而是通过评估每个树枝的潜力,基于树枝宽度、坡度(与水平面的角度)和曲率等因素来确定其栖息的适宜性。对于给定的树木栖息无人机以及在亚热带和温带季风气候下从二月到十月拍摄的超过10,000张城市树木图像的数据集,所提出的方法成功地为76%的可行目标生成了结果。可行目标被定义为树木的树枝直径足够粗,并且可用的栖息空间至少等于一个由腱驱动的抓取爪的宽度。这些成功的初步结果为一系列已识别的改进和附加特征的开发奠定了基础,以创建一种通用方法;这将涉及结合来自深度感知和姿态传感器的补充数据,以增强对树枝的评估。
cs.RO / 5 / 2605.15480
Residual Reinforcement Learning for Robot Teleoperation under Stochastic Delays
针对随机延迟的机器人遥操作的残差强化学习
Abstract
Stochastic communication delays in teleoperation introduce signal discontinuities that undermine control stability and degrade control performance. Consequently, the conventional reinforcement learning (RL) methods struggle with the delayed observations due to the delay-induced observations, leading to high-frequency chattering. To address this, we propose a hybrid control framework, delay-resilient RL, integrating a state estimator utilizing Long Short-Term Memory (LSTM) with a residual RL policy, which is resilient to stochastic delays. The LSTM reconstructs smooth, continuous state estimates from delayed observations, enabling the RL agent to learn a residual torque compensation policy that balances tracking accuracy with velocity smoothness. Experimental validation on Franka Panda robots demonstrates that our approach significantly outperforms the state-of-the-art baselines, ensuring robust and stable teleoperation even under high-variance stochastic delays.
Chinese Translation
遥操作中的随机通信延迟引入了信号的不连续性,削弱了控制的稳定性并降低了控制性能。因此,传统的强化学习(RL)方法在面对延迟观测时表现不佳,导致高频抖动。为了解决这个问题,我们提出了一种混合控制框架——延迟弹性强化学习(delay-resilient RL),该框架将利用长短期记忆(LSTM)的状态估计器与残差RL策略相结合,能够抵御随机延迟。LSTM能够从延迟观测中重构平滑、连续的状态估计,使得RL代理能够学习一种残差扭矩补偿策略,在跟踪精度与速度平滑性之间取得平衡。对Franka Panda机器人的实验验证表明,我们的方法显著优于现有的最先进基准,确保即使在高方差随机延迟下也能实现稳健和稳定的遥操作。
cs.RO / 6 / 2605.15486
Hybrid LLM-based Intelligent Framework for Robot Task Scheduling
基于混合大型语言模型的机器人任务调度智能框架
Abstract
This study introduces intelligent frameworks that use Large Language Models (LLMs) to improve task scheduling for construction robots. The LLM is fed with key data about the desired task, such as agent action abilities, and the desired end goal to be achieved. A well-balanced allocation strategy is developed, optimizing both time efficiency and resource utilization. Our system utilizes a Natural Language Processing interface to streamline communication with construction professionals and adapt in real-time to unexpected site conditions. We concurrently use two LLM agents, specifically generator (GPT-4) and supervisor (Gemma 3/Llama 4/Mistral 7b) LLM agents to provide a more precise task schedule. We evaluate the proposed methodology using a straightforward scenario and provide metric scores to prove the efficacy of the frameworks. Our results highlight that the implementation of LLMs is crucial in construction operational tasks including robots.
Chinese Translation
本研究介绍了一种智能框架,利用大型语言模型(LLMs)来改善建筑机器人的任务调度。该大型语言模型输入关于所需任务的关键数据,例如代理的行动能力和希望实现的最终目标。我们开发了一种平衡的分配策略,优化时间效率和资源利用。我们的系统利用自然语言处理接口,简化与建筑专业人士的沟通,并实时适应意外的现场条件。我们同时使用两个大型语言模型代理,分别是生成器(GPT-4)和监督者(Gemma 3/Llama 4/Mistral 7b)模型,以提供更精确的任务调度。我们使用一个简单的场景评估所提出的方法,并提供指标分数以证明框架的有效性。我们的结果强调了在建筑操作任务中实施大型语言模型的重要性,包括机器人任务。
cs.RO / 7 / 2605.15492
FLASH: Efficient Visuomotor Policy via Sparse Sampling
FLASH:通过稀疏采样实现高效的视觉运动策略
Abstract
Generative models such as diffusion and flow matching have become dominant paradigms for visuomotor policy learning, yet their reliance on iterative denoising incurs high inference latency incompatible with real-time robotic control. We present Fast Legendre-polynomial Action policy via Sparse History-anchored flow (FLASH Policy), which replaces discrete action-chunk generation with continuous Legendre polynomial trajectory representation. Specifically, by fitting expert demonstrations under sparse temporal sampling, FLASH enables a single inference to cover a significantly extended action horizon. To further accelerate generation, FLASH initiates the flow matching process from history polynomial coefficients rather than uninformative Gaussian noise, shortening the transport distance and enabling accurate single-step inference. Moreover, analytic polynomial differentiation directly provides desired velocity feed-forward signals to the torque controller without numerical approximation. Extensive experiments on five simulated and two real-world manipulation tasks demonstrate that FLASH achieves state-of-the-art success rates ($\ge 92\%$ across all tasks), a per-episode inference time of $31.40\,ms$ (up to $175\times$ faster than diffusion policies and $18\times$ faster than prior flow matching policies), up to $4\times$ faster training convergence than ACT, and $5\times$ to $7\times$ reduction in controller tracking error compared to discrete-action baselines.
Chinese Translation
生成模型如扩散和流匹配已成为视觉运动策略学习的主流范式,但它们对迭代去噪的依赖导致了高推理延迟,不适合实时机器人控制。我们提出了通过稀疏历史锚定流的快速勒让德多项式动作策略(FLASH Policy),该策略用连续的勒让德多项式轨迹表示替代了离散动作块的生成。具体而言,通过在稀疏时间采样下拟合专家演示,FLASH使得单次推理能够覆盖显著扩展的动作视野。为了进一步加速生成,FLASH从历史多项式系数而非无信息的高斯噪声开始流匹配过程,从而缩短了传输距离并实现了准确的单步推理。此外,解析多项式微分直接为扭矩控制器提供所需的速度前馈信号,而无需数值近似。在五个模拟和两个真实世界操控任务上的广泛实验表明,FLASH达到了最先进的成功率(所有任务均≥92%),每个回合的推理时间为31.40毫秒(比扩散策略快最多175倍,比之前的流匹配策略快18倍),训练收敛速度比ACT快最多4倍,并且与离散动作基线相比,控制器跟踪误差减少了5倍到7倍。
cs.RO / 8 / 2605.15496
LAPS: Improving Incremental LiDAR Mapping using Active Pooling and Sampling for Neural Distance Fields
LAPS:通过主动池化和采样改善增量LiDAR映射的神经距离场
Abstract
Neural distance fields offer a compact and continuous representation of 3D geometry, making them attractive for incremental LiDAR mapping. However, their online optimization is vulnerable to catastrophic forgetting, where new observations can degrade previously reconstructed geometry. Replay-based training is commonly used to address this issue, but existing methods typically rely on passive replay buffers and uniform sampling, which can waste memory on redundant observations and under-train poorly constrained regions. We propose LAPS, a replay management framework for incremental neural mapping that improves both replay retention and replay allocation during online updates. LAPS combines reliability-based active pooling to retain reliable historical samples under limited memory with uncertainty-guided active sampling to focus optimization on under-constrained regions. Experiments on synthetic and real-world benchmarks show that LAPS consistently improves reconstruction completeness while maintaining competitive geometric accuracy. On Oxford Spires, it improves recall by 4.66 pp and F1-score by 3.79 pp over PIN-SLAM on the Blenheim Palace 05 sequence. We release our open source implementation at: https://github.com/dongjae0107/LAPS.
Chinese Translation
神经距离场提供了一种紧凑且连续的3D几何表示,使其在增量LiDAR映射中具有吸引力。然而,它们的在线优化容易受到灾难性遗忘的影响,即新观察可能会降低先前重建几何体的质量。基于重放的训练通常用于解决此问题,但现有方法通常依赖于被动重放缓冲区和均匀采样,这可能会在冗余观察上浪费内存,并且对约束较差的区域训练不足。我们提出了LAPS,这是一种增量神经映射的重放管理框架,旨在改善在线更新过程中的重放保留和重放分配。LAPS结合了基于可靠性的主动池化,以在有限内存下保留可靠的历史样本,并结合不确定性引导的主动采样,以将优化重点放在约束不足的区域。对合成和真实世界基准的实验表明,LAPS在保持竞争几何精度的同时,始终提高重建完整性。在牛津尖塔上,它在布伦海姆宫05序列上相较于PIN-SLAM提高了4.66个百分点的召回率和3.79个百分点的F1-score。我们在以下网址发布了我们的开源实现:https://github.com/dongjae0107/LAPS。
cs.RO / 9 / 2605.15510
A QUBO Formulation Framework for Kinematic Structure-Based Robot Design Optimization: A Robotic Hand Case Study
基于运动学结构的机器人设计优化的 QUBO 公式框架:以机器人手为案例研究
Abstract
This paper presents a quadratic unconstrained binary optimization-based formulation framework for robot design optimization using kinematic structure-level evaluation metrics. In the proposed framework, classical computation is used to evaluate design-dependent metrics while the resulting combinatorial selection problem is formulated in a structure compatible with quantum annealing-based optimization. A robotic hand is adopted as a representative case study, as its performance is determined by both the individual kinematic characteristics of each finger and interaction terms. The proposed formulation incorporates individual design rewards, overlap workspace interactions, one-hot constraint, and structural dependency penalties into a unified quadratic model. A 27-variable robotic hand design problem is constructed, and simulated annealing is used as a classical baseline to verify the feasibility of the formulation. Quantum annealing is further performed to examine the applicability of the proposed formulation to annealing-based hardware execution. The results show that feasible design combinations satisfying both one-hot selection and pairwise constraints can be obtained, with the observed objective-value range becoming narrower as the number of reads increases. In addition, the formulation process is discussed for other robotic systems. The proposed framework provides a generalized approach for transforming kinematic structure-based robot design problems into combinatorial optimization problems.
Chinese Translation
本文提出了一种基于二次无约束二进制优化(QUBO)的公式框架,用于利用运动学结构层级评估指标进行机器人设计优化。在所提出的框架中,经典计算用于评估依赖于设计的指标,而由此产生的组合选择问题则以与量子退火优化兼容的结构进行公式化。机器人手被选为代表性案例研究,因为其性能受到每个手指的个体运动学特性和交互项的共同影响。所提出的公式将个体设计奖励、重叠工作空间交互、单热约束和结构依赖惩罚整合到一个统一的二次模型中。构建了一个包含27个变量的机器人手设计问题,并使用模拟退火作为经典基线来验证该公式的可行性。进一步进行量子退火,以检验所提出公式在基于退火的硬件执行中的适用性。结果表明,可以获得满足单热选择和成对约束的可行设计组合,并且随着读取次数的增加,观察到的目标值范围变得更窄。此外,讨论了该公式过程在其他机器人系统中的应用。所提出的框架为将基于运动学结构的机器人设计问题转化为组合优化问题提供了一种通用方法。
cs.RO / 10 / 2605.15517
Terrain Consistent Reference-Guided RL for Humanoid Navigation Autonomy
基于地形一致性的参考引导强化学习在人形导航自主性中的应用
Abstract
We present a method for training reference-guided, perceptive reinforcement learning locomotion policies for humanoid robots in which reference trajectories are modulated in training to be consistent with terrain geometry. Aiming to deploy our method with standard navigation autonomy infrastructure, we synthesize SE(2)-controllable reference trajectories inside the RL training loop, projecting desired footsteps onto valid footholds and adjusting swing-foot and center-of-mass trajectories to match the terrain. The resulting policy exposes a clean SE(2) velocity interface compatible with standard navigation planners. In simulation, environmentally-conditioned references significantly improve reference tracking performance compared to environment agnostic references. On hardware, we integrate the policy with an MPC + control barrier function planner and demonstrate long-horizon (>70m) closed-loop autonomous navigation on the Unitree G1 through outdoor environments containing rough terrain and consecutive flights of stairs, with all sensing and computation onboard.
Chinese Translation
我们提出了一种训练参考引导的感知强化学习步态策略的方法,适用于人形机器人,其中参考轨迹在训练过程中被调节以与地形几何形状保持一致。为了将我们的方法应用于标准导航自主基础设施,我们在强化学习训练循环中合成可控的 SE(2) 参考轨迹,将期望的步伐投影到有效的支撑点上,并调整摆动脚和质心轨迹以匹配地形。最终得到的策略暴露出一个干净的 SE(2) 速度接口,与标准导航规划器兼容。在仿真中,与环境无关的参考相比,环境条件下的参考显著提高了参考跟踪性能。在硬件上,我们将该策略与 MPC + 控制障碍函数规划器集成,并在 Unitree G1 上展示了在包含崎岖地形和连续楼梯的户外环境中进行长距离(>70米)闭环自主导航的能力,所有传感和计算均在车载完成。
cs.RO / 11 / 2605.15528
Task-Semantic Graph-Driven Distributed Agent Networking for Underwater Target Tracking
基于任务语义图的分布式代理网络在水下目标跟踪中的应用
Abstract
Autonomous underwater vehicle (AUV) swarms are emerging as intelligent underwater networks, where each node must sense, communicate, process local data, and make decisions under severe acoustic constraints. Persistent underwater target tracking is a typical task with moving targets, changing communication topology, intermittent acoustic links, and limited observation for each AUV. Multi-agent reinforcement learning (MARL) is a natural candidate for distributed tracking, yet existing studies still lack a unified open-source platform for evaluating different MARL algorithms under six-degree-of-freedom AUV dynamics. In addition, policies trained with raw geometric states and low-level force actions often struggle to represent task phases, observation reliability, link quality, and local cooperation roles. This paper addresses these issues by developing an open-source MARL-AUV platform that integrates DI-engine with a six-degree-of-freedom underwater AUV target-tracking simulator. To the best of our knowledge, it is the first open platform that connects a public MARL training framework with physically modeled AUV swarm-based tasks, and provides a unified experimental protocol for fair training, testing, and comparison of representative RL and MARL algorithms. Based on this platform, we propose STG-MAPPO, a Semantic Task Graph-enhanced variant of Multi-Agent Proximal Policy Optimization. STG-MAPPO builds semantic policy inputs from tracking diagnostics, task phases, observation confidence, link availability, neighbor tracking quality, and local role advantage. A compact semantic task graph links communication-constrained network states to decentralized actor decisions, and a velocity-level action abstraction maps high-level cooperative decisions to executable six-degree-offreedom AUV control inputs.The code is available at https://github.com/dasjsaj/MARL-AUV.
Chinese Translation
自主水下航行器(AUV)群体正在成为智能水下网络,其中每个节点必须在严苛的声学约束下进行感知、通信、处理本地数据并做出决策。持续的水下目标跟踪是一项典型任务,涉及移动目标、变化的通信拓扑、间歇性的声学链路以及每个AUV的有限观测。多智能体强化学习(MARL)是分布式跟踪的自然选择,然而现有研究仍缺乏一个统一的开源平台,以评估在六自由度AUV动态下的不同MARL算法。此外,使用原始几何状态和低级力动作训练的策略往往难以准确表示任务阶段、观测可靠性、链路质量和本地合作角色。本文通过开发一个开源的MARL-AUV平台,解决了这些问题,该平台将DI-engine与六自由度水下AUV目标跟踪模拟器集成在一起。据我们所知,这是第一个将公共MARL训练框架与物理建模的AUV群体任务连接起来的开放平台,并为代表性的强化学习(RL)和MARL算法的公平训练、测试和比较提供了统一的实验协议。在此基础上,我们提出了STG-MAPPO,一种基于语义任务图增强的多智能体近端策略优化变体。STG-MAPPO从跟踪诊断、任务阶段、观测置信度、链路可用性、邻居跟踪质量和本地角色优势构建语义策略输入。一个紧凑的语义任务图将受限于通信的网络状态与去中心化的行动者决策相连接,而速度级别的动作抽象则将高级合作决策映射到可执行的六自由度AUV控制输入。代码可在 https://github.com/dasjsaj/MARL-AUV 获取。
cs.RO / 12 / 2605.15536
SkiP: When to Skip and When to Refine for Efficient Robot Manipulation
SkiP:何时跳过与何时细化以实现高效的机器人操作
Abstract
Previous imitation learning policies predict future actions at every control step, whether in smooth motion phases or precise, contact-rich operation phases. This uniform treatment is wasteful: most steps in a manipulation trajectory traverse free space and carry little task-relevant information, while a small fraction of \emph{key} steps around contacts, grasps, and alignment demand dense, high-resolution prediction. We propose a novel \emph{action relabeling} mechanism: at each timestep in a skip segment, we replace the behavior cloning target with the action at the entrance of the next key segment, enabling the policy to leap over redundant steps in a single decision. The resulting \textbf{Skip Policy (SkiP)} dynamically leaps over skip segments and intensively refines actions in key segments, within a single unified network requiring no learned skip planner or hierarchical structure. To automatically partition demonstrations into key and skip segments without manual annotation, we introduce \emph{Motion Spectrum Keying} (MSK), a fast, task-agnostic procedure that detects local motion complexity from action signals. Extensive experiments across 72 simulated manipulation tasks and three real-robot tasks show that SkiP reduces executed steps by $15$--$40\%$ while matching or improving success rates across various policy backbones. Project page: \texttt{https://pgq18.github.io/SkiP-page/}.
Chinese Translation
以往的模仿学习策略在每个控制步骤中预测未来的动作,无论是在平滑运动阶段还是在精确的接触丰富操作阶段。这种统一的处理方式是浪费的:在操作轨迹中的大多数步骤穿越自由空间,携带的任务相关信息很少,而围绕接触、抓取和对齐的少数 extit{关键}步骤则需要密集的高分辨率预测。我们提出了一种新颖的 extit{动作重标定}机制:在跳过段的每个时间步中,我们用下一个关键段入口处的动作替换行为克隆目标,使得策略能够在一次决策中跳过冗余步骤。最终得到的 extbf{跳过策略(SkiP)}能够动态跳过跳过段,并在关键段中密集细化动作,且只需一个统一的网络,无需学习跳过规划器或层次结构。为了在没有手动标注的情况下自动将演示分割为关键段和跳过段,我们引入了 extit{运动光谱键控}(MSK),这是一种快速的、与任务无关的程序,能够从动作信号中检测局部运动复杂性。在72个模拟操作任务和三个真实机器人任务中的广泛实验表明,SkiP在减少执行步骤方面降低了$15$--$40\%$,同时在各种策略骨干上保持或提高了成功率。项目页面: exttt{https://pgq18.github.io/SkiP-page/}。
cs.RO / 13 / 2605.15548
KaRMA: A Kinematic Metric for Fine Manipulation Ability in Robotic Hands
KaRMA:用于机器人手精细操作能力的运动学度量
Abstract
Traditional robotic hand metrics focus on static properties such as workspace, manipulability, and grasp stability. However, these metrics do not directly measure dexterity under the standard definition in robotic manipulation: the ability to continuously change an object's pose within the hand while maintaining contact from an initial grasp. We introduce Kinematic Rolling Manipulation Ability (KaRMA), a kinematic-only metric for fine manipulation that quantifies reachable in-hand translation and reorientation of a spherical test object within a two-finger precision pinch through feasible rolling motions. KaRMA enforces joint limits, collision constraints, rolling contact, and antipodal force feasibility, then investigates reachable in-hand object poses via breadth-first search over translation and rotation primitives. KaRMA reports three scores: translational coverage (KaRMA-T), rotational coverage (KaRMA-R), and sensitivity to the initial grasp (KaRMA-S). We evaluate KaRMA on 16 widely used robotic hands and compare against static baselines, showing that KaRMA separates hands that rank identically under static proxies, reveals translation-rotation tradeoffs invisible to existing baselines, and is qualitatively consistent with selected published task benchmarks where Jacobian-based metrics can be misleading.
Chinese Translation
传统的机器人手度量关注静态属性,如工作空间、可操作性和抓握稳定性。然而,这些度量并未直接测量在机器人操作中的灵巧性:即在保持初始抓握接触的情况下,持续改变物体在手中的姿态的能力。我们提出了运动学滚动操作能力(Kinematic Rolling Manipulation Ability,KaRMA),这是一种仅基于运动学的精细操作度量,量化了在两指精确夹持中,通过可行的滚动运动,能够达到的球形测试物体的手内平移和重新定向。KaRMA 强制施加关节限制、碰撞约束、滚动接触和对立力的可行性,然后通过对平移和旋转原语的广度优先搜索,研究可达到的手内物体姿态。KaRMA 报告三个评分:平移覆盖率(KaRMA-T)、旋转覆盖率(KaRMA-R)和对初始抓握的敏感性(KaRMA-S)。我们在16个广泛使用的机器人手上评估了KaRMA,并与静态基线进行了比较,结果表明KaRMA能够区分在静态代理下排名相同的手,揭示了现有基线无法察觉的平移-旋转权衡,并且在选定的已发布任务基准中,KaRMA的定性结果与雅可比基于的度量一致,后者可能会产生误导。
cs.RO / 14 / 2605.15559
NavRL++: A System-Level Framework for Improving Sim-to-Real Transfer in Reinforcement Learning-Based Robot Navigation
NavRL++:一种系统级框架,用于改善基于强化学习的机器人导航中的仿真到现实转移
Abstract
Recent years have witnessed significant progress in autonomous navigation using reinforcement learning. However, existing approaches largely emphasize reinforcement learning framework design, such as input representations, action spaces, and reward functions, while providing limited analysis of sim-to-real transfer and insufficient insight into how training strategies affect real-world deployment performance. To bridge this gap, we not only introduce an effective RL framework but also present a complete training and deployment pipeline, along with a systematic empirical study that disentangles the key factors affecting sim-to-real transfer in reinforcement learning-based navigation, including sensor noise, perception failures, system latency, and control response. Building on insights from this analysis, we introduce perturbation-aware fine-tuning, a post-training adaptation strategy that improves transfer robustness by explicitly accounting for empirically identified domain discrepancies. To further mitigate perception degradation and enhance control smoothness in real-world deployment, we propose a Transformer-based temporal reasoning policy that leverages short-horizon observation for navigation control. We quantitatively evaluate how individual sim-to-real perturbations and training design choices impact navigation performance across environments. Experimental results demonstrate that the proposed training strategy and policy architecture outperform learning-based baselines in both static and dynamic environments, while achieving performance comparable to optimization-based planners in static settings. We validate our approach through real-world deployment on multiple robotic platforms, including aerial and legged robots, across navigation-centric tasks such as exploration and inspection, demonstrating zero-shot sim-to-real transfer.
Chinese Translation
近年来,基于强化学习的自主导航取得了显著进展。然而,现有方法主要强调强化学习框架设计,例如输入表示、动作空间和奖励函数,对仿真到现实转移的分析有限,且对训练策略如何影响现实世界部署性能的洞察不足。为了解决这一问题,我们不仅引入了一个有效的强化学习框架,还提出了一个完整的训练和部署流程,并进行了一项系统的实证研究,剖析影响基于强化学习的导航中仿真到现实转移的关键因素,包括传感器噪声、感知失败、系统延迟和控制响应。在此分析的基础上,我们引入了扰动感知微调(perturbation-aware fine-tuning),这是一种后训练适应策略,通过明确考虑实证识别的领域差异来提高转移的鲁棒性。为了进一步减轻感知退化并增强现实世界部署中的控制平滑性,我们提出了一种基于Transformer的时间推理策略,该策略利用短期观察进行导航控制。我们定量评估了各个仿真到现实的扰动和训练设计选择如何影响不同环境中的导航性能。实验结果表明,所提出的训练策略和策略架构在静态和动态环境中均优于基于学习的基线,同时在静态环境中实现了与基于优化的规划器相当的性能。我们通过在多个机器人平台(包括空中和腿部机器人)上的现实世界部署进行验证,涵盖了探索和检查等导航中心任务,展示了零样本的仿真到现实转移。
cs.RO / 15 / 2605.15619
Wind-Aware Optimal Trajectory Planning for Efficient Gliding of Fixed-Wing Aerial Systems
风-aware的固定翼航空系统高效滑翔的最优轨迹规划
Abstract
Gliding offers small fixed-wing UAVs extended endurance and silent operation but requires accurate energy management, especially under wind disturbances and obstacle constraints. Traditional Total Energy Control Systems based controllers regulate the trade between potential and kinetic energy reactively, often requiring fine-tuning and trim-conditions knowledge. In this work, we shift the regulation to the planning level and present a nonlinear, multi-cost trajectory planner for small UAV gliders. The method generates $\mathcal{C}^3$ continuous trajectories based on Bernstein polynomials, mapped into control commands through differential flatness, and re-planned online to match experimentally derived sink polar curves. A simulated netto variometer is integrated into the optimization to estimate air mass motion, constraining the glide to energy-balanced states. Consecutive gliding trajectories are linked by cruising segments computed through trajectories initialized on Dubins path-based waypoints, enabling hybrid missions that combine powered and unpowered flight. The approach is validated in CFD simulations and real-world experiments with a fixed-wing platform, showing reliable stabilization of sink rate, airspeed, and glide ratio under wind gusts and in presence of obstacles.
Chinese Translation
滑翔为小型固定翼无人机提供了更长的续航时间和静音操作,但在风扰动和障碍物约束下需要准确的能量管理。传统的总能量控制系统基于控制器反应性地调节势能和动能之间的权衡,通常需要微调和修整条件的知识。在本研究中,我们将调节转移到规划层面,提出了一种针对小型无人机滑翔的非线性多成本轨迹规划器。该方法基于伯恩斯坦多项式生成$ ext{C}^3$连续轨迹,通过微分平坦性映射为控制命令,并在线重新规划以匹配实验得出的下沉极坐标曲线。优化中集成了模拟净变速器,以估计空气质量运动,限制滑翔在能量平衡状态下。连续的滑翔轨迹通过在杜宾路径基础上初始化的航点计算的巡航段连接,从而实现结合动力和非动力飞行的混合任务。该方法在计算流体动力学(CFD)模拟和固定翼平台的实际实验中得到了验证,显示出在风阵和障碍物存在下对下沉率、空速和滑翔比的可靠稳定。
cs.RO / 16 / 2605.15641
Propagating Unsafe Actions in LLM Controlled Multi-Robot Collaboration via Single Robot Compromise
通过单一机器人妥协在大型语言模型控制的多机器人协作中传播不安全行为
Abstract
Large language models (LLMs) are increasingly used as general planners in embodied intelligence, enabling high level coordination and low level task planning for both single robot and multi-robot collaboration. This increasing reliance on embodied LLM planners also raises critical security concerns, since misaligned or manipulated instructions can be translated into physical actions. Prior work has studied such threats in single robot settings, while security risks in LLM controlled multi-robot collaboration, especially those propagated through inter robot communication, remain largely unexplored. To bridge this gap, we propose a novel attack paradigm for multi-robot system in which the adversary interacts with only a single entry robot. The compromised robot then propagates malicious intent through peer communication, leading to coordinated unsafe actions across the system. Our evaluation, covering high risk dimensions of dereliction of duty, privacy compromise, and public safety hazards, reveals a persistent safety alignment gap in multi-robot planners. We quantify this process with three metrics, obedience, infectiousness, and stealthiness. Experiments demonstrate both persistent attacker control and rapid propagation: obedience reaches 1.00 in the strongest cases, and infectiousness rises to 0.90. Notably, the attack is highly efficient, requiring as few as 3.0 rounds to compromise all the robots while maintaining a stealthiness score of 0.81. Such risks are amplified when robots must resolve trade offs in critical situations, such as emergencies or conflicts of rights, because the coordination mechanism can unintentionally allow adversarial instructions to override safety requirements. The code is available at https://github.com/TheFatInsect/InfectBot.
Chinese Translation
大型语言模型(LLMs)越来越多地被用作具身智能中的通用规划者,使得单机器人和多机器人协作中的高层协调与低层任务规划成为可能。这种对具身LLM规划者的日益依赖也引发了关键的安全隐患,因为不一致或被操控的指令可能会转化为物理行为。先前的研究主要集中在单机器人环境中的此类威胁,而LLM控制的多机器人协作中的安全风险,特别是通过机器人间通信传播的风险,仍然未得到充分探索。为填补这一空白,我们提出了一种新颖的多机器人系统攻击范式,其中对手仅与单一入口机器人进行交互。被妥协的机器人随后通过同伴通信传播恶意意图,导致系统内的协调不安全行为。我们的评估涵盖了失职、隐私泄露和公共安全危害等高风险维度,揭示了多机器人规划者中持续存在的安全对齐差距。我们通过服从性、传染性和隐蔽性三个指标对这一过程进行了量化。实验表明攻击者能够持续控制并迅速传播:在最强的案例中,服从性达到了1.00,传染性上升至0.90。值得注意的是,该攻击效率极高,仅需3.0轮即可妥协所有机器人,同时保持0.81的隐蔽性评分。当机器人在紧急情况或权利冲突等关键情况下必须解决权衡时,这种风险会被放大,因为协调机制可能无意中允许对手的指令覆盖安全要求。代码可在 https://github.com/TheFatInsect/InfectBot 获取。
cs.RO / 17 / 2605.15650
MyoChallenge 2025: A New Benchmark for Human Athletic Intelligence
MyoChallenge 2025:人类运动智能的新基准
Wang, Cheryl, Tan, Chun Kwang, Hodossy, Balint K., Lyu, Eric, Guo, Jun, Zhao, Wentao, Liu, Huaping, Li, Chengkun, Simos, Merkourios, Ziliotto, Bianca, Mathis, Alexander, Liu, Siyuan, Chen, Jiahao, Zhong, Shanlin, Jiang, Bo, Song, Ci, Zhu, Yaoye, Zuo, Chenhui, Sui, Yanan, Refai, Mohamed Irfan, Sartori, Massimo, Durandau, Guillaume, Kumar, Vikash, Caggiano, Vittorio
Abstract
Athletic performance represents the pinnacle of human motor intelligence, demanding rapid choices, precise control, agility, and coordinated physical execution. Replicating this seamless combination of capabilities remains elusive in current artificial intelligence and robotic systems. Concurrently, understanding the biological mastery of these movements is hindered because complex muscle coordination is rarely measured in vivo due to the limitations of physical equipment. To bridge this fundamental gap in understanding, MyoChallenge at NeurIPS 2025 established a pioneering benchmark for motor control intelligence in sports, leveraging high-fidelity musculoskeletal models within physics simulation combined with machine learning-driven algorithms. The competition introduces two distinct tracks emphasizing either upper or lower limbs control: a table tennis rally task utilizing a biomechanic upper limb composed of an arm with a hand and a trunk; and a soccer penalty kick using a biomechanic model of legs and a trunk. Marking the fourth iteration of the MyoChallenge series, this event attracted almost 70 teams and over 560 submissions globally, uniting a diverse community ranging from physicians and neuroscientists to machine learning experts. The competition facilitated the development of several state-of-the-art control algorithms for a musculoskeletal system capable of sports agility, leveraging techniques such as physics-based motion planners, on-policy behaviour cloning, hierarchical planning, and muscle synergies. By integrating standardized tasks and physiologically realistic models into the open-source framework of MyoSuite, MyoChallenge'25 serves as a reproducible and reusable testbed to accelerate interdisciplinary research across machine learning, biomechanics, sports science, and neuroscience. Project page: https://www.myosuite.org//myochallenge/myochallenge-2025.
Chinese Translation
运动表现代表了人类运动智能的巅峰,要求快速决策、精确控制、敏捷性和协调的身体执行。在当前的人工智能和机器人系统中,复制这种无缝结合的能力仍然是一个难题。同时,由于物理设备的限制,复杂的肌肉协调在体内的测量很少,这也阻碍了对这些运动生物学掌握的理解。为了弥补这一理解上的根本缺口,MyoChallenge 在 NeurIPS 2025 上建立了一个开创性的运动控制智能基准,利用高保真的肌肉骨骼模型与物理仿真相结合,并结合机器学习驱动的算法。该竞赛引入了两个不同的赛道,分别强调上肢或下肢的控制:一个是利用生物力学上肢(由手臂、手和躯干组成)进行的乒乓球对打任务;另一个是使用生物力学腿部和躯干模型的足球点球。作为 MyoChallenge 系列的第四次迭代,本次活动吸引了近70个团队和超过560份全球提交,汇聚了从医生和神经科学家到机器学习专家的多元社区。该竞赛促进了多种先进控制算法的发展,这些算法能够实现运动灵活性的肌肉骨骼系统,利用物理基础的运动规划、在线行为克隆、分层规划和肌肉协同等技术。通过将标准化任务和生理上逼真的模型整合到 MyoSuite 的开源框架中,MyoChallenge'25 作为一个可重复和可重用的测试平台,加速了机器学习、生物力学、运动科学和神经科学等跨学科研究。项目页面:https://www.myosuite.org//myochallenge/myochallenge-2025.
cs.RO / 18 / 2605.15654
PCASim: Promptable Closed-loop Adversarial Simulation for Urban Traffic Environment
PCASim:可提示的城市交通环境闭环对抗仿真
Abstract
Real-world autonomous driving, particularly in urban environments with numerous corner cases, requires rigorous testing to ensure product safety and robustness. However, few studies have explored integrating adversarial scenario generation with the training of safety agents in closed-loop testing, enabling efficient co-evolution and mutual enhancement of both. To address this challenge, an adversarial behavior knowledge repository is constructed by applying rule-based filtering to an open-source dataset, combined with knowledge retrieval modules tailored for simulation environments. A large language model (LLM) is employed to integrate knowledge-, data-, and adversarial-driven approaches, generating safety-critical traffic scenarios customized to user needs. Additionally, while evaluating the generated scenarios, we employ reinforcement learning models to train the behaviors of different types of vehicles, thereby enriching scenario diversity beyond existing datasets while preserving realism. Experimental results demonstrate that the proposed framework improves the accuracy of domain-specific language generation by 12\%. Moreover, the success rate of newly generated scenario transformations increases by 8\%, while obstacle-avoidance capability is enhanced by 30\%. For the complete manuscript, please refer to: https://zhenhaooo.github.io/PCASim.github.io/
Chinese Translation
现实世界中的自动驾驶,特别是在具有众多边缘案例的城市环境中,需要严格测试以确保产品的安全性和鲁棒性。然而,关于将对抗场景生成与安全代理的闭环测试训练相结合的研究仍然较少,这限制了两者的高效共同演化和相互增强。为了解决这一挑战,我们通过对开源数据集应用基于规则的过滤,构建了一个对抗行为知识库,并结合针对仿真环境的知识检索模块。我们采用大型语言模型(LLM)集成知识驱动、数据驱动和对抗驱动的方法,生成符合用户需求的安全关键交通场景。此外,在评估生成的场景时,我们利用强化学习模型训练不同类型车辆的行为,从而丰富场景的多样性,超越现有数据集,同时保持现实感。实验结果表明,所提出的框架使得领域特定语言生成的准确性提高了12%。此外,新生成场景转换的成功率提高了8%,而避障能力增强了30%。有关完整手稿,请参见:https://zhenhaooo.github.io/PCASim.github.io/
cs.RO / 19 / 2605.15705
Feedback World Model Enables Precise Guidance of Diffusion Policy
反馈世界模型实现扩散策略的精确指导
Abstract
World models aim to improve robotic decision making by predicting the consequences of actions. However, in practice, their predictions often become unreliable once the robot encounters states outside the training distribution, limiting their effectiveness at deployment. We observe that execution itself provides a natural but underutilized signal: after each action, the robot directly observes the true next state, revealing the mismatch between predicted and actual outcomes. Building on this insight, we propose feedback world model, a new paradigm that closes the loop between prediction and observation at inference time. Instead of treating the world model as a static open-loop predictor, our method maintains a lightweight feedback state that is updated online to iteratively correct future predictions, compensating for model errors using real-time observations without additional training data or parameter updates. We show that this process can be interpreted as a latent-space observer and admits convergence guarantees under mild conditions. We further introduce action-aware guidance to better translate corrected predictions into control by emphasizing action-controllable components while suppressing irrelevant variations. Experiments on LIBERO-Plus, Robomimic, and real-world manipulation tasks demonstrate that our method substantially improves both prediction accuracy and policy performance under distribution shift. In particular, it reduces world model prediction error by up to 76.4% and improves out-of-distribution (OOD) success rate by 30%. These results show that incorporating real-time feedback at inference time provides a simple yet powerful alternative to static world modeling.
Chinese Translation
世界模型旨在通过预测行动的后果来改善机器人决策。然而,在实践中,一旦机器人遇到训练分布之外的状态,其预测往往变得不可靠,从而限制了其在部署中的有效性。我们观察到,执行本身提供了一种自然但未充分利用的信号:在每次行动后,机器人直接观察到真实的下一个状态,揭示了预测结果与实际结果之间的差异。基于这一见解,我们提出了反馈世界模型(feedback world model),这是一种在推理时闭合预测与观察之间循环的新范式。我们的方法不再将世界模型视为静态的开环预测器,而是维护一个轻量级的反馈状态,该状态在线更新,以迭代地修正未来的预测,利用实时观察来补偿模型误差,而无需额外的训练数据或参数更新。我们表明,这一过程可以被解释为潜在空间观察者,并在温和条件下具有收敛保证。我们进一步引入了基于动作的指导,以更好地将修正后的预测转化为控制,强调可控动作的组成部分,同时抑制无关的变化。在LIBERO-Plus、Robomimic和真实世界操作任务上的实验表明,我们的方法在分布转移下显著提高了预测准确性和策略性能。特别是,它将世界模型预测误差减少了多达76.4%,并将分布外(OOD)成功率提高了30%。这些结果表明,在推理时结合实时反馈为静态世界建模提供了一种简单而强大的替代方案。
cs.RO / 20 / 2605.15713
Learning Dynamic Pick-and-Place for a Legged Manipulator
学习腿式机械手的动态抓取与放置
Abstract
Legged manipulators extend robotic capabilities beyond static manipulation by integrating agile locomotion with versatile arm control. However, achieving precise manipulation while maintaining coordinated locomotion remains a major challenge. This work presents a hierarchical reinforcement learning framework for dynamic pick-and-place tasks using a quadruped equipped with a 6-DOF robotic arm. The framework incorporates an explicit mass estimation module enabling adaptive whole-body control for objects with varying weights. In simulation, the system achieves an 86.05% success rate with payloads up to 2.3 kg. The approach is further validated through real-world experiments across six representative scenarios with controlled variations in object physical properties (size and mass) and task heights. Specifically, within a wide vertical workspace ranging from ground level to 1.1~m-high tabletops, the system demonstrates an average success rate of 73.3% for payloads up to 1.3 kg, with an average execution time of 4.06 s. Unlike prior works that handle lightweight objects and execute pick-and-place motions with slow, piecewise motions, the proposed framework exploits concurrent locomotion and manipulation for dynamic, continuous execution. These results demonstrate the potential of quadrupedal mobile manipulators for adaptive, whole-body pick-and-place with heavier payloads and extended workspaces.
Chinese Translation
腿式机械手通过将灵活的运动与多功能的臂部控制相结合,扩展了机器人在静态操作之外的能力。然而,在保持协调运动的同时实现精确操作仍然是一个主要挑战。本研究提出了一种层次化强化学习框架,用于使用配备6自由度(6-DOF)机械臂的四足机器人进行动态抓取与放置任务。该框架结合了显式的质量估计模块,使得对不同重量物体的自适应全身控制成为可能。在仿真中,该系统在负载高达2.3公斤的情况下实现了86.05%的成功率。该方法通过在六个具有代表性的场景中进行实际实验进一步验证,场景中控制了物体物理属性(大小和质量)及任务高度的变化。具体而言,在从地面到1.1米高的桌面范围内的广泛垂直工作空间中,该系统对负载高达1.3公斤的物体展示了73.3%的平均成功率,平均执行时间为4.06秒。与之前处理轻量物体并以缓慢、分段动作执行抓取与放置的研究不同,所提出的框架利用并行运动和操作实现动态、连续的执行。这些结果展示了四足移动机械手在适应性、全身抓取与放置重物及扩展工作空间方面的潜力。
cs.RO / 21 / 2605.15753
Hierarchical and Holistic Open-Vocabulary Functional 3D Scene Graphs for Indoor Spaces
用于室内空间的分层和整体开放词汇功能性3D场景图
Abstract
Functional 3D scene graphs offer a versatile and flexible representation for 3D scene understanding and robotic manipulation, defined by object nodes, interactive elements, and functional relationship edges. However, their potential remains underexplored due to the limited coverage of existing benchmarks and the overly straightforward design of previous pipelines, which primarily focus on large-scale furniture but lack of hierarchical structures. Therefore, in this work, we extend the benchmark coverage by introducing dense tabletop objects and explicit multi-level functional relationships. This expansion introduces critical challenges involving small-scale, dense, and similar instances, with lack of visual anchoring in relational reasoning, instance confusion during cross-frame fusion, and attribution uncertainty under dynamic viewpoints. To address these issues, we propose an open-vocabulary pipeline based on 2D visual grounding and 3D graph optimization. Specifically, we anchor fine-grained functional edges from 2D visual evidence, and associate nodes across frames in 3D using multiple cues. Furthermore, edge association is formulated as temporal graph optimization, integrating evidence accumulation, entropy regularization, and temporal smoothing to robustly determine the functional connections of each node. Finally, global hierarchy shaping is performed to recover the hierarchical graph structure. Extensive experiments demonstrate that the proposed method can reliably infer functional 3D scene graphs in challenging real-world scenes, thereby further unlocking their potential for practical applications.
Chinese Translation
功能性3D场景图为3D场景理解和机器人操作提供了一种多功能和灵活的表示,主要由对象节点、交互元素和功能关系边缘构成。然而,由于现有基准的覆盖范围有限以及以往管道设计过于简单,主要集中于大规模家具而缺乏分层结构,其潜力仍未得到充分挖掘。因此,在本研究中,我们通过引入密集的桌面物体和明确的多层次功能关系来扩展基准的覆盖范围。这一扩展带来了关键挑战,包括小规模、密集和相似实例的处理、关系推理中缺乏视觉锚定、跨帧融合中的实例混淆以及动态视角下的归属不确定性。为了解决这些问题,我们提出了一种基于2D视觉锚定和3D图优化的开放词汇管道。具体而言,我们从2D视觉证据中锚定细粒度的功能边缘,并利用多种线索在3D中关联跨帧节点。此外,边缘关联被形式化为时间图优化,整合证据累积、熵正则化和时间平滑,以稳健地确定每个节点的功能连接。最后,进行全局层次塑造以恢复分层图结构。大量实验表明,所提出的方法能够在具有挑战性的真实场景中可靠地推断功能性3D场景图,从而进一步释放其在实际应用中的潜力。
cs.RO / 22 / 2605.15769
Lamarckian Inheritance in Dynamic Environments: How Key Variables Affect Evolutionary Dynamics
动态环境中的拉马克遗传:关键变量如何影响进化动态
Abstract
The co-optimization of a robot's body and brain presents a coupled challenge: the morphology constrains which control strategies are effective, while the control determines how well the morphology performs. To address this, we combine morphology optimization as evolution with controller optimization as lifetime learning, utilizing Lamarckian inheritance to transfer learned controller parameters from parent to offspring. In dynamic environments, existing literature presents conflicting evidence: while traditional evolutionary theory often suggests Lamarckian inheritance lacks benefit, recent studies in evolutionary robotics indicate it can improve performance. We hypothesize that this is because previous works have not included all relevant variables with dynamic environments. In this work, we show that the benefit of Lamarckian inheritance depends on two variables: how conflicting the environmental changes are to robot control, and the predictability of those changes for the robotic agent. Using virtual soft robots and two different learning approaches, Bayesian optimization and reinforcement learning, we show that Lamarckian inheritance only underperforms Darwinian inheritance when the changes are both conflicting and unpredictable. We find that adding a sensor to detect environmental changes restores the benefits for Lamarckian inheritance in conflicting environments, by allowing robotic agents to predict the need for a different behavior, thereby generalizing their control.
Chinese Translation
机器人的身体与大脑的共同优化提出了一个耦合挑战:形态学限制了有效的控制策略,而控制则决定了形态学的表现。为了解决这个问题,我们将形态优化视为进化,将控制器优化视为终身学习,利用拉马克遗传将学习到的控制器参数从父代转移到子代。在动态环境中,现有文献提供了相互矛盾的证据:虽然传统进化理论通常认为拉马克遗传没有优势,但最近在进化机器人领域的研究表明它可以提高性能。我们假设这可能是因为之前的研究没有考虑到动态环境中的所有相关变量。在本研究中,我们展示了拉马克遗传的优势依赖于两个变量:环境变化对机器人控制的冲突程度,以及这些变化对机器人代理的可预测性。通过使用虚拟软机器人和两种不同的学习方法,贝叶斯优化和强化学习,我们表明,只有在变化既冲突又不可预测时,拉马克遗传才会表现不如达尔文遗传。我们发现,通过添加传感器以检测环境变化,可以恢复拉马克遗传在冲突环境中的优势,使机器人代理能够预测不同行为的需求,从而推广其控制能力。
cs.RO / 23 / 2605.15779
A Topology-Aware Spatiotemporal Handover Framework for Continuous Multi-UAV Tracking
一种拓扑感知的时空切换框架用于连续多无人机跟踪
Abstract
The integration of Unmanned Aerial Vehicles(UAVs) into Intelligent Transportation Systems (ITS) offers synoptic visibility for traffic monitoring, yet scalable deployment is hindered by trajectory fragmentation, where vehicle identity persistence is lost across multi-UAV Fields of View (FOV). While state-of-the-art frameworks excel in optimizing local trajectory extraction and stability for single-drone imagery, they often function as isolated data silos that generate disjointed trajectories, thereby precluding network-level analysis such as Origin-Destination estimation. This paper presents a real-time Multi-Camera Multi-Vehicle Tracking (MCMT) system designed to handle global identity persistence. Addressing the visual ambiguity and computational cost of appearance-based Re-Identification (Re-ID) in nadir views, we introduce a lightweight Topology-Based Spatiotemporal Handover mechanism. We implement a high-throughput parallel pipeline leveraging YOLO11 and ByteTrack to process concurrent 4K streams. Our core contribution is a deterministic queue-based matching algorithm that utilizes geometric overlaps and virtual lane discretization to predictively manage identity handover via FIFO queues. Experimental results on complex urban environments, including intersections and merging traffic, demonstrate a Handover Success Rate (HOSR) of 99.8% in continuous traffic flows, significantly outperforming Re-ID baselines (74.1%) while validating edge deployment feasibility. The source code is available at https://github.com/JYe9/multi-camera-multi-vehicle-tracking-system.
Chinese Translation
无人机(UAV)与智能交通系统(ITS)的集成为交通监测提供了全局可视性,但由于轨迹碎片化,车辆身份在多无人机视场(FOV)中丧失,导致可扩展部署受到阻碍。尽管最先进的框架在优化单无人机图像的局部轨迹提取和稳定性方面表现出色,但它们往往作为孤立的数据孤岛生成不连贯的轨迹,从而阻碍了网络级分析,例如起点-终点估计。本文提出了一种实时多摄像头多车辆跟踪(MCMT)系统,旨在处理全球身份持久性。针对俯视视角中基于外观的重新识别(Re-ID)所带来的视觉模糊和计算成本,我们引入了一种轻量级的基于拓扑的时空切换机制。我们实现了一个高吞吐量的并行处理管道,利用YOLO11和ByteTrack处理并发的4K流。我们的核心贡献是一个基于确定性队列的匹配算法,利用几何重叠和虚拟车道离散化,通过FIFO队列预测性地管理身份切换。在复杂的城市环境中进行的实验结果,包括交叉口和合流交通,显示出在连续交通流中切换成功率(HOSR)达到99.8%,显著优于Re-ID基线(74.1%),同时验证了边缘部署的可行性。源代码可在https://github.com/JYe9/multi-camera-multi-vehicle-tracking-system获取。
cs.RO / 24 / 2605.15782
Reactive Robot-Centric Safety for Autonomous Navigation in Constrained and Dynamic Environments
面向自主导航的反应式机器人中心安全性在受限和动态环境中的应用
Abstract
In this work, we address the problem of ensuring real-time safety in autonomous robot navigation, in spatially constrained dynamic environments, by utilizing only onboard sensors. We present a real-time control architecture that integrates a 3D LIDAR perception-based composite control barrier function(CBF)-based safety filter directly into the autonomy pipeline. The proposed perception-driven framework enforces collision avoidance constraints dynamically from onboard point cloud data, thus allowing a large number of constraints to be handled at the control frequency, while remaining minimally invasive to nominal task execution. The safety region is defined as an ellipsoid in the body-frame, consistent with the geometry of the platform, which induces time-varying constraints in the world frame as the robot rotates; this effect is handled through a dedicated formulation of time-varying (CBF) for each LIDAR point. We validate the system through multiple field experiments in underground environments by utilizing a quadruped platform performing a visual inspection task, demonstrating reliable operation in the presence of dynamic obstacles, unsafe high-level references, abrupt localization anomalies, and while traversing through narrow corridors.
Chinese Translation
在本研究中,我们通过仅利用机载传感器,解决了在空间受限的动态环境中确保自主机器人导航实时安全性的问题。我们提出了一种实时控制架构,将基于3D激光雷达感知的复合控制屏障函数(CBF)安全过滤器直接集成到自主导航流程中。所提出的感知驱动框架动态地从机载点云数据中强制执行碰撞避免约束,从而允许在控制频率下处理大量约束,同时对名义任务执行的干扰最小化。安全区域被定义为与平台几何形状一致的机体坐标系中的椭球体,这在机器人旋转时在世界坐标系中引入时间变化的约束;这一效应通过为每个激光雷达点专门制定的时间变化(CBF)公式进行处理。我们通过在地下环境中进行多次实地实验来验证该系统,利用四足平台执行视觉检查任务,展示了在动态障碍物、不安全的高层参考、突发的定位异常以及穿越狭窄走廊时的可靠操作。
cs.RO / 25 / 2605.15836
GAP: Geometric Anchor Pre-training for Data-Efficient Visuomotor Learning of Manipulation Tasks
GAP:用于数据高效的操作任务视觉运动学习的几何锚点预训练
Abstract
Learning visuomotor policies from scarce expert demonstrations remains a core challenge in robotic manipulation. A primary hurdle lies in distilling high-dimensional RGB representations into control-relevant geometry without overfitting. While using frozen pre-trained Vision Foundation Models (VFMs) improves data efficiency, it also shifts most task adaptation onto a small spatial pooling module, which can latch onto task-irrelevant shortcuts and lose geometric grounding when finetuned with few data samples. More broadly, pre-trained visual representations used for policy learning have been observed to struggle under even minor scene perturbations, highlighting the need for robustness-oriented inductive biases. We propose Geometric Anchor Pre-training (GAP), a simple, action-free warm-up stage that regularizes the spatial adapter before downstream imitation learning. GAP pre-trains the pooling layer on a lightweight simulated proxy task where object masks are available at no cost, encouraging the adapter to produce keypoints that lie on the object, cover its spatial extent, and remain sharp and repeatable over time. This yields stable geometric anchors that provide a reliable coordinate interface for few-shot policy learning, while keeping the VFM frozen. We evaluate GAP on RoboMimic and ManiSkill under severe data scarcity (15-50 demonstrations) and domain shift. A simple adapter regularized with GAP consistently outperforms stronger attention-based poolers and end-to-end fine-tuning, achieving 62% success on RoboMimic Can with 15 demonstrations (+16% over AFA), 63% on the long-horizon high-precision Tool Hang task with 50 demonstrations, and 61% on ManiSkill StackCube with 30 demonstrations (+11% over full fine-tuning). The proxy stage is lightweight and fully decoupled from downstream tasks, making it practical to reuse across environments and manipulation skills.
Chinese Translation
从稀缺的专家演示中学习视觉运动策略仍然是机器人操作中的一个核心挑战。一个主要障碍在于如何将高维RGB表示提炼为与控制相关的几何信息,而不发生过拟合。虽然使用冻结的预训练视觉基础模型(Vision Foundation Models, VFMs)提高了数据效率,但这也将大部分任务适应转移到一个小的空间池模块上,该模块可能会依赖于与任务无关的捷径,并在用少量数据样本进行微调时失去几何基础。更广泛地说,用于策略学习的预训练视觉表示在面对即使是轻微的场景扰动时也表现出困难,这突显了对稳健性导向归纳偏差的需求。我们提出了几何锚点预训练(Geometric Anchor Pre-training, GAP),这是一种简单的无动作热身阶段,在下游模仿学习之前对空间适配器进行正则化。GAP在一个轻量级的模拟代理任务上对池层进行预训练,该任务中对象掩码可以无成本获得,鼓励适配器产生位于对象上的关键点,覆盖其空间范围,并在时间上保持清晰和可重复。这产生了稳定的几何锚点,为少样本策略学习提供了可靠的坐标接口,同时保持VFM不变。我们在RoboMimic和ManiSkill上评估GAP,在严重的数据稀缺(15-50次演示)和领域转移下,经过GAP正则化的简单适配器始终优于更强的基于注意力的池化器和端到端微调,在RoboMimic Can任务中以15次演示实现62%的成功率(比AFA高出16%),在长时间高精度的工具悬挂任务中以50次演示实现63%的成功率,在ManiSkill StackCube任务中以30次演示实现61%的成功率(比完全微调高出11%)。代理阶段轻量且与下游任务完全解耦,使其在不同环境和操作技能中可实用地重用。
cs.RO / 26 / 2605.15845
Structured Jacobian Construction for Motion Optimization with High-Order Time Derivatives in Multi-Link Systems
多链接系统中高阶时间导数运动优化的结构化雅可比矩阵构建
Abstract
This paper presents a novel framework for Jacobian computation in motion optimization problems involving multi-link systems, where physical quantities are represented using higher-order time derivatives. In motion optimization of robots and humans, cost functions may incorporate higher-order time derivatives, such as jerk or the time variation of forces, to capture smoothness and perceptual characteristics, particularly in motion skill analysis and expressive behaviors, thereby necessitating Jacobian computations involving these quantities. However, such Jacobians are typically computed using numerical or automatic differentiation without explicitly exploiting the underlying multi-link structure, which can lead to increased computational cost and numerical instability. To address this limitation, we propose a structured Jacobian formulation for motion optimization, based on the comprehensive motion computation framework, in which physical quantities and their higher-order time derivatives are systematically represented along the multi-link structure. The proposed method systematically derives analytical expressions for Jacobians of kinematic and dynamic quantities, including momentum, forces, and joint torques, with respect to generalized coordinates and their higher-order derivatives. The resulting framework is applicable to both direct and inverse optimization. Through numerical experiments, we demonstrate that the proposed method improves computational efficiency compared to numerical and automatic differentiation, while achieving comparable accuracy. Furthermore, we demonstrate its effectiveness in inverse optimization by recovering cost function weights from motion data. Together, these results indicate that the proposed formulation provides a scalable and structured computational foundation for motion optimization involving higher-order time derivatives in multi-link systems.
Chinese Translation
本文提出了一种新颖的雅可比矩阵计算框架,用于涉及多链接系统的运动优化问题,其中物理量使用高阶时间导数表示。在机器人和人类的运动优化中,成本函数可能包含高阶时间导数,例如抖动(jerk)或力的时间变化,以捕捉平滑性和感知特征,特别是在运动技能分析和表现行为中,因此需要涉及这些量的雅可比矩阵计算。然而,这种雅可比矩阵通常通过数值或自动微分计算,而未明确利用潜在的多链接结构,这可能导致计算成本增加和数值不稳定。为了解决这一限制,我们提出了一种基于综合运动计算框架的运动优化结构化雅可比矩阵公式,其中物理量及其高阶时间导数沿多链接结构系统地表示。所提出的方法系统地推导出与广义坐标及其高阶导数相关的运动学和动力学量(包括动量、力和关节扭矩)的雅可比矩阵的解析表达式。所得到的框架适用于直接和逆优化。通过数值实验,我们证明了所提出的方法在计算效率上优于数值和自动微分,同时实现了可比的准确性。此外,我们还通过从运动数据中恢复成本函数权重,展示了其在逆优化中的有效性。这些结果表明,所提出的公式为涉及多链接系统中高阶时间导数的运动优化提供了可扩展和结构化的计算基础。
cs.RO / 27 / 2605.15892
Designing for Robot Wranglers: A Synthesis of Literature and Practice
为机器人驯养者设计:文献与实践的综合
Abstract
Robots are increasingly present in human spaces, such as for conducting deliveries in hospitals, interacting with visitors at museums, and stocking items in warehouses. To ensure the seamless integration of robots into these spaces, a new role in human-robot interaction is emerging - the robot wrangler, namely an individual who is responsible for setting up, overseeing, and troubleshooting the robot. To understand the needs of this stakeholder, we conducted a scoping review that uncovered a typology of robot wrangling across the research literature, and discovered that wrangling is an umbrella term that collapses a highly complex and heterogeneous space of activities, often rendering this labor difficult to characterize and support. To further clarify and understand robot wrangling, we then reflected on our own firsthand and imagined experiences as robot wranglers within our own respective domains. Guided by the scoping review and our reflections, we devise a series of design implications for supporting wranglers directly as individuals and as members of a wider service ecology.
Chinese Translation
机器人在医院进行送货、在博物馆与访客互动以及在仓库中存放物品等人类空间中的应用日益普遍。为了确保机器人在这些空间中的无缝集成,一种新的角色在人与机器人互动中逐渐浮现——机器人驯养者,即负责设置、监督和排除机器人故障的个人。为了理解这一利益相关者的需求,我们进行了范围评估,揭示了研究文献中机器人驯养的类型学,并发现驯养是一个涵盖高度复杂和异质活动空间的总称,常常使这一劳动难以特征化和支持。为了进一步澄清和理解机器人驯养,我们反思了自己在各自领域作为机器人驯养者的第一手和想象中的经历。在范围评估和我们的反思的指导下,我们制定了一系列设计启示,以直接支持驯养者作为个体以及作为更广泛服务生态系统中的成员。
cs.RO / 28 / 2605.15935
Dynamic Plasma Shape Control with Arbitrary Sensor Subsets
任意传感器子集下的动态等离子体形状控制
Abstract
Plasma shape control in tokamaks requires a real-time controller that tracks dynamically changing shape targets while tolerating diagnostic failures. Classical approaches decompose the problem into equilibrium reconstruction followed by a linear controller, and assume a fixed, fully operational sensor set. We present a reinforcement learning agent that addresses both limitations simultaneously. The agent is trained in NSFsim, a high-fidelity tokamak simulator configured for DIII-D, on a curated dataset of 120 experimental plasma shapes. The shape targets are resampled as random step changes every 0.25 s, exposing the agent to diverse transitions across the full shape envelope. At test time the agent zero-shot tracks dynamic shape sequences; on a held-out static configuration in simulation it achieves a mean shape error of 2.01 cm, and dynamic trajectory following is demonstrated qualitatively in simulation and on the physical device. Diagnostic dropout randomly masks 30% of magnetic sensors per episode, yielding a single policy robust to arbitrary sensor subsets without backup controllers or mode-switching logic. An asymmetric actor-critic architecture with privileged equilibrium information improves value estimation under partial observability; an auxiliary shape reconstruction head on the actor enables end-to-end shape reconstruction from raw diagnostics and serves as an interpretability tool for policy analysis. The policy transfers to experimental DIII-D shots, where it directly commands the coil actuators on two dynamic shape maneuvers, and to the independent GSevolve simulator.
Chinese Translation
在托卡马克中,等离子体形状控制需要一个实时控制器,该控制器能够跟踪动态变化的形状目标,同时容忍诊断故障。经典方法将问题分解为平衡重建和线性控制器,并假设传感器集是固定且完全可操作的。我们提出了一种强化学习代理,能够同时解决这两个限制。该代理在NSFsim中进行训练,NSFsim是一个为DIII-D配置的高保真托卡马克模拟器,使用的是一个经过精心策划的120个实验等离子体形状的数据集。形状目标每0.25秒重新采样为随机步进变化,使代理能够接触到整个形状包络内的多样化过渡。在测试时,代理能够零样本跟踪动态形状序列;在模拟中的一个静态配置上,它实现了2.01厘米的平均形状误差,并且动态轨迹跟踪在模拟和物理设备上均得到了定性验证。诊断掉落在每个回合随机掩盖30%的磁传感器,从而产生一个对任意传感器子集具有鲁棒性的单一策略,无需备份控制器或模式切换逻辑。采用不对称的演员-评论家架构,并利用特权平衡信息改善部分可观测性下的价值估计;演员上的辅助形状重建头使得能够从原始诊断数据进行端到端的形状重建,并作为策略分析的可解释性工具。该策略转移到实验DIII-D拍摄中,直接指挥两个动态形状操作的线圈执行器,并转移到独立的GSevolve模拟器上。
cs.RO / 29 / 2605.15944
FocalPolicy: Frequency-Optimized Chunking and Locally Anchored Flow Matching for Coherent Visuomotor Policy
FocalPolicy:针对一致性视觉运动策略的频率优化分块和局部锚定流匹配
Abstract
Visuomotor policies aim to learn complex manipulation tasks from expert demonstrations. However, generating smooth and coherent trajectories remains challenging, as it requires balancing proximal precision with distal foresight. Existing approaches typically focus on optimizing intra-chunk action distributions, often neglecting the inter-chunk coherence. Consequently, inter-chunk discontinuities significantly impede the learning of coherent long-horizon actions. To overcome this limitation and achieve a synergetic balance between precision and foresight, we propose FocalPolicy, a foresight-aware visuomotor policy that combines Frequency-Optimized Chunking with Locally Anchored flow matching. We introduce a foresight composite objective that supervises time-domain alignment within the proximal actions while regularizing frequency-domain structure over multiple future action chunks to improve cross-chunk coherence. To efficiently learn complex action distributions, we design locally anchored campling to enhance target signal propagation efficiency during consistency flow matching training. Extensive experiments demonstrate that FocalPolicy outperforms existing approaches and confirm the generalizability of our modules to other baselines. Project website: https://focalpolicy.github.io/
Chinese Translation
视觉运动策略旨在从专家演示中学习复杂的操作任务。然而,生成平滑且一致的轨迹仍然具有挑战性,因为这需要在近端精确性和远端前瞻性之间取得平衡。现有方法通常侧重于优化内部分块的动作分布,往往忽视了分块之间的一致性。因此,分块之间的不连续性显著阻碍了一致性长时间动作的学习。为了克服这一限制,并在精确性与前瞻性之间实现协同平衡,我们提出了FocalPolicy,一种前瞻性意识的视觉运动策略,它结合了频率优化分块和局部锚定流匹配。我们引入了一种前瞻性复合目标,以监督近端动作中的时间域对齐,同时在多个未来动作分块上对频率域结构进行正则化,以改善跨分块的一致性。为了高效学习复杂的动作分布,我们设计了局部锚定采样,以增强在一致性流匹配训练期间目标信号传播的效率。大量实验表明,FocalPolicy优于现有方法,并确认我们的模块对其他基线的可推广性。项目网站:https://focalpolicy.github.io/
cs.RO / 30 / 2605.15949
A Reproducible and Physically Feasible Dynamic Parameter Identification Framework for a Low-Cost Robot Arm
一种可重复且物理可行的低成本机器人臂动态参数识别框架
Abstract
This paper presents a reproducible and physically feasible dynamic parameter identification framework for CRANE-X7, a low-cost robot arm driven by modular smart actuators. To improve practical identifiability, products of inertia are removed according to approximate link symmetry, reducing the rigid-body model from 65 to 39 base parameters. Identification motions are hand-designed from structured single-joint and adjacent-joint primitives under practical joint-range limits. The proposed pipeline combines preprocessing, inverse-dynamics-regressor-based ordinary least squares (OLS), conditional semidefinite-programming (SDP) projection for feasibility recovery, and closed-loop input error (CLIE) refinement. Candidate solutions from 40 structured trajectories are analyzed in a common PCA space to select a statistically central representative model. Because statistical centrality alone does not ensure physical acceptability, the selected model is finally screened by an all-pose positive-definiteness audit of the inertia matrix and, when necessary, corrected by a localized post-CLIE SDP rescue step. Experiments show that the parameter cloud becomes progressively more concentrated from OLS to SDP and CLIE, while the final accepted model preserves high predictive accuracy on held-out validation motions. These results demonstrate a practical route to statistically coherent and physically feasible dynamic models for low-cost robot platforms.
Chinese Translation
本文提出了一种可重复且物理可行的动态参数识别框架,适用于CRANE-X7,这是一种由模块化智能驱动器驱动的低成本机器人臂。为了提高实际可识别性,根据近似的连杆对称性去除了惯性乘积,将刚体模型的基础参数从65个减少到39个。识别运动是根据实际关节范围限制,从结构化的单关节和相邻关节原语中手动设计的。所提出的流程结合了预处理、基于逆动力学回归的普通最小二乘法(OLS)、条件半正定规划(SDP)投影以恢复可行性,以及闭环输入误差(CLIE)精炼。从40条结构化轨迹中分析候选解,在一个共同的主成分分析(PCA)空间中选择统计中心的代表模型。由于统计中心性本身并不能确保物理可接受性,最终选定的模型通过对惯性矩阵进行全姿态正定性审计进行筛选,并在必要时通过局部的后CLIE SDP救援步骤进行修正。实验表明,从OLS到SDP和CLIE,参数云逐渐变得更加集中,而最终接受的模型在保留的验证运动上保持了高预测精度。这些结果展示了一条为低成本机器人平台提供统计一致且物理可行的动态模型的实用路径。
cs.RO / 31 / 2605.15964
WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation
WorldVLN:用于空中视觉-语言导航的自回归世界动作模型
Abstract
Aerial vision-language navigation (VLN) requires agents to follow natural-language instructions through closed-loop perception and action in 3D environments. We argue that aerial VLN can be formulated as a prediction-driven world-action problem: the agent should anticipate latent world evolution and act according to the predicted consequences. To this end, we propose WorldVLN, the first autoregressive world action model for aerial VLN. Unlike full-sequence video-generation world models that generate an entire visual clip, WorldVLN adapts a latent autoregressive video backbone to predict short-horizon world-state transitions and directly decodes them into executable waypoint actions. After each action segment is executed, newly received observations are encoded back into the autoregressive context, enabling closed-loop world-action prediction. We further introduce a two-stage training framework that first grounds the video prior in instruction-conditioned navigation dynamics and then develops Action-aware GRPO, the first reinforcement learning method tailored to autoregressive WAMs, to optimize waypoint decisions through their downstream rollout consequences. On public outdoor and indoor benchmarks, WorldVLN consistently outperforms existing Vision-Language-Action baselines with 12\%+ success-rate gains and larger advantages on challenging cases. It further transfers zero-shot to real drone deployment, suggesting that the proposed WorldVLN offers a promising route for spatial action tasks. Demos and code are available at https://embodiedcity.github.io/WorldVLN/.
Chinese Translation
空中视觉-语言导航(VLN)要求智能体通过闭环感知和行动在三维环境中遵循自然语言指令。我们认为,空中VLN可以被表述为一个以预测为驱动的世界动作问题:智能体应当预见潜在的世界演变并根据预测的后果采取行动。为此,我们提出了WorldVLN,这是第一个用于空中VLN的自回归世界动作模型。与生成整个视觉片段的全序列视频生成世界模型不同,WorldVLN采用潜在的自回归视频骨干网络来预测短期的世界状态转变,并将其直接解码为可执行的航点动作。在每个动作片段执行后,新接收到的观察结果被编码回自回归上下文中,从而实现闭环的世界-动作预测。我们进一步引入了一个两阶段的训练框架,首先将视频先验与指令条件下的导航动态相结合,然后开发了Action-aware GRPO,这是一种针对自回归世界动作模型的首个强化学习方法,通过其下游展开后果优化航点决策。在公共户外和室内基准测试中,WorldVLN在成功率上比现有的视觉-语言-动作基线提高了12%以上,并在具有挑战性的案例中表现出更大的优势。它还能够在真实无人机部署中实现零样本迁移,这表明所提出的WorldVLN为空间动作任务提供了一条有前景的路径。演示和代码可在 https://embodiedcity.github.io/WorldVLN/ 获取。
cs.RO / 32 / 2605.15971
OHP-RL: Online Human Preference as Guidance in Reinforcement Learning for Robot Manipulation
OHP-RL:在线人类偏好作为机器人操作强化学习中的指导
Abstract
While reinforcement learning (RL) enables robots to acquire skills autonomously, its real-world deployment is severely limited by inefficient and unsafe exploration. Human-in-the-loop interventions offer a practical solution, yet existing methods typically exploit these interventions as auxiliary training signals, without fully capturing the richer information they provide about when and how autonomy should be guided. Human interventions often encode relative preferences over behavior under safety and task constraints, rather than prescribing exact actions to imitate. Motivated by this perspective, we propose Online Human Preference as Guidance in Reinforcement Learning (OHP-RL), a framework that leverages human interventions as preference information to guide policy learning. OHP-RL introduces a state-dependent preference gate that adaptively regulates when and to what extent human interventions should shape policy learning. This design enables the agent to benefit from intermittent and imperfect human feedback while preserving autonomous exploration and stable policy optimization. We evaluate OHP-RL on three challenging real-world contact-rich manipulation tasks on a Franka robot. Across all tasks, OHP-RL consistently achieves strong success rates, faster convergence, and substantially lower human intervention effort than prior approaches. Moreover, the learned policies exhibit more stable and human-aligned behavior throughout training.
Chinese Translation
虽然强化学习(RL)使机器人能够自主获取技能,但其在现实世界中的应用受到低效和不安全探索的严重限制。人机协作干预提供了一种实用的解决方案,但现有方法通常将这些干预视为辅助训练信号,而未能充分捕捉它们在何时以及如何引导自主性方面提供的更丰富信息。人类干预通常编码了在安全和任务约束下对行为的相对偏好,而不是规定要模仿的确切动作。基于这一视角,我们提出了在线人类偏好作为强化学习中的指导(OHP-RL),这一框架利用人类干预作为偏好信息来指导策略学习。OHP-RL引入了一种状态依赖的偏好门,能够自适应地调节人类干预在何时以及在多大程度上影响策略学习。这一设计使得代理能够在保持自主探索和稳定策略优化的同时,受益于间歇性和不完美的人类反馈。我们在Franka机器人上评估了OHP-RL在三个具有挑战性的现实世界接触丰富的操作任务中的表现。在所有任务中,OHP-RL始终实现了较高的成功率、更快的收敛速度,以及显著降低的人类干预工作量。此外,学习到的策略在整个训练过程中表现出更稳定和与人类一致的行为。
cs.RO / 33 / 2605.15999
Constrained MPC-Based Motion Planning for Morphing Quadrotors in Ultra-Narrow Passages under Limited Perception
基于约束模型预测控制的形态变化四旋翼在有限感知下通过超狭窄通道的运动规划
Abstract
This paper introduces a motion planning framework to plan morphology and trajectory for morphing quadrotors under extremely constrained environments. We develop a novel obstacle avoidance cost function for nonlinear model predictive control (MPC) that enables navigation through extremely narrow gaps under limited perception from a 2D LiDAR. Classical artificial potential field-based costs typically have a high cost in narrow passages, artificially blocking the navigable path. In contrast, we propose a smooth exponential obstacle cost that preserves low traversal cost within narrow gaps while maintaining strong collision avoidance behavior. The formulation avoids hard activation thresholds and introduces a cost reduction factor to reduce the cost within narrow passages. Direct use of 2D LiDAR measurements in MPC allows navigation around arbitrarily shaped obstacles. The method is embedded within an acados-based nonlinear MPC framework. Simulation and experimental results demonstrate successful traversal of narrow corridors where typical repulsive cost functions would fail. The approach provides a computationally efficient and practical solution for navigating through tight spaces while maintaining safety from the obstacles. While we are implementing the framework on the morphing quadrotors, the cost function formulation is general-purpose for any mobile robot application, and is not limited to the morphing quadrotors. The implementation code is available at \href{https://github.com/harshjmodi1996/morphocopter_mpc}{Github Repo} and a short video is available at \href{https://zh.engr.tamu.edu/wp-content/uploads/sites/310/2026/03/MPC_MorphoCopter_video.mp4}{Video Link}.
Chinese Translation
本文介绍了一种运动规划框架,用于在极其受限的环境中规划形态和轨迹,以适应形态变化的四旋翼。我们开发了一种新颖的障碍物避免成本函数,适用于非线性模型预测控制(MPC),使得在有限的二维激光雷达(LiDAR)感知下能够通过极窄的缝隙进行导航。经典的基于人工势场的成本函数在狭窄通道中通常会产生高成本,从而人为地阻碍可导航路径。相反,我们提出了一种平滑的指数障碍成本,该成本在狭窄缝隙中保持低通行成本,同时维持强大的避碰行为。该公式避免了硬激活阈值,并引入了成本降低因子,以减少狭窄通道内的成本。在MPC中直接使用二维激光雷达测量值,使得能够绕过任意形状的障碍物。该方法嵌入在基于acados的非线性MPC框架中。仿真和实验结果展示了在典型排斥成本函数失效的情况下成功穿越狭窄走廊的能力。该方法为在狭小空间中安全导航提供了一种计算高效且实用的解决方案。虽然我们正在将该框架应用于形态变化的四旋翼,但成本函数的公式具有通用性,适用于任何移动机器人应用,并不限于形态变化的四旋翼。实现代码可在 {https://github.com/harshjmodi1996/morphocopter_mpc}{Github Repo} 获取,相关视频可在 {https://zh.engr.tamu.edu/wp-content/uploads/sites/310/2026/03/MPC_MorphoCopter_video.mp4}{视频链接} 查看。
cs.RO / 34 / 2605.16009
Fast Expanding Safe Circular Regions for Efficient Local Path Planning
快速扩展安全圆形区域以实现高效的局部路径规划
Abstract
Local navigation is one of the fundamental problems in robot navigation, and numerous approaches have been proposed over the years, including methods such as the Dynamic Window Approach, Model Predictive Control, and more recently, Control Barrier Functions and machine learning based techniques. While these methods perform well in simple environments, many of them rely on optimization or learning based procedures that can struggle in more complex scenarios. In contrast, this article proposes a more geometric algorithmic approach that enables a local navigation method with faster computation times and longer planning horizons. The proposed method is based on the computation of a sequence of circular regions from a local LiDAR scan that expand in the direction of the goal and capture free local navigable space. The proposed method was implemented in the ROS2 framework and evaluated in a simulated environment.
Chinese Translation
局部导航是机器人导航中的一个基本问题,多年来提出了许多方法,包括动态窗口方法(Dynamic Window Approach)、模型预测控制(Model Predictive Control),以及最近的控制障碍函数(Control Barrier Functions)和基于机器学习的技术。虽然这些方法在简单环境中表现良好,但许多方法依赖于优化或学习基础的程序,在更复杂的场景中可能会遇到困难。相比之下,本文提出了一种更几何的算法方法,使局部导航方法能够实现更快的计算时间和更长的规划视野。所提方法基于从局部激光雷达(LiDAR)扫描中计算出一系列向目标方向扩展的圆形区域,以捕捉自由的局部可导航空间。该方法在ROS2框架中实现,并在模拟环境中进行了评估。
cs.RO / 35 / 2605.16015
Adaptive Outer-Loop Control of Quadrotors via Reinforcement Learning
基于强化学习的四旋翼自适应外环控制
Abstract
Deep Reinforcement Learning (DRL) for quadrotor flight control typically relies on Domain Randomization (DR) for sim-to-real transfer, resulting in overly conservative policies that struggle with dynamic disturbances. To overcome this, we propose a novel adaptive control architecture that actively perceives and reacts to instantaneous perturbations. First, we train an optimal outer-loop policy, then replace its reliance on ground-truth disturbance data with a Residual Dynamics Predictor (RDP). The RDP estimates the external forces and moments acting on the aircraft in flight online using only the history of states and control actions. For seamless hardware transfer, we introduce a data-efficient linear calibration bridge and an online thrust correction mechanism that align the simulated latent space with reality using mere seconds of flight data. Real-world validations on a Crazyflie micro-quadrotor demonstrate that our adaptive controller significantly outperforms baselines, maintaining precise trajectory tracking under severe uncertainties including mass variations, asymmetric payloads, and dynamic slung loads
Chinese Translation
深度强化学习(Deep Reinforcement Learning, DRL)在四旋翼飞行控制中通常依赖于领域随机化(Domain Randomization, DR)进行仿真到现实的迁移,这导致了过于保守的策略,难以应对动态干扰。为了解决这个问题,我们提出了一种新颖的自适应控制架构,能够主动感知并对瞬时扰动做出反应。首先,我们训练一个最优的外环策略,然后用残差动力学预测器(Residual Dynamics Predictor, RDP)替代其对真实扰动数据的依赖。RDP在线估计飞行中作用于飞机的外部力和力矩,仅使用状态和控制动作的历史数据。为了实现无缝的硬件迁移,我们引入了一种数据高效的线性校准桥接和在线推力修正机制,通过仅需几秒钟的飞行数据将仿真潜在空间与现实对齐。在Crazyflie微型四旋翼上的实际验证表明,我们的自适应控制器显著优于基线,在包括质量变化、不对称载荷和动态悬挂负载等严重不确定性下,保持了精确的轨迹跟踪。
cs.RO / 36 / 2605.16043
Learning Sim-Grounded Policies for Bimanual Rope Manipulation from Human Teleoperation Data
从人类遥操作数据中学习基于仿真基础的双手绳索操控策略
Abstract
Deformable Linear Objects (DLOs) such as ropes and cables are widely encountered in both household and industrial applications, yet remain challenging to manipulate due to their infinite-dimensional configuration space and frequent self-occlusion. Imitation learning from teleoperation offers a practical path to bimanual DLO manipulation, but its scalability is limited by human effort, making the choice of observation space critical for generalization from small datasets. In this study, we investigate whether the lack of generalization in egocentric visual policies for the knot-untangling task stems from the observation space itself, rather than from the policy architecture or data scale. We compare two Action Chunking with Transformers policies trained on the same bimanual teleoperation data: a vision-based policy conditioned on two egocentric RGB streams from wrist-mounted cameras, and a state-based policy conditioned on the DLO's 3D particle state, extracted from an initial observation via multi-view fusion and evolved in a particle-based eXtended Position-Based Dynamics simulation. Evaluated open-loop on an unseen rope configuration, the state-based policy outperforms its visual counterpart with a 30.8% reduction in L1 error when predicting the initial grasp-and-pull action, quantifying the observability gap between pixels and physics-consistent state, and pointing toward more data-efficient robot learning for the DLO manipulation task from limited human demonstrations.
Chinese Translation
可变形线性物体(DLOs),如绳索和电缆,在家庭和工业应用中广泛存在,但由于其无限维配置空间和频繁的自遮挡,操控仍然具有挑战性。通过遥操作进行模仿学习为双手DLO操控提供了一条实用路径,但其可扩展性受到人类努力的限制,使得观察空间的选择对于从小数据集中进行泛化至关重要。在本研究中,我们探讨了在解结任务中,自我中心视觉策略泛化不足是否源于观察空间本身,而非策略架构或数据规模。我们比较了在相同双手遥操作数据上训练的两种基于Transformer的动作分块策略:一种是基于视觉的策略,依赖于来自手腕安装摄像头的两个自我中心RGB流;另一种是基于状态的策略,依赖于通过多视角融合从初始观察中提取的DLO的3D粒子状态,并在基于粒子的扩展位置动力学(eXtended Position-Based Dynamics)模拟中演变。在对未见绳索配置进行开环评估时,基于状态的策略在预测初始抓取和拉动动作时,相较于视觉策略减少了30.8%的L1误差,量化了像素与物理一致状态之间的可观察性差距,并指向了从有限人类示范中实现更数据高效的机器人学习以进行DLO操控任务的可能性。
cs.RO / 37 / 2605.16056
Health-Conditioned Vision-Language-Action Models for Malfunction-Aware Robot Control
健康状态感知的视觉-语言-动作模型用于故障意识机器人控制
Abstract
Research on Vision Language Action (VLA) models has been increasing rapidly in recent years. Although some of them focus on detecting, preventing, and recovering from task failures, they usually don't deal with adapting to robot's physical failures. In real-life scenarios, most robots face physical degradations in various ways such as joint degradation, actuator failure, or weak gripper. We introduce malfunction-aware (health-conditioned) VLA that takes a health vector as an input that gives information about robots' joints' operation angle and torque capability, and adapts its predictions to complete the tasks with the degraded joints. To achieve this, we inject a Health Projector module to the VLA-Adapter architecture and train it on malfunction robot data we collected on the LIBERO environment [1]. We collect 128 teleoperated episodes on Libero-Spatial tasks. Our results show that, with a very lightweight addition, the model can learn to operate successfully with different configurations of degraded joints which the default pretrained VLA-Adapter's Libero-Spatial-Pro model cannot. The code and dataset will be available soon at https://github.com/h-arslan/health-aware-vla
Chinese Translation
近年来,关于视觉语言动作(VLA)模型的研究迅速增加。尽管其中一些模型专注于检测、预防和恢复任务失败,但它们通常不处理适应机器人物理故障的问题。在现实场景中,大多数机器人以各种方式面临物理退化,例如关节退化、执行器故障或夹持器弱化。我们引入了一种故障意识(健康状态感知)VLA模型,该模型以健康向量作为输入,提供有关机器人关节操作角度和扭矩能力的信息,并根据退化的关节调整其预测以完成任务。为此,我们在VLA适配器架构中注入了一个健康投影模块,并在我们在LIBERO环境中收集的故障机器人数据上进行训练。我们在Libero-Spatial任务上收集了128个遥控操作的实验。我们的结果表明,通过非常轻量的添加,该模型能够学习在不同的退化关节配置下成功操作,而默认的预训练VLA适配器的Libero-Spatial-Pro模型则无法做到。代码和数据集将很快在https://github.com/h-arslan/health-aware-vla上发布。
cs.RO / 38 / 2605.16087
Towards Trustworthy and Explainable AI for Perception Models: From Concept to Prototype Vehicle Deployment
迈向可信赖和可解释的人工智能感知模型:从概念到原型车辆部署
Abstract
Deep Neural Networks have become the dominant solution for Autonomous Driving perception, but their opacity conflicts with emerging Trustworthy AI guidelines and complicates safety assurance, debugging, and human oversight. While theoretical frameworks for safe and Explainable AI (XAI) exist, concrete implementations of Trustworthy AI for 3D scene understanding remain scarce. We address this gap by proposing a Trustworthy AI perception module that is remarkably robust, integrates faithful explainability, and calibrated uncertainty estimates. Building on a transformer-based detector, we derive explanation from the attention mechanism at inference time and validate their faithfulness using perturbation-based consistency tests. We further integrate an uncertainty estimation and calibration module, and apply robustness-enhancing training methods. Experiments show faithful saliency behavior, improved robustness, and well-calibrated uncertainty estimates. Finally, we deploy these Trustworthy AI elements in a prototype vehicle and provide an XAI Interface that visualizes documentation artifacts, model uncertainty state, and saliency maps, demonstrating the feasibility of trustworthy perception monitoring in real time. Supplementary materials are available at https://tillbeemelmanns.github.io/trustworthy_ai/ .
Chinese Translation
深度神经网络已成为自动驾驶感知的主流解决方案,但其不透明性与新兴的可信赖人工智能指南相悖,且使得安全保障、调试和人类监督变得复杂。尽管存在安全和可解释人工智能(Explainable AI, XAI)的理论框架,但针对三维场景理解的可信赖人工智能的具体实现仍然稀缺。我们通过提出一个显著稳健的可信赖人工智能感知模块来填补这一空白,该模块集成了真实的可解释性和经过校准的不确定性估计。基于变换器(transformer)检测器,我们在推理时从注意力机制中推导解释,并使用基于扰动的一致性测试验证其真实性。我们进一步集成了不确定性估计和校准模块,并应用增强稳健性的训练方法。实验表明,模块具有真实的显著性行为、改善的稳健性和良好校准的不确定性估计。最后,我们在原型车辆中部署这些可信赖人工智能元素,并提供一个可解释人工智能接口,能够可视化文档工件、模型不确定性状态和显著性图,展示了实时可信赖感知监控的可行性。补充材料可在 https://tillbeemelmanns.github.io/trustworthy_ai/ 获取。
cs.RO / 39 / 2605.16115
Beyond Collision Avoidance: Multi-Robot Yielding and Spatial Affordance in Emergency Evacuations
超越碰撞避免:紧急疏散中的多机器人让路与空间适应性
Abstract
As mobile service robots increasingly coexist with pedestrians, ensuring passively safe behaviour during confined emergency evacuations is critical. Existing multi-robot yielding strategies often focus solely on collision avoidance and macroscopic flow optimisation, overlooking environmental affordances and human spatial expectations. To bridge the gap between macroscopic theory and micro-level perception, we conducted a game-based virtual evacuation experiment (N=56). We investigated individual psychological responses to four multi-robot yielding strategies (Hide, LineEscape, Freeze, ShortestPath) across confined corridors with and without refuge niches. Our results establish a robust preference hierarchy (Hide > LineEscape > Freeze > ShortestPath), demonstrating that proactive space-yielding significantly outperforms freezing and efficiency-first approaches. Crucially, we found that environmental affordances heavily shape cognitive expectations. Actively utilising available niches amplifies the psychological comfort of proactive yielding (Hide). Conversely, failing to use an obvious niche (e.g., executing LineEscape) may trigger Expectation Violation. This is reflected in a drastically increased perceived cognitive delay, despite objectively unimpeded trajectories. Furthermore, prior robot interaction experience helps users decode complex social intents. Ultimately, this research demonstrates that safe human-robot interaction during emergencies must evolve from pure trajectory optimisation to semantically aware navigation. Future work will extend this framework to investigate complex interactions between robot swarms and pedestrian crowds.
Chinese Translation
随着移动服务机器人与行人日益共存,在紧急疏散过程中确保被动安全行为至关重要。现有的多机器人让路策略往往仅关注于碰撞避免和宏观流动优化,忽视了环境适应性和人类空间期望。为了弥合宏观理论与微观感知之间的差距,我们进行了一项基于游戏的虚拟疏散实验(N=56)。我们研究了在有和没有避难空间的狭窄走廊中,个体对四种多机器人让路策略(Hide、LineEscape、Freeze、ShortestPath)的心理反应。我们的结果建立了一个稳健的偏好层次(Hide > LineEscape > Freeze > ShortestPath),表明主动让出空间显著优于冻结和以效率为先的方法。重要的是,我们发现环境适应性在很大程度上塑造了认知期望。积极利用可用的避难空间增强了主动让路(Hide)的心理舒适感。相反,未能使用明显的避难空间(例如,执行LineEscape)可能会引发期望违背。这反映在感知的认知延迟显著增加,尽管客观上路径并未受到阻碍。此外,之前的机器人互动经验帮助用户解码复杂的社会意图。最终,这项研究表明,在紧急情况下安全的人机互动必须从纯粹的轨迹优化演变为语义感知导航。未来的工作将扩展这一框架,以研究机器人群体与行人群体之间的复杂互动。
cs.RO / 40 / 2605.16257
DexJoCo: A Benchmark and Toolkit for Task-Oriented Dexterous Manipulation on MuJoCo
DexJoCo:MuJoCo上面向任务的灵巧操控基准与工具包
Abstract
Achieving human-level manipulation requires dexterous robotic hands capable of complex object interactions. Advancing such capabilities further demands standardized benchmarks for systematic evaluation. However, existing dexterous benchmarks lack tasks that reflect the unique manipulation capabilities of dexterous hands over parallel grippers, as well as comprehensive evaluation pipelines. In this paper, we present DexJoCo, a benchmark and toolkit for task-oriented dexterous manipulation, comprising 11 functionally grounded tasks that evaluate tool-use, bimanual coordination, long-horizon execution, and reasoning. We develop a low-cost data collection system and collect 1.1K trajectories across these tasks, with support for domain randomization to assess robustness. We benchmark modern models under diverse settings, including visual and dynamics randomization, multi-task training, and action-head adaptation. Through extensive empirical analysis, we identify several important insights and common limitations of current policies in dexterous manipulation, highlighting key challenges for future research in dexterous hand robot learning. Project page available at: https://dexjoco.github.io
Chinese Translation
实现人类水平的操控需要具备复杂物体交互能力的灵巧机器人手。进一步提升这种能力需要标准化的基准以进行系统评估。然而,现有的灵巧基准缺乏能够反映灵巧手相较于平行夹持器独特操控能力的任务,以及全面的评估流程。本文提出了DexJoCo,一个面向任务的灵巧操控基准与工具包,包含11个功能性任务,用于评估工具使用、双手协调、长时间执行和推理能力。我们开发了一个低成本的数据收集系统,并在这些任务中收集了1.1K条轨迹,支持领域随机化以评估鲁棒性。我们在多种设置下对现代模型进行了基准测试,包括视觉和动态随机化、多任务训练以及动作头适应。通过广泛的实证分析,我们识别出当前灵巧操控策略的一些重要见解和共同局限性,突显了未来灵巧手机器人学习研究的关键挑战。项目页面可访问:https://dexjoco.github.io
cs.CV / 1 / 2605.15256
ReactiveGWM: Steering NPC in Reactive Game World Models
ReactiveGWM:在反应式游戏世界模型中引导非玩家角色
Abstract
Current game world models simulate environments from a subjective, player-centric perspective. However, by treating the Non-Player Character (NPC) merely as background pixels, these models cannot capture interactions between the player and NPC. In that sense, they act as passive video renderers rather than real simulation engines, lacking the physical understanding needed to model action-induced NPC reactivities. We introduce ReactiveGWM, a reactive game world model that synthesizes dynamic interactions between the player and NPC. Instead of entangling all interaction dynamics, ReactiveGWM explicitly decouples player controls from NPC behaviors. Player actions are injected into the diffusion backbone via a lightweight additive bias, while high-level NPC responses (e.g., Offense, Control, Defense) are grounded through cross-attention modules. Crucially, these modules learn a game-agnostic representation of interactive logic. This enables zero-shot strategy transfer: our learned modules can be plugged directly into off-the-shelf, unannotated world models of different games. This instantly unlocks steerable NPC interactions without any domain-specific retraining. Evaluated on two Street Fighter games, ReactiveGWM maintains fine-grain player controllability while achieving robust, prompt-aligned NPC strategy adherence, paving the way for scalable, strategy-rich interaction with the NPC.
Chinese Translation
当前的游戏世界模型从主观的、以玩家为中心的视角模拟环境。然而,将非玩家角色(NPC)仅视为背景像素,这些模型无法捕捉玩家与NPC之间的互动。从这个意义上说,它们充其量只是被动的视频渲染器,而不是实际的模拟引擎,缺乏建模因行动而引发的NPC反应所需的物理理解。我们提出了ReactiveGWM,这是一种反应式游戏世界模型,合成玩家与NPC之间的动态互动。ReactiveGWM明确将玩家控制与NPC行为解耦,而不是将所有互动动态纠缠在一起。玩家的动作通过轻量级的附加偏置注入到扩散骨干中,而高层次的NPC响应(例如,攻击、控制、防御)则通过交叉注意模块进行基础化。关键在于,这些模块学习了一种与游戏无关的互动逻辑表示。这使得零样本策略转移成为可能:我们学习的模块可以直接插入不同游戏的现成、未注释的世界模型中。这瞬间解锁了可引导的NPC互动,而无需任何特定领域的再训练。在对两个街头霸王游戏的评估中,ReactiveGWM保持了细粒度的玩家可控性,同时实现了稳健、与提示对齐的NPC策略遵循,为与NPC的可扩展、丰富策略互动铺平了道路。
cs.CV / 2 / 2605.15300
Deep Pre-Alignment for VLMs
深度预对齐用于视觉语言模型
Abstract
Most Vision Language Models (VLMs) directly map outputs from ViT encoders to the LLM via a lightweight projector. While effective, recent analysis suggests this architecture suffers from an alignment challenge: visual features remain distant from the text space in the initial layers of the LLM, forcing the model to waste critical depth~\cite{zhang-etal-2024-investigating,artzy-schwartz-2024-attend} on superficial modality alignment rather than deep understanding and complex reasoning. In this work, we propose Deep Pre-Alignment (DPA), a novel architecture that replaces the standard ViT encoder with a small VLM as perceiver, ensuring visual features are deeply aligned with the text space of the target large language model. Comprehensive experiments demonstrate the effectiveness of DPA. On the 4B parameter scale, DPA outperforms baselines by 1.9 points across 8 multimodal benchmarks, with gains widening to 3.0 points at the 32B scale. Moreover, by offloading alignment to the perceiver, DPA achieves a 32.9\% reduction in language capability forgetting over 3 text benchmarks. We further demonstrate that these gains are consistent across different LLM families including Qwen3 and LLaMA 3.2, highlighting the generality of our approach. Beyond performance, DPA also offers a seamless upgrade path for current VLM development, requiring only a modular replacement for the visual encoder with marginal computation overhead.
Chinese Translation
大多数视觉语言模型(VLMs)通过轻量级投影器将ViT编码器的输出直接映射到大语言模型(LLM)。尽管这种方法有效,但最近的分析表明,该架构存在对齐挑战:在LLM的初始层中,视觉特征与文本空间之间的距离较远,迫使模型在表层模态对齐上浪费关键的深度,而不是进行深层理解和复杂推理。在本研究中,我们提出了深度预对齐(Deep Pre-Alignment, DPA),一种新颖的架构,用小型VLM作为感知器替代标准的ViT编码器,确保视觉特征与目标大语言模型的文本空间深度对齐。全面的实验表明DPA的有效性。在40亿参数规模下,DPA在8个多模态基准测试中比基线提高了1.9分,而在320亿规模下增幅扩大至3.0分。此外,通过将对齐任务转移到感知器,DPA在3个文本基准测试中实现了语言能力遗忘率降低32.9%。我们进一步证明,这些增益在不同的LLM家族中(包括Qwen3和LLaMA 3.2)是一致的,突显了我们方法的普适性。除了性能提升,DPA还为当前的VLM开发提供了无缝的升级路径,仅需对视觉编码器进行模块化替换,计算开销微乎其微。
cs.CV / 3 / 2605.15309
One Pass Is Not Enough: Recursive Latent Refinement for Generative Models
一次传递是不够的:生成模型的递归潜在细化
Abstract
Despite remarkable progress, image generation is far from solved. The dominant metric, FID, conflates sample fidelity with mode coverage and is close to being saturated. Yet a model can still exhibit mode collapse while achieving a low FID, since a handful of sharp, near-duplicate images can outscore a model that faithfully covers the full data distribution. We argue that precision and recall are essential complements to FID, and that because FID is already saturated, the more meaningful goal is to improve diversity and coverage. Achieving high recall requires a model that explicitly prioritizes mode coverage, unlike most generative models, which optimize sample fidelity. We introduce RTM, which replaces the single-pass latent mapping in style-based generators with an iterative refinement process, and show that this consistently improves both quality and diversity. Integrated with Implicit Maximum Likelihood Estimation (IMLE), which optimizes mode coverage by design, RTM achieves the highest precision and recall among current state-of-the-art approaches while maintaining competitive FID, with improvements across CIFAR-10, CelebA-HQ at 256x256, and nine few-shot benchmarks. RTM also improves StyleGAN2 and StyleGAN2-ADA on CIFAR-10 and AFHQ-v1 at 512x512, demonstrating that the benefit is not specific to IMLE. Unlike flow-matching baselines that achieve competitive FID at the expense of coverage, recursive refinement improves both quality and diversity simultaneously.
Chinese Translation
尽管取得了显著进展,图像生成仍远未解决。主导指标 FID 将样本保真度与模式覆盖混为一谈,并且接近饱和。然而,一个模型在实现低 FID 的同时仍可能表现出模式崩溃,因为少量清晰的近重复图像可以超越一个忠实覆盖完整数据分布的模型。我们认为,精确度和召回率是 FID 的重要补充,并且由于 FID 已经接近饱和,更有意义的目标是提高多样性和覆盖率。实现高召回率需要一个明确优先考虑模式覆盖的模型,这与大多数优化样本保真度的生成模型不同。我们提出了 RTM,它用迭代细化过程替代了基于风格生成器中的单次潜在映射,并且显示出这可以持续改善质量和多样性。与隐式最大似然估计(IMLE)集成,IMLE 通过设计优化模式覆盖,RTM 在当前最先进的方法中实现了最高的精确度和召回率,同时保持竞争力的 FID,在 CIFAR-10、CelebA-HQ(256x256)和九个少样本基准测试中均有改善。RTM 还在 CIFAR-10 和 AFHQ-v1(512x512)上改善了 StyleGAN2 和 StyleGAN2-ADA,证明这种好处并非特定于 IMLE。与在覆盖率上付出代价以实现竞争性 FID 的流匹配基线不同,递归细化同时改善了质量和多样性。
cs.CV / 4 / 2605.15325
COPRA: Conditional Parameter Adaptation with Reinforcement Learning for Video Anomaly Detection
COPRA:基于强化学习的条件参数适应用于视频异常检测
Abstract
Vision-language models (VLMs) have shown strong performance in video anomaly detection (VAD) while providing interpretable predictions. However, existing VLM-based VAD methods suffer from a fundamental mismatch between training and inference in both data distribution and model configuration. First, most approaches rely on static post-training adaptation, limiting generalization under distribution shifts such as unseen environments or anomaly types. Second, they train VLMs on sparse frames from long videos, but perform inference on densely sampled short segments, creating inconsistencies between training and testing. To address these limitations, we propose COPRA, a conditional parameter adaptation framework for VLM-based VAD. Instead of fixed prompts or shared parameter updates, COPRA generates input-specific parameter updates to dynamically adapt a frozen VLM for each video segment during both training and inference. Experiments show strong performance on standard VAD benchmarks, consistently outperforming static baselines in both in-domain and cross-domain settings. Moreover, COPRA generalizes beyond VAD to unseen tasks such as multiple-choice Video Question Answering and Dense Captioning. These results highlight COPRA as an effective weight-space generation framework for scalable, adaptive, and context-aware video understanding. The code will be released at https://github.com/THE-MALT-LAB/COPRA
Chinese Translation
视觉语言模型(VLMs)在视频异常检测(VAD)中表现出强大的性能,同时提供可解释的预测。然而,现有的基于VLM的VAD方法在训练和推理阶段的数据分布和模型配置之间存在根本的不匹配。首先,大多数方法依赖于静态的后训练适应,这限制了在分布变化(如未见环境或异常类型)下的泛化能力。其次,它们在长视频的稀疏帧上训练VLM,但在密集采样的短片段上进行推理,这造成了训练和测试之间的不一致。为了解决这些限制,我们提出了COPRA,一个用于基于VLM的VAD的条件参数适应框架。COPRA不是使用固定的提示或共享的参数更新,而是生成特定于输入的参数更新,以在训练和推理期间动态适应每个视频片段的冻结VLM。实验表明,在标准VAD基准测试中表现出强大的性能,在领域内和跨领域设置中始终优于静态基线。此外,COPRA超越VAD,能够推广到未见任务,如多选视频问答和密集字幕。这些结果突显了COPRA作为一个有效的权重空间生成框架,适用于可扩展、适应性强和上下文感知的视频理解。代码将发布在 https://github.com/THE-MALT-LAB/COPRA
cs.CV / 5 / 2605.15326
Multimodal Object Detection Under Sparse Forest-Canopy Occlusion
稀疏森林冠层遮挡下的多模态目标检测
Abstract
Reliable detection of humans beneath forest canopy remains a difficult remote-sensing challenge due to sparse, structured, and viewpoint-dependent occlusion. This paper presents a multimodal proof-of-concept pipeline that integrates three complementary approaches: (i) experimental evaluation of LiDAR returns through vegetation to assess the feasibility of active sensing, (ii) visible--thermal image fusion using a multi-scale transform and sparse-representation framework to enhance human saliency, and (iii) synthetic-aperture image formation via Airborne Optical Sectioning (AOS) to suppress canopy clutter. A YOLOv5 detector is fine-tuned on the Teledyne FLIR thermal dataset and evaluated on thermal and fused imagery. Results show that the tested terrestrial LiDAR configuration provides limited penetration for object-level detection, while visible--thermal fusion improves target visibility in low-contrast scenes and AOS enhances ground-plane detection in synthetic forest imagery. The fine-tuned YOLOv5 achieves a mean average precision of $\sim$0.83 on the top three FLIR classes. These findings establish an initial baseline for UAV-deployable search-and-rescue and surveillance systems operating in forested environments, and motivate future work on dedicated forest datasets and real-time multimodal integration.
Chinese Translation
在森林冠层下可靠地检测人类仍然是一个困难的遥感挑战,原因在于稀疏、结构化和视角依赖的遮挡。本文提出了一种多模态概念验证管道,集成了三种互补的方法:(i)通过植被对激光雷达(LiDAR)返回信号进行实验评估,以评估主动传感的可行性;(ii)使用多尺度变换和稀疏表示框架进行可见光-热成像融合,以增强人类显著性;(iii)通过机载光学切片(Airborne Optical Sectioning, AOS)形成合成孔径图像,以抑制冠层杂波。对YOLOv5检测器进行了微调,使用Teledyne FLIR热成像数据集进行评估,结果显示,所测试的地面激光雷达配置在对象级检测方面提供了有限的穿透能力,而可见光-热成像融合在低对比度场景中提高了目标可见性,AOS在合成森林图像中增强了地面平面检测。微调后的YOLOv5在FLIR的前三个类别上达到了约0.83的平均精度。这些发现为在森林环境中部署无人机的搜索与救援及监视系统建立了初步基线,并激励未来在专用森林数据集和实时多模态集成方面的工作。
cs.CV / 6 / 2605.15342
Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding
Minerva-Ego:用于自我中心视频理解的时空提示
Abstract
Video reasoning models are a core component of egocentric and embodied agents. However, standard benchmarks for assessing models provide only evaluation of the output (e.g. the answer to a question), without evaluation of intermediate reasoning steps, and most provide answers only in the text domain. We introduce Minerva-Ego, a benchmark for evaluating complex egocentric visual reasoning. We extend recent high-quality video data sources recorded from egocentric / embodied settings with a set of challenging, multi-step multimodal questions and spatiotemporally-dense human-annotated reasoning traces. Benchmarking experiments show that state-of-the-art models still have a large gap to human performance. To investigate this gap in detail, we annotate each reasoning trace in the dataset with the objects of interest required to solve the question, as spatiotemporal mask annotations. Through extensive evaluations, we identify that prompting frontier models with hints of 'where' and 'when' to look yields substantial improvements in performance. Minerva-Ego can be downloaded at https://github.com/google-deepmind/neptune.
Chinese Translation
视频推理模型是自我中心和具身代理的核心组成部分。然而,评估模型的标准基准仅提供输出的评估(例如,问题的答案),而不评估中间推理步骤,并且大多数仅在文本领域提供答案。我们引入了Minerva-Ego,这是一个用于评估复杂自我中心视觉推理的基准。我们扩展了最近从自我中心/具身环境中录制的高质量视频数据源,并提供了一组具有挑战性的多步骤多模态问题和时空密集的人类注释推理轨迹。基准实验表明,最先进的模型与人类表现之间仍存在较大差距。为了详细调查这一差距,我们对数据集中每个推理轨迹进行了注释,标注了解决问题所需的感兴趣对象,作为时空掩码注释。通过广泛的评估,我们发现通过提示前沿模型“在哪里”和“何时”观察可以显著提高性能。Minerva-Ego可以在https://github.com/google-deepmind/neptune下载。
cs.CV / 7 / 2605.15368
Discretizing Group-Convolutional Neural Networks for 3D Geometry in Feature Space
在特征空间中离散化群卷积神经网络以处理三维几何
Abstract
Group-convolutional neural networks (GCNNs) are among the most important methods for introducing symmetry as an inductive bias in deep learning: In each linear layer, GCNNs sample a transformation group $G$ densely and correlate data and filters in different poses (with suitable anti-aliasing for steerable GCNNs) to maintain equivariance with respect to $G$. Unfortunately, applying filters to many data items resulting from this sampling is expensive (even for translations alone, i.e., in ordinary CNNs), and costs grow exponentially with increasing degrees of freedom (such as translations and rotations in 3D), which often hinders practical applications. In this paper, we propose sampling in feature space, i.e., replacing geometrically dense samples with representative samples selected by feature similarity. This decouples geometric resolution from memory and processing costs during training and inference, providing a novel way to trade off computational effort and accuracy. Our main empirical finding is that a coarse feature-space sampling already preserves classification accuracy remarkably well, which permits precomputation based on geometric similarity, accelerating the training of equivariant 3D classifiers substantially.
Chinese Translation
群卷积神经网络(GCNNs)是将对称性作为深度学习中的归纳偏置引入的重要方法之一:在每个线性层中,GCNNs 密集地对变换群 $G$ 进行采样,并在不同姿态下关联数据和滤波器(对于可引导的 GCNNs 进行适当的抗混叠处理),以保持相对于 $G$ 的等变性。不幸的是,对由此采样产生的许多数据项应用滤波器的成本很高(即使仅仅是平移,即在普通 CNNs 中),而随着自由度的增加(例如三维中的平移和旋转),成本呈指数增长,这常常阻碍实际应用。在本文中,我们提出在特征空间中进行采样,即用通过特征相似性选择的代表性样本替代几何上密集的样本。这将几何分辨率与训练和推理过程中的内存和处理成本解耦,提供了一种新的方式来权衡计算工作量和准确性。我们的主要实证发现是,粗略的特征空间采样已经能够显著保持分类准确性,这允许基于几何相似性进行预计算,从而大幅加速等变三维分类器的训练。
cs.CV / 8 / 2605.15375
ChangeFlow -- Latent Rectified Flow for Change Detection in Remote Sensing
ChangeFlow -- 潜在校正流用于遥感中的变化检测
Abstract
Remote sensing change detection (RSCD) aims to localise changes between two images of the same geographic region. In practice, change masks often follow region-level annotation conventions rather than purely local appearance differences, making them context-dependent and occasionally ambiguous. Most state-of-the-art methods utilise per-pixel discriminative classification, which produces a single prediction per input and fails to explicitly model the changed region as a coherent whole. A natural alternative is generative formulation, which can model a distribution of plausible masks, enabling sampling to capture ambiguity and encourage global consistency. However, existing generative RSCD approaches typically lag behind strong discriminative baselines due to the high computational cost of pixel-space generation and the complexity of their conditioning mechanisms. To address the limitations of prior discriminative and generative methods, we propose ChangeFlow, a generative framework that reformulates change detection as the synthesis of a change mask in latent space via rectified flow. ChangeFlow is guided by a structured yet lightweight conditioning signal, and its stochastic design naturally supports sampling-based prediction ensembling. Namely, aggregating multiple predicted change masks improves robustness, while sample agreement provides a practical confidence estimation that highlights ambiguous regions. Across four benchmarks, ChangeFlow achieves an average F1 of 80.4\%, improving by 1.3 points on average over the previous best method, while maintaining inference speed comparable to recent strong baselines. Project page: https://blaz-r.github.io/changeflow_cd
Chinese Translation
遥感变化检测(RSCD)旨在定位同一地理区域的两幅图像之间的变化。在实际应用中,变化掩膜通常遵循区域级注释规范,而不仅仅是局部外观差异,使其依赖于上下文并且有时模糊不清。大多数最先进的方法采用逐像素的判别分类,这会对每个输入产生单一预测,并未明确将变化区域建模为一个连贯的整体。一个自然的替代方案是生成模型,它可以建模合理掩膜的分布,支持采样以捕捉模糊性并促进全局一致性。然而,现有的生成RSCD方法通常落后于强判别基线,原因在于像素空间生成的高计算成本以及其条件机制的复杂性。为了解决先前判别和生成方法的局限性,我们提出了ChangeFlow,一个生成框架,将变化检测重新表述为通过校正流在潜在空间中合成变化掩膜。ChangeFlow由一个结构化但轻量的条件信号引导,其随机设计自然支持基于采样的预测集成。具体而言,聚合多个预测的变化掩膜提高了鲁棒性,而样本一致性提供了一种实用的置信度估计,突出了模糊区域。在四个基准测试中,ChangeFlow的平均F1值达到80.4\%,比之前的最佳方法平均提高了1.3个百分点,同时保持了与近期强基线相当的推理速度。项目页面:https://blaz-r.github.io/changeflow_cd
cs.CV / 9 / 2605.15383
MorphoHELM: A Comprehensive Benchmark for Evaluating Representations for Microscopy-Based Morphology Assays
MorphoHELM:评估显微镜基础形态测定法表示的综合基准
Abstract
Microscopy images contain rich information about how cells respond to perturbations, making them essential to applications like drug screening. To quantify images, researchers often use representation extraction methods, and recent years have seen a proliferation of deep learning methods. While measuring the quality of these representations is essential, evaluation remains fragmented, with each proposed model evaluated on different tasks and datasets, using custom pipelines and metrics, making it difficult to fairly compare models. Here, we introduce MorphoHELM, a comprehensive open benchmark for evaluating feature extraction methods for Cell Painting, the most widely-used morphological profiling assay. MorphoHELM consolidates evaluation standards in the field, extends and corrects them to be more robust, and evaluates on the widest range of methods to date. A defining feature of the benchmark is that each task is evaluated at different degrees of batch effects (or technical noise), directly quantifying how the ability of methods to detect biological signal degrades as noise increases. Together, these properties enable MorphoHELM to detect trade-offs between methods, and we demonstrate that models that excel at certain kinds of biological signal are weaker at others. We show that no existing model outperforms classic computer vision analytic strategies across all settings, which remain the strongest general use-case representations. All datasets, code, and evaluation tools are publicly available at https://github.com/microsoft/MorphoHELM.
Chinese Translation
显微镜图像包含丰富的信息,揭示细胞如何对扰动作出反应,这使其在药物筛选等应用中至关重要。为了量化图像,研究人员通常使用表示提取方法,近年来深度学习方法的涌现使得这一领域迅速发展。虽然评估这些表示的质量至关重要,但评估仍然是零散的,每个提出的模型在不同的任务和数据集上进行评估,使用自定义的流程和指标,这使得公平比较模型变得困难。在此,我们介绍MorphoHELM,一个用于评估细胞绘画(Cell Painting)特征提取方法的综合开放基准,这是最广泛使用的形态特征分析测定法。MorphoHELM整合了该领域的评估标准,并对其进行了扩展和修正,以提高其稳健性,并在迄今为止最广泛的方法范围内进行评估。该基准的一个显著特征是,每个任务在不同程度的批次效应(或技术噪声)下进行评估,直接量化方法在噪声增加时检测生物信号的能力如何下降。这些特性使MorphoHELM能够检测方法之间的权衡,我们展示了在某些类型的生物信号上表现优异的模型在其他类型上较弱。我们还表明,没有现有模型在所有设置中超越经典计算机视觉分析策略,而后者仍然是最强的通用用例表示。所有数据集、代码和评估工具均可在 https://github.com/microsoft/MorphoHELM 上公开获取。
cs.CV / 10 / 2605.15391
PanoWorld: Geometry-Consistent Panoramic Video World Modeling
PanoWorld:几何一致的全景视频世界建模
Abstract
We present PanoWorld, a panoramic video world model that generates geometry-consistent 360$\degree$ video from a single image and a caption. Existing panoramic video methods optimize primarily for visual realism and do not explicitly constrain the underlying 3D scene state, producing outputs that appear plausible yet exhibit inconsistent depth, broken correspondences, and implausible motion across the spherical surface. We address this gap by framing panoramic video generation as a geometry- and dynamics-consistent latent state modeling problem rather than pure visual synthesis. Building on a pre-trained perspective video world model, we introduce two lightweight regularizers: a depth consistency loss against pseudo ground-truth panoramic depth, and a trajectory consistency loss that supervises the 3D world-frame positions of tracked points across time. We further apply spherical-geometry-aware adaptation to the conditioning and positional encoding. We additionally introduce PanoGeo, a unified geometry-aware panoramic video dataset with consistent depth, trajectory, and prompt annotations across diverse real and synthetic sources, used for both training and stratified evaluation. Experiments show that PanoWorld improves geometric consistency over prior panoramic generation methods while maintaining competitive visual realism, establishing that panoramic video generation must be treated as a geometric modeling problem to support the holistic spatial understanding requirements of embodied AI applications. Code is available at https://github.com/ostadabbas/PanoWorld.
Chinese Translation
我们提出了PanoWorld,一种全景视频世界模型,它能够从单幅图像和一个标题生成几何一致的360$ ext{°}$视频。现有的全景视频方法主要优化视觉真实感,并未明确约束底层3D场景状态,导致输出看似合理但深度不一致、对应关系破裂以及在球面上出现不合理运动。我们通过将全景视频生成框定为一个几何和动态一致的潜在状态建模问题,而非单纯的视觉合成,来填补这一空白。在预训练的透视视频世界模型基础上,我们引入了两个轻量级正则化器:一个针对伪真实全景深度的深度一致性损失,以及一个监督随时间变化的跟踪点的3D世界帧位置的轨迹一致性损失。我们进一步对条件和位置编码应用了球面几何感知适应。我们还引入了PanoGeo,这是一个统一的几何感知全景视频数据集,具有一致的深度、轨迹和提示注释,涵盖多种真实和合成来源,用于训练和分层评估。实验表明,PanoWorld在保持竞争性视觉真实感的同时,改善了几何一致性,确立了全景视频生成必须被视为几何建模问题,以支持具身人工智能应用的整体空间理解需求。代码可在 https://github.com/ostadabbas/PanoWorld 获取。
cs.CV / 11 / 2605.15397
ELDOR: A Dataset and Benchmark for Illegal Gold Mining in the Amazon Rainforest
ELDOR:亚马逊雨林非法采金的数据集与基准
Abstract
Illegal gold mining in the Amazon rainforest causes deforestation, water contamination, and long-term ecosystem disruption, yet remains difficult to monitor at fine spatial scales. Satellite imagery supports large-scale observation, but often misses small mining-related structures and subtle land-cover transitions, especially under frequent cloud cover. We introduce ELDOR, a large-scale UAV benchmark for monitoring environmental and landscape disturbance from illegal gold mining in the rainforest. ELDOR contains manually annotated orthomosaic imagery covering over 2,500 hectares, with pixel-level semantic labels for both mining-related activities and surrounding ecological structures. With this unified annotation source, we establish four benchmark tasks: semantic segmentation, segmentation-derived recognition, direct multi-label classification, and class-presence recognition with vision-language models. Across these tasks, we compare generic and remote-sensing-specific segmentation models, vision foundation model-related segmentation methods, direct multi-label classification methods, and vision-language models under a controlled closed-set protocol. Results show that current methods still struggle with rare small-scale mining structures and fine-grained recovery classes, suggesting the need for context-aware and multimodal modeling. To support domain analysis and practical use, we further build an interactive explorer for domain experts that provides a unified interface for data exploration and model inference.
Chinese Translation
亚马逊雨林中的非法采金导致了森林砍伐、水体污染和长期生态系统破坏,但在细尺度上监测仍然困难。卫星影像支持大规模观察,但往往无法捕捉到小型采矿相关结构和微妙的土地覆盖变化,尤其是在频繁的云层覆盖下。我们推出了ELDOR,这是一个用于监测雨林中非法采金引起的环境和景观干扰的大规模无人机基准。ELDOR包含手动标注的正射影像,覆盖超过2500公顷,具有针对采矿相关活动和周围生态结构的像素级语义标签。基于这一统一的标注源,我们建立了四个基准任务:语义分割、基于分割的识别、直接多标签分类以及使用视觉-语言模型的类别存在识别。在这些任务中,我们比较了通用和遥感特定的分割模型、与视觉基础模型相关的分割方法、直接多标签分类方法以及在受控闭集协议下的视觉-语言模型。结果表明,当前方法在处理稀有小规模采矿结构和细粒度恢复类别时仍然存在困难,这表明需要上下文感知和多模态建模。为了支持领域分析和实际应用,我们进一步为领域专家构建了一个交互式探索工具,提供统一的数据探索和模型推断界面。
cs.CV / 12 / 2605.15421
U-SEG: Uncertainty in SEGmentation -- A systematic multi-variable exploration
U-SEG:分割中的不确定性——系统的多变量探索
Abstract
In this study, we explore in depth a few under-studied topics at the intersection of uncertainty estimation and segmentation. Prior work has shown that the quality of uncertainty estimates can be very sensitive to a range of variables. As one of the main uses of uncertainty estimation is to help identify and deal with prediction errors in practical scenarios, any factors that affect this must be clearly identified. For example, do more challenging domains or different datasets and architectures result in worse performance when using uncertainty estimates? Can prior frames in a video sequence in fact provide useful uncertainty estimates comparable to other approaches? Is it possible to combine uncertainty estimation approaches, taking advantage of sample diversity, to get better estimates? Finally, when might it make sense to use an ensemble-based uncertainty estimate over a deterministic network? We address these questions by creating a framework for and executing a large scale study across many variables such as datasets, backbones, and downstream tasks, for both semantic and panoptic segmentation. We find that a) the more challenging task of panoptic segmentation usually results in worse performance while high performance variance between datasets and backbones indicates that generalization is not guaranteed, b) time series samples can be useful for specific configurations, but in many cases are not worth the cost, c) sample diversity shows the most promise in the downstream task of calibration, but otherwise fails to beat simpler alternatives, d) a deterministic approach is adequate for some downstream tasks, but ensembles allow for significant improvements if the right conditions can be achieved in deployment.
Chinese Translation
在本研究中,我们深入探讨了不确定性估计与分割交叉领域中一些尚未充分研究的主题。先前的研究表明,不确定性估计的质量对多种变量非常敏感。由于不确定性估计的主要用途之一是帮助识别和处理实际场景中的预测错误,因此必须清晰识别影响这一点的因素。例如,更具挑战性的领域或不同的数据集和架构在使用不确定性估计时是否会导致更差的性能?视频序列中的先前帧是否实际上能够提供可与其他方法相媲美的有用不确定性估计?是否有可能结合不确定性估计方法,利用样本多样性以获得更好的估计?最后,在什么情况下使用基于集成的不确定性估计而不是确定性网络是合理的?我们通过创建一个框架并在多个变量(如数据集、主干网络和下游任务)上执行大规模研究来解决这些问题,研究对象包括语义分割和全景分割。我们的发现包括:a) 更具挑战性的全景分割任务通常导致更差的性能,而数据集和主干网络之间的高性能方差表明泛化并不保证;b) 时间序列样本在特定配置下可能有用,但在许多情况下并不值得其成本;c) 样本多样性在校准的下游任务中显示出最大的潜力,但在其他情况下未能超越更简单的替代方案;d) 对于某些下游任务,确定性方法是足够的,但如果在部署中能够实现正确的条件,集成方法可以带来显著的改进。
cs.CV / 13 / 2605.15423
MR2-ByteTrack: CNN and Transformer-based Video Object Detection for AI-augmented Embedded Vision Sensor Nodes
MR2-ByteTrack:基于CNN和Transformer的视频目标检测用于AI增强的嵌入式视觉传感器节点
Abstract
Modern smart vision sensors need on-device intelligence to process video streams, as cloud computing is often impractical due to bandwidth, latency, and privacy constraints. However, these sensory systems typically rely on ultra-low-power microcontrollers (MCUs) with limited memory and compute, making conventional video object detection methods, which require feature storage or multi-frame buffering, unfeasible. To address this challenge, we introduce Multi-Resolution Rescored ByteTrack (MR2-ByteTrack), a Video Object Detection (VOD) method tailored for MCU-based embedded vision nodes. MR2-ByteTrack reduces computational cost by alternating between full- and low-resolution inference, while linking detections across frames via ByteTrack and correcting misclassifications through the Rescore algorithm, which applies probability union rules to aggregate detection confidence scores across frames. We apply our approach to both a CNN-based detector and a Transformer-based model, demonstrating its generality across architectures with fundamentally different spatial processing. Experiments on ImageNetVID demonstrate that MR2-ByteTrack maintains accuracy, achieving mAP scores of up to 49.0 for the CNN-based models and 48.7 for the Transformer, while reducing multiply-accumulate operations by as much as 53\% for the CNNs and 32\% for the Transformer. When deployed on GAP9, an ultra-low-power RISC-V multicore MCU, our method yields up to 55\% energy savings compared to processing only full-resolution images, enabling the first real-time Transformer-based VOD on an MCU-class embedded vision node. Code available at https://github.com/Bomps4/Multi_Resolution_Rescored_ByteTrack/tree/IEEE_Access
Chinese Translation
现代智能视觉传感器需要在设备上进行智能处理视频流,因为由于带宽、延迟和隐私限制,云计算通常不切实际。然而,这些传感系统通常依赖于具有有限内存和计算能力的超低功耗微控制器(MCUs),使得需要特征存储或多帧缓冲的传统视频目标检测方法变得不可行。为了解决这一挑战,我们提出了多分辨率重评分ByteTrack(MR2-ByteTrack),这是一种针对基于MCU的嵌入式视觉节点量身定制的视频目标检测(VOD)方法。MR2-ByteTrack通过在全分辨率和低分辨率推理之间交替,降低了计算成本,同时通过ByteTrack在帧之间链接检测,并通过重评分算法纠正误分类,该算法应用概率联合规则来聚合跨帧的检测置信度分数。我们将该方法应用于基于CNN的检测器和基于Transformer的模型,展示了其在具有根本不同空间处理的架构中的通用性。在ImageNetVID上的实验表明,MR2-ByteTrack保持了准确性,基于CNN的模型达到了高达49.0的mAP分数,基于Transformer的模型达到了48.7,同时将CNN的乘加操作减少了多达53\%,将Transformer的减少了32\%。在GAP9上部署时,我们的方法相比仅处理全分辨率图像可节省高达55\%的能量,使得在MCU级嵌入式视觉节点上实现了首个实时基于Transformer的视频目标检测。代码可在https://github.com/Bomps4/Multi_Resolution_Rescored_ByteTrack/tree/IEEE_Access获取。
cs.CV / 14 / 2605.15424
Social-Mamba: Socially-Aware Trajectory Forecasting with State-Space Models
Social-Mamba:基于状态空间模型的社会感知轨迹预测
Abstract
Human trajectory forecasting is crucial for safe navigation in crowded environments, requiring models that balance accuracy with computational efficiency. Efficiently modeling social interactions is key to performance in dense crowds. Yet, most recent methods rely on attention mechanisms, which are effective at capturing complex dependencies, but incur quadratic computational costs that scale poorly with the growing number of neighbors. Recently, Selective State-Space Models have provided a linear-time alternative; however, their inherently sequential design is misaligned with the unstructured and dynamic nature of social interactions. To address this challenge, we propose Social-Mamba, a forecasting architecture that reformulates social interactions as structured sequential processes. At its core is the Cycle Mamba block, a novel module that enables continuous bidirectional information flow. Social-Mamba organizes agents on an egocentric grid and introduces social triplet factorization, which decomposes interactions into temporal, egocentric, and goal-centric scans. These are dynamically integrated through a learnable social gate and global scan to generate accurate and efficient trajectory predictions. Extensive experiments on five trajectory forecasting benchmarks show that Social-Mamba achieves state-of-the-art accuracy while offering superior parameter efficiency and computational scalability. Furthermore, embedding Social-Mamba into a flow-matching framework further enhances both accuracy and efficiency, establishing it as a flexible and robust foundation for future trajectory forecasting research. The code is publicly available: https://github.com/vita-epfl/Social-Mamba
Chinese Translation
人类轨迹预测对于在拥挤环境中的安全导航至关重要,这需要在准确性与计算效率之间取得平衡。有效建模社会互动是提升密集人群中性能的关键。然而,大多数最新方法依赖于注意力机制,虽然能够有效捕捉复杂依赖关系,但其计算成本呈二次增长,随着邻居数量的增加而急剧上升。最近,选择性状态空间模型提供了一种线性时间的替代方案;然而,它们固有的顺序设计与社会互动的无结构和动态特性不相符。为了解决这一挑战,我们提出了Social-Mamba,一种将社会互动重新表述为结构化顺序过程的预测架构。其核心是Cycle Mamba模块,这是一种新颖的模块,能够实现持续的双向信息流。Social-Mamba在自我中心网格上组织代理,并引入社会三元组分解,将互动分解为时间、以自我为中心和以目标为中心的扫描。这些通过可学习的社会门和全局扫描动态整合,以生成准确且高效的轨迹预测。在五个轨迹预测基准上的广泛实验表明,Social-Mamba实现了最先进的准确性,同时提供了优越的参数效率和计算可扩展性。此外,将Social-Mamba嵌入流匹配框架中进一步提高了准确性和效率,确立了其作为未来轨迹预测研究的灵活和稳健基础。代码已公开可用: https://github.com/vita-epfl/Social-Mamba
cs.CV / 15 / 2605.15450
RIDE: Retinex-Informed Decoupling for Exposing Concealed Objects
RIDE:基于Retinex理论的解耦方法以揭示隐藏物体
Abstract
Concealed Object Segmentation (COS) encompasses a family of dense-prediction tasks, including camouflaged object detection, polyp segmentation, transparent object detection, and industrial defect inspection, where targets are visually entangled with their surroundings through different physical mechanisms. Existing methods either operate directly on RGB images or employ \emph{heterogeneous} decompositions (\eg, Fourier, wavelet) that redistribute spatial evidence across scale/frequency coefficients, making pixel-aligned cues less direct. We introduce a fundamentally different perspective: \textbf{homogeneous image decomposition} via Retinex theory, which factorizes an image into illumination and reflectance components within the \emph{same} spatial domain. Our key insight is that visual entanglement enforces appearance matching in the composite space, but this does \emph{not} necessitate simultaneous matching in both component spaces, a phenomenon we formalize as the \textbf{Discriminability Gap Theorem}. Crucially, we show that across diverse COS sub-tasks, the underlying physical processes systematically anti-correlate illumination and reflectance differences, yielding theoretical guarantees that Retinex decomposition preserves or strictly improves total foreground--background discriminability across the full physical regime, with anti-correlation maximizing the gain. Building on this, we propose \textbf{RIDE} comprising: (i) a Task-Driven Retinex Decomposition module that learns segmentation-optimal factorizations end-to-end; (ii) a Discriminability Gap Attention mechanism that adaptively exploits where decomposition helps; and (iii) a Camouflage-Breaking Contrastive loss operating in reflectance feature space.
Chinese Translation
隐藏物体分割(COS)涵盖了一系列密集预测任务,包括伪装物体检测、息肉分割、透明物体检测和工业缺陷检测,其中目标通过不同的物理机制与其周围环境视觉上交织在一起。现有方法要么直接在RGB图像上操作,要么采用 extit{异构}分解(例如,傅里叶、波浪变换),这些方法在尺度/频率系数之间重新分配空间证据,使得像素对齐的线索不够直接。我们提出了一种根本不同的视角:通过Retinex理论进行 extbf{同质图像分解},将图像分解为 extit{相同}空间域内的照明和反射成分。我们的关键见解是,视觉交织在复合空间中强制执行外观匹配,但这并 extit{不}需要在两个成分空间中同时匹配,这一现象我们正式化为 extbf{可区分性差距定理}。至关重要的是,我们展示了在不同的COS子任务中,潜在的物理过程系统性地反相关照明和反射差异,提供了理论保证,即Retinex分解在整个物理范围内保持或严格改善前景与背景的总可区分性,且反相关最大化了增益。在此基础上,我们提出了 extbf{RIDE},包括:(i)一个任务驱动的Retinex分解模块,端到端学习分割最优分解;(ii)一个可区分性差距注意力机制,适应性地利用分解的帮助;(iii)一个在反射特征空间中操作的伪装破坏对比损失。
cs.CV / 16 / 2605.15458
Video Models Can Reason with Verifiable Rewards
视频模型可以进行可验证奖励的推理
Abstract
Video diffusion models have made rapid progress in perceptual realism and temporal coherence, but they remain primarily optimized for plausible generation rather than verifiable reasoning. This limitation is especially pronounced in tasks where generated videos must satisfy explicit spatial, temporal, or logical constraints. Inspired by the role of reinforcement learning with verifiable rewards (RLVR) in reasoning-oriented language models, we introduce VideoRLVR, a practical recipe for optimizing video diffusion models with rule-based feedback. VideoRLVR formulates video reasoning as the generation of verifiable visual trajectories and consists of an SDE-GRPO optimization backbone, dense decomposed rewards, and an Early-Step Focus strategy for efficient training. The Early-Step Focus strategy restricts policy optimization to the early denoising phase, reducing training latency by about 40% while preserving performance. We evaluate VideoRLVR on Maze, FlowFree, and Sokoban, three procedurally generated domains with objective success criteria. Across these tasks, VideoRLVR consistently improves over supervised fine-tuning baselines, with dense decomposed rewards proving especially important in low-success-rate settings. Our RL-optimized model also outperforms the evaluated proprietary and open-source video generation models on these verifiable reasoning benchmarks and out-of-domain benchmarks. These results suggest that verifiable RL can move video models beyond perceptual imitation toward more reliable rule-consistent visual reasoning.
Chinese Translation
视频扩散模型在感知真实感和时间一致性方面取得了快速进展,但它们仍然主要针对合理生成而非可验证推理进行优化。这一局限性在生成视频必须满足明确的空间、时间或逻辑约束的任务中尤为明显。受到可验证奖励的强化学习(RLVR)在面向推理的语言模型中作用的启发,我们提出了VideoRLVR,这是一种通过基于规则的反馈优化视频扩散模型的实用方法。VideoRLVR将视频推理表述为可验证视觉轨迹的生成,并包括一个SDE-GRPO优化框架、密集分解奖励以及一种用于高效训练的早期步骤聚焦策略。早期步骤聚焦策略将策略优化限制在早期去噪阶段,减少了约40%的训练延迟,同时保持了性能。我们在Maze、FlowFree和Sokoban这三个具有客观成功标准的程序生成领域上评估了VideoRLVR。在这些任务中,VideoRLVR在监督微调基准上持续取得改进,密集分解奖励在低成功率设置中尤其重要。我们的RL优化模型在这些可验证推理基准和域外基准上也超越了评估的专有和开源视频生成模型。这些结果表明,可验证的强化学习可以推动视频模型超越感知模仿,朝着更可靠的规则一致视觉推理发展。
cs.CV / 17 / 2605.15466
Entity-Centric World Models: Interaction-Aware Masking for Causal Video Prediction
以实体为中心的世界模型:面向交互的因果视频预测掩蔽
Abstract
Learning predictive world models from unlabelled video is a foundational challenge in artificial intelligence. While Joint Embedding Predictive Architectures (JEPA) have set new benchmarks in semantic classification, they often remain physics-blind, failing to capture the causal dynamics necessary for downstream reasoning. We hypothesize that this stems from standard patch-based masking strategies, which prioritize visual texture over rare but informative kinematic events. We propose Interaction-Aware JEPA (IA-JEPA), which utilizes a self-supervised motion-centric masking strategy to prioritize physical interactions. By specifically targeting entities engaged in collisions or momentum transfers, we force the architecture to reconstruct latent trajectories rather than static background features. Evaluated on the CLEVRER benchmark, IA-JEPA achieves 14.26% accuracy on causal reasoning tasks, a significant lead over the 3.22% achieved by standard patch-masked baselines. Crucially, we demonstrate that IA-JEPA breaks the "static bias" of standard self-supervision by inducing a higher-entropy, more discriminative latent space (+10% entropy gain) that linearizes physical energy ($R^2=0.43$). We show that this interaction bias generalizes to real-world human actions (Something-Something V2) and zero-shot physical puzzles (PHYRE-Lite). Our results provide a scalable, fully self-supervised path toward building foundational world models that begin to internalize the causal structure of the physical world.
Chinese Translation
从未标记的视频中学习预测性世界模型是人工智能中的一个基础性挑战。尽管联合嵌入预测架构(Joint Embedding Predictive Architectures, JEPA)在语义分类方面设定了新的基准,但它们往往缺乏物理感知,无法捕捉下游推理所需的因果动态。我们假设这源于标准的基于补丁的掩蔽策略,这些策略优先考虑视觉纹理,而忽视了稀有但信息丰富的运动事件。我们提出了面向交互的JEPA(Interaction-Aware JEPA, IA-JEPA),该方法利用自监督的运动中心掩蔽策略来优先考虑物理交互。通过专门针对参与碰撞或动量转移的实体,我们迫使架构重建潜在轨迹,而非静态背景特征。在CLEVRER基准上评估,IA-JEPA在因果推理任务上达到了14.26%的准确率,显著领先于标准补丁掩蔽基线的3.22%。关键是,我们证明IA-JEPA打破了标准自监督的“静态偏见”,通过诱导更高熵、更具辨别力的潜在空间(熵增益+10%),使物理能量线性化($R^2=0.43$)。我们展示了这种交互偏见能够推广到真实世界的人类动作(Something-Something V2)和零样本物理难题(PHYRE-Lite)。我们的结果提供了一条可扩展的、完全自监督的路径,朝着构建基础性世界模型迈进,这些模型开始内化物理世界的因果结构。
cs.CV / 18 / 2605.15475
A Unified Non-Parametric and Interpretable Point Cloud Analysis via t-FCW Graph Representation
通过 t-FCW 图表示实现统一的非参数和可解释点云分析
Abstract
We introduce an empowered transposed Fully Connected Weighted (t-FCW) graph representation to embed point clouds into a metric space. While original t-FCW has shown promising results for point cloud classification, the reasons behind its effectiveness and its broader applicability remained unclear. In this work, we analyze the properties that make the empowered and original t-FCW effective and design a network that uses the empowered t-FCW exclusively as feature extractors. From an interpretability perspective, we build memory banks for classification, part segmentation, and semantic segmentation using the empowered t-FCW. Our analysis reveals that the empowered t-FCW inherits robustness from surface descriptors, provides interpretability through dimension-wise relations. These properties enable a highly efficient and interpretable network, which processes the ModelNet40 classification problem in approximately 7 seconds on an NVIDIA RTX A5000 GPU. Importantly, empowered t-FCW can function both as a lightweight standalone baseline and as a complementary plug-in to existing deep models.
Chinese Translation
我们引入了一种增强的转置全连接加权(t-FCW)图表示,以将点云嵌入到度量空间中。尽管原始的 t-FCW 在点云分类中显示出了良好的效果,但其有效性的原因及其更广泛的适用性仍不清晰。在本研究中,我们分析了使增强的 t-FCW 和原始 t-FCW 有效的属性,并设计了一种网络,该网络专门使用增强的 t-FCW 作为特征提取器。从可解释性的角度出发,我们使用增强的 t-FCW 构建了用于分类、部件分割和语义分割的记忆库。我们的分析表明,增强的 t-FCW 继承了表面描述符的鲁棒性,并通过维度间关系提供了可解释性。这些特性使得网络具有高效性和可解释性,能够在 NVIDIA RTX A5000 GPU 上大约 7 秒内处理 ModelNet40 分类问题。重要的是,增强的 t-FCW 可以作为轻量级的独立基线,也可以作为现有深度模型的补充插件。
cs.CV / 19 / 2605.15477
EgoExo-WM: Unlocking Exo Video for Ego World Models
EgoExo-WM:为自我世界模型解锁外部视频
Abstract
Egocentric world models present a promising direction for enabling agents to predict and plan, but their performance is constrained by the limited availability of egocentric training data and its inherent partial observability of humans' physical actions. In contrast, exocentric video is abundant and reveals body poses well, but lacks direct alignment with an agent's action space -- and is not egocentric. We propose a method to bridge this gap by extracting structured body pose from exocentric video as a representation of action and transforming the exocentric video to egocentric video, informed by a human kinematics prior. This process unlocks the integration of in-the-wild exocentric data for egocentric world model training. We show that training whole-body action-conditioned egocentric world models with our converted data significantly improves both prediction quality and downstream planning performance, where we infer the sequence of body poses needed to achieve a visual goal state. Our approach paves the way to enlist arbitrary in-the-wild videos for building powerful egocentric world models, furthering applications in robot planning and augmented-reality guidance.
Chinese Translation
自我中心的世界模型为使智能体能够进行预测和规划提供了一个有前景的方向,但其性能受到自我中心训练数据有限性及人类物理动作固有的部分可观测性限制。相比之下,外部视频丰富且能够很好地揭示身体姿态,但与智能体的动作空间缺乏直接对齐——并且不是自我中心的。我们提出了一种方法,通过从外部视频中提取结构化的身体姿态作为动作的表征,并根据人类运动学先验将外部视频转换为自我中心视频,从而弥补这一差距。这个过程解锁了在自然环境中使用外部数据进行自我中心世界模型训练的可能性。我们展示了使用我们转换的数据训练全身动作条件的自我中心世界模型显著提高了预测质量和下游规划性能,在此过程中我们推断出实现视觉目标状态所需的身体姿态序列。我们的方法为利用任意自然环境中的视频构建强大的自我中心世界模型铺平了道路,进一步推动了机器人规划和增强现实指导等应用。
cs.CV / 20 / 2605.15484
When Does Sparse MoE Help in Vision? The Role of Backbone Compute Leverage in Sparse Routing
稀疏 MoE 在视觉中何时有效?稀疏路由中骨干计算杠杆的作用
Abstract
Mixture-of-Experts (MoE) networks promise favorable accuracy-compute trade-offs, yet practical vision deployments are hindered by expert collapse and limited end-to-end efficiency gains. We study when sparse top-$k$ routing with hard capacity constraints helps in vision classification, evaluated under multi-seed protocols on four benchmarks (CIFAR-10/100, Tiny-ImageNet, ImageNet-1K). We observe a \emph{compute-leverage pattern}: positive accuracy gaps require a substantial fraction $\rho$ of total FLOPs to be routed; at ImageNet scale this is necessary but not sufficient, as multi-expert routing ($k \geq 2$) is additionally required. Two controlled experiments isolate these factors. A hidden-size sweep on CIFAR-10 yields both predicted sign reversals across standard and depthwise backbones, ruling out backbone family as the active variable. An ImageNet-1K ablation that varies only top-$k$ -- holding architecture, initialization, and $\rho$ fixed -- reverses the gap from positive to negative across all five seeds. A per-sample variant of Soft MoE that softmaxes over experts rather than the batch rescues CIFAR-100 above the dense baseline, identifying batch-axis dispatch as the dominant failure mode in per-sample CNN settings. Code and aggregate results: https://github.com/libophd/sparse-moe-vision-rho.
Chinese Translation
混合专家(Mixture-of-Experts, MoE)网络承诺提供良好的准确性与计算的权衡,但实际视觉部署受到专家崩溃和有限的端到端效率提升的阻碍。我们研究了在视觉分类中,稀疏 top-$k$ 路由在硬容量约束下何时有效,评估在四个基准(CIFAR-10/100、Tiny-ImageNet、ImageNet-1K)下的多种种子协议。我们观察到一种 extit{计算杠杆模式}:正的准确性差距需要相当大比例 $
ho$ 的总浮点运算(FLOPs)被路由;在 ImageNet 规模下,这是必要但不充分的,因为还需要多专家路由($k
geq 2$)。两个控制实验隔离了这些因素。在 CIFAR-10 上进行的隐藏层大小扫描显示,在标准和深度骨干网络中预测的符号反转,排除了骨干家族作为活跃变量。一个仅改变 top-$k$ 的 ImageNet-1K 消融实验——固定架构、初始化和 $
ho$——使得差距在所有五个种子中从正转为负。一个每样本变体的 Soft MoE 在专家上进行 softmax 而不是在批次上,使得 CIFAR-100 超过稠密基线,识别出批次轴调度作为每样本 CNN 设置中的主要失败模式。代码和汇总结果见:https://github.com/libophd/sparse-moe-vision-rho。
cs.CV / 21 / 2605.15497
AnyAct: Towards Human Reenactment of Character Motion From Video
AnyAct: 实现从视频中对角色动作的人类重现
Abstract
We study the problem of directly deriving an initial human reenactment from a monocular video of a non-human character. Our goal is not to reconstruct the source character itself but to reinterpret its motion as a plausible and editable human performance for downstream animation authoring. This task is challenging because existing video-based motion capture methods are largely restricted to human-centric structural spaces, while motion retargeting methods typically require structured 3D source motions and known source topologies. Our key insight is that sparse local articulated motion cues can preserve essential dynamics across large structural differences, providing a stable bridge from character video to human reenactment. Based on this observation, we propose AnyAct, which formulates character-video-driven human reenactment as conditional human motion generation from transferable sparse local 2D articulated motion. To make this practical, we introduce three key designs: human-motion-only supervision via augmented 3D-to-2D projection, progressive 3D-to-2D training to alleviate conditioning ambiguity, and global-local motion decoupling for reliable local motion control. We further construct a benchmark primarily covering diverse non-human character videos. Experiments on the benchmark show that AnyAct produces high-fidelity initial human reenactments that preserve the essential dynamics of the characters in reference videos, and further ablation studies validate the effectiveness of its core designs.
Chinese Translation
我们研究了如何直接从单目视频中推导出初始的人类重现,视频中的角色为非人类角色。我们的目标不是重建源角色本身,而是将其动作重新解释为一种合理且可编辑的人类表演,以便用于后续的动画创作。这一任务具有挑战性,因为现有的视频基础动作捕捉方法主要局限于以人为中心的结构空间,而动作重定向方法通常需要结构化的3D源动作和已知的源拓扑。我们的关键见解是,稀疏的局部关节运动线索能够在大结构差异中保留基本的动态特征,从而为从角色视频到人类重现提供一个稳定的桥梁。基于这一观察,我们提出了AnyAct,该方法将基于角色视频的人类重现形式化为从可转移的稀疏局部2D关节运动生成条件人类动作。为了使这一方法具有实用性,我们引入了三个关键设计:通过增强的3D到2D投影进行人类动作单独监督,渐进式的3D到2D训练以缓解条件模糊性,以及全球-局部运动解耦以实现可靠的局部运动控制。我们进一步构建了一个基准,主要涵盖多样的非人类角色视频。在该基准上的实验表明,AnyAct能够生成高保真度的初始人类重现,保留参考视频中角色的基本动态特征,进一步的消融研究验证了其核心设计的有效性。
cs.CV / 22 / 2605.15519
DiffVAS: Diffusion-Guided Visual Active Search in Partially Observable Environments
DiffVAS:基于扩散引导的部分可观测环境中的视觉主动搜索
Abstract
Visual active search (VAS) has been introduced as a modeling framework that leverages visual cues to direct aerial (e.g., UAV-based) exploration and pinpoint areas of interest within extensive geospatial regions. Potential applications of VAS include detecting hotspots for rare wildlife poaching, aiding search-and-rescue missions, and uncovering illegal trafficking of weapons, among other uses. Previous VAS approaches assume that the entire search space is known upfront, which is often unrealistic due to constraints such as a restricted field of view and high acquisition costs, and they typically learn policies tailored to specific target objects, which limits their ability to search for multiple target categories simultaneously. In this work, we propose DiffVAS, a target-conditioned policy that searches for diverse objects simultaneously according to task requirements in partially observable environments, which advances the deployment of visual active search policies in real-world applications. DiffVAS leverages a diffusion model to reconstruct the entire geospatial area from sequentially observed partial glimpses, which enables a target-conditioned reinforcement learning-based planning module to effectively reason and guide subsequent search steps. Extensive experiments demonstrate that DiffVAS excels in searching diverse objects in partially observable environments, significantly surpassing state-of-the-art methods on several datasets.
Chinese Translation
视觉主动搜索(VAS)被引入作为一种建模框架,利用视觉线索指导空中(例如,无人机基础)探索,并在广泛的地理空间区域内确定感兴趣的区域。VAS的潜在应用包括检测稀有野生动物偷猎的热点、协助搜救任务以及揭露非法武器贩运等。以往的VAS方法假设整个搜索空间是事先已知的,这在视野受限和高获取成本等约束下往往不切实际,并且它们通常学习针对特定目标对象的策略,这限制了它们同时搜索多个目标类别的能力。在本研究中,我们提出了DiffVAS,这是一种目标条件策略,可以根据任务要求在部分可观测环境中同时搜索多样化对象,从而推动视觉主动搜索策略在现实世界应用中的部署。DiffVAS利用扩散模型从顺序观察到的部分视图重建整个地理空间区域,使得基于目标条件的强化学习规划模块能够有效推理并指导后续搜索步骤。大量实验表明,DiffVAS在部分可观测环境中搜索多样化对象方面表现优异,在多个数据集上显著超越了最先进的方法。
cs.CV / 23 / 2605.15523
Self-Prompting Diffusion Transformer for Open-Vocabulary Scene Text Editing via In-Context Learning
自提示扩散变换器用于通过上下文学习进行开放词汇场景文本编辑
Abstract
Scene text editing aims to modify text in a target region of an image while preserving surrounding background style and texture. Existing methods rely solely on image background information while neglecting the visual details of target regions, which discards stylistic features in the original text and essentially degrades the task to text rendering. Moreover, the conditions imposed by pre-trained glyph encoder limit the scope of editable text. To address these issues, this paper proposes a self-prompting scene text editing method that constructs style and glyph prompts directly from the original image, without introducing additional style or glyph encoders. We employ a two-stage training strategy: the diffusion transformer is first trained on large-scale self-supervised data and then refined using a small set of paired images. By leveraging the in-context learning capability of the Multi-Modal Diffusion Transformer (MM-DiT), it achieves open-vocabulary and style-consistent text editing. Experimental results on various languages demonstrate that our method achieves the state-of-the-art performance in both text accuracy and style consistency. Our project page: \href{https://hongxiii.github.io/mstedit}{hongxiii.github.io/mstedit}.
Chinese Translation
场景文本编辑旨在修改图像目标区域中的文本,同时保留周围背景的风格和纹理。现有方法仅依赖于图像背景信息,而忽视了目标区域的视觉细节,这导致原始文本中的风格特征被丢弃,并本质上将任务降级为文本渲染。此外,预训练的字形编码器施加的条件限制了可编辑文本的范围。为了解决这些问题,本文提出了一种自提示的场景文本编辑方法,该方法直接从原始图像构建风格和字形提示,而无需引入额外的风格或字形编码器。我们采用了两阶段的训练策略:首先在大规模自监督数据上训练扩散变换器,然后使用一小组配对图像进行精细调整。通过利用多模态扩散变换器(Multi-Modal Diffusion Transformer, MM-DiT)的上下文学习能力,该方法实现了开放词汇和风格一致的文本编辑。在多种语言上的实验结果表明,我们的方法在文本准确性和风格一致性方面达到了最先进的性能。我们的项目页面: [hongxiii.github.io/mstedit](https://hongxiii.github.io/mstedit)。
cs.CV / 24 / 2605.15533
Tuning-free Instruction-based Video Editing Via Structural Noise Initialization and Guidance
无调优的基于指令的视频编辑:结构噪声初始化与引导
Abstract
Video editing poses a significant challenge. While a series of tuning-free methods circumvent the need for extensive data collection and model training, they often underutilize the rich information embedded within noisy latent, leading to unsatisfactory results. To address this, we propose a \textit{tuning-free, instruction-based} video editing framework. We approach video editing from the perspective of noisy latent: we design a Structural Noise Initialization Strategy (SNIS) to secure a superior editing starting point by assigning higher noise levels to edited regions (to facilitate content change) and lower noise levels to unedited regions (to maintain content consistency). We introduce a Noise Guidance Mechanism (NGM), which leverages the video prior in the generative model and effectively integrates rich information within the noisy latent to guide the denoising process, thereby preserving unedited content and overall visual coherence. Experiments show that our proposed method achieves better visual quality and state-of-the-art performance.
Chinese Translation
视频编辑面临着重大挑战。虽然一系列无调优的方法规避了大量数据收集和模型训练的需求,但它们往往未能充分利用嵌入在噪声潜变量中的丰富信息,导致结果不尽如人意。为此,我们提出了一种无调优的基于指令的视频编辑框架。我们从噪声潜变量的角度出发,设计了一种结构噪声初始化策略(Structural Noise Initialization Strategy, SNIS),通过对编辑区域分配更高的噪声水平(以促进内容变化)和对未编辑区域分配较低的噪声水平(以保持内容一致性),确保了一个优越的编辑起点。我们引入了一种噪声引导机制(Noise Guidance Mechanism, NGM),该机制利用生成模型中的视频先验,有效整合噪声潜变量中的丰富信息,以引导去噪过程,从而保留未编辑内容和整体视觉一致性。实验表明,我们提出的方法在视觉质量和前沿性能上均取得了更好的结果。
cs.CV / 25 / 2605.15535
Learning Dynamic Structural Specialization for Underwater Salient Object Detection
学习动态结构特化用于水下显著目标检测
Abstract
Underwater salient object detection (USOD) has attracted increasing attention for underwater visual scene understanding and vision-guided robotic applications. However, existing USOD methods still struggle with underwater image degradations, which often lead to inaccurate object localization, fragmented salient regions, and coarse boundary prediction. To address these challenges, this paper proposes DSS-USOD, a novel RGB-based USOD method built upon dynamic structural specialization. DSS-USOD extracts a shared base representation from a single underwater image, decomposes it into boundary-sensitive and region-coherent structural features, and dynamically coordinates their contributions according to local structural context. Specifically, the extracted shared base representation is decomposed into a boundary-sensitive branch for modeling fine-grained boundary details and a region-coherent branch for capturing region-level structural consistency. A spatial coordination module is then introduced to adaptively regulate the relative contributions of the two branches according to local structural context. Moreover, cooperative structural supervision is introduced to promote branch specialization and stabilize spatial coordination, enabling DSS-USOD to better balance boundary precision and region coherence under degraded underwater conditions. Extensive experiments show that DSS-USOD achieves superior performance on benchmark datasets. Finally, real-world deployment on an underwater robot validates the practical effectiveness of DSS-USOD for underwater object inspection.
Chinese Translation
水下显著目标检测(USOD)在水下视觉场景理解和视觉引导的机器人应用中引起了越来越多的关注。然而,现有的USOD方法在处理水下图像退化时仍然面临挑战,这常常导致目标定位不准确、显著区域破碎以及边界预测粗糙。为了解决这些问题,本文提出了一种新颖的基于RGB的USOD方法DSS-USOD,该方法基于动态结构特化。DSS-USOD从单幅水下图像中提取共享基础表示,将其分解为边界敏感和区域一致的结构特征,并根据局部结构上下文动态协调它们的贡献。具体而言,提取的共享基础表示被分解为一个用于建模细粒度边界细节的边界敏感分支和一个用于捕捉区域级结构一致性的区域一致分支。然后,引入一个空间协调模块,根据局部结构上下文自适应地调节两个分支的相对贡献。此外,引入了协同结构监督以促进分支特化并稳定空间协调,使DSS-USOD能够在退化的水下条件下更好地平衡边界精度和区域一致性。大量实验表明,DSS-USOD在基准数据集上表现优越。最后,在水下机器人上的实际部署验证了DSS-USOD在水下物体检查中的实际有效性。
cs.CV / 26 / 2605.15546
3DTMDet: A Dual-Path Synergy Network of Transformer and SSM for 3D Object Detection in Point Clouds
3DTMDet:一种结合变换器和状态空间模型的双路径协同网络,用于点云中的三维物体检测
Abstract
A fundamental challenge in point cloud object detection lies in the conflict between the extreme sparsity of distant points and the need for remote context understanding. The existing methods typically use 1D serialization to expand the receptive field, which inevitably discards already scarce local geometric details and reduces detection of distant and small objects. To address this issue, we propose 3DTMDet, a novel detection network that synergistically combines state space models (Mamba) with Transformers. The core idea is to utilize SSM's linear complexity and advantages in long sequence modeling to effectively capture global interactions between sparse and distant points, while using Transformer modules with local attention to encode fine-grained geometric structures in local point sets, preserving accurate shape information. We propose the 3D Hybrid Mamba Transformer (3DHMT) block, which uses an SSM-Attention-SSM pipeline to balance global context understanding and local detail preservation, effectively alleviating the tension between receptive field enlargement and geometric preservation in remote detection. In addition, we introduced a voxel generation block inspired by LiDAR physics, which diffuses features along the sensor observation direction to reconstruct the complete object structure of occlusion and distant areas. Extensive experiments conducted on the KITTI and ONCE datasets have shown that 3DTMDet outperforms state-of-the-art detectors. The code is available at https://github.com/QiuBingwen/3DTMDet.
Chinese Translation
点云物体检测中的一个基本挑战在于远处点的极端稀疏性与对远程上下文理解的需求之间的冲突。现有的方法通常使用一维序列化来扩展感受野,这不可避免地丢弃了已经稀缺的局部几何细节,并减少了对远处和小物体的检测。为了解决这个问题,我们提出了3DTMDet,一种新颖的检测网络,它协同结合了状态空间模型(Mamba)和变换器。其核心思想是利用状态空间模型在长序列建模中的线性复杂性和优势,有效捕捉稀疏和远处点之间的全局交互,同时使用具有局部注意力的变换器模块对局部点集中的细粒度几何结构进行编码,以保留准确的形状信息。我们提出了3D混合Mamba变换器(3DHMT)模块,该模块使用SSM-注意力-SSM管道来平衡全局上下文理解和局部细节保留,有效缓解了在远程检测中感受野扩大与几何保留之间的紧张关系。此外,我们引入了一个受激光雷达物理启发的体素生成模块,该模块沿传感器观察方向扩散特征,以重建遮挡和远处区域的完整物体结构。在KITTI和ONCE数据集上进行的广泛实验表明,3DTMDet的性能优于最先进的检测器。代码可在 https://github.com/QiuBingwen/3DTMDet 获取。
cs.CV / 27 / 2605.15561
RoiMAM: Region-of-Interest Medical Attention Model for Efficient Vision-Language Understanding
RoiMAM:用于高效视觉-语言理解的感兴趣区域医学注意力模型
Abstract
Vision-Language Models (VLMs) facilitate medical visual question answering (MedVQA) by jointly interpreting images and text. However, existing models typically depend on large architectures and closed-set answers, which limits their efficiency and potential clinical applicability. To overcome these shortcomings, we introduce RoiMAM, an efficient VLM. It integrates a training-free ROI Generation Module with Semantic Selective Suppression to focus on lesion-relevant regions, alongside a Text Prompt Enhancer module that provides modality-specific context without introducing training parameters. Compared to the widely used MedVInT-TD model, our design achieves efficient and accurate diagnosis at less than 20\% of the model size, while improving accuracy by approximately 2% on SLAKE and 4.6% on PMC-VQA.
Chinese Translation
视觉-语言模型(VLMs)通过共同解释图像和文本,促进医学视觉问答(MedVQA)。然而,现有模型通常依赖于大型架构和封闭集答案,这限制了它们的效率和潜在的临床适用性。为了解决这些缺点,我们提出了RoiMAM,一种高效的VLM。它集成了一个无训练的ROI生成模块与语义选择性抑制,以专注于与病变相关的区域,并配备了一个文本提示增强模块,提供特定模态的上下文而不引入训练参数。与广泛使用的MedVInT-TD模型相比,我们的设计在模型大小不到20%的情况下,实现了高效且准确的诊断,同时在SLAKE上提高了约2%的准确率,在PMC-VQA上提高了4.6%。
cs.CV / 28 / 2605.15574
MI-CXR: A Benchmark for Longitudinal Reasoning over Multi-Interval Chest X-rays
MI-CXR:多时间间隔胸部X光片的纵向推理基准
Abstract
Longitudinal chest X-ray (CXR) interpretation requires reasoning over disease evolution across multiple patient visits, yet most existing medical VQA benchmarks focus on single images or short-horizon image pairs. We introduce MI-CXR, a benchmark for standardized evaluation of Multi-Interval longitudinal reasoning over multi-visit CXR sequences, without requiring free-form report generation or additional clinical context. MI-CXR comprises five-way multiple-choice questions over five-visit patient timelines and instantiates three complementary task families: Temporal Event Localization, Interval-wise Change Reasoning, and Global Trajectory Summarization, which assess clinically grounded visual reasoning over time. Evaluating 14 state-of-the-art vision-language models (VLMs) shows low overall performance, with an average accuracy of 29.3%, only modestly above random guessing. Using stage-wise diagnostic probing, we find that models often produce locally plausible interval descriptions but fail to enforce temporal constraints or compose evidence into globally consistent decisions over the full timeline. These findings reveal key limitations of current VLMs and establish MI-CXR as a principled benchmark for longitudinal medical reasoning. The benchmark is available at https://github.com/AIDASLab/MI-CXR
Chinese Translation
纵向胸部X光片(CXR)解读需要对多次患者就诊中的疾病演变进行推理,然而大多数现有的医学视觉问答(VQA)基准集中于单幅图像或短时间范围的图像对。我们提出了MI-CXR,这是一个用于标准化评估多时间间隔纵向推理的基准,适用于多次就诊的CXR序列,而无需生成自由格式的报告或额外的临床背景信息。MI-CXR包含五次就诊患者时间线的五选一多项选择题,并实例化了三类互补的任务:时间事件定位、区间变化推理和全局轨迹总结,旨在评估临床基础的时间视觉推理。对14个最先进的视觉-语言模型(VLMs)的评估显示整体表现较低,平均准确率为29.3%,仅略高于随机猜测。通过阶段性诊断探测,我们发现模型通常生成局部合理的区间描述,但未能强制执行时间约束或将证据组合成全时间线上的全球一致决策。这些发现揭示了当前VLMs的关键局限性,并确立了MI-CXR作为纵向医学推理的原则性基准。该基准可在 https://github.com/AIDASLab/MI-CXR 获取。
cs.CV / 29 / 2605.15582
LDGuid: A Framework for Robust Change Detection via Latent Difference Guidance
LDGuid:通过潜在差异引导实现稳健变化检测的框架
Abstract
Modern deep learning models for change detection (CD) often struggle to explicitly represent task-relevant semantic differences. This paper proposes the Latent Difference Guidance (LDGuid) framework that explicitly learns and injects semantic differences into CD models. LDGuid deploys adversarial autoencoding to implement a difference embedding (DE) module. The DE module is pretrained via the information bottleneck method, restricting it to learn only task-relevant differences between pre- and post-event samples. The learned latent difference is then used as an explicit guidance signal in the CD model. We validate LDGuid by integrating it into U-Net, BIT, and AERNet baselines for CD and evaluating it on LEVIR-CD, WHU-CD, SVCD, and CaBuAr datasets. Experimental results show that LDGuid enhances segmentation performance across all benchmarks, with particularly remarkable gains in challenging settings affected by spectral noise. The results further highlight the ability of LDGuid in incorporating domain knowledge, such as task-specific spectral indices. Our findings suggest that semantic difference learning can drastically enhance the robustness of CD in remote sensing.
Chinese Translation
现代深度学习模型在变化检测(CD)中往往难以明确表示与任务相关的语义差异。本文提出了潜在差异引导(Latent Difference Guidance,LDGuid)框架,该框架明确学习并将语义差异注入到CD模型中。LDGuid采用对抗自编码技术实现差异嵌入(Difference Embedding,DE)模块。DE模块通过信息瓶颈方法进行预训练,限制其仅学习事件前后样本之间与任务相关的差异。学习到的潜在差异随后作为CD模型中的显式引导信号。我们通过将LDGuid集成到U-Net、BIT和AERNet基线模型中进行验证,并在LEVIR-CD、WHU-CD、SVCD和CaBuAr数据集上进行评估。实验结果表明,LDGuid在所有基准测试中提升了分割性能,尤其在受到光谱噪声影响的挑战性环境中表现出显著的提升。结果进一步突显了LDGuid在融入领域知识(如任务特定光谱指数)方面的能力。我们的研究结果表明,语义差异学习可以显著增强遥感变化检测的稳健性。
cs.CV / 30 / 2605.15583
Unsupervised 3D Human Pose Estimation via Conditional Multi-view Ancestral Sampling
通过条件多视角祖先采样进行无监督的3D人体姿态估计
Abstract
We propose a method of estimating a 3D human pose from a single view without 3D supervision. The key to our method is to leverage the 2D diffusion priors of motion diffusion models (MDMs) pre-trained on large 2D human pose datasets. Specifically, we extend multi-view ancestral sampling of diffusion models to the task of 2D-3D lifting of human pose. To this end, we newly propose a conditional multi-view ancestral sampling (cMAS) that optimizes the 3D pose such that its multi-view projections follow the manifold in 2D MDM noise space, while conditioning the 3D pose to match the given 2D poses and anatomical constraints of humans. Experiments on the Yoga dataset demonstrate that our method achieves better cross-domain performance compared to state-of-the-art supervised and unsupervised 3D pose estimation methods, including extreme human poses where 3D supervision is unavailable. Code is available at: https://github.com/asaa0001/c-MAS.
Chinese Translation
我们提出了一种从单视角估计3D人体姿态的方法,该方法不依赖于3D监督。我们方法的关键在于利用在大型2D人体姿态数据集上预训练的运动扩散模型(MDMs)的2D扩散先验。具体而言,我们将扩散模型的多视角祖先采样扩展到人体姿态的2D-3D提升任务。为此,我们新提出了一种条件多视角祖先采样(cMAS),该方法优化3D姿态,使其多视角投影遵循2D MDM噪声空间中的流形,同时将3D姿态条件化以匹配给定的2D姿态和人体解剖约束。在Yoga数据集上的实验表明,我们的方法在跨领域性能上优于最先进的监督和无监督3D姿态估计方法,包括在缺乏3D监督的极端人体姿态下。代码可在以下链接获取:https://github.com/asaa0001/c-MAS。
cs.CV / 31 / 2605.15584
AGC: Adaptive Geodesic Correction for Adversarial Robustness on Vision-Language Models
AGC:用于视觉-语言模型的自适应测地线校正以增强对抗鲁棒性
Abstract
Vision-language models like CLIP have demonstrated remarkable zero-shot transfer capabilities. However, their susceptibility to imperceptible adversarial perturbations remains a critical security concern. While test-time defenses offer a pragmatic solution for deployed models, existing approaches typically rely on gradient-based optimization during inference, incurring significant computational overhead. In this paper, we revisit the role of data augmentation in CLIP robustness and observe that augmentations are not equally effective: specific augmentations consistently provide robust geometric cues that align with correct class semantics in the hyperspherical feature space. Based on this, we propose Adaptive Geodesic Correction (AGC), a training-free defense mechanism that requires no parameter updates. AGC identifies a reliable augmentation as a geometric anchor and corrects the input feature towards it, utilizing an adaptive step size to balance robustness against clean accuracy preservation. AGC achieves superior performance across eight fine-grained datasets and three CLIP backbones, improving average robust accuracy by 44.4\% over state-of-the-art baseline while delivering a 10$\times$ reduction in inference latency. Our findings reveal a fundamental geometric property of CLIP features, offering a highly efficient and effective paradigm for robust multimodal deployment.
Chinese Translation
视觉-语言模型如CLIP展示了显著的零样本迁移能力。然而,它们对不可察觉的对抗扰动的敏感性仍然是一个关键的安全隐患。尽管测试时防御为已部署模型提供了一种务实的解决方案,但现有方法通常在推理过程中依赖基于梯度的优化,导致显著的计算开销。在本文中,我们重新审视了数据增强在CLIP鲁棒性中的作用,并观察到增强效果并不均衡:特定的增强方法始终提供与超球面特征空间中正确类别语义一致的鲁棒几何线索。基于此,我们提出了自适应测地线校正(AGC),这是一种无需训练的防御机制,不需要参数更新。AGC将可靠的增强视为几何锚点,并将输入特征朝其方向校正,利用自适应步长在鲁棒性与保持清晰准确性之间取得平衡。AGC在八个细粒度数据集和三个CLIP骨干网络上表现优越,平均鲁棒准确率比最先进的基线提高了44.4 ext{%},同时推理延迟减少了10倍。我们的发现揭示了CLIP特征的基本几何属性,为鲁棒的多模态部署提供了一种高效且有效的范式。
cs.CV / 32 / 2605.15592
Efficient Image Synthesis with Sphere Latent Encoder
基于球体潜在编码器的高效图像合成
Abstract
Few-step image generation has seen rapid progress, with consistency and meanflow-based methods significantly reducing the number of sampling steps. Despite their low inference cost, these approaches often suffer from training instability and limited scalability. Sphere Encoder is a recent alternative that produces high-quality images in only a few steps; however, it requires repeated transitions between the pixel space and latent space during inference while jointly optimizing reconstruction and generation within a single architecture. This design leads to computational inefficiency and objective conflict between reconstruction and generation. To address these limitations, we decouple the framework into a fixed pretrained image encoder and a separate latent denoising model trained entirely in a spherical latent space. Our approach eliminates repeated pixel-space operations during training and inference, improving efficiency and allowing reconstruction and generation to specialize independently. On Animal-Faces, Oxford-Flowers and ImageNet-1K datasets, our method significantly outperforms Sphere Encoder in both generation quality and inference speed, while achieving competitive results against strong few-step and multi-step baselines.
Chinese Translation
少步图像生成已经取得了快速进展,基于一致性和均流的方法显著减少了采样步骤的数量。尽管这些方法具有较低的推理成本,但它们往往面临训练不稳定和可扩展性有限的问题。球体编码器(Sphere Encoder)是一种最近提出的替代方案,它仅需少量步骤即可生成高质量图像;然而,它在推理过程中需要在像素空间和潜在空间之间进行重复转换,同时在单一架构内联合优化重建和生成。这种设计导致了计算效率低下以及重建与生成之间的目标冲突。为了解决这些限制,我们将框架解耦为一个固定的预训练图像编码器和一个完全在球形潜在空间中训练的独立潜在去噪模型。我们的方法消除了训练和推理过程中的重复像素空间操作,提高了效率,并允许重建和生成独立专业化。在Animal-Faces、Oxford-Flowers和ImageNet-1K数据集上,我们的方法在生成质量和推理速度上显著超越了球体编码器,同时在强大的少步和多步基准测试中也取得了竞争力的结果。
cs.CV / 33 / 2605.15597
CM-EVS: Sparse Panoramic RGB-D-Pose Data for Complete Scene Coverage
CM-EVS:用于完整场景覆盖的稀疏全景RGB-D-姿态数据
Abstract
Modern 3D visual learning relies on observations sampled from metric 3D assets, yet existing scans, meshes, point clouds, simulations, and reconstructions do not directly provide a sparse, comparable, and geometry-consistent panoramic training interface. Dense trajectories duplicate nearby views, source-specific rendering policies yield heterogeneous annotations, and sparse heuristics may miss important regions or introduce depth-inconsistent observations. We study how to convert 3D assets into sparse panoramic RGB-D-pose data that preserves complete scene coverage with low redundancy and auditable provenance. We propose COVER (Coverage-Oriented Viewpoint curation with ERP Range-depth warping), a training-free ERP viewpoint curator that projects geometry observed from selected views into candidate ERP probes, scores incremental coverage, and penalizes depth conflicts. Under bounded proxy error, its greedy coverage proxy preserves the standard coverage-style approximation behavior up to an additive error term. Using COVER, we build CM-EVS (Coverage-curated Metric ERP View Set), a panoramic RGB-D-pose dataset with 36,373 curated ERP frames from 1,275 indoor scenes across Blender indoor, HM3D, and ScanNet++, complemented by outdoor panoramas from TartanGround and OB3D re-encoded into the same schema. Each frame provides full-sphere RGB, metric range depth, calibrated pose; COVER-produced indoor frames include per-step provenance logs. With a median of only 25 frames per indoor scene, CM-EVS covers all 13 unified room types while maintaining compact scene-level coverage. Experiments show that COVER improves the coverage-conflict trade-off, making CM-EVS a sparse, compact, and auditable RGB-D-pose resource for geometry-consistent panoramic 3D learning.
Chinese Translation
现代3D视觉学习依赖于从度量3D资产中采样的观察数据,然而现有的扫描、网格、点云、模拟和重建并未直接提供稀疏、可比较且几何一致的全景训练接口。密集轨迹重复了邻近视图,特定来源的渲染策略产生了异质注释,而稀疏启发式方法可能会遗漏重要区域或引入深度不一致的观察。我们研究如何将3D资产转换为稀疏全景RGB-D-姿态数据,以保持完整的场景覆盖,同时降低冗余和可审计的来源。我们提出了COVER(基于覆盖的视点策划与ERP范围深度扭曲),这是一种无训练的ERP视点策划器,它将从选定视图观察到的几何投影到候选ERP探针中,评分增量覆盖,并惩罚深度冲突。在有界代理误差下,其贪婪覆盖代理保持了标准覆盖风格的近似行为,最多增加一个误差项。使用COVER,我们构建了CM-EVS(覆盖策划的度量ERP视图集),这是一个包含来自Blender室内、HM3D和ScanNet++的1,275个室内场景的36,373个策划ERP帧的全景RGB-D-姿态数据集,并补充了来自TartanGround和OB3D的户外全景,这些全景重新编码为相同的模式。每个帧提供全球面RGB、度量范围深度和校准姿态;COVER生成的室内帧包括每步的来源日志。CM-EVS在每个室内场景中仅包含25帧的中位数,覆盖了所有13种统一房间类型,同时保持紧凑的场景级覆盖。实验表明,COVER改善了覆盖与冲突的权衡,使CM-EVS成为一个稀疏、紧凑且可审计的RGB-D-姿态资源,适用于几何一致的全景3D学习。
cs.CV / 34 / 2605.15599
Pretraining Objective Matters in Extreme Low-Data FGVC: A Backbone-Controlled Study
预训练目标在极低数据细粒度视觉分类中的重要性:一项基于骨干网络的研究
Abstract
Extreme low-data fine-grained classification is common in expert domains where labeling is expensive, yet practitioners still need principled guidance for selecting pretrained encoders. We study emerald inclusion grading with a custom dataset of labeled images across three classes and ask: under matched backbone capacity, how does pretraining objective affect downstream representation quality? We compare four frozen ViT-B/16 encoders trained with supervised classification, contrastive learning (SigLIP2), masked reconstruction (MAE), and self-distillation (DINOv3), and evaluate them with leave-one-out cross-validation using linear and nonlinear probes. To control statistical noise in the low-N regime, we use permutation testing (N=1000) on macro one-vs-rest AUC. Supervised and contrastive encoders provide the strongest linear separability (logistic AUC: 0.768 and 0.735; SVM AUC: 0.739 and 0.697), while MAE improves under nonlinear probes (XGBoost AUC: 0.713). We find that DINOv3 underperforms across probe families in this domain. These results support a practical recommendation for extreme low-data FGVC: prioritize margin-enforcing pretraining objectives when data scarcity restricts probing to linear decision rules, and consider reconstruction-style encoders when nonlinear classifiers are feasible given dataset constraints.
Chinese Translation
在标签成本高昂的专家领域,极低数据细粒度分类现象普遍存在,但从业者仍需有原则的指导来选择预训练编码器。我们研究了使用自定义数据集对三类图像进行的翡翠内含物分级,并提出问题:在匹配的骨干网络容量下,预训练目标如何影响下游表示质量?我们比较了四个冻结的 ViT-B/16 编码器,这些编码器分别通过监督分类、对比学习(SigLIP2)、掩蔽重建(MAE)和自蒸馏(DINOv3)进行训练,并使用线性和非线性探针进行留一交叉验证评估。为了控制低样本量下的统计噪声,我们在宏观一对多 AUC 上进行了置换检验(N=1000)。监督和对比编码器提供了最强的线性可分性(逻辑回归 AUC:0.768 和 0.735;SVM AUC:0.739 和 0.697),而 MAE 在非线性探针下表现更佳(XGBoost AUC:0.713)。我们发现 DINOv3 在该领域的各类探针中表现不佳。这些结果支持了对极低数据细粒度视觉分类的实用建议:在数据稀缺限制探测为线性决策规则时,优先考虑边际强化的预训练目标,并在数据集约束下考虑重建风格的编码器,尤其是在非线性分类器可行时。
cs.CV / 35 / 2605.15615
Neutral-Reference Prompting for Vision-Language Models
视觉-语言模型的中性参考提示
Abstract
Efficient transfer learning of vision-language models (VLMs) commonly suffers from a Base-New Trade-off (BNT): improving performance on unseen (new) classes often degrades accuracy on known (base) classes. Addressing how to boost recognition of unseen classes without sacrificing known-class performance remains a central challenge. Existing work often simplistically attributes the BNT to overfitting on known classes. We observe an interesting phenomenon: VLMs frequently exhibit asymmetric confusion on certain downstream data, i.e., samples of class A are systematically mispredicted as class B, while the reverse confusion (B to A) rarely occurs. For known classes, this kind of bias can be mitigated by tuning using a cross-entropy loss, but for unseen classes, such pretraining-induced bias persists and harms generalization. Motivated by this, we propose NeRP, a plug-and-play prompting correction strategy that improves discrimination on unseen classes without modifying model parameters. NeRP leverages neutral text prompts and reference images to measure class-wise prior preferences along the pre-trained inter-class geometry, and combines them with the sample likelihood to obtain the model's surrogate score. If, for a given sample, the prior strongly favors the current prediction while the observed evidence is clearly insufficient, we perform a local flip between easily confusable class pairs, thereby correcting prior-dominated mispredictions. Extensive experiments across multiple backbones and 15 few-shot and cross-domain benchmarks show that NeRP substantially improves accuracy on unseen classes while preserving known-class prediction performance.
Chinese Translation
视觉-语言模型(VLMs)的高效迁移学习通常面临基类-新类权衡(Base-New Trade-off, BNT)的问题:在未见(新)类别上提高性能往往会降低已知(基)类别的准确性。如何在不牺牲已知类别性能的情况下提升未见类别的识别能力仍然是一个核心挑战。现有研究常常简单地将BNT归因于对已知类别的过拟合。我们观察到一个有趣的现象:VLMs在某些下游数据上经常表现出不对称的混淆,即A类样本系统性地被错误预测为B类,而反向混淆(B到A)则很少发生。对于已知类别,这种偏差可以通过使用交叉熵损失进行调优来缓解,但对于未见类别,这种由预训练引起的偏差仍然存在并损害了泛化能力。基于此,我们提出了NeRP,一种即插即用的提示修正策略,能够在不修改模型参数的情况下改善未见类别的区分能力。NeRP利用中性文本提示和参考图像来测量沿着预训练的类别间几何体的类别先验偏好,并将其与样本的似然性结合,以获得模型的替代评分。如果对于给定样本,先验强烈偏向当前预测,而观察到的证据明显不足,我们将在容易混淆的类别对之间进行局部翻转,从而纠正由先验主导的错误预测。在多个基础模型和15个少样本及跨领域基准上的广泛实验表明,NeRP显著提高了未见类别的准确性,同时保持了已知类别的预测性能。
cs.CV / 36 / 2605.15618
Latent Video Prediction Learns Better World Models
潜在视频预测学习更好的世界模型
Abstract
Self-supervised video models are increasingly framed as world models, yet their evaluation remains largely confined to a single top-1 accuracy score on clean benchmarks. This leaves a major gap in comprehending their potential as world models. We present the first systematic study addressing this gap, analyzing four matched-capacity frontier video foundation models, V-JEPA 2.1, V-JEPA 2, VideoPrism, and VideoMAEv2, across five robustness axes relevant to their deployment as video world models: feature discriminability, corruption robustness, fine-grained discrimination, occlusion robustness, and sensitivity to temporal direction. Our evaluations establish that across all five axes, latent-prediction models form a distinct and consistent profile. They degrade more gracefully under pixel corruption, preserve usable class structure rather than mere geometric stability under occlusion, capture fine-grained physical contact cues without reconstructing pixels, and uniquely encode the arrow of time. These advantages can even survive task adaptation: a frozen V-JEPA 2 backbone with a lightweight attentive probe outperforms a fully fine-tuned VideoMAE and a supervised TimeSformer on corruption and occlusion robustness. Our extensive results offer concrete new evidence in favor of latent prediction for robust world modeling.
Chinese Translation
自监督视频模型越来越被视为世界模型,但其评估仍然主要局限于在干净基准上的单一 top-1 准确率。这使得我们在理解它们作为世界模型的潜力时存在重大缺口。我们提出了第一个系统性研究来解决这一缺口,分析了四个匹配容量的前沿视频基础模型:V-JEPA 2.1、V-JEPA 2、VideoPrism 和 VideoMAEv2,涵盖了五个与其作为视频世界模型部署相关的鲁棒性维度:特征可区分性、损坏鲁棒性、细粒度区分、遮挡鲁棒性和对时间方向的敏感性。我们的评估表明,在所有五个维度上,潜在预测模型形成了一个独特且一致的特征。它们在像素损坏下的降级表现更为优雅,在遮挡下保持可用的类别结构而不仅仅是几何稳定性,捕捉细粒度的物理接触线索而无需重建像素,并独特地编码时间的箭头。这些优势甚至可以在任务适应中存活:一个冻结的 V-JEPA 2 主干与一个轻量级的注意力探针在损坏和遮挡鲁棒性上超越了完全微调的 VideoMAE 和监督的 TimeSformer。我们广泛的结果为潜在预测在鲁棒世界建模中的应用提供了具体的新证据。
cs.CV / 37 / 2605.15621
LRCP: Low-Rank Compressibility Guided Visual Token Pruning for Efficient LVLMs
LRCP:基于低秩可压缩性的视觉标记剪枝方法以提高大型视觉语言模型的效率
Abstract
Large vision-language models (LVLMs) achieve strong multimodal understanding, but their inference cost grows rapidly with the number of visual tokens, especially for high-resolution images and long videos. Existing attention-based methods estimate token importance from attention scores, which may introduce positional bias, while representation-based methods reduce visual redundancy based on feature relations or reconstruction errors, overlooking the global structure of the visual token set. In this paper, we revisit visual token compression from the perspective of low-rank compressibility. Across models and datasets, we observe that visual token representations exhibit a pronounced low-rank structure, with a dominant subspace that remains stable even after a large fraction of tokens is randomly removed. Motivated by this finding, we propose LRCP, a training-free compression framework that first estimates the dominant low-rank subspace of visual tokens via PCA, and then scores each token by its projection residual onto this subspace, retaining tokens that are poorly explained by the low-rank background. Extensive experiments show that LRCP achieves superior results, preserving 94.7% of the original image-understanding performance with an 88.9% token reduction and 97.8% of the average video-understanding accuracy with an 87.5% token reduction.
Chinese Translation
大型视觉语言模型(LVLMs)在多模态理解方面表现出色,但其推理成本随着视觉标记数量的增加而迅速增长,尤其是在高分辨率图像和长视频的情况下。现有的基于注意力的方法通过注意力得分来估计标记的重要性,这可能引入位置偏差,而基于表示的方法则基于特征关系或重构误差来减少视觉冗余,却忽视了视觉标记集的全局结构。本文从低秩可压缩性的角度重新审视视觉标记压缩。我们观察到,在不同的模型和数据集上,视觉标记表示呈现出明显的低秩结构,存在一个主导子空间,即使在随机移除大量标记后仍保持稳定。基于这一发现,我们提出了LRCP,一个无训练的压缩框架,首先通过主成分分析(PCA)估计视觉标记的主导低秩子空间,然后通过标记在该子空间上的投影残差对每个标记进行评分,保留那些低秩背景无法很好解释的标记。大量实验表明,LRCP在保持94.7%原始图像理解性能的同时实现了88.9%的标记减少,并在保持97.8%的平均视频理解准确率的同时实现了87.5%的标记减少。
cs.CV / 38 / 2605.15640
Learning Disentangled Representations for Generalized Multi-view Clustering
学习解耦表示以实现广义多视角聚类
Abstract
Multi-View Clustering (MVC) has gained significant attention for its ability to leverage complementary information across diverse views. However, existing deep MVC methods often struggle with view-distribution entanglement during cross-view fusion, which hampers the quality of the shared latent space and leads to suboptimal Figures. To address this issue, we propose the Generalized Multi-view Auto-Encoder (GMAE), a framework designed to preserve cross-view complementarity through disentangled representation learning. Specifically, GMAE employs dual-path autoencoders to decouple source features into view-specific and view-common embeddings, facilitating the discovery of clearer clustering structures. We further construct cross-view adversarial discriminators to guide view-specific encoders in capturing more discriminative features. By strategically modulating mutual information, GMAE effectively aligns distributions and prevents representation collapse, ensuring the generation of robust, non-trivial embeddings. Comprehensive experiments on 13 benchmark datasets demonstrate that GMAE consistently outperforms state-of-the-art methods in both complete and incomplete MVC tasks. Our code implementation is available at the repository: https://github.com/obananas/GMAE.
Chinese Translation
多视角聚类(MVC)因其能够利用不同视角之间的互补信息而受到广泛关注。然而,现有的深度MVC方法在跨视角融合过程中常常面临视角分布纠缠的问题,这影响了共享潜在空间的质量,并导致次优的聚类结果。为了解决这一问题,我们提出了广义多视角自编码器(GMAE),该框架旨在通过解耦表示学习来保持跨视角的互补性。具体而言,GMAE采用双路径自编码器将源特征解耦为视角特定和视角共同的嵌入,从而促进更清晰的聚类结构的发现。我们进一步构建了跨视角对抗鉴别器,以指导视角特定的编码器捕捉更多的判别特征。通过战略性地调节互信息,GMAE有效地对齐分布并防止表示崩溃,确保生成稳健且非平凡的嵌入。在13个基准数据集上的全面实验表明,GMAE在完整和不完整的MVC任务中均持续优于最先进的方法。我们的代码实现可在以下仓库获取:https://github.com/obananas/GMAE。
cs.CV / 39 / 2605.15660
MaTe: Images Are All You Need for Material Transfer via Diffusion Transformer
MaTe:图像是通过扩散变换器进行材料转移所需的一切
Abstract
Recent diffusion-based methods for material transfer rely on image fine-tuning or complex architectures with assistive networks, but face challenges including text dependency, extra computational costs, and feature misalignment. To address these limitations, we propose MaTe, a streamlined diffusion framework that eliminates textual guidance and reference networks. MaTe integrates input images at the token level, enabling unified processing via multi-modal attention in a shared latent space. This design removes the need for additional adapters, ControlNet, inversion sampling, or model fine-tuning. Extensive experiments demonstrate that MaTe achieves high-quality material generation under a zero-shot, training-free paradigm. It outperforms state-of-the-art methods in both visual quality and efficiency while preserving precise detail alignment, significantly simplifying inference prerequisites.
Chinese Translation
近期基于扩散的方法在材料转移中依赖于图像微调或复杂的辅助网络架构,但面临文本依赖、额外计算成本和特征不对齐等挑战。为了解决这些局限性,我们提出了MaTe,一个简化的扩散框架,消除了文本指导和参考网络。MaTe在标记级别整合输入图像,通过共享潜在空间中的多模态注意力实现统一处理。该设计消除了对额外适配器、ControlNet、反演采样或模型微调的需求。大量实验表明,MaTe在零-shot、无训练的范式下实现了高质量的材料生成。在视觉质量和效率方面超越了最先进的方法,同时保持了精确的细节对齐,显著简化了推理前提。
cs.CV / 40 / 2605.15661
VAGS: Velocity Adaptive Guidance Scale for Image Editing and Generation
VAGS:用于图像编辑和生成的速度自适应引导尺度
Abstract
Classifier-free guidance (CFG) is the primary control over how strongly text semantics move a flow-based sampler, yet standard practice holds its scale fixed across the entire ODE trajectory. This is a fundamental mismatch: early steps are noise-dominated and carry weak semantic signal, while late steps commit image structure and demand stronger directional commitment; more critically, the value of any guidance strength depends on whether the guided velocity is consistent with the model's current dynamics or working against them. We propose \textit{Velocity-Adaptive Guidance Scale} (VAGS), a training-free replacement that multiplies the nominal scale by a bounded factor combining a temporal signal-level term with the cosine similarity between task-relevant velocity fields. For inversion-free editing, VAGS measures the alignment between source- and target-guided velocities, so edit strength at each step reflects local compatibility between preservation and transformation. For generation, VAGS-Gen uses the alignment between unconditional and conditional velocities as the analogous signal. Neither variant requires fine-tuning, auxiliary networks, or extra forward passes, and fixed CFG is recovered as a special case. On PIE-Bench and DIV2K for editing, and COCO17, CUB-200, and Flickr30K for generation, VAGS consistently improves structural fidelity and generation quality over fixed CFG and recent training-free guidance variants. The code is publicly available at https://github.com/Harvard-AI-and-Robotics-Lab/Velocity_Adaptive_Guidance_Scale.
Chinese Translation
无分类器引导(CFG)是控制文本语义如何强烈影响基于流的采样器的主要手段,但标准做法在整个常微分方程(ODE)轨迹中保持其尺度不变。这是一个根本的不匹配:早期步骤受噪声主导,携带较弱的语义信号,而后期步骤则承载图像结构并需要更强的方向性承诺;更关键的是,任何引导强度的值取决于引导速度是否与模型当前的动态一致或相反。我们提出了 extit{速度自适应引导尺度}(VAGS),这是一种无训练的替代方案,通过一个有界因子将名义尺度乘以结合时间信号水平项和任务相关速度场之间的余弦相似度。对于无反演编辑,VAGS测量源引导速度和目标引导速度之间的对齐,因此每一步的编辑强度反映了保持与变换之间的局部兼容性。对于生成,VAGS-Gen使用无条件和条件速度之间的对齐作为类似信号。两种变体均不需要微调、辅助网络或额外的前向传递,并且固定的CFG作为特例被恢复。在PIE-Bench和DIV2K上进行编辑,在COCO17、CUB-200和Flickr30K上进行生成,VAGS在结构保真度和生成质量上始终优于固定CFG和最近的无训练引导变体。代码已公开发布在https://github.com/Harvard-AI-and-Robotics-Lab/Velocity_Adaptive_Guidance_Scale。
cs.CV / 41 / 2605.15666
ChronoEarth-492K: A Large Scale and Long Horizon Spatiotemporal Hyperspectral Earth Observation Dataset and Benchmark
ChronoEarth-492K:一个大规模和长时间跨度的时空高光谱地球观测数据集及基准
Abstract
Hyperspectral imaging (HSI) provides dense spectral information for the Earth's surface, enabling material-level understanding of land cover and ecosystem dynamics. Despite recent progress in hyperspectral self-supervised learning (SSL), existing datasets remain temporally shallow, limiting the development of long-horizon spatiotemporal modeling. To address this gap, we introduce ChronoEarth-492K, the first large-scale, temporally calibrated hyperspectral SSL dataset built upon NASA's EO-1 Hyperion mission, the world's longest continuous hyperspectral archive up to date (2001-2017). ChronoEarth-492K comprises 492,354 radiometrically harmonized patches across 185,398 global locations over 17 years, with 28,786 sites containing multi-temporal sequences ($\geq 3$ observations) that enable both short- and long-horizon temporal analysis. Building on this foundation, we establish the ChronoEarth-Benchmark, a unified evaluation suite spanning static, short-horizon, and long-horizon temporal tasks, constructed from six open-source geospatial products covering land cover, crop type, forest dynamics, and soil properties. We further introduce a standardized evaluation protocol and report extensive baseline results across state-of-the-art hyperspectral foundation models. Together, ChronoEarth and benchmark provide the first large-scale, temporally grounded platform for systematic spatiotemporal hyperspectral representation learning.
Chinese Translation
高光谱成像(HSI)为地球表面提供了密集的光谱信息,使得对土地覆盖和生态系统动态的材料级理解成为可能。尽管在高光谱自监督学习(SSL)方面取得了近期进展,但现有数据集的时间深度仍然有限,限制了长时间跨度时空建模的发展。为了解决这一问题,我们推出了ChronoEarth-492K,这是第一个基于NASA的EO-1 Hyperion任务构建的大规模、时间校准的高光谱SSL数据集,该任务是迄今为止世界上最长的连续高光谱档案(2001-2017)。ChronoEarth-492K包含492,354个辐射学上协调的图像块,覆盖185,398个全球位置,跨越17年,其中28,786个地点包含多时相序列($ ext{≥} 3$次观测),能够进行短时间和长时间的时序分析。在此基础上,我们建立了ChronoEarth-Benchmark,这是一个统一的评估套件,涵盖静态、短时间和长时间的时序任务,构建于六个开放源代码的地理空间产品之上,涉及土地覆盖、作物类型、森林动态和土壤特性。我们进一步引入了标准化的评估协议,并报告了在最先进的高光谱基础模型上的广泛基线结果。ChronoEarth及其基准共同提供了第一个大规模、时间基础的平台,用于系统的时空高光谱表示学习。
cs.CV / 42 / 2605.15672
VLMs Trace Without Tracking: Diagnosing Failures in Visual Path Following
无追踪的轨迹追踪:诊断视觉路径跟随中的失败
Abstract
Vision-language models (VLMs) achieve strong performance on multimodal benchmarks, but may still lack robust control over basic visual operations. We study \textit{line tracing}, where a model must follow a selected visual path through successive local continuations. To isolate this ability, we design controlled tracing tasks that introduce nearby competitors while reducing semantic and topological ambiguity such as crossings and overlaps. Across these tasks, even state-of-the-art VLMs frequently lose the target path and switch to nearby alternatives, especially when those alternatives look locally similar to the target. Behavioral interventions and internal analyses indicate that these failures arise from local competition: nearby similar distractors pull the model away from the true continuation. Standard remedies do not remove this bottleneck: model-size scaling provides only limited gains, reasoning partially compensates through costly substitute strategies, and explicit tracing instructions fail to recover stable path following. Finally, tests on tangled-cable scenes and metro maps with richer visual complexity show that the same path-switching failure persists beyond our controlled settings.
Chinese Translation
视觉-语言模型(VLMs)在多模态基准测试中表现强劲,但在基本视觉操作的控制上可能仍然缺乏稳健性。我们研究了 extit{线条追踪},其中模型必须通过连续的局部延续来跟随所选的视觉路径。为了隔离这种能力,我们设计了受控的追踪任务,这些任务引入了附近的竞争者,同时减少了诸如交叉和重叠等语义和拓扑歧义。在这些任务中,即使是最先进的VLMs也经常失去目标路径,并切换到附近的替代路径,特别是当这些替代路径在局部上与目标相似时。行为干预和内部分析表明,这些失败源于局部竞争:附近相似的干扰物使模型偏离真实的延续。标准的补救措施并未消除这一瓶颈:模型规模的扩大仅提供有限的收益,推理通过代价高昂的替代策略部分补偿,而明确的追踪指令未能恢复稳定的路径跟随。最后,在复杂视觉场景(如缠绕电缆的场景和地铁地图)上的测试表明,相同的路径切换失败在我们的受控设置之外仍然存在。
cs.CV / 43 / 2605.15682
DreamSR: Towards Ultra-High-Resolution Image Super-Resolution via a Receptive-Field Enhanced Diffusion Transformer
DreamSR:通过感受野增强的扩散变换器实现超高分辨率图像超分辨率
Abstract
Large-scale pre-trained diffusion models have been extensively adopted for real-world image Super-Resolution because of their powerful generative priors through textual guidance. However, when super-resolving high-resolution images with patch-wise inference strategy, most existing diffusion-based SR methods tend to suffer from over-generation, due to the misalignment between the global prompt from LR image and the incomplete semantic information of local patches during each inference step. On the other hand, most existing methods also failed to generate detailed texture in local patches due to the overemphasis on global generation capabilities in network designs and training strategies. To address this issue, we present DreamSR, a novel SR model that suppresses local over-generation and improves fine-detail synthesis, thereby achieving visually faithful results with ultra-high-quality details. Specifically, we propose a dual-branch MM-ControlNet, where the ControlNet generates local textual feature with patch-level prompts while the pre-trained DiT provides global textual feature with global prompts, thereby mitigating over-generation and ensuring semantic consistency across patches. We also design a comprehensive training strategy with stage-specific data processing pipelines and a Receptive-Field Enhancement strategy, enhancing the model's capability to capture patch information and effectively restore local textures. Extensive experiments demonstrate that DreamSR outperforms state-of-the-art methods, providing high-quality SR results. Code and model are available at https://github.com/jerrydong0219/DreamSR.
Chinese Translation
大规模预训练的扩散模型因其通过文本指导的强大生成先验而被广泛应用于实际图像超分辨率。然而,在采用块级推理策略对高分辨率图像进行超分辨率处理时,大多数现有的基于扩散的超分辨率方法往往会遭遇过度生成的问题,这主要是由于低分辨率(LR)图像的全局提示与每次推理步骤中局部块的不完整语义信息之间的错位。另一方面,大多数现有方法也未能在局部块中生成细致的纹理,这主要是由于网络设计和训练策略过于强调全局生成能力。为了解决这一问题,我们提出了DreamSR,这是一种新颖的超分辨率模型,能够抑制局部过度生成并改善细节合成,从而实现视觉上真实的超高质量细节。具体而言,我们提出了一种双分支的MM-ControlNet,其中ControlNet使用块级提示生成局部文本特征,而预训练的DiT则提供全局文本特征和全局提示,从而减轻过度生成并确保各块之间的语义一致性。我们还设计了一种全面的训练策略,结合阶段特定的数据处理流程和感受野增强策略,提升模型捕捉块信息的能力,有效恢复局部纹理。大量实验表明,DreamSR在性能上优于现有的最先进方法,提供高质量的超分辨率结果。代码和模型可在 https://github.com/jerrydong0219/DreamSR 获取。
cs.CV / 44 / 2605.15684
ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices
ElasticDiT:通过弹性架构和稀疏注意力实现高效扩散变换器,以便在移动设备上生成高分辨率图像
Abstract
The Diffusion Transformer (DiT) architecture is the state-of-the-art paradigm for high-fidelity image generation, underpinning models like Stable Diffusion-3 and FLUX.1. However, deploying these models on resource-constrained mobile devices entails prohibitive computational and memory overhead. While efficiency-driven approaches like Linear-DiT and static pruning alleviate bottlenecks, they often incur quality degradation. Unlike cloud environments, mobile constraints require a single-model paradigm that dynamically balances fidelity and latency. We introduce ElasticDiT, which achieves this dynamic trade-off by adjusting spatial compression ratios and DiT block depths. By integrating Shift Sparse Block Attention (SSBA) and a Tiny DWT-Distilled VAE (T-DVAE), ElasticDiT reduces inference latency and memory footprint while maintaining image quality. Experiments confirm that ElasticDiT effectively covers a wide range of fidelity-latency trade-offs within a single set of parameters. By jointly adjusting compression and depth, a single ElasticDiT model can be reconfigured on-the-fly to outperform task-specific baselines. Specifically, our flex lite variant achieves an HPS of 32.87, surpassing the Flux model, while maintaining competitive quality at 84.16 percent average sparsity through SSBA. Furthermore, the plug-and-play T-DVAE provides SD3-level reconstruction with only 1/8x the computational cost of standard VAEs, and Flow-GRPO boosts semantic alignment (GenEval: 66.93 to 73.62). These results demonstrate that ElasticDiT offers a versatile, hardware-adaptive solution that eliminates the need for multiple specialized models, providing a promising path for future high-resolution image generation on mobile devices.
Chinese Translation
扩散变换器(Diffusion Transformer,DiT)架构是高保真图像生成的最先进范式,支撑着如Stable Diffusion-3和FLUX等模型。然而,在资源受限的移动设备上部署这些模型会带来巨大的计算和内存开销。虽然像Linear-DiT和静态剪枝等以效率为驱动的方法缓解了瓶颈,但它们往往会导致质量下降。与云环境不同,移动设备的限制要求采用单模型范式,动态平衡保真度和延迟。我们提出了ElasticDiT,通过调整空间压缩比和DiT块深度来实现这种动态权衡。通过集成Shift Sparse Block Attention(SSBA)和Tiny DWT-Distilled VAE(T-DVAE),ElasticDiT在保持图像质量的同时减少了推理延迟和内存占用。实验结果确认,ElasticDiT有效覆盖了单一参数集内的广泛保真度-延迟权衡。通过联合调整压缩和深度,单个ElasticDiT模型可以实时重新配置,以超越特定任务的基线。具体而言,我们的flex lite变体实现了32.87的每秒高性能(HPS),超越了Flux模型,同时通过SSBA保持84.16%的竞争性质量。此外,插拔式T-DVAE以仅为标准VAE的1/8计算成本提供SD3级重建,而Flow-GRPO提升了语义对齐(GenEval:66.93至73.62)。这些结果表明,ElasticDiT提供了一种多功能的硬件自适应解决方案,消除了对多个专用模型的需求,为未来在移动设备上生成高分辨率图像提供了有希望的路径。
cs.CV / 45 / 2605.15689
How to Choose Your Teacher for Fine Grained Image Recognition
如何选择适合细粒度图像识别的教师模型
Abstract
Fine-grained image recognition classifies subcategories such as bird species or car models. While state-of-the-art (SOTA) models are accurate, they are often too resource-intensive for deployment on constrained devices. Knowledge distillation addresses this by transferring knowledge from a large teacher model to a smaller student model. A key challenge is selecting the right teacher, as it heavily impacts student performance. This paper introduces a teacher selection metric, \textbf{Ratio 1-2}, based on teacher prediction ratios. Extensive analysis of over one thousand experiments across 3 students, 8 teachers, and 8 datasets under 4 training strategies demonstrates that our metric improves teacher selection by 18\% over previous methods, enabling small student models to achieve up to 17\% accuracy gains. Experiment codebase is available at: \href{https://github.com/arkel23/FGIR-KD-Teacher}{https://github.com/arkel23/FGIR-KD-Teacher}.
Chinese Translation
细粒度图像识别对鸟类物种或汽车型号等子类别进行分类。尽管最先进的(SOTA)模型具有较高的准确性,但它们通常对资源的需求过于庞大,难以在受限设备上部署。知识蒸馏通过将知识从大型教师模型转移到较小的学生模型来解决这一问题。选择合适的教师模型是一个关键挑战,因为这对学生的表现有很大影响。本文提出了一种基于教师预测比率的教师选择指标—— extbf{Ratio 1-2}。对3个学生、8个教师和8个数据集在4种训练策略下进行的超过一千次实验的广泛分析表明,我们的指标在教师选择上比以前的方法提高了18\%,使得小型学生模型的准确性提升可达17 extbf{ extit{%。}}实验代码库可在以下链接获取: extit{https://github.com/arkel23/FGIR-KD-Teacher}。
cs.CV / 46 / 2605.15708
3D Segmentation Using Viewpoint-Dependent Spatial Relationships
基于视角依赖空间关系的3D分割
Abstract
Recent advances in 3D datasets and multimodal models have greatly improved natural language 3D scene understanding. However, most 3D referring segmentation methods do not explicitly represent the observer viewpoint, making spatial relations such as "left," "right," "front," and "behind" ambiguous and difficult to evaluate. We introduce a viewpoint-aware 3D referring segmentation dataset containing 220k benchmark samples, and scalable to tens of millions of viewpoint-conditioned samples through dense viewpoint sampling. In this dataset, target objects can only be identified through observer-centric spatial relations, making viewpoint-conditioned grounding necessary. We construct the benchmark by leveraging camera poses to automatically annotate observer-centric relations (left/right, front/behind) together with viewpoint-independent relations (above/under). Using this benchmark, we evaluate several existing 3D large multimodal models in a zero-shot setting and find that current models struggle with viewpoint-dependent spatial instructions. We further study how explicit viewpoint information can be incorporated into 3D large multimodal models. We introduce a viewpoint representation that encodes camera poses and conditions the model on the observation viewpoint, improving segmentation accuracy on viewpoint-dependent relations and increasing mIoU from 0.30 to 0.47 compared to a model without viewpoint conditioning. The dataset, code, and trained models will be made publicly available upon acceptance.
Chinese Translation
近年来,3D数据集和多模态模型的进展极大地提升了自然语言对3D场景的理解。然而,大多数3D指代分割方法并未明确表示观察者的视角,使得诸如“左”、“右”、“前”和“后”等空间关系变得模糊且难以评估。我们引入了一个视角感知的3D指代分割数据集,包含22万基准样本,并通过密集的视角采样扩展到数千万个视角条件样本。在该数据集中,目标对象只能通过观察者中心的空间关系来识别,从而使得视角条件的基础变得必要。我们通过利用相机姿态自动标注观察者中心的关系(左/右,前/后)以及视角无关的关系(上/下)来构建基准。利用该基准,我们在零-shot设置下评估了几种现有的3D大型多模态模型,发现当前模型在处理视角依赖的空间指令时表现不佳。我们进一步研究了如何将显式的视角信息融入3D大型多模态模型中。我们引入了一种视角表示,编码相机姿态并使模型以观察视角为条件,从而提高了对视角依赖关系的分割准确性,并将mIoU从0.30提高到0.47,相较于不进行视角条件的模型。该数据集、代码和训练模型将在接受后公开发布。
cs.CV / 47 / 2605.15711
EntropyScan: Towards Model-level Backdoor Detection in LVLMs via Visual Attention Entropy
EntropyScan:通过视觉注意熵实现LVLM中的模型级后门检测
Abstract
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across various tasks, yet they remain vulnerable to backdoor attacks. Existing defense methods predominantly focus on sample-level defense, which relies on the knowledge of training data or triggers. However, identifying whether a given model is backdoored remains a critical but unexplored task. To fill this gap, we propose EntropyScan, a lightweight and trigger-agnostic method for model-level backdoor detection in LVLMs. We first observe that backdoor injection disrupts the cross-modal alignment, resulting in pronounced structural anomalies in visual attention allocation on benign samples. Based on this insight, EntropyScan detects the backdoor models by quantifying such attention deviations. Specifically, it extracts visual attention distributions from the initial layers of the Large Language Model (LLM) and applies Tsallis entropy to capture these structural distortions. By employing a reference-anchored Z-score normalization on a small set of benign samples, it effectively identifies the backdoored model. Extensive experiments across two LVLMs architectures and three advanced attack scenarios show that EntropyScan achieves an F1 score of 98.5% in average and an AUC of 96.6%. Our code will be publicly available soon.
Chinese Translation
大型视觉语言模型(LVLM)在各种任务中展现了显著的能力,但仍然容易受到后门攻击。现有的防御方法主要集中在样本级防御,这依赖于对训练数据或触发器的了解。然而,识别给定模型是否存在后门仍然是一个关键但未被探索的任务。为填补这一空白,我们提出了EntropyScan,这是一种轻量级且与触发器无关的模型级后门检测方法。我们首先观察到,后门注入会破坏跨模态对齐,导致在良性样本上视觉注意分配出现明显的结构异常。基于这一见解,EntropyScan通过量化这种注意偏差来检测后门模型。具体而言,它从大型语言模型(LLM)的初始层提取视觉注意分布,并应用Tsallis熵来捕捉这些结构扭曲。通过对一小组良性样本进行参考锚定的Z-score标准化,它有效地识别出后门模型。在两个LVLM架构和三个高级攻击场景下的广泛实验表明,EntropyScan在平均F1分数上达到了98.5%,AUC为96.6%。我们的代码将很快公开发布。
cs.CV / 48 / 2605.15720
Semi-MedRef: Semi-Supervised Medical Referring Image Segmentation with Cross-Modal Alignment
Semi-MedRef:基于跨模态对齐的半监督医学参考图像分割
Abstract
Medical referring image segmentation (MRIS) requires pixel-level masks aligned with textual descriptions of anatomical locations, making annotation costly in low-label regimes. Semi-supervised learning (SSL) can mitigate this burden by leveraging unlabeled data, but its success hinges on maintaining reliable image-text alignment under perturbations. Most existing SSL-based referred segmentation methods use either independent or simplistic multi-modal perturbations (e.g., left-right flips), without fully addressing cross-modal alignment under strong augmentation, while CutMix, highly effective in single-modal SSL, remains underexplored in multi-modal settings due to its tendency to disrupt image-text coherence. We propose Semi-MedRef, a teacher-student SSL framework designed to explicitly maintain consistency between medical images and positional language through three alignment-preserving components: T-PatchMix, a cross-modal CutMix-style augmentation that synchronizes patch mixing with referring expressions via position-constrained and probability-driven rules; PosAug, a position-aware text augmentation that masks or fuzzes anatomical phrases; and ITCL, a position-guided image-text contrastive learning module, which leverages positional pseudo-labels to construct soft anatomical positives and strengthen medically grounded cross-modal alignment. Experiments on QaTa-COV19 and MosMedData+ demonstrate that Semi-MedRef consistently outperforms both fully supervised and semi-supervised baselines across all label regimes.
Chinese Translation
医学参考图像分割(MRIS)需要与解剖位置的文本描述对齐的像素级掩膜,这使得在低标签条件下的注释成本高昂。半监督学习(SSL)可以通过利用未标记数据来减轻这一负担,但其成功依赖于在扰动下保持可靠的图像-文本对齐。现有的大多数基于SSL的参考分割方法要么使用独立的,要么使用简单的多模态扰动(例如,左右翻转),未能充分解决在强增强下的跨模态对齐问题,而CutMix在单模态SSL中非常有效,但由于其可能破坏图像-文本的一致性,在多模态环境中仍未得到充分探索。我们提出了Semi-MedRef,一个教师-学生SSL框架,旨在通过三个保持对齐的一致性组件显式维护医学图像与位置语言之间的一致性:T-PatchMix,一种跨模态CutMix风格的增强方法,通过位置约束和基于概率的规则将补丁混合与参考表达同步;PosAug,一种位置感知的文本增强方法,用于掩盖或模糊解剖短语;以及ITCL,一个位置引导的图像-文本对比学习模块,利用位置伪标签构建软解剖正样本,并增强医学基础的跨模态对齐。在QaTa-COV19和MosMedData+上的实验表明,Semi-MedRef在所有标签条件下始终优于完全监督和半监督基线。
cs.CV / 49 / 2605.15725
DiLA: Disentangled Latent Action World Models
DiLA:解耦潜在动作世界模型
Abstract
Latent Action Models (LAMs) enable the learning of world models from unlabeled video by inferring abstract actions between consecutive frames. However, LAMs face a fundamental trade-off between action abstraction and generation fidelity. Existing methods typically circumvent this issue by using two-stage training with pre-trained world models or by limiting predictions to optical flow. In this paper, we introduce DiLA, a novel Disentangled Latent Action world model that aims to resolve this trade-off via content-structure disentanglement. Our key insight is that disentanglement and latent action learning are co-evolving: the predictive bottleneck inherent in latent action learning serves as a driving force for disentanglement, compelling the model to distill spatial layouts into the structure pathway while offloading visual details to a separate content pathway for generation. This synergy yields a continuous, semantically structured latent action space without compromising generative quality. DiLA achieves superior results in video generation quality, action transfer, visual planning, and manifold interpretability. These findings establish DiLA as a unified framework that simultaneously achieves high-level action abstraction and high-fidelity generation, advancing the frontier of self-supervised world model learning.
Chinese Translation
潜在动作模型(Latent Action Models, LAMs)通过推断连续帧之间的抽象动作,使得从未标记视频中学习世界模型成为可能。然而,LAMs在动作抽象和生成保真度之间面临着根本性的权衡。现有方法通常通过使用预训练的世界模型进行两阶段训练,或将预测限制在光流上来规避这一问题。本文介绍了DiLA,一种新颖的解耦潜在动作世界模型,旨在通过内容-结构解耦来解决这一权衡。我们的关键见解是,解耦和潜在动作学习是共同演化的:潜在动作学习中固有的预测瓶颈成为了解耦的驱动力,迫使模型将空间布局提炼到结构路径中,同时将视觉细节卸载到单独的内容路径以进行生成。这种协同作用产生了一个连续的、语义结构化的潜在动作空间,而不损害生成质量。DiLA在视频生成质量、动作迁移、视觉规划和流形可解释性方面取得了优越的结果。这些发现确立了DiLA作为一个统一框架,能够同时实现高水平的动作抽象和高保真度的生成,推动了自监督世界模型学习的前沿。
cs.CV / 50 / 2605.15728
DecomPose: Disentangling Cross-Category Optimization Contention for Category-Level 6D Object Pose Estimation
DecomPose:解耦跨类别优化竞争以实现类别级6D物体姿态估计
Abstract
Category-level 6D object pose estimation is typically formulated as a multi-category joint learning problem with fully shared model parameters. However, pronounced geometric heterogeneity across categories entangles incompatible optimization signals in shared modules, resulting in gradient conflicts and negative transfer during training. To address this challenge, we first introduce gradient-based diagnostics to quantify module-level cross-category contention. Building on results of diagnostics, we propose DecomPose, a difficulty-aware decomposition framework that mitigates optimization contention via: (1) difficulty-aware gradient decoupling, which groups categories using a data-driven difficulty proxy and routes each instance to a group-specific correspondence branch to isolate incompatible updates; and (2) stability-driven asymmetric branching, which assigns higher-capacity branches to structurally simple categories as stable optimization anchors while constraining complex categories with lightweight branches to suppress noisy updates and alleviate negative transfer. Extensive experiments on REAL275, CAMERA25, and HouseCat6D demonstrate that DecomPose effectively reduces cross-category optimization contention and delivers superior pose estimation performance across multiple benchmarks.
Chinese Translation
类别级6D物体姿态估计通常被表述为一个多类别联合学习问题,模型参数完全共享。然而,不同类别之间明显的几何异质性在共享模块中纠缠了不兼容的优化信号,导致训练过程中的梯度冲突和负迁移。为了解决这一挑战,我们首先引入基于梯度的诊断方法,以量化模块级跨类别竞争。在诊断结果的基础上,我们提出了DecomPose,一个基于难度感知的分解框架,通过以下方式减轻优化竞争:(1)难度感知的梯度解耦,利用数据驱动的难度代理对类别进行分组,并将每个实例路由到特定组的对应分支,以隔离不兼容的更新;(2)稳定性驱动的非对称分支,为结构简单的类别分配更高容量的分支作为稳定的优化锚点,同时用轻量级分支约束复杂类别,以抑制噪声更新并减轻负迁移。在REAL275、CAMERA25和HouseCat6D上的大量实验表明,DecomPose有效减少了跨类别优化竞争,并在多个基准测试中提供了卓越的姿态估计性能。
cs.CV / 51 / 2605.15735
UAM: A Dual-Stream Perspective on Forgetting in VLA Training
UAM:关于VLA训练中遗忘的双流视角
Abstract
Vision--language--action (VLA) models are typically built by fine-tuning a pretrained vision--language model (VLM) on action data. However, we show that this standard recipe systematically erodes the VLM's multimodal competence, a side effect we call the embodiment tax. But do VLAs have to forget? Inspired by the two-stream organization of biological vision, we trace this degradation to a structural bottleneck: current VLAs ask a single encoder to support both language-grounded semantics and control-relevant visual features, whereas biological vision separates recognition and visuomotor control into distinct pathways. Building on this view, we propose the Unified Action Model (UAM), which adds a parallel Dorsal Expert, an analog of the brain's dorsal pathway. To make the Dorsal Expert an effective second pathway and reduce the control-learning burden on the VLM, we initialize it from a pretrained generative model and train it with a mid-level reasoning objective that predicts visual dynamics. This design allows us to train the whole VLA end-to-end on action data alone: with no parameter freezing, no gradient stopping, and no auxiliary VL co-training, UAM retains over $95\%$ of the underlying VLM's multimodal capability and at the same time achieves the highest average success rate among baselines on a variety of manipulation tasks that probe out-of-distribution generalization, including unseen objects, novel object--target compositions, and instruction variation. Together, these results suggest that semantic preservation in VLAs can emerge from architectural separation itself, rather than being enforced by frozen weights or auxiliary data replay, and that this preserved semantic capability can naturally transfer from VLMs to semantic generalization in actions.
Chinese Translation
视觉-语言-动作(VLA)模型通常通过在动作数据上微调预训练的视觉-语言模型(VLM)来构建。然而,我们表明,这种标准方法系统性地削弱了VLM的多模态能力,这种副作用我们称之为体现税。那么,VLA是否必须遗忘?受到生物视觉的双流组织的启发,我们将这种退化追溯到一个结构瓶颈:当前的VLA要求单个编码器同时支持语言基础的语义和与控制相关的视觉特征,而生物视觉则将识别和视觉运动控制分为不同的通路。在此基础上,我们提出了统一动作模型(UAM),它增加了一个平行的背侧专家(Dorsal Expert),这是大脑背侧通路的类比。为了使背侧专家成为有效的第二通路,并减少对VLM的控制学习负担,我们从一个预训练的生成模型初始化它,并用一个中层推理目标进行训练,该目标预测视觉动态。这个设计使我们能够仅在动作数据上端到端地训练整个VLA:没有参数冻结,没有梯度停止,也没有辅助的VL共同训练,UAM保留了超过95%的基础VLM的多模态能力,同时在各种操控任务的基准中实现了最高的平均成功率,这些任务探测了分布外的泛化,包括未见物体、新的物体-目标组合和指令变体。这些结果共同表明,VLA中的语义保留可以源于架构的分离本身,而不是通过冻结权重或辅助数据重放来强制实现,并且这种保留的语义能力可以自然地从VLM转移到动作的语义泛化中。
cs.CV / 52 / 2605.15736
BiomedAP: A Vision-Informed Dual-Anchor Framework with Gated Cross-Modal Fusion for Robust Medical Vision-Language Adaptation
BiomedAP:一种基于视觉的信息双锚框架,结合门控跨模态融合以实现稳健的医学视觉-语言适应
Abstract
Biomedical Vision--Language Models (VLMs) have shown remarkable promise in few-shot medical diagnosis but face a critical bottleneck: \textit{fragility to prompt variations}.Existing adaptation frameworks typically optimize visual and textual prompts as independent streams, relying on ideal ``Golden Prompts''. In clinical reality, where descriptions are often noisy and heterogeneous, this modality isolation leads to unstable cross-modal alignment. To address this, we propose BiomedAP, a vision-informed dual-anchor framework with gated cross-modal fusion.BiomedAP enforces synergistic alignment through two mechanisms: (1) Gated Cross-Modal Fusion, which enables layer-wise interaction between modalities, acting as a dynamic noise regulator to suppress irrelevant textual cues; and (2) a Dual-Anchor Constraint that regularizes learnable prompts toward stable semantic centroids derived from both expert templates (High Anchors) and few-shot visual prototypes (Low Anchors). Extensive experiments across 11 benchmarks demonstrate that BiomedAP consistently surpasses baselines, achieving competitive few-shot accuracy and markedly enhanced robustness under prompt perturbations. Our code is available at: https://github.com/tongdiedie/BiomedAP. Keywords: Vision-Language Models; Prompt Learning; Parameter-Efficient Fine-Tuning; Few-shot Learning
Chinese Translation
生物医学视觉-语言模型(VLMs)在少量样本医学诊断中展现出显著的潜力,但面临一个关键瓶颈: extit{对提示变化的脆弱性}。现有的适应框架通常将视觉和文本提示作为独立流进行优化,依赖于理想的“黄金提示”。在临床现实中,描述往往是嘈杂和异质的,这种模态隔离导致跨模态对齐的不稳定性。为了解决这个问题,我们提出了BiomedAP,一种基于视觉的信息双锚框架,结合门控跨模态融合。BiomedAP通过两种机制强制协同对齐:(1) 门控跨模态融合,允许模态之间的层级交互,充当动态噪声调节器,以抑制无关的文本线索;(2) 双锚约束,规范可学习的提示朝向从专家模板(高锚点)和少量样本视觉原型(低锚点)派生的稳定语义质心。通过在11个基准测试中的广泛实验,BiomedAP始终超越基线,达到竞争性的少量样本准确率,并在提示扰动下显著增强稳健性。我们的代码可在以下链接获取:https://github.com/tongdiedie/BiomedAP。关键词:视觉-语言模型;提示学习;参数高效微调;少量样本学习
cs.CV / 53 / 2605.15737
BARRIER: Bounded Activation Regions for Robust Information Erasure
BARRIER:用于稳健信息抹除的有界激活区域
Abstract
Machine unlearning has reached a critical bottleneck. As traditional weight-space interventions focus primarily on erasing targeted concepts, they often fail to prevent the unintended suppression of other significant representations. This leads to substantial collateral damage, with essential knowledge being forgotten, because these methods lack formal mathematical guarantees for the preservation of neutral concepts. To avoid degradation, they are frequently forced into conservative updates. We propose BARRIER (Bounded Activation Regions for Robust Information Erasure), a paradigm-shifting framework that shifts the locus of intervention from static model weights to the dynamic geometry of hidden-layer activations. Unlike existing methods, BARRIER employs Interval Arithmetic (IA) on SVD-based projections of the activation space to encapsulate the specific target region within a bounding hypercube. By driving unlearning updates exclusively within this forget interval and mathematically bounding the model response on the complement, we ensure rigorous protection of the retain distribution. This geometric construction transforms the preservation of knowledge from an empirical heuristic into a formal optimization target with a probabilistic tail bound on functional drift. Crucially, this stability permits highly aggressive unlearning updates within the forget region. Empirical evaluations demonstrate that BARRIER matches state-of-the-art trade-offs across classifiers and diffusion models, maximizing targeted concept erasure while safeguarding the integrity of all other representations. Our code is available at https://github.com/OneAndZero24/BARRIER.
Chinese Translation
机器遗忘已达到一个关键瓶颈。传统的权重空间干预主要集中在抹除目标概念上,往往无法防止其他重要表征的意外抑制。这导致了显著的附带损害,重要知识被遗忘,因为这些方法缺乏对中性概念保存的正式数学保证。为了避免退化,它们常常被迫采用保守的更新。我们提出了BARRIER(Bounded Activation Regions for Robust Information Erasure),这是一个颠覆性的框架,将干预的重点从静态模型权重转移到隐藏层激活的动态几何结构。与现有方法不同,BARRIER在基于奇异值分解(SVD)的激活空间投影上采用区间算术(Interval Arithmetic),以在一个有界超立方体内封装特定的目标区域。通过仅在这个遗忘区间内进行遗忘更新,并在补集上对模型响应进行数学界定,我们确保了保留分布的严格保护。这种几何构造将知识的保存从经验启发式转变为具有概率尾界的正式优化目标。关键是,这种稳定性允许在遗忘区域内进行高度激进的遗忘更新。实证评估表明,BARRIER在分类器和扩散模型中匹配了最先进的权衡,最大化目标概念的抹除,同时保障所有其他表征的完整性。我们的代码可在 https://github.com/OneAndZero24/BARRIER 获取。
cs.CV / 54 / 2605.15741
HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion
HyperDiT:用于高保真像素空间扩散的超连接变换器
Abstract
Pixel-space diffusion models bypass the reconstruction bottleneck of Variational Autoencoders (VAEs) but face a fundamental "granularity dilemma": capturing global semantics favors large patch scales, while generating high-fidelity details demands fine-grained inputs. To address this issue, we propose HyperDiT, a unified framework establishing Hyper-Connected Cross-Scale Interactions to bridge the semantic and pixel manifold. Diverging from injecting semantics by AdaLN, HyperDiT utilizes Cross-Attention mechanisms, enabling fine-grained tokens to query multi-level semantic anchors globally. To resolve the spatial mismatch during multi-scale interactions, we introduce Scale-Aware Rotary Position Embedding (SA-RoPE) to ensure precise geometric alignment among tokens of varying patch sizes. Furthermore, we incorporate Registers to learn the dense semantics from a pretrained Visual Foundation Model (VFM), effectively reducing generation hallucination and artifacts. Extensive experiments demonstrate that HyperDiT achieves state-of-the-art (SoTA) FID of $\mathbf{1.56}$ on ImageNet $256\times256$ directly within the pixel space. By combining the fine-grained stream with semantic guidance, HyperDiT offers a superior paradigm for high-fidelity pixel generation.
Chinese Translation
像素空间扩散模型绕过了变分自编码器(Variational Autoencoders, VAEs)的重建瓶颈,但面临着一个根本的“粒度困境”:捕捉全局语义倾向于大块尺度,而生成高保真细节则需要细粒度输入。为了解决这个问题,我们提出了HyperDiT,一个统一框架,通过建立超连接跨尺度交互来桥接语义和像素流形。与通过AdaLN注入语义不同,HyperDiT利用跨注意力机制,使细粒度标记能够全局查询多层次的语义锚点。为了解决多尺度交互中的空间不匹配问题,我们引入了尺度感知旋转位置嵌入(Scale-Aware Rotary Position Embedding, SA-RoPE),以确保不同块大小的标记之间的精确几何对齐。此外,我们结合了寄存器,从预训练的视觉基础模型(Visual Foundation Model, VFM)中学习密集语义,有效减少生成的幻觉和伪影。大量实验表明,HyperDiT在ImageNet $256 imes256$像素空间中直接实现了先进的(SoTA)FID为$ extbf{1.56}$。通过将细粒度流与语义指导相结合,HyperDiT为高保真像素生成提供了一个卓越的范式。
cs.CV / 55 / 2605.15755
Attribute-Grounded Selective Reasoning for Artwork Emotion Understanding with Multimodal Large Language Models
基于属性的选择性推理用于多模态大语言模型的艺术作品情感理解
Abstract
Multimodal large language models (MLLMs) can produce fluent artwork emotion explanations, but they often suffer from attribute flooding: they enumerate many visible formal attributes without identifying which cues actually support the affective judgment. We therefore formulate artwork emotion understanding as Attribute-Grounded Selective Reasoning (AGSR), where predefined formal attributes serve as evidence units and only emotionally operative attributes should enter the final interpretation. To make this problem measurable, we extend EmoArt, originally introduced at ACM MM 2025 as a 132,664-artwork resource with content, formal-attribute, valence-arousal, and emotion annotations, by adding a 1,400-artwork human salience extension annotated by 15 art-trained annotators. This extension provides instance-level supervision for distinguishing attributes that are merely present from those that are emotionally salient. We further propose FAB-G (Formal-Attribute Bottleneck-Guided reasoning), a supervised multi-agent framework that first predicts attribute-level salience and then constrains downstream emotional analysis to the retained cues. Experiments show that FAB-G yields consistent gains in emotion, arousal, and valence prediction, achieves stronger agreement with human-marked salient attributes under Dice and Tversky metrics, and produces substantially more compact final explanations than prompting-based baselines. Cross-dataset evaluation further suggests that attribute-grounded salience selection transfers beyond the source distribution of EmoArt, while also revealing attribute-specific boundary cases. The dataset and project page are available at https://zhiliangzhang.github.io/EmoArt-130k/
Chinese Translation
多模态大语言模型(MLLMs)能够生成流畅的艺术作品情感解释,但它们常常面临属性泛滥的问题:它们列举了许多可见的形式属性,却未能识别哪些线索实际上支持情感判断。因此,我们将艺术作品情感理解形式化为基于属性的选择性推理(AGSR),在该框架中,预定义的形式属性作为证据单元,只有情感上有效的属性应纳入最终解释。为了使这一问题可度量,我们扩展了EmoArt,该资源最初在ACM MM 2025上介绍,包含132,664件艺术作品的内容、形式属性、情感唤起和情感注释,并增加了一个由15名艺术训练的注释员标注的1,400件艺术作品的人类显著性扩展。该扩展为区分仅存在的属性与情感显著的属性提供了实例级监督。我们进一步提出FAB-G(形式属性瓶颈引导推理),这是一个监督的多智能体框架,首先预测属性级显著性,然后将下游情感分析限制在保留的线索上。实验表明,FAB-G在情感、唤起和价值预测方面均取得了一致的提升,在Dice和Tversky指标下与人类标记的显著属性达成更强的一致性,并且比基于提示的基线生成了更为紧凑的最终解释。跨数据集评估进一步表明,基于属性的显著性选择超越了EmoArt的源分布,同时揭示了属性特定的边界案例。数据集和项目页面可在 https://zhiliangzhang.github.io/EmoArt-130k/ 获取。
cs.CV / 56 / 2605.15760
Learn2Splat: Extending the Horizon of Learned 3DGS Optimization
Learn2Splat:扩展学习型3D高斯溅射优化的视野
Abstract
3D Gaussian Splatting (3DGS) optimization is most commonly performed using standard optimizers (Adam, SGD). While stable across diverse scenes, standard optimizers are general-purpose and not tailored to the structure of the problem. In particular, they produce independent parameter updates that do not capture the structural and spatial relationships within a scene, leading to inefficient optimization and slow convergence. Recent works introduced learned optimizers that predict correlated updates informed by inter-parameter and inter-Gaussian dependencies. However, these methods are trained for a fixed number of optimization iterations and rely on manually scheduled learning rates to avoid degradation. In this paper, we introduce a learned optimizer for 3DGS that avoids degradation over extended optimization horizons without auxiliary mechanisms. To enable this, we propose a meta-learning scheme that extends the optimization horizon via a checkpoint buffer and an optimizer rollout strategy, combined with an architecture that encodes gradient scale information in its latent states. Results show improved early novel view synthesis quality while remaining stable over long horizons, with zero-shot generalization to unseen reconstruction settings. To support our findings, we introduce the first unified framework for training and evaluating both learned and conventional optimizers across sparse and dense view settings. Code and models will be released publicly. Our project page is available at https://naamapearl.github.io/learn2splat .
Chinese Translation
3D高斯溅射(3DGS)优化通常使用标准优化器(如Adam、SGD)进行。虽然在多种场景中表现稳定,但标准优化器是通用型的,并未针对问题的结构进行优化。特别是,它们产生的参数更新是独立的,无法捕捉场景中的结构和空间关系,导致优化效率低下和收敛速度缓慢。近期的研究提出了学习型优化器,通过参数间和高斯间的依赖关系预测相关的更新。然而,这些方法是在固定的优化迭代次数上进行训练,并依赖手动调度的学习率以避免性能下降。本文提出了一种针对3DGS的学习型优化器,能够在没有辅助机制的情况下,在扩展的优化视野中避免性能下降。为实现这一目标,我们提出了一种元学习方案,通过检查点缓冲区和优化器展开策略来扩展优化视野,并结合一种在其潜在状态中编码梯度规模信息的架构。结果表明,在长时间范围内保持稳定的同时,早期的新视图合成质量得到了改善,并且在未见重建设置中实现了零样本泛化。为了支持我们的发现,我们引入了第一个统一框架,用于在稀疏和密集视图设置下训练和评估学习型和传统优化器。代码和模型将公开发布。我们的项目页面可在 https://naamapearl.github.io/learn2splat 上访问。
cs.CV / 57 / 2605.15764
GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions
GRASP:学习在多人非语言互动中进行社会推理
Abstract
Understanding social interactions requires reasoning over subtle non-verbal cues, yet current multimodal large language models (MLLMs) often fail to identify who interacts with whom in multi-person videos. We introduce GRASP, a large-scale social reasoning dataset that connects high-level social QA with fine-grained gaze and deictic gesture events. GRASP contains 290K question--answer pairs over 46K videos totaling 749 hours, organized by a 16-category taxonomy spanning gaze, gesture, and joint gaze--gesture reasoning, together with GRASP-Bench for evaluation. Unlike prior resources that focus on either isolated cues or high-level social QA, GRASP builds questions from identity-consistent gaze trajectories, deictic gestures, and their joint compositions into social events. Moreover, we propose Social Grounding Reward (SGR), a learning signal that uses these social events to encourage models to reason about the participants involved in each interaction. Experiments show that SGR improves performance on GRASP-Bench while maintaining zero-shot performance on related social video QA benchmarks.
Chinese Translation
理解社会互动需要对微妙的非语言线索进行推理,然而当前的多模态大型语言模型(MLLMs)常常无法识别多人视频中谁与谁互动。我们提出了GRASP,一个大规模的社会推理数据集,将高层次的社会问答与细粒度的注视和指示性手势事件相连接。GRASP包含290,000对问题-答案,涵盖46,000个视频,总时长749小时,按照涵盖注视、手势和联合注视-手势推理的16类分类法进行组织,并配备GRASP-Bench用于评估。与以往关注孤立线索或高层次社会问答的资源不同,GRASP从身份一致的注视轨迹、指示性手势及其在社会事件中的联合组合中构建问题。此外,我们提出了社会基础奖励(Social Grounding Reward, SGR),这是一种学习信号,利用这些社会事件来鼓励模型推理每个互动中参与者的相关性。实验表明,SGR在GRASP-Bench上的表现有所提升,同时在相关的社会视频问答基准上保持零样本性能。
cs.CV / 58 / 2605.15792
Reversing the Flow: Generation-to-Understanding Synergy in Large Multimodal Models
逆转信息流:大规模多模态模型中的生成与理解协同
Abstract
The long-standing goal of multimodal AI is to build unified models in which visual understanding and visual generation mutually enhance one another. Despite recent works such as BAGEL, BLIP3o achieves remarkable progress; In practice, however, this unification remains one-directional: understanding routinely guides generation, yet how and why generation can support understanding is rarely investigated. We revisit this asymmetry and propose Generation-to-Understanding (G2U) synergy, where visual generation becomes an explicit intermediate reasoning step. Our framework enables a model to perform controlled generative acts, such as detail enhancement, context expansion or structural visualisation, to produce self-generated visual thoughts, which are then fed back into the model to refine perception without retraining or external tools. Through a comprehensive evaluation on twelve benchmarks, this reversed information flow consistently improves multimodal understanding. We show that generative fidelity bounds perceptual gain and that distinct families of edit prompts govern transfer efficiency. We further analyse whether models can decide what to imagine. While they can produce plausible edits, these self-generated visual thoughts lack stable task alignment, revealing that current large multimodal models fall short of true self-reflection. This work exposes a missing mechanism in unified cognition and suggests that imagination is not the end of understanding but its beginning.
Chinese Translation
多模态人工智能的长期目标是构建统一模型,使视觉理解与视觉生成相互增强。尽管最近的研究如BAGEL和BLIP3o取得了显著进展,但在实践中,这种统一仍然是单向的:理解通常引导生成,而生成如何以及为什么能够支持理解却鲜有探讨。我们重新审视这种不对称性,并提出生成到理解(Generation-to-Understanding, G2U)协同,其中视觉生成成为一个明确的中间推理步骤。我们的框架使模型能够执行受控的生成行为,如细节增强、上下文扩展或结构可视化,以产生自生成的视觉思维,然后将其反馈到模型中,以在不重新训练或使用外部工具的情况下优化感知。通过对十二个基准的全面评估,这种逆向信息流持续改善多模态理解。我们展示了生成的保真度限制了感知增益,并且不同类型的编辑提示主导了转移效率。我们进一步分析模型是否能够决定想象的内容。尽管它们能够生成合理的编辑,但这些自生成的视觉思维缺乏稳定的任务对齐,揭示了当前的大规模多模态模型在真正的自我反思方面的不足。这项工作揭示了统一认知中缺失的机制,并表明想象不是理解的终点,而是其起点。
cs.CV / 59 / 2605.15796
Cross-Modal Registration Between 3D and 2D Fingerprints via Pose-Aware Unwrapping and Point-Cloud Fusion
基于姿态感知展开和点云融合的3D与2D指纹跨模态配准
Abstract
Three-dimensional (3D) fingerprints preserve global finger geometry and local ridge structure while avoiding contact-induced deformation, but they remain difficult to integrate with legacy two-dimensional (2D) fingerprint systems. This paper addresses the intermediate stage between 3D acquisition and cross-modal matching, and presents a unified framework for 3D fingerprint preprocessing and registration across contactless and contact-based 2D modalities. The framework combines four components: 1) a nonparametric visualization and unwrapping method that converts a 3D fingerprint point cloud into a rolled-equivalent 2D representation without relying on a global finger-shape model; 2) a point-cloud fusion pipeline that registers and mosaics multiple partial 3D captures into a more complete fingerprint model; 3) an ellipse-based pose normalization method for canonical finger alignment; and 4) a pose-aware cross-modal registration strategy that improves compatibility between 3D fingerprints and both contactless and contact-based 2D fingerprints. Experiments on a self-collected multimodal fingerprint database containing 150 fingers show that the proposed framework achieves ridge-level 3D registration accuracy, robust pose estimation, and consistent gains in 2D compatibility. In particular, the 3D fusion error is concentrated around 0.09 mm, contactless 2D--3D registration reaches ridge-scale projection accuracy, and pose-aware unwrapping improves genuine matching scores relative to generic 3D unwrapping. These results support the use of 3D fingerprints as an effective geometric bridge across heterogeneous fingerprint modalities.
Chinese Translation
三维(3D)指纹保留了全局手指几何形状和局部脊线结构,同时避免了接触引起的变形,但它们与传统的二维(2D)指纹系统的集成仍然困难。本文解决了3D采集与跨模态匹配之间的中间阶段,并提出了一个统一的框架,用于3D指纹的预处理和无接触及基于接触的2D模态之间的配准。该框架结合了四个组件:1)一种非参数可视化和展开方法,将3D指纹点云转换为等效的卷曲2D表示,而不依赖于全局手指形状模型;2)一个点云融合管道,将多个部分3D采集进行配准和拼接,形成更完整的指纹模型;3)一种基于椭圆的姿态归一化方法,用于规范手指对齐;4)一种姿态感知的跨模态配准策略,提高了3D指纹与无接触及基于接触的2D指纹之间的兼容性。在一个包含150个手指的自收集多模态指纹数据库上的实验表明,所提出的框架实现了脊线级别的3D配准精度、稳健的姿态估计以及在2D兼容性方面的一致提升。特别是,3D融合误差集中在0.09毫米左右,无接触的2D-3D配准达到了脊线级投影精度,而姿态感知展开相较于通用的3D展开提高了真实匹配得分。这些结果支持将3D指纹作为跨异构指纹模态的有效几何桥梁。
cs.CV / 60 / 2605.15803
Embedding-perturbed Exploration Preference Optimization for Flow Models
用于流模型的嵌入扰动探索偏好优化
Abstract
Recent advancements have established Reinforcement Learning (RL) as a pivotal paradigm for aligning generative models with human intent. However, group-based optimization frameworks (e.g., GRPO) face a critical limitation: the rapid decay of intra-group variance. As the distinctiveness among samples within a group diminishes, the variance approaches zero. This eliminates the very learning signal required for optimization, rendering the process unstable and forcing the policy into premature stagnation or reward hacking. Existing strategies, such as varying the initial noise or increasing group sizes, often fail to address this fundamental issue, resulting in training instability or diminishing returns. To overcome these challenges, we propose $\textbf{Embedding-perturbed Exploration Preference Optimization (}E^2\textbf{PO)}$, a novel framework that sustains optimization through embedding-level perturbation. Our method introduces structured, embedding-level perturbations within sample groups, guaranteeing a robust variance that preserves the discriminative signal throughout the training process. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art baselines, achieving a more faithful alignment with human preference.
Chinese Translation
最近的进展使得强化学习(Reinforcement Learning, RL)成为将生成模型与人类意图对齐的重要范式。然而,基于群体的优化框架(例如,GRPO)面临一个关键限制:组内方差的快速衰减。随着组内样本之间的独特性减弱,方差趋近于零。这消除了优化所需的学习信号,使得过程不稳定,并迫使策略陷入过早的停滞或奖励黑客行为。现有策略,如变化初始噪声或增加组大小,往往无法解决这一根本问题,导致训练不稳定或收益递减。为了解决这些挑战,我们提出了$ extbf{嵌入扰动探索偏好优化(}E^2 extbf{PO)}$,这是一个通过嵌入级扰动维持优化的新框架。我们的方法在样本组内引入结构化的嵌入级扰动,确保在整个训练过程中保持稳健的方差,从而保留区分信号。大量实验表明,我们的方法显著优于最先进的基线,能够更忠实地与人类偏好对齐。
cs.CV / 61 / 2605.15824
FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization
时尚变色龙:面向实时和互动的人体服装视频定制
Abstract
Human-centric video customization, particularly at the garment level, has shown significant commercial value. However, existing approaches cannot support low-latency and interactive garment control, which is crucial for applications such as e-commerce and content creation. This paper studies how to achieve interactive multi-garment video customization while preserving motion coherence using only single-garment video data. We present FashionChameleon, a real-time and interactive framework for human-garment customization in autoregressive video generation, where users can interactively switch garment during generation. FashionChameleon consists of three key techniques: (i) Instead of training on multi-garment video data, we train a Teacher Model with In-Context Learning on a single reference-garment pair. By retaining the image-to-video training paradigm while enforcing a mismatch between the reference and garment image, the model is encouraged to implicitly preserve coherence during single-garment switching. (ii) To achieve consistency and efficiency during generation, we introduce Streaming Distillation with In-Context Learning, which fine-tunes the model with in-context teacher forcing and improves extrapolation consistency via gradient-reweighted distribution matching distillation. (iii) To extend the model for interactive multi-garment video customization, we propose Training-Free KV Cache Rescheduling, which includes garment KV refresh, historical KV withdraw, and reference KV disentangle to achieve garment switching while preserving motion coherence. Our FashionChameleon uniquely supports interactive customization and consistent long-video extrapolation, while achieving real-time generation at 23.8 FPS on a single GPU, 30-180$\times$ faster than existing baselines.
Chinese Translation
以人为中心的视频定制,特别是在服装层面,展现出了显著的商业价值。然而,现有的方法无法支持低延迟和互动的服装控制,这对于电子商务和内容创作等应用至关重要。本文研究如何在仅使用单一服装视频数据的情况下,实现互动的多服装视频定制,同时保持运动一致性。我们提出了FashionChameleon,一个用于自回归视频生成的人体服装定制的实时互动框架,用户可以在生成过程中互动切换服装。FashionChameleon包含三项关键技术:(i)我们并不在多服装视频数据上进行训练,而是通过上下文学习在单一参考服装对上训练教师模型。通过保留图像到视频的训练范式,同时强制参考图像与服装图像之间的不匹配,模型被鼓励在单一服装切换过程中隐式保持一致性。(ii)为了在生成过程中实现一致性和效率,我们引入了基于上下文学习的流式蒸馏,该方法通过上下文教师强制微调模型,并通过梯度加权分布匹配蒸馏提高外推一致性。(iii)为了扩展模型以实现互动的多服装视频定制,我们提出了无训练的KV缓存重新调度方法,包括服装KV刷新、历史KV撤回和参考KV解耦,以实现服装切换的同时保持运动一致性。我们的FashionChameleon独特地支持互动定制和一致的长视频外推,同时在单个GPU上以23.8 FPS实现实时生成,比现有基准快30-180倍。
cs.CV / 62 / 2605.15828
Not All Tasks Quantize Equally: Fisher-Guided Quantization for Visual Geometry Transformer
并非所有任务的量化效果相同:基于费舍尔引导的视觉几何变换器量化
Abstract
Feed-forward 3D reconstruction models, represented by Visual Geometry Grounded Transformer (VGGT), jointly predict multiple visual geometry tasks such as depth estimation, camera pose prediction, and point cloud reconstruction in a single forward pass. They have been widely adopted in 3D vision applications, but their billion-scale parameters bring substantial memory and computation overhead, posing challenges for on-device deployment. Post-Training Quantization (PTQ) is an effective technique to reduce this overhead. Existing PTQ methods for feed-forward 3D models mainly focus on handling heavy-tailed activation distributions and constructing diverse calibration datasets. However, we observe that feed-forward 3D models predict multiple geometric attributes through a shared backbone, where different transformer blocks and hidden channels contribute distinctly to each task, resulting in substantially different sensitivities to quantization errors across tasks, blocks, and channels. Consequently, treating all tasks equally over-emphasizes insensitive tasks and causes significant accuracy loss on the sensitive ones. To address this issue, we propose Fisher-Guided Quantization (FGQ) for feed-forward 3D reconstruction models. Specifically, FGQ uses the diagonal Fisher information matrix to quantify the different sensitivities across tasks, blocks, and channels, and incorporates these sensitivities into the Learnable Affine Transformation during calibration to better preserve the channels and blocks most critical to each task. Extensive experiments across camera pose estimation, point map reconstruction, and depth estimation show that FGQ consistently outperforms state-of-the-art quantization baselines on VGGT, achieving up to 39% relative improvement under the 4-bit quantization.
Chinese Translation
以视觉几何基础变换器(Visual Geometry Grounded Transformer, VGGT)为代表的前馈3D重建模型,在单次前向传播中共同预测多个视觉几何任务,如深度估计、相机姿态预测和点云重建。这些模型在3D视觉应用中得到了广泛应用,但其数十亿规模的参数带来了显著的内存和计算开销,给设备端部署带来了挑战。后训练量化(Post-Training Quantization, PTQ)是一种有效的减少这种开销的技术。现有的针对前馈3D模型的PTQ方法主要集中在处理重尾激活分布和构建多样化的校准数据集。然而,我们观察到前馈3D模型通过共享主干网络预测多个几何属性,其中不同的变换器块和隐藏通道对每个任务的贡献各不相同,导致任务、块和通道对量化误差的敏感性存在显著差异。因此,将所有任务视为相同会过分强调不敏感任务,并导致对敏感任务的显著准确性损失。为了解决这一问题,我们提出了针对前馈3D重建模型的费舍尔引导量化(Fisher-Guided Quantization, FGQ)。具体而言,FGQ使用对角费舍尔信息矩阵来量化任务、块和通道之间的不同敏感性,并在校准过程中将这些敏感性纳入可学习的仿射变换,以更好地保留对每个任务至关重要的通道和块。在相机姿态估计、点图重建和深度估计等多个任务上的广泛实验表明,FGQ在VGGT上始终优于最先进的量化基线,在4位量化下实现了高达39%的相对提升。
cs.CV / 63 / 2605.15835
Community-aware evaluation and threshold calibration for open-set plankton image recognition
基于社区意识的开放集浮游生物图像识别评估与阈值校准
Abstract
Automated plankton image recognition is increasingly used in aquatic ecosystem monitoring, but deployed classifiers inevitably encounter unseen taxa and non-target particles. Open-set recognition methods are usually evaluated with sample-level metrics such as AUROC, AUPR, and FPR@95% unknown-recall operating points, whereas ecological monitoring depends on community-level estimates of taxon abundance and diversity. This study examines the mismatch between these objectives using controlled pseudo-communities and three datasets spanning marine zooplankton imaged by ZooScan, marine phytoplankton imaged by IFCB, and freshwater plankton imaged by an in-situ camera. We define Open-Set Community Distortion (OSCD), a Bray-Curtis-style error over known taxa plus an unknown bin, with directional components distinguishing known-taxon overestimation from underestimation. Closed-set classifiers achieved high known-class accuracy, but unknown samples were often absorbed with high confidence and in structured ways. Sample-level OOD metrics were not sufficient to select ecological operating points: for MSP, FPR@95% unknown-recall thresholds produced large test-community OSCD on all three datasets mainly because true known taxa were over-rejected into the unknown bin. Community-aware threshold calibration reduced MSP OSCD relative to fixed 95% known recall on SYKE-ZooScan 2024 and SYKE-IFCB 2022; on ZooLake the fixed-recall baseline was already close to the community-aware threshold, and the best community-level method was a prototype-distance variant rather than MSP. The benefit of community-aware calibration therefore depends on validation-community representativeness and the gap between fixed recall and the community optimum. These results show that open-set plankton recognition should be evaluated as an ecological measurement problem, not only as a sample-level detection task.
Chinese Translation
自动化浮游生物图像识别在水生态系统监测中的应用日益增多,但已部署的分类器不可避免地会遇到未见过的分类群和非目标颗粒。开放集识别方法通常使用样本级指标进行评估,如AUROC、AUPR和FPR@95%未知召回操作点,而生态监测则依赖于分类群丰度和多样性的社区级估计。本研究通过控制伪社区和三个数据集进行探讨,这些数据集涵盖了通过ZooScan成像的海洋浮游动物、通过IFCB成像的海洋浮游植物以及通过原位相机成像的淡水浮游生物。我们定义了开放集社区失真(Open-Set Community Distortion, OSCD),这是一个基于已知分类群和未知类别的Bray-Curtis风格误差,具有区分已知分类群高估与低估的方向性成分。闭集分类器在已知类别上实现了高准确率,但未知样本往往以高置信度和结构化方式被吸收。样本级OOD指标不足以选择生态操作点:对于MSP,FPR@95%未知召回阈值在所有三个数据集上产生了较大的测试社区OSCD,主要是因为真实的已知分类群被过多地拒绝到未知类别中。基于社区意识的阈值校准相对于固定的95%已知召回在SYKE-ZooScan 2024和SYKE-IFCB 2022上减少了MSP OSCD;在ZooLake上,固定召回基线已接近基于社区意识的阈值,最佳的社区级方法是原型距离变体而非MSP。因此,基于社区意识的校准的益处取决于验证社区的代表性以及固定召回与社区最优之间的差距。这些结果表明,开放集浮游生物识别应作为生态测量问题进行评估,而不仅仅是样本级检测任务。
cs.CV / 64 / 2605.15843
WorldAct: Activating Monolithic 3D Worlds into Interactive-Ready Object-Centric Scenes
WorldAct:将单体3D世界激活为互动准备的对象中心场景
Abstract
Recent 3D world modeling systems based on generative scene synthesis, such as Marble, can create coherent and explorable 3D environments, yet their outputs are typically static monolithic assets with limited editability and physical interaction. This restricts their use in immersive content creation and embodied simulation, where generated worlds must be actively modified and manipulated. To tackle this challenge, we present WorldAct, a framework that converts static generated 3D worlds into editable and interaction-ready scenes. WorldAct uses a multimodal agent to guide scene decomposition, identify actionable objects, reconstruct geometrically aligned object-level meshes for interaction, and restore the residual background via 3D inpainting. The resulting scenes support object-level editing, collision-aware manipulation, and embodied task execution while preserving global scene coherence. Experiments show that WorldAct enables richer interaction scenarios than the original generated scenes, suggesting a practical path toward editable and interactive 3D world models.
Chinese Translation
最近基于生成场景合成的3D世界建模系统,如Marble,可以创建连贯且可探索的3D环境,但其输出通常是静态的单体资产,编辑性和物理交互性有限。这限制了它们在沉浸式内容创作和具身模拟中的应用,因为生成的世界必须被主动修改和操控。为了解决这一挑战,我们提出了WorldAct,一个将静态生成的3D世界转换为可编辑和准备互动场景的框架。WorldAct使用多模态代理引导场景分解,识别可操作对象,重建几何对齐的对象级网格以便进行交互,并通过3D修复恢复剩余背景。生成的场景支持对象级编辑、碰撞感知操作和具身任务执行,同时保持全局场景的一致性。实验表明,WorldAct能够实现比原始生成场景更丰富的交互场景,暗示了朝向可编辑和互动3D世界模型的实际路径。
cs.CV / 65 / 2605.15852
GHOST: Geometry-Hierarchical Online Streaming Token Eviction for Efficient 3D Reconstruction
GHOST:用于高效3D重建的几何层次在线流式令牌驱逐
Abstract
Streaming 3D reconstruction from long monocular video sequences requires maintaining a key-value (KV) cache that grows linearly with sequence length, creating a severe memory bottleneck. Existing approaches either truncate the cache to a fixed set of anchor frames, leading to reconstruction quality degradation, or rely on attention-score heuristics that are agnostic to 3D scene structure, failing to preserve geometrically valuable tokens. To address these problems, we present GHOST (Geometry-Hierarchical Online Streaming Token Eviction), a training-free KV cache management framework that exploits the model's own 3D geometry outputs to evict redundant tokens online. GHOST introduces three mutually reinforcing innovations: a hierarchical dual-level importance scoring scheme, a privilege mechanism that protects special tokens from eviction, and a cosine-similarity-guided layer-wise budget allocation. Experiments on various benchmarks show that GHOST preserves excellent reconstruction quality while cutting the KV cache by nearly half and delivering 1.75x faster inference compared to state-of-the-art methods. Our code is available at https://github.com/lokiniuniu/GHOST.
Chinese Translation
从长单目视频序列进行流式3D重建需要维护一个随着序列长度线性增长的键值(KV)缓存,这造成了严重的内存瓶颈。现有的方法要么将缓存截断为固定的一组锚帧,从而导致重建质量下降,要么依赖于与3D场景结构无关的注意力分数启发式方法,未能保留具有几何价值的令牌。为了解决这些问题,我们提出了GHOST(几何层次在线流式令牌驱逐),这是一种无训练的KV缓存管理框架,利用模型自身的3D几何输出在线驱逐冗余令牌。GHOST引入了三项相互增强的创新:一种分层的双层重要性评分机制、一种保护特殊令牌不被驱逐的特权机制,以及一种基于余弦相似度的层级预算分配。各种基准测试的实验表明,GHOST在将KV缓存减少近一半的同时,保持了优异的重建质量,并且与最先进的方法相比,推理速度提高了1.75倍。我们的代码可在 https://github.com/lokiniuniu/GHOST 获取。
cs.CV / 66 / 2605.15855
Do Less, Achieve More: Do We Need Every-Step Optimization for RL Fine-tuning of Diffusion Models?
少做多得:我们是否需要对扩散模型进行每一步优化的强化学习微调?
Abstract
Despite strong image-generation performance, diffusion models' reconstruction objectives limit alignment with human preferences. RL enables such alignment through explicit rewards. However, most studies apply RL to the full denoising trajectory, making it computationally costly and weakening preference alignment, i.e., doing more but achieving less. We observe that the impact of RL fine-tuning varies significantly across denoising stages. In the early stage, image structures are unstable and distant from the final reward signal. Applying RL at this stage leads to delayed rewards and action-reward mismatching, resulting in high variance and inefficient updates. Conversely, in the later stage, reward gains saturate, and continued training tends to overfit local details, intensifying reward hacking. To tackle these challenges, we propose AdaScope, an RL-enhanced plug-in that improves generation quality while reducing computational cost. Specifically, AdaScope adaptively identifies the optimal intervention timing for RL by perceiving the structural evolution and semantic consistency during denoising, and dynamically terminates training once the denoising converges and reward gains saturate. As a result, it achieves a rare 'dual benefit': a reduction in computational costs alongside a significant performance improvement. We offer theoretical grounds for the design of AdaScope. Compared with state-of-the-art methods, AdaScope improves performance by 66% while cutting computational cost by 59%.
Chinese Translation
尽管扩散模型在图像生成方面表现出色,但其重建目标限制了与人类偏好的对齐。强化学习(RL)通过显式奖励实现这种对齐。然而,大多数研究将RL应用于完整的去噪轨迹,这使得计算成本高昂,并削弱了偏好对齐,即做得越多,成效越少。我们观察到,RL微调的影响在去噪阶段之间显著不同。在早期阶段,图像结构不稳定,距离最终奖励信号较远。在这一阶段应用RL会导致奖励延迟和动作-奖励不匹配,从而导致高方差和低效更新。相反,在后期阶段,奖励增益饱和,持续训练往往会过拟合局部细节,加剧奖励操控。为了解决这些挑战,我们提出了AdaScope,这是一种增强RL的插件,旨在提高生成质量,同时降低计算成本。具体而言,AdaScope通过感知去噪过程中的结构演变和语义一致性,自适应地识别RL的最佳干预时机,并在去噪收敛和奖励增益饱和后动态终止训练。因此,它实现了罕见的“双重收益”:在显著提高性能的同时降低计算成本。我们为AdaScope的设计提供了理论基础。与最先进的方法相比,AdaScope在提高性能66%的同时将计算成本降低了59%。
cs.CV / 67 / 2605.15860
On RGB-TIR Stereo Calibration under Extreme Resolution Asymmetry
在极端分辨率不对称下的RGB-热红外立体校准
Abstract
Accurate geometric calibration of RGB-thermal infrared (TIR) stereo camera systems is essential for multimodal building envelope analysis, yet remains challenging when low-cost thermal sensors with very low spatial resolution are employed. This paper presents a practical stereo calibration framework for an RGB camera (2028 x 1520 px) paired with a TIR camera operating at only 80 x 62 px - a pixel-count ratio of approximately 1:625. An active OLED screen dynamically switches modality-specific patterns (checkerboard for TIR, ChArUco for RGB) on a single physical surface, providing controlled and repeatable thermal contrast. A dedicated corner detection algorithm combining perspective rectification, Hessian saddle-point analysis, and Mean Shift localisation achieves reliable checkerboard detection at 80 x 62 px without per-frame parameter tuning. A baseline-constrained bundle adjustment enforces physically consistent rig geometry under the planar-calibration-object degeneracy, yielding a stereo baseline of 32.7 mm (nominal 30 mm) with an overall reprojection error of 0.382 px. The system is validated on a thermally active building mock-up using constant-depth and per-pixel depth estimation, demonstrating consistent TIR-to-RGB projection suitable for building energy performance assessment.
Chinese Translation
RGB-热红外(TIR)立体相机系统的准确几何校准对于多模态建筑外壳分析至关重要,但当使用低成本、空间分辨率极低的热传感器时,校准仍然具有挑战性。本文提出了一种实用的立体校准框架,适用于分辨率为2028 x 1520像素的RGB相机与仅为80 x 62像素的TIR相机配对,像素数量比约为1:625。一个主动OLED屏幕在单一物理表面上动态切换特定模态的模式(TIR使用棋盘格,RGB使用ChArUco),提供可控且可重复的热对比度。一个专用的角点检测算法结合了透视校正、Hessian鞍点分析和均值漂移定位,能够在80 x 62像素下可靠地检测棋盘格,而无需对每帧进行参数调优。基线约束的束调整在平面校准物体退化情况下强制执行物理一致的设备几何形状,获得了32.7毫米(标称30毫米)的立体基线,整体重投影误差为0.382像素。该系统在一个热活动的建筑模型上进行了验证,使用恒深度和逐像素深度估计,展示了适合建筑能效评估的TIR到RGB的投影一致性。
cs.CV / 68 / 2605.15864
Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination
视觉语言模型是在看还是仅仅在说?揭示视觉重检的幻觉
Abstract
Vision-Language Models (VLMs) often produce self-reflective statements like "let me check the figure again" during reasoning. Do such statements trigger genuine visual re-examination, or are they merely learned textual patterns? We investigate this via VisualSwap, an image-swap probing framework: after a model reasons over an image, we replace it with a visually similar but semantically different one and test whether the model notices. We introduce VS-Bench, 800 image pairs curated from MathVista, MathVerse, MathVision, and MMMU-Pro. Experiments on Qwen3-VL, Kimi-VL, and ERNIE-VL reveal a striking failure: models overwhelmingly miss the swap, with accuracy dropping by up to 60%. Counterintuitively, thinking models are nearly 3x more vulnerable than their instructed counterparts, and scaling offers no mitigation. Multi-turn user instructions restore visual grounding, but self-generated reflective statements during continuous generation do not. Attention analysis explains why: user instructions substantially elevate attention to visual tokens, whereas self-reflection does not. Current VLMs tend to say rather than actually see when claiming to perform visual re-examination. Our code and dataset are available at the project page: https://visualswap.github.io
Chinese Translation
视觉语言模型(VLMs)在推理过程中常常会产生自我反思的陈述,例如“让我再检查一下图形”。这些陈述是否触发了真正的视觉重检,还是仅仅是学习到的文本模式?我们通过VisualSwap,一个图像交换探测框架,来研究这个问题:在模型对图像进行推理后,我们用一个视觉上相似但语义上不同的图像替换它,并测试模型是否注意到这一点。我们引入了VS-Bench,包含从MathVista、MathVerse、MathVision和MMMU-Pro精心挑选的800对图像。对Qwen3-VL、Kimi-VL和ERNIE-VL的实验揭示了一个显著的失败:模型在绝大多数情况下未能察觉到交换,准确率下降了多达60%。反直觉的是,思考模型的脆弱性几乎是其指令模型的3倍,而规模扩展并未提供缓解。多轮用户指令恢复了视觉基础,但在持续生成过程中自我生成的反思陈述则没有。注意力分析解释了原因:用户指令显著提高了对视觉标记的注意力,而自我反思则没有。目前的VLMs在声称进行视觉重检时往往只是说而并非真正看到。我们的代码和数据集可在项目页面获取:https://visualswap.github.io
cs.CV / 69 / 2605.15868
SOLAR: Self-supervised Joint Learning for Symmetric Multimodal Retrieval
SOLAR:用于对称多模态检索的自监督联合学习
Abstract
In this work, we address the critical yet underexplored challenge of symmetric multimodal-to-multimodal (MM2MM) retrieval, where queries and contexts are interchangeable. Existing universal multimodal retrieval works struggle with this task, as they are constrained by the labeled asymmetric datasets used. We produce SOLAR (Self-supervised jOint LeArning for symmetric multimodal Retrieval), a novel two-stage self-supervised framework that leverages readily available unlabeled web-scale image-text pairs. Based on the observation that both semantic alignment and discrepancies exist between two modalities, in the first stage, we learn the intersection mask of image-text pair, allowing us to align intersection while preserving semantic of difference. In the second stage, the learned mask is further utilized to construct positive and hardnegative samples via masking different parts of image/text, which enable us to conduct self-supervised multimodal embedding learning. Complementing this framework, we present a new benchmark featuring high-quality human-verified positive and hard-negative pairs to evaluate symmetric MM2MM retrieval under realistic conditions, as well as the corresponding pipeline. Extensive experiments against ten SOTA methods show SOLAR surpasses the strongest supervised VLM by 7.08 points on this benchmark, with over 50x fewer model parameters and a 5x smaller embedding dimension. Code and benchmark will be available soon.
Chinese Translation
在本研究中,我们解决了对称多模态到多模态(MM2MM)检索这一重要但尚未深入探讨的挑战,其中查询和上下文是可互换的。现有的通用多模态检索工作在这一任务上面临困难,因为它们受到所使用的标记不对称数据集的限制。我们提出了SOLAR(Self-supervised jOint LeArning for symmetric multimodal Retrieval),这是一个新颖的两阶段自监督框架,利用 readily available 的未标记网络规模图像-文本对。基于观察到的两种模态之间既存在语义对齐又存在差异,在第一阶段,我们学习图像-文本对的交集掩码,使我们能够在保留差异语义的同时对交集进行对齐。在第二阶段,学习到的掩码进一步用于通过掩盖图像/文本的不同部分构建正样本和困难负样本,从而使我们能够进行自监督多模态嵌入学习。作为该框架的补充,我们提出了一个新的基准,包含高质量的人类验证的正样本和困难负样本,以在现实条件下评估对称MM2MM检索及其相应的流程。与十种最先进的方法进行的广泛实验表明,SOLAR在该基准上超越了最强的监督VLM 7.08分,同时模型参数减少超过50倍,嵌入维度减少5倍。代码和基准将很快发布。
cs.CV / 70 / 2605.15876
Unlocking Dense Metric Depth Estimation in VLMs
解锁视觉语言模型中的密集度量深度估计
Abstract
Vision-Language Models (VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text-only supervision paradigm, which under-constrains fine-grained visual perception and prevents the recovery of dense geometry. Prior methods either distill geometry from external vision models, introducing error accumulation, or enable direct prediction with inefficient per-pixel query or coarse token-level outputs. In this paper, we propose DepthVLM, a simple yet effective framework that transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability. By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. We further introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. Experiments show that DepthVLM significantly outperforms existing VLMs with higher inference efficiency, surpasses leading pure vision models, and improves complex 3D spatial reasoning, moving toward a truly unified foundation model. All code and checkpoints will be publicly released.
Chinese Translation
视觉语言模型(VLMs)在基础的2D任务(如定位和图像描述)中表现出色,但在3D理解方面仍然有限。一个关键的限制是它们的文本监督范式,这在细粒度视觉感知方面约束不足,阻碍了密集几何的恢复。以往的方法要么从外部视觉模型中提取几何信息,导致误差累积,要么通过低效的逐像素查询或粗糙的标记级输出实现直接预测。本文提出了DepthVLM,一个简单而有效的框架,将单一的VLM转变为原生的密集几何预测器,同时保留其多模态能力。通过在LLM主干上附加一个轻量级的深度头,并在统一的视觉-文本监督范式下采用两阶段的训练计划,DepthVLM能够在单次前向传播中生成全分辨率的深度图和语言输出。此外,我们还引入了一个统一的室内-室外度量深度基准,以VLM兼容的格式呈现。实验表明,DepthVLM在推理效率上显著优于现有的VLM,并超越了领先的纯视觉模型,同时改善了复杂的3D空间推理,朝着真正统一的基础模型迈进。所有代码和检查点将公开发布。
cs.CV / 71 / 2605.15880
FSCM: Frequency-Enhanced Spatial-Spectral Coupled Mamba for Infrared Hyperspectral Image Colorization
FSCM:频率增强的空间-光谱耦合曼巴算法用于红外高光谱图像着色
Abstract
Thermal infrared imaging is robust to illumination variations and smoke interference, making it important for all-weather perception. However, the lack of natural color and fine texture limits target recognition, human visual interpretation, and the transfer of visible-light models. Existing infrared colorization methods mainly rely on single-band images, where insufficient spectral cues may lead to structural distortion and semantic confusion. Although infrared hyperspectral images provide rich spectral responses and material information, existing single-band frameworks remain limited in modeling spatial-spectral coupling and weak texture details. To address these issues, this paper presents FSCM, a spectral-information-guided GAN framework. Within FSCM, a frequency-enhanced spatial-spectral state-space generator composed of cascaded FSB units is constructed. Each FSB integrates three complementary components: state-space modeling captures global spatial-spectral dependencies; the frequency enhancement module (FEM) combines multi-level wavelet decomposition and Fourier gating to recover structural contours, directional high-frequency details, and global frequency responses; and the dual-stream hybrid gating module (DGM) integrates deformation-aware sampling with sparse attention to enhance effective local structures and suppress background interference. Additionally, an online semantic segmentation-guided loss is introduced to constrain the generated results, improving semantic consistency in complex road scenes. Experiments show that FSCM outperforms existing infrared colorization methods in visual quality and semantic fidelity.
Chinese Translation
热红外成像对光照变化和烟雾干扰具有较强的鲁棒性,使其在全天候感知中具有重要意义。然而,缺乏自然色彩和细腻纹理限制了目标识别、人类视觉解读以及可见光模型的迁移。现有的红外着色方法主要依赖单波段图像,缺乏足够的光谱线索可能导致结构失真和语义混淆。尽管红外高光谱图像提供了丰富的光谱响应和材料信息,但现有的单波段框架在建模空间-光谱耦合和细节纹理方面仍然有限。为了解决这些问题,本文提出了FSCM,一个基于光谱信息引导的生成对抗网络(GAN)框架。在FSCM中,构建了一个由级联的频率增强空间-光谱状态空间生成器组成的模型。每个频率增强单元(FSB)集成了三个互补组件:状态空间建模捕捉全局空间-光谱依赖关系;频率增强模块(FEM)结合多层小波分解和傅里叶门控以恢复结构轮廓、方向性高频细节和全局频率响应;双流混合门控模块(DGM)将变形感知采样与稀疏注意力相结合,以增强有效局部结构并抑制背景干扰。此外,引入了一种在线语义分割引导损失以约束生成结果,提高复杂道路场景中的语义一致性。实验表明,FSCM在视觉质量和语义保真度方面优于现有的红外着色方法。
cs.CV / 72 / 2605.15894
Uncertainty-Aware Wildfire Smoke Density Classification from Satellite Imagery via CBAM-Augmented EfficientNet with Evidential Deep Learning
基于卫星图像的考虑不确定性的野火烟雾密度分类:通过CBAM增强的EfficientNet与证据深度学习
Abstract
Rapid and accurate wildfire smoke severity assessment from satellite images is essential for emergency response, air quality modeling, and human health risk management. Existing deep learning approaches treat smoke detection as a binary task, producing point estimates without any measure of prediction confidence. We propose a probabilistic framework to categorize a satellite patch into Light, Moderate, and Heavy severity classes and to provide decomposed epistemic and aleatoric uncertainty in a single forward pass. Our architecture uses the backbone of a pre-trained EfficientNet-B3 and a CBAM module with an evidential deep learning head that predicts Dirichlet concentration parameters, directly estimating vacuity (epistemic) and dissonance (aleatoric) without Monte Carlo sampling. Evaluated on 16,298 real satellite patches derived from the Wildfire Detection dataset, our model achieves 93.8% weighted test accuracy (91.1% unweighted) with ECE=0.0274. Selective prediction retaining the most certain 50% of patches achieves 96.7% accuracy. As image quality degrades, uncertainty increases monotonically, and vacuity is a practical scan quality measure. The Moderate class represents transitional smoke conditions that exhibit the highest epistemic uncertainty (mean vacuity = 0.187), confirming the model correctly identifies ambiguous smoke boundary regions. CBAM spatial attention maps localize to structurally distinctive scene regions, and t-SNE demonstrates the clear cluster separation of Light and Heavy smoke.
Chinese Translation
从卫星图像快速准确地评估野火烟雾的严重程度对于应急响应、空气质量建模和人类健康风险管理至关重要。现有的深度学习方法将烟雾检测视为二元任务,产生点估计而没有任何预测置信度的度量。我们提出了一种概率框架,将卫星图像块分类为轻度、中度和重度严重性类别,并在单次前向传播中提供分解的认知不确定性和随机不确定性。我们的架构使用预训练的EfficientNet-B3作为主干,并结合CBAM模块和一个证据深度学习头,预测Dirichlet浓度参数,直接估计空缺(认知)和不和谐(随机)而无需蒙特卡洛采样。在从野火检测数据集中获得的16,298个真实卫星图像块上进行评估,我们的模型实现了93.8%的加权测试准确率(未加权为91.1%),ECE=0.0274。选择性预测保留最确定的50%图像块,准确率达到96.7%。随着图像质量的下降,不确定性单调增加,而空缺是一个实用的扫描质量度量。中度类别代表过渡烟雾条件,表现出最高的认知不确定性(平均空缺 = 0.187),确认模型正确识别模糊的烟雾边界区域。CBAM空间注意力图定位于结构上独特的场景区域,t-SNE展示了轻度和重度烟雾的明显聚类分离。
cs.CV / 73 / 2605.15906
A Causally Grounded Taxonomy for Image Degradation Robustness Evaluation
基于因果关系的图像降质鲁棒性评估分类法
Abstract
Image degradations can occur during acquisition, processing, and transmission, altering visual appearance and affecting downstream vision tasks. They are studied in several communities, including synthetic corruption benchmarks for robustness evaluation, perceptual image quality assessment, and physically grounded analyses of imaging systems or real camera failures. Although these areas address closely related phenomena, they often use incompatible grouping schemes and backend specific severity definitions, making results difficult to compare across datasets, degradation sources, and tasks. We propose a causally grounded framework for organizing and interpreting image degradations across these settings. Instead of introducing new degradations or redefining existing benchmarks, we provide an interpretive representation and measurement layer that makes implicit assumptions explicit. Each degradation is described along two orthogonal axes: its dominant causal source in the imaging pipeline (environment, sensor/optics, ISP/renderer/codec, or transfer/system), and its resulting perceptual effect. This dual axis abstraction yields a compact taxonomy spanning algorithmic corruptions, perceptual distortions, and physically motivated imaging artifacts. To address inconsistent severity semantics without changing existing implementations, we introduce a lightweight severity measurement layer. For every degradation and each native severity level of a given backend, we quantify degradation strength using full reference image quality metrics: PSNR, SSIM, and LPIPS. This makes severity observable and comparable across sources while preserving native parameterizations. We demonstrate the framework through COCO Degradation, a taxonomy aligned benchmark for evaluating object detector robustness under diverse imaging conditions.
Chinese Translation
图像降质可能在获取、处理和传输过程中发生,改变视觉外观并影响下游视觉任务。多个研究领域对此进行了研究,包括用于鲁棒性评估的合成损坏基准、感知图像质量评估,以及对成像系统或真实相机故障的物理基础分析。尽管这些领域研究的现象密切相关,但它们通常使用不兼容的分组方案和特定后端的严重性定义,使得在数据集、降质来源和任务之间比较结果变得困难。我们提出了一个基于因果关系的框架,用于组织和解释这些环境中的图像降质。我们并没有引入新的降质或重新定义现有基准,而是提供了一种解释性表示和测量层,使隐含假设变得明确。每种降质沿两个正交轴进行描述:其在成像流程中的主要因果来源(环境、传感器/光学、ISP/渲染器/编解码器或传输/系统),以及其导致的感知效果。这种双轴抽象产生了一个紧凑的分类法,涵盖了算法性损坏、感知失真和物理动机的成像伪影。为了解决不一致的严重性语义而不改变现有实现,我们引入了一种轻量级的严重性测量层。对于每种降质和给定后端的每个原生严重性水平,我们使用完整参考图像质量指标(PSNR、SSIM和LPIPS)量化降质强度。这使得严重性在不同来源之间可观察和可比较,同时保留了原生参数化。我们通过COCO Degradation展示了该框架,这是一个与分类法对齐的基准,用于评估在多种成像条件下对象检测器的鲁棒性。
cs.CV / 74 / 2605.15908
RaPD: Resolution-Agnostic Pixel Diffusion via Semantics-Enriched Implicit Representations
RaPD:通过语义增强隐式表示实现分辨率无关的像素扩散
Abstract
Natural images are continuous, yet most generative models synthesize them on discrete grids, limiting resolution-flexible generation. Continuous neural fields enable resolution-free rendering, but prior methods introduce continuity only at the decoding stage as an interpolation module, leaving the generative latent space discretized and reconstruction-oriented. We propose RaPD (Resolution-agnostic Pixel Diffusion), which performs diffusion in a continuous Neural Image Field (NIF) latent space. RaPD bridges this reconstruction-generation gap with Semantic Representation Guidance for generation-aware latent learning and a Coordinate-Queried Attention Renderer for coordinate-conditioned, scale-aware rendering. A single denoised latent can be rendered at arbitrary resolutions by changing only the query coordinates, keeping diffusion cost fixed. Experiments demonstrate superior generation quality and resolution scalability.
Chinese Translation
自然图像是连续的,但大多数生成模型在离散网格上合成图像,这限制了分辨率灵活的生成。连续神经场使得无分辨率渲染成为可能,但之前的方法仅在解码阶段引入连续性,作为插值模块,导致生成的潜在空间仍然是离散的且以重建为导向。我们提出了RaPD(分辨率无关的像素扩散),它在连续的神经图像场(NIF)潜在空间中执行扩散。RaPD通过语义表示引导弥合了重建与生成之间的鸿沟,实现了生成感知的潜在学习,并采用坐标查询注意力渲染器进行坐标条件、尺度感知的渲染。只需更改查询坐标,就可以以任意分辨率渲染单个去噪潜在,保持扩散成本不变。实验表明,生成质量和分辨率可扩展性优于现有方法。
cs.CV / 75 / 2605.15921
AdaEraser: Training-Free Object Removal via Adaptive Attention Suppression
AdaEraser:通过自适应注意力抑制实现无训练的物体移除
Abstract
Object removal aims to eliminate specified objects from images while plausibly inpainting the affected regions with background content. Current training-free methods typically block attention to object regions within self-attention layers during the image generation process, leveraging surrounding background information to restore the image. However, indiscriminate suppression of self-attention in the vacated areas can degrade generation quality, as the model must simultaneously reconstruct background content in these regions. To solve this conflict, we propose AdaEraser, an adaptive framework that dynamically modulates attention based on the estimated presence of target object concepts. Through analysis of self-attention map evolution across denoising timesteps before and during removal, we develop a token-wise adaptive attention suppression strategy. This approach enables progressive perception of object removal throughout the denoising process, with the suppression strength in self-attention layers adjusted adaptively. Extensive experiments demonstrate that AdaEraser achieves superior performance in object removal, outperforming even training-based methods.
Chinese Translation
物体移除旨在从图像中消除指定物体,同时合理地用背景内容填充受影响区域。目前的无训练方法通常在图像生成过程中阻止自注意力层对物体区域的关注,利用周围的背景信息来恢复图像。然而,在空缺区域对自注意力的无差别抑制可能会降低生成质量,因为模型必须同时重建这些区域的背景内容。为了解决这一冲突,我们提出了AdaEraser,一个自适应框架,根据目标物体概念的估计存在动态调节注意力。通过分析去噪时间步长前后自注意力图的演变,我们开发了一种逐标记的自适应注意力抑制策略。这种方法使得在去噪过程中逐步感知物体移除成为可能,同时自注意力层中的抑制强度也得到了自适应调整。大量实验表明,AdaEraser在物体移除方面表现优越,甚至超过了基于训练的方法。
cs.CV / 76 / 2605.15923
Invaria: Learning Scale and Density Invariance in Point Clouds via Next-Resolution Prediction
Invaria:通过下一分辨率预测学习点云的尺度和密度不变性
Abstract
Modern image encoders achieve high generalization by decoupling semantic meaning from resolution, an ability yet to be fully realized in the 3D domain. We investigate the failure of 3D point cloud encoders to achieve similar generalization and find that existing models are highly sensitive to sampling resolution and scale changes, leading to significant performance degradation. This sensitivity is a major bottleneck for real-world deployment in robotics, as it suggests models overfit to specific quantization densities and object scales rather than learning invariant semantic features. To mitigate this dependency, we propose Invaria, a point cloud encoder that achieves scale and density invariance through next-resolution prediction and receptive field calibration. While our objective is not the explicit generation of high-resolution point clouds, we find that this training objective encourages the model to learn robust, structural invariants. The resulting encoder achieves significant performance gains during resolution shifts while maintaining high efficiency through a compact model size and reduced token requirements. Specifically, on ScanNet, Invaria achieves a 56.0\% higher mIoU at 3$\times$ lower resolution and a 20\% improvement when the objects scale is reduced by a factor of 3. These gains are achieved with a 45\% smaller model size and an average reduction of 40\% in input tokens.
Chinese Translation
现代图像编码器通过将语义意义与分辨率解耦来实现高泛化能力,而这一能力在3D领域尚未完全实现。我们研究了3D点云编码器未能实现类似泛化的原因,发现现有模型对采样分辨率和尺度变化高度敏感,导致显著的性能下降。这种敏感性是机器人技术在实际应用中的主要瓶颈,因为它表明模型过拟合于特定的量化密度和物体尺度,而不是学习不变的语义特征。为了减轻这种依赖性,我们提出了Invaria,这是一种通过下一分辨率预测和感受野校准来实现尺度和密度不变性的点云编码器。虽然我们的目标并不是显式生成高分辨率的点云,但我们发现这一训练目标鼓励模型学习稳健的结构不变性。最终的编码器在分辨率变化时实现了显著的性能提升,同时通过紧凑的模型大小和减少的token需求保持了高效率。具体而言,在ScanNet上,Invaria在3倍更低分辨率下实现了56.0%的mIoU提升,并在物体尺度减少3倍时提高了20%的性能。这些提升是在模型大小减少45%和输入token平均减少40%的情况下实现的。
cs.CV / 77 / 2605.15942
Decomposed Vision-Language Alignment for Fine-Grained Open-Vocabulary Segmentation
分解视觉-语言对齐用于细粒度开放词汇分割
Abstract
Open-vocabulary segmentation models often struggle to generalize to unseen combinations of object categories and attributes, because fine-grained descriptions are typically encoded as holistic sentences that entangle multiple semantic units. We propose a Decomposed Vision-Language Alignment framework that explicitly factorizes textual prompts into a concept token and multiple attribute tokens, enabling separate cross-modal interactions for each semantic unit. At the feature level, we introduce a Feature-Gated Cross-Attention module that generates attribute-specific gating maps to fuse information in a multiplicative manner, effectively enforcing compositional semantics. At the scoring level, per-token similarities are aggregated in log-space, producing a stable and interpretable compositional matching. The method can be seamlessly integrated into existing transformer-based segmentation architectures and significantly improves generalization to unseen attribute-category compositions in fine-grained open-vocabulary segmentation benchmarks.
Chinese Translation
开放词汇分割模型通常难以对未见过的物体类别和属性组合进行泛化,因为细粒度描述通常被编码为纠缠多个语义单元的整体句子。我们提出了一种分解视觉-语言对齐框架,该框架明确地将文本提示分解为概念标记和多个属性标记,使每个语义单元能够进行独立的跨模态交互。在特征层面,我们引入了一种特征门控交叉注意力模块,该模块生成特定于属性的门控图,以乘法方式融合信息,有效地强制执行组合语义。在评分层面,逐标记相似性在对数空间中聚合,产生稳定且可解释的组合匹配。该方法可以无缝集成到现有的基于变换器的分割架构中,并显著提高在细粒度开放词汇分割基准中对未见属性-类别组合的泛化能力。
cs.CV / 78 / 2605.15951
From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding
从失败到反馈:群体修订解锁物体级基础的难题
Abstract
Finetuning Large Vision-Language Models with reinforcement learning has emerged as a promising approach to enhance their capability in object-level grounding. However, existing methods, mainly based on GRPO, assign rewards at the response level. Such sparse reward, often criterion-induced, leads to minimal learning signals when all candidate responses fail in challenging scenarios. In this work, we propose a group-revision optimisation paradigm that enhances learning on hard cases. It begins with a sampled initial response and generates a set of revised candidates to explore improved grounding outcomes. Inspired by reward shaping, we introduce a consolidation process that quantifies each candidate's improvement over the initial attempt and converts it into informative shaping signals. These signals are used to both refine the reward and modulate the advantage, amplifying the influence of high-quality revisions. Our method achieves consistent gains across referring and reasoning segmentation, REC, and counting benchmarks compared with prior GRPO-based models. Our code is available at https://github.com/yyliu01/GroupRevision.
Chinese Translation
通过强化学习对大型视觉-语言模型进行微调已成为增强其在物体级基础能力方面的一种有前景的方法。然而,现有的方法主要基于GRPO,在响应层面分配奖励。这种稀疏奖励,往往是由标准引起的,在所有候选响应在具有挑战性的场景中失败时,导致学习信号极少。在本研究中,我们提出了一种群体修订优化范式,以增强在难例上的学习。该方法从一个采样的初始响应开始,并生成一组修订候选以探索改进的基础结果。受到奖励塑形的启发,我们引入了一个整合过程,该过程量化每个候选相对于初始尝试的改进,并将其转化为信息丰富的塑形信号。这些信号用于精炼奖励并调节优势,放大高质量修订的影响。与先前基于GRPO的模型相比,我们的方法在指称和推理分割、REC和计数基准上实现了一致的提升。我们的代码可在 https://github.com/yyliu01/GroupRevision 获取。
cs.CV / 79 / 2605.15961
Sparse Autoencoders enable Robust and Interpretable Fine-tuning of CLIP models
稀疏自编码器实现CLIP模型的稳健且可解释的微调
Abstract
Large-scale pre-trained vision-language models like CLIP demonstrate remarkable zero-shot performance across diverse tasks. However, fine-tuning these models to improve downstream performance often degrades robustness against distribution shifts. Recent approaches have attempted to mitigate this trade-off, but often rely on computationally expensive text-guidance. We propose a novel method for robust fine-tuning, SAE-FT, which operates only on the model's visual representations. SAE-FT regularizes changes to these representations by penalizing the addition and removal of semantically meaningful features identified by a Sparse Autoencoder trained on the pre-trained model. This constraint prevents catastrophic forgetting and makes the fine-tuning process interpretable, enabling direct analysis of semantic changes. SAE-FT is both mechanistically transparent and computationally efficient, matching or exceeding state-of-the-art performance on ImageNet and its associated distribution shift benchmarks. Code is publicly available at: https://github.com/Fabian-Mor/sae-ft.
Chinese Translation
大规模预训练的视觉-语言模型如CLIP在多种任务中展现出卓越的零样本性能。然而,微调这些模型以提升下游性能往往会降低其对分布变化的鲁棒性。近期的方法尝试缓解这一权衡,但通常依赖于计算成本高昂的文本引导。我们提出了一种新颖的稳健微调方法,SAE-FT,该方法仅在模型的视觉表示上进行操作。SAE-FT通过惩罚由在预训练模型上训练的稀疏自编码器识别的语义上有意义特征的添加和移除,来规范对这些表示的变化。这一约束防止了灾难性遗忘,并使微调过程具有可解释性,从而能够直接分析语义变化。SAE-FT在机制上透明且计算高效,在ImageNet及其相关分布变化基准上达到或超过了最先进的性能。代码可在以下网址公开获取:https://github.com/Fabian-Mor/sae-ft。
cs.CV / 80 / 2605.15980
Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization
Flash-GRPO:通过一步策略优化实现视频扩散的高效对齐
Abstract
Group Relative Policy Optimization has emerged as essential for aligning video diffusion models with human preferences, but faces a critical computational bottleneck: training a 14B parametered model typically demands hundreds of GPU days per experiment. Existing efficiency methods reduce costs through sliding window subsampling training timesteps, but fundamentally compromise optimization, exhibiting severe instability and failing to reach full trajectory performance. We present Flash-GRPO, a single-step training framework that outperforms full trajectory training in alignment quality under low computational budgets while substantially improving training efficiency. Flash-GRPO addresses two critical challenges: iso-temporal grouping eliminates timestep-confounded variance by enforcing prompt-wise temporal consistency, decoupling policy performance from timestep difficulty; temporal gradient rectification neutralizes the time-dependent scaling factor that causes vastly inconsistent gradient magnitudes across timesteps. Experiments on 1.3B to 14B parameter models validate Flash-GRPO's effectiveness, demonstrating substantial training acceleration with consistent stability and state-of-the-art alignment quality.
Chinese Translation
群体相对策略优化(Group Relative Policy Optimization)已成为将视频扩散模型与人类偏好对齐的重要方法,但面临着关键的计算瓶颈:训练一个拥有140亿参数的模型通常需要数百个GPU天的实验时间。现有的效率方法通过滑动窗口子采样训练时间步来降低成本,但在根本上妥协了优化,表现出严重的不稳定性,并未能达到完整轨迹性能。我们提出了Flash-GRPO,这是一种单步训练框架,在低计算预算下,其对齐质量优于完整轨迹训练,同时显著提高了训练效率。Flash-GRPO解决了两个关键挑战:等时组(iso-temporal grouping)通过强制提示间的时间一致性消除了时间步混淆的方差,将策略性能与时间步难度解耦;时间梯度校正(temporal gradient rectification)中和了导致时间步间梯度幅度极不一致的时间依赖缩放因子。对13亿到140亿参数模型的实验验证了Flash-GRPO的有效性,展示了显著的训练加速,同时保持了一致的稳定性和最先进的对齐质量。
cs.CV / 81 / 2605.15997
Segmentation, Detection and Explanation: A Unified Framework for CT Appearance Reasoning
分割、检测与解释:CT外观推理的统一框架
Abstract
Recent progress in deep learning has significantly advanced CT image analysis, particularly for segmentation tasks. However, these advances are largely confined to image-level pattern recognition, with most methods lacking explicit anatomical or contextual reasoning. Large vision-language models introduce linguistic context into image analysis, yet most approaches typically focus on a single task, which is insufficient for clinical workflow analysis that requires multiple fine-grained types of analysis, such as anatomy detection and segmentation. In this paper, we propose a unified autoregressive framework that integrates language-guided visual reasoning into CT interpretation. Our method introduces task-routing tokens that trigger detection and segmentation heads conditioned on the hidden states of a large vision-language model, enabling coherent generation of visual outputs (e.g., masks and bounding boxes) and textual reasonings. To progressively enhance localisation accuracy and semantic clarity, we further design a "closer-look" mechanism that allows the model to perform progressive coarse-to-fine visits to regions of interest under refined fields of view. To support model training and evaluation, we curated a new multimodal CT dataset containing pixel-wise masks, bounding boxes, spatial prompts, and structured descriptions for visual objects constructed through an AI-assisted annotation process with human verification. Experiments on public benchmarks demonstrate consistent improvements over the SoTA, achieving up to 1.0% Dice on BTCV and 1.7% Dice on MosMed+, while additionally providing appearance reasoning outputs. The code and dataset will be available.
Chinese Translation
近年来,深度学习的进展显著推动了CT图像分析,尤其是在分割任务方面。然而,这些进展主要局限于图像级模式识别,大多数方法缺乏明确的解剖或上下文推理。大型视觉语言模型将语言上下文引入图像分析,但大多数方法通常专注于单一任务,这对于需要多种细粒度分析(如解剖检测和分割)的临床工作流程分析来说是不够的。本文提出了一种统一的自回归框架,将语言引导的视觉推理整合到CT解读中。我们的方法引入了任务路由标记,这些标记根据大型视觉语言模型的隐藏状态触发检测和分割头,从而实现视觉输出(例如,掩膜和边界框)和文本推理的一致生成。为了逐步提高定位准确性和语义清晰度,我们进一步设计了一种“更近观察”机制,使模型能够在细化视野下对感兴趣区域进行逐步的粗到细访问。为了支持模型的训练和评估,我们策划了一个新的多模态CT数据集,其中包含像素级掩膜、边界框、空间提示和通过人工验证的AI辅助注释过程构建的视觉对象的结构化描述。对公共基准的实验表明,相较于现有最优技术(SoTA),我们的模型在BTCV上提高了最高1.0%的Dice,在MosMed+上提高了1.7%的Dice,同时还提供了外观推理输出。代码和数据集将会公开。
cs.CV / 82 / 2605.16003
Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation
回声强制:一种用于交互式长视频生成的场景记忆框架
Abstract
Autoregressive video diffusion models enable open-ended generation through local attention and KV caching. However, existing training-free long-video optimization methods mainly focus on stable extension under a single prompt, making them difficult to handle interactive scenarios involving prompt switching, old scene forgetting, and historical scene recall. We identify the core bottleneck as the functional entanglement of historical KV states: stable anchors and recent dynamics are handled by the same cache policy, leading to outdated background contamination, delayed response to new prompts, and loss of long-range memory. To address this issue, we propose Echo-Forcing, a training-free scene memory framework specifically designed for interactive long video generation with three core mechanisms: (1) Hierarchical Temporal Memory, which decouples stable anchors, compressed history, and recent windows under relative RoPE; (2) Scene Recall Frames, which compresses historical scenes into spatially structured KV representations to support long-term recall; and (3) Difference-aware Memory Decay, which adaptively forgets conflicting tokens according to the discrepancy between old and new scenes. Based on these designs, Echo-Forcing uniformly supports smooth transitions, hard cuts, and long-range scene recall under a bounded cache budget. Extensive evaluations on VBench-Long further demonstrate that Echo-Forcing achieves the best overall performance in both long-video generation and interactive video generation settings. Our code is released in https://github.com/mingqiangWu/Echo-Forcing
Chinese Translation
自回归视频扩散模型通过局部注意力和KV缓存实现开放式生成。然而,现有的无训练长视频优化方法主要集中于在单一提示下的稳定扩展,使其难以处理涉及提示切换、旧场景遗忘和历史场景回忆的交互场景。我们将核心瓶颈识别为历史KV状态的功能纠缠:稳定锚点和近期动态由相同的缓存策略处理,导致过时的背景污染、对新提示的响应延迟以及长程记忆的丧失。为了解决这个问题,我们提出了回声强制(Echo-Forcing),这是一种专门为交互式长视频生成设计的无训练场景记忆框架,具有三个核心机制:(1) 层次时间记忆(Hierarchical Temporal Memory),在相对RoPE下解耦稳定锚点、压缩历史和近期窗口;(2) 场景回忆帧(Scene Recall Frames),将历史场景压缩为空间结构化的KV表示,以支持长期回忆;(3) 差异感知记忆衰减(Difference-aware Memory Decay),根据旧场景和新场景之间的差异自适应地遗忘冲突的标记。基于这些设计,回声强制在有限的缓存预算下均匀支持平滑过渡、硬切换和长程场景回忆。在VBench-Long上的广泛评估进一步证明,回声强制在长视频生成和交互视频生成设置中均实现了最佳整体性能。我们的代码已发布在 https://github.com/mingqiangWu/Echo-Forcing
cs.CV / 83 / 2605.16008
End-to-end plaque counting and virus titration from laboratory plate images with deep learning
基于深度学习的实验室平板图像中的端到端斑块计数和病毒滴度测定
Abstract
Plaque assays remain the gold standard readout of virus infectivity; however, plaque counting from plate images is labor-intensive and prone to inter-operator variability. We present an end-to-end, computer-aided workflow for cytopathic effect-based virus titration directly from laboratory plaque assay images. The proposed approach combines two models derived from the Segment Anything Model (SAM): a SAM2-based well-segmentation module that localizes assay wells across heterogeneous imaging conditions, and a SAM-based plaque-segmentation model that detects and enumerates plaques within each well. The method was evaluated on a mixed dataset comprising private plaque assay images of Mayaro virus and Coxsackievirus B3, together with public Vaccinia virus images from the VACVPlaque dataset. The pipeline outputs per-well plaque counts, automatically computes plaque-forming units per milliliter (PFU/mL), and is integrated into a web-based platform that allows users to review results and organize experiments. On held-out plates (17 from MAYV/CVB3 and 22 from VACV), the workflow generalized across two plate formats (6-well and 12-well) and showed strong agreement with manual annotations (Pearson correlation coefficients of 0.92 for MAYV/CVB3 and 0.88 for VACV). Automated plaque counts were further compared with annotations from four independent experts, demonstrating high concordance. The proposed system will be open sourced and publicly released upon acceptance of this manuscript to enable reproducible, scalable, and audit-ready plaque assay analysis while substantially reducing manual annotation effort.
Chinese Translation
斑块测定仍然是病毒感染性检测的金标准;然而,从平板图像中进行斑块计数既费时又容易受到操作员间的变异影响。我们提出了一种端到端的计算机辅助工作流程,用于直接从实验室斑块测定图像中基于细胞病变效应的病毒滴度测定。该方法结合了两个源自Segment Anything Model (SAM) 的模型:一个基于SAM2的孔位分割模块,能够在异质成像条件下定位测定孔,和一个基于SAM的斑块分割模型,能够检测并计数每个孔内的斑块。该方法在一个混合数据集上进行了评估,该数据集包括Mayaro病毒和Coxsackievirus B3的私有斑块测定图像,以及来自VACVPlaque数据集的公共Vaccinia病毒图像。该工作流程输出每孔的斑块计数,自动计算每毫升斑块形成单位 (PFU/mL),并集成到一个基于网络的平台中,允许用户查看结果并组织实验。在保留的平板上(17个来自MAYV/CVB3,22个来自VACV),该工作流程在两种平板格式(6孔和12孔)中具有良好的泛化能力,并与手动标注结果显示出强一致性(MAYV/CVB3的Pearson相关系数为0.92,VACV为0.88)。自动斑块计数进一步与四位独立专家的标注进行了比较,显示出高度一致性。该系统将在本手稿接受后开源并公开发布,以便实现可重复、可扩展和审计准备的斑块测定分析,同时显著减少手动标注的工作量。
cs.CV / 84 / 2605.16022
EndoGSim: Physics-Aware 4D Dynamic Endoscopic Scene Simulations via MLLM-Guided Gaussian Splatting
EndoGSim:基于物理的4D动态内窥镜场景模拟,通过MLLM引导的高斯喷溅
Abstract
In robot-assisted minimally invasive surgery, high-fidelity dynamic endoscopic scene reconstruction and simulation are crucial to enhancing downstream tasks and advancing surgical outcomes. However, existing methods primarily focus on visual reconstruction, lacking physics-based descriptions of the scene required for realistic simulation. We propose a unified framework that achieves physics-aware reconstruction and physical simulation of endoscopic scenes through Multi-modal Large Language Models (MLLMs)-guided Gaussian Splatting. Our approach utilizes 4D Gaussian Splatting (4DGS) integrated with pre-trained segmentation and depth estimation to represent deformable tissues and tools. To achieve automatic inference of physical properties, we introduce an object-wise material field that initializes material parameters via MLLM and refines them through a differentiable Material Point Method (MPM) under joint supervision from rendered images and optical flow. Validated on both open-source and in-house datasets, our framework achieves superior simulation fidelity and physical accuracy compared to state-of-the-art methods, underscoring its potential to advance robot-assisted surgical applications.
Chinese Translation
在机器人辅助的微创手术中,高保真动态内窥镜场景重建和模拟对于增强下游任务和改善手术结果至关重要。然而,现有方法主要集中于视觉重建,缺乏进行真实模拟所需的基于物理的场景描述。我们提出了一个统一框架,通过多模态大语言模型(MLLM)引导的高斯喷溅,实现内窥镜场景的基于物理的重建和物理模拟。我们的方法利用集成了预训练分割和深度估计的4D高斯喷溅(4DGS)来表示可变形的组织和工具。为了实现物理属性的自动推断,我们引入了一个对象级材料场,通过MLLM初始化材料参数,并通过可微分的材料点方法(MPM)在渲染图像和光流的联合监督下进行细化。在开源和内部数据集上进行验证,我们的框架在模拟保真度和物理准确性方面优于最先进的方法,突显了其在推进机器人辅助手术应用方面的潜力。
cs.CV / 85 / 2605.16065
Robust Prior-Guided Segmentation for Editable 3D Gaussian Splatting
基于鲁棒先验引导的可编辑3D高斯点云分割
Abstract
3D Gaussian Splatting (3D-GS) enables real-time 3D scene reconstruction but lacks robust segmentation for editing tasks such as object removal, extraction, and recoloring. Existing approaches that lift 2D segmentations to the 3D domain suffer from view inconsistencies and coarse masks. In this paper, we propose a novel framework that leverages the Segment Anything Model High Quality (SAM-HQ) to generate accurate 2D masks, addressing the limitations of the standard SAM in boundary fidelity and fine-structure preservation. To achieve robust 3D segmentation of any target object in a given scene, we introduce a prior-guided label reassignment method that assigns labels to 3D Gaussians by enforcing multiview consistency with learned priors. Our approach achieves state-of-the-art segmentation accuracy and enables interactive, real-time object editing while maintaining high visual fidelity. Qualitative results demonstrate superior boundary preservation and practical utility in Virtual Reality (VR) and robotics, advancing 3D scene editing.
Chinese Translation
3D高斯点云分割(3D-GS)实现了实时3D场景重建,但在对象移除、提取和重新上色等编辑任务中缺乏鲁棒的分割能力。现有方法将2D分割提升到3D领域时,常常面临视图不一致和粗糙掩膜的问题。本文提出了一种新颖的框架,利用高质量的Segment Anything Model(SAM-HQ)生成准确的2D掩膜,解决了标准SAM在边界保真度和细结构保留方面的局限性。为了实现给定场景中任何目标对象的鲁棒3D分割,我们引入了一种先验引导的标签重新分配方法,通过强制多视图一致性与学习到的先验进行3D高斯的标签分配。我们的方法实现了最先进的分割精度,并支持交互式实时对象编辑,同时保持高视觉保真度。定性结果展示了优越的边界保留能力和在虚拟现实(VR)及机器人领域的实际应用,推动了3D场景编辑的发展。
cs.CV / 86 / 2605.16076
AgriMind: An Ensemble Deep Learning Framework for Multi-Class Plant Disease Classification
AgriMind:一种用于多类别植物疾病分类的集成深度学习框架
Abstract
Plant disease detection is still largely manual in Bangladesh, where extension workers eyeball leaf samples across millions of smallholdings. We built AgriMind to automate this: an ensemble of ResNet50, EfficientNet-B0, and DenseNet121 trained on 20,638 PlantVillage images across 15 pepper, potato, and tomato disease classes. Transfer learning with frozen ImageNet backbones and 10 epochs of head-only training keeps the pipeline lightweight. Individual models hit 96--97% on the held-out test set, but averaging their softmax outputs pushes the ensemble to 99.23% -- a two-thirds cut in error rate. We tried biasing the average toward the best validation model; it backfired. Dropping any single model also hurt. Pepper and potato classify perfectly; tomato, with ten visually similar classes, still reaches 99.01%. On an NVIDIA T4 GPU the full ensemble runs at 53 FPS. Whether that translates to real-time mobile use depends on TensorFlow Lite optimization -- work we have not yet completed.
Chinese Translation
在孟加拉国,植物疾病检测仍然主要依赖人工,推广工作者需要在数百万个小农场中目测叶片样本。我们构建了AgriMind来实现自动化:该系统集成了ResNet50、EfficientNet-B0和DenseNet121,基于20,638张来自PlantVillage的图像进行训练,涵盖15种辣椒、土豆和番茄疾病类别。通过冻结ImageNet的主干网络并进行10个周期的头部训练,保持了管道的轻量化。各个模型在保留的测试集上达到了96%到97%的准确率,但对它们的softmax输出进行平均后,集成模型的准确率提升至99.23%——错误率减少了三分之二。我们尝试将平均值偏向最佳验证模型,但结果适得其反。去掉任何单一模型也会造成损失。辣椒和土豆的分类准确率达到100%;而番茄虽然有十个视觉上相似的类别,仍然达到了99.01%的准确率。在NVIDIA T4 GPU上,完整的集成模型以53帧每秒的速度运行。是否能实现实时移动使用取决于TensorFlow Lite的优化——这项工作我们尚未完成。
cs.CV / 87 / 2605.16079
VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation
VideoSeeker:通过原生代理工具调用激励实例级视频理解
Abstract
Large Vision-Language Models (LVLMs) have shown significant progress in video understanding, yet they face substantial challenges in tasks requiring precise spatiotemporal localization at the instance level. Existing methods primarily rely on text prompts for human-model interaction, but these prompts struggle to provide precise spatial and temporal references, resulting in poor user experience. Furthermore, current approaches typically decouple visual perception from language reasoning, centering reasoning around language rather than visual content, which limits the model's ability to proactively perceive fine-grained visual evidence. To address these challenges, we propose VideoSeeker, a novel paradigm for instance-level video understanding through visual prompts. VideoSeeker seamlessly integrates agentic reasoning with instance-level video understanding tasks, enabling the model to proactively perceive and retrieve relevant video segments on demand. We construct a four-stage fully automated data synthesis pipeline to efficiently generate large-scale, high-quality instance-level video data. We internalize tool-calling and proactive perception capabilities into the model via cold-start supervision and RL training, building a powerful video understanding model. Experiments demonstrate that our model achieves an average improvement of +13.7% over baselines on instance-level video understanding tasks, surpassing powerful closed-source models such as GPT-4o and Gemini-2.5-Pro, while also showing effective transferability on general video understanding benchmarks. The relevant datasets and code will be released publicly.
Chinese Translation
大型视觉语言模型(LVLMs)在视频理解方面取得了显著进展,但在需要精确时空定位的实例级任务中仍面临重大挑战。现有方法主要依赖文本提示进行人机交互,但这些提示难以提供精确的空间和时间参考,导致用户体验不佳。此外,目前的方法通常将视觉感知与语言推理解耦,推理主要围绕语言而非视觉内容展开,这限制了模型主动感知细粒度视觉证据的能力。为了解决这些挑战,我们提出了VideoSeeker,这是一种通过视觉提示进行实例级视频理解的新范式。VideoSeeker无缝地将代理推理与实例级视频理解任务结合,使模型能够主动感知并按需检索相关视频片段。我们构建了一个四阶段的全自动数据合成管道,以高效生成大规模、高质量的实例级视频数据。我们通过冷启动监督和强化学习训练将工具调用和主动感知能力内化到模型中,构建了一个强大的视频理解模型。实验表明,我们的模型在实例级视频理解任务上平均提高了+13.7%,超过了强大的闭源模型如GPT-4o和Gemini-2.5-Pro,同时在一般视频理解基准上也显示出有效的迁移能力。相关数据集和代码将公开发布。
cs.CV / 88 / 2605.16080
ReAlign: Generalizable Image Forgery Detection via Reasoning-Aligned Representation
ReAlign:通过推理对齐表示实现可泛化的图像伪造检测
Abstract
The rise of AI-generated images (AIGIs) poses growing challenges for digital authenticity, prompting the need for efficient, generalizable image forgery detection systems. Existing methods, whether non-LLM-based or LLM-based, exhibit distinct advantages and limitations. While non-LLM-based models offer efficient low-level artifact detection, they often lack semantic understanding. Conversely, LLM-based methods provide strong semantic reasoning and explainability but are computationally intensive and less sensitive to subtle visual artifacts. Moreover, the true contribution of explanatory reasoning texts to forgery detection performance remains unclear. In this work, we investigate the intrinsic value and potential of LLM-generated reasoning texts, considering it a source of generalization and semantic-error sensitivity. Based on these findings, we propose ReAlign, a novel framework that distills high-quality reasoning texts generated by a GRPO-optimized LLM into a lightweight AIGI detector via contrastive learning. ReAlign effectively inherits the generalization ability and semantic sensitivity capability of reasoning textual representations, while remaining efficient and lightweight for deployment. Moreover, ReAlign adopts a tailored joint optimization strategy that integrates contrastive loss for image-text alignment and classification loss for accurate forgery discrimination. Experimental results on AIGCDetectBenchmark, AIGI-Holmes, and our newly constructed UltraSynth-10k demonstrate that ReAlign consistently outperforms existing state-of-the-art detectors in both accuracy and generalization, particularly when facing complex, high-fidelity forgeries from modern generative models.
Chinese Translation
随着人工智能生成图像(AIGIs)的兴起,数字真实性面临日益严峻的挑战,这促使我们需要高效且可泛化的图像伪造检测系统。现有的方法,无论是非大语言模型(non-LLM)还是大语言模型(LLM)基础的,都展现出不同的优缺点。非LLM基础的模型提供高效的低级伪影检测,但往往缺乏语义理解。相反,LLM基础的方法提供强大的语义推理和可解释性,但计算资源消耗大,对细微视觉伪影的敏感性较低。此外,解释性推理文本对伪造检测性能的真实贡献仍不明确。在本研究中,我们探讨了LLM生成的推理文本的内在价值和潜力,认为其是泛化和语义错误敏感性的来源。基于这些发现,我们提出了ReAlign,一个新颖的框架,通过对比学习将由GRPO优化的LLM生成的高质量推理文本提炼成轻量级的AIGI检测器。ReAlign有效继承了推理文本表示的泛化能力和语义敏感性,同时在部署时保持高效和轻量。此外,ReAlign采用了定制的联合优化策略,结合了用于图像-文本对齐的对比损失和用于准确伪造区分的分类损失。在AIGCDetectBenchmark、AIGI-Holmes和我们新构建的UltraSynth-10k上的实验结果表明,ReAlign在准确性和泛化能力上始终优于现有的最先进检测器,尤其是在面对现代生成模型产生的复杂高保真伪造时。
cs.CV / 89 / 2605.16122
GenShield: Unified Detection and Artifact Correction for AI-Generated Images
GenShield:用于AI生成图像的统一检测与伪影修正
Abstract
Diffusion-based image synthesis has made AI-generated images (AIGI) increasingly photorealistic, raising urgent concerns about authenticity in applications such as misinformation detection, digital forensics, and content moderation. Despite the substantial advances in AIGI detection, how to correct detected AI-generated images with visible artifacts and restore realistic appearance remains largely underexplored. Moreover, few existing work has established the connection between AIGI detection and artifact correction. To fill this gap, we propose GenShield, a unified autoregressive framework that jointly performs explainable AIGI detection and controllable artifact correction in a closed loop from diagnosis to restoration, revealing a mutually reinforcing relationship between these two tasks. We further introduce a Visual Chain-of-Thought based curriculum learning strategy that enables self-explained, multi-step ``diagnose-then-repair'' correction with an explicit stopping criterion. A high-quality dataset with large-scale ``artifact-restored'' pairs is also constructed alongside a unified evaluation pipeline. Extensive experiments on our correction benchmark and mainstream AIGI detection benchmarks demonstrate state-of-the-art performance and strong generalization of our method. The code is available at https://github.com/zhipeixu/GenShield.
Chinese Translation
基于扩散的图像合成使得AI生成图像(AIGI)越来越逼真,这在虚假信息检测、数字取证和内容审核等应用中引发了对真实性的紧迫关注。尽管AIGI检测取得了显著进展,但如何修正具有明显伪影的检测到的AI生成图像并恢复其真实外观仍然在很大程度上未被深入研究。此外,现有的工作很少建立AIGI检测与伪影修正之间的联系。为填补这一空白,我们提出了GenShield,一个统一的自回归框架,能够在从诊断到修复的闭环中共同执行可解释的AIGI检测和可控的伪影修正,揭示了这两项任务之间的相互促进关系。我们进一步引入了一种基于视觉思维链的课程学习策略,使得能够进行自我解释的多步骤“诊断-然后-修复”修正,并设定明确的停止标准。同时,我们构建了一个包含大规模“伪影修复”对的高质量数据集,并配备统一的评估管道。在我们的修正基准和主流AIGI检测基准上的广泛实验表明,我们的方法具有最先进的性能和强大的泛化能力。代码可在 https://github.com/zhipeixu/GenShield 获取。
cs.CV / 90 / 2605.16127
WeatherOcc3D: VLM-Assisted Adverse Weather Aware 3D Semantic Occupancy Prediction
WeatherOcc3D:基于VLM的恶劣天气感知3D语义占用预测
Abstract
While multi-modal 3D semantic occupancy prediction typically enhances robustness by fusing camera and LiDAR inputs, its effectiveness is fundamentally constrained by environmental variability. Specifically, camera sensors suffer from severe low-light degradation, while LiDAR sensors encounter significant backscatter noise during heavy precipitation. These adverse conditions create a modality trust problem, as static fusion strategies fail to adaptively re-weight inputs when a specific sensor becomes unreliable. To address this, we propose a VLM-assisted framework leveraging the pre-trained CLIP latent space to guide multi-sensor integration via linguistic environmental cues. We utilize a parameter-efficient adapter to align weather-specific text embeddings with sensor features, coupled with a gating strategy that decomposes environmental uncertainty into two factors: visibility and illumination. This enables the model to dynamically modulate the fusion ratio - prioritizing semantic camera features in clear daylight and shifting to geometric LiDAR priors during rainy nights. Evaluations on the nuScenes dataset demonstrate the versatility of our approach, as implementing our proposed framework on the OccMamba and M-CONet architectures achieves mIoU scores of 26.3 and 21.1, respectively, significantly outperforming their traditional baselines.
Chinese Translation
尽管多模态3D语义占用预测通常通过融合相机和激光雷达(LiDAR)输入来增强鲁棒性,但其有效性在根本上受到环境变化的限制。具体而言,相机传感器在低光照条件下会遭受严重的性能下降,而激光雷达传感器在强降水时会遇到显著的后向散射噪声。这些不利条件造成了模态信任问题,因为静态融合策略无法在特定传感器变得不可靠时自适应地重新加权输入。为了解决这一问题,我们提出了一种基于VLM的框架,利用预训练的CLIP潜在空间,通过语言环境线索引导多传感器集成。我们使用一种参数高效的适配器,将天气特定的文本嵌入与传感器特征对齐,并结合一种门控策略,将环境不确定性分解为两个因素:可见性和光照。这使得模型能够动态调节融合比例——在清晰的白天优先考虑语义相机特征,而在雨夜则转向几何激光雷达先验。在nuScenes数据集上的评估表明我们方法的多样性,实施我们提出的框架在OccMamba和M-CONet架构上分别达到了26.3和21.1的mIoU分数,显著优于其传统基线。
cs.CV / 91 / 2605.16137
STABLE: Simulation-Ready Tabletop Layout Generation via a Semantics-Physics Dual System
STABLE:通过语义-物理双系统生成适用于仿真的桌面布局
Abstract
Generating simulation-ready tabletop scenes from task instructions is an intriguing and promising research direction in the field of Embodied AI. However, existing task-to-scene generation methods rely exclusively on large language models (LLMs) to predict scene layouts, inevitably yielding object collisions or floating due to LLMs' inherent limitations in 3D spatial reasoning. In this paper, we present STABLE, a semantics-physics dual-system tailored for simulation-ready tabletop scene generation. STABLE consists of two complementary modules: (i) a Semantic Reasoner, a fine-tuned LLM trained on a structured tabletop scene dataset to generate coarse layouts from input task instructions, and (ii) a Physics Corrector, a physics-aware flow-based denoising model that outputs pose updates to refine layouts, which ensures the physical plausibility of scenes while preserves semantic alignment with task instructions. STABLE adopts a progressive generation paradigm: by alternating between the Semantic Reasoner and Physics Corrector, it incrementally expands the scene from task-critical objects to background objects. Experiments demonstrate that STABLE successfully generates simulation-ready tabletop scenes that strictly conform to task instructions and significantly enhances the physical validity of scenes over prior art.
Chinese Translation
从任务指令生成适用于仿真的桌面场景是具身人工智能领域中一个引人注目且前景广阔的研究方向。然而,现有的任务到场景生成方法完全依赖于大型语言模型(LLMs)来预测场景布局,因LLMs在三维空间推理方面的固有局限性,难免会导致物体碰撞或漂浮。在本文中,我们提出了STABLE,一个专为生成适用于仿真的桌面场景而设计的语义-物理双系统。STABLE由两个互补模块组成:(i)语义推理器(Semantic Reasoner),一个在结构化桌面场景数据集上进行微调的LLM,用于从输入的任务指令生成粗略布局;(ii)物理校正器(Physics Corrector),一个基于物理的流动去噪模型,输出姿态更新以精细化布局,确保场景的物理合理性,同时保持与任务指令的语义一致性。STABLE采用渐进生成范式:通过在语义推理器和物理校正器之间交替,逐步扩展场景,从任务关键对象到背景对象。实验表明,STABLE成功生成严格符合任务指令的适用于仿真的桌面场景,并显著提高了场景的物理有效性,相较于以往的研究成果有了显著提升。
cs.CV / 92 / 2605.16147
Registers Matter for Pixel-Space Diffusion Transformers
寄存器对像素空间扩散变换器的重要性
Abstract
Vision Transformers (ViTs) are known to exhibit high-norm patch-token outliers that degrade feature map quality, a problem effectively mitigated by \textit{register tokens}. As diffusion models increasingly adopt transformer architectures and move toward pixel-space training, they become closer in form to ViTs, raising the question of whether register tokens are also useful for Diffusion Transformers (DiTs). In this work, we show that DiTs differ from ViTs in a key respect: they do not exhibit patch-token outliers. Interestingly, register tokens significantly improve convergence and generation quality of pixel-space DiTs. By analyzing intermediate representations, we find that register tokens produce cleaner feature maps at high noise levels, which may contribute to their effectiveness in pixel-space generation. We further observe that recent pixel-space DiT architectures implicitly incorporate register-like mechanisms, which may partially account for their strong empirical performance. Motivated by these insights, we investigate a parameter-efficient dual-stream architecture that specializes processing for register tokens and improves pixel-space generation quality with negligible runtime overhead.
Chinese Translation
视觉变换器(ViTs)已知存在高范数的补丁令牌异常值,这会降低特征图的质量,而这一问题通过 extit{寄存器令牌}得到了有效缓解。随着扩散模型越来越多地采用变换器架构并向像素空间训练转变,它们的形式与ViTs越来越接近,这引发了寄存器令牌是否对扩散变换器(DiTs)也有用的疑问。在本研究中,我们展示了DiTs在一个关键方面与ViTs不同:它们不表现出补丁令牌异常值。有趣的是,寄存器令牌显著提高了像素空间DiTs的收敛性和生成质量。通过分析中间表示,我们发现寄存器令牌在高噪声水平下产生了更干净的特征图,这可能有助于它们在像素空间生成中的有效性。我们进一步观察到,最近的像素空间DiT架构隐式地纳入了类似寄存器的机制,这可能部分解释了它们强大的实证表现。基于这些见解,我们研究了一种参数高效的双流架构,专门处理寄存器令牌,并在几乎没有运行时开销的情况下提高像素空间生成质量。
cs.CV / 93 / 2605.16165
Second-Order Multi-Level Variance Correction for Modality Competition in Multimodal Models
多模态模型中模态竞争的二阶多级方差修正
Abstract
Autoregressive next-token training offers a unified formulation for image generation and text understanding, but it also creates strong modality competition that destabilizes optimization and limits large-batch scaling. We show that first-order optimizers such as AdamW are vulnerable to cross-modality gradient heterogeneity, while second-order preconditioning, particularly SOAP, provides a more stable basis for multimodal alignment. Building on this insight, we propose \emph{ML-FOP-SOAP}, a second-order optimization framework with Multi-Level Variance Correction. Our Fisher-Orthogonal Projection suppresses variance-induced modality conflicts, reducing the trade-off between visual generation and textual understanding. To make this practical under large gradient accumulation, we introduce a hierarchical folding strategy that captures fine-grained variance with low micro-step overhead. Experiments on Janus and Emu3 show consistent gains across both modalities and stable training at batch size 8192. Compared with AdamW, our method improves sample efficiency by up to $1.4\times$ and accelerates wall-clock training by up to $1.5\times$, offering a robust optimizer for scaling multimodal foundation models.
Chinese Translation
自回归下一个标记训练为图像生成和文本理解提供了统一的公式,但也造成了强烈的模态竞争,这使得优化不稳定并限制了大批量的扩展。我们展示了像 AdamW 这样的一级优化器容易受到跨模态梯度异质性的影响,而二阶预处理,特别是 SOAP,为多模态对齐提供了更稳定的基础。基于这一见解,我们提出了 extit{ML-FOP-SOAP},一种具有多级方差修正的二阶优化框架。我们的费舍尔正交投影抑制了由方差引起的模态冲突,减少了视觉生成与文本理解之间的权衡。为了在大梯度累积下实现这一目标,我们引入了一种分层折叠策略,以低微步开销捕捉细粒度方差。在 Janus 和 Emu3 上的实验显示,两种模态均有一致的增益,并且在批量大小为 8192 时训练稳定。与 AdamW 相比,我们的方法提高了样本效率,最高可达 $1.4 imes$,并加速了墙钟训练,最高可达 $1.5 imes$,为扩展多模态基础模型提供了一个稳健的优化器。
cs.CV / 94 / 2605.16171
Res$^2$CLIP: Few-Shot Generalist Anomaly Detection with Residual-to-Residual Alignment
Res$^2$CLIP:基于残差对残差对齐的少样本通用异常检测
Abstract
Few-shot Generalist Anomaly Detection requires models to generalize to novel categories without retraining, posing significant challenges in real-world scenarios with scarce samples and rapidly changing categories. Existing CLIP-based methods face two major challenges: coarse-grained unified text prompts struggle to adapt to fine-grained foreground-background differences, causing cross-granularity mismatch; and fine-tuning on auxiliary datasets disrupts CLIP's inherent open-world generalization due to domain shift, leading to cross-category generalization degradation. To address these, we propose to shift multimodal alignment entirely into a unified residual space, where residual representations naturally eliminate fine-grained normal feature differences across regions and class-specific biases, simultaneously resolving both problems. Based on this insight, Res$^2$CLIP, the first residual-to-residual alignment framework that symmetrically bridges visual and text modalities within CLIP's residual space, is designed. The framework is developed from a residual perspective into three branches: a text prompt-based branch, a visual prompt-based branch, and a novel residual-to-residual alignment branch. All learnable optimizations are constrained within the residual domain, and the residual alignment optimization objectives are designed to force the model to focus on relative anomaly deviations rather than optimizing class-specific features. Experiments on multiple datasets demonstrate the effectiveness of our architecture. The code is available at https://github.com/hito2448/Res2CLIP.
Chinese Translation
少样本通用异常检测要求模型在不重新训练的情况下对新类别进行泛化,这在样本稀缺和类别快速变化的现实场景中带来了重大挑战。现有的基于CLIP的方法面临两个主要挑战:粗粒度的统一文本提示难以适应细粒度的前景-背景差异,导致跨粒度不匹配;而在辅助数据集上的微调由于领域转移而破坏了CLIP固有的开放世界泛化,导致跨类别泛化性能下降。为了解决这些问题,我们提出将多模态对齐完全转移到统一的残差空间,在该空间中,残差表示自然消除了区域间的细粒度正常特征差异和类别特定偏差,同时解决了这两个问题。基于这一见解,我们设计了Res$^2$CLIP,这是第一个在CLIP的残差空间内对视觉和文本模态进行对称桥接的残差对残差对齐框架。该框架从残差的角度发展为三个分支:基于文本提示的分支、基于视觉提示的分支,以及一个新颖的残差对残差对齐分支。所有可学习的优化都限制在残差域内,残差对齐优化目标旨在迫使模型关注相对异常偏差,而不是优化类别特定特征。多个数据集上的实验表明了我们架构的有效性。代码可在 https://github.com/hito2448/Res2CLIP 获取。
cs.CV / 95 / 2605.16179
MAgSeg: Segmentation of Agricultural Landscapes in High-Resolution Satellite Imagery using Multimodal Large Language Models
MAgSeg:利用多模态大型语言模型对高分辨率卫星影像中的农业景观进行分割
Abstract
Agricultural landscape segmentation in the Global South is challenging as it is characterized by fragmented plots, high intra-class variance, and a scarcity of labeled training data. Recent advances in segmentation have been made by Multimodal Large Language Models (MLLMs). However, current approaches encounter critical context length bottlenecks and a domain alignment gap in understanding satellite features. We address these limitations through MAgSeg, a novel, decoder-free MLLM segmentation approach. MAgSeg is an architecturally efficient approach that enables standard MLLMs to perform segmentation of complex smallholder agricultural landscapes from high-resolution satellite imagery, without requiring auxiliary vision decoders. We introduce a novel instruction tuning data format designed to enable scalable fine-tuning and post-training on high resolution satellite imagery, which enables MAgSeg to learn from the global context of the image while generating text tokens for only a patch within the image. Extensive evaluations on datasets spanning three countries in the Global South demonstrate that MAgSeg significantly outperforms state-of-the-art MLLM baselines, offering a scalable solution to map smallholder agricultural environments.
Chinese Translation
全球南方的农业景观分割面临挑战,因为其特征是地块碎片化、类别内部方差大以及标注训练数据稀缺。最近,多模态大型语言模型(MLLMs)在分割领域取得了进展。然而,当前的方法在理解卫星特征时遇到了关键的上下文长度瓶颈和领域对齐差距。我们通过MAgSeg解决了这些局限性,这是一种新颖的无解码器MLLM分割方法。MAgSeg是一种架构高效的方法,使标准MLLM能够在不需要辅助视觉解码器的情况下,从高分辨率卫星影像中对复杂的小农农业景观进行分割。我们引入了一种新颖的指令调优数据格式,旨在实现高分辨率卫星影像的可扩展微调和后训练,使MAgSeg能够在生成图像中仅一个补丁的文本标记时,从图像的全球上下文中学习。对涵盖全球南方三个国家的数据集进行的广泛评估表明,MAgSeg显著优于最先进的MLLM基线,提供了一种可扩展的解决方案来绘制小农农业环境。
cs.CV / 96 / 2605.16241
Offline Semantic Guidance for Efficient Vision-Language-Action Policy Distillation
离线语义指导的高效视觉-语言-动作策略蒸馏
Abstract
Billion-parameter Vision-Language-Action (VLA) policies have recently shown impressive performance in robotic manipulation, yet their size and inference cost remain major obstacles for real-time closed-loop control. We introduce \textbf{VLA-AD}, a distillation framework that uses a Vision-Language Model as an offline semantic supervisor to transfer large VLA teachers into lightweight student policies. Instead of relying only on low-level action imitation, VLA-AD augments teacher-provided 7-DoF action targets with high-level semantic guidance, including task phase anchors and multi-frame operating-direction descriptions. These auxiliary signals are used only during training: at test time, the student policy runs independently, with neither the VLA teacher nor the VLM required. We evaluate VLA-AD on three LIBERO benchmark suites. Using OpenVLA-7B as the teacher, our method produces a 158M-parameter student, yielding a $44\times$ reduction in model size while matching the teacher with only a $0.27\%$ average relative gap. The resulting policy runs at 12.5 Hz on an RTX 4090, achieving a $3.28\times$ inference speedup over OpenVLA-7B. We further show that the same semantic distillation pipeline generalizes to a different $\pi_{0.5}$-4B teacher, where the student outperforms the teacher on two suites and remains within $0.53\%$ on \texttt{libero\_goal}. Additional analysis indicates that phase-level supervision and multi-frame directional cues make the student less sensitive to noisy teacher actions, such as erroneous high-frequency gripper changes. Overall, VLA-AD demonstrates that offline semantic guidance from VLMs can substantially improve the efficiency, robustness, and deployability of VLA policy distillation.
Chinese Translation
最近,拥有数十亿参数的视觉-语言-动作(VLA)策略在机器人操作中展现了令人印象深刻的性能,但其规模和推理成本仍然是实时闭环控制的主要障碍。我们提出了 extbf{VLA-AD},一个蒸馏框架,利用视觉-语言模型(VLM)作为离线语义监督,将大型VLA教师转化为轻量级学生策略。VLA-AD不仅依赖于低级动作模仿,还通过高层语义指导增强教师提供的7自由度(7-DoF)动作目标,包括任务阶段锚点和多帧操作方向描述。这些辅助信号仅在训练期间使用:在测试时,学生策略独立运行,无需VLA教师或VLM。我们在三个LIBERO基准测试套件上评估了VLA-AD。以OpenVLA-7B作为教师,我们的方法生成了一个158M参数的学生策略,模型大小减少了$44 imes$,同时与教师的平均相对差距仅为$0.27\%$。最终策略在RTX 4090上以12.5 Hz的频率运行,实现了相较于OpenVLA-7B的$3.28 imes$推理加速。我们进一步展示了相同的语义蒸馏管道可以推广到不同的$ extpi_{0.5}$-4B教师,学生在两个测试套件上超越了教师,并在 exttt{libero extunderscore goal}上保持在$0.53\\%$以内。额外分析表明,阶段级监督和多帧方向线索使学生对教师的噪声动作(如错误的高频夹持器变化)不那么敏感。总体而言,VLA-AD证明了来自VLM的离线语义指导可以显著提高VLA策略蒸馏的效率、鲁棒性和可部署性。
cs.CV / 97 / 2605.16258
IVGT: Implicit Visual Geometry Transformer for Neural Scene Representation
IVGT:用于神经场景表示的隐式视觉几何变换器
Abstract
Reconstructing coherent 3D geometry and appearance from unposed multi-view images is a fundamental yet challenging problem in computer vision. Most existing visual geometry foundation models predict explicit geometry by regressing pixel-aligned pointmaps, often suffering from redundancy and limited geometric continuity. We propose IVGT, an Implicit Visual Geometry Transformer that implicitly models continuous and coherent geometry from pose-free multi-view images. This formulation learns a continuous neural scene representation in a canonical coordinate system and supports continuous spatial queries at any 3D positions, retrieving local features to predict signed distance (SDF) values and colors using lightweight decoders. It allows direct extraction of continuous and coherent surface geometry, enabling rendering of RGB images, depth maps, and surface normal maps from arbitrary viewpoints. We train IVGT via multi-dataset joint optimization with 2D supervision and 3D geometric regularization. IVGT demonstrates generalization across scenes and achieves strong performance on various tasks, including mesh and point cloud reconstruction, novel view synthesis, depth and surface normal estimation, and camera pose estimation.
Chinese Translation
从未摆姿态的多视角图像重建一致的三维几何形状和外观是计算机视觉中的一个基本但具有挑战性的问题。现有的大多数视觉几何基础模型通过回归像素对齐的点图预测显式几何,通常面临冗余和有限几何连续性的问题。我们提出了IVGT,一种隐式视觉几何变换器,它从无姿态的多视角图像中隐式建模连续且一致的几何形状。该模型在标准坐标系中学习连续的神经场景表示,并支持在任意三维位置进行连续空间查询,利用轻量级解码器检索局部特征以预测有符号距离(SDF)值和颜色。它允许直接提取连续且一致的表面几何形状,从任意视点渲染RGB图像、深度图和表面法线图。我们通过多数据集联合优化与二维监督和三维几何正则化来训练IVGT。IVGT在场景间表现出良好的泛化能力,并在多个任务上取得了强劲的性能,包括网格和点云重建、新视角合成、深度和表面法线估计以及相机姿态估计。
cs.AI / 1 / 2605.15202
DeepSlide: From Artifacts to Presentation Delivery
DeepSlide:从工件到演示交付
Abstract
Presentations are a primary medium for scholarly communication, yet most AI slide generators optimize the artifact (a visually plausible deck) while under-optimizing the delivery process (pacing, narrative, and presentation preparation). We present DeepSlide, a human-in-the-loop multi-agent system that supports preparing the full presentation process, from requirement elicitation and time-budgeted narrative planning, to evidence-grounded slide--script generation, attention augmentation, and rehearsal support. DeepSlide integrates (i) a controllable logical-chain planner with per-node time budgets, (ii) a lightweight content-tree retriever for grounding, (iii) Markov-style sequential rendering with style inheritance, and (iv) sandboxed execution with minimal repair to ensure renderability. We further introduce a dual-scoreboard benchmark that cleanly separates static artifact quality from dynamic delivery excellence. Across 20 domains and diverse audience profiles, DeepSlide matches strong baselines on artifact quality while consistently achieving larger gains on delivery metrics, improving narrative flow, pacing precision, and slide--script synergy with clearer attention guidance.
Chinese Translation
演示是学术交流的主要媒介,但大多数人工智能幻灯片生成器优化的是工件(一个视觉上可行的幻灯片组),而在交付过程(节奏、叙事和演示准备)上优化不足。我们提出了DeepSlide,一个人机协作的多智能体系统,支持从需求获取和时间预算叙事规划,到基于证据的幻灯片-脚本生成、注意力增强和排练支持的完整演示准备过程。DeepSlide集成了(i)一个可控的逻辑链规划器,具有每个节点的时间预算,(ii)一个轻量级内容树检索器用于基础支撑,(iii)具有风格继承的马尔可夫风格序列渲染,以及(iv)最小修复的沙箱执行以确保可渲染性。我们进一步引入了一个双评分基准,清晰地将静态工件质量与动态交付卓越性分开。在20个领域和多样的受众特征中,DeepSlide在工件质量上与强基线相匹配,同时在交付指标上持续取得更大的提升,改善了叙事流畅性、节奏精确性和幻灯片-脚本的协同,提供了更清晰的注意力指导。
cs.AI / 2 / 2605.15204
SDOF: Taming the Alignment Tax in Multi-Agent Orchestration with State-Constrained Dispatch
SDOF:通过状态约束调度驯服多智能体编排中的对齐税
Abstract
Multi-agent orchestration frameworks such as LangChain, LangGraph, and CrewAI route tasks through graph-based pipelines but do not enforce the stage constraints that govern real business processes. We present SDOF, a framework that treats multi-agent execution as a constrained state machine. SDOF operates through two primary defensive layers, implemented by three components: (1) an Online-RLHF Specialized Intent Router trained via Generative Reward Modeling (GRPO) and (2) a StateAwareDispatcher with GoalStage finite-automaton checks and precondition/postcondition SkillRegistry validation for auditable execution control. On a recruitment system backed by the Beisen iTalent platform (6000+ enterprises), 185 expert-curated scenarios trigger 1671 live API calls. Our GSPO-aligned 7B Intent Router achieves higher joint accuracy than zero-shot GPT-4o on this FSM-constrained adversarial routing benchmark (80.9% versus 48.9%). In end-to-end execution, SDOF reaches 86.5% task completion (95% confidence interval 80.8 to 90.7) and blocks all 22 operations in the injection, illegal HR subset. Under a broader message-level blocking audit, SDOF attains precision 100% and recall 88%, expert agreement kappa=0.94. A separate evaluation on 960 SGD-derived dialogues spanning 8 service domains surfaces 201 stage-order conflicts under our FSM mapping, 41 of which arise in the normal split. This arXiv version reports the current validated scope; extended multi-seed training comparisons and deeper workflow evaluations will be released in a subsequent update.
Chinese Translation
多智能体编排框架如 LangChain、LangGraph 和 CrewAI 通过基于图的管道路由任务,但未能强制执行支配真实业务流程的阶段约束。我们提出了 SDOF,一个将多智能体执行视为受限状态机的框架。SDOF 通过两个主要的防御层运作,实施了三个组件:(1)通过生成奖励建模(Generative Reward Modeling, GRPO)训练的在线强化学习人类反馈(Online-RLHF)专用意图路由器;(2)具有目标阶段有限自动机检查和前置条件/后置条件技能注册验证的状态感知调度器(StateAwareDispatcher),用于可审计的执行控制。在一个由 Beisen iTalent 平台(6000 多家企业)支持的招聘系统中,185 个专家策划的场景触发了 1671 次实时 API 调用。我们的 GSPO 对齐的 7B 意图路由器在这个 FSM 约束的对抗路由基准测试中,达到了比零样本 GPT-4o 更高的联合准确率(80.9% 对比 48.9%)。在端到端执行中,SDOF 达到了 86.5% 的任务完成率(95% 置信区间为 80.8 到 90.7),并阻止了注入和非法人力资源子集中的所有 22 次操作。在更广泛的消息级阻塞审计下,SDOF 达到了 100% 的精确度和 88% 的召回率,专家一致性 kappa=0.94。对 960 个基于 SGD 的对话进行的单独评估,涵盖了 8 个服务领域,揭示了在我们的 FSM 映射下出现的 201 个阶段顺序冲突,其中 41 个出现在正常拆分中。此 arXiv 版本报告了当前验证的范围;扩展的多种种子训练比较和更深入的工作流评估将在后续更新中发布。
cs.AI / 3 / 2605.15205
Does Theory of Mind Improvement Really Benefit Human-AI Interactions? Empirical Findings from Interactive Evaluations
理论心智的提升真的有利于人机交互吗?来自互动评估的实证发现
Abstract
Improving the Theory of Mind (ToM) capability of Large Language Models (LLMs) is crucial for effective social interactions between these AI models and humans. However, the existing benchmarks often measure ToM capability improvement through story-reading, multiple-choice questions from a third-person perspective, while ignoring the first-person, dynamic, and open-ended nature of human-AI (HAI) interactions. To directly examine how ToM improvement techniques benefit HAI interactions, we first proposed the new paradigm of interactive ToM evaluation with both perspective and metric shifts. Next, following the paradigm, we conducted a systematic study of four representative ToM enhancement techniques using both four real-world datasets and a user study, covering both goal-oriented tasks (e.g., coding, math) and experience-oriented tasks (e.g., counseling). Our findings reveal that improvements on static benchmarks do not always translate to better performance in dynamic HAI interactions. This paper offers critical insights into ToM evaluation, showing the necessity of interaction-based assessments in developing next-generation, socially aware LLMs for HAI symbiosis.
Chinese Translation
提升大型语言模型(LLMs)的理论心智(ToM)能力对于这些人工智能模型与人类之间的有效社会交互至关重要。然而,现有的基准测试通常通过故事阅读和第三人称视角的多项选择题来衡量ToM能力的提升,而忽视了人机(HAI)交互的第一人称、动态和开放性特征。为了直接检验ToM提升技术如何有利于HAI交互,我们首先提出了具有视角和度量转变的新范式——互动ToM评估。接下来,基于该范式,我们对四种代表性的ToM增强技术进行了系统研究,使用了四个真实世界数据集和一项用户研究,涵盖了目标导向任务(如编程、数学)和体验导向任务(如咨询)。我们的研究发现,静态基准上的提升并不总是能转化为动态HAI交互中的更好表现。本文提供了对ToM评估的关键见解,显示了在开发下一代社会意识LLMs以实现HAI共生时,基于交互的评估的必要性。
cs.AI / 4 / 2605.15215
SkillSmith: Compiling Agent Skills into Boundary-Guided Runtime Interfaces
SkillSmith:将代理技能编译为边界引导的运行时接口
Abstract
Recently, skills have been widely adopted in large language model (LLM)-based agent systems across various domains. In existing frameworks, skills are typically injected into the agent reasoning loop as contextual guidance once matched to a runtime task, enabling specialized task-solving capabilities. We find that this execution paradigm introduces two major sources of redundancy: irrelevant context injection and repeated skill-specific reasoning and planning. To this end, we propose SkillSmith, a boundary-first compiler-runtime framework that compiles skill packages offline into minimal executable interfaces. By extracting fine-grained operational boundaries from skills, SkillSmith enables agents to dynamically access and execute only the relevant components at runtime, thereby minimizing unnecessary context injection and redundant reasoning overhead. In the evaluation on SkillsBench benchmark, SkillSmith reduces solve-stage token usage by 57.44%, thinking iterations by 42.99%, solve time by 50.57% (2.02x faster), and token-proportional monetary cost by 57.44% compared with using raw-skills. Moreover, compiled artifacts produced by a stronger model can be reused by a smaller or more efficient runtime model, improving task accuracy in cases where raw skill interpretation fails. The source code and data are available at https://github.com/AetherHeart-AI/Aeloon.
Chinese Translation
近年来,技能在基于大型语言模型(LLM)的代理系统中被广泛应用于各个领域。在现有框架中,技能通常在与运行时任务匹配后作为上下文指导注入到代理推理循环中,从而实现专门的任务解决能力。我们发现,这种执行范式引入了两个主要的冗余来源:无关上下文的注入和重复的技能特定推理与规划。为此,我们提出了SkillSmith,一个以边界为先的编译-运行时框架,能够将技能包离线编译成最小的可执行接口。通过从技能中提取细粒度的操作边界,SkillSmith使代理能够在运行时动态访问和执行仅相关的组件,从而最小化不必要的上下文注入和冗余的推理开销。在SkillsBench基准测试的评估中,与使用原始技能相比,SkillSmith将求解阶段的标记使用减少了57.44%,思考迭代减少了42.99%,求解时间减少了50.57%(速度提升2.02倍),以及标记比例的货币成本减少了57.44%。此外,由更强模型生成的编译工件可以被更小或更高效的运行时模型重用,从而在原始技能解释失败的情况下提高任务准确性。源代码和数据可在https://github.com/AetherHeart-AI/Aeloon获取。
cs.AI / 5 / 2605.15217
Fair outputs, Biased Internals: Causal Potency and Asymmetry of Latent Bias in LLMs for High-Stakes Decisions
公平输出,偏见内部:高风险决策中大型语言模型潜在偏见的因果效能与不对称性
Abstract
Instruction-tuned language models exhibit behavioural fairness in high-stakes decisions while retaining biased associations in their internal representations. However, whether these suppressed representations can affect model outputs - and whether such causal potency is symmetric across demographic groups - remains unknown. We investigate the use of open-weight models for mortgage underwriting using matched applications that differ only in racially-associated names and reveal a critical disconnect: models show no output-level bias, yet retain and amplify demographic representations across model layers. Through activation steering and novel cross-layer interventions, we demonstrate that this suppressed information is decision-relevant: when reinjected at critical layers, it produces near-complete decision reversals. Critically, this latent bias is asymmetric - steering interventions affect decisions in one demographic direction, while producing minimal effects in reverse - and susceptible to adversarial prompt engineering and parameter-efficient fine-tuning. These findings demonstrate that behavioural audits focused on outputs are insufficient: fair outputs can mask exploitable internal biases. They also motivate dual-layer testing frameworks combining output evaluation with representational analysis for AI governance in high-stakes decisions.
Chinese Translation
经过指令调优的语言模型在高风险决策中表现出行为公平性,同时在其内部表征中保留了偏见关联。然而,这些被抑制的表征是否会影响模型输出,以及这种因果效能在不同人口群体之间是否对称,仍然未知。我们研究了使用开放权重模型进行抵押贷款承保的情况,利用仅在种族相关名称上有所不同的匹配申请,揭示了一个关键的脱节:模型在输出层面上没有偏见,但在模型层中保留并放大了人口表征。通过激活引导和新颖的跨层干预,我们证明了这些被抑制的信息与决策相关:当在关键层重新注入时,几乎会导致决策的完全反转。关键的是,这种潜在偏见是不对称的——引导干预在一个人口方向上影响决策,而在反向上产生的效果微乎其微——并且容易受到对抗性提示工程和参数高效微调的影响。这些发现表明,专注于输出的行为审计是不够的:公平的输出可能掩盖可利用的内部偏见。它们还促使我们构建双层测试框架,将输出评估与表征分析结合起来,以实现高风险决策中的人工智能治理。
cs.AI / 6 / 2605.15218
CAX-Agent: A Lightweight Agent Harness for Reliable APDL Automation
CAX-Agent:一种轻量级代理工具,用于可靠的APDL自动化
Abstract
Large language models deployed for MAPDL finite-element simulation face practical reliability challenges: without structured execution control, tool encapsulation, and fault recovery, outputs may be inconsistent and task failures are common. The Agent Harness paradigm addresses this by inserting domain-specific orchestration middleware that manages tool lifecycles, workflow state, and recovery escalation. This paper presents the architecture of CAX-Agent, a lightweight agent harness purpose-built for MAPDL automation, and empirically evaluates one of its core components -- the recovery policy.CAX-Agent organizes execution into three layers -- LLM service, agent harness, and solver backend -- with a recovery ladder that escalates from deterministic rule patching through model-driven regeneration to context enrichment and human intervention. We evaluate three recovery strategies (no_recovery, rule_only, and model_only) on 50 standard structural benchmarks with three repeated runs per strategy (450 case-runs total). Two independent human raters score task completion under blind conditions; inter-rater agreement is strong (quadratic weighted Cohen's kappa = 0.84, 96 percent of score pairs within one point). Model_only achieves the best completion rate (0.9267), task score (3.59/4), total score (9.16/10), and zero-intervention rate (0.84), outperforming rule_only (0.7733, 3.17/4, 7.03/10, 0.00) and no_recovery (0.6933, 2.74/4, 5.60/10, 0.00) with large effect sizes (Cliff's delta = 0.81-0.87). The benchmark uses deliberately simple geometries to isolate recovery-policy effects; we discuss the scope of these findings and directions for broader validation.
Chinese Translation
部署于MAPDL有限元仿真的大型语言模型面临实际的可靠性挑战:缺乏结构化的执行控制、工具封装和故障恢复,输出可能不一致,任务失败也很常见。代理工具架构通过插入特定领域的编排中间件来解决这一问题,该中间件管理工具生命周期、工作流状态和恢复升级。本文介绍了CAX-Agent的架构,这是一种专为MAPDL自动化构建的轻量级代理工具,并对其核心组件之一——恢复策略进行了实证评估。CAX-Agent将执行组织为三个层次——LLM服务、代理工具和求解器后端,并设有一个恢复梯级,从确定性规则修补逐步升级到模型驱动再生、上下文丰富和人工干预。我们在50个标准结构基准上评估了三种恢复策略(无恢复、仅规则和仅模型),每种策略进行了三次重复运行(总共450次案例运行)。两位独立的人类评审员在盲条件下对任务完成情况进行评分;评审员之间的一致性较强(加权二次Cohen's kappa = 0.84,96%的评分对在一分之内)。仅模型策略实现了最佳完成率(0.9267)、任务得分(3.59/4)、总得分(9.16/10)和零干预率(0.84),在大效应量下优于仅规则(0.7733,3.17/4,7.03/10,0.00)和无恢复(0.6933,2.74/4,5.60/10,0.00)(Cliff's delta = 0.81-0.87)。基准使用故意简单的几何形状以孤立恢复策略的影响;我们讨论了这些发现的范围和更广泛验证的方向。
cs.AI / 7 / 2605.15219
NOVA: Fundamental Limits of Knowledge Discovery Through AI
NOVA:通过人工智能知识发现的基本限制
Abstract
Can AI systems discover genuinely new knowledge through iterative self improvement, and if so, at what cost? We introduce the NOVA framework, which models the common ``generate, verify, accumulate, retrain'' loop as an adaptive sampling process over a knowledge space. We identify sufficient conditions under which accumulated genuine knowledge eventually covers a finite domain, and show how their violations produce distinct failure modes: contamination, forgetting, exploration failure, and acceptance failure. We then analyze imperfect verification and identify a contamination trap: as easy-to-find knowledge is exhausted, the model mass assigned to new valid artifacts shrinks, so even small false-positive rates can cause invalid artifacts to enter the knowledge base faster than genuine discoveries. We clarify that Good--Turing estimation is a local batch-diversity diagnostic, not an estimator of the historically undiscovered valid mass that governs long-term discovery. Under a separate tail-equivalence assumption relating the model's effective discovery distribution to a Zipf law with exponent $\alpha>1$, we prove that the cumulative generation cost required to obtain $D$ distinct genuine discoveries satisfies $R_{\mathrm{cum}}(D)=\Theta(c_{\mathrm{gen}}D^\alpha)$, where $c_{\mathrm{gen}}$ is the per-candidate generation cost. This scaling law quantifies asymptotic diminishing returns as the discovery frontier advances. Finally, we formalize human amplification through guidance, generation, and verification, explaining why expert input is most valuable near autonomous exploration barriers.
Chinese Translation
人工智能系统能否通过迭代自我改进发现真正的新知识?如果可以,这种发现的代价是什么?我们引入了NOVA框架,该框架将常见的“生成、验证、积累、再训练”循环建模为知识空间上的自适应采样过程。我们确定了积累的真正知识最终覆盖有限领域的充分条件,并展示了这些条件的违反如何产生不同的失败模式:污染、遗忘、探索失败和接受失败。接着,我们分析了不完美的验证,并识别出一种污染陷阱:随着易于发现的知识被耗尽,分配给新有效文物的模型质量缩小,因此即使是小的假阳性率也会导致无效文物比真正的发现更快地进入知识库。我们澄清了Good-Turing估计是局部批量多样性诊断,而不是历史上未发现的有效质量的估计,这一质量支配着长期发现。在一个将模型有效发现分布与指数为$eta>1$的Zipf定律相关的独立尾等价假设下,我们证明了获得$D$个不同真正发现所需的累积生成成本满足$R_{ ext{cum}}(D)= heta(c_{ ext{gen}}D^eta)$,其中$c_{ ext{gen}}$是每个候选生成成本。该缩放法则量化了随着发现前沿推进而出现的渐近递减回报。最后,我们通过指导、生成和验证形式化了人类放大,解释了为什么专家输入在接近自主探索障碍时最为宝贵。
cs.AI / 8 / 2605.15224
ICRL: Learning to Internalize Self-Critique with Reinforcement Learning
ICRL:通过强化学习学习内化自我批评
Abstract
Large language model-based agents make mistakes, yet critique can often guide the same model toward correct behavior. However, when critique is removed, the model may fail again on the same query, indicating that it has not internalized the critique's guidance into its underlying capability. Meanwhile, a frozen critic cannot improve its feedback quality over time, limiting the potential for iterative self-improvement. To address this, we propose learning to internalize self-critique with reinforcement learning(ICRL), a novel framework that jointly trains a solver and a critic from a shared backbone to convert critique-induced success into unassisted solver ability. The critic is rewarded based on the solver's subsequent performance gain, incentivizing actionable feedback. To address the distribution shift between critique-conditioned and critique-free behavior, ICRL introduces a distribution-calibration re-weighting ratio that selectively transfers critique-guided improvements compatible with the solver's own prompt distribution. Additionally, a role-wise group advantage estimation stabilizes joint optimization across the two roles. Together, these mechanisms ensure that the solver learns to improve itself without external critique, rather than becoming dependent on critique-conditioned behavior. We evaluate ICRL on diverse benchmarks spanning agentic and mathematical reasoning tasks, using Qwen3-4B and Qwen3-8B as backbones. Results show consistent improvements, with average gains of 6.4 points over GRPO on agentic tasks, and 7.0 points on mathematical reasoning. Notably, the learned 8B critic is comparable to 32B critics while using substantially fewer tokens. The code is available at https://github.com/brick-pid/ICRL.
Chinese Translation
基于大型语言模型的智能体会犯错误,但批评往往可以引导同一模型朝着正确的行为发展。然而,当批评被移除时,模型可能在同一查询上再次失败,这表明它并没有将批评的指导内化为其基本能力。同时,固定的批评者无法随着时间的推移改善其反馈质量,从而限制了迭代自我改进的潜力。为了解决这个问题,我们提出了一种通过强化学习内化自我批评的学习方法(ICRL),这是一个新颖的框架,它共同训练一个求解器和一个批评者,利用共享的基础结构将批评引导的成功转化为无辅助求解器的能力。批评者根据求解器后续的性能提升获得奖励,从而激励可操作的反馈。为了应对批评条件行为与无批评行为之间的分布转变,ICRL引入了一种分布校准重加权比率,选择性地转移与求解器自身提示分布兼容的批评引导改进。此外,角色级别的组优势估计稳定了两个角色之间的联合优化。这些机制共同确保求解器学习在没有外部批评的情况下自我改进,而不是依赖于批评条件行为。我们在涵盖智能体和数学推理任务的多样基准上评估了ICRL,使用Qwen3-4B和Qwen3-8B作为基础模型。结果显示出一致的改进,在智能体任务上平均提高了6.4分,在数学推理上提高了7.0分。值得注意的是,学习到的8B批评者在使用显著较少的标记的情况下,其性能可与32B批评者相媲美。代码可在https://github.com/brick-pid/ICRL获取。
cs.AI / 9 / 2605.15227
NIMO Controller: a self-driving laboratory orchestrator based on the Model Context Protocol
NIMO 控制器:基于模型上下文协议的自驾实验室协调器
Abstract
Self-driving laboratories (SDLs) have attracted increasing attention as a means of accelerating scientific discovery; however, developing SDL software remains technically demanding. To improve accessibility, orchestration software frameworks have been proposed to coordinate SDL components. Nevertheless, existing frameworks are primarily designed for human interaction and do not provide standardized interfaces suitable for AI agents. In this work, we propose an SDL software architecture based on the Model Context Protocol (MCP), in which all SDL functionalities are exposed through MCP servers. Following this design principle, we introduce an MCP-based SDL orchestrator, named NIMO Controller. It provides a visual programming interface automatically generated through MCP-based tool discovery, allowing human users to design experimental workflows without writing code. The same MCP backend can also be accessed by AI agents, providing a unified interface for both human users and AI agents. We demonstrate the proposed system through a case study on a color-matching SDL. The results validate the usability of the proposed MCP-based SDL architecture.
Chinese Translation
自驾实验室(SDL)作为加速科学发现的一种手段,受到了越来越多的关注;然而,开发 SDL 软件仍然在技术上具有挑战性。为了提高可访问性,已经提出了协调 SDL 组件的软件框架。然而,现有框架主要是为人机交互设计的,并未提供适合 AI 代理的标准化接口。在本研究中,我们提出了一种基于模型上下文协议(MCP)的 SDL 软件架构,其中所有 SDL 功能通过 MCP 服务器进行暴露。遵循这一设计原则,我们引入了一种基于 MCP 的 SDL 协调器,命名为 NIMO 控制器。它提供了一个通过基于 MCP 的工具发现自动生成的可视化编程接口,使人类用户能够在不编写代码的情况下设计实验工作流程。同样的 MCP 后端也可以被 AI 代理访问,为人类用户和 AI 代理提供统一的接口。我们通过对一个颜色匹配 SDL 的案例研究展示了所提系统。结果验证了所提出的基于 MCP 的 SDL 架构的可用性。
cs.AI / 10 / 2605.15228
Verifiable Agentic Infrastructure: Proof-Derived Authorization for Sovereign AI Systems
可验证的自主基础设施:基于证明的主权人工智能系统授权
Abstract
Modern cloud and enterprise systems rely on identity-centric authorization, assuming that callers possessing valid credentials are safe to execute commands. The emergence of autonomous AI agents invalidates this assumption: agents can generate syntactically valid but semantically unsafe actions, making standing privileges a significant operational risk. This risk becomes especially acute in sovereign AI systems, where autonomous agents may interact with cloud infrastructure, regulated data, financial workflows, and national-scale digital services. Governed mutation substrates reduce this risk by interposing on agent actions: agents submit intents, infrastructure evaluates context and policy, and execution is mediated. However, this shifts the trust boundary: how can the decision to authorize an intent be made verifiable, distributed, and replayable? We introduce a Distributed Trust Framework (DTF), a verification framework for governed mutation systems that computes execution authority from structured, verifiable artifacts. DTF introduces a Justification Proof to encode the admissibility basis of an action, a consensus model for independent evaluation, an ephemeral Execution Identity derived from the approved proof, and an append-only Evidence Chain that preserves the authorization lifecycle. Under stated substrate assumptions, this architecture enforces a compact authorization invariant: no high-stakes execution without a proof object, no derived authority without consensus, and no valid mutation detached from evidence. We define the model, instantiate it over an OpenKedge-based governed mutation substrate, and show how it maps onto cloud-native environments. By shifting authorization from standing identity to proof-derived authority, DTF provides an infrastructure foundation for making agentic execution governable, auditable, and bounded in sovereign AI deployments.
Chinese Translation
现代云计算和企业系统依赖于以身份为中心的授权,假设拥有有效凭证的调用者是安全的,可以执行命令。然而,自主人工智能代理的出现使这一假设失效:代理可以生成语法上有效但语义上不安全的操作,从而使得持续的权限成为一个显著的操作风险。这一风险在主权人工智能系统中尤为严重,因为自主代理可能与云基础设施、受监管的数据、金融工作流以及国家级数字服务进行交互。受管变异基底通过对代理行为的干预来降低这一风险:代理提交意图,基础设施评估上下文和政策,执行则受到调解。然而,这改变了信任边界:如何使授权意图的决策可验证、分布式和可重放?我们提出了一个分布式信任框架(Distributed Trust Framework, DTF),这是一个用于受管变异系统的验证框架,它从结构化的、可验证的文档中计算执行权限。DTF引入了一种证明理由(Justification Proof),用于编码操作的可接受性基础,一个用于独立评估的共识模型,一个基于批准证明的短暂执行身份,以及一个仅附加的证据链,以保持授权生命周期。在所述基底假设下,该架构强制执行一个紧凑的授权不变性:没有证明对象就没有高风险执行,没有共识就没有派生权限,没有与证据脱离的有效变异。我们定义了该模型,在基于OpenKedge的受管变异基底上实例化,并展示其如何映射到云原生环境。通过将授权从持续身份转移到基于证明的权限,DTF为使自主执行在主权人工智能部署中可治理、可审计和有限提供了基础设施基础。
cs.AI / 11 / 2605.15301
Solvita: Enhancing Large Language Models for Competitive Programming via Agentic Evolution
Solvita:通过智能进化增强大型语言模型在竞争编程中的表现
Abstract
Large language models (LLMs) still struggle with the rigorous reasoning demands of hard competitive programming. While recent multi-agent frameworks attempt to bridge this reliability gap, they remain fundamentally stateless: they rely on static retrieval and discard the valuable problem-solving and debugging experience gained from previous tasks. To address this, we present Solvita, an agentic evolution framework that enables continuous learning without requiring weight updates to the underlying LLM. Solvita reorganizes problem-solving into a closed-loop system of strategy selection, program synthesis, certified supervision, and targeted hacking, executed by four specialized agents: Planner, Solver, Oracle, and Hacker. Crucially, each agent is paired with a trainable, graph-structured knowledge network. As the system operates, outcome signals, such as pass/fail verdicts, test certification quality, and adversarial vulnerabilities discovered by the Hacker, are recast as reinforcement learning updates to these network weights. This allows the agents to dynamically route future queries based on past successes and failures, effectively accumulating transferable reasoning experience over time. Evaluated across CodeContests, APPS, AetherCode, and live Codeforces rounds, Solvita establishes a new state-of-the-art among code-generation agents, outperforming existing multi-agent pipelines and nearly doubling the accuracy of single-pass baselines.
Chinese Translation
大型语言模型(LLMs)在应对高难度竞争编程的严格推理要求时仍然面临挑战。尽管近期的多智能体框架试图弥补这一可靠性差距,但它们在根本上仍然是无状态的:它们依赖于静态检索,忽视了从先前任务中获得的宝贵问题解决和调试经验。为了解决这一问题,我们提出了Solvita,一个智能进化框架,能够实现持续学习,而无需对基础LLM进行权重更新。Solvita将问题解决重新组织为一个闭环系统,包括策略选择、程序合成、认证监督和针对性黑客攻击,由四个专门的智能体执行:规划者(Planner)、求解者(Solver)、神谕者(Oracle)和黑客(Hacker)。关键是,每个智能体都配备了一个可训练的图结构知识网络。在系统运行过程中,结果信号,如通过/失败裁决、测试认证质量以及黑客发现的对抗性漏洞,都会被重新转化为对这些网络权重的强化学习更新。这使得智能体能够根据过去的成功与失败动态地引导未来的查询,从而有效地积累可转移的推理经验。在CodeContests、APPS、AetherCode和实时Codeforces轮次的评估中,Solvita在代码生成智能体中建立了新的最先进水平,超越了现有的多智能体管道,并几乎将单次通过基线的准确性翻倍。
cs.AI / 12 / 2605.15308
SMCEvolve: Principled Scientific Discovery via Sequential Monte Carlo Evolution
SMCEvolve:通过序列蒙特卡洛演化实现原则性科学发现
Abstract
LLM-driven program evolution has emerged as a powerful tool for automated scientific discovery, yet existing frameworks offer no principled guide for designing their individual components and provide no guarantee that the search converges. We introduce SMCEvolve, which recasts program search as sampling from a reward-tilted target distribution and approximates it with a Sequential Monte Carlo (SMC) sampler. From this view, three core mechanisms emerge as principled components: adaptive parent resampling, mixture of mutation with acceptance, and automatic convergence control. We further provide a finite-sample complexity analysis that bounds the LLM-call budget required to reach a target approximation error. Across math, algorithm efficiency, symbolic regression, and end-to-end ML research benchmarks, SMCEvolve surpasses state-of-the-art evolving systems while using fewer LLM calls under self-determined termination. The code is available at https://github.com/kongwanbianjinyu/SMCEvolve.
Chinese Translation
基于大型语言模型(LLM)的程序演化已成为自动化科学发现的强大工具,但现有框架未能提供设计各个组件的原则性指导,也未能保证搜索过程的收敛性。我们提出了SMCEvolve,将程序搜索重新表述为从奖励倾斜的目标分布中采样,并使用序列蒙特卡洛(SMC)采样器进行近似。从这个角度来看,三个核心机制作为原则性组件浮现:自适应父代重采样、带接受的变异混合和自动收敛控制。我们进一步提供了有限样本复杂度分析,界定了达到目标近似误差所需的LLM调用预算。在数学、算法效率、符号回归和端到端机器学习研究基准测试中,SMCEvolve在自我决定终止的情况下,使用更少的LLM调用超越了最先进的演化系统。代码可在 https://github.com/kongwanbianjinyu/SMCEvolve 获取。
cs.AI / 13 / 2605.15315
Context Pruning for Coding Agents via Multi-Rubric Latent Reasoning
通过多标准潜在推理对编码代理进行上下文剪枝
Abstract
LLM-powered coding agents spend the majority of their token budget reading repository files, yet much of the retrieved code is irrelevant to the task at hand. Existing learned pruners compress this context with a single-objective sequence labeler, collapsing all facets of code relevance into one score and one transition matrix. We show that this formulation creates a modeling bottleneck: a single CRF transition prior must serve heterogeneous retention patterns, including contiguous semantic spans and sparse structural support lines. We propose LaMR (Latent Multi-Rubric), a structured pruning framework that decomposes code relevance into two interpretable quality dimensions, semantic evidence and dependency support, each modeled by a dedicated CRF with dimension-specific transition dynamics. A mixture-of-experts gating network dynamically weights the per-rubric emissions conditioned on the query, and a final CRF layer on the fused emissions produces the aggregate keep-or-prune decision. To supervise each dimension without additional annotation cost, we derive multi-rubric labels from the existing training corpus via AST-based program analysis, simultaneously denoising the teacher's binary labels. By effectively filtering distracting noise, LaMR frequently matches or even outperforms unpruned full-context baselines. Experiments on four benchmarks (SWE-Bench Verified, SWE-QA, LCC, LongCodeQA) show that LaMR wins 12 of 16 head-to-head multi-turn comparisons. It saves up to 31% more tokens on multi-turn agent tasks and improves Exact Match by up to +3.5 on single-turn tasks, while performance is frequently enhanced by denoising the context, and any remaining drops are marginal.
Chinese Translation
基于大型语言模型(LLM)的编码代理在读取代码库文件时消耗了大部分的令牌预算,但检索到的代码往往与当前任务无关。现有的学习剪枝器使用单一目标序列标注器来压缩上下文,将代码相关性的所有方面合并为一个分数和一个转移矩阵。我们表明,这种表述方式造成了建模瓶颈:单一的条件随机场(CRF)转移先验必须服务于异质的保留模式,包括连续的语义跨度和稀疏的结构支持线。我们提出了LaMR(潜在多标准),一个结构化剪枝框架,将代码相关性分解为两个可解释的质量维度:语义证据和依赖支持,每个维度由一个专用的CRF建模,具有特定维度的转移动态。一个专家混合门控网络根据查询动态加权每个标准的输出,而在融合输出上的最终CRF层则生成聚合的保留或剪枝决策。为了在不增加额外标注成本的情况下监督每个维度,我们通过基于抽象语法树(AST)的程序分析从现有训练语料库中推导多标准标签,同时去噪教师的二元标签。通过有效过滤干扰噪声,LaMR经常与未剪枝的全上下文基线相匹配,甚至超越它们。在四个基准测试(SWE-Bench Verified、SWE-QA、LCC、LongCodeQA)上的实验表明,LaMR在16次正面多轮比较中赢得了12次。在多轮代理任务中,它节省了多达31%的令牌,并在单轮任务中提高了精确匹配(Exact Match)最多+3.5,同时通过去噪上下文经常提升性能,任何剩余的下降都是微不足道的。
cs.AI / 14 / 2605.15333
Zero-Shot Goal Recognition with Large Language Models
基于大型语言模型的零样本目标识别
Abstract
Large language models have recently reached near-parity with classical planners on well-known planning domains, yet this competence relies on world-knowledge exploitation rather than genuine symbolic reasoning. Goal recognition is a complementary abductive task structurally better suited to LLM strengths: it consists of evaluating consistency with world knowledge rather than generating novel action sequences. This paper provides the first systematic zero-shot evaluation of frontier LLMs as goal recognisers on key classical PDDL benchmarks. Our results show that LLM competence on goal recognition is uneven: some models scale with evidence and approach landmark-based accuracy at full observations, while others remain anchored to world-knowledge priors regardless of how much evidence accumulates. Qualitative analysis of model reasoning traces reveals that this divergence reflects a fundamental difference in evidence integration rather than domain familiarity. These findings position goal recognition as a principled benchmark for the foundational planning knowledge of LLMs.
Chinese Translation
大型语言模型最近在知名规划领域已接近与经典规划器的水平,但这种能力依赖于对世界知识的利用,而非真正的符号推理。目标识别是一项补充性的溯因任务,在结构上更适合大型语言模型的优势:它主要是评估与世界知识的一致性,而不是生成新的行动序列。本文首次对前沿大型语言模型作为目标识别器在关键经典PDDL基准上的零样本评估进行了系统研究。我们的结果表明,大型语言模型在目标识别上的能力不均衡:一些模型随着证据的增加而提升,并在完全观察时接近基于里程碑的准确性,而其他模型则无论证据如何积累,仍然固守于世界知识的先验。对模型推理轨迹的定性分析显示,这种差异反映了证据整合的根本差异,而非领域熟悉度。这些发现将目标识别定位为大型语言模型基础规划知识的一个原则性基准。
cs.AI / 15 / 2605.15343
Belief Engine: Configurable and Inspectable Stance Dynamics in Multi-Agent LLM Deliberation
信念引擎:可配置和可检查的多智能体大语言模型(LLM)审议中的立场动态
Abstract
LLM-based agents are increasingly used to simulate deliberative interactions such as negotiation, conflict resolution, and multi-turn opinion exchange. Yet generated transcripts often do not reveal why an agent's stance changes: movement may reflect evidence uptake, anchoring, role drift, echoing, or changed prompt and retrieval context. We introduce the Belief Engine (BE), an auditable belief-update layer that treats "belief" as an evidential state over a proposition and exposes it as scalar stance. BE extracts arguments into structured memory and updates stance with a log-odds rule controlled by evidence uptake u and prior anchoring a. Across multiple base LLMs, parameter sweeps show that these controls reliably shape stance dynamics while preserving an evidence-level update trail. On DEBATE, a human deliberation dataset with pre/post opinions, BE best reconstructs participants whose final stance follows extracted evidence; stable and evidence-opposed cases instead point to anchoring or factors outside the extracted evidence stream. BE provides configurable infrastructure for studying evidence-grounded deliberation, where openness, commitment, convergence, and disagreement can be tied to explicit update assumptions rather than hidden prompt effects.
Chinese Translation
基于大语言模型(LLM)的代理越来越多地用于模拟审议互动,例如谈判、冲突解决和多轮意见交流。然而,生成的记录通常无法揭示代理立场变化的原因:这种变化可能反映了证据采纳、锚定、角色漂移、回声效应或提示和检索上下文的变化。我们引入了信念引擎(Belief Engine,BE),这是一个可审计的信念更新层,将“信念”视为对命题的证据状态,并将其呈现为标量立场。BE 将论据提取到结构化记忆中,并通过一个由证据采纳 u 和先前锚定 a 控制的对数几率规则更新立场。在多个基础 LLM 上进行的参数扫描表明,这些控制可靠地塑造了立场动态,同时保留了证据级更新轨迹。在 DEBATE 数据集中,该数据集包含前后意见的人类审议,BE 最好地重构了最终立场遵循提取证据的参与者;而稳定和与证据相对立的案例则指向锚定或提取证据流之外的因素。BE 提供了可配置的基础设施,以研究基于证据的审议,其中开放性、承诺、收敛和分歧可以与明确的更新假设联系起来,而不是隐藏的提示效应。
cs.AI / 16 / 2605.15377
Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute
人工智能控制的集成监控:多样化信号胜过更多计算
Abstract
As AI systems are increasingly deployed in autonomous agentic settings at scale, it is important to ensure the actions they take are safe and aligned with user intent. Monitoring agent actions is a key safety mechanism, yet reliable monitors remain difficult to build and the scale of these systems makes human oversight impractical. We show that combining signals from diverse monitors into an ensemble improves detection of misaligned actions. We build 12 GPT-4.1-Mini monitors using both prompting and fine-tuning strategies. We evaluate them on coding tasks where candidate solutions pass standard tests but fail on adversarial inputs. In this setting, diverse ensembles outperform both individual monitors and homogeneous ensembles. Our best 3-monitor ensemble achieves 2.4x greater detection performance gain compared to an ensemble composed of three identical monitors, with the same ensemble performing strongly on an independent dataset. We contend that these results show that diversity - not scale - drives gains. The best ensembles combine strong individual performance with low correlation between monitors. Furthermore, fine-tuned monitors appear in every top-performing ensemble and maintain this advantage on out-of-distribution attack types, suggesting that fine-tuning enables detection capabilities that prompting alone does not elicit. These results support ensemble monitoring as a practical AI control strategy for safety gains at reasonable inference costs.
Chinese Translation
随着人工智能系统在自主代理环境中的大规模部署,确保其采取的行动安全且与用户意图一致变得愈发重要。监控代理行为是一个关键的安全机制,但可靠的监控器仍然难以构建,而这些系统的规模使得人工监督变得不切实际。我们展示了将来自不同监控器的信号组合成集成体可以改善对不一致行为的检测。我们使用提示和微调策略构建了12个GPT-4.1-Mini监控器。我们在编码任务上对它们进行了评估,这些任务中候选解决方案通过了标准测试,但在对抗输入上失败。在这种情况下,多样化的集成体在性能上优于单个监控器和同质集成体。我们最佳的3监控器集成体相比于由三个相同监控器组成的集成体,检测性能提升了2.4倍,并且在独立数据集上表现出色。我们认为这些结果表明,多样性而非规模推动了性能提升。最佳的集成体结合了强大的个体性能与监控器之间的低相关性。此外,微调监控器出现在每个表现最佳的集成体中,并在分布外攻击类型上保持这一优势,表明微调能够实现提示所无法引发的检测能力。这些结果支持集成监控作为一种实用的人工智能控制策略,以在合理的推理成本下实现安全收益。
cs.AI / 17 / 2605.15400
Beyond Partner Diversity: An Influence-Based Team Steering Framework for Zero-Shot Human-Machine Teaming
超越合作伙伴多样性:基于影响的团队引导框架用于零样本人机协作
Abstract
While AI agents are rapidly advancing from isolated tools to interactive collaborators, data-driven human-machine teaming (HMT) methods remain costly in their reliance on human interaction data across domains, teammates, and team sizes. Zero-shot coordination (ZSC) addresses this bottleneck by simulating diverse partner populations to approximate how unseen partners might behave. However, partner coverage alone is insufficient as team settings scale and communication becomes degraded. To remedy this deficiency, we propose Influence-Based Team Steering (IBTS), a framework that uses influence shaping to incentivize agents to discover diverse, high-performing team interaction patterns and further steers ongoing trajectories toward stronger learned coordination modes. We assess IBTS on Overcooked-AI in both two-agent and three-agent settings, allowing us to test whether learned coordination structure transfers beyond dyadic interaction. Our evaluation includes simulated partners, synthetic partner-style variation, and, to our knowledge, the first 30-subject Overcooked-AI HMT study involving two real human teammates and one machine teammate. Across these evaluations, IBTS improves team performance against competing baselines, highlighting the need for scaled ZSC to combine sparse-reward coordination mechanisms with partner-variation coverage rather than relying on diversity alone.
Chinese Translation
随着人工智能代理从孤立工具迅速发展为互动协作者,基于数据的人机协作(HMT)方法在对人类交互数据的依赖上仍然成本高昂,涉及多个领域、队友和团队规模。零样本协调(ZSC)通过模拟多样化的合作伙伴群体来解决这一瓶颈,从而近似未见合作伙伴的行为。然而,单靠合作伙伴覆盖在团队设置规模扩大和沟通质量下降时是不够的。为了解决这一不足,我们提出了基于影响的团队引导(IBTS)框架,该框架利用影响塑造来激励代理发现多样化的高效团队互动模式,并进一步引导持续的轨迹朝向更强的学习协调模式。我们在Overcooked-AI的两代理和三代理设置中评估IBTS,这使我们能够测试学习到的协调结构是否能够超越二人互动。我们的评估包括模拟合作伙伴、合成合作伙伴风格变异,以及据我们所知,首个涉及两名真实人类队友和一名机器队友的30人Overcooked-AI HMT研究。在这些评估中,IBTS在与竞争基准的对比中提升了团队表现,突显了规模化ZSC的必要性,以将稀疏奖励协调机制与合作伙伴变异覆盖相结合,而不是仅仅依赖多样性。
cs.AI / 18 / 2605.15445
From LLM-Generated Conjectures to Lean Formalizations: Automated Polynomial Inequality Proving via Sum-of-Squares Certificates
从LLM生成的猜想到Lean形式化:通过平方和证书的自动多项式不等式证明
Abstract
Automated proving of polynomial inequalities is a fundamental challenge in automated mathematical reasoning, where rich algebraic structure and a rapidly growing certificate search space hinder scalability. Purely symbolic approaches provide strong guarantees but often scale poorly as the number of variables or the degree increases, due to expensive algebraic manipulations and rapidly growing intermediate expressions. In parallel, LLM-guided methods have made notable progress, particularly on competition-style inequalities with a small number of variables. To address the remaining scalability challenges, we propose NSPI, a neuro-symbolic framework that combines the complementary strengths of LLMs and symbolic computation for polynomial-inequality proving. Concretely, an LLM proposes a conjecture in the form of an approximate polynomial Sum-Of-Squares (SOS) decomposition; we refine it via symbolic computation to obtain an exact polynomial SOS representation, which directly proves the target inequality, and we further certify the proof in Lean, yielding an end-to-end pipeline from heuristic discovery to machine-checked proof. Experiments on challenging benchmarks involving polynomials with up to 10 variables demonstrate the effectiveness and scalability of the proposed method.
Chinese Translation
自动证明多项式不等式是自动数学推理中的一个基本挑战,其中丰富的代数结构和快速增长的证书搜索空间阻碍了可扩展性。纯符号方法提供了强有力的保证,但由于代数操作的高昂成本和快速增长的中间表达式,通常在变量数量或次数增加时扩展性较差。与此同时,基于LLM(大语言模型)的方法在变量数量较少的竞赛风格不等式上取得了显著进展。为了解决剩余的可扩展性挑战,我们提出了NSPI(神经符号框架),它结合了LLM和符号计算在多项式不等式证明中的互补优势。具体而言,LLM以近似多项式平方和(Sum-Of-Squares, SOS)分解的形式提出猜想;我们通过符号计算对其进行细化,以获得精确的多项式SOS表示,从而直接证明目标不等式,并进一步在Lean中对证明进行认证,形成从启发式发现到机器检查证明的端到端流程。在涉及多达10个变量的多项式的挑战性基准测试中进行的实验表明了所提方法的有效性和可扩展性。
cs.AI / 19 / 2605.15505
X-SYNTH: Beyond Retrieval -- Enterprise Context Synthesis from Observed Human Attention
X-SYNTH:超越检索——基于观察到的人类注意力的企业上下文合成
Abstract
In enterprise operations, the context required for an AI agent task is scattered across systems of record, static information stores, and communication channels. What is stored is system state, a lossy representation of the work that actually happened [2, 52]. The prevailing approach [17, 31, 34, 36] retrieves by matching request content to what is stored; for narrow requests this works well. But synthesis quality depends on knowing what to surface and how to interpret it: knowledge specific to each organization, team, and individual [5, 57, 61], present in behavioral patterns, absent from any retrieval index. For complex agentic tasks it breaks down: True Lead Rate is low, False Lead Rate is high, and the model has no mechanism to improve. We present X-SYNTH, a framework for enterprise context synthesis grounded in human attention, the digitally observable interaction signatures of each worker, encoding not just what they did but the sequence in which they did it, along with implicit reward signals. Behavioral traces preceding positive outcomes are distinguishable from those that did not, without external labeling. X-SYNTH models each individual's behavioral baseline as a Digital Twin Signature (DTS) and selects among seven qualitatively distinct attention filters: Proportional, Inverse, Differential, Recurrent, Comparative, Sequential, and Collective, per individual and per query, to identify causally relevant activity signatures. A four-stage pipeline assembles ranked context grounded in behavioral patterns rather than query embeddings. On a sales lead identification task, a frontier model unaided achieves 9.5% True Lead Rate (TLR) with 90.5% False Lead Rate (FLR). Augmented with X-SYNTH, TLR rises to 61.9% (6.5x) while FLR falls to 18.8%. Enterprise context synthesis is not a retrieval problem. It is a relevance problem, and human attention is its most reliable ground truth.
Chinese Translation
在企业运营中,AI代理任务所需的上下文散布在记录系统、静态信息存储和沟通渠道中。存储的信息是系统状态,是对实际发生工作的有损表示。当前的主流方法通过将请求内容与存储的信息进行匹配来进行检索;对于狭窄的请求,这种方法效果良好。但合成质量依赖于知道该呈现什么以及如何解读:这些知识特定于每个组织、团队和个人,存在于行为模式中,而在任何检索索引中都缺失。对于复杂的代理任务,这种方法会失效:真实线索率(True Lead Rate,TLR)低,虚假线索率(False Lead Rate,FLR)高,且模型没有改进的机制。我们提出了X-SYNTH,这是一个基于人类注意力的企业上下文合成框架,关注每个员工的数字可观察交互特征,不仅编码他们所做的事情,还编码他们所做事情的顺序,以及隐含的奖励信号。积极结果之前的行为痕迹与未产生积极结果的痕迹是可区分的,无需外部标记。X-SYNTH将每个个体的行为基线建模为数字双胞胎特征(Digital Twin Signature,DTS),并根据个体和查询选择七种质性不同的注意力过滤器:比例(Proportional)、反向(Inverse)、差异(Differential)、递归(Recurrent)、比较(Comparative)、顺序(Sequential)和集体(Collective),以识别因果相关的活动特征。一个四阶段的管道基于行为模式而非查询嵌入组装排名上下文。在销售线索识别任务中,一个未增强的前沿模型实现了9.5%的真实线索率(TLR)和90.5%的虚假线索率(FLR)。在X-SYNTH的增强下,TLR上升至61.9%(6.5倍),而FLR下降至18.8%。企业上下文合成不是一个检索问题,而是一个相关性问题,而人类注意力是其最可靠的真实依据。
cs.AI / 20 / 2605.15513
CAPS: Cascaded Adaptive Pairwise Selection for Efficient Parallel Reasoning
CAPS:用于高效并行推理的级联自适应成对选择
Abstract
Parallel reasoning, where a generator samples many candidate solutions and an aggregator selects the best, is one of the most effective forms of test-time scaling in large language models, and pairwise self-verification has become its strongest aggregation primitive. Yet pairwise verification carries a heavy cost: each judgment reads two complete solutions in full, and existing methods perform tens of such judgments per problem regardless of whether the comparison is informative. We introduce CAPS (Cascaded Adaptive Pairwise Selection), an inference-only framework that allocates verifier compute non-uniformly along two orthogonal axes: an evidence axis that adapts how much of each candidate the judge sees, and a distribution axis that adapts how comparisons are spread across the pool. CAPS instantiates these into a four-stage cascade with an optional rescue subroutine, and admits a closed-form verifier-token cost in which the per-candidate marginal cost is roughly halved relative to uniform full-evidence schedules. On four self-verifying models (Qwen3-14B, GPT-OSS-20B, Qwen3-4B-Instruct/Thinking) and five reasoning benchmarks spanning code (LiveCodeBench-v5/v6, CodeContests) and math (AIME 2025, HMMT 2025), CAPS outperforms the leading pairwise verifier on 14 of 20 suites while using 25.4% of its verifier-token budget on code, and outperforms pointwise self-verification on all 20. The trade-off suites admit an interpretable diagnostic in terms of the verifier's accuracy at partial versus full evidence, providing a concrete pre-deployment check for cascade suitability.
Chinese Translation
并行推理是一种生成器采样多个候选解决方案并由聚合器选择最佳方案的有效测试时扩展形式,在大型语言模型中,成对自我验证已成为其最强的聚合原语。然而,成对验证的成本很高:每个判断都需要完整读取两个解决方案,而现有方法在每个问题上执行数十次这样的判断,无论比较是否具有信息性。我们提出了CAPS(Cascaded Adaptive Pairwise Selection),这是一个仅用于推理的框架,它沿两个正交轴非均匀地分配验证器计算:一个证据轴适应评判者看到每个候选方案的多少,另一个分布轴适应比较在池中的分布。CAPS将这些实现为一个四阶段的级联,并带有可选的救援子程序,并允许一个封闭形式的验证器令牌成本,其中每个候选的边际成本相对于均匀的全证据计划大约减半。在四个自我验证模型(Qwen3-14B, GPT-OSS-20B, Qwen3-4B-Instruct/Thinking)和五个推理基准(涵盖代码(LiveCodeBench-v5/v6, CodeContests)和数学(AIME 2025, HMMT 2025))上,CAPS在20个套件中的14个上超越了领先的成对验证器,同时在代码上使用了25.4%的验证器令牌预算,并在所有20个套件上超越了逐点自我验证。这些权衡套件提供了一个可解释的诊断,涉及验证器在部分证据与全证据下的准确性,为级联适用性提供了具体的部署前检查。
cs.AI / 21 / 2605.15537
RTL-BenchMT: Dynamic Maintenance of RTL Generation Benchmark Through Agent-Assisted Analysis and Revision
RTL-BenchMT:通过代理辅助分析与修订动态维护RTL生成基准
Abstract
This paper introduces RTL-BenchMT, an agentic framework for dynamically maintaining RTL generation benchmarks. Large Language Models (LLMs) assisted automated RTL generation is one of the most important directions in EDA research. However, current RTL benchmarks face two critical challenges: (1) flawed cases in the benchmarks and (2) overfitting to the benchmarks. Both challenges are difficult to resolve purely by manual engineering effort. To address these issues and systematically reduce human maintenance costs, we propose an automated agentic framework, RTL-BenchMT. RTL-BenchMT focuses on two key applications: (1) automatically identifying and revising flawed benchmark cases and (2) automatically detecting and updating overfitting cases. With the assistance of RTL-BenchMT, we conduct a thorough, in-depth analysis of flawed and overfitting cases and produce a refined benchmark suite that will be open-sourced to the community.
Chinese Translation
本文介绍了RTL-BenchMT,这是一个用于动态维护RTL生成基准的代理框架。大型语言模型(LLMs)辅助的自动化RTL生成是电子设计自动化(EDA)研究中最重要的方向之一。然而,当前的RTL基准面临两个关键挑战:(1)基准中的缺陷案例和(2)对基准的过拟合。这两个挑战仅靠人工工程努力很难解决。为了解决这些问题并系统性地降低人工维护成本,我们提出了一个自动化的代理框架RTL-BenchMT。RTL-BenchMT专注于两个关键应用:(1)自动识别和修订缺陷基准案例,以及(2)自动检测和更新过拟合案例。在RTL-BenchMT的帮助下,我们对缺陷和过拟合案例进行了全面深入的分析,并生成了一个经过精炼的基准套件,该套件将向社区开源。
cs.AI / 22 / 2605.15542
DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding
DRS-GUI:无训练的动态区域搜索用于GUI定位
Abstract
GUI agents powered by Multimodal Large Language Models (MLLMs) have demonstrated impressive capability in understanding and executing user instructions. However, accurately grounding instruction-relevant elements from high-resolution screenshots cluttered with irrelevant UI components remains challenging for existing approaches. Inspired by how humans dynamically adjust their perceptual scope to locate task-related regions on complex screens, we propose DRS-GUI, a training-free dynamic region search framework for GUI grounding that can be seamlessly integrated into existing MLLMs. DRS-GUI introduces a lightweight UI Perceptor that performs three human-like perceptual actions (Focus, Shift, and Scatter) to progressively explore the interface and generate region proposals. To dynamically schedule these actions, we further design an Action Planner based on Monte Carlo Tree Search (MCTS). A region quality reward is employed to evaluate and select the highly instruction-relevant region, efficiently pruning redundant UI elements. Experiments demonstrate that DRS-GUI yields a 14\% improvement on ScreenSpot-Pro for general and GUI-specific MLLMs (Qwen2.5-VL-7B and UGround-V1-7B), significantly enhancing grounding performance and generalization.
Chinese Translation
由多模态大型语言模型(MLLMs)驱动的GUI代理在理解和执行用户指令方面展现了令人印象深刻的能力。然而,从高分辨率的屏幕截图中准确定位与指令相关的元素,尤其是在杂乱的用户界面组件中,仍然是现有方法面临的挑战。受到人类如何动态调整感知范围以定位复杂屏幕上任务相关区域的启发,我们提出了DRS-GUI,这是一种无训练的动态区域搜索框架,用于GUI定位,可以无缝集成到现有的MLLMs中。DRS-GUI引入了一种轻量级的用户界面感知器(UI Perceptor),执行三种类人感知动作(聚焦、转移和散布),以逐步探索界面并生成区域提议。为了动态调度这些动作,我们进一步设计了一种基于蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)的动作规划器(Action Planner)。采用区域质量奖励来评估和选择高度与指令相关的区域,有效地修剪冗余的用户界面元素。实验表明,DRS-GUI在ScreenSpot-Pro上对一般和特定于GUI的MLLMs(Qwen2.5-VL-7B和UGround-V1-7B)实现了14%的提升,显著增强了定位性能和泛化能力。
cs.AI / 23 / 2605.15567
Position: Artificial Intelligence Needs Meta Intelligence -- the Case for Metacognitive AI
立场:人工智能需要元智能——元认知人工智能的案例
Abstract
This position paper argues for metacognition as a general design principle for creating more accurate, secure, and efficient AI. The metacognitive solution involves systems monitoring their own states and judiciously allocating resources depending on each problem instance's difficulty or cost of mistakes. Drawing inspiration both from past work on resource-rational AI and from well-documented metacognitive strategies in psychology and cognitive science, we identify specific challenges in embedding these strategies into AI design and highlight open theoretical and implementation problems. We showcase these principles through a tangible example of improved learning efficiency, effectiveness, and security in a Federated Learning (FL) case study. We show how these principles can be translated into practice with a novel software framework developed specifically to allow the community to design, deploy, and experiment with metacognition-enabled AI applications.
Chinese Translation
本文立场论文主张将元认知作为创建更准确、安全和高效的人工智能的一般设计原则。元认知解决方案涉及系统监控自身状态,并根据每个问题实例的难度或错误成本明智地分配资源。我们从以往关于资源理性人工智能的研究以及心理学和认知科学中充分记录的元认知策略中汲取灵感,识别出将这些策略嵌入人工智能设计中的具体挑战,并强调开放的理论和实施问题。我们通过一个具体的案例展示这些原则在联邦学习(Federated Learning, FL)中的学习效率、有效性和安全性的提升。我们展示了如何通过一个专门开发的新软件框架将这些原则转化为实践,以便社区能够设计、部署和实验具有元认知能力的人工智能应用。
cs.AI / 24 / 2605.15581
STAR: A Stage-attributed Triage and Repair framework for RCA Agents in Microservices
STAR:一种用于微服务中 RCA 代理的阶段属性 triage 和修复框架
Abstract
LLM-based root cause analysis (RCA) agents have recently emerged as a promising paradigm for incident diagnosis in microservice AIOps. However, their reliability remains fragile: an error in early evidence collection, hypothesis formulation, or causal analysis can propagate through the reasoning trace and eventually corrupt the final diagnosis. In this paper, we present \textbf{STAR}, a \emph{Stage-attributed Triage and Repair} framework for repairing erroneous RCA traces. STAR explicitly decomposes an RCA workflow into four structured stages, namely \emph{Evidence Package} (EP), \emph{Hypothesis Set} (HS), \emph{Analysis Structure} (AS), and \emph{Decision Report} (DR), and treats agent failure as a stage-localizable reasoning bug rather than a monolithic end-to-end error. Built on top of LangGraph, STAR performs stage-wise auditing, budget-aware \emph{Fast/Slow Routing}, \emph{decisive stage localization via counterfactual candidate evaluation}, and stage-specific patch-and-replay repair. We evaluate STAR on a public large-scale benchmark and a real-world production dataset, using two RCA agent workflows and three foundation models. Experimental results show that STAR consistently improves both root cause localization and fault type classification over strong baselines. Moreover, STAR identifies the decisive faulty stage with high accuracy, repairs most initially incorrect traces within one or two replay rounds, and benefits substantially from both Fast/Slow Routing and counterfactual stage evaluation. These results suggest that explicitly modeling \emph{where} an RCA agent fails is an effective path toward reliable, debuggable, and self-repairing agentic RCA systems.
Chinese Translation
基于大语言模型(LLM)的根本原因分析(RCA)代理最近作为微服务 AIOps 中事件诊断的有前景范式而出现。然而,它们的可靠性仍然脆弱:在早期证据收集、假设形成或因果分析中的错误可能会在推理轨迹中传播,并最终破坏最终诊断。在本文中,我们提出了 extbf{STAR},一种用于修复错误 RCA 轨迹的 extit{阶段属性 triage 和修复} 框架。STAR 明确将 RCA 工作流分解为四个结构化阶段,即 extit{证据包}(EP)、 extit{假设集}(HS)、 extit{分析结构}(AS)和 extit{决策报告}(DR),并将代理失败视为可局部化的推理错误,而不是单一的端到端错误。基于 LangGraph,STAR 进行阶段审计、预算感知的 extit{快速/慢速路由}、 extit{通过反事实候选评估进行决定性阶段定位},以及阶段特定的补丁和重放修复。我们在一个公共的大规模基准和一个真实的生产数据集上评估 STAR,使用两个 RCA 代理工作流和三个基础模型。实验结果表明,STAR 在根本原因定位和故障类型分类方面始终优于强基线。此外,STAR 高精度地识别出决定性故障阶段,在一到两个重放轮次内修复大多数最初错误的轨迹,并且在快速/慢速路由和反事实阶段评估中受益匪浅。这些结果表明,明确建模 RCA 代理失败的 extit{位置} 是实现可靠、可调试和自修复的代理 RCA 系统的有效途径。
cs.AI / 25 / 2605.15585
See Before You Code: Learning Visual Priors for Spatially Aware Educational Animation Generation
在编码之前先观察:学习空间感知教育动画生成的视觉先验
Abstract
Large language models can generate executable code for educational animations, but the resulting renders often exhibit visual defects, including element overlap, misalignment, and broken animation continuity. These defects cannot be reliably detected from the code alone and become apparent only after execution. We formalize this problem as render-feedback-aware constrained code generation: given a natural language specification, the model must generate executable code whose rendered output satisfies structured quality criteria that can be evaluated only after rendering. To address this problem, we introduce OmniManim, a render-feedback-aware educational animation generation framework built around a shared scene state, explicit visual planning, structured post-render diagnostics, and localized repair. Within OmniManim, the Vision Agent is a task-specific visual planning module: it predicts sparse keyframe layouts with coarse-to-fine bounding-box denoising and optimizes an interpolation-aware objective to reduce intermediate-frame failures induced by downstream animation interpolation. We further construct two datasets, ManimLayout-1K and EduRequire-500, and provide a reproducible evaluation protocol covering executability, instructional quality, visual quality, and efficiency. On EduRequire-500, OmniManim improves measured render quality over both single-model baselines and existing multi-agent frameworks. Systematic ablation studies further verify that explicit visual planning, especially its coarse spatial prior, bounding-box refinement, and interpolation-aware optimization, is central to these gains.
Chinese Translation
大型语言模型能够生成用于教育动画的可执行代码,但生成的渲染结果往往存在视觉缺陷,包括元素重叠、对齐错误和动画连续性中断。这些缺陷无法仅通过代码可靠检测,只有在执行后才会显现。我们将此问题形式化为渲染反馈感知的受限代码生成:给定自然语言规范,模型必须生成可执行代码,其渲染输出满足只能在渲染后评估的结构化质量标准。为了解决这个问题,我们引入了OmniManim,一个围绕共享场景状态、明确的视觉规划、结构化的后渲染诊断和局部修复构建的渲染反馈感知教育动画生成框架。在OmniManim中,视觉代理(Vision Agent)是一个特定任务的视觉规划模块:它通过粗到细的边界框去噪预测稀疏关键帧布局,并优化一个感知插值的目标,以减少由下游动画插值引起的中间帧失败。我们进一步构建了两个数据集,ManimLayout-1K和EduRequire-500,并提供了一个可重复的评估协议,涵盖可执行性、教学质量、视觉质量和效率。在EduRequire-500上,OmniManim在测量的渲染质量上优于单模型基线和现有的多代理框架。系统的消融研究进一步验证了明确的视觉规划,特别是其粗略空间先验、边界框细化和感知插值优化,对于这些提升至关重要。
cs.AI / 26 / 2605.15611
TopoEvo: A Topology-Aware Self-Evolving Multi-Agent Framework for Root Cause Analysis in Microservices
TopoEvo:一种面向拓扑的自我演化多智能体框架,用于微服务中的根因分析
Abstract
Root cause analysis (RCA) in microservices is challenging due to (i) noisy and heterogeneous multimodal observability (metrics, logs, traces), (ii) cascading failure propagation that amplifies downstream symptoms, and (iii) non-stationary topology drift induced by autoscaling and rolling updates. Recent LLM-based RCA agents can generate tool-grounded explanations, yet they often remain topology-agnostic and suffer from \emph{symptom-amplification bias}, misattributing the root cause to salient downstream victims. We propose \textbf{TopoEvo}, a topology-aware self-evolving multi-agent framework that couples graph representation learning with structured, topology-constrained reasoning. TopoEvo first introduces \emph{Metric-orthogonal Multimodal Alignment} (MOMA), which decomposes metric embeddings into complementary subspaces and contrastively aligns logs and traces to reduce modality redundancy and sparsity, yielding stable node representations for graph encoding. It then applies \emph{Vector Quantization} (VQ) to discretize topology-enhanced states into auditable \emph{symptom tokens} with a symptom lexicon, enabling reliable retrieval and token-level evidence grounding. On top of these discrete topology cues, TopoEvo performs a multi-agent \emph{Hypothesis--Evidence--Test} (HET) workflow to explicitly verify propagation-consistent explanations and separate initiating anomalies from amplified downstream symptoms. Finally, a \emph{Self-Evolving Mechanism} refreshes hierarchical incident memory and performs conservative test-time adaptation with high-confidence pseudo-labels to maintain robustness under drift.
Chinese Translation
微服务中的根因分析(RCA)面临诸多挑战,包括(i)嘈杂且异构的多模态可观测性(指标、日志、追踪),(ii)级联故障传播放大下游症状,以及(iii)由自动扩展和滚动更新引起的非平稳拓扑漂移。近期基于大型语言模型(LLM)的RCA智能体能够生成工具基础的解释,但它们通常对拓扑无感,并且受到 extit{症状放大偏差}的影响,错误地将根因归因于显著的下游受害者。我们提出了 extbf{TopoEvo},一种面向拓扑的自我演化多智能体框架,将图表示学习与结构化的拓扑约束推理相结合。TopoEvo首先引入了 extit{指标正交多模态对齐}(MOMA),该方法将指标嵌入分解为互补子空间,并对日志和追踪进行对比对齐,以减少模态冗余和稀疏性,从而生成稳定的节点表示用于图编码。接着,它应用 extit{向量量化}(VQ)将拓扑增强状态离散化为可审计的 extit{症状标记},并构建症状词汇表,从而实现可靠的检索和标记级证据基础。在这些离散的拓扑线索之上,TopoEvo执行多智能体的 extit{假设--证据--测试}(HET)工作流程,以明确验证传播一致的解释,并将初始异常与放大后的下游症状分开。最后, extit{自我演化机制}刷新层次事件记忆,并在高置信度伪标签下进行保守的测试时适应,以在漂移情况下保持鲁棒性。
cs.AI / 27 / 2605.15625
ColPackAgent: Agent-Skill-Guided Hard-Particle Monte Carlo Workflows for Colloidal Packing
ColPackAgent:基于代理技能指导的硬粒子蒙特卡洛胶体堆积工作流程
Abstract
We introduce ColPackAgent, an agent framework that autonomously runs Monte Carlo simulations of colloidal packing through a Model Context Protocol (MCP) tool server and an agent skill, whether as a standalone agent or inside an existing agent system. By harnessing the MCP server and agent skill, ColPackAgent executes a structured workflow for colloidal packing simulations, which are central to studies of phase behavior, self-assembly, and materials design. Without dedicated simulation tools and workflow instructions, general-purpose Large Language Model (LLM) agents tend to describe such workflows rather than execute them reliably. The MCP server exposes a custom-built colpack Python package that wraps HOOMD-blue hard-particle Monte Carlo, and the skill encodes a four-stage workflow contract. ColPackAgent can carry out the workflow interactively with human feedback, autonomously from an end-to-end prompt, or as autoresearch following a provided program file. We demonstrate the system in different modes with several colloidal packing simulation examples such as cube particles in 3D, a binary system of disks and capsules in 2D, and the 2D hard-disk freezing transition using autoresearch. We also compare model performance on this workflow across a panel of LLMs with 17 stage-specific prompts. This benchmark provides a stage-level check of how reliably different models follow the setup, planning, and analysis workflow. Together, these results show that pairing a domain Python package with MCP tools and a portable agent skill provides a practical route for turning a simulation toolkit into an agent-assisted research workflow.
Chinese Translation
我们介绍了ColPackAgent,这是一种代理框架,能够通过模型上下文协议(Model Context Protocol, MCP)工具服务器和代理技能自主运行胶体堆积的蒙特卡洛模拟,无论是作为独立代理还是在现有代理系统内。通过利用MCP服务器和代理技能,ColPackAgent执行了一个结构化的胶体堆积模拟工作流程,该流程对于相行为、自组装和材料设计的研究至关重要。在没有专用模拟工具和工作流程指令的情况下,通用的大型语言模型(Large Language Model, LLM)代理往往倾向于描述这些工作流程,而不是可靠地执行它们。MCP服务器提供了一个定制构建的colpack Python包,该包封装了HOOMD-blue硬粒子蒙特卡洛,而该技能则编码了一个四阶段的工作流程合同。ColPackAgent可以通过人类反馈进行交互式工作流程执行,或从端到端提示中自主执行,或者根据提供的程序文件进行自我研究。我们展示了该系统在不同模式下的应用,提供了多个胶体堆积模拟示例,如三维立方体粒子、二维盘子和胶囊的二元系统,以及使用自我研究的二维硬盘冻结转变。我们还比较了在该工作流程上不同模型的性能,使用了17个阶段特定的提示。这一基准测试提供了不同模型在设置、规划和分析工作流程中可靠性的阶段性检查。综上所述,这些结果表明,将领域特定的Python包与MCP工具和可移植的代理技能结合,提供了一条将模拟工具包转变为代理辅助研究工作流程的实用途径。
cs.AI / 28 / 2605.15665
PRISM: Prompt Reliability via Iterative Simulation and Monitoring for Enterprise Conversational AI
PRISM:通过迭代仿真和监控实现企业对话式人工智能的提示可靠性
Abstract
Deploying large language model (LLM)-driven conversational agents in enterprise settings requires prompts that are simultaneously correct at launch and resilient to the non-deterministic behavioral drift that characterizes production LLM deployments. Existing prompt optimization frameworks address prompt quality as a one-time compile-time problem, leaving open the equally critical question of how to detect and repair prompt regressions caused by silent LLM behavior changes over time. We present PRISM (Prompt Reliability via Iterative Simulation and Monitoring), a closed-loop framework that treats prompt engineering as a continuous reliability engineering problem rather than a one-time authorship task. PRISM takes as input plain-language agent requirements, a set of configured tools and memory variables, and an initial draft prompt. It automatically generates test cases from requirements, simulates full multi-turn conversations against a platform-faithful LLM environment, evaluates pass/fail using an LLM-as-judge, diagnoses root causes of failures, and surgically repairs the prompt -- iterating until all tests pass. Critically, PRISM is designed to run on a scheduled basis (daily), treating LLM behavioral drift as a first-class reliability concern. We evaluate PRISM across 35 enterprise conversational agents over a three-week deployment period on the Yellow.ai V3 platform. PRISM reduces median prompt authoring time from 2 days to under 30 minutes, achieves 99% production reliability across all evaluated agents, and successfully identifies and repairs production regressions caused by LLM behavioral drift within a 24-hour detection window. Our results suggest that continuous, simulation-driven prompt optimization is both tractable and necessary for reliable enterprise conversational AI at scale.
Chinese Translation
在企业环境中部署大型语言模型(LLM)驱动的对话代理需要在启动时同时保证提示的正确性,并对生产LLM部署中非确定性行为漂移的韧性。现有的提示优化框架将提示质量视为一次性编译时问题,未能解决同样重要的如何检测和修复由于LLM行为变化而导致的提示回归的问题。我们提出了PRISM(通过迭代仿真和监控实现提示可靠性),这是一个闭环框架,将提示工程视为一个持续的可靠性工程问题,而非一次性的创作任务。PRISM以自然语言的代理需求、一组配置的工具和记忆变量以及初始草稿提示作为输入。它从需求中自动生成测试用例,在一个平台忠实的LLM环境中模拟完整的多轮对话,使用LLM作为评判者评估通过/失败,诊断失败的根本原因,并对提示进行精确修复——迭代直到所有测试通过。关键是,PRISM设计为定期(每日)运行,将LLM行为漂移视为一项重要的可靠性问题。我们在Yellow.ai V3平台上对35个企业对话代理进行了为期三周的部署,评估了PRISM的效果。PRISM将中位提示创作时间从2天减少到30分钟以内,在所有评估的代理中实现了99%的生产可靠性,并成功识别和修复了在24小时检测窗口内由LLM行为漂移引起的生产回归。我们的结果表明,持续的、基于仿真的提示优化对于大规模可靠的企业对话式人工智能既可行又必要。
cs.AI / 29 / 2605.15726
Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR
超越舒适区的推动:针对RLVR的高效策略引导探索
Abstract
Reinforcement learning with verifiable rewards (RLVR) has emerged as a scalable paradigm for improving the reasoning capabilities of large language models. However, its effectiveness is fundamentally limited by exploration: the policy can only improve on trajectories it has already sampled. While increasing the number of rollouts alleviates this issue, such brute-force scaling is computationally expensive, and existing approaches that modify the optimization objective provide limited control over what is explored. In this work, we propose NudgeRL, a framework for structured and diversity-driven exploration in RLVR. Our approach introduces Strategy Nudging, which conditions each rollout on lightweight, strategy-level contexts to induce diverse reasoning trajectories without relying on expensive oracle supervision. To effectively learn from such structured exploration, we further propose a unified objective, which decomposes the reward signal into inter- and intra-context components and incorporates a distillation objective to transfer discovered behaviors back to the base policy. Empirically, NudgeRL outperforms standard GRPO with up to 8 times larger rollout budgets, while outperforming oracle-guided RL baseline on average across five challenging math benchmarks. These results demonstrate that structured, context-driven exploration can serve as an efficient and scalable alternative to both brute-force rollout scaling and feasibility-oriented methods based on privileged information. Our code is available at https://github.com/tally0818/NudgeRL.
Chinese Translation
具有可验证奖励的强化学习(RLVR)已成为提升大型语言模型推理能力的可扩展范式。然而,其有效性在根本上受到探索的限制:策略只能在已采样的轨迹上进行改进。虽然增加回合数可以缓解这一问题,但这种粗暴的扩展在计算上代价高昂,而现有的通过修改优化目标的方法对探索内容的控制有限。在本研究中,我们提出了NudgeRL,一个用于RLVR中结构化和多样性驱动探索的框架。我们的方法引入了策略推动(Strategy Nudging),它在轻量级的策略级上下文中对每个回合进行条件设置,以诱导多样化的推理轨迹,而无需依赖昂贵的oracle监督。为了有效地从这种结构化探索中学习,我们进一步提出了一个统一目标,将奖励信号分解为上下文间和上下文内的组成部分,并结合蒸馏目标将发现的行为转移回基础策略。实证结果表明,NudgeRL在回合预算高达8倍的情况下优于标准的GRPO,同时在五个具有挑战性的数学基准上平均超越了oracle引导的RL基线。这些结果表明,结构化的、基于上下文的探索可以作为粗暴回合扩展和基于特权信息的可行性导向方法的高效且可扩展的替代方案。我们的代码可在https://github.com/tally0818/NudgeRL获取。
cs.AI / 30 / 2605.15734
Can We Trust AI-Inferred User States. A Psychometric Framework for Validating the Reliability of Users States Classification by LLMs in Operational Environments
我们能信任人工智能推断的用户状态吗?用于验证大型语言模型在操作环境中用户状态分类可靠性的心理测量框架
Abstract
The use of large language models to assess user states in conversational and adaptive systems is based on the assumption that the metrics used for such assessment are stable and interpretable at the level of individual scores. This paper empirically tests this assumption, focusing on the psychometric reliability of artificial intelligence (AI) measures of user states. This study employed replication evaluation procedures to assess the repeatability of a broad set of metrics across three different bimodal large language models (GPT-4o audio, Gemini 2.0 Flash, Gemini 2.5 Flash). Analyses include both individual score reliability and aggregated reliability, allowing us to distinguish metrics potentially useful for real-time adaptation from those that retain their value only in aggregated analyses. The results demonstrate that metric reliability cannot be considered a default property in interpretive domains. The lack of stability at the level of individual scores precludes the interpretation of such scores as indicators of user state in real-time adaptive systems, even if these metrics demonstrate stability after aggregation. At the same time, the study indicates that individually unstable metrics can retain analytical utility in post-hoc studies, identifying rules governing interactions and their relationships with user experience parameters such as satisfaction, trust, and engagement. The main contribution of this work, besides quantifying the severity of the problem (only 31 of 213 metrics met the criteria), is the proposal of a replicable evaluation framework, enabling measurable evaluations of metric applicability. This approach supports more responsible AI design of adaptive systems, in which the interpretation of results requires explicit validation of reliability and monitoring for violations over time.
Chinese Translation
使用大型语言模型评估对话和自适应系统中的用户状态是基于这样一个假设:用于此类评估的指标在个体评分层面上是稳定且可解释的。本文实证检验了这一假设,重点关注人工智能(AI)对用户状态测量的心理测量可靠性。本研究采用复制评估程序,评估三种不同双模态大型语言模型(GPT-4o音频、Gemini 2.0 Flash、Gemini 2.5 Flash)中广泛指标的重复性。分析包括个体评分的可靠性和聚合可靠性,使我们能够区分在实时适应中可能有用的指标与仅在聚合分析中保留其价值的指标。结果表明,指标可靠性不能被视为解释领域中的默认属性。在个体评分层面缺乏稳定性使得无法将这些评分解释为实时自适应系统中用户状态的指示,即使这些指标在聚合后表现出稳定性。同时,研究表明,个体不稳定的指标在事后研究中仍然可以保留分析效用,识别支配交互的规则及其与用户体验参数(如满意度、信任和参与度)的关系。除了量化问题的严重性(213个指标中仅有31个符合标准)之外,这项工作的主要贡献是提出了一个可复制的评估框架,使指标适用性的可测评估成为可能。这种方法支持更负责任的自适应系统AI设计,其中结果的解释需要明确的可靠性验证和随时间的违规监测。
cs.AI / 31 / 2605.15768
ALSO: Adversarial Online Strategy Optimization for Social Agents
ALSO:针对社会智能体的对抗性在线策略优化
Abstract
Social simulation provides a compelling testbed for studying social intelligence, where agents interact through multi-turn dialogues under evolving contexts and strategically adapting opponents. Such environments are inherently non-stationary, requiring agents to dynamically adjust their strategies over time. However, most Large Language Model (LLM) based social agents rely on static personas, while existing approaches for enhancing social intelligence, such as offline reinforcement learning or external planners, are ill-suited to these settings, typically assuming stationarity and incurring substantial training overhead. To bridge this gap, we propose \textbf{ALSO} (\textbf{A}dversarial on\textbf{L}ine \textbf{S}trategy \textbf{O}ptimization), the first framework for online strategy optimization in multi-agent social simulation. ALSO advances social adaptation through two key contributions. (1) ALSO formulates multi-turn interaction as an adversarial bandit problem, where combinations of static personas and dynamic strategy instructions are treated as arms, providing a principled solution to non-stationarity without relying on environmental stability assumptions. (2) To predict rewards and generalize sparse feedback in multi-turn dialogues, ALSO introduces a lightweight neural surrogate to predict rewards from interaction histories, enabling sample-efficient exploration and continuous online adaptation. Experiments on the Sotopia benchmark demonstrate that ALSO consistently outperforms static baselines and existing optimization methods in dynamic environments, validating the effectiveness of adversarial online strategy optimization for building robust social agents.
Chinese Translation
社会模拟为研究社会智能提供了一个引人注目的测试平台,在该平台上,智能体通过多轮对话在不断变化的背景下与适应性对手进行互动。这种环境本质上是非平稳的,要求智能体随着时间的推移动态调整其策略。然而,大多数基于大型语言模型(LLM)的社会智能体依赖于静态角色,而现有的增强社会智能的方法,如离线强化学习或外部规划者,通常假设环境是平稳的,并且会产生大量的训练开销,这使得它们不适合这些设置。为了解决这一问题,我们提出了 extbf{ALSO}( extbf{A}dversarial on extbf{L}ine extbf{S}trategy extbf{O}ptimization),这是第一个用于多智能体社会模拟的在线策略优化框架。ALSO通过两个关键贡献推动了社会适应性的发展。(1) ALSO将多轮互动形式化为对抗性赌博问题,其中静态角色和动态策略指令的组合被视为臂,提供了一种在不依赖环境稳定性假设的情况下解决非平稳性的问题的原则性方案。(2) 为了预测奖励并在多轮对话中推广稀疏反馈,ALSO引入了一种轻量级神经替代模型,从互动历史中预测奖励,从而实现样本高效的探索和持续的在线适应。在Sotopia基准上的实验表明,ALSO在动态环境中始终优于静态基线和现有优化方法,验证了对抗性在线策略优化在构建鲁棒社会智能体方面的有效性。
cs.AI / 32 / 2605.15777
SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?
SaaS-Bench:计算机使用代理能否利用现实世界的SaaS解决专业工作流程?
Abstract
Computer-Using Agents (CUAs) are rapidly extending large language models (LLMs) beyond text-based reasoning toward action execution in more complex environments, such as web browsers and graphical user interfaces (GUIs). However, existing web and GUI agent benchmarks often rely on simplified settings, isolated tasks, or short-horizon interactions, making it difficult to assess capabilities of agents in realistic professional workflows. Software-as-a-Service (SaaS) environments are a natural choice for CUA evaluation, as they host a large share of modern digital work and naturally involve dynamic system states, cross-application coordination, domain-specific knowledge, and long-horizon dependencies. To this end, we introduce SaaS-Bench, a benchmark built on 23 deployable SaaS systems across six professional domains, containing 106 tasks grounded in realistic work scenarios. These tasks require long-horizon execution, cover both text-only and multimodal settings, and are evaluated with weighted verification checkpoints that measure strict task completion and partial progress. Experiments show that representative LLM-based agents struggle on SaaS-Bench, with even the strongest model completing fewer than 4% of tasks end-to-end, exposing limitations in planning, state tracking, cross-application context maintenance, and error recovery. Code are available at https://github.com/UniPat-AI/SaaS-Bench for reproduction.
Chinese Translation
计算机使用代理(CUAs)正在迅速将大型语言模型(LLMs)从基于文本的推理扩展到在更复杂环境中执行动作,例如网页浏览器和图形用户界面(GUIs)。然而,现有的网页和GUI代理基准通常依赖于简化的设置、孤立的任务或短期交互,这使得评估代理在现实专业工作流程中的能力变得困难。软件即服务(SaaS)环境是CUA评估的自然选择,因为它们承载了现代数字工作的很大一部分,并自然涉及动态系统状态、跨应用协调、特定领域知识和长期依赖关系。为此,我们引入了SaaS-Bench,这是一个基于六个专业领域中23个可部署SaaS系统构建的基准,包含106个基于现实工作场景的任务。这些任务需要长期执行,涵盖文本和多模态设置,并通过加权验证检查点进行评估,以衡量严格的任务完成和部分进展。实验表明,代表性的基于LLM的代理在SaaS-Bench上表现不佳,即使是最强的模型也仅能完成不到4%的任务,暴露出在规划、状态跟踪、跨应用上下文维护和错误恢复方面的局限性。代码可在https://github.com/UniPat-AI/SaaS-Bench获取以便复现。
cs.AI / 33 / 2605.15871
Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design
自主发现神经架构:AIRA-Compose 和 AIRA-Design
Abstract
Toward recursive self-improvement, we investigate LLM agents autonomously designing foundation models beyond standard Transformers. We introduce a dual-framework approach: AIRA-Compose for high-level architecture search, and AIRA-Design for low-level mechanistic implementation. AIRA-Compose uses 11 agents to explore fundamental computational primitives under a 24-hour budget. Agents evaluate million-parameter candidates, extrapolating top designs to 350M, 1B, and 3B scales. This yields 14 architectures across two families: AIRAformers (Transformer-based) and AIRAhybrids (Transformer-Mamba). Pre-trained at 1B scale, these consistently outperform Llama 3.2 and Composer-found baselines. On downstream tasks, AIRAformer-D and AIRAhybrid-D improve accuracy by 2.4% and 3.8% over Llama 3.2. Furthermore, AIRA-Compose finds models with highly efficient scaling frontiers: AIRAformer-C scales 54% and 71% faster than Llama 3.2 and Composer's best Transformer, while AIRAhybrid-C outscales Nemotron-2 by 23% and Composer's best hybrid by 37%. AIRA-Design tasks 20 agents with writing novel attention mechanisms for long-range dependencies and high-performing training scripts. On the Long Range Arena benchmark, agent-designed architectures reach within 2.3% and 2.6% of human state-of-the-art on document matching and text classification. On the Autoresearch benchmark, Greedy Opus 4.5 achieves 0.968 validation bits-per-byte under a fixed time budget, surpassing the published minimum. Together, these frameworks show AI agents can autonomously discover architectures and algorithmic optimizations matching or surpassing hand-designed baselines. This establishes a powerful paradigm for discovering next-generation foundation models, marking a clear step toward recursive self-improvement.
Chinese Translation
为了实现递归自我改进,我们研究了大型语言模型(LLM)代理自主设计超越标准变换器(Transformers)的基础模型。我们提出了一种双框架方法:AIRA-Compose 用于高层次架构搜索,AIRA-Design 用于低层次机制实现。AIRA-Compose 利用 11 个代理在 24 小时的预算内探索基本计算原语。代理评估百万参数候选模型,并将最佳设计外推至 3.5 亿、10 亿和 30 亿规模。这产生了 14 种架构,分为两个家族:AIRAformers(基于变换器)和 AIRAhybrids(变换器-曼巴)。在 10 亿规模下进行预训练,这些模型在性能上始终优于 Llama 3.2 和 Composer 基准。针对下游任务,AIRAformer-D 和 AIRAhybrid-D 的准确率分别比 Llama 3.2 提高了 2.4% 和 3.8%。此外,AIRA-Compose 找到了具有高效扩展前沿的模型:AIRAformer-C 的扩展速度比 Llama 3.2 快 54%,比 Composer 的最佳变换器快 71%;而 AIRAhybrid-C 的扩展能力比 Nemotron-2 快 23%,比 Composer 的最佳混合模型快 37%。AIRA-Design 指派 20 个代理编写新颖的注意力机制,以处理长距离依赖和高性能训练脚本。在长距离竞技场(Long Range Arena)基准测试中,代理设计的架构在文档匹配和文本分类上分别达到了人类最先进水平的 2.3% 和 2.6% 的误差。在自研究(Autoresearch)基准测试中,Greedy Opus 4.5 在固定时间预算下实现了 0.968 的验证比特每字节,超越了已发布的最低值。这些框架共同表明,人工智能代理能够自主发现与手工设计基准相匹配或超越的架构和算法优化。这为发现下一代基础模型建立了一个强大的范式,标志着递归自我改进的明确一步。
cs.AI / 34 / 2605.15960
Imperfect World Models are Exploitable
不完美的世界模型是可以被利用的
Abstract
We propose a novel definition of model exploitation in reinforcement learning. Informally, a world model is exploitable if it implies that one policy should be strictly preferred over another while the environment's true transition model implies the reverse. We analogize our definition with a prior characterization of reward hacking but show that the associated proof of inevitability does not transfer to exploitation. To overcome this obstruction, we develop a general theory of reward hacking and model exploitation that proves that exploitation is essentially unavoidable on large policy sets and yields the corresponding claim for hacking as a special case. Unfortunately, we also find that the conditions that guarantee unhackability in finite policy sets have no counterpart that precludes exploitation. Consequently, we introduce a relaxed notion of exploitation and derive a safe horizon within which it can be avoided. Taken together, our results establish a formal bridge between reward hacking and model exploitation and elucidate the limits of safe planning in world models.
Chinese Translation
我们提出了一种强化学习中模型利用的新定义。非正式地说,如果一个世界模型暗示某一策略应被严格优于另一策略,而环境的真实转移模型则暗示相反,那么该世界模型就是可以被利用的。我们将我们的定义与之前对奖励黑客(reward hacking)的描述进行类比,但表明相关的不可避免性证明并不适用于利用。为了克服这一障碍,我们发展了一种奖励黑客和模型利用的一般理论,证明在大规模策略集合中,利用本质上是不可避免的,并将黑客作为特例得出相应的结论。不幸的是,我们还发现,保证有限策略集合中不可被黑客攻击的条件没有对应的条件来排除利用。因此,我们引入了一种放宽的利用概念,并推导出一个安全的时间范围,在此范围内可以避免利用。综合来看,我们的结果建立了奖励黑客与模型利用之间的正式桥梁,并阐明了在世界模型中安全规划的局限性。
cs.AI / 35 / 2605.15963
PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control
PAGER:弥合点精确几何图形用户界面控制中的语义-执行差距
Abstract
Large vision-language models have significantly advanced GUI agents, enabling executable interaction across web, mobile, and desktop interfaces. Yet these gains largely rely on a forgiving region-tolerant paradigm, where many nearby pixels inside the same component remain valid. Precise geometric construction breaks this assumption: actions must land on points in continuous canvas space rather than tolerant regions. Because geometric primitives carry ontological dependencies, a local coordinate error can induce cascading topological failures that distort downstream objects and invalidate the final construction. We identify this regime as precision-sensitive GUI tasks, requiring point-level accuracy, geometry-aware verification, and robustness to dependency-driven error propagation. To benchmark it, we introduce PAGE Bench, with 4,906 problems and over 224K process-supervised, pixel-level GUI actions. We further propose PAGER, a topology-aware agent that decomposes construction into dependency-structured planning and pixel-level execution. Pixel-grounded supervised tuning establishes executable action grammar, while precision-aligned reinforcement learning mitigates rollout-induced exposure bias through state-conditioned geometric feedback. Experiments reveal a pronounced Semantic-Execution Gap: general multimodal models can exceed 88% action type accuracy yet remain below 6% task success. PAGER closes this gap, delivering 4.1x higher task success than the strongest evaluated general baseline and raising step success rate from below 9% for GUI-specialized agents to over 62%, establishing a new state of the art for point-precise GUI control.
Chinese Translation
大型视觉-语言模型显著推动了图形用户界面(GUI)代理的发展,使其能够在网页、移动和桌面界面之间进行可执行交互。然而,这些进展在很大程度上依赖于宽容的区域容忍范式,其中同一组件内的许多相邻像素仍然有效。精确的几何构造打破了这一假设:动作必须落在连续画布空间中的点上,而不是宽容区域。由于几何原语具有本体依赖性,局部坐标误差可能引发级联拓扑失败,从而扭曲下游对象并使最终构造失效。我们将这一范畴定义为对精度敏感的GUI任务,要求点级精度、几何感知验证和对依赖驱动错误传播的鲁棒性。为了对其进行基准测试,我们引入了PAGE Bench,包含4906个问题和超过224K个过程监督的像素级GUI动作。我们进一步提出了PAGER,一个拓扑感知代理,它将构造分解为依赖结构规划和像素级执行。基于像素的监督调优建立了可执行动作语法,而精度对齐的强化学习通过状态条件的几何反馈减轻了因回滚引起的曝光偏差。实验揭示了显著的语义-执行差距:通用多模态模型的动作类型准确率可超过88%,但任务成功率仍低于6%。PAGER缩小了这一差距,任务成功率比最强的评估通用基线高出4.1倍,并将GUI专用代理的步骤成功率从低于9%提高到超过62%,为点精确的GUI控制建立了新的技术领先水平。
cs.AI / 36 / 2605.15967
Deterministic Event-Graph Substrates as World Models for Counterfactual Reasoning
确定性事件图基础作为反事实推理的世界模型
Abstract
We study event-graph substrates: a class of world models that represent agent state as an append-only log of typed RDF triples and answer counterfactual queries by forking the log under a structured intervention vocabulary. Substrates are inspectable at the triple level, support exact counterfactuals, and transfer across domains without learned components. We formalize the class, prove a duality between explanatory and counterfactual queries that reduces both to the same causal-ancestor traversal, and evaluate a 1,400-line CLEVRER-DSL interpreter atop a domain-agnostic substrate runtime at full CLEVRER validation scale (n=75,618). The substrate exceeds the NS-DR symbolic oracle on all four per-question categories (by 9.89, 20.26, 17.65, and 0.80 percentage points), and exceeds the parametric ALOE baseline on descriptive and explanatory while lagging on predictive and counterfactual. We also introduce twin-EventLog, a 500-specification Park-canonical Smallville counterfactual benchmark on which the substrate exceeds Llama-3.1-8B with full context by 18.80 points joint accuracy.
Chinese Translation
我们研究事件图基础:一种世界模型类别,通过将代理状态表示为仅追加的类型化RDF三元组日志,并通过在结构化干预词汇下分叉日志来回答反事实查询。该基础在三元组级别上可检查,支持精确的反事实,并且在没有学习组件的情况下跨领域转移。我们对该类别进行了形式化,证明了解释性查询与反事实查询之间的对偶关系,将两者都简化为相同的因果祖先遍历,并在全CLEVRER验证规模(n=75,618)下评估了一个1,400行的CLEVRER-DSL解释器,构建在一个领域无关的基础运行时上。该基础在所有四个每问题类别上超过了NS-DR符号神谕(分别提高了9.89、20.26、17.65和0.80个百分点),并在描述性和解释性上超过了参数化的ALOE基线,而在预测性和反事实上则落后。我们还引入了twin-EventLog,这是一个500个规范的Park-canonical Smallville反事实基准,在该基准上,基础在完整上下文下超过了Llama-3.1-8B,提升了18.80点的联合准确率。
cs.AI / 37 / 2605.15975
Learning Bilevel Policies over Symbolic World Models for Long-Horizon Planning
基于符号世界模型的双层策略学习用于长时间规划
Abstract
We tackle the challenge of building embodied AI agents that can reliably solve long-horizon planning problems. Imitation learning from demonstrations has shown itself to be effective in training robots to solve a diversity of complex tasks requiring fine motor control and manipulation over low-level (LL), continuous environments. Yet, it remains a difficult endeavour to generate long-horizon plans from imitation learning alone. In contrast, high-level (HL), symbolic abstractions facilitate efficient and interpretable long-horizon planning. We propose to combine the strengths of LL imitation learning for manipulation and control, and HL symbolic abstractions for long-horizon planning. We realise this idea via \emph{bilevel policies} of the form $(\pi^{\mathrm{hl}}, \pi^{\mathrm{ll}})$, consisting of a neural policy $\pi^{\mathrm{ll}}$ learned from LL demonstrations, and an HL symbolic policy $\pi^{\mathrm{hl}}$ that is constructed from symbolic abstractions of the LL demonstrations combined with inductive generalisation. We implement these ideas in the BISON system. Experiments on extended MetaWorld benchmarks demonstrate that BISON generalises to long horizons and problems with greater numbers of objects than those solved by VLA and end-to-end methods, and is more time and memory efficient in training and inference. Notably, when ignoring LL execution, BISON's HL policies can solve HL problems with 10,000 relevant objects in under a minute. Project page: https://dillonzchen.github.io/bison
Chinese Translation
我们解决了构建能够可靠地解决长时间规划问题的具身人工智能代理的挑战。从示范中进行模仿学习已被证明在训练机器人解决需要精细运动控制和操作的多样复杂任务方面是有效的,尤其是在低层(LL)连续环境中。然而,仅依靠模仿学习生成长时间规划仍然是一项困难的工作。相比之下,高层(HL)符号抽象促进了高效且可解释的长时间规划。我们提议结合低层模仿学习在操作和控制方面的优势,以及高层符号抽象在长时间规划中的应用。我们通过形式为 $( heta^{ ext{hl}}, heta^{ ext{ll}})$ 的双层策略实现这一构想,其中包括从低层示范中学习的神经策略 $ heta^{ ext{ll}}$,以及从低层示范的符号抽象结合归纳泛化构建的高层符号策略 $ heta^{ ext{hl}}$。我们在BISON系统中实现了这些思想。在扩展的MetaWorld基准上的实验表明,BISON能够推广到比VLA和端到端方法解决的更多对象的长时间和更复杂的问题,并且在训练和推理中更加节省时间和内存。值得注意的是,当忽略低层执行时,BISON的高层策略可以在不到一分钟的时间内解决具有10,000个相关对象的高层问题。项目页面:https://dillonzchen.github.io/bison
cs.AI / 38 / 2605.15983
Petri Net Induced Heuristic Search for Resource Constrained Scheduling
基于宠物网的资源约束调度启发式搜索
Abstract
We formulate the Resource-Constrained Project Scheduling Problem (RCPSP) as optimal search over the reachability graph of a Timed Transition Petri Net with Resources, using relative-delay tokens so that scheduling decisions correspond to transition firings in the induced state space. We solve the resulting problem with $A^*$ guided by a heuristic that combines Critical Path and resource-based lower bounds, and prove that it is consistent under our token-based time semantics. Experiments on the PSPLIB benchmarks show that the approach outperforms strong exact Mixed-Integer Linear Programming (MIP) baselines (SCIP, CBC) in both success rate and solve time. Per-instance analysis shows that heuristic search and MIP degrade along independent axes, resource tightness for $A^*$ and formulation size for MIP, with resource strength mediating which solver benefits from scale.
Chinese Translation
我们将资源约束项目调度问题(RCPSP)表述为在带有资源的定时转换宠物网的可达性图上的最优搜索,使用相对延迟令牌,使得调度决策对应于诱导状态空间中的转换触发。我们使用 $A^*$ 算法解决所得到的问题,并通过结合关键路径和基于资源的下界的启发式方法来指导搜索,证明在我们的基于令牌的时间语义下该启发式是一致的。在 PSPLIB 基准测试中的实验表明,该方法在成功率和求解时间上均优于强大的精确混合整数线性规划(MIP)基线(SCIP,CBC)。逐实例分析显示,启发式搜索和 MIP 在独立的轴上退化,$A^*$ 的资源紧张度和 MIP 的模型规模,资源强度调节了哪个求解器从规模中受益。
cs.AI / 39 / 2605.16024
ScreenSearch: Uncertainty-Aware OS Exploration
ScreenSearch:基于不确定性的操作系统探索
Abstract
Desktop GUI agents operate under partial observability: visually similar screens can correspond to different underlying workflow states, so locally plausible actions can lead to sharply different outcomes. We frame this as a problem of computer/OS state exploration, where effective behavior requires both expanding the reachable frontier and reducing ambiguity before committing. We present ScreenSearch, a system that combines structural screen retrieval and deduplication with an ambiguity-aware PUCT graph-bandit for large-scale desktop exploration. The retrieval layer converts UIA trees into location-aware structural features, indexes related screens through sparse token search and metadata filters, and maintains a shared deduplicated state graph across VM workers. On top of this graph, we define a scalable ambiguity signal based on matched-action outcome dispersion. If similar screens produce different next states under the same action signature, the state should be probed further rather than treated as resolved. We use this signal together with frontier rewards to drive large-scale exploration and replay-start policy evaluation over the shared graph. Across 11 desktop applications, ScreenSearch collects over 1M screenshots and over 30K deduplicated states, yielding large exploration corpora with substantial cross-application and within-application diversity. On a fixed replay-start slice, we observe a clear novelty--ambiguity trade-off: some policies reduce ambiguity quickly while discovering little frontier. Ambiguity reduction alone is therefore not a sufficient exploration objective. Appendix ablations show that stronger proposal priors can materially improve unique-state discovery during corpus building. These results suggest that state identity, proposal quality, and ambiguity-aware search all matter when deciding when to probe and when to commit.
Chinese Translation
桌面图形用户界面(GUI)代理在部分可观察性下运行:视觉上相似的屏幕可能对应于不同的底层工作流状态,因此局部可行的操作可能导致截然不同的结果。我们将其框架视为计算机/操作系统状态探索的问题,其中有效的行为需要在承诺之前扩展可达边界并减少模糊性。我们提出了ScreenSearch,一个结合了结构化屏幕检索与去重的系统,并使用不确定性感知的PUCT图带算法进行大规模桌面探索。检索层将用户界面自动化(UIA)树转换为位置感知的结构特征,通过稀疏标记搜索和元数据过滤器索引相关屏幕,并在虚拟机(VM)工作者之间维护一个共享的去重状态图。在此图的基础上,我们定义了一种可扩展的不确定性信号,基于匹配操作结果的离散度。如果相似的屏幕在相同的操作签名下产生不同的下一个状态,则该状态应进一步探测,而不是视为已解决。我们将该信号与边界奖励结合使用,以驱动大规模探索和在共享图上的重放启动策略评估。在11个桌面应用程序中,ScreenSearch收集了超过100万张屏幕截图和超过3万个去重状态,生成了具有显著跨应用和应用内多样性的庞大探索语料库。在固定的重放启动切片上,我们观察到明显的新颖性与模糊性之间的权衡:某些策略快速减少模糊性,但发现的边界很少。因此,仅仅减少模糊性并不足以作为探索目标。附录中的消融实验表明,较强的提议先验可以显著改善语料库构建过程中的唯一状态发现。这些结果表明,在决定何时探测和何时承诺时,状态身份、提议质量和不确定性感知搜索都至关重要。
cs.AI / 40 / 2605.16052
Reasoners or Translators? Contamination-aware Evaluation and Neuro-Symbolic Robustness in Tax Law
推理者还是翻译者?考虑污染的评估与税法中的神经符号鲁棒性
Abstract
Recent advances in large language models (LLMs) have significantly enhanced automated legal reasoning. Yet, it remains unclear whether their performance reflects genuine legal reasoning ability or artifacts of data contamination. We present a comprehensive empirical study of tax law reasoning approaches and implement a contamination detection protocol to rigorously assess LLM reliability. We show that performance can be inflated by contamination. Building on this analysis, we conduct a systematic evaluation, comparing monolithic LLMs with hybrid systems that translate statutory text into formal representations and delegate inference to symbolic solvers. We build a novel test suite designed to probe generalization to unseen documents via case and rule variations. Our findings indicate that legal reasoning is inherently compositional and that neuro-symbolic frameworks offer a more reliable and robust foundation for legal AI, as well as improved generalization to unobserved situations.
Chinese Translation
近期大型语言模型(LLMs)的进展显著提升了自动化法律推理的能力。然而,目前尚不清楚它们的表现是否反映了真正的法律推理能力,还是数据污染的伪影。我们呈现了一项关于税法推理方法的全面实证研究,并实施了一种污染检测协议,以严格评估LLM的可靠性。我们的研究表明,污染可能会夸大性能。在此分析的基础上,我们进行了一项系统评估,比较了单一的LLM与将法定文本翻译为形式化表示并将推理委托给符号求解器的混合系统。我们构建了一套新颖的测试套件,旨在通过案例和规则变体探测对未见文档的泛化。我们的发现表明,法律推理本质上是组合性的,而神经符号框架为法律人工智能提供了更可靠和稳健的基础,并改善了对未观察情况的泛化能力。
cs.AI / 41 / 2605.16103
Sign-Separated Finite-Time Error Analysis of Q-Learning
Q学习的符号分离有限时间误差分析
Abstract
This paper develops a sign-separated finite-time error analysis for constant step-size Q-learning. Starting from the switching-system representation, the error is decomposed into its componentwise negative and positive parts. The negative part is dominated by a lower comparison linear time-invariant (LTI) system associated with a fixed optimal policy, whereas the positive part is controlled by a linear switching system. The resulting bounds show that the negative-side LTI certificate is no slower than the positive-side switching certificate and may produce a faster exponential envelope. The analysis identifies a max-induced asymmetry in Q-learning error dynamics. This asymmetry is connected to overestimation: positive action-wise errors can be selected and propagated by the Bellman maximum, whereas negative errors admit an optimal-policy lower comparison. Finite-time bounds are provided for both deterministic and stochastic constant-step-size recursions.
Chinese Translation
本文针对常步长Q学习发展了一种符号分离的有限时间误差分析。基于切换系统表示,误差被分解为其分量的负部分和正部分。负部分由与固定最优策略相关的下比较线性时不变(LTI)系统主导,而正部分则由线性切换系统控制。结果显示,负侧LTI证书的速度不慢于正侧切换证书,并且可能产生更快的指数包络。该分析识别了Q学习误差动态中的最大诱导不对称性。这种不对称性与高估有关:正向动作误差可以通过贝尔曼最大值被选择和传播,而负误差则允许一个最优策略下的下比较。为确定性和随机常步长递归提供了有限时间界限。
cs.AI / 42 / 2605.16116
ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents
ShopGym:一个用于电子商务网络代理的真实模拟和可扩展基准测试的综合框架
Abstract
Developing and evaluating e-commerce web agents requires environments that preserve meaningful task structure while enabling controllable, reproducible, and scalable scientific comparison. Existing methodologies force a tradeoff: live storefronts provide realism but are non-stationary, difficult to inspect, and irreproducible, while hand-built sandbox benchmarks provide control but cover only a narrow range of layouts, catalogs, policies, and interaction patterns. We argue that the core bottleneck is methodological: the field lacks a scalable way to construct evaluation settings that are simultaneously realistic, diverse, controllable, inspectable, and reproducible. We introduce ShopGym, an integrated framework for realistic simulation and scalable benchmarking of e-commerce web agents. ShopGym is a framework for constructing e-commerce simulation environments and grounded benchmark tasks. Its simulation layer, ShopArena, converts live seed storefronts into self-contained sandbox shops through anonymized shop specifications and a staged, validated generation process. On top of these simulated storefronts, ShopGuru synthesizes benchmark tasks across seven skill categories, grounding each task in the shop's catalog, navigation structure, policies, and interaction affordances. Together, ShopArena and ShopGuru produce self-contained, resettable, inspectable, and stable evaluation artifacts that preserve structural properties and agent-evaluation signals relevant to shopping tasks. We validate the framework through graph-based structural analysis and agent-based behavioral evaluation with 224 generated tasks across six sandbox shops: three constructed with synthetic data and three with real data. Our results show that the synthetic shops preserve key structural properties of live storefronts, with agent performance on synthetic shops positively correlated with performance on live storefronts.
Chinese Translation
开发和评估电子商务网络代理需要保持有意义的任务结构的环境,同时能够进行可控、可重复和可扩展的科学比较。现有的方法论迫使我们做出权衡:实时店面提供了现实性,但是不稳定的、难以检查和不可重复的,而手工构建的沙盒基准测试提供了控制,但仅覆盖狭窄的布局、目录、策略和交互模式。我们认为核心瓶颈在于方法论:该领域缺乏一种可扩展的方式来构建同时真实、多样、可控、可检查和可重复的评估设置。我们引入了ShopGym,一个用于电子商务网络代理的真实模拟和可扩展基准测试的综合框架。ShopGym是一个构建电子商务模拟环境和基础基准任务的框架。其模拟层ShopArena通过匿名化的商店规格和分阶段的验证生成过程,将实时种子店面转换为自包含的沙盒商店。在这些模拟的店面之上,ShopGuru在七个技能类别中合成基准任务,将每个任务与商店的目录、导航结构、政策和交互能力相结合。ShopArena和ShopGuru共同生成自包含、可重置、可检查和稳定的评估工件,保留与购物任务相关的结构属性和代理评估信号。我们通过基于图的结构分析和基于代理的行为评估对框架进行了验证,涉及在六个沙盒商店中生成的224个任务:三个使用合成数据构建,三个使用真实数据构建。我们的结果表明,合成商店保留了实时店面的关键结构属性,代理在合成商店上的表现与在实时店面上的表现呈正相关。
cs.AI / 43 / 2605.16142
Property-Guided LLM Program Synthesis for Planning
基于属性引导的LLM程序合成用于规划
Abstract
LLMs have shown impressive success in program synthesis, discovering programs that surpass prior solutions. However, these approaches rely on simple numeric scores to signal program quality, such as the value of the solution or the number of passed tests. Because a score offers no guidance on why a program failed, the system must generate and evaluate many candidates hoping some succeed, increasing LLM inference and evaluation costs. We study a different approach: property-guided LLM program synthesis. Instead of scoring programs after evaluation, we check whether a candidate satisfies a formally defined property. When the property is violated, we stop the evaluation early and provide the LLM with a concrete counterexample showing exactly how the program failed. This feedback drastically reduces both the number of program generations and the evaluation cost, and can guide the LLM to generate stronger programs. We evaluate this approach on PDDL planning domains, asking the LLM to synthesize direct heuristic functions: every state reachable by strictly improving transitions has a strictly improving successor. A heuristic with this property leads hill-climbing algorithm directly to a goal state. A counterexample-guided repair loop generates one candidate program, checks the property over a training set, and returns the first case that violates the property. We evaluate our approach on ten planning domains with an out-of-distribution test set. The synthesized heuristics are effectively direct on virtually all test tasks, and compared to the best prior generation method our approach generates seven times fewer programs per domain on average, solves more tasks without using search, and requires several orders of magnitude less computation to evaluate candidates. Whenever a problem admits a verifiable property, property-guided LLM synthesis can reduce cost and improve program quality.
Chinese Translation
大型语言模型(LLMs)在程序合成方面取得了令人瞩目的成功,发现的程序超越了以往的解决方案。然而,这些方法依赖简单的数值评分来指示程序质量,例如解决方案的值或通过测试的数量。由于评分无法提供程序失败的原因指导,系统必须生成并评估许多候选程序,希望其中一些能够成功,从而增加了LLM推理和评估的成本。我们研究了一种不同的方法:基于属性引导的LLM程序合成。我们不再在评估后对程序进行评分,而是检查候选程序是否满足正式定义的属性。当属性被违反时,我们提前停止评估,并向LLM提供一个具体的反例,准确展示程序失败的原因。这种反馈显著减少了程序生成的数量和评估成本,并可以引导LLM生成更强的程序。我们在PDDL规划领域评估了这种方法,要求LLM合成直接启发式函数:每个通过严格改进转移可达的状态都有一个严格改进的后继状态。具有此属性的启发式函数可以直接将爬山算法引导到目标状态。反例引导的修复循环生成一个候选程序,在训练集上检查属性,并返回第一个违反该属性的案例。我们在十个规划领域以及一个分布外测试集上评估了我们的方法。合成的启发式函数在几乎所有测试任务中有效地表现为直接,并且与最佳的先前生成方法相比,我们的方法在每个领域平均生成的程序数量减少了七倍,解决了更多任务而无需搜索,并且评估候选程序所需的计算量减少了几个数量级。每当问题允许可验证的属性时,基于属性引导的LLM合成可以降低成本并提高程序质量。
cs.AI / 44 / 2605.16143
Look Before You Leap: Autonomous Exploration for LLM Agents
三思而后行:大语言模型代理的自主探索
Abstract
Large language model based agents often fail in unfamiliar environments due to premature exploitation: a tendency to act on prior knowledge before acquiring sufficient environment-specific information. We identify autonomous exploration as a critical yet underexplored capability for building adaptive agents. To formalize and quantify this capability, we introduce Exploration Checkpoint Coverage, a verifiable metric that measures how broadly an agent discovers key states, objects, and affordances. Our systematic evaluation reveals that agents trained with standard task-oriented reinforcement learning consistently exhibit narrow and repetitive behaviors that impede downstream performance. To address this limitation, we develop a training strategy that interleaves task-execution rollouts and exploration rollouts, with each type of rollout optimized by its corresponding verifiable reward. Building on this training strategy, we propose the Explore-then-Act paradigm, which decouples information-gathering from task execution: agents first utilize an interaction budget to acquire grounded environmental knowledge, then leverage it for task resolution. Our results demonstrate that learning to systematically explore is imperative for building generalizable and real-world-ready agents.
Chinese Translation
基于大型语言模型的代理在不熟悉的环境中常常因过早利用先前知识而失败,这种倾向在于在获取足够的环境特定信息之前就采取行动。我们将自主探索视为构建自适应代理的一个关键但尚未充分探索的能力。为了形式化和量化这一能力,我们引入了探索检查点覆盖(Exploration Checkpoint Coverage),这是一个可验证的指标,用于衡量代理发现关键状态、对象和可用性(affordances)的广度。我们的系统评估表明,使用标准任务导向强化学习训练的代理表现出狭窄且重复的行为,这妨碍了下游性能。为了解决这一局限性,我们开发了一种训练策略,该策略交替进行任务执行回放和探索回放,每种回放类型均通过其对应的可验证奖励进行优化。在此训练策略的基础上,我们提出了探索-再行动(Explore-then-Act)范式,该范式将信息收集与任务执行解耦:代理首先利用交互预算获取扎根的环境知识,然后利用这些知识解决任务。我们的结果表明,系统性探索的学习对于构建可推广和适应现实世界的代理至关重要。
cs.AI / 45 / 2605.16153
An Algebraic Exposition of the Theory of Dyadic Morality
双重道德理论的代数阐述
Abstract
This paper provides an algebraic exposition of the theory of dyadic morality (TDM), a psychological model of moral judgment grounded in a simple two-node template: an intentional agent causing harm to a vulnerable patient. We formalize TDM using structural causal modeling (SCM) notation and identify three psychological operators (typecasting operator, completion operator, and valence-dependent inference mechanism) that extend standard SCM to capture how people compute moral judgments under constraints. We address scalability challenges arising from TDM's dyadic limitation, showing how moral cognition compresses multi-node scenarios through node collapse and sequential processing. Drawing on this algebraic framework, we demonstrate concrete applications to AI policy design: detecting conflicting obligations, structuring helpfulness policies to preserve user agency, and designing post-failure communication as causal interventions. Finally, we recommend scoped, contextual measurement of mind perception over universal averaging to operationalize the theory empirically. This algebraic formalization enables neurosymbolic AI systems to compute morality in a way that is both mathematically rigorous and faithful to human moral cognition.
Chinese Translation
本文提供了双重道德理论(TDM)的代数阐述,这是一种基于简单双节点模板的道德判断心理模型:一个意图明确的主体对一个脆弱的受害者造成伤害。我们使用结构因果建模(SCM)符号对TDM进行形式化,并识别出三种心理操作符(类型化操作符、完成操作符和效价依赖推理机制),这些操作符扩展了标准SCM,以捕捉人们在约束条件下如何计算道德判断。我们解决了由于TDM的双重限制而产生的可扩展性挑战,展示了道德认知如何通过节点压缩和顺序处理来简化多节点场景。基于这一代数框架,我们展示了对人工智能政策设计的具体应用:检测冲突义务、构建有助于保持用户自主性的帮助政策,以及设计作为因果干预的失败后沟通。最后,我们建议在理论的实证化过程中采用有范围的、情境化的心智感知测量,而非普遍平均。这一代数形式化使神经符号人工智能系统能够以既数学严谨又忠实于人类道德认知的方式计算道德。
cs.AI / 46 / 2605.16198
Formal Methods Meet LLMs: Auditing, Monitoring, and Intervention for Compliance of Advanced AI Systems
形式化方法与大型语言模型的结合:高级人工智能系统合规性的审计、监控与干预
Abstract
We examine one particular dimension of AI governance: how to monitor and audit AI-enabled products and services throughout the AI development lifecycle, from pre-deployment testing to post-deployment auditing. Combining principles from formal methods with SoTA machine learning, we propose techniques that enable AI-enabled product and service developers, as well as third party AI developers and evaluators, to perform offline auditing and online (runtime) monitoring of product-specific (temporally extended) behavioral constraints such as safety constraints, norms, rules and regulations with respect to black-box advanced AI systems, notably LLMs. We further provide practical techniques for predictive monitoring, such as sampling-based methods, and we introduce intervening monitors that act at runtime to preempt and potentially mitigate predicted violations. Experimental results show that by exploiting the formal syntax and semantics of Linear Temporal Logic (LTL), our proposed auditing and monitoring techniques are superior to LLM baseline methods in detecting violations of temporally extended behavioral constraints; with our approach, even small-model labelers match or exceed frontier LLM judges. Our predictive and intervening monitors significantly reduce the violation rates of LLM-based agents while largely preserving task performance. We further show through controlled experiments that LLMs' temporal reasoning shows a pronounced degradation in accuracy with increasing event distance, number of constraints, and number of propositions.
Chinese Translation
我们考察人工智能治理的一个特定维度:如何在人工智能开发生命周期内监控和审计人工智能驱动的产品和服务,从部署前测试到部署后审计。结合形式化方法的原则与最先进的机器学习技术,我们提出了使人工智能产品和服务开发者,以及第三方人工智能开发者和评估者能够对特定产品的(时间扩展的)行为约束(如安全约束、规范、规则和法规)进行离线审计和在线(运行时)监控的技术,特别是针对黑箱高级人工智能系统,尤其是大型语言模型(LLMs)。我们进一步提供了预测监控的实用技术,例如基于采样的方法,并引入了在运行时进行干预的监控器,以预防和潜在减轻预测到的违规行为。实验结果表明,通过利用线性时序逻辑(LTL)的形式语法和语义,我们提出的审计和监控技术在检测时间扩展的行为约束违规方面优于大型语言模型的基线方法;在我们的方法下,即使是小模型标注者的表现也能与前沿大型语言模型评判者相匹敌或超越。我们的预测和干预监控器显著降低了基于大型语言模型的代理的违规率,同时在很大程度上保持了任务性能。我们还通过控制实验表明,随着事件距离、约束数量和命题数量的增加,大型语言模型的时间推理准确性显著下降。
cs.AI / 47 / 2605.16205
Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP
上下文、推理与层次结构:复合 LLM 代理设计在对抗性 POMDP 中的成本-性能研究
Abstract
Deploying compound LLM agents in adversarial, partially observable sequential environments requires navigating several design dimensions: (1) what the agent sees, (2) how it reasons, and (3) how tasks are decomposed across components. Yet practitioners lack guidance on which design choices improve performance versus merely increase inference costs. We present a controlled study of compound LLM agent design in CybORG CAGE-2, a cyber defense environment modeled as a Partially Observable Markov Decision Process (POMDP). Reward is non-positive, so all configurations operate in a failure-mitigation mode. Our evaluation spans five model families, six models, and twelve configurations (3,475 episodes) with token-level cost accounting. We vary context representation (raw observations vs. a deterministic state-tracking layer with compressed history), deliberation (self-questioning, self-critique, and self-improvement tools, with optional chain-of-thought prompting), and hierarchical decomposition (monolithic ReAct vs. delegation to specialized sub-agents). We find that: (1) Programmatic state abstraction delivers the largest returns per token spent (RPTS), improving mean return by up to 76% over raw observations. (2) Distributing deliberation tools across a hierarchy degrades performance relative to hierarchy alone for all five model families, reaching up to 3.4$\times$ worse mean return while using 1.8-2.7$\times$ more tokens. We call this destructive pattern a deliberation cascade. (3) Hierarchical decomposition without deliberation achieves the best absolute performance for most models, and context engineering is generally more cost-effective than deliberation. These findings suggest a design principle for structured adversarial POMDPs: invest in programmatic infrastructure and clean task decomposition rather than deeper per-agent reasoning, as these strategies can interfere when combined.
Chinese Translation
在对抗性、部分可观察的序列环境中部署复合 LLM 代理需要在多个设计维度中进行权衡:(1)代理所见的内容,(2)其推理方式,以及(3)任务在各个组件之间的分解方式。然而,实践者缺乏关于哪些设计选择能够提高性能而不仅仅是增加推理成本的指导。我们在 CybORG CAGE-2 中进行了复合 LLM 代理设计的受控研究,该环境被建模为部分可观察的马尔可夫决策过程(POMDP)。由于奖励为非正值,因此所有配置均在故障缓解模式下运行。我们的评估涵盖了五个模型系列、六个模型和十二个配置(3,475 个回合),并进行了基于令牌级别的成本核算。我们变化了上下文表示(原始观察与具有压缩历史的确定性状态跟踪层)、深思熟虑(自我提问、自我批评和自我改进工具,带有可选的思维链提示)以及层次分解(单体 ReAct 与委派给专门子代理)。我们发现:(1)程序化状态抽象在每个令牌支出中提供了最大的回报(RPTS),比原始观察提高了多达 76% 的平均回报。(2)在层次结构中分配深思熟虑工具相较于单独的层次结构会降低性能,对于所有五个模型系列,平均回报最多下降 3.4 倍,同时使用的令牌数量增加 1.8-2.7 倍。我们将这种破坏性模式称为深思熟虑级联。(3)在没有深思熟虑的情况下,层次分解为大多数模型实现了最佳的绝对性能,而上下文工程通常比深思熟虑更具成本效益。这些发现为结构化对抗性 POMDP 提出了一个设计原则:投资于程序化基础设施和清晰的任务分解,而不是更深入的每代理推理,因为这些策略在结合时可能会相互干扰。
cs.AI / 48 / 2605.16207
Confirming Correct, Missing the Rest: LLM Tutoring Agents Struggle Where Feedback Matters Most
确认正确,忽视其余:大型语言模型辅导智能体在反馈至关重要的地方挣扎
Abstract
Effective tutoring requires distinguishing optimal, valid but suboptimal, and incorrect student solutions, a distinction central to intelligent tutoring systems (ITS) but untested for LLM-based tutors. As LLMs are increasingly explored as conversational complements to ITS, evaluating their diagnostic precision is essential. We present a benchmark of seven LLM feedback agents in propositional logic using knowledge-graph-derived ground truth across 10,836 solution--feedback pairs and three feedback conditions. Models achieved near-ceiling performance on optimal steps but systematically over-rejected valid but suboptimal reasoning and over-validated incorrect solutions, precisely where adaptive tutoring matters most. These failures persisted across models regardless of solution context, suggesting architectural rather than informational limits. Moreover, accurate diagnosis did not reliably produce pedagogically actionable feedback, revealing a gap between diagnostic judgment and instructional effectiveness. Our findings suggest that LLMs are better suited for hybrid architectures where KG-grounded models handle diagnosis while LLMs support open-ended scaffolding and dialogue.
Chinese Translation
有效的辅导需要区分最佳、有效但次优和不正确的学生解答,这一区分对于智能辅导系统(ITS)至关重要,但尚未在基于大型语言模型(LLM)的辅导中进行测试。随着LLM越来越多地被探索作为ITS的对话补充,评估其诊断精度变得至关重要。我们展示了七个LLM反馈智能体在命题逻辑中的基准,使用知识图谱推导的真实数据,涵盖了10,836个解答-反馈对和三种反馈条件。模型在最佳步骤上达到了接近上限的表现,但系统性地过度拒绝有效但次优的推理,并过度验证不正确的解答,恰恰是在自适应辅导最为重要的地方。这些失败在不同模型中持续存在,无论解答的上下文如何,表明其限制更多是架构性的而非信息性的。此外,准确的诊断并未可靠地产生具有教育意义的可操作反馈,揭示了诊断判断与教学有效性之间的差距。我们的研究结果表明,LLM更适合用于混合架构,其中基于知识图谱的模型负责诊断,而LLM则支持开放式的支架和对话。
cs.AI / 49 / 2605.16215
Fully Open Meditron: An Auditable Pipeline for Clinical LLMs
完全开放的 Meditron:可审计的临床大语言模型管道
Abstract
Clinical decision support systems (CDSS) require scrutable, auditable pipelines that enable rigorous, reproducible validation. Yet current LLM-based CDSS remain largely opaque. Most "open" models are open-weight only, releasing parameters while withholding the data provenance, curation procedures, and generation pipelines that determine model behavior. Fully Open (FO) models, which expose the complete training stack end-to-end, do not currently exist in medicine. We introduce Fully Open Meditron, the first fully open pipeline for building LLM-CDSS, comprising a clinician-audited training corpus, a reproducible data construction and training framework, and a use-aligned evaluation protocol. The corpus unifies eight public medical QA datasets into a normalized conversational format and expands coverage with three clinician-vetted synthetic extensions: exam-style QA, guideline-grounded QA derived from 46,469 clinical practice guidelines, and clinical vignettes. The pipeline enforces system-wide decontamination, gold-label resampling of teacher generations, and end-to-end validation by a four-physician panel. We evaluate using an LLM-as-a-judge protocol over expert-written clinical vignettes, calibrated against 204 human raters. We apply the recipe to five FO base models (Apertus-70B/8B-Instruct, OLMo-2-32B-SFT, EuroLLM-22B/9B-Instruct). All MeditronFO variants are preferred over their bases. Apertus-70B-MeditronFO improves +6.6 points over its base (47.2% to 53.8%) on aggregate medical benchmarks, establishing a new FO SoTA. Gemma-3-27B-MeditronFO is preferred over MedGemma in 58.6% of LLM-as-a-judge comparisons and outperforms it on HealthBench (58% vs 55.9%). These results show that fully open pipelines can achieve state-of-the-art domain-specific performance without sacrificing auditability or reproducibility.
Chinese Translation
临床决策支持系统(CDSS)需要可审查、可审计的管道,以实现严格、可重复的验证。然而,目前基于大语言模型(LLM)的 CDSS 在很大程度上仍然不透明。大多数“开放”模型仅开放权重,发布参数而隐瞒数据来源、整理程序和决定模型行为的生成管道。完全开放(FO)模型,即端到端暴露完整训练堆栈的模型,在医学领域尚不存在。我们介绍了完全开放的 Meditron,这是第一个用于构建 LLM-CDSS 的完全开放管道,包含经过临床医生审核的训练语料库、可重复的数据构建和训练框架,以及与使用对齐的评估协议。该语料库将八个公共医学问答数据集统一为标准化的对话格式,并通过三个经过临床医生验证的合成扩展进行覆盖:考试风格问答、基于 46,469 条临床实践指南的指南基础问答,以及临床小插曲。该管道强制实施系统范围的去污染、教师生成的金标签重采样,以及由四位医生小组进行的端到端验证。我们使用 LLM 作为评判者的协议对专家撰写的临床小插曲进行评估,并与 204 名人类评审员进行校准。我们将该方法应用于五个 FO 基础模型(Apertus-70B/8B-Instruct,OLMo-2-32B-SFT,EuroLLM-22B/9B-Instruct)。所有 MeditronFO 变体均优于其基础模型。Apertus-70B-MeditronFO 在综合医学基准上比其基础模型提高了 6.6 分(从 47.2% 提升至 53.8%),确立了新的 FO 最先进技术(SoTA)。Gemma-3-27B-MeditronFO 在 58.6% 的 LLM 作为评判者比较中优于 MedGemma,并在 HealthBench 上表现优于它(58% 对 55.9%)。这些结果表明,完全开放的管道能够在不牺牲可审计性或可重复性的情况下,实现领域特定的最先进性能。
cs.AI / 50 / 2605.16233
FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast
FORGE:无权重更新的自我演化代理记忆通过种群广播
Abstract
Can LLM agents improve decision-making through self-generated memory without gradient updates? We propose FORGE (Failure-Optimized Reflective Graduation and Evolution), a staged, population-based protocol that evolves prompt-injected natural-language memory for hierarchical ReAct agents. FORGE wraps a Reflexion-style inner loop, where a dedicated reflection agent (using the same underlying LLM, no distillation from a stronger model) converts failed trajectories into reusable knowledge artifacts: textual heuristics (Rules), few-shot demonstrations (Examples), or both (Mixed), with an outer loop that propagates the best-performing instance's memory to the population between stages and freezes converged instances via a graduation criterion. We evaluate on CybORG CAGE-2, a stochastic network-defense POMDP at a 30-step horizon against the B-line attacker, where all four tested LLM families (Gemini-2.5-Flash-Lite, Grok-4-Fast, Llama-4-Maverick, Qwen3-235B) exhibit strongly negative, heavy-tailed zero-shot rewards. Compared against both a zero-shot baseline and a Reflexion baseline (isolated single-stream learning), FORGE improves average evaluation return by 1.7-7.7$\times$ over zero-shot and by 29-72% over Reflexion in all 12 model-representation conditions, reducing major-failure rates (below $-100$) to as low as $\sim$1%. We find that (1) population broadcast is critical mechanism, with a no-graduation ablation confirming that broadcast carries the performance gains while graduation primarily saves compute; (2) Examples achieves the strongest returns for three of four models, Rules offers the best cost-reliability profile with $\sim$40% fewer tokens; and (3) weaker baseline models benefit disproportionately, suggesting FORGE may mitigate capability gaps rather than amplify strong models. All evidence is confined to CAGE-2 B-line; cross-family findings are directional evidence.
Chinese Translation
大型语言模型(LLM)代理是否可以通过自生成的记忆在不进行梯度更新的情况下改善决策?我们提出了FORGE(失败优化反思毕业与演化),这是一种分阶段的基于种群的协议,旨在为层次化的ReAct代理演化注入提示的自然语言记忆。FORGE 包裹了一个反思风格的内循环,其中一个专门的反思代理(使用相同的基础LLM,而不是从更强模型中提取)将失败的轨迹转换为可重用的知识工件:文本启发式(规则)、少量示范(示例)或两者结合(混合),并通过一个外循环在阶段之间将表现最佳实例的记忆传播到种群中,并通过毕业标准冻结收敛的实例。我们在CybORG CAGE-2上进行评估,这是一个在30步视野下针对B线攻击者的随机网络防御部分可观测马尔可夫决策过程(POMDP),所有四个测试的LLM家族(Gemini-2.5-Flash-Lite、Grok-4-Fast、Llama-4-Maverick、Qwen3-235B)都表现出强烈的负面、重尾的零-shot奖励。与零-shot基线和反思基线(孤立的单流学习)相比,FORGE在所有12种模型表示条件下将平均评估回报提高了1.7-7.7倍,相较于零-shot提高了29-72%,将主要失败率(低于-100)降低至约1%。我们发现(1)种群广播是关键机制,未进行毕业的消融实验确认广播带来了性能提升,而毕业主要节省计算资源;(2)示例在四个模型中的三个模型中实现了最强的回报,规则在大约减少40%令牌的情况下提供了最佳的成本可靠性配置;(3)较弱的基线模型受益不成比例,表明FORGE可能减轻能力差距,而不是放大强模型。所有证据均限于CAGE-2 B线;跨家族的发现为方向性证据。
cs.AI / 51 / 2605.16238
Prospective multi-pathogen disease forecasting using autonomous LLM-guided tree search
基于自主大型语言模型引导的多路径病害预测
Abstract
Probabilistic forecasting of infectious diseases is crucial for public health but relies on labor-intensive manual model curation by expert modeling teams. This bespoke development bottlenecks scalability to granular geographic resolutions or emerging pathogens. Here, we present an autonomous system using Large Language Model (LLM)-guided tree search to iteratively generate, evaluate, and optimize executable forecasting software. In a fully prospective, real-time evaluation during the 2025-2026 US respiratory season, the system autonomously discovered methodologically diverse models for influenza, COVID-19, and respiratory syncytial virus (RSV). Aggregating these machine-generated models yielded an ensemble that consistently matched or outperformed the gold-standard, human-curated Centers for Disease Control and Prevention (CDC) hub ensembles out-of-sample. The system successfully navigated data-scarce "cold start" scenarios for RSV. Moreover, controlled retrospective ablations revealed that optimizing log-scale distance metrics prevents reward hacking, while an automated judge-in-the-loop ensures structural fidelity to complex scientific theories. By autonomously translating epidemiological theory into accurate, transparent code, this framework overcomes the modeling labor bottleneck, enabling rapid deployment of expert-level disease forecasting at unprecedented scales.
Chinese Translation
传染病的概率预测对公共卫生至关重要,但依赖于专家建模团队进行劳动密集型的手动模型管理。这种定制开发限制了在细粒度地理分辨率或新兴病原体方面的可扩展性。在此,我们提出了一种自主系统,利用大型语言模型(LLM)引导的树搜索,迭代生成、评估和优化可执行的预测软件。在2025-2026年美国呼吸季节的全面前瞻性实时评估中,该系统自主发现了针对流感、COVID-19和呼吸道合胞病毒(RSV)的方法学多样化模型。聚合这些机器生成的模型产生的集成模型在样本外的表现始终与金标准的人工策划的疾病控制与预防中心(CDC)集成模型相匹配或超越。该系统成功应对了RSV的数据稀缺“冷启动”场景。此外,受控的回顾性消融实验表明,优化对数尺度距离度量可以防止奖励黑客行为,而自动化的循环评判确保了对复杂科学理论的结构保真性。通过自主将流行病学理论转化为准确、透明的代码,该框架克服了建模劳动瓶颈,使得以空前规模快速部署专家级疾病预测成为可能。
cs.CL / 1 / 2605.15220
Always Learning, Always Mixing: Efficient and Simple Data Mixing All The Time
始终学习,始终混合:高效且简单的数据混合方法
Abstract
Data mixing decides how to combine different sources or types of data and is a consequential problem throughout language model training. In pretraining, data composition is a key determinant of model quality; in continual learning and adaptation, it governs what is retained and acquired. Yet existing data mixing methods address only one phase of this lifecycle at a time: some require smaller proxy models tied to a single training phase, others assume a fixed domain set, and continual learning lacks principled guidance altogether. We argue that data mixing is fundamentally an online decision making problem -- one that recurs throughout training and demands a single, unified solution. We introduce OP-Mix (On-Policy Mix), a data mixing algorithm that operates across the entire language model training lifecycle. Our main insight is that candidate data mixtures can be cheaply simulated by interpolating between low-rank adapters trained directly on the current model, eliminating separate proxy models and ensuring the search is always grounded in the model's actual learning dynamics. Across pretraining, continual midtraining, and continual instruction tuning, OP-Mix consistently finds near-optimal mixtures while using a fraction of the compute of the baselines. In pretraining, OP-Mix improves upon training without mixing by 6.3% in average perplexity. For continual learning, OP-Mix matches the performance of both retraining and on-policy distillation while using 66% and 95% less overall compute, respectively. OP-Mix suggests a different view of language model training: not a sequence of distinct phases, but a single continuous process of learning from data.
Chinese Translation
数据混合决定了如何组合不同来源或类型的数据,并且在语言模型训练中是一个重要问题。在预训练阶段,数据组成是模型质量的关键决定因素;在持续学习和适应中,它决定了保留和获取的内容。然而,现有的数据混合方法仅针对这一生命周期的一个阶段:一些方法需要与单一训练阶段相关联的小型代理模型,另一些则假设固定的领域集,而持续学习则完全缺乏原则性指导。我们认为数据混合本质上是一个在线决策问题——在整个训练过程中反复出现,并且需要一个统一的解决方案。我们提出了 OP-Mix(On-Policy Mix),一种在整个语言模型训练生命周期中运行的数据混合算法。我们的主要见解是,通过在当前模型上直接训练的低秩适配器之间进行插值,可以廉价地模拟候选数据混合,从而消除单独的代理模型,并确保搜索始终基于模型的实际学习动态。在预训练、持续中训练和持续指令调优中,OP-Mix 始终能够找到近乎最优的混合,同时使用的计算资源仅为基线的一个小部分。在预训练中,OP-Mix 在平均困惑度上比不混合训练提高了 6.3%。对于持续学习,OP-Mix 的性能与重新训练和在线蒸馏相匹配,同时分别使用了 66% 和 95% 更少的计算资源。OP-Mix 提出了对语言模型训练的不同看法:不是一系列不同的阶段,而是一个从数据中持续学习的单一连续过程。
cs.CL / 2 / 2605.15282
Fluency and Faithfulness in Human and Machine Literary Translation
人类与机器文学翻译中的流畅性与忠实性
Abstract
Literary translation requires balancing target-language fluency with faithfulness to the source. Recent large language models (LLMs) often produce fluent translations, but it remains unclear whether fluency corresponds to semantic preservation in literary text. We examine this relationship using 130,486 translated paragraphs from 106 novels in 16 source languages, including human, Google Translate, and TranslateGemma translations. Fluency is measured as original-likeness with a translationese classifier trained on paragraph part-of-speech n-grams, and faithfulness with the automatic translation evaluation metric COMET-KIWI. We control for paragraph length and find a consistent negative correlation between fluency and faithfulness. The pattern appears for both human and Google Translate, but is weaker and often non-significant for TranslateGemma. These results show that segment length matters for automatic evaluation and suggest a tradeoff between fluency and faithfulness in literary translation.
Chinese Translation
文学翻译需要在目标语言的流畅性与对源语言的忠实性之间取得平衡。近期的大型语言模型(LLMs)通常能够生成流畅的翻译,但流畅性是否与文学文本的语义保留相对应仍不明确。我们使用来自16种源语言的106部小说中的130,486个翻译段落来考察这一关系,包括人类翻译、Google Translate和TranslateGemma的翻译。流畅性通过一个基于段落词性n-grams训练的翻译特征分类器来衡量,而忠实性则使用自动翻译评估指标COMET-KIWI进行评估。我们控制了段落长度,发现流畅性与忠实性之间存在一致的负相关关系。该模式在人工翻译和Google Translate中均出现,但在TranslateGemma中则较弱且通常不显著。这些结果表明段落长度对自动评估的重要性,并暗示在文学翻译中流畅性与忠实性之间存在权衡关系。
cs.CL / 3 / 2605.15304
DiscoExplorer: An Open Interface for the Study of Multilingual Discourse Relations
DiscoExplorer:多语言话语关系研究的开放接口
Abstract
The relations connecting propositions in discourse such as cause (A because B) or concession (A although B) are a subject of intense interest in Computational Linguistics and Pragmatics, but challenging to study and compare across languages. Recent progress in standardizing discourse relation inventories across datasets offers the potential to facilitate such studies, but is hindered by the complexity of relevant data and the lack of easily accessible interfaces to analyze it. In this paper we present DiscoExplorer, a new open source web interface, capable of running on local computers, which we use to make datasets from the DISRPT Shared Task on discourse relation classification publicly available, covering 16 different languages. We present the query language, search and visualization facilities for relations and signaling devices such as connectives, as well as some example studies.
Chinese Translation
话语中连接命题的关系,如因果关系(A 因为 B)或让步关系(A 尽管 B),在计算语言学和语用学中引起了广泛关注,但在跨语言研究和比较中面临挑战。最近在标准化数据集中的话语关系清单方面取得的进展,为促进此类研究提供了潜力,但由于相关数据的复杂性以及缺乏易于访问的分析接口而受到阻碍。在本文中,我们介绍了 DiscoExplorer,一个新的开源网络接口,能够在本地计算机上运行,我们利用它将 DISRPT 共享任务中关于话语关系分类的数据集公开,涵盖 16 种不同语言。我们展示了查询语言、关系和信号装置(如连接词)的搜索和可视化功能,以及一些示例研究。
cs.CL / 4 / 2605.15362
Automatic Construction of a Legal Citation Graph from 100 Million Ukrainian Court Decisions: Large-Scale Extraction, Topological Analysis, and Ontology-Driven Clustering
从1亿条乌克兰法院裁决自动构建法律引用图:大规模提取、拓扑分析与本体驱动聚类
Abstract
Half a billion citation edges extracted from 100.7 million Ukrainian court decisions reveal that judicial citation structure encodes legal domain boundaries without supervision and predicts future legislative importance with near-perfect accuracy. We construct the first large-scale citation graph from the complete EDRSR registry (99.5 million full texts, 1.1 TB), extracting 502 million citation links across six types via regex on commodity hardware in approximately 5 hours, with precision of 1.00 on a 200-decision validation sample (95% Wilson CI: [0.982, 1.000]). Three principal findings emerge. (1) The degree distribution follows a power law (alpha = 1.57 +/- 0.008), placing the Ukrainian court network near the EU Court of Justice and below the US Supreme Court, with hub articles cited by millions of decisions. (2) Louvain community detection on the co-citation projection recovers legal domain boundaries (civil, criminal, administrative, commercial) with modularity Q = 0.44-0.55 and temporal stability (NMI = 0.83-0.86 across periods), constituting an automatically constructed legal ontology grounded in judicial practice. (3) Citation features predict top-1000 articles with AUC = 0.9984, substantially outperforming a naive frequency baseline (P@1000 = 0.655); temporal dynamics detect legislative regime changes as phase transitions and the 2022 invasion as a citation entropy spike (H: 11.02 -> 13.49) with emergent wartime legislation nodes. The citation-derived ontology is operationalized as the domain layer of a workflow memory system for LLM-assisted legal analysis, connecting to the ontology-controlled paradigm. The extraction pipeline, analysis code, and aggregated statistics are released as open data.
Chinese Translation
从1.007亿条乌克兰法院裁决中提取的五亿条引用边揭示了司法引用结构在没有监督的情况下编码法律领域边界,并以近乎完美的准确性预测未来立法的重要性。我们从完整的EDRSR注册表(9950万份全文,1.1 TB)构建了第一个大规模引用图,通过在普通硬件上使用正则表达式提取了502百万条引用链接,耗时约5小时,在200个裁决的验证样本中精度达到1.00(95% Wilson置信区间:[0.982, 1.000])。研究得出三项主要发现。(1) 引用度分布遵循幂律(alpha = 1.57 +/- 0.008),将乌克兰法院网络置于欧盟法院和美国最高法院之间,中心文章被数百万个裁决引用。(2) 在共引投影上进行的Louvain社区检测恢复了法律领域边界(民事、刑事、行政、商业),模块度Q = 0.44-0.55,且具有时间稳定性(NMI = 0.83-0.86),构成了基于司法实践自动构建的法律本体。(3) 引用特征以AUC = 0.9984预测前1000篇文章,显著优于简单频率基线(P@1000 = 0.655);时间动态检测立法制度变化作为相变,2022年入侵作为引用熵峰值(H: 11.02 -> 13.49),伴随新兴的战时立法节点。引用导出的本体被操作化为LLM辅助法律分析的工作流记忆系统的领域层,连接到本体控制的范式。提取管道、分析代码和汇总统计数据已作为开放数据发布。
cs.CL / 5 / 2605.15365
Greedy or not, here I come: Language production under vocabulary constraints in humans and resource-rational models
贪婪还是不贪婪,我来了:在词汇限制下人类和资源理性模型的语言生成
Abstract
Communicating using only a limited vocabulary is a common but challenging cognitive phenomenon, requiring an ideal communicator to plan carefully to optimize for intelligibility while circumventing a constrained lexicon. In this work, we investigate how humans respond to a broad array of questions under variable vocabulary limitations, consisting of only 250 highly frequent words at the most restrictive. We provide theoretically motivated comparisons to greedy and globally optimal sampling algorithms using Sequential Monte Carlo inference with large language models. Humans generally resemble greedy sampling more than globally optimal sampling, though more skilled humans are more likely to backtrack and revise -- a non-greedy behavior. An observed human pattern of leaning on semantically light words in high-constraint settings falls out of both greedy and globally optimal sampling. We discuss the results and their broader implications for resource-rational cognition, psycholinguistics, L2 communication, and language impairments.
Chinese Translation
仅使用有限词汇进行交流是一种常见但具有挑战性的认知现象,这要求理想的交流者仔细规划,以优化可理解性,同时规避受限的词汇。在本研究中,我们调查了人类在不同词汇限制下对广泛问题的反应,这些限制最多仅包含250个高频词。我们提供了理论驱动的比较,涉及贪婪采样和全局最优采样算法,使用大语言模型的序列蒙特卡洛推断。总体而言,人类的行为更类似于贪婪采样而非全局最优采样,尽管更熟练的人类更可能进行回溯和修正——这是一种非贪婪行为。在高约束环境中,人类倾向于依赖语义轻的词汇的模式既不符合贪婪采样,也不符合全局最优采样。我们讨论了这些结果及其对资源理性认知、心理语言学、第二语言交流和语言障碍的更广泛影响。
cs.CL / 6 / 2605.15376
Adesua: Development and Feasibility Study of an AI WhatsApp Bot for Science Learning in West Africa
Adesua:西非科学学习的人工智能WhatsApp机器人开发与可行性研究
Abstract
Sub-Saharan Africa faces persistently high student-teacher ratios and shortages of qualified teachers, limiting students' access to personalized learning support and formative assessment. To address this challenge, we present Adesua, a WhatsApp-based AI Teaching Assistant for science education that extends the Kwame for Science platform. Adesua leverages WhatsApp's widespread adoption in Africa to provide accessible, curriculum-aligned learning support for Junior High School (JHS) and Senior High School (SHS) students across West Africa. The system integrates curated textbooks and 33 years of national examination questions with generative AI to enable conversational question answering and automated assessment with feedback via a WhatsApp bot. Students can ask science questions, take timed or untimed multiple-choice tests by topic or exam year, and receive instant grading and detailed explanations of correct and incorrect responses. A 6-month feasibility deployment in 2025 had 56 active users in Ghana, including students and parents. Quantitative evaluation showed a high perceived usefulness, with a helpfulness score of 93.75\% for AI-generated answers, albeit with a small number of ratings (n=16). These preliminary results provide a basis for more extensive future evaluation of a WhatsApp-based AI assistant to assess its potential to offer scalable, low-cost personalized learning support and formative assessment in resource-constrained educational contexts.
Chinese Translation
撒哈拉以南非洲面临着持续高企的师生比例和合格教师短缺的问题,这限制了学生获得个性化学习支持和形成性评估的机会。为了解决这一挑战,我们提出了Adesua,一个基于WhatsApp的人工智能教学助手,旨在扩展Kwame for Science平台。Adesua利用WhatsApp在非洲的广泛应用,为西非的初中(JHS)和高中(SHS)学生提供可获取的、与课程对齐的学习支持。该系统整合了经过筛选的教科书和33年的国家考试题目,并结合生成式人工智能,实现了通过WhatsApp机器人进行对话式问答和自动评估反馈。学生可以提出科学问题,按主题或考试年份进行限时或不限时的多项选择测试,并即时获得评分及正确和错误答案的详细解释。2025年进行的为期6个月的可行性部署在加纳有56名活跃用户,包括学生和家长。定量评估显示,AI生成答案的感知有用性很高,帮助评分达到93.75%,尽管评分数量较少(n=16)。这些初步结果为未来更广泛评估基于WhatsApp的人工智能助手提供了基础,以评估其在资源有限的教育环境中提供可扩展、低成本的个性化学习支持和形成性评估的潜力。
cs.CL / 7 / 2605.15380
Eskwai for Students: Generative AI Assistant for Legal Education in Ghana
Eskwai for Students:加纳法律教育的生成式人工智能助手
Abstract
Recent advances in generative AI have shown their potential to be leveraged for legal education. Yet, work on the development and deployment of such systems for legal education in the Global South is limited. In this work, we developed Eskwai for Students, a generative AI assistant to help law students with their legal education. Eskwai for Students is a retrieval augmented generation (RAG) system that provides answers to a wide range of legal questions for law students grounded in a curated database of over 12K case laws and 1.4K legislation in Ghana. We deployed Eskwai for Students in a longitudinal study of 30 months (2.5 years) used by 3.1K law students in Ghana who made 32K queries. We evaluated the helpfulness of our AI, and provided insight into the kinds of queries law students submit to this generative AI tool, which raises some ethical concerns. This work contributes to an understanding of how law students in the Global South are using generative AI for their studies and the ways it could be leveraged responsibly to advance legal education.
Chinese Translation
最近在生成式人工智能领域的进展显示了其在法律教育中应用的潜力。然而,针对全球南方地区法律教育的此类系统的开发和部署工作仍然有限。在本研究中,我们开发了Eskwai for Students,一个生成式人工智能助手,旨在帮助法学院学生进行法律教育。Eskwai for Students是一个检索增强生成(RAG)系统,能够基于一个包含超过12,000个案例法和1,400个立法的精心策划数据库,为法学院学生提供广泛的法律问题解答。我们在一项为期30个月(2.5年)的纵向研究中部署了Eskwai for Students,参与者为3,100名加纳法学院学生,他们共提交了32,000个查询。我们评估了该人工智能的帮助程度,并深入分析了法学院学生向这一生成式人工智能工具提交的查询类型,这引发了一些伦理问题。本研究有助于理解全球南方地区法学院学生如何利用生成式人工智能进行学习,以及如何负责任地利用这一技术来推动法律教育的发展。
cs.CL / 8 / 2605.15404
Capability Conditioned Scaffolding for Professional Human LLM Collaboration
基于能力条件的专业人类大语言模型协作支架
Abstract
Large language model personalization typically adapts outputs to user preferences and style but does not account for differences in user evaluation capacity across domains of expertise. This limitation can encourage Professional Domain Drift, where users rely on AI generated reasoning in domains they cannot reliably evaluate. We introduce Capability Conditioned Scaffolding, a typed framework that partitions expertise into strong, mixed, and weak domains and conditions intervention behavior on structured capability profiles. A pilot evaluation across multiple MMLU subsets and four LLM substrates shows consistent profile conditioned intervention behavior, including categorical inversion under profile swapping and selective activation in mixed domain risk zones. These findings suggest that capability aware scaffolding can support more reliable professional human AI collaboration beyond stylistic personalization.
Chinese Translation
大语言模型个性化通常会根据用户的偏好和风格调整输出,但并未考虑用户在不同专业领域的评估能力差异。这一局限性可能导致专业领域漂移(Professional Domain Drift),使用户在无法可靠评估的领域依赖于AI生成的推理。我们提出了基于能力条件的支架(Capability Conditioned Scaffolding),这是一个将专业知识划分为强、混合和弱领域的类型化框架,并根据结构化的能力档案调整干预行为。在多个MMLU子集和四种大语言模型(LLM)基础上进行的初步评估显示出一致的基于档案的干预行为,包括在档案交换下的类别反转和在混合领域风险区的选择性激活。这些发现表明,关注能力的支架可以支持更可靠的专业人类与AI协作,超越单纯的风格个性化。
cs.CL / 9 / 2605.15436
Neural Activation Patterns Across Language Model Architectures: A Comprehensive Analysis of Cognitive Task Performance
语言模型架构中的神经激活模式:认知任务表现的综合分析
Abstract
This paper presents a comprehensive analysis of neural activation patterns across six distinct large language model (LLM) architectures, examining their performance on twelve cognitive task categories. Through systematic measurement of final activation values, attention entropy, and sparsity patterns, we reveal fundamental differences in how encoder and decoder architectures process diverse cognitive tasks. Our analysis of 144 task-model combinations demonstrates that mathematical reasoning consistently produces the highest attention entropy across all architectures, while decoder models exhibit significantly higher sparsity patterns compared to encoder models. The findings provide critical insights into the computational characteristics of modern language models and their task-specific neural behaviors, with implications for model selection and optimization in big data applications.
Chinese Translation
本文对六种不同的大型语言模型(LLM)架构中的神经激活模式进行了全面分析,考察了它们在十二个认知任务类别上的表现。通过系统测量最终激活值、注意力熵和稀疏性模式,我们揭示了编码器和解码器架构在处理多样认知任务时的基本差异。对144个任务-模型组合的分析表明,数学推理在所有架构中始终产生最高的注意力熵,而解码器模型的稀疏性模式显著高于编码器模型。这些发现为现代语言模型的计算特性及其任务特定的神经行为提供了关键见解,对大数据应用中的模型选择和优化具有重要意义。
cs.CL / 10 / 2605.15440
Why are language models less surprised than humans? Testing the Parse Multiplicity Mismatch Hypothesis
为什么语言模型的惊讶程度低于人类?测试解析多重性不匹配假说
Abstract
Surprisal theory posits that the processing difficulty of a word is determined by its predictability in context, offering a potential link between human sentence processing and next-word predictions from language models. While language model (LM) surprisals successfully predict reading times in naturalistic text, they systematically underpredict the magnitude of difficulty observed in controlled studies of syntactic ambiguity, particularly in garden path sentences. This mismatch might arise from differences in the computational constraints between humans and LMs. Here we test one such hypothesis, specifically, that LMs may be able to simultaneously consider a greater number of distinct sentence interpretations at once, compared to humans. Using Recurrent Neural Network Grammars (RNNGs) with word-synchronous beam search, we systematically vary the number of simultaneous parses used to compute word surprisal, and then use these surprisals to predict human reading times. Reducing the number of simultaneous active parses indeed increases the magnitude of predicted garden path effects, but not nearly enough to capture the full magnitude of the effects in humans. This suggests that differences in the number of simultaneous parses available to LMs and humans cannot reconcile LM-based surprisal with human sentence processing.
Chinese Translation
惊讶理论认为,一个词的处理难度由其在上下文中的可预测性决定,这为人类句子处理与语言模型的下一个词预测之间提供了潜在的联系。尽管语言模型(LM)的惊讶值成功预测了自然文本中的阅读时间,但它们在控制语法模糊性的研究中,尤其是在花园路径句子中,系统性地低估了观察到的难度幅度。这种不匹配可能源于人类与语言模型之间计算约束的差异。在这里,我们测试了一个这样的假设,具体来说,语言模型可能能够同时考虑比人类更多的不同句子解释。我们使用带有词同步束搜索的递归神经网络语法(RNNGs),系统性地改变用于计算词惊讶的同时解析数量,然后利用这些惊讶值来预测人类的阅读时间。减少同时激活的解析数量确实增加了预测的花园路径效应的幅度,但远不足以捕捉人类中效应的全部幅度。这表明,语言模型和人类可用的同时解析数量的差异无法调和基于语言模型的惊讶值与人类句子处理之间的关系。
cs.CL / 11 / 2605.15454
Reasoning Models Don't Just Think Longer, They Move Differently
推理模型不仅思考更长时间,它们的移动方式也不同
Abstract
Reasoning-trained language models often spend more tokens on harder problems, but longer chains of thought do not show whether a model is merely computing for more steps or following a different internal trajectory. We study this distinction through hidden-state trajectories during chain-of-thought generation across competitive programming, mathematics, and Boolean satisfiability. Raw trajectory geometry is strongly shaped by generation length: longer generations mechanically alter path statistics, so difficulty-dependent comparisons are misleading without adjustment. After residualizing trajectory statistics on length, difficulty remains systematically coupled to corrected trajectory geometry across all domains studied. The clearest reasoning-specific separation appears in the code domain, where harder problems show more direct corrected trajectories and less heterogeneous local curvature in reasoning-trained models than in matched instruction-tuned baselines. Corrected difficulty-geometry coupling is weaker, but still present, in mathematics and Boolean satisfiability. Prompt-stage linear probes do not mirror the code-domain separation, and behavioral annotations show that stronger corrected coupling co-occurs with strategy shifts and uncertainty monitoring. Together, these findings establish length correction as a prerequisite for generation-time trajectory analysis and show that reasoning training can be associated with distinct corrected trajectory geometry, with the strength of the effect depending on the domain.
Chinese Translation
经过推理训练的语言模型在处理更难的问题时通常会消耗更多的标记,但较长的思维链并不能表明模型只是计算更多步骤,还是遵循不同的内部轨迹。我们通过在竞争编程、数学和布尔可满足性中的思维链生成过程中的隐藏状态轨迹研究这一区别。原始轨迹几何形状受到生成长度的强烈影响:较长的生成在机械上改变路径统计,因此在没有调整的情况下,基于难度的比较具有误导性。在对轨迹统计进行长度残差化后,难度仍然系统性地与所有研究领域的修正轨迹几何形状相耦合。在代码领域,推理特定的分离最为明显,较难的问题在推理训练模型中显示出更直接的修正轨迹和较少的局部曲率异质性,而在匹配的指令调优基线中则相反。在数学和布尔可满足性中,修正的难度-几何耦合较弱,但仍然存在。提示阶段的线性探测并未反映代码领域的分离,而行为注释表明,较强的修正耦合与策略转变和不确定性监测同时发生。综合来看,这些发现确立了长度修正作为生成时间轨迹分析的前提,并表明推理训练可以与独特的修正轨迹几何形状相关联,其效果的强度取决于具体领域。
cs.CL / 12 / 2605.15467
Retrieval-Augmented Large Language Models for Schema-Constrained Clinical Information Extraction
基于检索增强的大型语言模型用于受限模式的临床信息提取
Abstract
Conversational nurse-patient transcripts contain actionable observations, but converting these transcripts into structured representations at scale remains challenging. Documentation burden is substantial, with prior studies showing clinicians spend large portions of their workday on documentation and related desk work rather than direct patient care. MEDIQA-SYNUR focuses on observation extraction from conversational nurse-patient transcripts, requiring systems to normalize these narratives into a predefined schema with value-type constraints. We propose a modular retrieval-augmented generation (RAG) pipeline that uses the training set as an exemplar corpus, combines schema-constrained prompting (full schema vs. pruned candidate schema), deterministic schema-based postprocessing, and a second-pass audit, with two LLM backbones: Llama-4-Scout-17B-16E-Instruct and GPT-5.2 with corresponding embedding models for RAG. Our best configuration uses GPT-5.2 with full schema, RAG, and a second-pass auditing, achieving 80.36% F1 score. Overall, our results show that RAG consistently improves performance, while the optimal degree of schema constraint depends on the model, and second-pass auditing yields modest additional gains by correcting residual schema-adherence errors.
Chinese Translation
对话护士与患者的记录包含可操作的观察结果,但将这些记录转换为结构化表示在规模上仍然具有挑战性。文档负担相当大,以前的研究表明,临床医生在文档记录和相关的桌面工作上花费了大量的工作时间,而不是直接进行患者护理。MEDIQA-SYNUR专注于从对话护士与患者的记录中提取观察结果,这要求系统将这些叙述规范化为具有值类型约束的预定义模式。我们提出了一种模块化的检索增强生成(RAG)管道,该管道使用训练集作为示例语料库,结合受限模式提示(完整模式与修剪的候选模式)、确定性基于模式的后处理和第二轮审核,并使用两个大型语言模型(LLM)作为基础:Llama-4-Scout-17B-16E-Instruct和GPT-5.2,以及相应的嵌入模型用于RAG。我们最佳的配置使用了完整模式的GPT-5.2、RAG和第二轮审核,达到了80.36%的F1分数。总体而言,我们的结果表明RAG始终提高了性能,而模式约束的最佳程度依赖于模型,第二轮审核通过纠正残余的模式遵循错误带来了适度的额外收益。
cs.CL / 13 / 2605.15482
FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models
FINESSE-Bench:用于大型语言模型金融领域知识和技术分析的分层基准套件
Abstract
Large language models (LLMs) are increasingly being applied to financial analysis, reporting, investment decision support, risk management, compliance, and professional training. However, robust evaluation of their domain competence in finance remains incomplete. Widely used open benchmarks such as FinQA, ConvFinQA, and TAT-QA have played an important role in advancing financial question answering and numerical reasoning, but they focus primarily on question answering over financial reports and do not provide an explicit hierarchy of professional difficulty. Broader resources, including FinanceBench, PIXIU, FinBen, and FLaME, expand the coverage of financial tasks, yet the problem of evaluating the transition from foundational knowledge to expert-level financial reasoning remains open. In this work, we present FINESSE-Bench, a suite of eight specialized benchmarks comprising 3,993 questions for hierarchical evaluation of financial competencies in LLMs. FINESSE-Bench combines exam-oriented datasets inspired by professional certifications (CFA-like Levels 1-3, CMT-like Level 2, and CFTe-like Level 1), applied trading task collections, and a Russian-language olympiad benchmark. This design enables evaluation of domain breadth, performance degradation as difficulty increases, the ability to solve computational tasks, and model behavior in specialized financial domains. We also describe a unified evaluation protocol covering multiple-choice questions, numerical answers, and short open-ended responses, together with an automated scoring scheme for freeform answers based on the LLM-as-judge paradigm. FINESSE-Bench is intended both as a complement to existing open financial benchmarks and as a tool for more substantive evaluation of professionally relevant financial competencies in large language models.
Chinese Translation
大型语言模型(LLMs)越来越多地应用于金融分析、报告、投资决策支持、风险管理、合规性和专业培训。然而,对其在金融领域能力的稳健评估仍然不够完善。广泛使用的开放基准,如FinQA、ConvFinQA和TAT-QA,在推动金融问答和数值推理方面发挥了重要作用,但它们主要集中在对金融报告的问答上,并未提供明确的专业难度层级。更广泛的资源,包括FinanceBench、PIXIU、FinBen和FLaME,扩展了金融任务的覆盖范围,但评估从基础知识到专家级金融推理的过渡问题仍然未得到解决。在本研究中,我们提出了FINESSE-Bench,一个由8个专业基准组成的套件,包含3,993个问题,用于对LLMs中金融能力的分层评估。FINESSE-Bench结合了受专业认证(类似CFA的1-3级、类似CMT的2级和类似CFTe的1级)启发的考试导向数据集、应用交易任务集合以及一个俄语奥林匹克基准。这种设计使得能够评估领域广度、随着难度增加而导致的性能下降、解决计算任务的能力以及模型在专业金融领域的行为。我们还描述了一个统一的评估协议,涵盖多项选择题、数值答案和简短开放式回答,以及基于LLM作为评判者范式的自由形式答案的自动评分方案。FINESSE-Bench旨在作为现有开放金融基准的补充,并作为更实质性评估大型语言模型中专业相关金融能力的工具。
cs.CL / 14 / 2605.15514
RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably
RoPE在长上下文中既无法区分位置也无法区分标记,且有理论证明
Abstract
We identify intrinsic limitations of Rotary Positional Embeddings (RoPE) in Transformer-based long-context language models. Our theoretical analysis abstracts away from the specific content of the context and depends only on its length. We prove that as context length increases, RoPE-based attention becomes unpredictable and loses two properties that are central to its effectiveness. First, it loses its locality bias: RoPE is no more likely to favor nearer positions than substantially farther ones. Second, it loses consistency in token relevance: a key vector that receives a higher attention score than an alternative at one position may receive a lower score at another. In both cases, the probability of failure approaches 0.5, no better than random guessing. We further prove that the attention score can remain unchanged when a key token is moved to a different position, or even replaced by a different token, indicating a failure to distinguish positions or tokens. Adjusting the RoPE base trades off distinguishing positions against distinguishing tokens but cannot preserve both at the same time. Increasing the RoPE base hyperparameter, a common practice in today's long-context models, helps distinguish different tokens, but inevitably sacrifices the ability to distinguish positions. Our empirical analysis shows that multi-head, multi-layer architectures are insufficient to overcome these limitations. Our findings suggest that fundamentally new mechanisms for encoding position and token order may be needed in future Transformer long-context language models.
Chinese Translation
我们识别出旋转位置嵌入(Rotary Positional Embeddings, RoPE)在基于Transformer的长上下文语言模型中的内在局限性。我们的理论分析抽象化了上下文的具体内容,仅依赖于其长度。我们证明,随着上下文长度的增加,基于RoPE的注意力变得不可预测,并失去了两个对其有效性至关重要的特性。首先,它失去了局部性偏向:RoPE不再更倾向于靠近的位置,而是与远位置的偏向相当。其次,它失去了标记相关性的一致性:在某一位置上获得更高注意力分数的关键向量,在另一位置上可能获得更低的分数。在这两种情况下,失败的概率接近0.5,和随机猜测没有区别。我们进一步证明,当一个关键标记移动到不同位置或甚至被不同标记替换时,注意力分数可以保持不变,这表明无法区分位置或标记。调整RoPE基数在区分位置和区分标记之间进行权衡,但无法同时保留两者。增加RoPE基数超参数,这在当今的长上下文模型中是常见做法,有助于区分不同的标记,但不可避免地牺牲了区分位置的能力。我们的实证分析表明,多头、多层架构不足以克服这些局限性。我们的研究结果表明,未来的Transformer长上下文语言模型可能需要根本新的机制来编码位置和标记顺序。
cs.CL / 15 / 2605.15518
DetectRL-X: Towards Reliable Multilingual and Real-World LLM-Generated Text Detection
DetectRL-X:迈向可靠的多语言和现实世界的LLM生成文本检测
Abstract
The effective detection and governance of Large Language Model (LLM) generated content has become increasingly critical due to the growing risk of misuse. Despite the impressive performance of existing detectors, their reliability and potential in multilingual, real-world scenarios remain largely underexplored. In this study, we introduce DetectRL-X, a comprehensive multilingual benchmark designed to evaluate advanced detectors across 8 dimensions. The benchmark encompasses 8 languages commonly used in commercial contexts and collects human-written texts from 6 domains highly susceptible to LLM misuse. To better aligned with real-world applications, We create LLM-generated texts using 4 popular commercial LLMs, and include typical AI-assisted writing operations such as polishing, expanding, and condensing to capture authentic usage patterns. Furthermore, we develop a multilingual framework for paraphrasing and perturbation attacks to simulate diverse human modifications and writing noise, enabling stress testing of detectors across languages. Experimental results on DetectRL-X reveal the strengths and limitations of current state-of-the-art detectors when applied to diverse linguistic resources. We further analyze how domains, generators, attack strategies, text length, and refinement operations influence performance in different languages, underscoring DetectRL-X as an effective benchmark for strengthening multilingual and language-specific detectors.
Chinese Translation
由于滥用风险的不断增加,有效检测和治理大型语言模型(LLM)生成内容变得愈加重要。尽管现有检测器表现出色,但它们在多语言和现实世界场景中的可靠性和潜力仍然未得到充分探索。在本研究中,我们介绍了DetectRL-X,这是一个全面的多语言基准,旨在评估先进检测器在8个维度上的表现。该基准涵盖了8种在商业环境中常用的语言,并从6个高度易受LLM滥用的领域收集了人工撰写的文本。为了更好地与现实世界应用对接,我们使用4种流行的商业LLM生成文本,并包括润色、扩展和压缩等典型的AI辅助写作操作,以捕捉真实的使用模式。此外,我们开发了一个多语言框架,用于释义和扰动攻击,以模拟多样的人类修改和写作噪声,从而对检测器在不同语言中的表现进行压力测试。在DetectRL-X上的实验结果揭示了当前最先进检测器在应用于多样语言资源时的优缺点。我们进一步分析了领域、生成器、攻击策略、文本长度和精炼操作如何影响不同语言中的性能,强调了DetectRL-X作为加强多语言和特定语言检测器的有效基准。
cs.CL / 16 / 2605.15529
Process Rewards with Learned Reliability
具有学习可靠性的过程奖励
Abstract
Process Reward Models (PRMs) provide step-level feedback for reasoning, but current PRMs usually output only a single reward score for each step. Downstream methods must therefore treat imperfect step-level reward predictions as reliable decision signals, with no indication of when these predictions should be trusted. We propose BetaPRM, a distributional PRM that predicts both a step-level success probability and the reliability of that prediction. Given step-success supervision from Monte Carlo continuations, BetaPRM learns a Beta belief that explains the observed number of successful continuations through a Beta-Binomial likelihood, rather than regressing to the finite-sample success ratio as a point target. This learned reliability signal indicates when a step reward should be trusted, enabling downstream applications to distinguish reliable rewards from uncertain ones. As one application, we introduce Adaptive Computation Allocation (ACA) for PRM-guided Best-of-N reasoning. ACA uses the learned reliability signal to stop when a high-reward solution is reliable and to spend additional computation on uncertain candidate prefixes. Experiments across four backbones and four reasoning benchmarks show that BetaPRM improves PRM-guided Best-of-N selection while preserving standard step-level error detection. Built on this signal, ACA improves the accuracy--token tradeoff over fixed-budget Best-of-16, reducing token usage by up to 33.57% while improving final-answer accuracy.
Chinese Translation
过程奖励模型(PRMs)为推理提供逐步反馈,但当前的PRMs通常仅为每个步骤输出一个单一的奖励分数。因此,下游方法必须将不完美的逐步奖励预测视为可靠的决策信号,而没有指示这些预测何时应被信任。我们提出了BetaPRM,这是一种分布式PRM,能够预测逐步成功概率及其预测的可靠性。在通过蒙特卡洛延续获得的步骤成功监督下,BetaPRM学习一种Beta信念,通过Beta-二项式似然来解释观察到的成功延续数量,而不是将其回归到有限样本成功比率作为点目标。这种学习到的可靠性信号指示何时应信任步骤奖励,使下游应用能够区分可靠奖励与不确定奖励。作为一种应用,我们引入了自适应计算分配(ACA),用于PRM引导的最佳选择推理。ACA利用学习到的可靠性信号在高奖励解决方案可靠时停止计算,并在不确定的候选前缀上花费额外的计算。跨四个基础模型和四个推理基准的实验表明,BetaPRM在保持标准逐步错误检测的同时,改善了PRM引导的最佳选择。基于该信号,ACA在固定预算的最佳选择16中改善了准确性与代币使用的权衡,将代币使用减少了多达33.57%,同时提高了最终答案的准确性。
cs.CL / 17 / 2605.15557
When Latent Geometry Is Not Enough: Draft-Conditioned Latent Refinement for Non-Autoregressive Text Generation
当潜在几何不足时:基于草稿条件的潜在精炼用于非自回归文本生成
Abstract
Continuous diffusion and flow models are attractive for non-autoregressive text generation because they can update all positions in parallel. A major difficulty is the interface between continuous latent states and discrete tokens. This report studies a draft-conditioned latent refinement model built from a frozen BERT encoder, a parallel decoder, a denoising DraftPrior, a local FlowNet, and a learned diagonal MetricNet. Early Gaussian-start experiments showed that good latent-space metrics, such as scale matching or cosine similarity, do not guarantee good decoding. Generated latents can be close to real encoder latents but still produce high-entropy, biased, or repetitive token distributions. We therefore frame the task as controlled local refinement rather than full generation from noise. On ROCStories, using the first two sentences as prompt and the last three as target, full 768-dimensional BERT latents recover tokens much better than compressed 256-dimensional latents. With 768-dimensional latents, DraftPrior target-token probability is 0.938 for clean drafts, 0.613 for 3% token dropout, 0.483 for 5% dropout, and 0.272 for 10% dropout. Local flow refinement and fused decoder-aware readout give modest additional gains, while metric learning and OT-style alignment improve geometry but do not close the decoder gap. The main result is a diagnostic one: latent geometry alone is not enough. Continuous latent text generation should be evaluated by decoder recoverability, the quality of the start distribution, and whether refinement preserves decoder-readable structure.
Chinese Translation
连续扩散和流模型因其能够并行更新所有位置而在非自回归文本生成中颇具吸引力。一个主要的难点是连续潜在状态与离散标记之间的接口。本报告研究了一种基于冻结的BERT编码器、并行解码器、去噪草稿先验(DraftPrior)、局部流网络(FlowNet)和学习的对角度度量网络(MetricNet)构建的草稿条件潜在精炼模型。早期的高斯起始实验表明,良好的潜在空间度量,如尺度匹配或余弦相似性,并不能保证良好的解码。生成的潜在向量可能接近真实的编码器潜在向量,但仍然会产生高熵、有偏或重复的标记分布。因此,我们将任务框架设定为受控的局部精炼,而不是从噪声中完全生成。在ROCStories数据集上,使用前两句作为提示,最后三句作为目标,完整的768维BERT潜在向量在恢复标记方面表现优于压缩的256维潜在向量。使用768维潜在向量时,草稿先验目标标记概率在干净草稿下为0.938,在3%标记丢失下为0.613,在5%丢失下为0.483,在10%丢失下为0.272。局部流精炼和融合的解码器感知读取提供了适度的额外增益,而度量学习和OT风格的对齐改善了几何结构,但并未缩小解码器之间的差距。主要结果是一个诊断性的发现:单靠潜在几何是不够的。连续潜在文本生成应通过解码器可恢复性、起始分布的质量以及精炼是否保留解码器可读结构来进行评估。
cs.CL / 18 / 2605.15562
GiLT: Augmenting Transformer Language Models with Dependency Graphs
GiLT:通过依赖图增强变换器语言模型
Abstract
Augmenting Transformers with linguistic structures effectively enhances the syntactic generalization performance of language models. Previous work in this direction focuses on syntactic tree structures of languages, in particular constituency tree structures. We propose Graph-Infused Layers Transformer Language Model (GiLT) which leverages dependency graphs for augmenting Transformer language models. Unlike most previous work, GiLT does not insert extra structural tokens in language modeling; instead, it injects structural information into language modeling by modulating attention weights in the Transformer with features extracted from the dependency graph that is incrementally constructed along with token prediction. In our experiments, GiLT with semantic dependency graphs achieves better syntactic generalization while maintaining competitive perplexity in comparison with Transformer language model baselines. In addition, GiLT can be finetuned from a pretrained language model to achieve improved downstream task performance. Our code is released at https://github.com/cookie-pie-oops/GiLT-LM.
Chinese Translation
通过语言结构增强变换器有效提升了语言模型的句法泛化性能。以往在这一方向的研究主要集中于语言的句法树结构,特别是成分树结构。我们提出了图注入层变换器语言模型(Graph-Infused Layers Transformer Language Model,GiLT),该模型利用依赖图来增强变换器语言模型。与大多数以往的研究不同,GiLT并不在语言建模中插入额外的结构性标记;相反,它通过调节变换器中的注意力权重,将从依赖图中提取的特征注入到语言建模中,这些依赖图是随着标记预测逐步构建的。在我们的实验中,使用语义依赖图的GiLT在保持与变换器语言模型基线相当的困惑度的同时,取得了更好的句法泛化效果。此外,GiLT可以从预训练语言模型进行微调,以提高下游任务的性能。我们的代码已发布在 https://github.com/cookie-pie-oops/GiLT-LM。
cs.CL / 19 / 2605.15572
Measuring Maximum Activations in Open Large Language Models
测量开放大型语言模型中的最大激活值
Abstract
The dynamic range of activations is a first-order constraint for low-bit quantization, activation scaling, and stable LLM inference. Prior work characterized outlier features and massive activations on pre-2024 LLaMA-style models, and the downstream activation-quantization stack inherits that picture without revisiting it for the post-LLaMA open-model boom. We ask the deployment-oriented question: how large can activations get in modern open LLMs, and how does this magnitude vary across families, generations, and training stages? Under a unified pipeline (5,000-sample multi-domain corpus, family-specific tokenization, identical hooks across embeddings, hidden states, attention, MLP/MoE, SwiGLU gates, and final norm), we measure global and layerwise maxima on 27 checkpoints from 8 open families spanning dense, MoE, vision-language, intermediate-training, and instruction-tuned variants. We find that (i) global maxima span over nearly four orders of magnitude at comparable parameter counts, with Qwen3.5 and MoE checkpoints in the 10^2 to 10^3 range and Gemma3-27B-it reaching ~7 x 10^5; (ii) cross-family and cross-generation comparisons break simple monotonic scaling; and (iii) MoE checkpoints exhibit 14.0-23.4x lower peaks than matched-scale dense counterparts, while the residual stream carries the global maximum in 22/24 checkpoints. A lightweight INT-8 sanity check shows that measured maxima co-vary with low-bit reconstruction error via activation-scale selection. We conclude that maximum activation magnitude is a model property tied to family, architecture, and training stage - not a simple byproduct of size - and should be measured and reported alongside any open-weight release before low-bit deployment. The code is publicly available at https://github.com/clx1415926/Max_act_llm.
Chinese Translation
激活值的动态范围是低位量化、激活缩放和稳定大型语言模型(LLM)推理的首要约束。之前的研究对2024年前的LLaMA风格模型中的异常特征和大规模激活进行了表征,而下游的激活量化流程在后LLaMA开放模型繁荣的背景下并未重新审视这一情况。我们提出了一个面向部署的问题:现代开放LLM中的激活值最大可以达到多大,这一幅度在不同模型家族、代际和训练阶段之间是如何变化的?在一个统一的流程下(5000样本的多领域语料库、特定家族的标记化、嵌入、隐藏状态、注意力、MLP/MoE、SwiGLU门和最终归一化的相同钩子),我们在来自8个开放家族的27个检查点上测量了全局和层级的最大值,这些家族涵盖了稠密、MoE、视觉-语言、中间训练和指令调优的变体。我们发现:(i) 在可比参数数量下,全局最大值跨越了近四个数量级,其中Qwen3.5和MoE检查点在10^2到10^3范围内,而Gemma3-27B-it达到了约7 x 10^5;(ii) 跨家族和跨代际的比较打破了简单的单调缩放;(iii) MoE检查点的峰值比匹配规模的稠密对应物低14.0-23.4倍,而残差流在22/24个检查点中承载了全局最大值。一项轻量级的INT-8合理性检查表明,测得的最大值与通过激活缩放选择的低位重构误差呈共变关系。我们得出结论,最大激活幅度是与模型家族、架构和训练阶段相关的模型特性,而不是简单的规模副产品,并且在低位部署之前应与任何开放权重发布一起测量和报告。代码可在https://github.com/clx1415926/Max_act_llm上公开获取。
cs.CL / 20 / 2605.15573
Response-Conditioned Parallel-to-Sequential Orchestration for Multi-Agent Systems
响应条件的并行到顺序编排用于多智能体系统
Abstract
Multi-agent systems can solve complex tasks through collaboration between multiple Large Language Model agents. Existing collaboration frameworks typically operate in either a parallel or a sequential mode. In the parallel mode, agents respond independently to queries followed by aggregation of responses. In contrast, sequential systems allow agents to communicate via a directed topology and refine one another step by step. However, both modes are inadequate for achieving the desired objectives of minimizing communication and latency while simultaneously maximizing the accuracy of the final response. In this work, we introduce a hybrid paradigm called Nexa, a trainable response-conditioned policy that bridges the gap between the two modes. Nexa begins with a parallel execution stage, embeds the resulting responses into a shared semantic space, and then predicts a sparse directed acyclic communication graph. If the graph is empty, the system remains purely parallel; if it is non-empty, the system performs one sequential message propagation. The policy is a lightweight transformer model, and the method avoids the need for external LLM judges or reward models, as well as hand-crafted test-time topology search. We formalize this hybrid execution problem, show that the resulting graph is acyclic by construction, and that the framework strictly subsumes pure parallel execution, and present a training procedure based on policy-gradient optimization. Results demonstrate that the response-conditioned policy learned by Nexa under one setting can be reused when the number of agents, the task, or the underlying agent changes, thus emphasizing the generalizability of the learned communication policy.
Chinese Translation
多智能体系统可以通过多个大型语言模型(Large Language Model)智能体之间的协作来解决复杂任务。现有的协作框架通常以并行或顺序模式运行。在并行模式下,智能体独立响应查询,然后聚合响应。相比之下,顺序系统允许智能体通过有向拓扑进行通信,并逐步相互完善。然而,这两种模式都不足以实现最小化通信和延迟的目标,同时最大化最终响应的准确性。在本研究中,我们引入了一种名为Nexa的混合范式,这是一种可训练的响应条件策略,弥合了这两种模式之间的差距。Nexa首先进行并行执行阶段,将结果响应嵌入共享语义空间,然后预测一个稀疏的有向无环通信图。如果图为空,系统保持纯并行;如果图非空,系统执行一次顺序消息传播。该策略是一个轻量级的变换器模型,并且该方法避免了对外部大型语言模型评判者或奖励模型的需求,以及手工设计的测试时拓扑搜索。我们对这一混合执行问题进行了形式化,证明了所得到的图在构造上是无环的,并且该框架严格包含了纯并行执行,并提出了一种基于策略梯度优化的训练过程。结果表明,在一种设置下由Nexa学习的响应条件策略可以在智能体数量、任务或基础智能体变化时重复使用,从而强调了所学通信策略的可推广性。
cs.CL / 21 / 2605.15588
Calibrating LLMs with Semantic-level Reward
使用语义级奖励对大型语言模型进行校准
Abstract
As large language models (LLMs) are deployed in consequential settings such as medical question answering and legal reasoning, the ability to estimate when their outputs are likely to be correct is essential for safe and reliable use, requiring well-calibrated uncertainty. Standard reinforcement learning with verifiable rewards (RLVR) trains models with a binary correctness reward that is indifferent to confidence, providing no penalty for confident but wrong predictions and thereby degrading calibration. Recent work addresses this by training models to produce verbalized confidence scores alongside answers and rewarding agreement with correctness. However, verbalized confidence is calibrated at the token level and thus exhibits inconsistency across textual variations with same semantic meaning. We propose \textbf{Calibration with Semantic Reward (CSR)}, a framework that calibrates language models directly in semantic space without a verbalized confidence interface. CSR combines the correctness reward with a novel semantic calibration reward that encourages exploitation among correct rollouts by promoting semantic agreement, and exploration among incorrect ones by discouraging spurious consistency. Experiments across three model families on HotpotQA (in-distribution) and TriviaQA, MSMARCO, and NQ-Open (out-of-distribution) show that CSR consistently achieves lower ECE and higher AUROC than verbalized-confidence baselines across nearly all settings, reducing ECE by up to $40\%$ and improving AUROC by up to $31\%$ over verbalized-confidence baselines, with calibration behavior generalizing robustly across all four evaluation settings.
Chinese Translation
随着大型语言模型(LLMs)在医疗问答和法律推理等重要场景中的应用,估计其输出何时可能正确的能力对于安全可靠的使用至关重要,这需要良好校准的不确定性。标准的可验证奖励强化学习(RLVR)使用二元正确性奖励训练模型,对置信度无动于衷,对自信但错误的预测没有惩罚,从而降低了校准效果。近期的研究通过训练模型在回答的同时生成口头化的置信度分数,并奖励与正确性的一致性来解决这一问题。然而,口头化的置信度是在词元级别进行校准的,因此在具有相同语义意义的文本变体之间表现出不一致性。我们提出了 extbf{使用语义奖励进行校准(CSR)}的框架,该框架直接在语义空间中对语言模型进行校准,而无需口头化的置信度接口。CSR将正确性奖励与一种新颖的语义校准奖励相结合,鼓励在正确的结果之间进行利用,通过促进语义一致性来实现,同时在错误的结果之间进行探索,通过抑制虚假的一致性来实现。在HotpotQA(同分布)和TriviaQA、MSMARCO、NQ-Open(异分布)三个模型系列的实验表明,CSR在几乎所有设置中始终实现了比口头化置信度基线更低的ECE和更高的AUROC,将ECE降低了多达$40 ext{ extperthousand}$,并将AUROC提高了多达$31 ext{ extperthousand}$,其校准行为在所有四个评估设置中均表现出强健的泛化能力。
cs.CL / 22 / 2605.15589
MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models
MHGraphBench:基于知识图谱的心理健康知识在大型语言模型中的基准评估
Abstract
Large language models (LLMs) are increasingly used in the mental health domain, yet it remains unclear how well they capture related biomedical knowledge and how reliably they apply it to clinically salient structured judgments. Here, we present a knowledge-graph (KG)-grounded benchmark for assessing LLMs on mental-health entity recognition, relation judgment, and two-hop reasoning. The benchmark is derived from PrimeKG and comprises nine task families with KG-supported answers and controlled negative options. Experiments across 15 closed- and open-source LLMs reveal a persistent recognition-to-judgment gap: leading models achieve near-ceiling performance on entity typing and on the small relation-typing subset, yet they still struggle with relation prediction and two-hop reasoning. Additionally, short KG-derived snippets benefit some models but degrade performance for others. Moreover, output-format reliability can substantially influence measured performance under constrained multiple-choice settings, highlighting the critical role of response validity in benchmark-based evaluation. MHGraphBench should therefore be interpreted as evaluating agreement with a curated mental-health slice of PrimeKG under a constrained multiple-choice interface, rather than as a direct assessment of real-world clinical safety.
Chinese Translation
大型语言模型(LLMs)在心理健康领域的应用日益增多,但它们在捕捉相关生物医学知识方面的能力以及在临床显著结构判断中的可靠应用仍不明确。在此,我们提出了一个基于知识图谱(KG)的基准,用于评估LLMs在心理健康实体识别、关系判断和两跳推理方面的表现。该基准源自PrimeKG,包含九个任务类别,提供KG支持的答案和受控的负面选项。对15个闭源和开源LLMs的实验揭示了一个持续存在的识别与判断差距:领先模型在实体类型识别和小规模关系类型子集上接近顶尖表现,但在关系预测和两跳推理方面仍面临挑战。此外,短小的KG衍生片段对某些模型有益,但对其他模型则降低了性能。此外,在受限的多项选择设置下,输出格式的可靠性可以显著影响测量性能,突显了响应有效性在基准评估中的关键作用。因此,MHGraphBench应被理解为在受限的多项选择界面下评估与精心策划的PrimeKG心理健康片段的一致性,而非对现实世界临床安全的直接评估。
cs.CL / 23 / 2605.15607
Syntax Without Semantics: Teaching Large Language Models to Code in an Unseen Language
没有语义的语法:教大型语言模型在未见语言中编码
Abstract
Large language models (LLMs) achieve high pass rates on code generation benchmarks, yet whether they can transfer this ability to languages absent from pretraining remains poorly understood. We introduce PyLang, a minimal imperative language absent from all pretraining corpora, and evaluate frontier models zero-shot and fine-tuned Qwen3 (4B, 8B, 32B) on 352 problems. We find that fine-tuning quickly teaches syntax but fails to transfer semantic competence: Python outperforms PyLang by up to 19% across all configurations, and no intervention (multi-task learning, preference tuning, code infilling, or latent-space objectives) closes the gap. An LLM judge reveals that frontier models select an identical algorithm to Python 80% of the time, yet cannot translate it into a working PyLang implementation., and CKA analysis confirms that fine-tuned models converge to nearly identical internal representations across languages (CKA > 0.97) while diverging at the output stage. We term this the implementation fidelity gap: models possess language-agnostic algorithmic understanding but cannot express it in an unfamiliar language. Our findings highlight the need for training methods that decouple reasoning from language-specific realization.
Chinese Translation
大型语言模型(LLMs)在代码生成基准测试中取得了高通过率,但它们是否能够将这种能力转移到预训练中缺失的语言仍然不甚明了。我们引入了 PyLang,这是一种在所有预训练语料库中都不存在的最小命令式语言,并对前沿模型进行了零-shot 和微调评估,包括 Qwen3(4B、8B、32B)在 352 个问题上的表现。我们发现微调能够迅速教授语法,但未能转移语义能力:在所有配置中,Python 的表现比 PyLang 高出多达 19%,而且没有任何干预(多任务学习、偏好调优、代码填充或潜在空间目标)能够缩小这一差距。一项 LLM 判别分析显示,前沿模型在 80% 的情况下选择与 Python 相同的算法,但无法将其翻译为有效的 PyLang 实现。CKA 分析证实,微调模型在不同语言间收敛到几乎相同的内部表示(CKA > 0.97),而在输出阶段却出现分歧。我们将此称为实现保真度差距:模型具备语言无关的算法理解能力,但无法在不熟悉的语言中表达。我们的研究结果强调了需要训练方法来将推理与特定语言的实现解耦。
cs.CL / 24 / 2605.15609
PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding
PSD:通过并行推测解码推动扩散大型语言模型的帕累托前沿
Abstract
Diffusion large language models (dLLMs) generate text by iteratively denoising masked token sequences. Although dLLMs can predict all masked positions in parallel within each step, the large number of denoising iterations still makes inference expensive. This cost can be reduced spatially by unmasking multiple tokens per step, or temporally by collapsing multiple denoising steps into one verification call. We propose Parallel Speculative Decoding (PSD), a training-free framework that jointly improves inference along both axes. Using the confidence scores from a single forward pass, PSD selects positions to unmask via a configurable, adaptive unmasking policy and constructs multi-depth speculative drafts without extra model calls. A final batched verification pass then applies hierarchical acceptance, keeping the deepest draft that remains consistent with the updated predictions. Experiments on three dLLMs across reasoning and code generation tasks show that PSD achieves favorable trade-offs between inference efficiency and generation quality, reaching up to $5.5\times$ tokens per forward pass with accuracy comparable to greedy decoding.
Chinese Translation
扩散大型语言模型(dLLMs)通过迭代去噪掩蔽的标记序列生成文本。尽管dLLMs可以在每一步中并行预测所有掩蔽位置,但大量的去噪迭代仍然使得推理成本较高。通过在每一步中解掩多个标记,可以在空间上降低这一成本;而通过将多个去噪步骤合并为一次验证调用,则可以在时间上降低成本。我们提出了并行推测解码(Parallel Speculative Decoding, PSD),这是一个无训练的框架,能够在这两个方面共同提高推理效率。PSD利用单次前向传播的置信度分数,通过可配置的自适应解掩策略选择解掩位置,并在不增加额外模型调用的情况下构建多深度的推测草稿。最后的批量验证步骤应用层级接受机制,保留与更新预测一致的最深草稿。在推理和代码生成任务上对三种dLLMs的实验表明,PSD在推理效率和生成质量之间实现了良好的权衡,达到每次前向传播高达5.5倍的标记数,同时准确性与贪婪解码相当。
cs.CL / 25 / 2605.15613
Toward LLMs Beyond English-Centric Development
迈向超越英语中心发展的大型语言模型
Abstract
Through an analysis of sequences generated by open-weight large language models (LLMs), we demonstrate that LLMs are heavily biased toward English. While continual pre-training is commonly used to adapt LLMs to a target language, we show that it does not offer a cost advantage over training from scratch, even for improving cultural understanding in the target language. These findings suggest that dedicated per-language investment may become increasingly important for future LLM development, rather than relying primarily on the expansion of English-centric resources.
Chinese Translation
通过对开放权重的大型语言模型(LLMs)生成的序列进行分析,我们证明了LLMs在很大程度上偏向于英语。尽管持续的预训练通常用于将LLMs适应目标语言,但我们表明,这种方法在成本上并不优于从头开始训练,即使是在提高目标语言的文化理解方面。这些发现表明,针对每种语言的专门投资在未来LLM的发展中可能变得越来越重要,而不是主要依赖于英语中心资源的扩展。
cs.CL / 26 / 2605.15635
Evaluating Chinese Ambiguity Understanding in Large Language Models
评估大型语言模型对中文歧义理解的能力
Abstract
Linguistic ambiguity is critical to the robustness of Large Language Models (LLMs), yet existing research focuses mostly on English, with limited attention devoted to Chinese. Existing Chinese ambiguity datasets (e.g., CHAmbi) suffer from poor scalability. Guided by Potential Ambiguity (PA) Theory, we design a semi-automatic pipeline to construct CHA-Gen. It is the first PA Theory-grounded Chinese ambiguity dataset, which comprises 5,712 sentences (2,414 ambiguous, 3,298 unambiguous) across 18 potential ambiguous structures. Evaluating LLMs (e.g. Gemma 3, Qwen 2.5/3 series) via direct querying and machine translation, we find that LLMs struggle with ambiguity detection (improved by CoT prompting). Analysis of Qwen3-32B's CoT rationales reveals three common failure modes: ambiguity blindness, misattribution, and premature resolution. Uncertainty quantification with semantic entropy metric shows higher uncertainty for ambiguous sentences. Moreover, instruction tuning induces overconfidence, whereas Base models better capture semantic diversity. We further observe that models exhibit a bias toward dominant interpretations. Our work provides a scalable approach for Chinese ambiguity corpus and insights into LLMs' ambiguity handling, laying a foundation for enhancing Chinese ambiguity research in LLMs.
Chinese Translation
语言歧义对大型语言模型(LLMs)的鲁棒性至关重要,但现有研究主要集中在英语上,对中文的关注有限。现有的中文歧义数据集(如 CHAmbi)存在扩展性差的问题。在潜在歧义(PA)理论的指导下,我们设计了一种半自动化流程来构建 CHA-Gen。这是第一个基于 PA 理论的中文歧义数据集,包含 5,712 个句子(2,414 个歧义句,3,298 个非歧义句),涵盖 18 种潜在歧义结构。通过直接查询和机器翻译评估 LLMs(如 Gemma 3、Qwen 2.5/3 系列),我们发现 LLMs 在歧义检测方面存在困难(通过链式思维提示(CoT prompting)有所改善)。对 Qwen3-32B 的链式思维推理分析揭示了三种常见的失败模式:歧义盲点、错误归因和过早解决。使用语义熵度量的不确定性量化显示,歧义句的的不确定性更高。此外,指令调优会导致过度自信,而基础模型则更好地捕捉语义多样性。我们进一步观察到,模型对主导解释存在偏向。我们的工作为中文歧义语料库提供了一种可扩展的方法,并为 LLMs 处理歧义提供了见解,为增强 LLMs 中的中文歧义研究奠定了基础。
cs.CL / 27 / 2605.15676
Dynamic Chunking for Diffusion Language Models
扩散语言模型的动态分块
Abstract
Block discrete diffusion language models factorize a sequence autoregressively over fixed-size positional blocks, decoupling within-block parallel denoising from across-block conditioning. We argue that this rigid partition wastes structure already present in the sequence: blocks defined by position rather than by content separate semantically coherent tokens and group unrelated ones together. We introduce the \textbf{D}ynamic \textbf{C}hunking \textbf{D}iffusion \textbf{M}odel (DCDM), which replaces positional blocks with content-defined semantic chunks. At its core is Chunking Attention, a differentiable layer that routes tokens into $K$ clusters parameterized by learnable subspaces and shaped end-to-end by the diffusion objective. The resulting cluster assignments induce a chunk-causal attention mask under which a discrete diffusion denoiser factorizes the sequence likelihood autoregressively over semantic chunks, strictly generalizing block discrete diffusion. On downstream benchmarks at parameter scales up to 1.5B, DCDM consistently improves over both unstructured and positional-block diffusion baselines, with the advantage stable across scales and visible early in training.
Chinese Translation
块离散扩散语言模型在固定大小的位置块上自回归地对序列进行因式分解,将块内并行去噪与块间条件解耦。我们认为这种刚性分区浪费了序列中已经存在的结构:由位置而非内容定义的块将语义上连贯的标记分开,并将无关的标记聚集在一起。我们提出了 extbf{D}ynamic extbf{C}hunking extbf{D}iffusion extbf{M}odel(DCDM),它用内容定义的语义块替代位置块。其核心是Chunking Attention,一个可微分的层,它将标记路由到由可学习子空间参数化的$K$个集群中,并通过扩散目标端到端地塑造。由此产生的集群分配引入了一个块因果注意力掩码,在该掩码下,离散扩散去噪器自回归地在语义块上因式分解序列似然,从而严格推广了块离散扩散。在参数规模高达15亿的下游基准测试中,DCDM在无结构和位置块扩散基线之上始终表现出改进,且这种优势在不同规模间保持稳定,并在训练初期即可观察到。
cs.CL / 28 / 2605.15677
VCG-Bench: Towards A Unified Visual-Centric Benchmark for Structured Generation and Editing
VCG-Bench:迈向统一的以视觉为中心的结构化生成与编辑基准
Abstract
Despite the rapid advancements in Vision-Language Models (VLMs), a critical gap remains in their ability to handle structured, controllable diagrammatic tasks essential for professional workflows. Existing methods predominantly rely on pixel-based synthesis, which operates in probabilistic pixel spaces and is inherently limited in editability and fidelity. Instead, we propose a new Diagram-as-Code paradigm with symbolic logic that leverages mxGraph Extensible Markup Language (XML) for precise diagram generation and editing. We present VCG-Bench, a unified benchmark for visual-centric \texttt{mxGraph} tasks. VCG-Bench comprises: (1) a taxonomized dataset of 1,449 diverse diagrams spanning 6 domains and 15 sub-domains, (2) a paradigm definition that integrates Generation (Vision-to-Code) and Editability (Code-to-Code), (3) a Tailored Evaluation Protocol employing multi-dimensional metrics such as \texttt{mxGraph} Execution Success Rate, Style Consistency Score (SCS), etc. Experimental results highlight the challenges faced by current State-of-the-Art (SOTA) VLMs in structured fidelity and instruction compliance, reflecting their vision and reasoning capabilities.
Chinese Translation
尽管视觉-语言模型(VLMs)迅速发展,但它们在处理专业工作流程中至关重要的结构化、可控图示任务方面仍存在显著差距。现有方法主要依赖于基于像素的合成,这种方法在概率像素空间中运行,固有地限制了可编辑性和保真度。因此,我们提出了一种新的图示即代码(Diagram-as-Code)范式,利用符号逻辑和 mxGraph 可扩展标记语言(XML)进行精确的图示生成与编辑。我们推出了 VCG-Bench,这是一个针对以视觉为中心的 exttt{mxGraph} 任务的统一基准。VCG-Bench 包括:(1)一个分类的数据集,包含 1,449 个涵盖 6 个领域和 15 个子领域的多样化图示;(2)一个整合生成(视觉到代码)和可编辑性(代码到代码)的范式定义;(3)一个量身定制的评估协议,采用多维度指标,如 exttt{mxGraph} 执行成功率、风格一致性评分(SCS)等。实验结果突显了当前最先进(SOTA)VLMs 在结构保真度和指令遵循方面面临的挑战,反映了它们的视觉和推理能力。
cs.CL / 29 / 2605.15680
Few-Shot Large Language Models for Actionable Triage Categorization of Online Patient Inquiries
用于在线患者咨询可操作性分诊分类的少样本大型语言模型
Abstract
Online patient inquiries are often informal, incomplete, and written before professional assessment, yet they must still be routed to an appropriate level of clinical follow-up. We study this as a four-class actionable triage task -- self-care, schedule-visit, urgent-clinician-review, or emergency-referral, and ask whether prompted large language models (LLMs) can support such routing under low-resource labeling conditions. Using the public HealthCareMagic-100K corpus, we construct a 300-example human calibrated gold evaluation set, a 700-example auto-labeled silver training set, and a 40-example few-shot pool. We compare Term Frequency-Inverse Document Frequency (TF-IDF) and Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT) baselines train on silver labels against six prompted LLMs under 0-shot, 4-shot, and 12-shot conditions respectively. Accordingly, we evaluate with macro-$F_1$ alongside safety-aware metrics, including emergency-recall, under-triage rate, and severe under-triage rate. The strongest LLM (Claude Haiku 4.5, 12-shot) reaches macro-$F_1$ 0.475, exceeding the best supervised baseline (BioBERT, 0.378) on point estimate, with overlapping confidence intervals. Few-shot prompting and two-model agreement help in label-dependent ways: self-care agreement is reliable, urgent-clinician-review is not. We conclude that LLMs can support triage prioritization and selective human review, but not autonomous deployment.
Chinese Translation
在线患者咨询通常是非正式的、不完整的,并且是在专业评估之前撰写的,但仍然必须将其引导到适当的临床跟进级别。我们将此研究作为一个四类可操作性分诊任务——自我护理、预约就诊、紧急临床审查或紧急转诊,并探讨在低资源标注条件下,提示的大型语言模型(LLMs)是否能够支持这种引导。利用公共的HealthCareMagic-100K语料库,我们构建了一个包含300个示例的人类校准金标准评估集、一个包含700个示例的自动标注银标准训练集,以及一个包含40个示例的少样本池。我们比较了在银标准标签上训练的词频-逆文档频率(TF-IDF)和生物医学文本挖掘的双向编码器表示(BioBERT)基线与六个提示的LLMs在0-shot、4-shot和12-shot条件下的表现。因此,我们使用宏观-$F_1$以及安全意识指标进行评估,包括紧急召回、低分诊率和严重低分诊率。最强的LLM(Claude Haiku 4.5,12-shot)达到了宏观-$F_1$ 0.475,超过了最佳监督基线(BioBERT,0.378)的点估计,且置信区间重叠。少样本提示和双模型一致性在标签依赖的方式上有所帮助:自我护理的一致性可靠,而紧急临床审查则不然。我们得出结论,LLMs可以支持分诊优先级和选择性的人类审查,但不能实现自主部署。
cs.CL / 30 / 2605.15687
ASRU: Activation Steering Meets Reinforcement Unlearning for Multimodal Large Language Models
ASRU:激活引导与强化遗忘相结合的多模态大型语言模型
Abstract
Multimodal large language models (MLLMs) may memorize sensitive cross-modal information during pretraining, making machine unlearning (MU) crucial. Existing methods typically evaluate unlearning effectiveness based on output deviations, while overlooking the generation quality after unlearning. This can easily lead to hallucinated or rigid responses, thereby affecting the usability and safety of the unlearned model. To address this issue, we propose ASRU, a controllable multimodal unlearning framework that incorporates generation quality as a core evaluation objective. ASRU first induces initial refusal behavior through activation redirection, and then optimizes fine-grained refusal boundaries using a customized reward function, thereby achieving a better trade-off between target knowledge unlearning and model utility. Experiments on Qwen3-VL show that ASRU significantly improves unlearning effectiveness (+24.6%) on average and generation quality (5.8x) on average while effectively preserving model utility, using only a small amount of retained supervision data.
Chinese Translation
多模态大型语言模型(MLLMs)在预训练过程中可能会记忆敏感的跨模态信息,因此机器遗忘(MU)变得至关重要。现有方法通常基于输出偏差评估遗忘的有效性,而忽视了遗忘后生成质量的重要性。这可能导致产生幻觉或僵化的响应,从而影响未遗忘模型的可用性和安全性。为了解决这个问题,我们提出了ASRU,一个可控的多模态遗忘框架,将生成质量作为核心评估目标。ASRU首先通过激活重定向引导初始拒绝行为,然后使用定制的奖励函数优化细粒度的拒绝边界,从而在目标知识遗忘与模型效用之间实现更好的权衡。在Qwen3-VL上的实验表明,ASRU在遗忘有效性上平均提高了24.6%,在生成质量上平均提高了5.8倍,同时有效保留了模型的效用,仅使用少量保留的监督数据。
cs.CL / 31 / 2605.15701
H-Mem: A Novel Memory Mechanism for Evolving and Retrieving Agent Memory via a Hybrid Structure
H-Mem:一种通过混合结构演变和检索智能体记忆的新型记忆机制
Abstract
Memory data are ubiquitous in Large Language Model (LLM)-based agents (e.g., OpenClaw and Manus). A few recent works have attempted to exploit agents'memory for improving their performance on the question-answering (QA) task, but they lack a principled mechanism for effectively modeling how memory data evolves over time and retrieving memory data effectively, leading to poor performance in memory utilization. To fill this gap, we present H-Mem, a novel memory mechanism via a hybrid structure that can not only effectively model the evolution of agent memory over a long period of time, but also provide an efficient memory retrieval approach. Particularly, H-Mem builds a temporal and semantic tree structure that allows the short-term memory data to evolve progressively into long-term memory data, where the latter provides summarized information about the former, while simultaneously constructing a knowledge graph to capture the relationships between entities in memory. Moreover, it offers an effective memory retrieval approach by exploiting the hybrid structure of the tree and graph structures. Extensive experiments on three agent memory benchmarks show that H-Mem achieves state-of-the-art performance on the QA task.
Chinese Translation
记忆数据在基于大型语言模型(LLM)的智能体中无处不在(例如,OpenClaw 和 Manus)。一些近期的研究尝试利用智能体的记忆来提高它们在问答(QA)任务上的表现,但缺乏有效建模记忆数据随时间演变的原则性机制以及有效检索记忆数据的方法,导致记忆利用效率低下。为填补这一空白,我们提出了 H-Mem,这是一种通过混合结构实现的新型记忆机制,能够有效建模智能体记忆在较长时间内的演变,并提供高效的记忆检索方法。具体而言,H-Mem 构建了一个时间和语义树结构,使短期记忆数据能够逐步演变为长期记忆数据,后者提供前者的摘要信息,同时构建知识图谱以捕捉记忆中实体之间的关系。此外,它通过利用树和图结构的混合结构提供了一种有效的记忆检索方法。在三个智能体记忆基准上的大量实验表明,H-Mem 在问答任务上达到了最先进的性能。
cs.CL / 32 / 2605.15710
SMMBench: A Benchmark for Source-Distributed Multimodal Agent Memory
SMMBench:一种源分布式多模态智能体记忆基准
Abstract
Existing benchmarks for multimodal memory reasoning largely evaluate systems within pre-assembled contexts, but under-evaluate whether agents can use evidence distributed across independently originated sources. We argue that source-distributed memory composition is an important and under-examined bottleneck in multimodal agent memory, especially when relevant evidence is fragmented across heterogeneous artifacts such as conversations, profiles, screenshots, tables, images, and documents. To address this gap, we introduce Source-distributed Multimodal Memory Benchmark(SMMBench), which measures whether agents can retrieve, align, and compose multimodal evidence scattered across multiple sources rather than reason within a single curated context. SMMBench evaluates four core capabilities: (1) cross-source multimodal reasoning; (2) conflict resolution; (3) preference reasoning; (4) memory-grounded action prediction. The benchmark contains 1877 samples grounded in 264 sources. Experiments on representative memory-style and retrieval-based baselines show that current systems still struggle on these capabilities, positioning source-distributed multimodal memory as an important and still under-evaluated challenge for multimodal agents. Our data are available at https://huggingface.co/datasets/HuacanChai/SMMBench.
Chinese Translation
现有的多模态记忆推理基准主要在预先组装的上下文中评估系统,但对智能体能否利用分布于独立来源的证据的评估不足。我们认为,源分布式记忆组合是多模态智能体记忆中一个重要且未被充分研究的瓶颈,尤其是在相关证据分散于异构文档(如对话、个人资料、截图、表格、图像和文档)时。为了解决这一问题,我们引入了源分布式多模态记忆基准(SMMBench),该基准测量智能体是否能够检索、对齐和组合散布于多个来源的多模态证据,而不是在单一策划的上下文中进行推理。SMMBench评估四个核心能力:(1)跨源多模态推理;(2)冲突解决;(3)偏好推理;(4)基于记忆的行动预测。该基准包含基于264个来源的1877个样本。在代表性的记忆风格和基于检索的基线上的实验表明,当前系统在这些能力上仍然存在困难,标志着源分布式多模态记忆是多模态智能体面临的重要且仍未被充分评估的挑战。我们的数据可在 https://huggingface.co/datasets/HuacanChai/SMMBench 获取。
cs.CL / 33 / 2605.15721
Contexting as Recommendation: Evolutionary Collaborative Filtering for Context Engineering
作为推荐的情境化:用于情境工程的进化协同过滤
Abstract
Large Language Models (LLMs) are highly sensitive to their input contexts, motivating the development of automated context engineering. However, existing methods predominantly treat this as a global search problem, seeking a single context strategy that maximizes average performance across a dataset. This restrictive assumption overlooks the fact that different inputs often require distinct guidance, leaving substantial instance-level performance gains untapped. In this paper, we propose a paradigm shift by formulating context engineering as a recommendation problem. We introduce \textbf{Neural Collaborative Context Engineering (NCCE)}, a framework that transitions optimization from a static global search to dynamic, instance-wise routing. NCCE first bootstraps a diverse catalog of anchor contexts and then employs a novel \textbf{Context-CF Co-Evolution} mechanism. This stage establishes a synergistic feedback loop: a lightweight Neural Collaborative Filtering (NCF) model learns instance-context preferences to guide the generation of specialized context variants, while the newly evaluated contexts continuously refine the NCF model's understanding of latent preferences. At inference time, the trained NCF model acts as a context router, dynamically assigning the most suitable context strategy to each unseen instance. Theoretical Proofs and comprehensive experiments demonstrate that by matching individual inputs with their optimal contexts, NCCE significantly improves task accuracy, highlighting the critical importance of personalization in LLM context engineering.
Chinese Translation
大型语言模型(LLMs)对输入情境高度敏感,这推动了自动化情境工程的发展。然而,现有方法主要将其视为一个全局搜索问题,寻求一种能够最大化数据集平均性能的单一情境策略。这一限制性假设忽视了不同输入通常需要不同指导的事实,导致大量实例级性能提升未被挖掘。在本文中,我们提出了一种范式转变,将情境工程表述为推荐问题。我们引入了 extbf{神经协同情境工程(NCCE)},这是一个将优化从静态全局搜索转变为动态实例级路由的框架。NCCE首先引导出多样化的锚定情境目录,然后采用一种新颖的 extbf{情境-协同过滤共同演化(Context-CF Co-Evolution)}机制。该阶段建立了一个协同反馈循环:一个轻量级的神经协同过滤(NCF)模型学习实例-情境偏好,以指导生成特定的情境变体,而新评估的情境则不断完善NCF模型对潜在偏好的理解。在推理时,训练好的NCF模型充当情境路由器,动态地为每个未见实例分配最合适的情境策略。理论证明和全面实验表明,通过将个别输入与其最佳情境匹配,NCCE显著提高了任务准确性,突显了个性化在LLM情境工程中的关键重要性。
cs.CL / 34 / 2605.15759
DimMem: Dimensional Structuring for Efficient Long-Term Agent Memory
DimMem:高效长期智能体记忆的维度结构化
Abstract
Large language model (LLM) agents require long-term memory to leverage information from past interactions. However, existing memory systems often face a fidelity--efficiency trade-off: raw dialogue histories are expensive, while flat facts or summaries may discard the structure needed for precise recall. We propose \textbf{DimMem}, a lightweight dimensional memory framework that represents each memory as an atomic, typed, and self-contained unit with explicit fields such as time, location, reason, purpose, and keywords. This representation exposes the structure needed for dimension-aware retrieval, memory update, and selective assistant-context recall without storing full histories in the model context. Across LoCoMo-10 and LongMemEval-S, DimMem achieves \textbf{81.43\%} and \textbf{78.20\%} overall accuracy, respectively, outperforming existing lightweight memory systems while reducing LoCoMo per-query token cost by \textbf{24\%}. We further show that dimensional memory extraction is learnable by compact models: after fine-tuning on the DimMem schema, a Qwen3-4B extractor surpasses LightMem with GPT-4.1-mini on both benchmarks and reaches performance comparable to, or better than, much larger extractors in key settings. These results suggest that explicit dimensional structuring is an effective and efficient foundation for long-term memory in LLM agents. Code is available at https://github.com/ChowRunFa/DimMem.
Chinese Translation
大型语言模型(LLM)智能体需要长期记忆以利用过去交互中的信息。然而,现有的记忆系统常常面临忠实度与效率的权衡:原始对话历史成本高昂,而平面事实或摘要可能会丢失精确回忆所需的结构。我们提出了 extbf{DimMem},一种轻量级的维度记忆框架,将每个记忆表示为一个原子、类型化且自包含的单元,具有明确的字段,如时间、地点、原因、目的和关键词。这种表示方式揭示了维度感知检索、记忆更新和选择性助手上下文回忆所需的结构,而无需在模型上下文中存储完整历史。在LoCoMo-10和LongMemEval-S上,DimMem分别实现了 extbf{81.43\%}和 extbf{78.20\\%}的整体准确率,超越了现有的轻量级记忆系统,同时将LoCoMo每查询的令牌成本降低了 extbf{24\\%}。我们进一步展示了维度记忆提取是可通过紧凑模型学习的:在对DimMem架构进行微调后,一个Qwen3-4B提取器在两个基准测试中超越了LightMem与GPT-4.1-mini,并在关键设置中达到了与更大提取器相当或更好的性能。这些结果表明,明确的维度结构化是LLM智能体长期记忆的有效且高效的基础。代码可在https://github.com/ChowRunFa/DimMem获取。
cs.CL / 35 / 2605.15763
CompactQE: Interpretable Translation Quality Estimation via Small Open-Weight LLMs
CompactQE:通过小型开放权重大语言模型实现可解释的翻译质量估计
Abstract
Current state-of-the-art Quality Estimation (QE) in machine translation relies on massive, proprietary LLMs, raising data privacy concerns. We demonstrate that smaller, open-source LLMs (<30B parameters) are a viable, cost-effective and privacy-preserving alternative. Using a single-pass prompting strategy, our models simultaneously generate quality scores, MQM error annotations, suggested error corrections, and full post-editions. Our analysis shows these models achieve highly competitive system-level correlations with human judgments that outperform traditional neural metrics, fine-tuned models, and human inter-annotator agreement, effectively approximating the capabilities of much larger proprietary LLMs.
Chinese Translation
当前最先进的机器翻译质量估计(QE)依赖于庞大的专有大语言模型(LLMs),这引发了数据隐私问题。我们证明了较小的开源大语言模型(<30B参数)是一个可行的、具有成本效益且保护隐私的替代方案。通过单次提示策略,我们的模型同时生成质量评分、MQM错误注释、建议的错误修正和完整的后期编辑。我们的分析表明,这些模型在系统级别上与人类判断的相关性高度竞争,超越了传统神经度量、微调模型和人类标注者之间的一致性,有效地接近了更大专有大语言模型的能力。
cs.CL / 36 / 2605.15794
ForMaT: Dataset for Visually-Grounded Multilingual PDF Translation
ForMaT:视觉基础的多语言PDF翻译数据集
Abstract
We present ForMaT (Format-Preserving Multilingual Translation), a parallel corpus of 3,956 PDFs across 15 language pairs that preserves original layout metadata proposed for multimodal machine translation. To ensure structural diversity in the dataset, we employ K-Medoids sampling over 45 geometric features, capturing complex elements like nested tables and formulas to focus only on visually diverse PDF documents. Our evaluation reveals that current MT systems struggle with spatial grounding and geometric synchronization, often losing the link between text and its visual context. ForMaT provides a benchmark for developing layout-aware translation models that integrate visual and textual context for high-fidelity document reconstruction.
Chinese Translation
我们提出了ForMaT(格式保留的多语言翻译),这是一个包含3,956个PDF文档的平行语料库,涵盖15对语言,旨在为多模态机器翻译保留原始布局元数据。为了确保数据集的结构多样性,我们在45个几何特征上采用K-Medoids抽样,捕捉复杂元素,如嵌套表格和公式,专注于视觉上多样的PDF文档。我们的评估表明,当前的机器翻译系统在空间定位和几何同步方面存在困难,常常失去文本与其视觉上下文之间的联系。ForMaT为开发布局感知的翻译模型提供了基准,这些模型整合了视觉和文本上下文,以实现高保真度的文档重建。
cs.CL / 37 / 2605.15886
Linked Multi-Model Data on Russian Domestic and Foreign Policy Speeches
俄罗斯国内与外交政策演讲的关联多模型数据
Abstract
This paper introduces a dataset of interlinked multimodal political communications from the Russian government, addressing persistent deficiencies in the availability of social text- and image-based data for authoritarian politics contexts. The dataset comprises two large corpora of official speeches delivered by senior actors within the Kremlin and the Russian Ministry of Foreign Affairs over multiple decades. For each speech, we provide Russian- and English-language texts, associated images and captions where available, and harmonized metadata including (e.g.) dates, speakers, (geo)locations, and official government content tags. Unique identifiers link images to speeches and align Russian and English versions of the same communication texts. We further augment these linked datasets with validated topical annotations for both speech texts and speech images, which are generated via transformer-based multimodal topic modeling and refined by a Russian politics expert. The resulting data resources support multimodal, multilingual, temporal, and/or spatial analyses of (authoritarian) political communication and offer a valuable testbed for social science research and large language model (LLM) applications in political domains.
Chinese Translation
本文介绍了一个来自俄罗斯政府的互联多模态政治传播数据集,旨在解决在威权政治背景下社会文本和图像数据可用性方面的持续不足。该数据集包含了多个十年间克里姆林宫和俄罗斯外交部高级官员发表的两大类官方演讲语料库。对于每篇演讲,我们提供了俄文和英文文本、相关图像及其说明(如有),以及包括日期、发言人、(地理)位置和官方政府内容标签等统一的元数据。独特的标识符将图像与演讲关联,并对同一传播文本的俄文和英文版本进行对齐。我们进一步通过基于变换器的多模态主题建模生成并由俄罗斯政治专家精炼的主题注释,增强了这些关联数据集,涵盖了演讲文本和演讲图像。最终生成的数据资源支持对(威权)政治传播的多模态、多语言、时间和/或空间分析,并为社会科学研究和政治领域的大型语言模型(LLM)应用提供了宝贵的试验平台。
cs.CL / 38 / 2605.15913
Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation
通过自动分割和块蒸馏实现块注意力的泛化
Abstract
Block attention, which processes the input as separate blocks that cannot attend to one another, offers significant potential to improve KV cache reuse in long-context scenarios such as Retrieval-Augmented Generation (RAG). However, its broader application is hindered by two key challenges: the difficulty of segmenting input text into meaningful, self-contained blocks, and the inefficiency of existing block fine-tuning methods that risk degrading performance. To address these, we first construct SemanticSeg, a large and diverse semantic segmentation dataset containing over 30k instances across 16 categories-including books, code, web text, and conversations with text lengths ranging from 2k to 32k. Using this dataset, we train a lightweight segmenter to automatically partition text into human-instinct-aligned blocks with controllable granularity. Second, we propose block distillation, a training framework that is more efficient than block fine-tuning, which uses a frozen full-attention teacher model to guide the block-attention student. This framework integrates three novel components: block sink tokens to mitigate information loss at block boundaries, block dropout to leverage training signals from all blocks, and token-level loss weighting to focus learning on block-attention-sensitive tokens. Experiments across multiple models and benchmarks demonstrate that our segmenter outperforms heuristic and statistical baselines, and block distillation achieves near-full-attention performance under block attention, establishing a practical and scalable pathway for deploying block attention.
Chinese Translation
块注意力将输入处理为相互无法关注的独立块,具有在检索增强生成(Retrieval-Augmented Generation, RAG)等长上下文场景中提高键值缓存重用的显著潜力。然而,其更广泛的应用受到两个主要挑战的阻碍:将输入文本分割成有意义的、自包含的块的困难,以及现有块微调方法的低效性,这可能导致性能下降。为了解决这些问题,我们首先构建了SemanticSeg,这是一个大型多样的语义分割数据集,包含超过30,000个实例,涵盖16个类别,包括书籍、代码、网络文本和对话,文本长度范围从2,000到32,000。利用该数据集,我们训练了一个轻量级的分割器,以自动将文本划分为与人类直觉对齐的、可控粒度的块。其次,我们提出了块蒸馏,这是一种比块微调更高效的训练框架,使用一个冻结的全注意力教师模型来指导块注意力学生。该框架整合了三个新颖的组件:块汇聚令牌以减轻块边界的信息损失、块丢弃以利用来自所有块的训练信号,以及令牌级损失加权以集中学习于对块注意力敏感的令牌。多个模型和基准的实验表明,我们的分割器优于启发式和统计基线,而块蒸馏在块注意力下实现了接近全注意力的性能,为部署块注意力建立了一条实用且可扩展的路径。
cs.CL / 39 / 2605.15976
Reference-Free Reinforcement Learning Fine-Tuning for MT: A Seq2Seq Perspective
无参考强化学习微调机器翻译:基于序列到序列的视角
Abstract
Production machine translation relies overwhelmingly on encoder-decoder Seq2Seq models, yet reinforcement learning approaches to MT fine-tuning have largely targeted decoder-only LLMs at $\geq$7B parameters, with limited systematic study of encoder-decoder architectures. We apply Group Relative Policy Optimization to NLLB-200 (600M and 1.3B) using a hybrid reference-free reward (LaBSE and COMET-Kiwi) that requires no parallel data at fine-tuning time, evaluating across 13 typologically diverse languages. GRPO yields consistent improvements on all 13 languages, up to $+$5.03 chrF++ for Traditional Chinese, and, without any target-language data, competes with 3-epoch supervised fine-tuning on morphologically complex languages . We identify a consistent empirical pattern in which gains are largest where baseline performance is weakest and reward discriminability is highest, making this approach most effective precisely where parallel data is scarcest, and replicate this pattern across English and Spanish source languages.
Chinese Translation
生产机器翻译在很大程度上依赖于编码器-解码器的序列到序列(Seq2Seq)模型,但针对机器翻译微调的强化学习方法主要集中在参数数量大于等于7亿的仅解码器大型语言模型(LLMs)上,对编码器-解码器架构的系统性研究有限。我们将组相对策略优化(Group Relative Policy Optimization)应用于NLLB-200(600M和1.3B),使用一种混合的无参考奖励(LaBSE和COMET-Kiwi),在微调时不需要并行数据,并在13种类型学上多样的语言中进行评估。GRPO在所有13种语言上均取得了一致的改进,传统中文的chrF++提升幅度达到+5.03,并且在没有任何目标语言数据的情况下,与形态复杂语言的3个周期监督微调相竞争。我们识别出一个一致的经验模式,即在基线性能最弱和奖励可区分性最高的地方,增益最大,这使得该方法在并行数据最稀缺的情况下尤为有效,并在英语和西班牙语源语言中复制了这一模式。
cs.CL / 40 / 2605.15978
Ontology for Policing: Conceptual Knowledge Learning for Semantic Understanding and Reasoning in Law Enforcement Reports
警务本体:用于执法报告语义理解与推理的概念知识学习
Abstract
Law enforcement reports contain structured fields and written narratives. However, many incident facts that are needed for review, police training, and investigations are in natural language and require manual reading. We propose a framework using symbolic methods for converting narratives into evidence-linked facts. Our objective is to measure the value of narratives to recover incident details only from the unstructured text and build temporal graphs with time cues and domain axioms. We achieve this by redacting personal identifiers, semantic parsing, predicate mapping to ontology, and reasoning. We evaluate the symbolic approach on 450 property crime reports and a short human review. Of the extracted events from the system, 54.1% had a confidence score of at least 0.80 and 93.7% were mapped through the PropBank--VerbNet--WordNet semantic path. 100% agreement was reached on incident initiation, stolen items, and temporal cues and lower agreement for forced entry interpretation.
Chinese Translation
执法报告包含结构化字段和书面叙述。然而,许多用于审查、警察培训和调查的事件事实以自然语言呈现,需要人工阅读。我们提出了一种框架,使用符号方法将叙述转换为与证据相关的事实。我们的目标是评估叙述的价值,以仅从非结构化文本中恢复事件细节,并构建带有时间线索和领域公理的时间图。我们通过编辑个人标识符、语义解析、谓词映射到本体以及推理来实现这一目标。我们在450份财产犯罪报告和简短的人类审查中评估了该符号方法。从系统提取的事件中,54.1%的事件具有至少0.80的置信度分数,93.7%的事件通过PropBank--VerbNet--WordNet语义路径进行了映射。对于事件的启动、被盗物品和时间线索达成了100%的一致,而对于强行进入的解释则一致性较低。
cs.CL / 41 / 2605.15990
Defining Cultural Capabilities for AI Evaluation: A Taxonomy Grounded in Intercultural Communication Theory
为人工智能评估定义文化能力:基于跨文化传播理论的分类法
Abstract
Tremendous efforts have been put into evaluating the inclusivity and effectiveness of AI systems across cultures. However, the cultural capabilities considered in much of the literature remain vaguely defined, are referred to using interchangeable terminology, and are typically limited to recalling accurate information about various demographics, regions, and nationalities. To address this construct ambiguity, we draw from Intercultural Communication scholarship and propose a three-level taxonomy of AI-relevant cultural capabilities: Cultural Awareness answers "Does the model know?", Cultural Sensitivity answers "How does it frame its knowledge?", and Cultural Competence answers "Can it adapt as the interaction evolves?". Beyond conceptual clarification, we position this taxonomy as a practical tool for improving the validity and interpretability of AI evaluation in real-world, multicultural settings. Without such construct clarity, evaluation results risk overstating model capabilities and may lead to inappropriate deployment decisions in culturally sensitive contexts.
Chinese Translation
在评估人工智能系统在不同文化中的包容性和有效性方面,已经付出了巨大的努力。然而,许多文献中考虑的文化能力仍然定义模糊,使用可互换的术语,通常仅限于回忆关于各种人口统计、地区和国籍的准确信息。为了解决这一构念模糊性,我们借鉴跨文化传播学的研究,提出了一种与人工智能相关的文化能力的三级分类法:文化意识回答“模型是否知道?”;文化敏感性回答“它如何框定其知识?”;文化能力回答“随着互动的发展,它能否适应?”除了概念上的澄清,我们将这一分类法定位为改善人工智能在现实多元文化环境中评估的有效性和可解释性的实用工具。如果没有这样的构念清晰性,评估结果可能会夸大模型的能力,并可能导致在文化敏感的背景下做出不当的部署决策。
cs.CL / 42 / 2605.16011
Can Vision Language Models Be Adaptive in Mathematics Education? A Learner Model-based Rubric Study
视觉语言模型在数学教育中能否实现适应性?基于学习者模型的评分标准研究
Abstract
Adaptive learning refers to educational technologies that track learners' learning progress and adapt the instructional process based on individual learners' learning performance. It is increasingly recognized as critical for developing an effective learning support tool. Vision language models (VLMs) have seen adoption in mathematics education, and students have been using them as learning aids for personalized instruction. However, it is unknown whether VLMs have the ability to adapt to different learner profiles when providing mathematical instructions. Current VLMs lack a systematic evaluation framework for this adaptivity to different learner profiles in mathematics tutoring tasks. To address this gap, we draw on the learner model from the adaptive learning framework (Shute and Towle, 2018) and propose a learner model-based rubric. Our rubric formalizes adaptivity assessment into three aspects: cognitive aspects, motivational aspects, and complexity. We also evaluate two additional dimensions of VLM responses: correctness (of answers and solutions) and quality (of the response itself). Our experimental results show measurable differences in adaptivity across models and also reveal that current VLMs struggle to consistently produce learner model-based instructional responses, especially when receiving limited learner information.
Chinese Translation
适应性学习是指跟踪学习者学习进度并根据个体学习者的学习表现调整教学过程的教育技术。它越来越被认为是开发有效学习支持工具的关键。视觉语言模型(VLMs)已在数学教育中得到应用,学生们将其作为个性化教学的学习辅助工具。然而,目前尚不清楚VLMs在提供数学指导时是否能够适应不同的学习者特征。目前的VLMs缺乏针对数学辅导任务中对不同学习者特征的适应性进行系统评估的框架。为了解决这一问题,我们借鉴了适应性学习框架中的学习者模型(Shute and Towle, 2018),并提出了一种基于学习者模型的评分标准。我们的评分标准将适应性评估形式化为三个方面:认知方面、动机方面和复杂性。我们还评估了VLM响应的两个额外维度:正确性(答案和解决方案的正确性)和质量(响应本身的质量)。我们的实验结果显示,不同模型之间在适应性方面存在可测量的差异,并且还揭示了当前的VLMs在根据学习者模型生成教学响应时面临困难,尤其是在接收到有限的学习者信息时。
cs.CL / 43 / 2605.16023
Judge Circuits
判决电路
Abstract
LLM-as-a-judge has become the dominant paradigm for grading model outputs at scale, yet the same model assigns systematically different scores when its output format changes (e.g., a 1-5 rating vs. a True/False label). Existing diagnoses of these format-induced inconsistencies stop at the input-output level. Using Position-aware Edge Attribution Patching (PEAP), we causally investigate the internal mechanism in Gemma-3, Qwen2.5, and Llama-3. We find that judgments across structured understanding and open-ended preference tasks share a sparse, generalized Latent Evaluator sub-graph in the mid-to-late multi-layer perceptrons (MLPs); zero-ablating it collapses judgment while preserving world knowledge in architecturally modular models. By structurally decoupling abstract judging from output formatting, we provide a mechanistic account of format-induced inconsistency on the open-weight models we study: a continuous judgment signal computed in the shared trunk is mapped through fragile, format-specific terminal branches, enabling format-independent preference to be isolated downstream of the requested output format. Our findings imply that benchmark-level reliability comparisons across formats are partially measuring formatter geometry rather than evaluation quality.
Chinese Translation
作为评估模型输出的主导范式,LLM(大语言模型)作为评判者在大规模评分中已占据主导地位,然而当其输出格式改变时(例如,1-5评分与真/假标签),同一模型却系统性地赋予不同的分数。现有对这些格式引起的不一致性的诊断仅停留在输入-输出层面。通过使用位置感知边缘归因修补(Position-aware Edge Attribution Patching, PEAP),我们对Gemma-3、Qwen2.5和Llama-3的内部机制进行了因果调查。我们发现,在结构化理解和开放式偏好任务中,判断共享了一个稀疏的、广义的潜在评估子图,该子图位于中到后期的多层感知器(MLPs)中;零消融该子图会导致判断崩溃,而在结构上模块化的模型中保留了世界知识。通过结构性地将抽象判断与输出格式解耦,我们为我们研究的开放权重模型中的格式引起的不一致性提供了机制性解释:在共享主干中计算出的连续判断信号通过脆弱的、格式特定的终端分支进行映射,从而使格式独立的偏好能够在请求的输出格式下游被隔离。我们的发现意味着,不同格式之间的基准级可靠性比较部分上是在测量格式化几何而非评估质量。
cs.CL / 44 / 2605.16026
From Flat Language Labels to Typological Priors: Structured Language Conditioning for Multilingual Speech-to-Speech Translation
从平面语言标签到类型学先验:多语言语音到语音翻译的结构化语言条件
Abstract
Compositional speech-to-speech translation (S2ST) systems built upon speech large language models (SpeechLLMs) have recently shown promising performance. However, existing S2ST systems often either neglect source-language information or encode it through a language-as-label paradigm, representing each source language as an independent flat embedding. Such a design overlooks systematic linguistic structure shared across languages, which may limit data-efficient multilingual adaptation when supervised S2ST data are scarce. To address this issue, we propose S2ST-Omni 2, a many-to-one compositional S2ST framework that systematically reformulates multilingual language conditioning from flat language labels to structured typological priors. Specifically, S2ST-Omni 2 revisits language conditioning at three levels: typology-informed hierarchical language encoding for structured source-language representation, dynamically-gated language-aware Dual-CTC for content-adaptive acoustic modulation, and typology-aware LLM prompting for decoder-side linguistic guidance. Experiments on CVSS-C show that S2ST-Omni 2 achieves superior average performance among representative S2ST approaches across BLEU, COMET, ASR-BLEU, and BLASER 2.0 under the adopted evaluation protocol. Ablation studies indicate that the proposed representation-level, acoustic-level, and decoding-level strategies provide complementary benefits. Moreover, controlled data-budget analyses and a Japanese-to-English evaluation using only approximately 3 hours of supervised training data suggest that explicit typological priors provide useful inductive biases for data-efficient multilingual S2ST.
Chinese Translation
基于语音大型语言模型(SpeechLLMs)的组合语音到语音翻译(S2ST)系统最近显示出良好的性能。然而,现有的S2ST系统往往要么忽视源语言信息,要么通过语言作为标签的范式对其进行编码,将每种源语言表示为独立的平面嵌入。这种设计忽视了跨语言共享的系统性语言结构,这可能在监督的S2ST数据稀缺时限制了数据高效的多语言适应。为了解决这个问题,我们提出了S2ST-Omni 2,一个多对一的组合S2ST框架,系统性地将多语言语言条件从平面语言标签重新构造为结构化的类型学先验。具体而言,S2ST-Omni 2在三个层面上重新审视语言条件:基于类型学的分层语言编码用于结构化的源语言表示、动态门控的语言感知双重CTC(Dual-CTC)用于内容自适应的声学调制,以及类型学感知的LLM提示用于解码侧的语言指导。在CVSS-C上的实验表明,S2ST-Omni 2在所采用的评估协议下,在BLEU、COMET、ASR-BLEU和BLASER 2.0等代表性S2ST方法中实现了优越的平均性能。消融研究表明,所提出的表示层、声学层和解码层策略提供了互补的好处。此外,受控数据预算分析和仅使用约3小时监督训练数据的日语到英语评估表明,显式的类型学先验为数据高效的多语言S2ST提供了有用的归纳偏置。
cs.CL / 45 / 2605.16045
RecMem: Recurrence-based Memory Consolidation for Efficient and Effective Long-Running LLM Agents
RecMem:基于重复性的记忆巩固方法用于高效且有效的长时间运行大语言模型代理
Abstract
Memory systems often organize user-agent interactions as retrievable external memory and are crucial for long-running agents by overcoming the limited context windows of LLMs. However, existing memory systems invoke LLMs to process every incoming interaction for memory extraction, and such an eager memory consolidation scheme leads to substantial token consumption. To tackle this problem, we propose RecMem by rethinking when memory consolidation should be conducted. RecMem stores incoming interactions in a subconscious memory layer and encode them using lightweight embedding models for retrieval. LLMs are only invoked to extract episodic and semantic memory when sustained recurrence are observed for semantically similar interactions. Such recurrence-based consolidation works because these interactions correspond to a semantic cluster with rich information and thus are worth extraction and summarization. To improve accuracy, RecMem also incorporates a semantic refinement mechanism that recovers the fine-grained facts omitted by memory extraction. Experiments show that RecMem reduces the memory construction token cost of three SOTA memory systems by up to 87% while exceeding their accuracy.
Chinese Translation
记忆系统通常将用户与代理的交互组织为可检索的外部记忆,对于长时间运行的代理至关重要,因为它们克服了大语言模型(LLMs)有限的上下文窗口。然而,现有的记忆系统在每次接收交互时都调用LLMs进行记忆提取,这种急切的记忆巩固方案导致了大量的标记消耗。为了解决这个问题,我们提出了RecMem,通过重新思考何时进行记忆巩固。RecMem将接收到的交互存储在潜意识记忆层中,并使用轻量级嵌入模型进行编码以便检索。只有在观察到语义相似交互的持续重复时,才会调用LLMs提取情节记忆和语义记忆。这种基于重复性的巩固方法有效,因为这些交互对应于一个信息丰富的语义簇,因此值得进行提取和总结。为了提高准确性,RecMem还结合了一种语义精炼机制,恢复了记忆提取中遗漏的细粒度事实。实验表明,RecMem将三种最先进记忆系统的记忆构建标记成本降低了多达87%,同时超越了它们的准确性。
cs.CL / 46 / 2605.16077
Can Large Language Models Imitate Human Speech for Clinical Assessment? LLM-Driven Data Augmentation for Cognitive Score Prediction
大型语言模型能否模仿人类语言进行临床评估?基于LLM的数据增强用于认知评分预测
Abstract
Accurate assessment of cognitive decline from spontaneous speech remains challenging due to limited dataset size and class imbalance. In this work, we propose a large language model (LLM)-driven data augmentation framework to improve the prediction of cognitive scores from speech. Experiments are conducted on a Japanese corpus in which each participant provides both a spontaneous oral narrative and a written response to the same clinical prompt. The written responses serve as semantic anchors to generate multiple oral-like monologues in different styles using GPT-5. We then predict Hasegawa Dementia Scale scores, a widely used cognitive screening tool in Japan, using a Partial Least Squares regression model trained on Sentence-BERT speech embeddings. We investigate two augmentation strategies: random class-balanced selection, which yields moderate but unstable improvements, and similarity-guided class-balanced selection. The latter prioritizes semantically close synthetic samples, leading to more consistent improvements and substantially reducing prediction error for minority low-score participants while maintaining performance for the majority group. Overall, our findings demonstrate the potential of semantically guided LLM-driven augmentation as a principled approach for addressing class imbalance and improving data efficiency in clinical speech analysis.
Chinese Translation
由于数据集规模有限和类别不平衡,从自发语言中准确评估认知衰退仍然具有挑战性。在本研究中,我们提出了一种基于大型语言模型(LLM)的数据增强框架,以改善从语言中预测认知评分的效果。实验在一个日本语料库上进行,每位参与者提供自发的口头叙述和对同一临床提示的书面回应。书面回应作为语义锚点,利用GPT-5生成多种风格的口语化独白。随后,我们使用基于Sentence-BERT语音嵌入训练的偏最小二乘回归模型预测长谷川痴呆量表(Hasegawa Dementia Scale)分数,这是日本广泛使用的认知筛查工具。我们研究了两种增强策略:随机类别平衡选择,产生适度但不稳定的改善,以及相似性引导的类别平衡选择。后者优先考虑语义上相近的合成样本,导致更一致的改善,并显著减少低分少数参与者的预测误差,同时保持多数群体的表现。总体而言,我们的研究结果表明,语义引导的基于LLM的增强作为解决类别不平衡和提高临床语言分析数据效率的原则性方法具有潜力。
cs.CL / 47 / 2605.16107
Multi-Level Contextual Token Relation Modeling for Machine-Generated Text Detection
机器生成文本检测的多层次上下文令牌关系建模
Abstract
Machine-generated texts (MGTs) pose risks such as disinformation and phishing, underscoring the need for reliable detection. Metric-based methods, which extract statistically distinguishable features of MGTs, are often more practical than complex model-based methods that are prone to overfitting. Given their diverse designs, we first place representative metric-based methods within a unified framework, enabling a clear assessment of their advantages and limitations. Our analysis identifies a core challenge across these methods: the token-level detection score is easily biased by the inherent randomness of the MGTs generation process. Then, we theoretically derive the multi-hop transitions of the token-level detection score and explore their local and global relations. Based on these findings, we propose a multi-level contextual token relation modeling framework for MGT detection. Specifically, for local relations, we model them through a lightweight Markov-informed calibration module that refines token-level evidence before aggregation. For global relations, we introduce a rule-support reasoning module that uses explicit logical rules derived from contextual score statistics. Finally, we combine the local calibrated score and the global rule-support reasoning signal in a joint multi-level inference framework. Extensive experiments show broad and substantial improvements across various real-world scenarios, including cross-LLM and cross-domain settings, with low computational overhead.
Chinese Translation
机器生成文本(MGTs)带来了虚假信息和网络钓鱼等风险,强调了可靠检测的必要性。基于度量的方法通过提取MGTs的统计可区分特征,通常比复杂的基于模型的方法更为实用,因为后者容易出现过拟合。鉴于这些方法设计的多样性,我们首先将代表性的基于度量的方法置于一个统一框架内,从而能够清晰评估它们的优缺点。我们的分析识别出这些方法的一个核心挑战:令牌级检测分数容易受到MGTs生成过程固有随机性的偏倚。接着,我们理论上推导了令牌级检测分数的多跳转移,并探讨了它们的局部和全局关系。基于这些发现,我们提出了一种用于MGT检测的多层次上下文令牌关系建模框架。具体而言,对于局部关系,我们通过一个轻量级的马尔可夫信息校准模块对其建模,该模块在聚合之前细化令牌级证据。对于全局关系,我们引入一个规则支持推理模块,该模块使用从上下文分数统计中推导出的显式逻辑规则。最后,我们在一个联合多层次推理框架中结合局部校准分数和全局规则支持推理信号。大量实验表明,在包括跨LLM和跨领域设置的各种真实场景中,具有低计算开销的广泛且显著的改进。
cs.CL / 48 / 2605.16113
DebiasRAG: A Tuning-Free Path to Fair Generation in Large Language Models through Retrieval-Augmented Generation
DebiasRAG:通过检索增强生成实现大型语言模型公平生成的无调优路径
Abstract
Large language models (LLMs) have achieved unprecedented success due to their exceptional generative capabilities. However, because they depend on knowledge encapsulated from training corpora, they may produce hallucinations, stereotypes, and socially biased content. In particular, LLMs are prone to prejudiced responses involving race, gender, and age, which are collectively referred to as social biases. Prior studies have used fine-tuning and prompt engineering to mitigate such biases in LLMs, but these methods require additional training resources or domain knowledge to design the framework. Moreover, they may degrade the original capabilities of LLMs and often overlook the need for dynamic debiasing contexts for fairer inference. In this paper, we propose DebiasRAG, a novel tuning-free and dynamic query-specific debiasing framework based on retrieval-augmented generation (RAG). DebiasRAG improves fairness while preserving the intrinsic properties of LLMs, such as representation ability. DebiasRAG consists of three stages: (1) query-specific debiasing candidate generation; (2) context candidate pool construction; and (3) gradient-updated debiasing-guided context piece reranking. First, DebiasRAG leverages self-diagnosed bias contexts relevant to the query through regular retrieval, where the bias contexts are prepared offline by the DebiasRAG provider. Given the query-specific bias contexts, DebiasRAG reversely produces debiasing contexts, which are provided as additional fairness constraints for LLM outputs. Second, a regular RAG retrieval process produces query-related contexts from the regular RAG document database, such as a chunked Wikipedia dataset.
Chinese Translation
大型语言模型(LLMs)因其卓越的生成能力而取得了前所未有的成功。然而,由于它们依赖于训练语料库中封装的知识,可能会产生幻觉、刻板印象和社会偏见内容。特别是,LLMs容易产生涉及种族、性别和年龄的偏见反应,这些反应统称为社会偏见。之前的研究通过微调和提示工程来减轻LLMs中的这些偏见,但这些方法需要额外的训练资源或领域知识来设计框架。此外,它们可能会降低LLMs的原始能力,并且往往忽视了动态去偏见上下文在实现更公平推理中的必要性。在本文中,我们提出了DebiasRAG,这是一种基于检索增强生成(RAG)的新型无调优和动态查询特定去偏见框架。DebiasRAG在保持LLMs内在属性(如表示能力)的同时,提高了公平性。DebiasRAG由三个阶段组成:(1)查询特定的去偏见候选生成;(2)上下文候选池构建;(3)梯度更新的去偏见引导上下文片段重排序。首先,DebiasRAG通过常规检索利用与查询相关的自我诊断偏见上下文,其中偏见上下文由DebiasRAG提供者离线准备。根据查询特定的偏见上下文,DebiasRAG反向生成去偏见上下文,这些上下文作为LLM输出的额外公平性约束。其次,常规RAG检索过程从常规RAG文档数据库中生成与查询相关的上下文,例如分块的维基百科数据集。
cs.CL / 49 / 2605.16117
SGR: A Stepwise Reasoning Framework for LLMs with External Subgraph Generation
SGR:一种用于外部子图生成的逐步推理框架
Abstract
Large Language Models (LLMs) have demonstrated strong capabilities across diverse NLP applications, such as translation, text generation, and question answering. Nevertheless, they remain limited in complex settings that demand deep reasoning and logical inference. Since these models are trained on large-scale text corpora, their generation process may still introduce irrelevant, noisy, or factually inconsistent content. To mitigate this problem, we introduce SGR, a stepwise framework that enhances LLM reasoning through external subgraph generation. SGR builds query-specific subgraphs from external knowledge bases and uses their semantic structure to support multi-step inference. By grounding intermediate reasoning steps in structured external knowledge, the framework helps the model concentrate on relevant entities, relations, and supporting evidence. In particular, SGR first constructs a subgraph tailored to the input question. It then guides the model to reason progressively over the generated structure and combines multiple reasoning trajectories to obtain the final prediction. Experimental results across several benchmark datasets show that SGR achieves consistent improvements over competitive baselines, highlighting its value for improving both reasoning accuracy and factual reliability.
Chinese Translation
大型语言模型(LLMs)在翻译、文本生成和问答等多种自然语言处理(NLP)应用中展现出了强大的能力。然而,在需要深度推理和逻辑推断的复杂场景中,它们仍然存在局限性。由于这些模型是在大规模文本语料库上训练的,它们的生成过程可能仍会引入无关、嘈杂或事实不一致的内容。为了解决这个问题,我们提出了SGR,一种通过外部子图生成增强LLM推理的逐步框架。SGR从外部知识库构建特定于查询的子图,并利用其语义结构支持多步推理。通过将中间推理步骤与结构化的外部知识相结合,该框架帮助模型集中关注相关实体、关系和支持证据。具体而言,SGR首先构建一个针对输入问题量身定制的子图。然后,它引导模型在生成的结构上逐步推理,并结合多个推理轨迹以获得最终预测。在多个基准数据集上的实验结果表明,SGR在竞争基线之上实现了一致的改进,突显了其在提高推理准确性和事实可靠性方面的价值。
cs.CL / 50 / 2605.16191
Optimized Three-Dimensional Photovoltaic Structures with LLM guided Tree Search
基于LLM引导的树搜索优化三维光伏结构
Abstract
We present a case study for how AI coding systems can be used to generate novel scientific hypotheses. We combine a generic coding agent (Google's AntiGravity) with an LLM-driven tree search algorithm (Empirical Research Assistance / ERA) to autonomously generate high-efficiency three-dimensional photovoltaic (3DPV) structures that overcome losses limiting flat solar panels at mid-latitudes. These structures operate by presenting favorable angles to the sun throughout the day, and for illustrative purposes we focus on optimizing performance for a single solar day. Our workflow begins by using AntiGravity to reproduce calculations \cite{bernardi2012solar} showing that 3DPV can have energy densities much higher than stationary flat PV panels. We use these initial designs as the starting point for large scale tree search, where we seek improved solutions and score them for their diurnal yield. The initial tree search leads to nominally more efficient solutions, yet they are caused by algorithmic reward hacking, arising from non-physical design features such as structurally levitating disconnected tiers and exploitations of the discretizations in the optics solver. To counteract this, we develop a workflow where the coding agent iteratively patches the physics engine with constraints to eliminate reward hacking. With reward-hacking eliminated, ERA discovers a series of designs with various constraints and improved performance, including optimal designs with different fixed collector areas, optimizing zenith tracking and avoiding self shadowing. Combining coding agents with tree search (ERA) provides a powerful platform for scientific discovery, for problems whose solutions can be empirically evaluated with a score function.
Chinese Translation
我们展示了一个案例研究,探讨人工智能编码系统如何用于生成新颖的科学假设。我们将一个通用编码代理(Google的AntiGravity)与一个基于LLM的树搜索算法(经验研究助手/ERA)结合起来,自动生成高效的三维光伏(3DPV)结构,以克服限制中纬度平面太阳能电池板的损失。这些结构通过在一天中以有利的角度朝向太阳来工作,为了说明,我们专注于优化单个太阳日的性能。我们的工作流程首先使用AntiGravity重现计算结果,显示3DPV的能量密度远高于静态平面光伏面板。我们将这些初始设计作为大规模树搜索的起点,寻求改进的解决方案并根据其昼夜产量进行评分。初步的树搜索导致了名义上更高效的解决方案,但这些效率是由于算法奖励黑客行为造成的,源于非物理设计特征,如结构上悬浮的断开层和光学求解器中的离散化利用。为了应对这一问题,我们开发了一种工作流程,使编码代理通过约束迭代修补物理引擎,以消除奖励黑客行为。在消除奖励黑客行为后,ERA发现了一系列具有不同约束和改进性能的设计,包括具有不同固定收集器面积的最佳设计,优化天顶跟踪并避免自我遮蔽。将编码代理与树搜索(ERA)结合,为科学发现提供了一个强大的平台,适用于那些可以通过评分函数进行经验评估的问题。
cs.CL / 51 / 2605.16193
Improving Cross-Cultural Survey Simulation with Calibrated Value Personas
通过校准价值角色改善跨文化调查模拟
Abstract
Large language models (LLMs) are increasingly used to simulate human opinions and survey responses, but their ability to reproduce population responses across cultures remains limited. Existing persona-based prompting methods typically rely on sociodemographic or personality traits, which are only indirect proxies for the values that shape human responses. We propose a value-based persona construction method that derives textual descriptors from survey responses capturing core cultural dimensions. By sampling value profiles from target populations and aggregating LLM responses across personas, we obtain population-level predictions grounded in observed value distributions. We further introduce a calibration procedure that improves response diversity while preserving estimated opinions. We show that our approach reduces prediction error across countries, with the largest improvements observed in underrepresented populations. This substantially narrows the performance gap between countries aligned with dominant LLM priors and those that are less represented in training data, while also yielding response distributions that closely match human diversity.
Chinese Translation
大型语言模型(LLMs)越来越多地用于模拟人类意见和调查响应,但它们在不同文化中再现人口响应的能力仍然有限。现有的基于角色的提示方法通常依赖于社会人口或个性特征,这些特征只是间接代理了塑造人类响应的价值观。我们提出了一种基于价值观的角色构建方法,该方法从捕捉核心文化维度的调查响应中推导出文本描述。通过从目标人群中抽样价值档案,并在不同角色之间聚合LLM响应,我们获得了基于观察到的价值分布的群体级预测。我们进一步引入了一种校准程序,该程序在保留估计意见的同时改善响应的多样性。我们展示了我们的方法在不同国家减少了预测误差,其中在代表性不足的人群中观察到最大的改善。这大大缩小了与主流LLM先验一致的国家与在训练数据中代表性较少的国家之间的性能差距,同时也产生了与人类多样性紧密匹配的响应分布。
cs.CL / 52 / 2605.16217
Argus: Evidence Assembly for Scalable Deep Research Agents
Argus:可扩展深度研究代理的证据组装
Abstract
Deep research agents have achieved remarkable progress on complex information seeking tasks. Even long ReAct style rollouts explore only a single trajectory, while recent state of the art systems scale inference time compute via parallel search and aggregation. Yet deep research answers are composed of complementary pieces of evidence, which parallel rollouts often duplicate rather than complete, yielding diminishing returns while pushing the aggregation context toward the model's limit. We propose Argus, an agentic system in which a Searcher and a Navigator cooperate to treat deep research as assembling a jigsaw from complementary evidence pieces, rather than brute forcing the whole answer in parallel. The Searcher collects evidence traces for a given sub-query through ReAct-style interaction. The Navigator maintains a shared evidence graph, verifying which pieces are still missing, dispatching Searchers to gather them, and reasoning over the completed graph to produce a source-traced final answer. We train the Navigator with reinforcement learning to verify, dispatch, and synthesize, while independently training the Searcher to remain a standard ReAct agent. The resulting Navigator supports rollouts with a single Searcher or many in parallel without retraining. With both Searcher and Navigator built on a 35B-A3B MoE backbone, Argus gains 5.5 points with a single Searcher and 12.7 points with 8 parallel Searchers, averaged over eight benchmarks. With 64 Searchers it reaches 86.2 on BrowseComp, surpassing every proprietary agent we benchmark, while the Navigator's reasoning context stays under 21.5K tokens.
Chinese Translation
深度研究代理在复杂信息检索任务上取得了显著进展。即使是长时间的 ReAct 风格的展开也仅探索单一轨迹,而最近的最先进系统通过并行搜索和聚合来扩展推理时间计算。然而,深度研究的答案由互补的证据片段组成,而并行展开往往重复而非补充这些片段,从而导致收益递减,同时将聚合上下文推向模型的极限。我们提出了 Argus,一个代理系统,其中搜索者(Searcher)和导航者(Navigator)合作,将深度研究视为从互补证据片段中组装拼图,而不是并行强行求解整个答案。搜索者通过 ReAct 风格的交互收集给定子查询的证据轨迹。导航者维护一个共享的证据图,验证哪些片段仍然缺失,派遣搜索者去收集这些片段,并对完成的图进行推理,以生成源追踪的最终答案。我们通过强化学习训练导航者以验证、派遣和综合,同时独立训练搜索者以保持标准的 ReAct 代理。最终的导航者支持单个搜索者或多个并行搜索者的展开,而无需重新训练。在一个 35B-A3B MoE 主干上构建的搜索者和导航者,Argus 在单个搜索者上获得了 5.5 分,在 8 个并行搜索者上获得了 12.7 分,平均在八个基准测试中表现。使用 64 个搜索者时,它在 BrowseComp 上达到了 86.2,超越了我们基准测试的每个专有代理,而导航者的推理上下文保持在 21.5K 令牌以下。
cs.CL / 53 / 2605.16222
Artificial Aphasias in Lesioned Language Models
损伤语言模型中的人工失语症
Abstract
Aphasias, selective language impairments which can arise from brain damage, reveal the functional organization of human language by providing causal links between affected brain regions and specific symptom profiles. Drawing on this literature, we introduce an aphasia-inspired technique to characterize the emergent functional organization of language models (LMs). We ``lesion'' (zero-out) model parameters and measure the effects of this intervention against clinical aphasia symptoms, as diagnosed by the Text Aphasia Battery (TAB). When applied to 112,426 outputs from five 1B-scale LMs, the full range of evaluated symptoms surface, but in distributions largely distinct from those of humans. Our method uncovers broad symptom-profile differences between attention components (query, key, value, output) and feed-forward components (up, gate, down), with weaker evidence for differences among components within the same mechanism. We also find an effect of depth, where lesions in early layers disproportionately cause syntactic and semantic symptoms while late-middle layers yield higher rates of phonological and fluency deficits. Although some LM lesions induce quantitatively more similar profiles to some human aphasia types than others, qualitative differences in symptom patterns between LMs and humans suggest that aphasia syndromes are heavily influenced by the details of learning and processing rather than being a domain-invariant consequence of disrupted language processing.
Chinese Translation
失语症是由脑损伤引起的选择性语言障碍,通过提供受影响脑区与特定症状特征之间的因果联系,揭示了人类语言的功能组织。基于这一文献,我们引入了一种受失语症启发的技术,以表征语言模型(LMs)中出现的功能组织。我们对模型参数进行“损伤”(归零处理),并将这种干预的效果与通过文本失语症测评(Text Aphasia Battery, TAB)诊断的临床失语症症状进行比较。当应用于来自五个1B规模LMs的112,426个输出时,评估的症状范围全面呈现,但其分布与人类的分布在很大程度上是不同的。我们的方法揭示了注意力组件(查询、键、值、输出)与前馈组件(上、门、下)之间的广泛症状特征差异,而同一机制内组件之间的差异证据较弱。我们还发现了深度的影响,早期层的损伤不成比例地导致句法和语义症状,而中后期层则产生更高比例的音韵和流利性缺陷。尽管一些LM的损伤在定量上与某些人类失语症类型的特征更为相似,但LM与人类之间症状模式的定性差异表明,失语综合症在很大程度上受到学习和处理细节的影响,而不是语言处理中断的领域不变后果。
cs.CL / 54 / 2605.16232
A Unified Generative-AI Framework for Smart Energy Infrastructure: Intelligent Gas Distribution, Utility Billing, Carbon Analytics, and Quantum-Inspired Optimisation
智能能源基础设施的统一生成式人工智能框架:智能气体分配、公用事业计费、碳分析与量子启发优化
Abstract
The accelerating convergence of smart metering, generative artificial intelligence, and quantum-inspired combinatorial optimisation is reshaping how energy utilities manage physical infrastructure, customer engagement, and environmental accountability
Chinese Translation
智能计量、生成式人工智能和量子启发组合优化的加速融合正在重塑能源公用事业管理物理基础设施、客户参与和环境责任的方式。
cs.CL / 55 / 2605.16250
A Generative AI Framework for Intelligent Utility Billing CO 2 Analytics and Sustainable Resource Optimisation
用于智能公用事业账单二氧化碳分析和可持续资源优化的生成式人工智能框架
Abstract
Distribution utilities are now expected to deliver bills that customers can actually read attach a defensible carbon number to every kWh sold and schedule load against grid stress and emissions constraints We propose an end-to-end framework that unifies four production-grade capabilities under one architectural roof a generative-AI agent that drafts each customers natural-language billing statement from structured numeric inputs under a constrained decoding policy a transformer-based forecaster that supplies the day-ahead consumption estimate with calibrated quantile bands
Chinese Translation
配电公用事业现在被期望提供客户能够实际阅读的账单,为每千瓦时(kWh)销售附上可辩护的碳排放数字,并根据电网压力和排放约束安排负荷。我们提出了一个端到端框架,将四个生产级能力统一在一个架构之下:一个生成式人工智能(generative-AI)代理,根据结构化的数值输入,在受限解码策略下起草每位客户的自然语言账单声明;一个基于变换器(transformer)的预测器,提供经过校准的分位数区间的次日消费估计。