arXiv Daily Digest

337

Papers

Towards Robotic Dexterous Hand Intelligence: A Survey

朝向机器人灵巧手智能：一项综述

Zhao, Weiguang, Guo, Xihao, Liang, Tian, Zhang, Rui, King, Irwin, Huang, Kaizhu

Abstract

Robotic dexterous hands are central to contact-rich manipulation, with rapid progress driven by advances in hardware, sensing, control, simulation, and data generation. However, existing studies are often developed under different assumptions regarding hand embodiments, sensory configurations, task settings, training data, and evaluation protocols, making systematic comparison difficult and obscuring the developmental trajectory of the field. This survey provides a holistic review of dexterous hand research from four complementary aspects. First, we present a hardware-level analysis covering actuation, transmission, perception, and representative hand designs, highlighting the key trade-offs in force capability, compliance, bandwidth, integration, and system complexity. Furthermore, we review control and learning methods for dexterous manipulation from a methodological perspective, grouping representative works by major paradigms and tracing their evolution in chronological order. In addition, we consolidate datasets, modality design, and evaluation practices, which enables methodological progress to be interpreted together with the ways in which it is trained, benchmarked, and assessed. Finally, we discuss the major limitations of current dexterous hand research and summarize the corresponding future directions. By connecting hardware analysis, methodological development, data resources, and evaluation, this survey aims to provide a structured understanding of dexterous hand research and to clarify the most important open challenges for future study.

Chinese Translation

机器人灵巧手在接触丰富的操作中占据核心地位，其快速进展得益于硬件、传感、控制、仿真和数据生成等领域的进步。然而，现有研究通常在手部体现、传感配置、任务设置、训练数据和评估协议等方面存在不同假设，这使得系统比较变得困难，并模糊了该领域的发展轨迹。本综述从四个互补的方面对灵巧手研究进行了全面回顾。首先，我们呈现了一项硬件层面的分析，涵盖了驱动、传输、感知和代表性的手部设计，强调了在力能力、顺应性、带宽、集成和系统复杂性等方面的关键权衡。此外，我们从方法论的角度回顾了灵巧操作的控制和学习方法，将代表性工作按主要范式进行分组，并按时间顺序追溯其演变。此外，我们整合了数据集、模态设计和评估实践，使得方法论进展能够与其训练、基准测试和评估方式一起进行解读。最后，我们讨论了当前灵巧手研究的主要局限性，并总结了相应的未来方向。通过连接硬件分析、方法论发展、数据资源和评估，本综述旨在提供对灵巧手研究的结构化理解，并阐明未来研究中最重要的开放挑战。

View on arXiv Download PDF AI Translation

cs.RO / 2 / 2605.13996

Ergodic Imitation for Adaptive Exploration around Demonstrations

基于遍历模仿的自适应探索方法

Xu, Ziyi, Bilaloglu, Cem, Li, Yiming, Calinon, Sylvain

Abstract

In robotics, a common challenge in imitation learning is the mismatch between training and deployment conditions, caused, for example, by environmental changes or imperfect observation and control. When a robot follows a nominal trajectory under such mismatch, it may become stuck and fail to complete the task. This calls for adaptive online exploration strategies that remain grounded in demonstrations. To this end, we propose an adaptive ergodic imitation approach that constructs a target distribution from the geometry of the retrieved demonstrations and uses it to generate trajectories that adaptively interpolate between tracking and exploration. Our method extends ergodic control beyond its traditional role in area-coverage and search by incorporating demonstrations into a retrieval-based receding-horizon framework for adaptive imitation.

Chinese Translation

在机器人技术中，模仿学习面临的一个常见挑战是训练条件与部署条件之间的不匹配，这种不匹配可能由环境变化或观察与控制的不完善引起。当机器人在这种不匹配下跟随名义轨迹时，可能会陷入困境并无法完成任务。这就需要基于示范的自适应在线探索策略。为此，我们提出了一种自适应遍历模仿方法，该方法从检索到的示范的几何结构中构建目标分布，并利用该分布生成自适应插值的轨迹，以在跟踪与探索之间进行平衡。我们的方法将遍历控制的应用扩展到传统的区域覆盖和搜索之外，通过将示范融入基于检索的递归视野框架中，实现自适应模仿。

View on arXiv Download PDF AI Translation

cs.RO / 3 / 2605.14106

Behavior Cloning for Active Perception with Low-Resolution Egocentric Vision

基于行为克隆的低分辨率自我中心视觉的主动感知

Bilic, Anthony, Chen, Chen, Bölöni, Ladislau

Abstract

We investigate whether behavior cloning is sufficient to produce active perception in a structured object-finding task. A low-cost robot arm equipped with a wrist-mounted egocentric RGB camera must reposition to center a partially visible plant before triggering a grasp signal, requiring actions that improve future observations. The model predicts joint commands directly from low-resolution RGB images under closed-loop control. We show that low-resolution egocentric vision is sufficient for reliable task completion and that predicting relative joint deltas substantially outperforms absolute joint position prediction in our setting. These results demonstrate that visually grounded active perception can emerge from behavior cloning in a reproducible setting.

Chinese Translation

我们研究了行为克隆是否足以在结构化的物体寻找任务中产生主动感知。一个配备有腕部安装的自我中心RGB相机的低成本机器人手臂必须重新定位，以便在触发抓取信号之前将部分可见的植物居中，这要求采取能够改善未来观察的行动。该模型在闭环控制下直接从低分辨率RGB图像预测关节命令。我们表明，低分辨率自我中心视觉足以可靠地完成任务，并且在我们的设置中，预测相对关节增量显著优于绝对关节位置预测。这些结果表明，基于视觉的主动感知可以在可重复的环境中通过行为克隆产生。

View on arXiv Download PDF AI Translation

cs.RO / 4 / 2605.14174

Safety-Constrained Reinforcement Learning with Post-Training Reachability Verification for Robot Navigation

带有后训练可达性验证的安全约束强化学习用于机器人导航

He, Qisong, Huang, Xinmiao, Hu, Jinwei, Li, Zhuoyun, Dong, Yi, Wu, Changshun, Huang, Xiaowei

Abstract

Safe navigation for mobile robots demands policies that remain reliable under the high-consequence perception uncertainty of cluttered environments. Yet most existing safe reinforcement learning (RL) methods assess safety through average cumulative cost. Such metrics can mask dangerous tail-risk behaviors. To address this, we propose a framework that trains risk-sensitive policies through Conditional Value-at-Risk (CVaR) constrained optimization on an off-policy TD3 backbone and evaluates their safety margins post-training through neural network reachability verification. During training, the policy is optimized under CVaR constraints on cumulative costs, promoting sensitivity to high-cost tail outcomes rather than average behavior alone. After training, we compute action reachable sets under bounded observation uncertainty using Taylor Model analysis, yielding a safety rate metric that quantifies the proportion of evaluated states at which the policy's reachable action set remains within prescribed safety margins. A key finding is that policies trained with CVaR constraints maintain larger safety margins from obstacles across evaluated states. This makes them significantly more amenable to formal reachability verification. Experiments across ten navigation scenarios and six baselines show that our method achieves a 98.3\% success rate, the highest safety verification rate among all compared methods, while revealing that average cost rankings and reachability-based safety rankings can diverge. This indicates that reachability verification captures risks which are missed by empirical cost metrics alone. We further validate our approach on a physical Clearpath Jackal robot, demonstrating successful sim-to-real transfer.

Chinese Translation

移动机器人的安全导航要求在复杂环境的高后果感知不确定性下保持可靠的策略。然而，大多数现有的安全强化学习（RL）方法通过平均累积成本来评估安全性。这种度量可能掩盖危险的尾部风险行为。为了解决这个问题，我们提出了一个框架，通过条件风险价值（Conditional Value-at-Risk, CVaR）约束优化在离线策略TD3基础上训练风险敏感的策略，并通过神经网络可达性验证在训练后评估其安全边际。在训练过程中，策略在累积成本的CVaR约束下进行优化，促进对高成本尾部结果的敏感性，而不仅仅是平均行为。训练后，我们使用泰勒模型分析计算在有界观测不确定性下的可达动作集，得出一个安全率指标，该指标量化了评估状态中策略的可达动作集保持在规定安全边际内的比例。一个关键发现是，经过CVaR约束训练的策略在评估状态中与障碍物之间保持了更大的安全边际。这使得它们在形式可达性验证中显著更易于处理。在十个导航场景和六个基线的实验中，我们的方法达到了98.3%的成功率，这是所有比较方法中最高的安全验证率，同时揭示了平均成本排名和基于可达性的安全排名可能会出现偏差。这表明可达性验证捕捉到了仅通过经验成本度量所遗漏的风险。我们进一步在物理Clearpath Jackal机器人上验证了我们的方法，成功展示了模拟到现实的转移。

View on arXiv Download PDF AI Translation

cs.RO / 5 / 2605.14199

Motion Planning for Autonomous Vehicles using Optimization over Graphs of Convex Sets

基于凸集图的自主车辆运动规划优化

Wagner, Matheus, Fröhlich, Antônio Augusto

Abstract

Motion planning for autonomous vehicles requires generating collision-free and dynamically feasible trajectories in complex environments under real-time constraints. While nonlinear optimal control formulations provide high-fidelity solutions, they are computationally demanding and sensitive to initialization, whereas geometric planning methods scale well but often decouple path selection from trajectory optimization. This paper studies the extent to which optimization over Graphs of Convex Sets (GCS) can approximate solutions of nonlinear optimal control problems in the context of autonomous driving. The free space is represented as a finite union of convex regions organized as a directed graph, allowing nonconvex geometry to be handled through discrete connectivity decisions while maintaining convex trajectory constraints within each region. Vehicle motion is parameterized using Bezier curves for the spatial path and a polynomial time-scaling function for temporal evolution. Under small-slip and linear tire assumptions, a simplified dynamic bicycle model enables approximate enforcement of dynamic feasibility through convex constraints on trajectory derivatives. The approach is evaluated in CommonRoad scenarios involving static obstacle avoidance and lane-changing maneuvers, and is compared against a nonlinear discrete-time optimal control formulation. The results indicate that the GCS-based method generates collision-free and dynamically consistent trajectories that closely match those obtained from the nonlinear program, while exhibiting improved computational efficiency and reduced sensitivity to initialization. These findings suggest that GCS provides a structured approximation of nonlinear motion planning problems, capturing dominant geometric and dynamic effects while preserving convexity in the continuous relaxation.

Chinese Translation

自主车辆的运动规划需要在复杂环境中生成无碰撞且动态可行的轨迹，并满足实时约束。虽然非线性最优控制方法提供了高保真度的解决方案，但其计算需求高且对初始化敏感，而几何规划方法则具有良好的扩展性，但通常将路径选择与轨迹优化解耦。本文研究了在自主驾驶背景下，基于凸集图（Graphs of Convex Sets, GCS）的优化在多大程度上可以近似非线性最优控制问题的解决方案。自由空间被表示为有向图组织的有限凸区域的并集，从而在处理非凸几何时可以通过离散连接决策来维持每个区域内的凸轨迹约束。车辆运动通过贝塞尔曲线参数化空间路径，并使用多项式时间缩放函数参数化时间演变。在小滑移和线性轮胎假设下，简化的动态自行车模型通过对轨迹导数施加凸约束来近似实现动态可行性。该方法在涉及静态障碍物避让和变道操作的CommonRoad场景中进行了评估，并与非线性离散时间最优控制方法进行了比较。结果表明，基于GCS的方法生成的轨迹无碰撞且动态一致，与非线性程序获得的轨迹高度匹配，同时展现出更高的计算效率和对初始化的较低敏感性。这些发现表明，GCS为非线性运动规划问题提供了一种结构化的近似，捕捉了主导的几何和动态效应，同时在连续松弛中保持了凸性。

View on arXiv Download PDF AI Translation

cs.RO / 6 / 2605.14201

MAPLE: Latent Multi-Agent Play for End-to-End Autonomous Driving

MAPLE：潜在多智能体游戏用于端到端自主驾驶

Yasarla, Rajeev, Hegde, Deepti, Cheng, Hsin-Pai, Han, Shizhong, Shi, Yunxiao, Sadeghigooghari, Meysam, Ackermann, Hanno, Liu, Litian, Desai, Pranav, Porikli, Fatih, Ghavamzadeh, Mohammad, Cai, Hong

Abstract

Vision-language-action (VLA) models are effective as end-to-end motion planners, but can be brittle when evaluated in closed-loop settings due to being trained under traditional imitation learning framework. Existing closed-loop supervision approaches lack scalability and fail to completely model a reactive environment. We propose MAPLE, a novel framework for reactive, multi-agent rollout of a dynamic driving scenario in the latent space of the VLA model. The ego vehicle and nearby traffic agents are independently controlled over multi-step horizons, while being reactive to other agents in the scene, enabling closed-loop training. MAPLE consists of two training stages: (1) supervised fine-tuning on the latent rollouts based on ground-truth trajectories, followed by (2) reinforcement learning with global and agent -specific rewards that encourage safety, progress, and interaction realism. We further propose diversity rewards that encourage the model to generate planning behaviors that may not be present in logged driving data. Notably, our closed-loop training framework is scalable and does not require external simulators, which can be computationally expensive to run and have limited visual fidelity to the real-world. MAPLE achieves state-of-the-art driving performance on Bench2Drive and demonstrates scalable, closed-loop multi-agent play for robust E2E autonomous driving systems.

Chinese Translation

视觉-语言-动作（VLA）模型作为端到端运动规划器非常有效，但由于在传统模仿学习框架下训练，在闭环设置中评估时可能表现脆弱。现有的闭环监督方法缺乏可扩展性，无法完全建模反应性环境。我们提出了MAPLE，一个新颖的框架，用于在VLA模型的潜在空间中进行动态驾驶场景的反应性多智能体展开。自我车辆和附近的交通代理在多步时间范围内独立控制，同时对场景中的其他代理做出反应，从而实现闭环训练。MAPLE由两个训练阶段组成：（1）基于真实轨迹对潜在展开进行监督微调，随后（2）进行强化学习，使用全局和特定于代理的奖励，鼓励安全性、进展和交互的真实感。我们进一步提出了多样性奖励，鼓励模型生成在记录的驾驶数据中可能不存在的规划行为。值得注意的是，我们的闭环训练框架具有可扩展性，不需要外部模拟器，这些模拟器的运行可能计算成本高且与真实世界的视觉保真度有限。MAPLE在Bench2Drive上实现了最先进的驾驶性能，并展示了可扩展的闭环多智能体游戏，适用于强健的端到端自主驾驶系统。

View on arXiv Download PDF AI Translation

cs.RO / 7 / 2605.14232

Reactive Planning based Control for Mobile Robots in Obstacle-Cluttered Environments

基于反应规划的移动机器人障碍物密集环境控制

Tan, Li, Xiong, Junlin, Wang, Yan, Ren, Wei

Abstract

This paper addresses the motion control problem for mobile robots in obstacle-cluttered environments. The mobile robot has partial environment information only, and aims to move from an initial position to a target position without collisions. For this purpose, a reactive planning based control strategy (RPCS) is proposed. First, the initial and target positions are connected as a reference trajectory. Then, a reactive planning strategy (RPS) is developed to ensure the collision avoidance by modifying the reference trajectory locally based on the partial environment information. Next, an adaptive tracking control strategy (ATCS) is proposed to track the reference trajectory with potentially local modifications via the discretization techniques. Finally, the RPS and ATCS are combined to establish the RPCS, whose efficacy and advantages are illustrated by numerical examples.

Chinese Translation

本文针对障碍物密集环境中移动机器人的运动控制问题进行探讨。移动机器人仅具备部分环境信息，旨在从初始位置移动到目标位置而不发生碰撞。为此，提出了一种基于反应规划的控制策略（Reactive Planning based Control Strategy, RPCS）。首先，将初始位置和目标位置连接为参考轨迹。然后，开发了一种反应规划策略（Reactive Planning Strategy, RPS），通过基于部分环境信息局部修改参考轨迹来确保避免碰撞。接下来，提出了一种自适应跟踪控制策略（Adaptive Tracking Control Strategy, ATCS），通过离散化技术跟踪可能局部修改的参考轨迹。最后，将RPS和ATCS结合以建立RPCS，并通过数值实例展示其有效性和优势。

View on arXiv Download PDF AI Translation

cs.RO / 8 / 2605.14262

Distill: Uncovering the True Intent behind Human-Robot Communication

Distill：揭示人机沟通背后的真实意图

Li, Ting, Porfirio, David

Abstract

As robots become increasingly integrated into everyday environments, intuitive communication paradigms such as natural language and end-user programming have become indispensable for specifying autonomous robot behavior. However, these mechanisms are ineffective at fully capturing user intent: natural language is imprecise and ambiguous, whereas end-user programming can be overly specific. As a result, understanding what users truly mean when they interact with robots remains a central challenge for human-AI communication systems. To address this issue, we propose the Distill approach for human-robot communication interfaces. Given a task specification provided by the user, Distill (1) removes unnecessary steps; (2) generalizes the meaning behind individual steps; and (3) relaxes ordering constraints between steps. We implemented Distill on a web interface and, through a crowdsourcing study, demonstrated its ability to elicit and refine user intent from initial task specifications.

Chinese Translation

随着机器人日益融入日常环境，直观的沟通范式，如自然语言和终端用户编程，已成为指定自主机器人行为不可或缺的工具。然而，这些机制无法完全捕捉用户的意图：自然语言模糊且不精确，而终端用户编程可能过于具体。因此，理解用户在与机器人互动时的真实意图仍然是人机通信系统面临的核心挑战。为了解决这一问题，我们提出了用于人机沟通接口的Distill方法。根据用户提供的任务规范，Distill (1) 去除不必要的步骤；(2) 概括单个步骤背后的含义；(3) 放宽步骤之间的顺序约束。我们在一个网络接口上实现了Distill，并通过众包研究展示了其从初始任务规范中引导和细化用户意图的能力。

View on arXiv Download PDF AI Translation

cs.RO / 9 / 2605.14411

Energy-Efficient Quadruped Locomotion with Compliant Feet

具有柔性足部的能量高效四足运动

Pal, Pramod, Kolathaya, Shishir, Ghosal, Ashitava

Abstract

Quadruped robots are often designed with rigid feet to simplify control and maintain stable contact during locomotion. While this approach is straightforward, it limits the ability of the legs to absorb impact forces and reuse stored elastic energy, leading to higher energy expenditure during locomotion. To explore whether compliant feet can provide an advantage, we integrate foot compliance into a reinforcement learning (RL) locomotion controller and study its effect on walking efficiency. In simulation, we train eight policies corresponding to eight different spring stiffness values and then cross-evaluate their performance by measuring mechanical energy consumed per meter traveled. In experiments done on a developed quadruped, the energy consumption for the intermediate stiffness spring is lower by ~ 17% when compared to a very stiff or a very flexible spring incorporated in the feet, with similar trends appearing in the simulation results. These results indicate that selecting an appropriate foot compliance can improve locomotion efficiency without destabilizing the robot during motion.

Chinese Translation

四足机器人通常设计为刚性足部，以简化控制并在运动过程中保持稳定接触。尽管这种方法简单，但它限制了腿部吸收冲击力和重复利用储存弹性能量的能力，从而导致运动过程中的能量消耗增加。为了探讨柔性足部是否能提供优势，我们将足部柔性集成到强化学习（Reinforcement Learning, RL）运动控制器中，并研究其对行走效率的影响。在仿真中，我们训练了八个对应于八种不同弹簧刚度值的策略，然后通过测量每米行驶所消耗的机械能来交叉评估它们的性能。在对开发的四足机器人进行的实验中，使用中等刚度弹簧的能量消耗比使用非常刚性或非常柔性弹簧的能量消耗低约17%，仿真结果也显示出类似的趋势。这些结果表明，选择合适的足部柔性可以在不破坏机器人运动稳定性的情况下提高运动效率。

View on arXiv Download PDF AI Translation

cs.RO / 10 / 2605.14417

Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control

身体移动之前：学习语言条件下的人形机器人控制的预期关节意图

Jia, Haozhe, Jin, Honglei, Zhang, Yuan, Fan, Youcheng, Liang, Shaofeng, Wang, Lei, Jin, Shuxu, Yu, Kuimou, Zhang, Zinuo, Song, Jianfei, Chen, Wenshuo, Yue, Yutao

Abstract

Natural language is an intuitive interface for humanoid robots, yet streaming whole-body control requires control representations that are executable now and anticipatory of future physical transitions. Existing language-conditioned humanoid systems typically generate kinematic references that a low-level tracker must repair reactively, or use latent/action policies whose outputs do not explicitly encode upcoming contact changes, support transfers, and balance preparation. We propose \textbf{DAJI} (\emph{Dynamics-Aligned Joint Intent}), a hierarchical framework that learns an anticipatory joint-intent interface between language generation and closed-loop control. DAJI-Act distills a future-aware teacher into a deployable diffusion action policy through student-driven rollouts, while DAJI-Flow autoregressively generates future intent chunks from language and intent history. Experiments show that DAJI achieves strong results in anticipatory latent learning, single-instruction generation, and streaming instruction following, reaching 94.42\% rollout success on HumanML3D-style generation and 0.152 subsequence FID on BABEL.

Chinese Translation

自然语言是人形机器人直观的接口，但全身控制的流式处理需要可立即执行且对未来物理过渡具有预期的控制表示。现有的语言条件人形系统通常生成运动学参考，低级跟踪器必须进行反应性修复，或者使用潜在/动作策略，其输出并未明确编码即将发生的接触变化、支撑转移和保持平衡的准备。我们提出了 extbf{DAJI}（ extit{动态对齐关节意图}），这是一个层次化框架，学习语言生成与闭环控制之间的预期关节意图接口。DAJI-Act通过学生驱动的回滚将未来感知的教师提炼为可部署的扩散动作策略，而DAJI-Flow则通过自回归方式从语言和意图历史生成未来意图块。实验表明，DAJI在预期潜在学习、单指令生成和流式指令跟随方面取得了良好的结果，在HumanML3D风格生成中达到94.42 ext{%}的回滚成功率，在BABEL中达到0.152的子序列FID。

View on arXiv Download PDF AI Translation

cs.RO / 11 / 2605.14571

Let Robots Feel Your Touch: Visuo-Tactile Cortical Alignment for Embodied Mirror Resonance

让机器人感受你的触碰：具身镜像共鸣的视觉-触觉皮层对齐

Zhu, Tianfang, An, Ning, Wang, Rui, Gao, Jiasi, Luo, Qingming, Li, Anan, Zhou, Guyue

Abstract

Observing touch on another's body can elicit corresponding tactile sensations in the observer, a phenomenon termed mirror touch that supports empathy and social perception. This visuo-tactile resonance is thought to rely on structural correspondence between visual and somatosensory cortices, yet robotic systems lack computational frameworks that instantiate this principle. Here we demonstrate that cortical correspondence can be operationalized to endow robots with mirror touch. We introduce Mirror Touch Net, which imposes semantic, distributional and geometric alignment between visual and tactile representations through multi-level constraints, enabling prediction of millimetre-scale tactile signals across 1,140 taxels on a robotic hand from RGB images. Manifold analysis reveals that these constraints reshape visual representations into geometry consistent with the tactile manifold, reducing the complexity of cross-modal mapping. Extending this alignment framework to cross-domain observations of human hands enables tactile prediction and reflexive responses to observed human touch. Our results link a neural principle of visuo-tactile resonance to robotic perception, providing an explainable route towards anticipatory touch and empathic human-robot interaction. Code is available at https://github.com/fun0515/Mirror-Touch-Net.

Chinese Translation

观察他人身体上的触碰可以在观察者身上引发相应的触觉感受，这一现象被称为镜像触觉，支持同理心和社会感知。这种视觉-触觉共鸣被认为依赖于视觉皮层和体感皮层之间的结构对应关系，然而，现有的机器人系统缺乏能够实现这一原理的计算框架。在此，我们展示了皮层对应关系可以被操作化，从而赋予机器人镜像触觉。我们引入了镜像触觉网络（Mirror Touch Net），通过多层约束在视觉和触觉表征之间施加语义、分布和几何对齐，使得能够根据RGB图像预测机器人手上1,140个触觉传感器的毫米级触觉信号。流形分析表明，这些约束将视觉表征重塑为与触觉流形一致的几何形状，从而降低了跨模态映射的复杂性。将这一对齐框架扩展到对人手的跨域观察，使得能够预测触觉并对观察到的人类触碰做出反射性反应。我们的结果将视觉-触觉共鸣的神经原理与机器人感知联系起来，为预期触觉和同理心的人机交互提供了可解释的途径。代码可在 https://github.com/fun0515/Mirror-Touch-Net 获取。

View on arXiv Download PDF AI Translation

cs.RO / 12 / 2605.14598

DSSP: Diffusion State Space Policy with Full-History Encoding

DSSP：具有全历史编码的扩散状态空间策略

Guan, Zhiyuan, Hu, Jianshu, Fang, Han, Jiang, Yunpeng, Huang, Yize, Li, Shujia, Li, Xiao, Ban, Yutong

Abstract

Diffusion-based imitation learning has shown strong promise for robot manipulation. However, most existing policies condition only on the current observation or a short window of recent observations, limiting their ability to resolve history-dependent ambiguities in long-horizon tasks. To address this, we introduce DSSP, a history-conditioned Diffusion State Space Policy that enables efficient, full-history conditioning for robot manipulation. Leveraging the continuous sequence modeling properties of State Space Models (SSMs), our history encoder effectively compresses the entire observation stream into a compact context representation. To ensure this context preserves critical information regarding future state evolution, the encoder is optimized with a dynamics-aware auxiliary training objective. This high-level context representation is then seamlessly fused with recent state observations to form a hierarchical conditioning mechanism for action generation. Furthermore, to maintain architectural consistency and minimize GPU memory overhead, we also instantiate the diffusion backbone itself using an SSM. Extensive experiments across simulation benchmarks and real-world manipulation tasks show that DSSP achieves state-of-the-art performance with a significantly smaller model size, demonstrating superior efficiency of the hierarchical conditioning in capturing crucial information as the history length increases.

Chinese Translation

基于扩散的模仿学习在机器人操控中展现出强大的潜力。然而，现有的大多数策略仅依赖于当前观测或短期的近期观测，这限制了它们在长时间任务中解决依赖历史的模糊性的能力。为了解决这一问题，我们提出了DSSP，一种历史条件的扩散状态空间策略，能够实现高效的全历史条件化以用于机器人操控。利用状态空间模型（State Space Models, SSMs）的连续序列建模特性，我们的历史编码器有效地将整个观测流压缩为紧凑的上下文表示。为了确保该上下文保留关于未来状态演变的关键信息，编码器通过一种动态感知的辅助训练目标进行优化。然后，这种高层次的上下文表示与近期状态观测无缝融合，形成一个用于动作生成的分层条件机制。此外，为了保持架构一致性并最小化GPU内存开销，我们还使用SSM实例化扩散主干。通过在模拟基准和真实世界操控任务上的广泛实验，DSSP在显著较小的模型规模下实现了最先进的性能，展示了在历史长度增加时分层条件化在捕捉关键信息方面的优越效率。

View on arXiv Download PDF AI Translation

cs.RO / 13 / 2605.14683

SeaVis: Modeling and Control of a Remotely Operated Towed Vehicle for Seabed Visualization and Mapping

SeaVis：用于海底可视化和测绘的遥控拖曳车辆的建模与控制

Amer, Abdelhakim, Alstrup, Aske, Rasmussen, Frederik, Brodskiy, Yury, Sarabakha, Andriy, Kayacan, Erdal

Abstract

High-resolution seafloor mapping necessitates stable and precise positioning for underwater robots. This paper introduces a novel mathematical model for SeaVis remotely operated towed vehicles (ROTVs) and develops a gain-scheduled linear-quadratic regulator (LQR) for robust depth and attitude control. We validate the approach in a high-fidelity simulation, benchmarking the LQR against a conventional PID controller over a challenging seabed profile. The presented results demonstrate the LQR's superior performance, with significantly enhanced robustness to disturbances, greater control efficiency, and substantially reduced flap actuation. The gain scheduling also confirms the controller's effectiveness across the full operational velocity range. The complete simulation environment and controller are open-sourced.

Chinese Translation

高分辨率海底测绘需要水下机器人具备稳定和精确的定位能力。本文提出了一种新颖的数学模型，用于SeaVis遥控拖曳车辆（ROTVs），并开发了一种增益调度线性二次调节器（LQR）以实现稳健的深度和姿态控制。我们在高保真模拟中验证了该方法，并将LQR与传统PID控制器进行了基准测试，测试对象为具有挑战性的海底轮廓。结果表明，LQR在抗干扰能力、控制效率和减少舵机动作方面均表现出显著优势。增益调度的结果也确认了该控制器在整个操作速度范围内的有效性。完整的模拟环境和控制器已开源。

View on arXiv Download PDF AI Translation

cs.RO / 14 / 2605.14700

SR-Platform: An Agentic Pipeline for Natural Language-Driven Robot Simulation Environment Synthesis

SR-Platform：一种基于自然语言驱动的机器人仿真环境合成的自主管道

Lim, Ben Wei, Le, Minh Duc, Truong, Thang, Canh, Thanh Nguyen

Abstract

Generating robot simulation environments remains a major bottleneck in simulation-based robot learning. Constructing a training-ready MuJoCo scene typically requires expertise in 3D asset modeling, MJCF specification, spatial layout, collision avoidance, and robot-model integration. We present SR-Platform, a production-deployed agentic system that converts free-form natural language descriptions into executable, physically valid MuJoCo environments. SR-Platform decomposes scene synthesis into four stages: an LLM-based orchestrator that converts user intent into a structured scene plan; an asset forge that retrieves cached assets or generates new 3D geometry through LLM-to-CadQuery synthesis; a layout architect that assigns object poses and verifies industrial constraints; and a bridge layer that assembles the final MJCF scene and merges the selected robot model. The system is deployed as a nine-service Docker stack with WebSocket progress streaming, MinIO-backed mesh storage, Qdrant-based semantic asset retrieval, Redis job state, and InfluxDB telemetry. Using 30 days of production telemetry covering 611 successful LLM calls, SR-Platform generates five-object scenes with a median end-to-end latency of approximately 50 s, while cache-accelerated scenes complete in approximately 30-40 s. The asset forge shows an 11.3% first-attempt retry rate with automatic recovery, and cached asset retrieval removes per-object LLM calls for previously generated object types. These results show that agentic scene synthesis can reduce the manual effort required to create diverse robot training environments, enabling users to produce executable MuJoCo scenes from plain English prompts in under one minute.

Chinese Translation

生成机器人仿真环境仍然是基于仿真的机器人学习中的一个主要瓶颈。构建一个适合训练的MuJoCo场景通常需要在3D资产建模、MJCF规范、空间布局、碰撞避免和机器人模型集成等方面具备专业知识。我们提出了SR-Platform，这是一种已投入生产的自主系统，能够将自由形式的自然语言描述转换为可执行的、物理有效的MuJoCo环境。SR-Platform将场景合成分解为四个阶段：一个基于大型语言模型（LLM）的协调器，将用户意图转换为结构化的场景计划；一个资产锻造器，检索缓存资产或通过LLM到CadQuery的合成生成新的3D几何体；一个布局架构师，分配对象姿态并验证工业约束；以及一个桥接层，组装最终的MJCF场景并合并所选的机器人模型。该系统以九个服务的Docker堆栈形式部署，具有WebSocket进度流、MinIO支持的网格存储、基于Qdrant的语义资产检索、Redis作业状态和InfluxDB遥测。利用覆盖611次成功LLM调用的30天生产遥测数据，SR-Platform生成五个对象的场景，端到端延迟的中位数约为50秒，而缓存加速的场景完成时间约为30-40秒。资产锻造器的首次尝试重试率为11.3%，并具有自动恢复功能，缓存资产检索消除了对先前生成对象类型的每个对象的LLM调用。这些结果表明，自主场景合成可以减少创建多样化机器人训练环境所需的手动工作，使用户能够在不到一分钟的时间内从简单的英语提示生成可执行的MuJoCo场景。

View on arXiv Download PDF AI Translation

cs.RO / 15 / 2605.14712

IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation

IntentVLA：用于别名机器人操作的短期意图建模

Lian, Shijie, Yu, Bin, Lin, Xiaopeng, Shen, Zhaolong, Yang, Laurence Tianruo, Jin, Yurun, Liu, Haishan, Wu, Changti, Yuan, Hang, Huang, Cong, Chen, Kai

Abstract

Robot imitation data are often multimodal: similar visual-language observations may be followed by different action chunks because human demonstrators act with different short-horizon intents, task phases, or recent context. Existing frame-conditioned VLA policies infer each chunk from the current observation and instruction alone, so under partial observability they may resample different intents across adjacent replanning steps, leading to inter-chunk conflict and unstable execution. We introduce IntentVLA, a history-conditioned VLA framework that encodes recent visual observations into a compact short-horizon intent representation and uses it to condition chunk generation. We further introduce AliasBench, a 12-task ambiguity-aware benchmark on RoboTwin2 with matched training data and evaluation environments that isolate short-horizon observation aliasing. Across AliasBench, SimplerEnv, LIBERO, and RoboCasa, IntentVLA improves rollout stability and outperforms strong VLA baselines

Chinese Translation

机器人模仿数据通常是多模态的：相似的视觉-语言观察可能会被不同的动作片段所跟随，因为人类示范者在短期意图、任务阶段或近期上下文方面的行为各异。现有的基于帧条件的VLA策略仅根据当前观察和指令推断每个片段，因此在部分可观测性下，它们可能在相邻的重新规划步骤中重新采样不同的意图，导致片段间的冲突和执行的不稳定性。我们提出了IntentVLA，这是一种基于历史的VLA框架，它将近期的视觉观察编码为紧凑的短期意图表示，并利用该表示来条件化片段生成。我们进一步引入了AliasBench，这是一个包含12个任务的模糊感知基准，在RoboTwin2上具有匹配的训练数据和评估环境，以隔离短期观察别名。在AliasBench、SimplerEnv、LIBERO和RoboCasa中，IntentVLA提高了回放稳定性，并超越了强大的VLA基线。

View on arXiv Download PDF AI Translation

cs.RO / 16 / 2605.14801

Exploring Bottlenecks in VLM-LLM Navigation: How 3D Scene Understanding Capability Impacts Zero-Shot VLN

探索 VLM-LLM 导航中的瓶颈：3D 场景理解能力如何影响零-shot VLN

Xia, Ziyi, Xiong, Chaoran, Wei, Litao, Hu, Xinhao, Pei, Ling

Abstract

Zero-shot vision-and-language navigation (VLN) has gained significant attention due to its minimal data collection costs and inherent generalization. This paradigm is typically driven by the integration of pre-trained Vision-Language Models (VLMs) and Large Language Models (LLMs), where VLMs construct 3D scene graphs while LLMs handle high-level reasoning and decision-making. However, a critical bottleneck exists in this system: current 3D perception models prioritize pixel-level accuracy, directly conflicting with the strict computational limits and real-time efficiency demanded by embodied navigation. To address this gap, this paper quantifies the actual impact of 3D scene understanding capability on VLN performance. Based on typical VLM-LLM frameworks, we propose statistical success rate (SR) upper bounds for two core subsystems: 1) the slow LLM planner, which relies on topological mapping semantics, and 2) the fast reactive navigator, which utilizes spatial coordinates and bounding boxes to execute LLM decisions. Evaluations using state-of-the-art 3D scene understanding models validate our proposed bounds and reveal a perception saturation phenomenon, indicating that improvements in perception accuracy beyond a certain threshold yield diminishing returns in navigation success. Our findings suggest that 3D scene understanding for VLN should pivot away from strict pixel-level precision, prioritizing instead navigation-relevant core vocabularies and accurate bounding box proportions.

Chinese Translation

零-shot 视觉与语言导航（VLN）因其数据收集成本低和固有的泛化能力而受到广泛关注。该范式通常由预训练的视觉-语言模型（VLMs）和大型语言模型（LLMs）的结合驱动，其中 VLMs 构建 3D 场景图，而 LLMs 处理高层次的推理和决策。然而，该系统中存在一个关键瓶颈：当前的 3D 感知模型优先考虑像素级的准确性，这与具身导航所要求的严格计算限制和实时效率直接冲突。为了解决这一问题，本文量化了 3D 场景理解能力对 VLN 性能的实际影响。基于典型的 VLM-LLM 框架，我们为两个核心子系统提出了统计成功率（SR）上限：1）依赖拓扑映射语义的慢速 LLM 规划器，和 2）利用空间坐标和边界框执行 LLM 决策的快速反应导航器。使用最先进的 3D 场景理解模型进行的评估验证了我们提出的上限，并揭示了感知饱和现象，表明在某一阈值以上，感知准确性的提升在导航成功率上收益递减。我们的研究结果表明，VLN 的 3D 场景理解应当从严格的像素级精度转向优先考虑与导航相关的核心词汇和准确的边界框比例。

View on arXiv Download PDF AI Translation

cs.RO / 17 / 2605.14805

Learning Cross-Coupled and Regime Dependent Dynamics for Aerial Manipulation

学习交叉耦合和状态依赖动态的空中操控

Yadav, Rishabh Dev, Ujjawal, Samaksh, Sun, Sihao, Roy, Spandan, Pan, Wei

Abstract

Accurate dynamics models are critical for aerial manipulators operating under complex tasks such as payload transport. However, modeling these systems remains fundamentally challenging due to strong quadrotor-manipulator coupling, delayed aerodynamic interactions, and regime-dependent dynamics variations arising from payload changes and manipulator reconfiguration. These effects produce residual dynamics that are simultaneously cross-coupled, history-dependent, and nonstationary, causing both analytical models and purely offline learned models to degrade during deployment. To address these challenges, we propose a structured encoder-decoder framework for adaptive residual dynamics learning in aerial manipulators. The proposed nonlinear latent encoder captures cross-variable coupling and temporal dependencies from state-input histories, while a lightweight linear latent decoder enables online adaptation under regime-dependent nonstationary dynamics. The linear-in-parameter decoder structure permits closed-form Bayesian adaptation together with consistency-driven covariance inflation, enabling rapid and stable adaptation to both transient and slowly varying dynamics changes while remaining compatible with real-time model predictive control (MPC). Experimental results on a real aerial manipulation platform demonstrate improved residual prediction accuracy, faster adaptation under changing operating conditions, and enhanced MPC-based trajectory tracking performance. These results highlight the importance of jointly modeling coupled temporal dynamics and deployment-time nonstationarity for reliable aerial manipulation.

Chinese Translation

准确的动态模型对于在复杂任务（如载荷运输）下操作的空中操控器至关重要。然而，由于四旋翼与操控器之间的强耦合、延迟的气动相互作用以及因载荷变化和操控器重配置而产生的状态依赖动态变化，建模这些系统仍然具有根本性挑战。这些影响产生了残余动态，这些动态同时具有交叉耦合、历史依赖性和非平稳性，导致分析模型和纯离线学习模型在部署过程中性能下降。为了解决这些挑战，我们提出了一种结构化的编码器-解码器框架，用于空中操控器的自适应残余动态学习。所提出的非线性潜在编码器从状态-输入历史中捕获交叉变量耦合和时间依赖性，而轻量级线性潜在解码器则能够在状态依赖的非平稳动态下进行在线适应。线性参数解码器结构允许闭式形式的贝叶斯适应，并结合一致性驱动的协方差膨胀，使得能够快速且稳定地适应瞬态和缓慢变化的动态变化，同时与实时模型预测控制（MPC）兼容。在真实的空中操控平台上的实验结果表明，残余预测精度有所提高，在变化的操作条件下适应速度更快，并且基于MPC的轨迹跟踪性能得到了增强。这些结果突显了联合建模耦合时间动态和部署时非平稳性对于可靠空中操控的重要性。

View on arXiv Download PDF AI Translation

cs.RO / 18 / 2605.14810

CaMeRL: Collision-Aware and Memory-Enhanced Reinforcement Learning for UAV Navigation in Multi-Scale Obstacle Environments

CaMeRL：用于多尺度障碍环境中无人机导航的碰撞感知和记忆增强强化学习

Hong, Hong, Liao, Feiyu, Liang, Yongheng, Zhang, Boning, Wang, Haitao, Wu, Hejun

Abstract

In obstacle avoidance navigation of unmanned aerial vehicles (UAVs), variations in obstacle scale have received strangely less attention than obstacle number or density. Existing methods typically extract purely geometric features from single-frame depth observations. Such representations tend to neglect small obstacles and lose spatial context under occlusions caused by large obstacles, leading to noticeable degradation in environments with multi-scale obstacles. To address this issue, we propose CaMeRL, a Collision-aware and Memory-enhanced Reinforcement Learning framework for UAV navigation. The collision-aware latent representation encodes risk-sensitive depth cues to preserve fine-grained obstacle structures, thereby improving sensitivity to small obstacles. The temporal memory module integrates observations across frames, mitigating partial observability caused by large-obstacle occlusions. We evaluate CaMeRL with multi-scale obstacles, including ultra-small and extra-large obstacle settings. Results show that CaMeRL outperforms state-of-the-art baselines across all scales, with success rate gains of 0.48 and 0.28 in the ultra-small and extra-large settings, respectively. More importantly, CaMeRL achieves reliable navigation in cluttered outdoor environments.

Chinese Translation

在无人机（UAV）的障碍规避导航中，障碍物规模的变化相比于障碍物数量或密度受到了意外的较少关注。现有方法通常从单帧深度观测中提取纯几何特征。这种表示往往忽视小障碍物，并在大障碍物造成的遮挡下失去空间上下文，从而导致在多尺度障碍环境中显著的性能下降。为了解决这个问题，我们提出了CaMeRL，一个用于无人机导航的碰撞感知和记忆增强强化学习框架。碰撞感知的潜在表示编码了风险敏感的深度线索，以保留细粒度的障碍物结构，从而提高对小障碍物的敏感性。时间记忆模块整合了跨帧的观测，减轻了由于大障碍物遮挡造成的部分可观测性问题。我们在包括超小和超大障碍设置的多尺度障碍中评估了CaMeRL。结果表明，CaMeRL在所有尺度上均优于最先进的基线，在超小和超大设置中成功率分别提高了0.48和0.28。更重要的是，CaMeRL在杂乱的户外环境中实现了可靠的导航。

View on arXiv Download PDF AI Translation

cs.RO / 19 / 2605.14832

Learning Direct Control Policies with Flow Matching for Autonomous Driving

基于流匹配的自主驾驶直接控制策略学习

Ceresini, Marcello, Pirazzoli, Federico, Bertogalli, Andrea, Cipelli, Lorenzo, D'Addeo, Filippo, Dell'Eva, Anthony, Capasso, Alessandro Paolo, Broggi, Alberto

Abstract

We present a flow-matching planner for autonomous driving that directly outputs actionable control trajectories defined by acceleration and curvature profiles. The model is conditioned on a bird's-eye-view (BEV) raster of the surrounding scene and generates control sequences in a small number of Ordinary Differential Equations (ODE) integration steps, enabling low-latency inference suitable for real-time closed-loop re-planning. We train exclusively on urban scenarios (real urban city streets, intersections and roundabouts of the city of Parma, Italy) collected from a 2D traffic simulator with reactive agents, and evaluate in closed-loop on both in-distribution and markedly out-of-distribution environments, including multi-lane highways and unseen urban scenarios. Our results show that the model generalizes reliably to these unseen conditions, maintaining stable closed-loop control and successfully completing scenarios that differ substantially from the training distribution. We attribute this to the BEV representation, which provides a geometry-centric view of the scene that is inherently less sensitive to distributional shifts, and to the flow-matching formulation, which learns a smooth vector field that degrades gracefully under distribution shift. We provide video demonstrations of closed-loop behavior at https://marcelloceresini.github.io/DirectControlFlowMatching.

Chinese Translation

我们提出了一种用于自主驾驶的流匹配规划器，该规划器直接输出由加速度和曲率轮廓定义的可操作控制轨迹。该模型以周围场景的鸟瞰图（BEV）栅格为条件，并在少量常微分方程（ODE）积分步骤中生成控制序列，从而实现适合实时闭环重规划的低延迟推理。我们专门在城市场景（意大利帕尔马市的真实城市街道、交叉口和环形交叉口）上进行训练，这些场景是从具有反应性代理的二维交通模拟器中收集的，并在闭环中评估，包括多车道高速公路和未见过的城市场景。我们的结果表明，该模型在这些未见条件下能够可靠地泛化，保持稳定的闭环控制，并成功完成与训练分布显著不同的场景。我们将这一结果归因于BEV表示，它提供了一种几何中心的场景视图，固有地对分布变化不那么敏感，以及流匹配公式，它学习了在分布变化下优雅降级的平滑向量场。我们在 https://marcelloceresini.github.io/DirectControlFlowMatching 提供了闭环行为的视频演示。

View on arXiv Download PDF AI Translation

cs.RO / 20 / 2605.14911

Chrono-Gymnasium: An Open-Source, Gymnasium-Compatible Distributed Simulation Framework

Chrono-Gymnasium：一个开源的、兼容Gymnasium的分布式仿真框架

Zou, Bocheng, Zhang, Harry, Slaton, Khailanii, Wang, Jingquan, Ruan, Derrick, Unjhawala, Huzaifa Mustafa, Serban, Radu, Negrut, Dan

Abstract

High-fidelity physics simulation is essential for closing the sim-to-real gap in robotics and complex mechanical systems. However, the computational overhead of high-fidelity engines often limits their use in data-intensive tasks like Reinforcement Learning (RL) and global optimization. We introduce Chrono-Gymnasium, a distributed computing framework that scales the high-fidelity multi-body dynamics of Project Chrono across large-scale computing clusters. Built upon the Ray framework, Chrono-Gymnasium provides a standardized Gymnasium interface, enabling seamless integration with modern machine learning libraries while providing built-in synchronization and messaging primitives for distributed execution. We demonstrate the framework's capabilities through two distinct case studies: (1) the training of an RL agent for autonomous robotic navigation in complex terrains, and (2) the Bayesian Optimization of a planetary lander's design parameters to ensure landing stability. Our results show that Chrono-Gymnasium reduces wall-clock time for high-fidelity simulations without sacrificing physical accuracy, offering a scalable path for the design and control of complex robotic systems.

Chinese Translation

高保真物理仿真对于缩小机器人技术和复杂机械系统中的仿真与现实之间的差距至关重要。然而，高保真引擎的计算开销常常限制其在数据密集型任务（如强化学习（Reinforcement Learning, RL）和全局优化）中的应用。我们介绍了Chrono-Gymnasium，一个分布式计算框架，能够在大规模计算集群上扩展Project Chrono的高保真多体动力学。Chrono-Gymnasium基于Ray框架构建，提供了标准化的Gymnasium接口，使其能够与现代机器学习库无缝集成，同时为分布式执行提供内置的同步和消息传递原语。我们通过两个不同的案例研究展示了该框架的能力：（1）为复杂地形中的自主机器人导航训练RL智能体，以及（2）对行星着陆器设计参数的贝叶斯优化，以确保着陆稳定性。我们的结果表明，Chrono-Gymnasium在不牺牲物理准确性的情况下减少了高保真仿真的实际时间，为复杂机器人系统的设计和控制提供了一条可扩展的路径。

View on arXiv Download PDF AI Translation

cs.RO / 21 / 2605.14920

FU-MPC: Frontier- and Uncertainty-Aware Model Predictive Control for Efficient and Accurate UAV Exploration with Motorized LiDAR

FU-MPC：面向前沿和不确定性的模型预测控制，用于高效准确的无人机探索与电动激光雷达

Li, Jianping, Wan, Pengfei, Liu, Zhongyuan, Wang, Yi, Chen, Yiheng, Xu, Xinhang, Jin, Rui, Zhou, Boyu, Xie, Lihua

Abstract

Efficient UAV exploration in unknown environments requires rapid coverage expansion while maintaining accurate and reliable localization, since safe navigation in complex scenes depends on consistent mapping and pose estimation. However, for conventional LiDAR-equipped UAVs, the observable region is tightly coupled with the UAV pose and motion. Expanding coverage often requires additional translational or rotational maneuvers, which can reduce exploration efficiency and increase the risk of localization degradation in geometrically challenging environments. Motorized rotating LiDARs provide a promising solution by actively adjusting the sensor viewing direction without changing the UAV motion, thereby introducing an additional sensing degree of freedom. Nevertheless, existing exploration systems rarely exploit this scanning freedom as an explicit decision variable linked to both exploration progress and localization quality. To address this gap, we develop a UAV platform equipped with an independently actuated rotating LiDAR and propose a hierarchical exploration framework. The global planner organizes frontiers into representative viewpoints and sequences them using topology-aware transition costs. Built upon this planner, FU-MPC serves as a local receding-horizon scan controller that optimizes LiDAR rotation along the predicted flight trajectory. The controller jointly considers frontier-aware exploration utility and direction-dependent localization uncertainty, while lightweight surrogate evaluation enables real-time onboard execution. Experiments in complex environments demonstrate that the proposed system improves exploration efficiency while maintaining robust localization performance compared with fixed-pattern scanning and uncertainty-only baselines. The project page can be found at https://kafeiyin00.github.io/FU-MPC/.

Chinese Translation

在未知环境中高效的无人机探索需要快速扩展覆盖范围，同时保持准确和可靠的定位，因为在复杂场景中的安全导航依赖于一致的地图构建和姿态估计。然而，对于传统的配备激光雷达的无人机，可观测区域与无人机的姿态和运动紧密相关。扩展覆盖范围通常需要额外的平移或旋转操作，这可能降低探索效率并增加在几何挑战环境中定位退化的风险。电动旋转激光雷达通过主动调整传感器视角而不改变无人机运动，提供了一种有前景的解决方案，从而引入了额外的感知自由度。然而，现有的探索系统很少将这种扫描自由度作为与探索进展和定位质量相关的显式决策变量。为了解决这一问题，我们开发了一种配备独立驱动旋转激光雷达的无人机平台，并提出了一个分层探索框架。全局规划器将前沿组织为代表性视点，并使用拓扑感知的过渡成本对其进行排序。在此规划器的基础上，FU-MPC作为一个局部回退地平线扫描控制器，优化激光雷达沿预测飞行轨迹的旋转。该控制器共同考虑了面向前沿的探索效用和方向依赖的定位不确定性，同时轻量级的代理评估实现了实时的机载执行。在复杂环境中的实验表明，与固定模式扫描和仅考虑不确定性的基线相比，所提系统在保持稳健定位性能的同时提高了探索效率。项目页面可在 https://kafeiyin00.github.io/FU-MPC/ 找到。

View on arXiv Download PDF AI Translation

cs.RO / 22 / 2605.14944

Behavioral Data-Driven Optimal Trajectory Generation for Rotary Cranes

基于行为数据的旋转起重机最优轨迹生成

Khemakhem, Iskandar, Zobel, Manuel, Schüle, Johannes, Sawodny, Oliver, Uchiyama, Naoki, Farrage, Abdallah

Abstract

With the growth of the construction industry and the global shortage of skilled labor, the automation of crane control has become increasingly important for safe and efficient operations. A central challenge in automatic crane control is the reduction of load oscillations during motion, which is primarily addressed through appropriate slewing trajectories. In this context, classical model-based control methods rely on accurate dynamical models and expert tuning, and often struggle to meet safety and precision requirements, while many learning-based approaches require large data sets and significant computational resources. This paper proposes a behavioral data-driven framework for generating open-loop slewing trajectories for rotary cranes that suppress load sway while reducing operation time and energy consumption. The approach builds on Willems' fundamental lemma and its generalizations, to bypass explicit system modeling and operate directly on measured input-output data. A practical workflow is presented in this paper to reduce the need for expert knowledge. Despite the underactuated nature of the crane dynamics, the method identifies a nonparametric representation of the system behavior and generates smooth, optimal trajectories using limited data and convex optimization. The proposed trajectory generation method is validated on a laboratory crane setup and compared against an established model-based approach, achieving up to 35% reduction in load sway, 43% reduction in tracking error, and 50% reduction in travel time.

Chinese Translation

随着建筑行业的发展和全球熟练劳动力的短缺，起重机控制的自动化在安全和高效操作中变得愈发重要。自动起重机控制的一个核心挑战是减少运动过程中的负载摆动，这主要通过适当的回转轨迹来解决。在这一背景下，经典的基于模型的控制方法依赖于准确的动态模型和专家调优，往往难以满足安全性和精确度的要求，而许多基于学习的方法则需要大量的数据集和显著的计算资源。本文提出了一种基于行为数据的框架，用于生成旋转起重机的开环回转轨迹，以抑制负载摆动，同时减少操作时间和能耗。该方法基于Willems的基本引理及其推广，绕过显式系统建模，直接在测量的输入-输出数据上进行操作。本文提出了一种实用的工作流程，以减少对专家知识的需求。尽管起重机动态具有欠驱动特性，该方法识别出系统行为的非参数表示，并利用有限数据和凸优化生成平滑的最优轨迹。所提出的轨迹生成方法在实验室起重机设置上进行了验证，并与一种已建立的基于模型的方法进行了比较，达到了负载摆动减少35%、跟踪误差减少43%和旅行时间减少50%的效果。

View on arXiv Download PDF AI Translation

cs.RO / 23 / 2605.15049

A Prototyping Framework for Distributed Control of Multi-Robot Systems

多机器人系统分布式控制的原型框架

Memon, Junaid Ahmed, Nascimento, Allan Andre Do, Margellos, Kostas, Papachristodoulou, Antonis

Abstract

This paper presents a prototyping framework for distributed control of multi-robot systems, aimed at bridging theory and practical testing of distributed optimization algorithms. Using the Single Program, Multiple Data (SPMD) paradigm, the framework emulates distributed control on a single computer, with each core running the same algorithm using local states and neighbour-to-neighbour communication. We demonstrate the framework on a four-quadrotor position-swapping task using a non-cooperative game-theoretic distributed algorithm. Computational time and trajectory data are compared across the supported dynamics levels: a point-mass model, a high-fidelity quadrotor model, and an experimental hardware testbed using Crazyflie quadcopters. The results show that the framework provides a low-cost and accessible approach for validating distributed algorithms.

Chinese Translation

本文提出了一种用于多机器人系统分布式控制的原型框架，旨在弥合分布式优化算法的理论与实际测试之间的差距。该框架采用单程序多数据（Single Program, Multiple Data, SPMD）范式，在单台计算机上模拟分布式控制，每个核心运行相同的算法，利用本地状态和邻接通信。我们在一个四架四旋翼机的位置交换任务中演示了该框架，使用了一种非合作博弈论的分布式算法。我们比较了在不同动态水平下的计算时间和轨迹数据：点质量模型、高保真四旋翼模型以及使用Crazyflie四旋翼的实验硬件测试平台。结果表明，该框架为验证分布式算法提供了一种低成本且易于获取的方法。

View on arXiv Download PDF AI Translation

cs.RO / 24 / 2605.15074

SOCC-ICP: Semantics-Assisted Odometry based on Occupancy Grids and ICP

SOCC-ICP：基于占用网格和ICP的语义辅助里程计

Scherer, Johannes, Hirt, Sebastian, Meeß, Henri

Abstract

Reliable pose estimation in previously unseen environments is a fundamental capability of autonomous systems. Existing LiDAR odometry methods typically employ point-, surfel-, or NDT-based map representations, which are distinct from the semantic occupancy grids commonly used for downstream tasks such as motion planning. We introduce SOCC-ICP, a semantics-assisted odometry framework that jointly performs Semantic OCCupancy grid mapping and LiDAR scan alignment. Each map voxel encodes geometric and semantic statistics, enabling adaptive point-to-point or point-to-plane ICP based on local planarity. Further, the occupancy grid naturally filters dynamic objects through raycasting-based free-space updates. Across diverse evaluation scenarios, SOCC-ICP achieves performance competitive with state-of-the-art LiDAR odometry and remains robust in geometrically degenerate environments, even in the absence of semantic cues. When semantic labels are available, integrating them into map construction, downsampling, and correspondence weighting yields further accuracy gains. By unifying odometry and semantic occupancy grid mapping within a single representation, SOCC-ICP eliminates redundant map structures and directly provides a map suitable for downstream robotic applications.

Chinese Translation

在之前未见环境中可靠的姿态估计是自主系统的一项基本能力。现有的激光雷达里程计方法通常采用基于点、表面元素或NDT（Normal Distributions Transform）的地图表示，这与常用于运动规划等下游任务的语义占用网格有所不同。我们提出了SOCC-ICP，一种语义辅助的里程计框架，联合执行语义占用网格映射和激光雷达扫描对齐。每个地图体素编码几何和语义统计信息，使得基于局部平面的自适应点对点或点对面ICP成为可能。此外，占用网格通过基于光线投射的自由空间更新自然过滤动态物体。在多样的评估场景中，SOCC-ICP的性能与最先进的激光雷达里程计相当，并且在几何退化环境中保持稳健，即使在缺乏语义线索的情况下也是如此。当可用语义标签时，将其整合到地图构建、降采样和对应加权中可进一步提高精度。通过在单一表示中统一里程计和语义占用网格映射，SOCC-ICP消除了冗余的地图结构，直接提供适合下游机器人应用的地图。

View on arXiv Download PDF AI Translation

cs.RO / 25 / 2605.15120

CLOVER: Closed-Loop Value Estimation \& Ranking for End-to-End Autonomous Driving Planning

CLOVER：端到端自主驾驶规划的闭环价值估计与排名

Ang, Sining, Yang, Yuguang, Chen, Canyu, Wang, Yan

Abstract

End-to-end autonomous driving planners are commonly trained by imitating a single logged trajectory, yet evaluated by rule-based planning metrics that measure safety, feasibility, progress, and comfort. This creates a training--evaluation mismatch: trajectories close to the logged path may violate planning rules, while alternatives farther from the demonstration can remain valid and high-scoring. The mismatch is especially limiting for proposal-selection planners, whose performance depends on candidate-set coverage and scorer ranking quality. We propose CLOVER, a Closed-LOop Value Estimation and Ranking framework for end-to-end autonomous driving planning. CLOVER follows a lightweight generator--scorer formulation: a generator produces diverse candidate trajectories, and a scorer predicts planning-metric sub-scores to rank them at inference time. To expand proposal support beyond single-trajectory imitation, CLOVER constructs evaluator-filtered pseudo-expert trajectories and trains the generator with set-level coverage supervision. It then performs conservative closed-loop self-distillation: the scorer is fitted to true evaluator sub-scores on generated proposals, while the generator is refined toward teacher-selected top-$k$ and vector-Pareto targets with stability regularization. We analyze when an imperfect scorer can improve the generator, showing that scorer-mediated refinement is reliable when scorer-selected targets are enriched under the true evaluator and updates remain conservative. On NAVSIM, CLOVER achieves 94.5 PDMS and 90.4 EPDMS, establishing a new state of the art. On the more challenging NavHard split, it obtains 48.3 EPDMS, matching the strongest reported result. On supplementary nuScenes open-loop evaluation, CLOVER achieves the lowest L2 error and collision rate among compared methods. Code data will be released at https://github.com/WilliamXuanYu/CLOVER.

Chinese Translation

端到端自主驾驶规划器通常通过模仿单一记录轨迹进行训练，但评估却依赖于基于规则的规划指标，这些指标衡量安全性、可行性、进展和舒适性。这导致了训练与评估之间的不匹配：接近记录路径的轨迹可能违反规划规则，而远离演示的替代轨迹则可能仍然有效且得分较高。这种不匹配对提案选择规划器尤其限制，因为其性能依赖于候选集的覆盖范围和评分器的排名质量。我们提出了CLOVER，一个用于端到端自主驾驶规划的闭环价值估计与排名框架。CLOVER遵循轻量级生成器-评分器的形式：生成器产生多样的候选轨迹，评分器在推理时预测规划指标的子得分以对其进行排名。为了扩展提案支持，超越单轨迹模仿，CLOVER构建了评估者过滤的伪专家轨迹，并通过集级覆盖监督训练生成器。然后，它执行保守的闭环自蒸馏：评分器根据生成提案的真实评估者子得分进行拟合，而生成器则在稳定性正则化下朝向教师选择的前$k$和向量帕累托目标进行优化。我们分析了不完美评分器何时能够改善生成器，表明当评分器选择的目标在真实评估者下得到丰富，并且更新保持保守时，评分器介导的优化是可靠的。在NAVSIM上，CLOVER达到了94.5 PDMS和90.4 EPDMS，确立了新的最先进水平。在更具挑战性的NavHard分割上，它获得了48.3 EPDMS，匹配了报告的最强结果。在补充的nuScenes开放循环评估中，CLOVER在比较方法中实现了最低的L2误差和碰撞率。代码数据将发布在https://github.com/WilliamXuanYu/CLOVER。

View on arXiv Download PDF AI Translation

cs.RO / 26 / 2605.15122

CoCo-InEKF: State Estimation with Learned Contact Covariances in Dynamic, Contact-Rich Scenarios

CoCo-InEKF：在动态接触丰富场景中的学习接触协方差的状态估计

Baumgartner, Michael, Müller, David, Serifi, Agon, Grandia, Ruben, Knoop, Espen, Gross, Markus, Bächer, Moritz

Abstract

Robust state estimation for highly dynamic motion of legged robots remains challenging, especially in dynamic, contact-rich scenarios. Traditional approaches often rely on binary contact states that fail to capture the nuances of partial contact or directional slippage. This paper presents CoCo-InEKF, a differentiable invariant extended Kalman filter that utilizes continuous contact velocity covariances instead of binary contact states. These learned covariances allow the method to dynamically modulate contact confidence, accounting for more nuanced conditions ranging from firm contact to directional slippage or no contact. To predict these covariances for a set of predefined contact candidate points, we employ a lightweight neural network trained end-to-end using a state-error loss. This approach eliminates the need for heuristic ground-truth contact labels. In addition, we propose an automated contact candidate selection procedure and demonstrate that our method is insensitive to their exact placement. Experiments on a bipedal robot demonstrate a superior accuracy-efficiency tradeoff for linear velocity estimation, as well as improved filter consistency compared to baseline methods. This enables the robust execution of challenging motions, including dancing and complex ground interactions -- both in simulation and in the real world.

Chinese Translation

对于腿部机器人在高度动态运动中的稳健状态估计仍然具有挑战性，尤其是在动态接触丰富的场景中。传统方法通常依赖于二元接触状态，这无法捕捉到部分接触或方向滑移的细微差别。本文提出了CoCo-InEKF，一种可微分的不变扩展卡尔曼滤波器，利用连续接触速度协方差而非二元接触状态。这些学习到的协方差使得该方法能够动态调节接触置信度，考虑到从牢固接触到方向滑移或无接触等更细微的条件。为了预测一组预定义接触候选点的协方差，我们采用了一种轻量级神经网络，通过端到端训练，使用状态误差损失。这种方法消除了对启发式真实接触标签的需求。此外，我们提出了一种自动接触候选选择程序，并证明我们的方法对其精确位置不敏感。在双足机器人上的实验表明，与基线方法相比，我们的方法在线速度估计上具有更优的准确性与效率权衡，以及改进的滤波器一致性。这使得在模拟和现实世界中都能稳健地执行包括舞蹈和复杂地面交互在内的挑战性动作。

View on arXiv Download PDF AI Translation

cs.RO / 27 / 2605.15153

Pelican-Unified 1.0: A Unified Embodied Intelligence Model for Understanding, Reasoning, Imagination and Action

Pelican-Unified 1.0：一个统一的具身智能模型，用于理解、推理、想象和行动

Zhang, Yi, Chen, Yinda, Liu, Che, Ding, Zeyuan, Xu, Jin, Zou, Shilong, Liao, Junwei, Hu, Jiayu, Ren, Xiancong, Zhang, Xiaopeng, Liu, Yechi, Shi, Haoyuan, Tang, Zecong, Sun, Haosong, Cui, Renwen, Wu, Kuishu, Liu, Wenhai, Xu, Yang, Zhang, Yingji, Wang, Yidong, Hu, Senkang, Lu, Jinpeng, Chan, Nga Teng, Wu, Yechen, Dai, Yong, Tang, Jian, Ju, Xiaozhu

Abstract

We present Pelican-Unified 1.0, the first embodied foundation model trained according to the principle of unification. Pelican-Unified 1.0 uses a single VLM as a unified understanding module, mapping scenes, instructions, visual contexts, and action histories into a shared semantic space. The same VLM also serves as a unified reasoning module, autoregressively producing task-, action-, and future-oriented chains of thought in a single forward pass and projecting the final hidden state into a dense latent variable. A Unified Future Generator (UFG) then conditions on this latent variable and jointly generates future videos and future actions through two modality-specific output heads within the same denoising process. The language, video, and action losses are all backpropagated into the shared representation, enabling the model to jointly optimize understanding, reasoning, imagination, and action during training, rather than training three isolated expert systems. Experiments demonstrate that unification does not imply compromise. With a single checkpoint, Pelican-Unified 1.0 achieves strong performance across all three capabilities: 64.7 on eight VLM benchmarks, the best among comparable-scale models; 66.03 on WorldArena, ranking first; and 93.5 on RoboTwin, the second-best average among compared action methods. These results show that the unified paradigm succeeds in preserving specialist strength while bringing understanding, reasoning, imagination, and action into one model.

Chinese Translation

我们提出了Pelican-Unified 1.0，这是第一个根据统一原则训练的具身基础模型。Pelican-Unified 1.0使用单一的视觉语言模型（VLM）作为统一的理解模块，将场景、指令、视觉上下文和行动历史映射到共享的语义空间中。相同的VLM还作为统一的推理模块，自回归地在单次前向传播中生成以任务、行动和未来为导向的思维链，并将最终的隐藏状态投影到一个密集的潜变量中。统一未来生成器（Unified Future Generator, UFG）随后基于该潜变量进行条件生成，通过同一去噪过程中的两个特定模态输出头共同生成未来视频和未来行动。语言、视频和行动损失均反向传播到共享表示中，使模型能够在训练过程中共同优化理解、推理、想象和行动，而不是训练三个孤立的专家系统。实验表明，统一并不意味着妥协。凭借单一的检查点，Pelican-Unified 1.0在所有三项能力上都取得了强劲的表现：在八个VLM基准测试中得分64.7，是同规模模型中表现最佳的；在WorldArena中得分66.03，排名第一；在RoboTwin中得分93.5，是比较行动方法中第二好的平均值。这些结果表明，统一范式成功地保留了专业优势，同时将理解、推理、想象和行动融入一个模型中。

View on arXiv Download PDF AI Translation

cs.RO / 28 / 2605.15157

Hand-in-the-Loop: Improving Dexterous VLA via Seamless Interventional Correction

人机协作：通过无缝干预纠正提高灵巧的视觉-语言-动作模型

Li, Zhuohang, Huang, Liqun, Xu, Wei, Zhu, Zhengming, Lin, Nie, Ma, Xiao, Sheng, Xinjun, Wen, Ruoshi

Abstract

Vision-Language-Action (VLA) models are prone to compounding errors in dexterous manipulation, where high-dimensional action spaces and contact-rich dynamics amplify small policy deviations over long horizons. While Interactive Imitation Learning (IIL) can refine policies through human takeover data, applying it to high-degree-of-freedom (DoF) robotic hands remains challenging due to a command mismatch between human teleoperation and policy execution at the takeover moment, which causes abrupt robot-hand configuration changes, or "gesture jumps". We present Hand-in-the-Loop (HandITL), a seamless human-in-the-loop intervention method that blends human corrective intent with autonomous policy execution to avoid gesture jumps during bimanual dexterous manipulation. Compared with direct teleoperation takeover, HandITL reduces takeover jitter by 99.8% and preserves robust post-takeover manipulation, reducing grasp failures by 87.5% and mean completion time by 19.1%. We validate HandITL on tasks requiring bimanual coordination, tool use, and fine-grained long-horizon manipulation. When used to collect intervention data for policy refinement, HandITL yields policies that outperform those trained with standard teleoperation data by 19% on average across three long-horizon dexterous tasks.

Chinese Translation

视觉-语言-动作（VLA）模型在灵巧操作中容易出现累积误差，其中高维动作空间和丰富的接触动态会在长时间内放大小的策略偏差。尽管交互式模仿学习（IIL）可以通过人类接管数据来优化策略，但由于人类遥操作与策略执行之间的指令不匹配，在接管时会导致机器人手的配置发生突变，即“手势跳跃”，这使得在高自由度（DoF）机器人手上应用IIL变得具有挑战性。我们提出了人机协作（Hand-in-the-Loop, HandITL），这是一种无缝的人机干预方法，将人类的纠正意图与自主策略执行相结合，以避免在双手灵巧操作过程中出现手势跳跃。与直接遥操作接管相比，HandITL将接管抖动减少了99.8%，并保持了稳健的接管后操作，减少了87.5%的抓取失败率和19.1%的平均完成时间。我们在需要双手协调、工具使用和精细长时间操作的任务上验证了HandITL。当用于收集干预数据以优化策略时，HandITL所产生的策略在三个长时间灵巧任务上平均比使用标准遥操作数据训练的策略提高了19%。

View on arXiv Download PDF AI Translation

计算机视觉 (Computer Vision)

151

cs.CV / 1 / 2605.13854

Contrastive Multi-Modal Hypergraph Reasoning for 3D Crowd Mesh Recovery

对比多模态超图推理用于三维人群网格恢复

Sun, Minghao, Xu, Chongyang, Xie, Yitao, Huang, Buzhen, Li, Kun

Abstract

Multi-person 3D reconstruction is pivotal for real-world interaction analysis, yet remains challenging due to severe occlusions and depth ambiguity. Current approaches typically rely on single-modality inputs, which inherently lack geometric guidance. Furthermore, these methods often reconstruct subjects in isolation, neglecting the collective group context essential for resolving ambiguities in crowded scenes. To address these limitations, we propose Contrastive Multi-modal Hypergraph Reasoning to synergize semantic, geometric, and pose cues for crowd reconstruction. We first initialize robust node representations by combining RGB features, geometric priors, and occlusion-aware incomplete poses. Additionally, we introduce a pelvis depth indicator as a global spatial anchor, aligning visual features with a metric-scale-agnostic depth ordering. Subsequently, we construct a shared-topology hypergraph that moves beyond pairwise constraints to model higher-order crowd dynamics. To improve feature fusion, we design a hypergraph-based contrastive learning scheme that jointly enhances intra-modal discriminability and enforces cross-modal orthogonality. This mechanism enables the network to propagate global context effectively, allowing it to infer missing information even under severe occlusion. Extensive experiments on the Panoptic and GigaCrowd benchmarks confirm that our method achieves new state-of-the-art performance. Code and pre-trained models are available at https://github.com/SunMH-try/CoMHR.

Chinese Translation

多人物三维重建对于现实世界的交互分析至关重要，但由于严重的遮挡和深度模糊，仍然面临挑战。目前的方法通常依赖于单一模态输入，缺乏几何指导。此外，这些方法往往孤立地重建个体，忽视了在拥挤场景中解决模糊所必需的集体群体上下文。为了解决这些局限性，我们提出了对比多模态超图推理，以协同语义、几何和姿态线索进行人群重建。我们首先通过结合RGB特征、几何先验和考虑遮挡的不完整姿态来初始化稳健的节点表示。此外，我们引入了一个骨盆深度指示器作为全局空间锚点，将视觉特征与度量尺度无关的深度排序对齐。随后，我们构建了一个共享拓扑的超图，超越了成对约束，以建模更高阶的人群动态。为了改善特征融合，我们设计了一种基于超图的对比学习方案，联合增强模态内可区分性并强制跨模态正交性。该机制使网络能够有效传播全局上下文，即使在严重遮挡的情况下也能推断缺失信息。在Panoptic和GigaCrowd基准上的大量实验确认了我们的方法达到了新的最先进性能。代码和预训练模型可在 https://github.com/SunMH-try/CoMHR 获取。

View on arXiv Download PDF AI Translation

cs.CV / 2 / 2605.13974

Few Channels Draw The Whole Picture: Revealing Massive Activations in Diffusion Transformers

少量通道描绘全貌：揭示扩散变换器中的大规模激活

Turri, Evelyn, Bucciarelli, Davide, Sarto, Sara, Baraldi, Lorenzo, Cornia, Marcella

Abstract

Diffusion Transformers (DiTs) and related flow-based architectures are now among the strongest text-to-image generators, yet the internal mechanisms through which prompts shape image semantics remain poorly understood. In this work, we study massive activations: a small subset of hidden-state channels whose responses are consistently much larger than the rest. We show that, despite their sparsity, these few channels effectively draw the whole picture, in three complementary senses. First, they are functionally critical: a controlled disruption probe that zeroes the massive channels causes a sharp collapse in generation quality, while disrupting an equally-sized set of low-statistic channels has marginal effect. Second, they are spatially organized: restricting image-stream tokens to massive channels and clustering them yields coherent partitions that closely align with the main subject and salient regions, exposing a structured spatial code hidden inside an apparently outlier-like subspace. Third, they are transferable: transporting massive activations from one prompt-conditioned trajectory into another, shifts the final image toward the source prompt while preserving substantial content from the target, producing localized semantic interpolation rather than unstructured pixel blending. We exploit this property in two use cases: text-conditioned and image-conditioned semantic transport, where massive activations transport enables prompt interpolation and subject-driven generation without any additional training. Together, these results recast massive activations not as activation anomalies, but as a sparse prompt-conditioned carrier subspace that organizes and controls semantic information in modern DiT models.

Chinese Translation

扩散变换器（Diffusion Transformers, DiTs）及相关的基于流的架构现已成为最强大的文本到图像生成器之一，但通过提示塑造图像语义的内部机制仍然不甚了解。在本研究中，我们研究了大规模激活：一小部分隐藏状态通道，其响应始终远大于其他通道。我们展示了尽管这些通道稀疏，但它们在三个互补的意义上有效地描绘了全貌。首先，它们在功能上至关重要：一个控制性干扰探针将大规模通道的激活归零会导致生成质量的急剧下降，而干扰同样大小的低统计通道则影响微乎其微。其次，它们在空间上有组织：将图像流标记限制在大规模通道并对其进行聚类，产生的连贯分区与主要主题和显著区域紧密对齐，揭示了隐藏在表面上看似异常的子空间中的结构化空间编码。第三，它们是可转移的：将大规模激活从一个提示条件轨迹转移到另一个，可以将最终图像向源提示方向移动，同时保留目标的实质内容，产生局部语义插值而非无结构的像素混合。我们在两个使用案例中利用了这一特性：文本条件和图像条件的语义传输，其中大规模激活的传输使得提示插值和主题驱动生成成为可能，而无需额外训练。总之，这些结果将大规模激活重新定义为一种稀疏的提示条件载体子空间，组织和控制现代 DiT 模型中的语义信息。

View on arXiv Download PDF AI Translation

cs.CV / 3 / 2605.13994

CineMesh4D: Personalized 4D Whole Heart Reconstruction from Sparse Cine MRI

CineMesh4D：基于稀疏 Cine MRI 的个性化 4D 整体心脏重建

Liu, Xiaoyue, Yuan, Xiaohan, Chan, Mark Y, Sia, Ching-Hui, Li, Lei

Abstract

Accurate 3D+t whole-heart mesh reconstruction from cine MRI is a clinically crucial yet technically challenging task. The difficulty of this task arises from two coupled factors: inherently sparse sampling of 3D cardiac anatomy by 2D image slices and the tight coupling between cardiac shape and motion. Current cardiac image-to-mesh approaches typically reconstruct only a subset of cardiac chambers or a single phase of the cardiac cycle. In this work, we propose CineMesh4D, a novel end-to-end 4D (3D+t) pipeline that directly reconstructs patient-specific whole-heart mesh from multi-view 2D cine MRI via cross-domain mapping. Specifically, we introduce a differentiable rendering loss that enables supervision of 3D+t whole-heart mesh from multi-view sparse contours of cine MRI. Furthermore, we develop a dual-context temporal block that fuses global and local cardiac temporal information to capture high-dimensional sequential patterns. In quantitative and qualitative evaluations, CineMesh4D outperforms existing approaches in terms of reconstruction quality and motion consistency, providing a practical pathway for personalized real-time cardiac assessment. The code will be publicly released once the manuscript is accepted.

Chinese Translation

从 Cine MRI 中准确重建 3D+t 整体心脏网格是一项临床上至关重要但技术上具有挑战性的任务。这项任务的难度源于两个相互关联的因素：2D 图像切片对 3D 心脏解剖结构的固有稀疏采样以及心脏形状与运动之间的紧密耦合。目前的心脏图像到网格的方法通常仅重建心脏腔室的一个子集或心脏周期的单一相位。在本研究中，我们提出了 CineMesh4D，这是一种新颖的端到端 4D (3D+t) 流水线，能够通过跨域映射直接从多视角 2D Cine MRI 重建患者特定的整体心脏网格。具体而言，我们引入了一种可微渲染损失，能够利用多视角稀疏轮廓对 3D+t 整体心脏网格进行监督。此外，我们开发了一种双上下文时间块，融合全局和局部心脏时间信息，以捕捉高维序列模式。在定量和定性评估中，CineMesh4D 在重建质量和运动一致性方面优于现有方法，为个性化实时心脏评估提供了切实可行的途径。代码将在手稿被接受后公开发布。

View on arXiv Download PDF AI Translation

cs.CV / 4 / 2605.14028

Unified Pix Token And Word Token Generative Language Model

统一的像素令牌与词令牌生成语言模型

Leung, Haun, Wang, ZiNan

Abstract

Since the emergence of Vision Transformer (ViT), it has been widely used in generative language model and generative visual model. Especially in the current state-of-art open source multimodal models, ViT obtained by CLIP or SigLIP method serves as the vision encoder backbone to help them acquire visual understanding capabilities. But this method leads to limitations in visual understanding for details, such as difficulty in recognizing small text or numbers in images. To address these issues, we propose a new model to unify pix token and word token into the generative language model. The new model also features with each pix of image having its own token embedding, color folding, global conditional attention approximation and image unsupervised pretraining. We conducted image unsupervised pretraining experiments using our new model to explore its potential. The experimental results show that it has good performance even in small model and with limited training data. We believe our model also conforms to the scaling law, as long as model parameters and training data increased, its performance will continue to improve.

Chinese Translation

自从视觉变换器（Vision Transformer, ViT）出现以来，它在生成语言模型和生成视觉模型中得到了广泛应用。尤其是在当前最先进的开源多模态模型中，通过 CLIP 或 SigLIP 方法获得的 ViT 作为视觉编码器的骨干，帮助它们获取视觉理解能力。然而，这种方法在细节的视觉理解上存在局限性，例如在图像中识别小文本或数字的困难。为了解决这些问题，我们提出了一种新模型，将像素令牌（pix token）和词令牌（word token）统一到生成语言模型中。该新模型的特点是每个图像的像素都有其独立的令牌嵌入，颜色折叠、全局条件注意力近似和图像无监督预训练。我们使用新模型进行了图像无监督预训练实验，以探索其潜力。实验结果表明，即使在小模型和有限训练数据的情况下，它也表现出良好的性能。我们相信，只要模型参数和训练数据增加，我们的模型也符合规模法则，其性能将持续提升。

View on arXiv Download PDF AI Translation

cs.CV / 5 / 2605.14045

PVRF: All-in-one Adverse Weather Removal via Prior-modulated and Velocity-constrained Rectified Flow

PVRF：通过先验调制和速度约束的整流流实现一体化恶劣天气去除

Dong, Wei, Zhou, Han, Ji, Terry, Zhao, Guanhua, Asoodeh, Shahab, Zhang, Yulun, Zhai, Guangtao, Chen, Jun, Liu, Xiaohong

Abstract

Adverse weather removal (AWR) in real-world images remains challenging due to heterogeneous and unseen degradations, while distortion-driven training often yields overly smooth results. We propose PVRF, a unified framework that integrates zero-shot soft weather perceptions with velocity-constrained rectified-flow refinement. PVRF introduces an AWR-specific question answering module (AWR-QA) that uses frozen vision--language models (VLMs) to estimate soft probabilities of weather types and low-level attribute scores. These perceptions condition restoration networks via attribute-modulated normalization (AMN) and weather-weighted adapters (WWA), producing an anchor estimate for refinement. We then learn a terminal-consistent residual rectified flow with perception-adaptive source perturbation and a terminal-consistent velocity parameterization to stabilize learning near the terminal regime. Extensive experiments show that PVRF improves both fidelity and perceptual quality over state-of-the-art baselines, with strong cross-dataset generalization on single and combined degradations. Code will be released at https://github.com/dongw22/PVRF.

Chinese Translation

在现实世界图像中，恶劣天气去除（AWR）仍然面临挑战，因为存在异质和未见的退化，而基于失真驱动的训练往往导致结果过于平滑。我们提出了PVRF，一个统一框架，将零样本软天气感知与速度约束的整流流细化相结合。PVRF引入了一个针对AWR的特定问答模块（AWR-QA），利用冻结的视觉-语言模型（VLMs）来估计天气类型的软概率和低级属性分数。这些感知通过属性调制归一化（AMN）和天气加权适配器（WWA）来调节恢复网络，生成用于细化的锚定估计。然后，我们学习一个终端一致的残差整流流，结合感知自适应源扰动和终端一致的速度参数化，以在终端区域附近稳定学习。大量实验表明，PVRF在保真度和感知质量上均优于最先进的基线，并在单一和组合退化上展现出强大的跨数据集泛化能力。代码将发布在 https://github.com/dongw22/PVRF。

View on arXiv Download PDF AI Translation

cs.CV / 6 / 2605.14047

Evolving Layer-Specific Scalar Functions for Hardware-Aware Transformer Adaptation

针对硬件感知的变层特定标量函数演化

Carrigg, Kieran, de Vries, Sigur, Sadough, Amirhossein, van Gerven, Marcel

Abstract

Vision Transformers (ViTs) achieve state-of-the-art performance on challenging vision tasks, but their deployment on edge devices is severely hindered by the computational complexity and global reduction bottleneck imposed by layer normalization. Recent methods attempt to bypass this by replacing normalization layers with hardware-friendly scalar approximations. However, these homogeneous replacements do not optimally fit to all layers' behaviour and rely on expensive model retraining. In this work, we propose a highly efficient, hardware-aware framework that utilizes genetic programming (GP) to evolve heterogeneous, layer-specific scalar functions directly from pre-trained weights. Coupled with a novel post-training re-alignment strategy, our approach eliminates the need to retrain models from scratch entirely. Our evolved expressions accurately approximate the target normalization behaviours, capturing $91.6\%$ of the variance ($R^2$) compared to only $70.2\%$ for homogeneous baselines, allowing our modified architecture to recover $84.25\%$ Top-1 ImageNet-1K accuracy in only 20 epochs. By preserving this performance while eliminating the global reduction bottleneck, our approach establishes a highly favourable trade-off between arithmetic complexity and off-chip memory traffic, removing a primary barrier to the efficient deployment of ViTs on edge accelerators.

Chinese Translation

视觉变换器（Vision Transformers, ViTs）在具有挑战性的视觉任务中实现了最先进的性能，但由于层归一化带来的计算复杂性和全局归约瓶颈，其在边缘设备上的部署受到严重阻碍。近期的方法试图通过用硬件友好的标量近似替代归一化层来绕过这一问题。然而，这些同质替代方案并未能最佳地适应所有层的行为，并且依赖于昂贵的模型重训练。在本研究中，我们提出了一种高效的硬件感知框架，该框架利用遗传编程（Genetic Programming, GP）直接从预训练权重中演化异质的层特定标量函数。结合一种新颖的后训练重新对齐策略，我们的方法完全消除了从头开始重训练模型的需求。我们演化出的表达式准确地近似了目标归一化行为，与同质基线相比，捕获了 $91.6\%$ 的方差（$R^2$），而同质基线仅为 $70.2\\%$，使得我们修改后的架构在仅20个训练周期内恢复了 $84.25\\%$ 的 Top-1 ImageNet-1K 准确率。通过在消除全局归约瓶颈的同时保持这一性能，我们的方法在算术复杂性和离芯内存流量之间建立了高度有利的权衡，从而消除了视觉变换器在边缘加速器上高效部署的主要障碍。

View on arXiv Download PDF AI Translation

cs.CV / 7 / 2605.14068

CurveBench: A Benchmark for Exact Topological Reasoning over Nested Jordan Curves

CurveBench：一个用于嵌套乔丹曲线精确拓扑推理的基准

Mohseni, Amirreza, Mohammadi, Mona, Saghafian, Morteza, Saradari, Naser Talebizadeh

Abstract

We introduce CurveBench, a benchmark for hierarchical topological reasoning from visual input. CurveBench consists of \textbf{756 images} of pairwise non-intersecting Jordan curves across easy, polygonal, topographic-inspired, maze-like, and dense counting configurations. Each image is annotated with a rooted tree encoding the containment relations between planar regions. We formulate the task as structured prediction: given an image, a model must recover the full rooted containment tree induced by the curves. Despite the visual simplicity of the task, the strongest evaluated model, Gemini 3.1 Pro, achieves only \textbf{71.1\%} tree-generation accuracy on CurveBench-Easy and \textbf{19.1\%} on CurveBench-Hard. We further demonstrate benchmark utility through RLVR-style fine-tuning of open-weight vision-language models. Our trained Qwen3-VL-8B model improves over \texttt{Qwen-3-VL-8B-Thinking} from \textbf{2.8\%} to \textbf{33.3\%} tree-generation accuracy on CurveBench-Easy, exceeding GPT-5.4 and Claude Opus 4.5 under our evaluation protocol. The remaining gap, especially on CurveBench-Hard, shows that exact topology-aware visual reasoning remains far from solved.

Chinese Translation

我们介绍了CurveBench，这是一个用于从视觉输入进行层次拓扑推理的基准。CurveBench包含756张成对不相交的乔丹曲线的图像，涵盖简单、聚合、多样化地形、迷宫式和密集计数配置。每张图像都被注释为一个根树，编码了平面区域之间的包含关系。我们将任务表述为结构化预测：给定一张图像，模型必须恢复由曲线诱导的完整根包含树。尽管任务在视觉上相对简单，但评估中表现最强的模型Gemini 3.1 Pro在CurveBench-Easy上的树生成准确率仅为71.1%，而在CurveBench-Hard上的准确率为19.1%。我们进一步通过RLVR风格的开放权重视觉-语言模型微调展示了基准的实用性。我们训练的Qwen3-VL-8B模型在CurveBench-Easy上的树生成准确率从2.8%提高到33.3%，超过了在我们评估协议下的GPT-5.4和Claude Opus 4.5。剩余的差距，尤其是在CurveBench-Hard上，表明精确的拓扑感知视觉推理仍然远未解决。

View on arXiv Download PDF AI Translation

cs.CV / 8 / 2605.14091

Venus-DeFakerOne: Unified Fake Image Detection & Localization

Venus-DeFakerOne：统一的假图像检测与定位

GuangJian Team

Abstract

In recent years, the rapid evolution of generative AI has fundamentally reshaped the paradigm of image forgery, breaking the traditional boundaries between document editing, natural image manipulation, DeepFake generation, and full-image AIGC synthesis. Despite this shift toward unified forgery generation, existing research in Fake Image Detection and Localization (FIDL) remains fragmented. This creates a mismatch between increasingly unified forgery generation mechanisms and the domain-specific detection paradigm. Bridging this mismatch poses two key challenges for FIDL: understanding cross-domain artifacts transfer and interference, and building a high-capacity unified foundation model for joint detection and localization. To address these challenges, we propose DeFakerOne, a data-centric, unified FIDL foundation model integrating InternVL2 and SAM2. DeFakerOne enables simultaneous image-level detection and pixel-level forgery localization across diverse scenarios. Extensive experiments demonstrate that DeFakerOne achieves state-of-the-art performance, outperforming baselines on 39 forgery detection benchmarks and 9 localization benchmarks. Furthermore, the model exhibits superior robustness against real-world perturbations and state-of-the-art generators such as GPT-Image-2. Finally, we provide a systematic analysis of data scaling laws, cross-domain artifacts transfer-interference patterns, the necessity of fine-grained supervision, and the original resolution artifacts preservation, highlighting the design principles for scalable, robust, and unified FIDL.

Chinese Translation

近年来，生成性人工智能的快速发展从根本上重塑了图像伪造的范式，打破了文档编辑、自然图像处理、DeepFake生成和全图像AIGC合成之间的传统界限。尽管伪造生成日益统一，但现有的假图像检测与定位（Fake Image Detection and Localization, FIDL）研究仍然碎片化。这导致了日益统一的伪造生成机制与特定领域检测范式之间的不匹配。弥合这一不匹配为FIDL带来了两个关键挑战：理解跨领域伪造物的转移与干扰，以及构建一个高容量的统一基础模型以实现联合检测与定位。为了解决这些挑战，我们提出了DeFakerOne，一个以数据为中心的统一FIDL基础模型，集成了InternVL2和SAM2。DeFakerOne能够在多种场景中实现图像级检测与像素级伪造定位的同时进行。大量实验表明，DeFakerOne在39个伪造检测基准和9个定位基准上实现了最先进的性能，超越了基线。此外，该模型在面对现实世界扰动和诸如GPT-Image-2等最先进生成器时表现出卓越的鲁棒性。最后，我们对数据扩展规律、跨领域伪造物转移-干扰模式、细粒度监督的必要性以及原始分辨率伪造物的保留进行了系统分析，强调了可扩展、鲁棒和统一FIDL的设计原则。

View on arXiv Download PDF AI Translation

cs.CV / 9 / 2605.14104

DUET: Dual-Paradigm Adaptive Expert Triage with Single-cell Inductive Prior for Spatial Transcriptomics Prediction

DUET：基于单细胞归纳先验的双范式自适应专家分诊用于空间转录组学预测

Zhu, Junchao, Deng, Ruining, Guo, Junlin, Yao, Tianyuan, Qu, Chongyu, Xiong, Juming, Lu, Zhengyi, Zhu, Yanfan, Lionts, Marilyn, Yang, Yuechen, Wang, Yu, Zhao, Shilin, Yang, Haichun, Huo, Yuankai

Abstract

Inferring spatially resolved gene expression from histology images offers a cost-effective complement to spatial transcriptomics (ST). However, existing methods reduce this task to a simple morphology-to-expression mapping, where visual similarity does not guarantee molecular consistency. Meanwhile, single-cell data has amassed rich resources far surpassing the scale of ST data, yet it remains underexplored in vision-omics modeling. Furthermore, current approaches commit to a monolithic paradigm with bottlenecks, unable to balance expressive flexibility with biological fidelity. To bridge these gaps, we propose DUET, a novel dual-paradigm framework that synergizes parametric prediction and memory-based retrieval under cellular inductive priors. DUET implements a parallel regression-retrieval paradigm, adaptively reconciling the outputs of its complementary pathways. To mitigate aleatoric vision ambiguity, we incorporate large-scale single-cell references to impose molecular states as biological constraints for faithful learning. Building upon structural refinement, we further design a lightweight adapter to dynamically assign branch preference across spatial contexts to achieve optimal performance. Extensive experiments on three public datasets across varied gene scales demonstrate that DUET achieves SOTA performance, with consistent gains contributed by each proposed component. Code is available at https://github.com/Junchao-Zhu/DUET

Chinese Translation

从组织学图像推断空间分辨的基因表达为空间转录组学（ST）提供了一种具有成本效益的补充。然而，现有方法将这一任务简化为简单的形态与表达映射，其中视觉相似性并不保证分子一致性。同时，单细胞数据的积累远超ST数据的规模，但在视觉组学建模中仍未得到充分探索。此外，当前方法承诺于单一的范式，存在瓶颈，无法平衡表达灵活性与生物学真实度。为了解决这些问题，我们提出了DUET，这是一种新颖的双范式框架，结合了参数预测和基于记忆的检索，基于细胞归纳先验进行协同。DUET实现了并行回归-检索范式，自适应地调和其互补路径的输出。为了减轻视觉模糊带来的不确定性，我们引入大规模单细胞参考，以将分子状态作为生物约束，从而实现真实的学习。在结构优化的基础上，我们进一步设计了一种轻量级适配器，以动态分配空间上下文中的分支偏好，以实现最佳性能。在三个不同基因规模的公共数据集上的广泛实验表明，DUET达到了最先进的性能，每个提出的组件均持续贡献了显著的增益。代码可在 https://github.com/Junchao-Zhu/DUET 获取。

View on arXiv Download PDF AI Translation

cs.CV / 10 / 2605.14108

Bridging the Rural Healthcare Gap: A Cascaded Edge-Cloud Architecture for Automated Retinal Screening

弥合农村医疗保健差距：一种用于自动化视网膜筛查的级联边缘-云架构

Doshi, Nishi, Shah, Shrey

Abstract

Diabetic Retinopathy (DR) is one of the leading causes of preventable blindness, yet rural regions often lack the specialists and infrastructure needed for early detection. Although cloud-based deep learning systems offer high accuracy, they face significant challenges in these settings due to high latency, limited bandwidth, and high data transmission costs. To address these challenges, we propose a two-tier edge-cloud cascade on the public APTOS 2019 Blindness Detection dataset. Tier 1 runs a lightweight MobileNetV3-small model on a local clinic device to perform a binary triage between Referable DR (Classes 2-4) and Non-referable DR (Classes 0-1). Tier 2 runs a RETFoundDINOv2 model in the cloud for ordinal severity grading, but only on the subset of images flagged as referable by Tier 1. On a stratified APTOS test split of 733 images, Tier 1 reaches 98.99% sensitivity and 84.37% specificity at a validation-tuned high-sensitivity threshold. The default cascade forwards 49.52% of test images to Tier 2, reducing cloud calls by 50.48% relative to using a cloud-based model for all images. In the deployed 4-class output space (Class 0-1 / Class 2 / Class 3 / Class 4), the cascade obtains 80.49% accuracy and 0.8167 quadratic weighted kappa; the cloud-only baseline obtains 80.76% accuracy and 0.8184 quadratic weighted kappa. On APTOS, the cascade cuts cloud use by about half with a modest drop in grading performance. Index Terms: Diabetic Retinopathy, Edge-Cloud Cascade, MobileNetV3-small, RETFound-DINOv2, Retinal Screening, tele-ophthalmology

Chinese Translation

糖尿病视网膜病变（DR）是可预防失明的主要原因之一，但农村地区往往缺乏早期检测所需的专家和基础设施。尽管基于云的深度学习系统提供了高准确性，但由于高延迟、带宽有限和数据传输成本高，这些系统在这些环境中面临重大挑战。为了解决这些问题，我们提出了一种在公共APTOS 2019失明检测数据集上运行的两级边缘-云级联架构。第一级在本地诊所设备上运行轻量级的MobileNetV3-small模型，以对可转诊的DR（类别2-4）和不可转诊的DR（类别0-1）进行二元分流。第二级在云端运行RETFoundDINOv2模型进行序列严重性分级，但仅对第一级标记为可转诊的图像子集进行处理。在733幅图像的分层APTOS测试集中，第一级在经过验证调整的高灵敏度阈值下达到了98.99%的灵敏度和84.37%的特异性。默认级联将49.52%的测试图像转发到第二级，相较于对所有图像使用云模型，减少了50.48%的云调用。在部署的4类输出空间（类别0-1 / 类别2 / 类别3 / 类别4）中，该级联获得了80.49%的准确率和0.8167的二次加权kappa；而仅使用云的基线获得了80.76%的准确率和0.8184的二次加权kappa。在APTOS上，该级联将云使用量减少了约一半，同时分级性能略有下降。关键词：糖尿病视网膜病变，边缘-云级联，MobileNetV3-small，RETFound-DINOv2，视网膜筛查，远程眼科

View on arXiv Download PDF AI Translation

cs.CV / 11 / 2605.14110

SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection

SToRe3D：在视觉变换器中实现稀疏标记相关性以提高多视角3D物体检测的效率

Papais, Sandro, Feng, Lezhou, Cossette, Charles, Ge, Lingting

Abstract

Vision Transformers (ViTs) enable strong multi-view 3D detection but are limited by high inference latency from dense token and query processing across multiple views and large 3D regions. Existing sparsity methods, designed mainly for 2D vision, prune or merge image tokens but do not extend to full-model sparsity or address 3D object queries. We introduce SToRe3D, a relevance-aligned sparsity framework that jointly selects 2D image tokens and 3D object queries while storing filtered features for reactivation. Mutual 2D-3D relevance heads allocate compute to driving-critical content and preserve other embeddings. Evaluated on nuScenes and our new nuScenes-Relevance benchmark, SToRe3D achieves up to 3x faster inference with marginal accuracy loss, establishing real-time large-scale ViT-based 3D detection while maintaining accuracy on planning-critical agents.

Chinese Translation

视觉变换器（ViTs）使得强大的多视角3D检测成为可能，但由于在多个视角和大规模3D区域中处理密集标记和查询所导致的高推理延迟而受到限制。现有的稀疏性方法主要针对2D视觉，修剪或合并图像标记，但并未扩展到全模型稀疏性或解决3D物体查询的问题。我们提出了SToRe3D，一种相关性对齐的稀疏性框架，能够同时选择2D图像标记和3D物体查询，同时存储过滤后的特征以便重新激活。互相关联的2D-3D相关性头将计算资源分配给关键内容，同时保留其他嵌入。在nuScenes及我们新的nuScenes-Relevance基准上进行评估，SToRe3D实现了最高3倍的推理速度提升，且准确性损失极小，建立了实时大规模基于ViT的3D检测，同时保持了对规划关键代理的准确性。

View on arXiv Download PDF AI Translation

cs.CV / 12 / 2605.14113

ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows

ProtoMedAgent：通过隐私意识的代理工作流程实现多模态临床可解释性

Pellicer, Alvaro Lopez, Angelov, Plamen, Bukhari, Marwan, Li, Yi, Soares, Eduardo, Kerns, Jemma

Abstract

While interpretable prototype networks offer compelling case-based reasoning for clinical diagnostics, their raw continuous outputs lack the semantic structure required for medical documentation. Bridging this gap via standard Retrieval-Augmented Generation (RAG) routinely triggers ``retrieval sycophancy,'' where Large Language Models (LLMs) hallucinate post-hoc rationalizations to align with visual predictions. We introduce ProtoMedAgent, a framework that formalizes multimodal clinical reporting as an iterative, zero-gradient test-time optimization problem over a strict neuro-symbolic bottleneck. Operating on a frozen prototype backbone, we distill latent visual and tabular features into a discrete semantic memory. Online generation is strictly constrained by exact set-theoretic differentials and a reflective Scribe-Critic loop, mathematically precluding unsupported narrative claims. To safely bound data disclosure, we introduce a semantic privacy gate governed by $k$-anonymity and $\ell$-diversity. Evaluated on a 4,160-patient clinical cohort, ProtoMedAgent achieves 91.2\% Comparison Set Faithfulness where it fundamentally outperforms standard RAG (46.2\%). ProtoMedAgent additionally leverages a binding $\ell$-diversity phase transition to systematically reduce artifact-level membership inference risks by an absolute 9.8\%.

Chinese Translation

尽管可解释的原型网络为临床诊断提供了引人注目的案例推理，但其原始连续输出缺乏医疗文档所需的语义结构。通过标准的检索增强生成（Retrieval-Augmented Generation, RAG）来弥补这一差距，常常会触发“检索谄媚”（retrieval sycophancy），在这种情况下，大型语言模型（Large Language Models, LLMs）会虚构事后合理化，以与视觉预测对齐。我们提出了ProtoMedAgent，一个将多模态临床报告形式化为在严格的神经符号瓶颈上进行的迭代零梯度测试时间优化问题的框架。在一个冻结的原型骨干网络上，我们将潜在的视觉和表格特征提炼为离散的语义记忆。在线生成严格受限于精确的集合论微分和反思的记录-批评循环，从数学上排除了不支持的叙述主张。为了安全地限制数据披露，我们引入了一个由$k$-匿名性和$oldsymbol{ ext{ℓ}}$-多样性控制的语义隐私门。经过对4,160名患者的临床队列评估，ProtoMedAgent在比较集的忠实度上达到了91.2\%，在这一点上显著优于标准RAG（46.2\%）。此外，ProtoMedAgent还利用绑定的$oldsymbol{ ext{ℓ}}$-多样性相变系统性地将伪影级别的成员推断风险绝对降低了9.8\\%。

View on arXiv Download PDF AI Translation

cs.CV / 13 / 2605.14135

PanoPlane: Plane-Aware Panoramic Completion for Sparse-View Indoor 3D Gaussian Splatting

PanoPlane：针对稀疏视图室内3D高斯点云的平面感知全景补全

Qureshi, Adil, Jung, Dongki, Choi, Jaehoon, Manocha, Dinesh

Abstract

We present PanoPlane, an approach for high-fidelity sparse-view indoor novel view synthesis that reconstructs closed room geometry via panoramic scene completion. Unlike perspective-based methods that generate training views from limited fields of view, PanoPlane leverages $360^{\circ}$ panoramic completion to condition the generative process on the full spatial layout. We propose Layout Anchored Attention Steering, a training-free mechanism that steers attention within the diffusion model's internal representation toward scene's detected planar surfaces at inference time. By directing each unobserved region's attention toward geometrically consistent observed content, our method replaces unconstrained hallucination with grounded surface extrapolation. The resulting panoramic completions provide supervision for 3D Gaussian Splatting, enabling accurate novel-view synthesis across unobserved regions from as few as three input views. Experiments on Replica, ScanNet++, and Matterport3D demonstrate state-of-the-art novel view synthesis quality across 3, 6, and 9 input views, achieving up to $+17.8\%$ improvement in PSNR over the current state-of-the-art baseline without any training or fine-tuning of the diffusion model.

Chinese Translation

我们提出了PanoPlane，一种用于高保真稀疏视图室内新视图合成的方法，通过全景场景补全重建封闭房间的几何形状。与基于透视的方法不同，后者从有限的视场生成训练视图，PanoPlane利用$360^{ ext{°}}$全景补全将生成过程的条件建立在完整的空间布局上。我们提出了一种布局锚定注意力引导机制（Layout Anchored Attention Steering），这是一种无训练机制，在推理时引导扩散模型内部表示中的注意力朝向场景中检测到的平面表面。通过将每个未观察区域的注意力引导至几何上一致的观察内容，我们的方法用扎根的表面外推取代了不受限制的幻觉。最终得到的全景补全为3D高斯点云提供了监督，使得从少至三个输入视图中能够准确合成未观察区域的新视图。在Replica、ScanNet++和Matterport3D上的实验表明，在3、6和9个输入视图下，达到了最先进的新视图合成质量，相较于当前最先进的基线，在PSNR上提高了高达$+17.8\%$，且无需对扩散模型进行任何训练或微调。

View on arXiv Download PDF AI Translation

cs.CV / 14 / 2605.14136

TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion

TeDiO：无训练的时间对角优化用于一致性视频扩散

Tursynbek, Nurislam, Lao, Zhiqiang, Yu, Heather, Bertasius, Gedas, Niethammer, Marc

Abstract

Recent text-to-video diffusion transformers generate visually compelling frames, yet still struggle with temporal coherence, often producing flickering, drifting, or unstable motion. We show that these failures leave a clear imprint inside the model: incoherent videos consistently exhibit irregular, fragmented temporal diagonals in their intermediate self-attention maps, whereas stable motion corresponds to smooth, band-diagonal patterns. Building on this observation, we introduce TeDiO, a training-free, inference-time method that reinforces temporal consistency by regularizing these internal attention patterns. TeDiO estimates diagonal smoothness, identifies unstable regions, and performs lightweight latent updates that promote coherent frame-to-frame dynamics, without modifying model weights or using external motion supervision. Across multiple video diffusion models (e.g., Wan2.1, CogVideoX), TeDiO delivers markedly smoother motion while preserving per-frame visual quality, offering an efficient plug-and-play approach to improving dynamic realism in modern video generation systems.

Chinese Translation

近期的文本到视频扩散变换器生成了视觉上引人注目的帧，但在时间一致性方面仍然存在困难，常常产生闪烁、漂移或不稳定的运动。我们展示了这些失败在模型内部留下了明显的印记：不一致的视频在其中间自注意力图中始终表现出不规则、碎片化的时间对角线，而稳定的运动则对应于平滑的带状对角模式。基于这一观察，我们引入了TeDiO，这是一种无训练的推理时方法，通过对这些内部注意力模式进行正则化来增强时间一致性。TeDiO估计对角平滑度，识别不稳定区域，并执行轻量级的潜在更新，以促进一致的帧间动态，而无需修改模型权重或使用外部运动监督。在多个视频扩散模型（如Wan2.1、CogVideoX）中，TeDiO显著提高了运动的平滑性，同时保持了每帧的视觉质量，提供了一种高效的即插即用方法，以改善现代视频生成系统中的动态真实感。

View on arXiv Download PDF AI Translation

cs.CV / 15 / 2605.14145

Rethinking the Good Enough Embedding for Easy Few-Shot Learning

重新思考足够好的嵌入以实现简单的少样本学习

Karnes, Michael, Yilmaz, Alper

Abstract

The field of deep visual recognition is undergoing a paradigm shift toward universal representations. The Platonic Representation Hypothesis suggests that diverse architectures trained on massive datasets are converging toward a shared, "ideal" latent space. This again raises a critical question: is a "Good Embedding All You Need?" In this paper, we leverage this convergence to demonstrate that off-the-shelf embeddings are inherently "good enough" for complex tasks, rendering intensive task-specific fine-tuning unnecessary. We explore this hypothesis within the few-shot learning framework, proposing a straightforward, non-parametric pipeline that entirely bypasses backpropagation. By utilizing a k-Nearest Neighbor classifier on frozen DINOv2-L features, we conduct a layer-wise characterization to identify an optimal feature extraction. We further demonstrate that manifold refinement via PCA and ICA provides a beneficial regularizing effect. Our results across four major benchmarks demonstrate that our approach consistently surpasses sophisticated meta-learning algorithms, achieving state-of-the-art performance.

Chinese Translation

深度视觉识别领域正在朝着通用表征的范式转变。柏拉图表征假设表明，在大规模数据集上训练的多样化架构正在趋向于一个共享的“理想”潜在空间。这再次提出了一个关键问题：“足够好的嵌入就是你所需要的吗？”在本文中，我们利用这种收敛性证明，现成的嵌入在复杂任务中本质上是“足够好的”，从而使得密集的任务特定微调变得不必要。我们在少样本学习框架内探索这一假设，提出了一种简单的非参数管道，完全绕过反向传播。通过在冻结的 DINOv2-L 特征上利用 k-最近邻分类器，我们进行逐层特征表征，以识别最佳特征提取。我们进一步证明，通过 PCA 和 ICA 进行流形细化提供了有益的正则化效果。我们在四个主要基准上的结果表明，我们的方法始终超越复杂的元学习算法，实现了最先进的性能。

View on arXiv Download PDF AI Translation

cs.CV / 16 / 2605.14166

You Only Landmark Once: Lightweight U-Net Face Super Resolution with YOLO-World Landmark Heatmaps

一次性标记：基于轻量级 U-Net 的面部超分辨率与 YOLO-World 地标热图

Carraro, Riccardo, Briotto, Anna, Hysa, Endi, Fiorucci, Marco, Ballan, Lamberto

Abstract

Face image super-resolution aims to recover high-resolution facial images from severely degraded inputs. Under extreme upscaling factors, fine facial details are often lost, making accurate reconstruction challenging. Existing methods typically rely on heavy network architectures, adversarial training schemes, or separate alignment networks, increasing model complexity and computational cost. To address these issues, we propose a lightweight U-Net based-architecture designed to reconstructs $128{ \times }128$ facial images from severely degraded $16{ \times }16$ inputs, achieving an $8 \times $ magnification. A key contribution is a novel auxiliary-training-free supervision strategy that leverages heatmaps generated by YOLO-World, an open-vocabulary object detector, to localize key facial features such as eyes, nose, and mouth. These heatmaps are converted into spatial weights to form a heatmap-guided loss that emphasizes reconstruction errors in semantically important regions. Unlike prior methods that require dedicated landmark or alignment networks, our approach directly reuses detector outputs as supervision, maintaining an efficient training and inference pipeline. Experiments on the aligned CelebA dataset demonstrate that the proposed loss consistently improves quantitative metrics and produces sharper, more realistic reconstructions. Overall, our results show that lightweight networks can effectively exploit detection-driven priors for perceptually convincing extreme upscaling, without adversarial training or increased computational cost.

Chinese Translation

面部图像超分辨率旨在从严重退化的输入中恢复高分辨率的面部图像。在极端放大倍数下，细微的面部细节往往会丢失，使得准确重建变得具有挑战性。现有方法通常依赖于复杂的网络架构、对抗训练方案或单独的对齐网络，增加了模型的复杂性和计算成本。为了解决这些问题，我们提出了一种基于轻量级 U-Net 的架构，旨在从严重退化的 $16{ imes }16$ 输入重建 $128{ imes }128$ 的面部图像，实现 $8 imes $ 的放大。一个关键贡献是提出了一种新颖的无辅助训练监督策略，该策略利用由 YOLO-World 生成的热图（YOLO-World 是一种开放词汇的目标检测器）来定位面部关键特征，如眼睛、鼻子和嘴巴。这些热图被转换为空间权重，形成一种热图引导的损失，强调在语义重要区域的重建误差。与之前需要专用地标或对齐网络的方法不同，我们的方法直接重用检测器输出作为监督，保持高效的训练和推理流程。在对齐的 CelebA 数据集上的实验表明，所提出的损失在定量指标上持续改善，并产生更清晰、更逼真的重建结果。总体而言，我们的结果表明，轻量级网络可以有效利用基于检测的先验进行感知上令人信服的极端放大，而无需对抗训练或增加计算成本。

View on arXiv Download PDF AI Translation

cs.CV / 17 / 2605.14191

CoReDiT: Spatial Coherence-Guided Token Pruning and Reconstruction for Efficient Diffusion Transformers

CoReDiT：基于空间一致性的引导令牌剪枝与重构以提高扩散变换器的效率

Li, Zhuojin, Cheng, Hsin-Pai, Cai, Hong, Han, Shizhong, Porikli, Fatih

Abstract

Diffusion Transformers (DiTs) deliver remarkable image and video generation quality but incur high computational cost, limiting scalability and on-device deployment. We introduce CoReDiT, a structured token pruning framework for DiTs across vision tasks. CoReDiT uses a linear-time spatial coherence score to estimate local redundancy in the latent token lattice and skips high coherence (redundant) tokens in self-attention. To maintain a dense representation and avoid visual discontinuities, we reconstruct skipped attention outputs via coherence-guided aggregation of spatially neighboring retained tokens. We further introduce a progressive, block-adaptive pruning schedule that increases pruning gradually and allocates larger budgets to blocks and denoising steps with higher redundancy. Across state-of-the-art diffusion backbones including PixArt-{\alpha} and MagicDrive-V2, CoReDiT achieves up to 55% self-attention FLOPs reduction and inference speedups of 1.33x on cloud GPUs and 1.72x on mobile NPUs, while maintaining high visual quality. Notably, CoReDiT also increases on-device memory head-room, enabling higher-resolution generation.

Chinese Translation

扩散变换器（DiTs）在图像和视频生成质量上表现出色，但其高计算成本限制了可扩展性和设备端部署。我们提出了CoReDiT，这是一个针对DiTs在视觉任务中进行结构化令牌剪枝的框架。CoReDiT使用线性时间的空间一致性评分来估计潜在令牌格中的局部冗余，并在自注意力中跳过高一致性（冗余）令牌。为了保持密集表示并避免视觉不连续性，我们通过对空间上相邻的保留令牌进行一致性引导的聚合来重构跳过的注意力输出。我们进一步引入了一种渐进的、块自适应的剪枝计划，该计划逐步增加剪枝，并将更大的预算分配给冗余更高的块和去噪步骤。在包括PixArt-{eta}和MagicDrive-V2在内的最先进的扩散骨干网络上，CoReDiT实现了高达55%的自注意力FLOPs减少，并在云GPU上实现了1.33倍的推理加速，在移动NPU上实现了1.72倍的推理加速，同时保持高视觉质量。值得注意的是，CoReDiT还增加了设备端内存的余量，使得更高分辨率的生成成为可能。

View on arXiv Download PDF AI Translation

cs.CV / 18 / 2605.14221

Automatic Landmark-Based Segmentation of Human Subcortical Structures in MRI

基于自动标志点的人类皮层下结构的MRI分割

Rekik, Ahmed, Rushmore, R. Jarrett, Bouix, Sylvain, Marrakchi-Kacem, Linda

Abstract

Precise segmentation of brain structures in magnetic resonance imaging (MRI) is essential for reliable neuroimaging analysis, yet voxel-wise deep models often yield anatomically inconsistent results that diverge from expert-defined boundaries. In this research, we propose a landmark-guided 3D brain segmentation approach that explicitly mimics the manual segmentation protocol of the Harvard--Oxford Atlas. A Global-to-Local network automatically detects 16 landmarks representing key subcortical reference points. Then, a semantic segmentation model produces a coarse segmentation of 12 anatomical labels, each grouping multiple subcortical regions. Finally, a landmark-driven post-processing step separates these 12 labels into 26 distinct structures by enforcing local anatomical constraints. Experimental results demonstrate consistent improvements in boundary accuracy. Overall, integrating learned landmarks aligns segmentations more closely with manual protocols.

Chinese Translation

在磁共振成像（MRI）中，精确分割脑结构对于可靠的神经影像分析至关重要，然而，体素级深度模型往往产生与专家定义边界不一致的解剖结果。在本研究中，我们提出了一种基于标志点的3D脑分割方法，该方法明确模仿哈佛-牛津图谱的手动分割协议。一个全局到局部的网络自动检测16个标志点，代表关键的皮层下参考点。然后，一个语义分割模型生成12个解剖标签的粗略分割，每个标签聚合多个皮层下区域。最后，一个基于标志点的后处理步骤通过强制局部解剖约束将这12个标签分离成26个独特的结构。实验结果表明边界准确性的一致性提高。总体而言，整合学习到的标志点使得分割结果与手动协议更为一致。

View on arXiv Download PDF AI Translation

cs.CV / 19 / 2605.14239

Implicit spatial-frequency fusion of hyperspectral and lidar data via kolmogorov-arnold networks

通过 Kolmogorov-Arnold 网络隐式空间频率融合高光谱和激光雷达数据

Long, Zekun, Yang, Judy X., Wang, Jing, Zia, Ali, Fu, Guanyiman, Zhou, Jun

Abstract

Hyperspectral image (HSI) classification is challenging in complex scenes due to spectral ambiguity, spatial heterogeneity, and the strong coupling between material properties and geometric structures. Although LiDAR provides complementary elevation information, most HSI-LiDAR fusion methods rely on CNNs or MLPs with fixed activation functions and linear weights. These methods struggle to model structural discontinuities in LiDAR data, intricate spectral features of HSI, and their interactions. In addition, fusion of the two modalities in both spatial and frequency domains with LiDAR guidance remains underexplored. To address these issues, we propose the Implicit Frequency-Geometry Fusion Network (IFGNet), which leverages Kolmogorov-Arnold Networks (KANs) with learnable spline-based functions to adaptively capture highly nonlinear relationships between hyperspectral and LiDAR features. Furthermore, IFGNet introduces a LiDAR-guided implicit aggregation module in both spatial and frequency domains, enhancing geometry-aware spatial representations while capturing global structural patterns. Experiments on the Houston 2013 and MUUFL benchmarks demonstrate that IFGNet consistently outperforms existing fusion methods in overall accuracy, average accuracy, and Cohen's Kappa, while maintaining an efficient architecture.

Chinese Translation

高光谱图像（HSI）分类在复杂场景中面临挑战，原因包括光谱模糊性、空间异质性以及材料属性与几何结构之间的强耦合。尽管激光雷达（LiDAR）提供了互补的高程信息，但大多数 HSI-LiDAR 融合方法依赖于具有固定激活函数和线性权重的卷积神经网络（CNN）或多层感知器（MLP）。这些方法在建模 LiDAR 数据中的结构不连续性、高光谱图像的复杂光谱特征及其相互作用方面存在困难。此外，利用 LiDAR 指导在空间和频率域中融合这两种模态仍然未得到充分探索。为了解决这些问题，我们提出了隐式频率-几何融合网络（IFGNet），该网络利用具有可学习样条函数的 Kolmogorov-Arnold 网络（KAN）自适应捕捉高光谱特征与 LiDAR 特征之间高度非线性的关系。此外，IFGNet 在空间和频率域中引入了一个 LiDAR 指导的隐式聚合模块，增强了几何感知的空间表示，同时捕捉全局结构模式。在 Houston 2013 和 MUUFL 基准测试上的实验表明，IFGNet 在整体准确率、平均准确率和 Cohen's Kappa 指标上始终优于现有的融合方法，同时保持了高效的架构。

View on arXiv Download PDF AI Translation

cs.CV / 20 / 2605.14251

Generative Deep Learning for Computational Destaining and Restaining of Unregistered Digital Pathology Images

生成深度学习在未注册数字病理图像的计算去染和再染中的应用

Kulkarni, Aarushi, Lowe, Alarice, Shah, Pratik

Abstract

Conditional generative adversarial networks (cGANs) have enabled high-fidelity computational staining and destaining of hematoxylin and eosin (H&E) in digital pathology whole-slide images (WSI). However, their ability to generalize to out-of-distribution WSI across institutions without retraining remains insufficiently characterized. Previously developed cGAN models trained on 102 registered prostate core biopsy WSIs from Brigham and Women's Hospital were evaluated on 82 spatially unregistered WSIs acquired at Stanford University. To mitigate domain shift without retraining, a preprocessing pipeline consisting of histogram-based stain normalization for H&E-stained WSIs and channel-wise intensity calibration for unstained WSIs was developed. Because image registration was intentionally omitted for real-world deployment conditions, the reported quantitative results are conservative lower bounds reflecting both model performance and limited spatial alignment. Under these conditions, virtual destaining achieved a Pearson correlation coefficient (PCC) of 0.854, structural similarity index measure (SSIM) of 0.699, and peak signal-to-noise ratio (PSNR) of 18.41 dB. H&E restaining from computationally destained outputs outperformed direct staining from ground-truth unstained inputs across all metrics (PCC: 0.798 vs. 0.715; SSIM: 0.756 vs. 0.718; PSNR: 20.08 vs. 18.51 dB), suggesting that preprocessing quality may be more limiting than model capacity. Qualitative pathological review indicated preservation of benign glandular structures while showing that malignant glands were often rendered with vessel-like morphologies. These findings support the feasibility of applying cGAN-based computational H&E staining and destaining generative models to external WSI datasets using preprocessing-based adaptation alone while defining specific morphological targets for future domain adaptation.

Chinese Translation

条件生成对抗网络（cGANs）使得在数字病理全切片图像（WSI）中实现高保真度的计算染色和去染成为可能。然而，它们在不同机构之间对未分布WSI的泛化能力仍然不足以进行充分表征。之前开发的cGAN模型是在来自布莱根妇女医院的102个注册前列腺核心活检WSI上训练的，并在斯坦福大学获得的82个空间未注册WSI上进行了评估。为了在不重新训练的情况下减轻领域转移，开发了一种预处理管道，包括针对H&E染色WSI的基于直方图的染色标准化和针对未染色WSI的通道强度校准。由于在真实世界部署条件下故意省略了图像配准，报告的定量结果是保守的下限，反映了模型性能和有限的空间对齐。在这些条件下，虚拟去染实现了0.854的皮尔逊相关系数（PCC）、0.699的结构相似性指数（SSIM）和18.41 dB的峰值信噪比（PSNR）。从计算去染输出中进行H&E再染在所有指标上均优于直接从真实未染输入进行染色（PCC: 0.798 vs. 0.715; SSIM: 0.756 vs. 0.718; PSNR: 20.08 vs. 18.51 dB），这表明预处理质量可能比模型能力更具限制性。定性病理评审表明良性腺体结构得到了保留，而恶性腺体则常常呈现出血管样形态。这些发现支持了仅通过基于预处理的适应，便能够将基于cGAN的计算H&E染色和去染生成模型应用于外部WSI数据集的可行性，同时为未来的领域适应定义了特定的形态学目标。

View on arXiv Download PDF AI Translation

cs.CV / 21 / 2605.14253

Towards Real-Time Autonomous Navigation: Transformer-Based Catheter Tip Tracking in Fluoroscopy

实时自主导航的探索：基于变换器的荧光透视导管尖端跟踪

Robertshaw, Harry, Hao, Yanghe, Deng, Weiyuan, Jackson, Benjamin, Sadati, S. M. Hadi, Fischer, Nikola, Vercauteren, Tom, Granados, Alejandro, Booth, Thomas C.

Abstract

Purpose: Mechanical thrombectomy (MT) improves stroke outcomes, but is limited by a lack of local treatment access. Widespread distribution of reinforcement learning (RL)-based robotic systems can be used to alleviate this challenge through autonomous navigation, but current RL methods require live device tip coordinate tracking to function. This paper aims to develop and evaluate a real-time catheter tip tracking pipeline under fluoroscopy, addressing challenges such as low contrast, noise, and device occlusion. Methods: A multi-threaded pipeline was designed, incorporating frame reading, preprocessing, inference, and post-processing. Deep learning segmentation models, including U-Net, U-Net+Transformer, and SegFormer, were trained and benchmarked using two-class and three-class formulations. Post-processing involved two-step component filtering, one-pixel medial skeletonization, and greedy arc-length path following with contour fall-back. Results: On manually-labeled moderate complexity fluoroscopic video data, the two-class SegFormer achieved a mean absolute error of 4.44 mm, outperforming U-Net (4.60 mm), U-Net+Transformer (6.20 mm) and all three-class models (5.19-7.74 mm). On segmentation benchmarks, the system exceeded state-of-the-art CathAction results with improvements of up to +5% in Dice scores for three-segmentation. Conclusion: The results demonstrate that the proposed multi-threaded tracking framework maintains stable performance under challenging imaging conditions, outperforming prior benchmarks, while providing a reliable and efficient foundation for RL-based autonomous MT navigation.

Chinese Translation

目的：机械性血栓切除术（MT）改善中风预后，但受限于缺乏局部治疗途径。基于强化学习（RL）的机器人系统的广泛分布可以通过自主导航来缓解这一挑战，但当前的RL方法需要实时设备尖端坐标跟踪才能发挥作用。本文旨在开发和评估一种在荧光透视下的实时导管尖端跟踪管道，解决低对比度、噪声和设备遮挡等挑战。方法：设计了一个多线程管道，包含帧读取、预处理、推理和后处理。训练并基准测试了深度学习分割模型，包括U-Net、U-Net+Transformer和SegFormer，使用了两类和三类的公式。后处理包括两步组件过滤、一像素中轴骨架化和贪婪弧长路径跟踪与轮廓回退。结果：在手动标记的中等复杂度荧光透视视频数据上，两类SegFormer的平均绝对误差为4.44毫米，优于U-Net（4.60毫米）、U-Net+Transformer（6.20毫米）和所有三类模型（5.19-7.74毫米）。在分割基准测试中，该系统超越了最先进的CathAction结果，三分割的Dice分数提高了多达5%。结论：结果表明，所提出的多线程跟踪框架在具有挑战性的成像条件下保持稳定性能，超越了先前的基准，同时为基于RL的自主MT导航提供了可靠和高效的基础。

View on arXiv Download PDF AI Translation

cs.CV / 22 / 2605.14267

Image Restoration via Diffusion Models with Dynamic Resolution

基于动态分辨率的扩散模型图像恢复

Zheng, Yang, Li, Wen, Liu, Zhaoqiang

Abstract

Diffusion models (DMs) have exhibited remarkable efficacy in various image restoration tasks. However, existing approaches typically operate within the high-dimensional pixel space, resulting in high computational overhead. While methods based on latent DMs seek to alleviate this issue by utilizing the compressed latent space of a variational autoencoder, they require repeated encoder-decoder inference. This introduces significant additional computational burdens, often resulting in runtime performance that is even inferior to that of their pixel-space counterparts. To mitigate the computational inefficiency, this work proposes projecting data into lower-dimensional subspaces using dynamic resolution DMs to accelerate the inference process. We first fine-tune pre-trained DMs for dynamic resolution priors and adapt DPS and DAPS, which are two widely used pixel-space methods for general image restoration tasks, into the proposed framework, yielding methods we refer to as SubDPS and SubDAPS, respectively. Given the favorable inference speed and reconstruction fidelity of SubDAPS, we introduce an enhanced variant termed SubDAPS++ to further boost both reconstruction efficiency and quality. Empirical evaluations across diverse image datasets and various restoration tasks demonstrate that the proposed methods outperform recent DM-based approaches in the majority of experimental scenarios. The code is available at https://github.com/StarNextDay/SubDAPS.git.

Chinese Translation

扩散模型（Diffusion Models, DMs）在各种图像恢复任务中展现出了显著的有效性。然而，现有方法通常在高维像素空间中运行，导致高计算开销。虽然基于潜在扩散模型的方法试图通过利用变分自编码器的压缩潜在空间来缓解这一问题，但它们需要重复的编码器-解码器推理。这引入了显著的额外计算负担，往往导致运行时性能甚至不如其像素空间对应物。为了减轻计算低效，本研究提出使用动态分辨率扩散模型将数据投影到低维子空间，以加速推理过程。我们首先对预训练的扩散模型进行微调，以适应动态分辨率先验，并将DPS和DAPS这两种广泛使用的像素空间方法适配到所提出的框架中，得到我们称之为SubDPS和SubDAPS的方法。鉴于SubDAPS在推理速度和重建保真度方面的优越性，我们引入了一个增强变体SubDAPS++，以进一步提升重建效率和质量。在多种图像数据集和各种恢复任务上的实证评估表明，所提出的方法在大多数实验场景中优于近期基于扩散模型的方法。代码可在 https://github.com/StarNextDay/SubDAPS.git 获取。

View on arXiv Download PDF AI Translation

cs.CV / 23 / 2605.14269

PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation

PhyMotion：基于物理的结构化三维运动奖励用于人类视频生成

Huang, Yidong, Wang, Zun, Lin, Han, Kim, Dong-Ki, Omidshafiei, Shayegan, Yoon, Jaehong, Cho, Jaemin, Zhang, Yue, Bansal, Mohit

Abstract

Generating realistic human motion is a central yet unsolved challenge in video generation. While reinforcement learning (RL)-based post-training has driven recent gains in general video quality, extending it to human motion remains bottlenecked by a reward signal that cannot reliably score motion realism. Existing video rewards primarily rely on 2D perceptual signals, without explicitly modeling the 3D body state, contact, and dynamics underlying articulated human motion, and often assign high scores to videos with floating bodies or physically implausible movements. To address this, we propose PhyMotion, a structured, fine-grained motion reward that grounds recovered 3D human trajectories in a physics simulator and evaluates motion quality along multiple dimensions of physical feasibility. Concretely, we recover SMPL body meshes from generated videos, retarget them onto a humanoid in the MuJoCo physics simulator, and evaluate the resulting motion along three axes: kinematic plausibility, contact and balance consistency, and dynamic feasibility. Each component provides a continuous and interpretable signal tied to a specific aspect of motion quality, allowing the reward to capture which aspects of motion are physically correct or violated. Experiments show that PhyMotion achieves stronger correlation with human judgments than existing reward formulations. These gains carry over to RL-based post-training, where optimizing PhyMotion leads to larger and more consistent improvements than optimizing existing rewards, improving motion realism across both autoregressive and bidirectional video generators under both automatic metrics and blind human evaluation (+68 Elo gain). Ablations show that the three axes provide complementary supervision signals, while the reward preserves overall video generation quality with only modest training overhead.

Chinese Translation

生成逼真的人类运动是视频生成中的一个核心但尚未解决的挑战。尽管基于强化学习（RL）的后训练推动了近期一般视频质量的提升，但将其扩展到人类运动仍然受到无法可靠评估运动真实感的奖励信号的制约。现有的视频奖励主要依赖于二维感知信号，而未明确建模支撑关节人类运动的三维身体状态、接触和动态，常常对漂浮的身体或物理上不合理的运动赋予高分。为了解决这个问题，我们提出了PhyMotion，这是一种结构化的、细粒度的运动奖励，它将恢复的三维人类轨迹基于物理模拟器进行评估，并在多个物理可行性维度上评估运动质量。具体而言，我们从生成的视频中恢复SMPL身体网格，将其重新定向到MuJoCo物理模拟器中的人形上，并沿着三个维度评估生成的运动：运动学合理性、接触和平衡一致性，以及动态可行性。每个组件提供了与运动质量特定方面相关的连续且可解释的信号，使得奖励能够捕捉运动的哪些方面是物理上正确或被违反的。实验表明，PhyMotion与人类判断之间的相关性强于现有的奖励公式。这些提升在基于RL的后训练中得以延续，优化PhyMotion比优化现有奖励带来了更大且更一致的改进，在自动回归和双向视频生成器中提高了运动的真实感，且在自动指标和盲人类评估中均表现出+68 Elo的提升。消融实验表明，这三个维度提供了互补的监督信号，而奖励在仅有适度训练开销的情况下保持了整体视频生成质量。

View on arXiv Download PDF AI Translation

cs.CV / 24 / 2605.14270

Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers

多模态扩散变换器中的概念遗漏诊断与修正

Baek, Kanghyun, Lew, Jaihyun, Shin, Chaehun, Lee, Jungbeom, Yoon, Sungroh

Abstract

Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-to-image generation, yet they frequently suffer from concept omission, where specified objects or attributes fail to emerge in the generated image. By performing linear probing on text tokens, we demonstrate that text embeddings can distinguish a characteristic `omission signal' representing the absence of target concepts. Leveraging this insight, we propose Omission Signal Intervention (OSI), which amplifies the omission signal to actively catalyze the generation of missing concepts. Comprehensive experiments on FLUX.1-Dev and SD3.5-Medium demonstrate that OSI significantly alleviates concept omission even in extreme scenarios.

Chinese Translation

多模态扩散变换器（MM-DiTs）在文本到图像生成方面取得了显著进展，但它们经常面临概念遗漏的问题，即指定的对象或属性未能在生成的图像中出现。通过对文本标记进行线性探测，我们证明了文本嵌入可以区分一种特征性‘遗漏信号’，该信号表示目标概念的缺失。基于这一见解，我们提出了遗漏信号干预（Omission Signal Intervention, OSI），该方法放大遗漏信号，以主动催化缺失概念的生成。在FLUX.1-Dev和SD3.5-Medium上的全面实验表明，OSI显著缓解了即使在极端场景下的概念遗漏问题。

View on arXiv Download PDF AI Translation

cs.CV / 25 / 2605.14274

CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL

CreFlow：稀疏奖励的具身视频扩散强化学习的纠正重流

Ni, Zhenyang, Li, Yijiang, Jiao, Ruochen, Zhan, Simon Sinong, Chen, Sipeng, Yin, Zhenfei, Chen, Minshuo, Torr, Philip, Wang, Zhaoran, Zhu, Qi

Abstract

Video generation models trained on heterogeneous data with likelihood-surrogate objectives can produce visually plausible rollouts that violate physical constraints in embodied manipulation. Although reinforcement-learning post-training offers a natural route to adapting VGMs, existing video-RL rewards often reduce each rollout to a low-level visual metric, whereas manipulation video evaluation requires logic-based verification of whether the rollout satisfies a compositional task specification. To fill this gap, we introduce a compositional constraint-based reward model for post-training embodied video generation models, which automatically formulates task requirements as a composition of Linear Temporal Logic constraints, providing faithful rewards and localized error information in generated videos. To achieve effective improvement in high-dimensional video generation using these reward signals, we further propose CreFlow, a novel online RL framework with two key designs: i) a credit-aware NFT loss that confines the RL update to reward-relevant regions, preventing perturbations to unrelated regions during post-training; and ii) a corrective reflow loss that leverages within-group positive samples as an explicit estimate of the correction direction, stabilizing and accelerating training. Experiments show that CreFlow yields reward judgments better aligned with human and simulator success labels than existing methods and improves downstream execution success by 23.8 percentage points across eight bimanual manipulation tasks.

Chinese Translation

在异构数据上训练的基于似然替代目标的视频生成模型能够生成视觉上合理的结果，但这些结果在具身操作中违反了物理约束。虽然强化学习后训练提供了一种自然的途径来适应视频生成模型（VGM），但现有的视频强化学习奖励往往将每个结果简化为低级视觉指标，而操作视频评估则需要基于逻辑的验证，以确定结果是否满足组合任务规范。为填补这一空白，我们引入了一种基于组合约束的奖励模型，用于后训练的具身视频生成模型，该模型自动将任务要求表述为线性时序逻辑（Linear Temporal Logic）约束的组合，从而在生成视频中提供真实的奖励和局部错误信息。为了利用这些奖励信号在高维视频生成中实现有效改进，我们进一步提出了CreFlow，这是一种新颖的在线强化学习框架，具有两个关键设计：i) 一种关注信用的NFT损失，将强化学习更新限制在与奖励相关的区域，防止在后训练过程中对无关区域的扰动；ii) 一种纠正重流损失，利用组内正样本作为纠正方向的显式估计，从而稳定和加速训练。实验表明，CreFlow的奖励判断与人类和模拟器的成功标签相比，优于现有方法，并在八个双手操作任务中提高了23.8个百分点的下游执行成功率。

View on arXiv Download PDF AI Translation

cs.CV / 26 / 2605.14278

KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration

KVPO：基于ODE的自回归视频对齐的GRPO通过KV语义探索

Zhang, Ruicheng, Cong, Kaixi, Zhou, Jun, Zhong, Zhizhou, Xu, Zunnan, Mao, Shuiyang, Liu, Wei, Li, Xiu

Abstract

Aligning streaming autoregressive (AR) video generators with human preferences is challenging. Existing reinforcement learning methods predominantly rely on noise-based exploration and SDE-based surrogate policies that are mismatched to the deterministic ODE dynamics of distilled AR models, and tend to perturb low-level appearance rather than the high-level semantic storyline progression critical for long-horizon coherence. To address these limitations, we present KVPO, an ODE-native online Group Relative Policy Optimization (GRPO) framework for aligning streaming video generators. For diversity exploration, KVPO introduces a causal-semantic exploration paradigm that relocates the source of variation from stochastic noise to the historical KV cache. By stochastically routing historical KV entries, it constructs semantically diverse generation branches that remain strictly on the data manifold. For policy modeling, KVPO introduces a velocity-field surrogate policy based on Trajectory Velocity Energy (TVE), which quantifies branch likelihood in flow-matching velocity space and yields a reward-weighted contrastive objective fully consistent with the native ODE formulation. Experiments on multiple distilled AR video generators demonstrate consistent gains in visual quality, motion quality, and text-video alignment across both single-prompt short-video and multi-prompt long-video settings.

Chinese Translation

将流式自回归（AR）视频生成器与人类偏好对齐是一项挑战。现有的强化学习方法主要依赖基于噪声的探索和基于随机微分方程（SDE）的替代策略，这些方法与提炼的AR模型的确定性常微分方程（ODE）动态不匹配，且往往扰动低层次的外观，而非对长时间一致性至关重要的高层次语义故事线进展进行干扰。为了解决这些局限性，我们提出了KVPO，一个基于ODE的在线群体相对策略优化（GRPO）框架，用于对齐流式视频生成器。为了实现多样性探索，KVPO引入了一种因果语义探索范式，将变化源从随机噪声转移到历史KV缓存。通过随机路由历史KV条目，它构建了在数据流形上严格保持语义多样性的生成分支。对于策略建模，KVPO引入了一种基于轨迹速度能量（Trajectory Velocity Energy, TVE）的速度场替代策略，该策略量化了流匹配速度空间中的分支可能性，并产生与原生ODE公式完全一致的奖励加权对比目标。在多个提炼的AR视频生成器上的实验表明，在单提示短视频和多提示长视频设置中，视觉质量、运动质量和文本视频对齐均取得了一致的提升。

View on arXiv Download PDF AI Translation

cs.CV / 27 / 2605.14309

ICED: Concept-level Machine Unlearning via Interpretable Concept Decomposition

ICED：通过可解释的概念分解实现概念级机器遗忘

Lin, Shen, Lin, Jing, Dong, Junhao, Koniusz, Piotr, Xu, Li

Abstract

Machine unlearning in Vision-Language Models (VLMs) is typically performed at the image or instance level, making it difficult to precisely remove target knowledge without affecting unrelated semantics. This issue is especially pronounced since a single image often contains multiple entangled concepts, including both target concepts to be forgotten and contextual information that should be preserved. In this paper, we propose an interpretable concept-level unlearning framework for VLMs, which constructs a compact task-specific concept vocabulary from the forgetting set using a multimodal large language model. In addition to modality alignment, visual representations are decomposed into sparse, nonnegative combinations of semantic concepts, providing an explicit interface for fine-grained knowledge manipulation. Based on this decomposition, our method formulates unlearning as concept-level optimization, where target concepts are selectively suppressed while intra-instance non-target semantics and global cross-modal knowledge are preserved. Extensive experiments across both in-domain and out-of-domain forgetting settings demonstrate that our method enables more comprehensive target forgetting, better preserves non-target knowledge within the same image, and maintains competitive model utility compared with existing VLM unlearning methods.

Chinese Translation

视觉-语言模型（VLMs）中的机器遗忘通常在图像或实例级别进行，这使得在不影响无关语义的情况下精确移除目标知识变得困难。这个问题尤其明显，因为单个图像通常包含多个纠缠的概念，包括需要遗忘的目标概念和应当保留的上下文信息。本文提出了一种针对VLMs的可解释的概念级遗忘框架，该框架利用多模态大语言模型从遗忘集构建紧凑的任务特定概念词汇。除了模态对齐外，视觉表示被分解为稀疏的、非负的语义概念组合，为细粒度知识操作提供了明确的接口。基于这种分解，我们的方法将遗忘公式化为概念级优化，其中目标概念被选择性地抑制，同时保留实例内的非目标语义和全局跨模态知识。在领域内和领域外的遗忘设置下，广泛的实验表明，我们的方法能够实现更全面的目标遗忘，更好地保留同一图像中的非目标知识，并与现有的VLM遗忘方法相比，保持竞争力的模型效用。

View on arXiv Download PDF AI Translation

cs.CV / 28 / 2605.14310

CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding

CoRDS：基于核心集的流媒体视频理解的代表性与多样性选择

Mahdizadeh, Ailar, Azadi, Puria, Li, Muchen, He, Xiangteng, Sigal, Leonid

Abstract

Streaming video understanding with large vision-language models (VLMs) requires a compact memory that can support future reasoning over an ever-growing visual history. A common solution is to compress the key-value (KV) cache, but existing streaming methods typically rely on local token-wise heuristics, such as recency, temporal redundancy, or saliency, which do not explicitly optimize whether the retained cache is representative of the accumulated history. We propose to view KV-cache compression as a coreset selection problem: rather than scoring tokens independently for retention, we select a small subset that covers the geometry of the accumulated visual cache. Our method operates in a joint KV representation and introduces a bicriteria objective that balances coverage in key and value spaces, preserving both retrieval structure and output-relevant information. To encourage a more diverse retained subset, we further introduce an orthogonality-driven diversity criterion that favors candidates contributing new directions beyond the current selection, and connect this criterion to log-determinant subset selection. Across four open-source VLMs and five long-video and streaming-video benchmarks, our method improves over heuristic streaming compression baselines under a fixed cache budget. These results highlight that representative coreset selection offers a more effective principle, than token-wise pruning, for memory-constrained streaming video understanding.

Chinese Translation

使用大型视觉语言模型（VLMs）进行流媒体视频理解需要一个紧凑的内存，以支持对不断增长的视觉历史进行未来推理。一个常见的解决方案是压缩关键值（KV）缓存，但现有的流媒体方法通常依赖于局部的逐个标记启发式方法，如近期性、时间冗余或显著性，这些方法并未明确优化保留的缓存是否能够代表累积的历史。我们提出将KV缓存压缩视为核心集选择问题：我们选择一个小的子集，而不是独立地对标记进行评分，从而覆盖累积视觉缓存的几何特征。我们的方法在联合KV表示中运行，并引入一个双标准目标，以平衡关键空间和值空间的覆盖，保留检索结构和与输出相关的信息。为了鼓励保留子集的多样性，我们进一步引入一个基于正交性的多样性标准，偏向于选择那些在当前选择之外贡献新方向的候选项，并将该标准与对数行列式子集选择联系起来。在四个开源VLM和五个长视频及流媒体视频基准测试中，我们的方法在固定缓存预算下优于启发式流媒体压缩基线。这些结果突显出，代表性核心集选择提供了一种比逐个标记修剪更有效的原则，适用于内存受限的流媒体视频理解。

View on arXiv Download PDF AI Translation

cs.CV / 29 / 2605.14315

TurboVGGT: Fast Visual Geometry Reconstruction with Adaptive Alternating Attention

TurboVGGT：具有自适应交替注意力的快速视觉几何重建

Huang, David, Wu, Guile, Huang, Chengjie, Liu, Bingbing, Bai, Dongfeng

Abstract

Recent feed-forward 3D reconstruction methods, such as visual geometry transformers, have substantially advanced the traditional per-scene optimization paradigm by enabling effective multi-view reconstruction in a single forward pass. However, most existing methods struggle to achieve a balance between reconstruction quality and computational efficiency, which limits their scalability and efficiency. Although some efficient visual geometry transformers have recently emerged, they typically use the same sparsity ratio across layers and frames and lack mechanisms to adaptively learn representative tokens to capture global relationships, leading to suboptimal performance. In this work, we propose TurboVGGT, a novel approach that employs an efficient visual geometry transformer with adaptive alternating attention for fast multi-view 3D reconstruction. Specifically, TurboVGGT employs an end-to-end trainable framework with adaptive sparse global attention guided by adaptive sparsity selection to capture global relationships across frames and frame attention to aggregate local details within each frame. In the adaptive sparse global attention, TurboVGGT adaptively learns representative tokens with varying sparsity levels for global geometry modeling, considering that token importance varies across frames, attention layers operate tokens at different levels of abstraction, and global dependencies rely on structurally informative regions. Extensive experiments on multiple 3D reconstruction benchmarks demonstrate that TurboVGGT achieves fast multi-view reconstruction while maintaining competitive reconstruction quality compared with state-of-the-art methods. Project page: https://turbovggt.github.io/.

Chinese Translation

近年来，前馈式三维重建方法，如视觉几何变换器，显著推动了传统的逐场景优化范式，使得在单次前向传播中实现有效的多视角重建成为可能。然而，现有大多数方法在重建质量和计算效率之间难以取得平衡，这限制了它们的可扩展性和效率。尽管近期出现了一些高效的视觉几何变换器，但它们通常在各层和帧之间使用相同的稀疏比，并缺乏自适应学习代表性标记以捕捉全局关系的机制，导致性能不尽如人意。在本研究中，我们提出了TurboVGGT，这是一种新颖的方法，采用高效的视觉几何变换器与自适应交替注意力，旨在实现快速的多视角三维重建。具体而言，TurboVGGT采用端到端可训练的框架，结合自适应稀疏全局注意力和自适应稀疏选择引导，以捕捉帧间的全局关系，并通过帧注意力聚合每帧内的局部细节。在自适应稀疏全局注意力中，TurboVGGT自适应地学习具有不同稀疏级别的代表性标记用于全局几何建模，考虑到标记的重要性在不同帧之间的变化、注意力层在不同抽象层次上操作标记，以及全局依赖关系依赖于结构性信息区域。对多个三维重建基准的广泛实验表明，TurboVGGT在保持与最先进方法竞争的重建质量的同时，实现了快速的多视角重建。项目页面：https://turbovggt.github.io/

View on arXiv Download PDF AI Translation

cs.CV / 30 / 2605.14326

D2-CDIG: Controlled Diffusion Remote Sensing Image Generation with Dual Priors of DEM and Cloud-Fog

D2-CDIG：基于数字高程模型和云雾双先验的受控扩散遥感图像生成

Zhao, Zuopeng, Liu, Ying, Pharksuwan, Kanyaphakphachsorn, Luo, Su, Li, Xiaoyu, Ning, Maocai

Abstract

Remote sensing image generation provides a reliable data foundation for remote sensing large models and downstream tasks. However, existing controllable remote sensing image generation methods typically rely on traditional techniques such as segmentation and edge detection, which do not fully leverage terrain or atmospheric conditions. As a result, the generated images often lack accuracy and naturalness when dealing with complex terrains and atmospheric phenomena. In this paper, we propose a novel remote sensing image generation framework, D2-CDIG, which integrates diffusion models with a dual-prior control mechanism. By incorporating both Digital Elevation Model (DEM) and cloud-fog information as dual prior knowledge, D2-CDIG precisely controls ground features and atmospheric phenomena within the generated images. Specifically, D2-CDIG decouples the terrain and atmospheric generation processes through independent control of ground and atmospheric branches. Additionally, a refined cloud-fog slider is introduced to flexibly adjust cloud thickness and distribution. During training, ground and atmospheric control signals are injected in layers to ensure a seamless transition within the images. Compared to traditional methods based on segmentation or edge detection, D2-CDIG shows significant improvements in image quality, detail richness, and realism. D2-CDIG offers a flexible and precise solution for remote sensing image generation, providing high-quality data for training large remote sensing models and downstream tasks.

Chinese Translation

遥感图像生成为遥感大模型和下游任务提供了可靠的数据基础。然而，现有的可控遥感图像生成方法通常依赖于传统技术，如分割和边缘检测，这些方法未能充分利用地形或大气条件。因此，在处理复杂地形和大气现象时，生成的图像往往缺乏准确性和自然性。本文提出了一种新颖的遥感图像生成框架D2-CDIG，该框架将扩散模型与双先验控制机制相结合。通过将数字高程模型（Digital Elevation Model, DEM）和云雾信息作为双重先验知识，D2-CDIG能够精确控制生成图像中的地面特征和大气现象。具体而言，D2-CDIG通过独立控制地面和大气分支，解耦了地形和大气生成过程。此外，引入了一个精细的云雾滑块，以灵活调整云的厚度和分布。在训练过程中，地面和大气控制信号以层的形式注入，以确保图像内的无缝过渡。与基于分割或边缘检测的传统方法相比，D2-CDIG在图像质量、细节丰富性和真实感方面显示出显著改善。D2-CDIG为遥感图像生成提供了一种灵活而精确的解决方案，为训练大型遥感模型和下游任务提供高质量的数据。

View on arXiv Download PDF AI Translation

cs.CV / 31 / 2605.14333

InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation

InsightTok：通过离散标记化提高自回归图像生成中的文本和面部保真度

Yue, Yang, Wei, Fangyun, He, Tianyu, Zhao, Jinjing, Ni, Zanlin, Liu, Zeyu, Guo, Jiayi, Shi, Lei, Dong, Yue, Chen, Li, Li, Ji, Huang, Gao, Chen, Dong

Abstract

Text and faces are among the most perceptually salient and practically important patterns in visual generation, yet they remain challenging for autoregressive generators built on discrete tokenization. A central bottleneck is the tokenizer: aggressive downsampling and quantization often discard the fine-grained structures needed to preserve readable glyphs and distinctive facial features. We attribute this gap to standard discrete-tokenizer objectives being weakly aligned with text legibility and facial fidelity, as these objectives typically optimize generic reconstruction while compressing diverse content uniformly. To address this, we propose InsightTok, a simple yet effective discrete visual tokenization framework that enhances text and face fidelity through localized, content-aware perceptual losses. With a compact 16k codebook and a 16x downsampling rate, InsightTok significantly outperforms prior tokenizers in text and face reconstruction without compromising general reconstruction quality. These gains consistently transfer to autoregressive image generation in InsightAR, producing images with clearer text and more faithful facial details. Overall, our results highlight the potential of specialized supervision in tokenizer training for advancing discrete image generation.

Chinese Translation

文本和面孔是视觉生成中最具感知显著性和实际重要性的模式之一，但它们仍然对基于离散标记化的自回归生成器构成挑战。一个核心瓶颈是标记器：激进的下采样和量化常常丢弃保留可读字形和独特面部特征所需的细粒度结构。我们将这一差距归因于标准离散标记器目标与文本可读性和面部保真度的弱对齐，因为这些目标通常优化通用重建，同时均匀压缩多样内容。为了解决这个问题，我们提出了InsightTok，一个简单而有效的离散视觉标记化框架，通过局部的、内容感知的感知损失来增强文本和面部保真度。凭借紧凑的16k代码本和16倍下采样率，InsightTok在文本和面部重建方面显著优于先前的标记器，而不影响一般重建质量。这些提升在InsightAR中的自回归图像生成中持续转移，生成的图像具有更清晰的文本和更真实的面部细节。总体而言，我们的结果突显了在标记器训练中专门监督的潜力，以推动离散图像生成的发展。

View on arXiv Download PDF AI Translation

cs.CV / 32 / 2605.14337

IG-Diff: Complex Night Scene Restoration with Illumination-Guided Diffusion Model

IG-Diff：基于照明引导扩散模型的复杂夜间场景恢复

Chen, Yifan, Yin, Fei, Guo, Chunle, Li, Chongyi, Yang, Yujiu

Abstract

In nighttime circumstances, it is challenging for individuals and machines to perceive their surroundings. While prevailing image restoration methods adeptly handle singular forms of degradation, they falter when confronted with intricate nocturnal scenes, such as the concurrent presence of weather and low-light conditions. Compounding this challenge, the lack of paired data that encapsulates the coexistence of low-light situations and other forms of degradation hinders the development of a comprehensive end-to-end solution. In this work, we contribute complex nighttime scene datasets that simulate both illumination degradation and other forms of deterioration. To address the complexity of night degradation, we propose an integration of an illumination-guided module embedded in the diffusion model to guide the illumination restoration process. Our model can preserve texture fidelity while contending with the adversities posed by various degradation in low-light scenarios.

Chinese Translation

在夜间环境中，个人和机器感知周围环境面临挑战。虽然现有的图像恢复方法能够有效处理单一形式的退化，但在面对复杂的夜间场景时，例如天气和低光条件的同时存在，它们却显得力不从心。更为复杂的是，缺乏能够 encapsulate 低光情况与其他退化形式共存的配对数据，阻碍了全面端到端解决方案的发展。在本研究中，我们贡献了复杂夜间场景数据集，模拟了照明退化与其他形式的劣化。为了应对夜间退化的复杂性，我们提出了一种将照明引导模块嵌入扩散模型中的集成方法，以指导照明恢复过程。我们的模型能够在低光场景中应对各种退化带来的挑战的同时，保持纹理的真实感。

View on arXiv Download PDF AI Translation

cs.CV / 33 / 2605.14341

AnyBand-Diff: A Unified Remote Sensing Image Generation and Band Repair Framework with Spectral Priors

AnyBand-Diff：一个统一的遥感图像生成与波段修复框架，具有光谱先验

Zhao, Zuopeng, Liu, Ying, Li, Xiaoyu, Luo, Su, Li, Lu, Liu, Wenwen

Abstract

Existing diffusion models have made significant progress in generating realistic images. However, their direct adaptation to remote sensing imagery often disregards intrinsic physical laws. This oversight frequently leads to spectral distortion and radiometric inconsistency, severely limiting the scientific utility of generated data. To address this issue, this paper introduces AnyBand-Diff, a novel spectral-prior-guided diffusion framework tailored for robust spectral reconstruction. Specifically, we design a Masked Conditional Diffusion backbone integrated with a dual stochastic masking strategy, empowering the model to recover complete spectral information from arbitrary band subsets. Subsequently, to ensure radiometric fidelity, a Physics-Guided Sampling mechanism is proposed, leveraging gradients from a differentiable physical model to explicitly steer the denoising trajectory toward the manifold of physically plausible solutions. Furthermore, a Multi-Scale Physical Loss is formulated to enforce rigorous constraints across pixel, region, and global levels in a joint manner. Extensive experiments confirm the effectiveness of AnyBand-Diff in generating reliable imagery and achieving accurate spectral reconstruction, contributing to the advancement of physics-aware generative methods for Earth observation.

Chinese Translation

现有的扩散模型在生成真实图像方面取得了显著进展。然而，它们直接应用于遥感影像时，往往忽视了内在的物理法则。这种忽视常常导致光谱失真和辐射一致性问题，严重限制了生成数据的科学实用性。为了解决这一问题，本文提出了AnyBand-Diff，一种新颖的基于光谱先验的扩散框架，旨在实现稳健的光谱重建。具体而言，我们设计了一个集成了双随机掩码策略的掩码条件扩散骨干网络，使模型能够从任意波段子集恢复完整的光谱信息。随后，为了确保辐射保真性，提出了一种物理引导采样机制，利用可微分物理模型的梯度，明确引导去噪轨迹朝向物理上合理解的流形。此外，制定了一种多尺度物理损失，以联合方式在像素、区域和全局层面施加严格约束。大量实验验证了AnyBand-Diff在生成可靠影像和实现准确光谱重建方面的有效性，为面向地球观测的物理感知生成方法的发展做出了贡献。

View on arXiv Download PDF AI Translation

cs.CV / 34 / 2605.14346

Learning with Semantic Priors: Stabilizing Point-Supervised Infrared Small Target Detection via Hierarchical Knowledge Distillation

带有语义先验的学习：通过层次知识蒸馏稳定点监督红外小目标检测

Yao, Yuanhang, Qian, Ping, Liu, Zhu, Ma, Long, Wang, Weimin

Abstract

Single-frame Infrared Small Target Detection (ISTD) aims to localize weak targets under heavy background clutter, yet dense pixel-wise annotations are expensive. Point supervision with online label evolution reduces annotation cost; however, lightweight CNN detectors often lack sufficient semantics, leading to noisy pseudo-masks and unstable optimization. To address this, we propose a hierarchical VFM-driven knowledge distillation framework that uses a frozen Vision Foundation Model (VFM) during training. We formulate point-supervised learning as a bilevel optimization process: the inner loop adapts a VFM-embedded teacher on reweighted training samples, while the outer loop transfers validation-guided knowledge to a lightweight student to mitigate pseudo-label noise and training-set bias. We further introduce Semantic-Conditioned Affine Modulation (SCAM) to inject VFM semantics into CNN features at multiple layers. In addition, a dynamic collaborative learning strategy with cluster-level sample reweighting enhances robustness to imperfect pseudo-masks. Experiments on diverse challenging cases across multiple ISTD backbones demonstrate consistent improvements in detection accuracy and training stability. Our code is available at https://github.com/yuanhang-yao/semantic-prior.

Chinese Translation

单帧红外小目标检测（ISTD）旨在在复杂背景下定位微弱目标，但密集的像素级标注成本高昂。通过在线标签演变的点监督可以降低标注成本；然而，轻量级卷积神经网络（CNN）检测器往往缺乏足够的语义信息，导致伪标签噪声和不稳定的优化。为了解决这一问题，我们提出了一种基于层次化视觉基础模型（VFM）驱动的知识蒸馏框架，在训练过程中使用冻结的VFM。我们将点监督学习形式化为双层优化过程：内层循环在重新加权的训练样本上调整嵌入VFM的教师模型，而外层循环将验证引导的知识转移到轻量级学生模型，以减轻伪标签噪声和训练集偏差。此外，我们进一步引入了语义条件仿射调制（SCAM），在多个层次上将VFM语义注入CNN特征中。此外，采用集群级样本重加权的动态协作学习策略增强了对不完美伪标签的鲁棒性。在多个ISTD骨干网络上对各种挑战性案例的实验表明，检测准确率和训练稳定性均有一致性提升。我们的代码可在 https://github.com/yuanhang-yao/semantic-prior 获取。

View on arXiv Download PDF AI Translation

cs.CV / 35 / 2605.14382

Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation

Delta 强制：用于交互式自回归视频生成的信任区域引导

Wu, Yuheng, Gao, Xiangbo, Chen, Tianhao, Chen, Xinghao, Yin, Qing, Tu, Zhengzhong, Lee, Dongman

Abstract

Interactive real-time autoregressive video generation is essential for applications such as content creation and world modeling, where visual content must adapt to dynamically evolving event conditions. A fundamental challenge lies in balancing reactivity and stability: models must respond promptly to new events while maintaining temporal coherence over long horizons. Existing approaches distill bidirectional models into autoregressive generators and further adapt them via streaming long tuning, yet often exhibit persistent drift after condition changes. We identify the cause as conditional bias, where the teacher may provide condition-aligned but trajectory-agnostic guidance, biasing generation toward locally valid yet globally inconsistent modes. Inspired by Trust Region Policy Optimization, we propose Delta Forcing, a simple yet effective framework that constrains unreliable teacher supervision within an adaptive trust region. Specifically, Delta Forcing estimates transition consistency from the latent delta between teacher and generator trajectories, and uses it to balance teacher supervision with a monotonic continuity objective. This suppress unreliable teacher-induced shifts while preserving responsiveness to new events. Extensive experiments demonstrate that Delta Forcing significantly improves consistency while maintaining event reactivity.

Chinese Translation

交互式实时自回归视频生成对于内容创作和世界建模等应用至关重要，因为视觉内容必须适应动态演变的事件条件。一个基本挑战在于平衡反应性和稳定性：模型必须迅速响应新事件，同时在较长时间范围内保持时间一致性。现有方法将双向模型提炼为自回归生成器，并通过流式长调优进一步调整它们，但在条件变化后往往表现出持续的漂移。我们将其原因归结为条件偏差，即教师可能提供与条件对齐但与轨迹无关的指导，从而使生成偏向于局部有效但全局不一致的模式。受到信任区域策略优化（Trust Region Policy Optimization）的启发，我们提出了 Delta 强制（Delta Forcing），这是一个简单而有效的框架，它在自适应信任区域内约束不可靠的教师监督。具体而言，Delta 强制通过估计教师和生成器轨迹之间的潜在增量来评估过渡一致性，并利用这一点来平衡教师监督与单调连续性目标。这抑制了不可靠的教师引起的偏移，同时保持对新事件的响应能力。大量实验表明，Delta 强制显著提高了一致性，同时保持事件反应性。

View on arXiv Download PDF AI Translation

cs.CV / 36 / 2605.14391

Dual-Latent Collaborative Decoding for Fidelity-Perception Balanced Image Compression

双潜变量协同解码用于保真度与感知平衡的图像压缩

Mao, Qi, Wang, Zijian, Cheng, Zhengxue, Zhu, Lingyu, Ma, Siwei

Abstract

Learned image compression (LIC) increasingly requires reconstructions that balance distortion fidelity and perceptual realism across a wide range of bitrates. However, most existing methods still rely on a single compressed latent representation to simultaneously carry structural details, semantic cues, and perceptual priors, requiring the same latent representation to serve multiple, potentially conflicting roles. This tension becomes evident across different latent paradigms: scalar-quantized (SQ) continuous latents provide rate-scalable fidelity but tend to lose perceptual details at low rates, while vector-quantized (VQ) discrete tokens preserve compact semantic cues but suffer from limited structural fidelity and bitrate scalability. To address this issue, we propose Mixture of Decoder Experts (MoDE), a dual-latent collaborative decoding framework that decomposes reconstruction responsibilities across complementary latent paradigms. Specifically, MoDE treats the SQ branch as a fidelity-oriented expert and the VQ branch as a perception-oriented expert, and coordinates them through two decoder-side modules: Expert-Specific Enhancement (ESE), which preserves branch-specific expert references, and Cross-Expert Modulation (CEM), which enables selective complementary transfer during reconstruction. The resulting framework supports selective cross-latent collaboration under a shared dual-stream bitstream and enables both fidelity-anchored and perception-anchored decoding. Extensive experiments demonstrate that MoDE achieves a more favorable fidelity-perception balance than representative distortion-oriented, perception-oriented, generative, and dual-latent baselines across a wide bitrate range, highlighting decoder-side expert collaboration as an effective design for wide-range fidelity-perception balanced LIC.

Chinese Translation

学习型图像压缩（LIC）越来越需要在广泛的比特率范围内平衡失真保真度和感知真实感的重建。然而，大多数现有方法仍依赖于单一的压缩潜变量表示，同时承载结构细节、语义线索和感知先验，要求相同的潜变量表示承担多个潜在冲突的角色。这种紧张关系在不同的潜变量范式中变得显而易见：标量量化（SQ）连续潜变量提供可按比特率缩放的保真度，但在低比特率时往往会丢失感知细节，而向量量化（VQ）离散标记保留紧凑的语义线索，但在结构保真度和比特率可扩展性方面受到限制。为了解决这个问题，我们提出了解码器专家混合（Mixture of Decoder Experts, MoDE），这是一个双潜变量协同解码框架，它将重建责任分解到互补的潜变量范式中。具体而言，MoDE将SQ分支视为以保真度为导向的专家，将VQ分支视为以感知为导向的专家，并通过两个解码器侧模块进行协调：专家特定增强（Expert-Specific Enhancement, ESE），用于保留分支特定的专家参考，以及跨专家调制（Cross-Expert Modulation, CEM），使重建过程中的选择性互补传递成为可能。最终框架支持在共享的双流比特流下进行选择性的跨潜变量协作，并实现保真度锚定和感知锚定的解码。大量实验表明，MoDE在广泛的比特率范围内实现了比典型的失真导向、感知导向、生成式和双潜变量基线更有利的保真度与感知平衡，突显了解码器侧专家协作作为广泛保真度与感知平衡LIC的有效设计。

View on arXiv Download PDF AI Translation

cs.CV / 37 / 2605.14393

Analogical Trajectory Transfer

类比轨迹转移

Kim, Junho, Lee, Eun Sun, Bae, Gwangtak, Kang, Seunggu, Kim, Young Min

Abstract

We study analogical trajectory transfer, where the goal is to translate motion trajectories in one 3D environment to a semantically analogous location in another. Such a capacity would enable machines to perform analogical spatial reasoning, with applications in AR/VR co-presence, content creation, and robotics. However, even semantically similar scenes can still differ substantially in object placement, scale, and layout, so naively matching semantics leads to collisions or geometric distortions. Furthermore, finding where each trajectory point should transfer to has a large search space, as the mapping must preserve semantics and functionality without tearing the trajectory apart or causing collisions. Our key insight is to decompose the problem into spatially segregated subproblems and merge their solutions to produce semantically consistent and spatially coherent transfers. Specifically, we partition scenes into object-centric clusters and estimate cross-scene mappings via hierarchical smooth map prediction, using 3D foundation model features that encode contextual information from object and open-space arrangements. We then combinatorially assemble the per-cluster maps into an initial transfer and refine the result to remove collisions and distortions, yielding a spatially coherent trajectory. Our method does not require training, attains a fast runtime around 0.6 seconds, and outperforms baselines based on LLMs, VLMs, and scene graph matching. We further showcase applications in virtual co-presence, multi-trajectory transfer, camera transfer, and human-to-robot motion transfer, which indicates the broad applicability of our work to AR/VR and robotics.

Chinese Translation

我们研究类比轨迹转移，其目标是将一个三维环境中的运动轨迹转移到另一个语义相似的位置。这种能力将使机器能够进行类比空间推理，应用于增强现实/虚拟现实（AR/VR）共存、内容创作和机器人技术。然而，即使是语义上相似的场景，在物体放置、比例和布局上仍可能存在显著差异，因此简单地匹配语义会导致碰撞或几何失真。此外，确定每个轨迹点应转移到何处的搜索空间很大，因为映射必须保持语义和功能，而不破坏轨迹或造成碰撞。我们的关键见解是将问题分解为空间上隔离的子问题，并合并它们的解决方案，以产生语义一致且空间连贯的转移。具体而言，我们将场景划分为以物体为中心的聚类，并通过层次平滑映射预测来估计跨场景映射，利用编码了物体和开放空间排列的上下文信息的三维基础模型特征。然后，我们将每个聚类的映射组合成初始转移，并对结果进行精炼，以消除碰撞和失真，从而获得空间连贯的轨迹。我们的方法不需要训练，运行时间约为0.6秒，且优于基于大型语言模型（LLMs）、视觉语言模型（VLMs）和场景图匹配的基线。我们进一步展示了在虚拟共存、多轨迹转移、摄像机转移和人机运动转移等方面的应用，表明我们的工作在AR/VR和机器人领域具有广泛的适用性。

View on arXiv Download PDF AI Translation

cs.CV / 38 / 2605.14396

Systematic Discovery of Semantic Attacks in Online Map Construction through Conditional Diffusion

通过条件扩散系统性发现在线地图构建中的语义攻击

Wang, Chenyi, Song, Ruoyu, Muller, Raymond, Monteuuis, Jean-Philippe, Petit, Jonathan, Celik, Z. Berkay, Gerdes, Ryan, Li, Ming F.

Abstract

Autonomous vehicles depend on online HD map construction to perceive lane boundaries, dividers, and pedestrian crossings -- safety-critical road elements that directly govern motion planning. While existing pixel perturbation attacks can disrupt the mapping, they can be neutralized by standard adversarial defenses. We present MIRAGE, a framework for systematic discovery of semantic attacks that bypass adversarial defenses and degrade mapping predictions by finding plausible environmental variation (e.g. shadows, wet roads). MIRAGE exploits the latent manifold of real-world data learned by diffusion models, and searches for semantically mutated scenes neighboring the ground truth with the same road topology yet mislead the mapping predictions. We evaluate MIRAGE on nuScenes and demonstrate two attacks: (1) boundary removal, suppressing 57.7% of detections and corrupting 96% of planned trajectories; and (2) boundary injection, the only method that successfully injects fictitious boundaries, while pixel PGD and AdvPatch fail entirely. Both attacks remain potent under various adversarial defenses. We use two independent VLM judges to quantify realism, where MIRAGE passes as realistic 80--84% of the time (vs. 97--99% for clean nuScenes), while AdvPatch only 0--9%. Our findings expose a categorical gap in current adversarial defenses: semantic-level perturbations that manifest as legitimate environmental variation are substantially harder to mitigate than pixel-level perturbations.

Chinese Translation

自主车辆依赖在线高清地图构建来感知车道边界、隔离带和人行横道等安全关键的道路元素，这些元素直接影响运动规划。虽然现有的像素扰动攻击可以干扰地图构建，但它们可以被标准的对抗防御所中和。我们提出了MIRAGE，一个系统性发现语义攻击的框架，该框架可以绕过对抗防御并通过寻找合理的环境变化（例如阴影、湿滑道路）来降低地图预测的准确性。MIRAGE利用扩散模型学习的真实世界数据的潜在流形，搜索与真实情况相邻的语义变异场景，这些场景具有相同的道路拓扑结构但会误导地图预测。我们在nuScenes上评估了MIRAGE，并展示了两种攻击：(1) 边界移除，抑制57.7%的检测并破坏96%的规划轨迹；(2) 边界注入，这是唯一成功注入虚构边界的方法，而像素PGD和AdvPatch则完全失败。这两种攻击在各种对抗防御下仍然有效。我们使用两位独立的VLM评审员来量化现实性，结果显示MIRAGE在80%到84%的情况下被评估为现实（而干净的nuScenes为97%到99%），而AdvPatch仅为0%到9%。我们的发现揭示了当前对抗防御中的一个类别差距：表现为合法环境变化的语义级扰动比像素级扰动更难以缓解。

View on arXiv Download PDF AI Translation

cs.CV / 39 / 2605.14399

SceneForge: Structured World Supervision from 3D Interventions

场景生成：来自3D干预的结构化世界监督

Li, Jizhizi, Ao, Jiayang, Wicks, Danny, Tudosiu, Petru-Daniel

Abstract

Many multimodal learning tasks require supervision that remains consistent across edits, viewpoints, and scene-level interventions. However, such supervision is difficult to obtain from observation-level datasets, which do not expose the underlying scene state or how changes propagate through it. We present SceneForge, an intervention-driven framework that generates structured supervision from editable 3D world states. SceneForge represents each scene as a persistent world with semantic, geometric, and physical dependencies. By applying explicit interventions (e.g., object removal or camera variation) and propagating their effects through scene dependencies, SceneForge renders supervision that remains consistent with object structure and scene-level effects. This produces aligned outputs including counterfactual observations, multi-view observations, and effect-aware signals such as shadows and reflections, all derived from a shared world state rather than post hoc image-space processing. We instantiate SceneForge using Infinigen and Blender to construct a licensing-clean indoor supervision resource with a large number of counterfactual pairs and aligned annotations from over 2K scenes, covering both diverse single-view and registered multi-view settings. Under matched training budgets, incorporating SceneForge supervision improves both object removal and scene removal performance across multiple benchmarks in both quantitative and qualitative evaluation. These results indicate that modeling supervision as structured state transitions in editable worlds provides a practical and scalable foundation for intervention-consistent multimodal learning.

Chinese Translation

许多多模态学习任务需要在编辑、视角和场景级干预之间保持一致的监督。然而，从观察级数据集中获取这样的监督是困难的，因为这些数据集并未揭示潜在的场景状态或变化如何在其中传播。我们提出了SceneForge，这是一种干预驱动的框架，能够从可编辑的3D世界状态中生成结构化监督。SceneForge将每个场景表示为一个具有语义、几何和物理依赖关系的持久世界。通过施加显式干预（例如，物体移除或相机变化）并传播其在场景依赖关系中的影响，SceneForge生成与物体结构和场景级效果一致的监督。这产生了对齐的输出，包括反事实观察、多视角观察以及诸如阴影和反射等效果感知信号，所有这些都是基于共享的世界状态，而不是事后图像空间处理。我们使用Infinigen和Blender实例化SceneForge，构建了一个许可清晰的室内监督资源，包含大量反事实对和来自2000多个场景的对齐注释，涵盖了多样的单视角和注册多视角设置。在匹配的训练预算下，纳入SceneForge监督在多个基准测试中改善了物体移除和场景移除的性能，且在定量和定性评估中均表现出色。这些结果表明，将监督建模为可编辑世界中的结构化状态转移，为干预一致的多模态学习提供了一个实用且可扩展的基础。

View on arXiv Download PDF AI Translation

cs.CV / 40 / 2605.14403

DermAgent: A Self-Reflective Agentic System for Dermatological Image Analysis with Multi-Tool Reasoning and Traceable Decision-Making

DermAgent：一种用于皮肤病图像分析的自反性代理系统，具备多工具推理和可追溯决策能力

Liu, Yize, Yan, Siyuan, Hu, Ming, Ju, Lie, Li, Xieji, Tang, Feilong, Feng, Wei, Ge, Zongyuan

Abstract

Dermatological diagnosis requires integrating fine-grained visual perception with expert clinical knowledge. Although Multimodal Large Language Models (MLLMs) facilitate interactive medical image analysis, their application in dermatology is hindered by insufficient domain-specific grounding and hallucinations. To address these issues, we propose DermAgent, a collaborative multi-tool agent that orchestrates seven specialized vision and language modules within a Plan-Execute-Reflect framework. DermAgent delivers stepwise, traceable diagnostic reasoning through three core components. First, it employs complementary visual perception tools for comprehensive morphological description, dermoscopic concept annotation, and disease diagnosis. Second, to overcome the lack of domain prior, a dual-modality retrieval module anchors every prediction in external evidence by cross-referencing 413,210 diagnosed image cases and 3,199 clinical guideline chunks. To further mitigate hallucinations, a deterministic critic module conducts strict post-hoc auditing via confidence, coverage, and conflict gates, automatically detecting inter-source disagreements to trigger targeted self-correction. Extensive experiments on five dermatology benchmarks demonstrate that DermAgent consistently outperforms state-of-the-art MLLMs and medical agent baselines across zero-shot fine-grained disease diagnosis, concept annotation, and clinical captioning tasks, exceeding GPT-4o by 17.6% in skin disease diagnostic accuracy and 3.15% in captioning ROUGE-L. Our code is available at https://github.com/YizeezLiu/DermAgent.

Chinese Translation

皮肤病诊断需要将细致的视觉感知与专家临床知识相结合。尽管多模态大型语言模型（MLLMs）促进了交互式医学图像分析，但其在皮肤科的应用受到领域特定基础不足和幻觉现象的制约。为了解决这些问题，我们提出了DermAgent，一种协作多工具代理，能够在计划-执行-反思框架内协调七个专业的视觉和语言模块。DermAgent通过三个核心组件提供逐步、可追溯的诊断推理。首先，它利用互补的视觉感知工具进行全面的形态描述、皮肤镜概念注释和疾病诊断。其次，为了克服领域先验的缺乏，一个双模态检索模块通过交叉引用413,210个已诊断图像案例和3,199个临床指南片段，将每个预测锚定在外部证据上。为了进一步减少幻觉现象，一个确定性批评模块通过置信度、覆盖率和冲突门进行严格的事后审计，自动检测源间的不一致性以触发针对性的自我修正。在五个皮肤病基准测试上的广泛实验表明，DermAgent在零-shot细粒度疾病诊断、概念注释和临床描述任务中始终优于最先进的MLLMs和医学代理基线，在皮肤病诊断准确率上超越GPT-4o 17.6%，在描述ROUGE-L上超越3.15%。我们的代码可在https://github.com/YizeezLiu/DermAgent获取。

View on arXiv Download PDF AI Translation

cs.CV / 41 / 2605.14448

Think When Needed: Adaptive Reasoning-Driven Multimodal Embeddings with a Dual-LoRA Architecture

必要时思考：基于自适应推理的双LoRA架构多模态嵌入

Zhang, Longxiang, Dai, Weilong, Zhang, Guanghao, Jiang, Hao, Huang, Pipei

Abstract

Multimodal large language models (MLLMs) have emerged as a powerful backbone for multimodal embeddings. Recent methods introduce chain-of-thought (CoT) reasoning into the embedding pipeline to improve retrieval quality, but remain costly in both model size and inference cost. They typically employ separate reasoner and embedder with substantial parameter overhead, and generate CoT indiscriminately for every input. However, we observe that for simple inputs, discriminative embeddings already perform well, and redundant reasoning can even mislead the model, degrading performance. To address these limitations, we propose Think When Needed (TWN), a unified multimodal embedding framework with adaptive reasoning. TWN introduces a dual-LoRA architecture that attaches reasoning and embedding adapters to a shared frozen backbone, detaching gradients at their interface to mitigate gradient conflicts introduced by joint optimization while keeping parameters close to a single model. Building on this, an adaptive think mechanism uses a self-supervised routing gate to decide per input whether to generate CoT, skipping unnecessary reasoning to reduce inference overhead and even improve retrieval quality. We further explore embedding-guided RL to optimize CoT quality beyond supervised training. On the 78 tasks of MMEB-V2, TWN achieves state-of-the-art embedding quality while being substantially more efficient than existing generative methods, requiring only 3-5% additional parameters relative to the backbone and up to 50% fewer reasoning tokens compared to the full generative mode.

Chinese Translation

多模态大型语言模型（MLLMs）已成为多模态嵌入的强大支柱。近期的方法将思维链（CoT）推理引入嵌入流程，以提高检索质量，但在模型规模和推理成本上仍然代价高昂。它们通常采用独立的推理器和嵌入器，带来显著的参数开销，并对每个输入无差别地生成CoT。然而，我们观察到，对于简单输入，区分性嵌入已经表现良好，冗余推理甚至可能误导模型，降低性能。为了解决这些局限性，我们提出了必要时思考（TWN），一个具有自适应推理的统一多模态嵌入框架。TWN引入了双LoRA架构，将推理和嵌入适配器附加到共享的冻结主干上，在它们的接口处断开梯度，以减轻联合优化引入的梯度冲突，同时保持参数接近单一模型。在此基础上，自适应思考机制使用自监督路由门来决定每个输入是否生成CoT，跳过不必要的推理以减少推理开销，甚至提高检索质量。我们进一步探索嵌入引导的强化学习，以优化超越监督训练的CoT质量。在MMEB-V2的78个任务上，TWN实现了最先进的嵌入质量，同时在效率上显著优于现有生成方法，相对于主干仅需增加3-5%的参数，并且与完整生成模式相比，推理令牌减少了多达50%。

View on arXiv Download PDF AI Translation

cs.CV / 42 / 2605.14461

ClickRemoval: An Interactive Open-Source Tool for Object Removal in Diffusion Models

ClickRemoval：一种用于扩散模型中物体移除的交互式开源工具

Zhang, Ledun, Ji, Yatu, Zhuang, Xufei, Yao, Xinying

Abstract

Existing object removal tools often rely on manual masks or text prompts, making precise removal difficult for non-expert users in complex scenes and often leading to incomplete removal or unnatural background completion. To address this issue, we present ClickRemoval, an open-source interactive object removal tool built on pretrained Stable Diffusion models and driven solely by user clicks. Without additional training, hand-drawn masks, or text descriptions, ClickRemoval localizes target objects and restores the background through self-attention modulation during denoising. Experiments show that ClickRemoval achieves competitive results across quantitative metrics and user studies. We release a complete software package at https://github.com/zld-make/ClickRemoval under the Apache-2.0 license.

Chinese Translation

现有的物体移除工具通常依赖于手动遮罩或文本提示，这使得在复杂场景中对非专业用户进行精确移除变得困难，并且往往导致移除不完全或背景填充不自然。为了解决这一问题，我们提出了ClickRemoval，这是一种基于预训练的Stable Diffusion模型构建的开源交互式物体移除工具，仅通过用户点击进行驱动。ClickRemoval在去噪过程中通过自注意力调制来定位目标物体并恢复背景，无需额外的训练、手绘遮罩或文本描述。实验表明，ClickRemoval在定量指标和用户研究中均取得了具有竞争力的结果。我们在https://github.com/zld-make/ClickRemoval上发布了完整的软件包，采用Apache-2.0许可证。

View on arXiv Download PDF AI Translation

cs.CV / 43 / 2605.14462

Real2Sim in HOI: Toward Physically Plausible HOI Reconstruction from Monocular Videos

HOI中的Real2Sim：从单目视频中实现物理上合理的HOI重建

Zhao, Yubo, Chai, Yujin, Dong, Yunao, Zhao, Chengfeng, Zeng, Zijiao, Liu, Yuan, Tang, Chi-Keung

Abstract

Recovering 4D human-object interaction (HOI) from monocular video is a key step toward scalable 3D content creation, embodied AI, and simulation-based learning. Recent methods can reconstruct temporally coherent human and object trajectories, but these trajectories often remain visual artifacts while failing to preserve stable contact, functional manipulation, or physical plausibility when used as reference motions for humanoid-object simulation. This reveals a fundamental interaction gap: HOI reconstruction should not stop at tracking a human and an object, but should recover the relation that makes their motion a coherent interaction. We introduce $\textbf{HA-HOI}$, a framework for reconstructing physically plausible 4D HOI animation from in-the-wild monocular videos. Instead of treating the human and object as independent entities in an ambiguous monocular 3D space, we propose a $\textit{human-first, object-follow}$ formulation. The human motion is recovered as the interaction anchor, and the object is reconstructed, aligned, and refined relative to the human action. The resulting kinematic trajectory is then projected into a physics-based humanoid-object simulation, where it acts as a teacher trajectory for stable physical rollout. Across benchmark and in-the-wild videos, $\textbf{HA-HOI}$ improves human-object alignment, contact consistency, temporal stability, and simulation readiness over prior monocular HOI reconstruction methods. By moving beyond visually plausible trajectory recovery toward physically grounded interaction animation, our work takes a step toward turning general monocular HOI videos into scalable demonstrations for humanoid-object behavior. Project page: https://knoxzhao.github.io/real2sim_in_HOI/

Chinese Translation

从单目视频中恢复4D人机交互（HOI）是可扩展3D内容创作、具身人工智能和基于模拟的学习的重要步骤。最近的方法能够重建时间上连贯的人类和物体轨迹，但这些轨迹往往仍然是视觉伪影，未能在作为类人-物体模拟的参考运动时保持稳定接触、功能性操作或物理合理性。这揭示了一个基本的交互差距：HOI重建不应仅停留在跟踪人类和物体上，而应恢复使其运动成为连贯交互的关系。我们提出了$ extbf{HA-HOI}$，一个从自然环境中的单目视频中重建物理上合理的4D HOI动画的框架。我们并不将人类和物体视为模糊的单目3D空间中的独立实体，而是提出了一种$ extit{以人类为中心，物体跟随}$的公式。人类运动被恢复为交互锚点，物体则相对于人类动作进行重建、对齐和精细化。最终的运动轨迹被投影到基于物理的类人-物体模拟中，作为稳定物理展开的教师轨迹。在基准测试和自然环境视频中，$ extbf{HA-HOI}$在与人类-物体对齐、接触一致性、时间稳定性和模拟准备性方面优于先前的单目HOI重建方法。通过超越视觉上合理的轨迹恢复，朝着物理基础的交互动画迈进，我们的工作为将一般的单目HOI视频转变为可扩展的类人-物体行为演示迈出了重要一步。项目页面：https://knoxzhao.github.io/real2sim_in_HOI/

View on arXiv Download PDF AI Translation

cs.CV / 44 / 2605.14475

GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding

GeoVista：用于超高分辨率遥感理解的视觉基础主动感知

Zhu, Jiashun, Fu, Ronghao, Hu, Jiasen, Xing, Nachuan, Na, Xu, Yang, Xiao, Lin, Zhiwen, Zhang, Weipeng, Sun, Lang, Xue, Zhiheng, Liu, Haoran, Zhang, Weijie, Yang, Bo

Abstract

Interpreting ultra-high-resolution (UHR) remote sensing images requires models to search for sparse and tiny visual evidence across large-scale scenes. Existing remote sensing vision-language models can inspect local regions with zooming and cropping tools, but most exploration strategies follow either a one-shot focus or a single sequential trajectory. Such single-path exploration can lose global context, leave scattered regions unvisited, and revisit or count the same evidence multiple times. To this end, we propose GeoVista, a planning-driven active perception framework for UHR remote sensing interpretation. Instead of committing to one zooming path, GeoVista first builds a global exploration plan, then verifies multiple candidate regions through branch-wise local inspection, while maintaining an explicit evidence state for cross-region aggregation and de-duplication. To enable this behavior, we introduce APEX-GRO, a cold-start supervised trajectory corpus that reformulates diverse UHR tasks as Global-Region-Object interactive reasoning processes with a unified, scale-invariant spatial representation. We further design an Observe-Plan-Track mechanism for global observation, adaptive region inspection, and evidence tracking, and align the model with a GRPO-based strategy using step-wise rewards for planning, localization, and final answer correctness. Experiments on RSHR-Bench, XLRS-Bench, and LRS-VQA show that GeoVista achieves state-of-the-art performance. Code and dataset are available at https://github.com/ryan6073/GeoVista

Chinese Translation

解读超高分辨率（UHR）遥感图像需要模型在大规模场景中寻找稀疏和微小的视觉证据。现有的遥感视觉语言模型可以利用缩放和裁剪工具检查局部区域，但大多数探索策略要么遵循一次性聚焦，要么沿单一顺序轨迹进行。这种单路径探索可能会丢失全局上下文，留下未访问的分散区域，并多次重新访问或计算相同的证据。为此，我们提出了GeoVista，一个基于规划的主动感知框架，用于UHR遥感解读。GeoVista首先构建一个全局探索计划，而不是承诺于一个缩放路径，然后通过分支局部检查验证多个候选区域，同时保持跨区域聚合和去重的明确证据状态。为了实现这一行为，我们引入了APEX-GRO，一个冷启动的监督轨迹语料库，将多样的UHR任务重新表述为具有统一、尺度不变空间表示的全球-区域-对象交互推理过程。我们进一步设计了观察-计划-跟踪机制，以实现全球观察、自适应区域检查和证据跟踪，并使用基于GRPO的策略对模型进行对齐，通过逐步奖励进行规划、定位和最终答案的正确性。RSHR-Bench、XLRS-Bench和LRS-VQA上的实验表明，GeoVista达到了最先进的性能。代码和数据集可在https://github.com/ryan6073/GeoVista获取。

View on arXiv Download PDF AI Translation

cs.CV / 45 / 2605.14486

Reduce the Artifacts Bias for More Generalizable AI-Generated Image Detection

减少伪影偏差以提高AI生成图像检测的通用性

Li, Yiheng, Yang, Yang, Tan, Zichang, Li, Gao, Lei, Zhen, Wang, Wenhao

Abstract

As the misuse of AI-generated images grows, generalizable image detection techniques are urgently needed. Recent state-of-the-art (SOTA) methods adopt aligned training datasets to reduce content, size, and format biases, empowering models to capture robust forgery cues. A common strategy is to employ reconstruction techniques, e.g., VAE and DDIM, which show remarkable results in diffusion-based methods. However, such reconstruction-based approaches typically introduce limited and homogeneous artifacts, which cannot fully capture diverse generative patterns, such as GAN-based methods. To complement reconstruction-based fake images with aligned yet diverse artifact patterns, we propose a GAN-based upsampling approach that mimics GAN-generated fake patterns while preserving content, size, and format alignment. This naturally results in two aligned but distinct types of fake images. However, due to the domain shift between reconstruction-based and upsampling-based fake images, direct mixed training causes suboptimal results, where one domain disrupts feature learning of the other. Accordingly, we propose a Separate Expert Fusion (SEF) framework to extract complementary artifact information and reduce inter-domain interference. We first train domain-specific experts via LoRA adaptation on a frozen foundational model, then conduct decoupled fusion with a gating network to adaptively combine expert features while retaining their specialized knowledge. Rather than merely benefiting GAN-generated image detection, this design introduces diverse and complementary artifact patterns that enable SEF to learn a more robust decision boundary and improve generalization across broader generative methods. Extensive experiments demonstrate that our method yields strong results across 13 diverse benchmarks. Codes are released at: https://github.com/liyih/SEF_AIGC_detection.

Chinese Translation

随着AI生成图像的误用日益增加，迫切需要通用的图像检测技术。近期的最先进（SOTA）方法采用对齐的训练数据集，以减少内容、大小和格式偏差，从而使模型能够捕捉到稳健的伪造线索。一种常见策略是采用重建技术，例如变分自编码器（VAE）和去噪扩散生成模型（DDIM），这些方法在基于扩散的技术中表现出色。然而，这些基于重建的方法通常引入有限且同质的伪影，无法完全捕捉多样的生成模式，例如基于生成对抗网络（GAN）的方法。为了用对齐但多样的伪影模式补充基于重建的伪造图像，我们提出了一种基于GAN的上采样方法，该方法在保持内容、大小和格式对齐的同时，模拟GAN生成的伪造模式。这自然导致了两种对齐但不同类型的伪造图像。然而，由于基于重建和基于上采样的伪造图像之间存在领域转移，直接混合训练会导致次优结果，其中一个领域会干扰另一个领域的特征学习。因此，我们提出了一个独立专家融合（SEF）框架，以提取互补的伪影信息并减少领域间干扰。我们首先通过在冻结的基础模型上进行LoRA适应训练领域特定的专家，然后通过门控网络进行解耦融合，自适应地结合专家特征，同时保留其专业知识。该设计不仅有利于GAN生成图像的检测，还引入了多样且互补的伪影模式，使SEF能够学习更稳健的决策边界，并提高在更广泛生成方法中的泛化能力。大量实验表明，我们的方法在13个不同基准测试中取得了优异的结果。代码已发布于：https://github.com/liyih/SEF_AIGC_detection。

View on arXiv Download PDF AI Translation

cs.CV / 46 / 2605.14487

Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity

头部强制：通过头部异质性实现长时间自回归视频生成

Tian, Jiahao, Wang, Yiwei, Yu, Gang, Zhang, Chi

Abstract

Autoregressive video diffusion models support real-time synthesis but suffer from error accumulation and context loss over long horizons. We discover that attention heads in AR video diffusion transformers serve functionally distinct roles as local heads for detail refinement, anchor heads for structural stabilization, and memory heads for long-range context aggregation, yet existing methods treat them uniformly, leading to suboptimal KV cache allocation. We propose Head Forcing, a training-free framework that assigns each head type a tailored KV cache strategy: local and anchor heads retain only essential tokens, while memory heads employ a hierarchical memory system with dynamic episodic updates for long-range consistency. A head-wise RoPE re-encoding scheme further ensures positional encodings remain within the pretrained range. Without additional training, Head Forcing extends generation from 5 seconds to minute-level duration, supports multi-prompt interactive synthesis, and consistently outperforms existing baselines. Project Page: https://jiahaotian-sjtu.github.io/headforcing.github.io/.

Chinese Translation

自回归视频扩散模型支持实时合成，但在长时间范围内会遭遇错误累积和上下文丢失的问题。我们发现，AR视频扩散变换器中的注意力头在功能上发挥着不同的作用：局部头用于细节精炼，锚定头用于结构稳定，记忆头用于长距离上下文聚合。然而，现有方法对它们的处理是统一的，导致KV缓存分配不理想。我们提出了头部强制（Head Forcing），这是一个无训练的框架，为每种头部类型分配量身定制的KV缓存策略：局部头和锚定头仅保留必要的标记，而记忆头则采用具有动态情节更新的分层记忆系统，以确保长距离一致性。头部级的RoPE重新编码方案进一步确保位置编码保持在预训练范围内。在没有额外训练的情况下，头部强制将生成时间从5秒延长至分钟级别，支持多提示交互合成，并始终优于现有基准。项目页面：https://jiahaotian-sjtu.github.io/headforcing.github.io/

View on arXiv Download PDF AI Translation

cs.CV / 47 / 2605.14513

HASTE: Training-Free Video Diffusion Acceleration via Head-Wise Adaptive Sparse Attention

HASTE：通过头部自适应稀疏注意力实现无训练视频扩散加速

Zheng, Xuzhe, Ma, Yuexiao, Xu, Jing, Zheng, Xiawu, Ji, Rongrong, Chao, Fei

Abstract

Diffusion-based video generation has advanced substantially in visual fidelity and temporal coherence, but practical deployment remains limited by the quadratic complexity of full attention. Training-free sparse attention is attractive because it accelerates pretrained models without retraining, yet existing online top-$p$ sparse attention still spends non-negligible cost on mask prediction and applies shared thresholds despite strong head-level heterogeneity. We show that these two overlooked factors limit the practical speed-quality trade-off of training-free sparse attention in Video DiTs. To address them, we introduce a head-wise adaptive framework with two plug-in components: Temporal Mask Reuse, which skips unnecessary mask prediction based on query-key drift, and Error-guided Budgeted Calibration, which assigns per-head top-$p$ thresholds by minimizing measured model-output error under a global sparsity budget. On Wan2.1-1.3B and Wan2.1-14B, our method consistently improves XAttention and SVG2, achieving up to 1.93 times speedup at 720P while maintaining competitive video quality and similarity metrics.

Chinese Translation

基于扩散的视频生成在视觉保真度和时间一致性方面取得了显著进展，但实际应用仍受到全注意力的二次复杂度限制。无训练的稀疏注意力因其能够在不重新训练的情况下加速预训练模型而备受关注，然而现有的在线 top-$p$ 稀疏注意力仍在掩码预测上花费了不可忽视的成本，并且尽管头部间存在显著异质性，却应用了共享阈值。我们展示了这两个被忽视的因素限制了无训练稀疏注意力在视频 DiTs 中的实际速度-质量权衡。为了解决这些问题，我们引入了一种头部自适应框架，包含两个插件组件：时间掩码重用（Temporal Mask Reuse），根据查询-键漂移跳过不必要的掩码预测，以及基于误差引导的预算校准（Error-guided Budgeted Calibration），通过在全球稀疏预算下最小化测量的模型输出误差来为每个头分配 top-$p$ 阈值。在 Wan2.1-1.3B 和 Wan2.1-14B 上，我们的方法持续改善了 XAttention 和 SVG2，在保持竞争性视频质量和相似性指标的同时，实现了在 720P 下高达 1.93 倍的加速。

View on arXiv Download PDF AI Translation

cs.CV / 48 / 2605.14518

ArcGate: Adaptive Arctangent Gated Activation

ArcGate：自适应反正切门控激活

Bhattacharya, Avik, Gole, Siddhant Dnyanesh, Chaudhuri, Subhasis, Frery, Alejandro C., Banerjee, Biplab

Abstract

Activation functions are central to deep networks, influencing non-linearity, feature learning, convergence, and robustness. This paper proposes the Adaptive Arctangent Gated Activation (ArcGate) function, a flexible formulation that generates a broad spectrum of activation shapes via a three-stage non-linear transformation. Unlike conventional fixed-shape activations such as ReLU, GELU, or SiLU, ArcGate uses seven learnable parameters per layer, allowing the neural network to autonomously optimize its non-linearity to the specific requirements of the feature hierarchy and data distribution. We evaluate ArcGate using ResNet-50 and Vision Transformer (ViT-B/16) architectures on three widely used remote sensing benchmarks: PatternNet, UC Merced Land Use, and the 13-band EuroSAT MSI multispectral dataset. Experimental results show that ArcGate consistently outperforms standard baselines, achieving a peak overall accuracy of 99.67% on PatternNet. Most notably, ArcGate exhibits superior structural resilience in noisy environments, maintaining a 26.65% performance lead over ReLU under moderate Gaussian noise (standard deviation 0.1). Analysis of the learned parameters reveals a depth-dependent functional evolution, where the model increases gating strength in deeper layers to enhance signal propagation. These findings suggest that ArcGate is a robust and adaptive general node activation function for high-resolution earth observation tasks.

Chinese Translation

激活函数在深度网络中至关重要，影响非线性、特征学习、收敛性和鲁棒性。本文提出了一种自适应反正切门控激活（ArcGate）函数，这是一种灵活的形式，通过三阶段非线性变换生成广泛的激活形状。与传统的固定形状激活函数（如ReLU、GELU或SiLU）不同，ArcGate在每一层使用七个可学习参数，使神经网络能够自主优化其非线性，以满足特征层次和数据分布的具体要求。我们在三个广泛使用的遥感基准数据集上评估了ArcGate，包括PatternNet、UC Merced土地利用和13波段EuroSAT MSI多光谱数据集，使用ResNet-50和Vision Transformer（ViT-B/16）架构。实验结果表明，ArcGate始终优于标准基线，在PatternNet上达到99.67%的峰值整体准确率。值得注意的是，ArcGate在嘈杂环境中表现出优越的结构鲁棒性，在中等高斯噪声（标准差0.1）下相较于ReLU保持26.65%的性能优势。对学习参数的分析揭示了深度依赖的功能演变，模型在更深层次中增加了门控强度，以增强信号传播。这些发现表明，ArcGate是一种稳健且自适应的通用节点激活函数，适用于高分辨率地球观测任务。

View on arXiv Download PDF AI Translation

cs.CV / 49 / 2605.14525

From Sparse to Dense: Spatio-Temporal Fusion for Multi-View 3D Human Pose Estimation with DenseWarper

从稀疏到密集：基于DenseWarper的多视角3D人体姿态估计的时空融合

Li, Ling, Chen, Changjie, Wang, Yuyan, Lyu, Jiaqing, Chang, Kenglun, Chen, Yiyun, Deng, Zhidong

Abstract

In multi-view 3D human pose estimation, models typically rely on images captured simultaneously from different camera views to predict a pose at a specific moment. While providing accurate spatial information, this traditional approach often overlooks the rich temporal dependencies between adjacent frames. We propose a novel 3D human pose estimation input method: the sparse interleaved input to address this. This method leverages images captured from different camera views at various time points (e.g., View 1 at time $t$ and View 2 at time $t+\delta$), allowing our model to capture rich spatio-temporal information and effectively boost performance. More importantly, this approach offers two key advantages: First, it can theoretically increase the output pose frame rate by N times with N cameras, thereby breaking through single-view frame rate limitations and enhancing the temporal resolution of the production. Second, using a sparse subset of available frames, our method can reduce data redundancy and simultaneously achieve better performance. We introduce the DenseWarper model, which leverages epipolar geometry for efficient spatio-temporal heatmap exchange. We conducted extensive experiments on the Human3.6M and MPI-INF-3DHP datasets. Results demonstrate that our method, utilizing only sparse interleaved images as input, outperforms traditional dense multi-view input approaches and achieves state-of-the-art performance. The source code for this work is available at: https://github.com/lingli1724/DenseWarper-ICLR2026

Chinese Translation

在多视角3D人体姿态估计中，模型通常依赖于从不同相机视角同时捕获的图像，以预测特定时刻的姿态。虽然这种传统方法提供了准确的空间信息，但往往忽视了相邻帧之间丰富的时间依赖关系。为此，我们提出了一种新颖的3D人体姿态估计输入方法：稀疏交错输入。该方法利用在不同时间点（例如，视角1在时间$t$，视角2在时间$t+ ext{δ}$）从不同相机视角捕获的图像，使我们的模型能够捕获丰富的时空信息并有效提升性能。更重要的是，这种方法提供了两个关键优势：首先，理论上可以通过N个相机将输出姿态帧率提高N倍，从而突破单视角帧率的限制，增强生成的时间分辨率。其次，使用可用帧的稀疏子集，我们的方法可以减少数据冗余，同时实现更好的性能。我们引入了DenseWarper模型，该模型利用极几何进行高效的时空热图交换。我们在Human3.6M和MPI-INF-3DHP数据集上进行了广泛的实验。结果表明，我们的方法仅使用稀疏交错图像作为输入，优于传统的密集多视角输入方法，并实现了最先进的性能。本研究的源代码可在以下链接获取：https://github.com/lingli1724/DenseWarper-ICLR2026

View on arXiv Download PDF AI Translation

cs.CV / 50 / 2605.14530

Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models

缓解大型扩散视觉-语言模型中的掩码先验漂移和位置注意力崩溃

Hong, Sujung, Yoon, Chanyong, Hwang, Seongjae

Abstract

Large diffusion vision-language models (LDVLMs) have recently emerged as a promising alternative to autoregressive models, enabling parallel decoding for efficient inference and leveraging bidirectional attention for global context. Despite these advances, their behavior under long-form generation remains underexplored. In this work, we show that existing LDVLMs suffer from repetitive generation and degraded visual grounding, and identify two underlying causes. First, repetitive generation originates from a mask token prior: since generation tokens are initialized as mask tokens, their hidden representations progressively drift toward a shared prior direction over generation steps. Second, a fundamental misalignment between the positional attention bias and the iterative unmasking process suppresses attention toward informative visual tokens, degrading visual grounding. Based on these insights, we propose a training-free approach, introducing Mask Prior Suppression and Monotonic RoPE Scaling to mitigate mask prior drift and positional attention collapse during decoding. Experiments on general multimodal benchmarks and visual grounding tasks demonstrate improvements over baseline LDVLMs, with robust gains on long-form description benchmarks. Our results show that these failures can be effectively addressed with a lightweight, plug-and-play strategy that requires no additional training and generalizes across diverse LDVLM architectures.

Chinese Translation

大型扩散视觉-语言模型（LDVLMs）最近作为自回归模型的有希望的替代方案出现，使得高效推理的并行解码成为可能，并利用双向注意力获取全局上下文。尽管取得了这些进展，但它们在长文本生成中的表现仍然未被充分探索。在本研究中，我们表明现有的LDVLMs存在重复生成和视觉基础差的现象，并识别出两个潜在原因。首先，重复生成源于掩码令牌先验：由于生成令牌初始化为掩码令牌，其隐藏表示在生成步骤中逐渐漂移到共享的先验方向。其次，位置注意力偏差与迭代去掩码过程之间的根本不对齐抑制了对信息丰富的视觉令牌的注意力，导致视觉基础的退化。基于这些见解，我们提出了一种无训练的方法，引入掩码先验抑制和单调RoPE缩放，以缓解解码过程中的掩码先验漂移和位置注意力崩溃。在一般多模态基准和视觉基础任务上的实验表明，相较于基线LDVLMs，取得了显著改善，尤其在长文本描述基准上表现出强劲的提升。我们的结果表明，这些失败可以通过一种轻量级的即插即用策略有效解决，该策略不需要额外训练，并且能够在多种LDVLM架构中推广。

View on arXiv Download PDF AI Translation

cs.CV / 51 / 2605.14534

PROVE: A Perceptual RemOVal cohErence Benchmark for Visual Media

PROVE：视觉媒体的感知去除一致性基准

Li, Fuhao, You, Shaofeng, Hu, Jiagao, Liu, Yu, Chen, Yuxuan, Wang, Zepeng, Wang, Fei, Zhou, Daiguo, Luan, Jian

Abstract

Evaluating object removal in images and videos remains challenging because the task is inherently one-to-many, yet existing metrics frequently disagree with human perception. Full-reference metrics reward copy-paste behaviors over genuine erasure; no-reference metrics suffer from systematic biases such as favoring blurry results; and global temporal metrics are insensitive to localized artifacts within edited regions. To address these limitations, we propose RC (Removal Coherence), a pair of perception-aligned metrics: RC-S, which measures spatial coherence via sliding-window feature comparison between masked and background regions, and RC-T, which measures temporal consistency via distribution tracking within shared restored regions across adjacent frames. To validate RC and support community benchmarking, we further introduce PROVE-Bench, a two-tier real-world benchmark comprising PROVE-M, an 80-video paired dataset with motion augmentation, and PROVE-H, a 100-video challenging subset without ground truth. Together, RC metrics and PROVE-Bench form the PROVE (Perceptual RemOVal cohErence) evaluation framework for visual media. Experiments across diverse image and video benchmarks demonstrate that RC achieves substantially stronger alignment with human judgments than existing evaluation protocols. The code for RC metrics and PROVE-Bench are publicly available at: https://github.com/xiaomi-research/prove/.

Chinese Translation

评估图像和视频中的对象去除仍然具有挑战性，因为该任务本质上是多对一的，然而现有的度量标准常常与人类感知不一致。全参考度量奖励复制粘贴行为而非真实的删除；无参考度量则受到系统性偏差的影响，例如偏爱模糊结果；而全局时间度量对编辑区域内的局部伪影不敏感。为了解决这些局限性，我们提出了RC（去除一致性），一对与感知对齐的度量：RC-S，通过对比遮挡区域和背景区域之间的滑动窗口特征来测量空间一致性；RC-T，通过在相邻帧之间共享恢复区域内的分布跟踪来测量时间一致性。为了验证RC并支持社区基准测试，我们进一步推出了PROVE-Bench，一个由两层组成的真实世界基准，包括PROVE-M，一个具有运动增强的80个视频配对数据集，以及PROVE-H，一个没有真实标签的100个视频挑战子集。RC度量和PROVE-Bench共同构成了视觉媒体的PROVE（感知去除一致性）评估框架。在多样的图像和视频基准测试中的实验表明，RC与人类判断的对齐程度显著强于现有的评估协议。RC度量和PROVE-Bench的代码已公开发布，网址为：https://github.com/xiaomi-research/prove/。

View on arXiv Download PDF AI Translation

cs.CV / 52 / 2605.14548

Local Spatiotemporal Convolutional Network for Robust Gait Recognition

用于鲁棒步态识别的局部时空卷积网络

Wang, Xiaoyun, Li, Cunrong, Wang, Wu

Abstract

Gait recognition, as a promising biometric technology, identifies individuals through their unique walking patterns and offers distinctive advantages including non-invasiveness, long-range applicability, and resistance to deliberate disguise. Despite these merits, capturing the intrinsic motion patterns concealed within consecutive video frames remains challenging due to the complexity of video data and the interference of external covariates such as viewpoint changes, clothing variations, and carrying conditions. Existing approaches predominantly rely on either static appearance features extracted from individual silhouette frames or employ complex sequential models (\eg, LSTM, 3D convolutions) that demand substantial computational resources and sophisticated training strategies. To address these limitations, we propose a Local Spatiotemporal Convolutional Network (LSTCN), a structurally simple yet highly effective dual-branch architecture that endows standard two-dimensional convolutional networks with the capacity to extract temporal information. Specifically, we introduce a Global Bidirectional Spatial Pooling (GBSP) mechanism that reduces the dimensionality of gait tensors by decomposing spatial features into horizontal and vertical strip-based local representations, enabling the temporal dimension to participate in standard 2D convolution operations. Building upon this, we design a Local Spatiotemporal Convolutional (LSTC) layer that jointly processes temporal and spatial dimensions, allowing the network to adaptively learn strip-based gait motion patterns. We further extend this formulation with asymmetric convolution kernels that independently attend to the temporal, spatial, and joint spatiotemporal domains, thereby enriching the extracted feature representations.

Chinese Translation

步态识别作为一种有前景的生物识别技术，通过个体独特的行走模式来识别个体，并具有非侵入性、远程适用性和抵抗故意伪装等显著优势。尽管如此，由于视频数据的复杂性以及视角变化、服装变化和携带条件等外部干扰，捕捉连续视频帧中隐含的运动模式仍然具有挑战性。现有方法主要依赖从单个轮廓帧提取的静态外观特征，或采用复杂的序列模型（如 LSTM、3D 卷积），这些方法需要大量的计算资源和复杂的训练策略。为了解决这些局限性，我们提出了一种局部时空卷积网络（Local Spatiotemporal Convolutional Network, LSTCN），这是一种结构简单但高效的双分支架构，使标准的二维卷积网络具备提取时间信息的能力。具体而言，我们引入了一种全局双向空间池化（Global Bidirectional Spatial Pooling, GBSP）机制，通过将空间特征分解为基于水平和垂直条带的局部表示，降低步态张量的维度，从而使时间维度能够参与标准的二维卷积操作。在此基础上，我们设计了一个局部时空卷积（Local Spatiotemporal Convolutional, LSTC）层，联合处理时间和空间维度，使网络能够自适应地学习基于条带的步态运动模式。我们进一步扩展了这一公式，采用不对称卷积核，独立关注时间、空间和联合时空域，从而丰富提取的特征表示。

View on arXiv Download PDF AI Translation

cs.CV / 53 / 2605.14552

LiWi: Layering in the Wild

LiWi: 野外分层

He, Yu, Li, Fang, Tong, Haoyang, Ma, Lichen, Shan, Xinyuan, Fu, Jingling, Chen, Dong, Liu, Luohang, Huang, Junshi, Li, Yan

Abstract

Recent advances in generative models have empowered impressive layered image generation, yet their success is largely confined to graphic design domains. The layering of in-the-wild images remains an underexplored problem, limiting fine-grained editing and applications of images in real-world scenarios. Specifically, challenges remain in scalable layered data and the modeling of object interaction in natural images, such as illumination effects and structural boundary. To address these bottlenecks, we propose a novel framework for high-fidelity natural image decomposition. First, we introduce an Agent-driven Data Decomposition (ADD) pipeline that orchestrates agents and tools to synthesize layered data without manual intervention. Utilizing this pipeline, we construct a large-scale dataset, named LiWi-100k, with over 100,000 high-quality layered in-the-wild images. Second, we present a novel framework that jointly improves photometric fidelity and alpha boundary accuracy. Specifically, shadow-guided learning explicitly models the illumination effects, and degradation-restoration objective provides boundary-correction supervision by recovering clean foreground image from degraded one. Extensive experiments demonstrate that our framework achieves state-of-the-art (SoTA) performance in natural image decomposition, outperforming existing models in RGB L1 and Alpha IoU metrics. We will soon release our code and dataset.

Chinese Translation

最近在生成模型方面的进展使得令人印象深刻的分层图像生成成为可能，但其成功在很大程度上局限于图形设计领域。在自然场景中的图像分层仍然是一个未被充分探索的问题，限制了图像在现实世界场景中的细粒度编辑和应用。具体而言，在可扩展的分层数据和自然图像中物体交互建模（如光照效果和结构边界）方面仍然存在挑战。为了解决这些瓶颈，我们提出了一种新颖的高保真自然图像分解框架。首先，我们引入了一种基于代理的数据分解（Agent-driven Data Decomposition, ADD）管道，该管道协调代理和工具合成分层数据，无需人工干预。利用该管道，我们构建了一个名为LiWi-100k的大规模数据集，包含超过100,000张高质量的野外分层图像。其次，我们提出了一种新颖的框架，联合提高光度保真度和透明度边界的准确性。具体而言，基于阴影的学习明确建模光照效果，而降级-恢复目标通过从降级图像中恢复干净的前景图像提供边界校正监督。大量实验表明，我们的框架在自然图像分解方面达到了最先进的（SoTA）性能，在RGB L1和Alpha IoU指标上超越了现有模型。我们将很快发布我们的代码和数据集。

View on arXiv Download PDF AI Translation

cs.CV / 54 / 2605.14566

SpectraFlow: Unifying Structural Pretraining and Frequency Adaptation for Medical Image Segmentation

SpectraFlow：统一结构预训练与频率适应用于医学图像分割

Chen, Zhiquan, Wang, Haitao, Zou, Guowei, Wu, Hejun

Abstract

Medical image segmentation remains challenging in low-data regimes, where scarce annotations often yield poor generalization and ambiguous boundaries with missing fine structures. Recent self-supervised pretraining has improved transferability, but it often exhibits a texture bias. In contrast, accurate segmentation is inherently geometry-aware and depends on both topological consistency and precise boundary preservation. To address this problem, we propose a two-stage framework that couples structure-aware encoder pretraining with boundary-oriented decoding. In Stage-1, we aim to learn structure-aware representations for downstream segmentation in low-data regimes. To this end, we propose Mixed-Domain MeanFlow Pretraining, which aligns images and binary masks in a shared latent space through latent transport regression, where masks act as conditional structural guidance rather than prediction targets, making the pretraining task-agnostic. To further improve training stability under scarce supervision, we incorporate a lightweight Dispersive Loss to prevent representation collapse. In Stage-2, we fine-tune the pretrained encoder with a lightweight decoder that combines Direct Attentional Fusion for adaptive cross-scale gating and Frequency-Directional Dynamic Convolution for high-frequency boundary refinement under appearance variation. Experiments on ISIC-2016, Kvasir-SEG, and GlaS demonstrate consistent gains over state-of-the-art methods, with improved robustness in low-data settings and sharper boundary delineation.

Chinese Translation

医学图像分割在低数据环境中仍然面临挑战，在这种情况下，稀缺的标注往往导致较差的泛化能力和模糊的边界，缺失细微结构。近期的自监督预训练提高了迁移能力，但通常表现出纹理偏差。相比之下，准确的分割本质上是几何感知的，依赖于拓扑一致性和精确的边界保持。为了解决这个问题，我们提出了一个两阶段框架，将结构感知编码器预训练与边界导向解码相结合。在第一阶段，我们旨在学习低数据环境下用于下游分割的结构感知表示。为此，我们提出了混合域均值流预训练（Mixed-Domain MeanFlow Pretraining），通过潜在传输回归将图像和二进制掩膜对齐到共享的潜在空间，其中掩膜作为条件结构指导而非预测目标，使得预训练任务与具体任务无关。为了进一步提高在稀缺监督下的训练稳定性，我们引入了一种轻量级的分散损失（Dispersive Loss）以防止表示崩溃。在第二阶段，我们用一个轻量级解码器微调预训练的编码器，该解码器结合了直接注意力融合（Direct Attentional Fusion）以实现自适应跨尺度门控，以及频率导向动态卷积（Frequency-Directional Dynamic Convolution）以在外观变化下进行高频边界细化。在ISIC-2016、Kvasir-SEG和GlaS上的实验表明，相较于最先进的方法，我们的方法在低数据环境下表现出一致的提升，并且边界描绘更加清晰。

View on arXiv Download PDF AI Translation

cs.CV / 55 / 2605.14569

Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction

连接大脑与语义：一种层次化框架用于语义增强的fMRI到视频重建

Wei, Yujie, Ma, Chenglong, Gao, Jianxiong, Wang, Chenhui, Zhang, Shiwei, Gong, Biao, Tan, Shuai, Yuan, Hangjie, Shan, Hongming

Abstract

Reconstructing dynamic visual experiences as videos from functional magnetic resonance imaging (fMRI) is pivotal for advancing the understanding of neural processes. However, current fMRI-to-video reconstruction methods are hindered by a semantic gap between noisy fMRI signals and the rich content of videos, stemming from a reliance on incomplete semantic embeddings that neither capture video-specific cues (e.g., actions) nor integrate prior knowledge. To this end, we draw inspiration from the dual-pathway processing mechanism in human brain and introduce CineNeuron, a novel hierarchical framework for semantically enhanced video reconstruction from fMRI signals with two synergistic stages. First, a bottom-up semantic enrichment stage maps fMRI signals to a rich embedding space that comprehensively captures textual semantics, image contents, action concepts, and object categories. Second, a top-down memory integration stage utilizes the proposed Mixture-of-Memories method to dynamically select relevant "memories" from previously seen data and fuse them with the fMRI embedding to refine the video reconstruction. Extensive experimental results on two fMRI-to-video benchmarks demonstrate that CineNeuron surpasses state-of-the-art methods across various metrics.

Chinese Translation

从功能性磁共振成像（fMRI）重建动态视觉体验为视频，对于推动对神经过程的理解至关重要。然而，当前的fMRI到视频重建方法受到噪声fMRI信号与视频丰富内容之间的语义差距的限制，这源于对不完整语义嵌入的依赖，这些嵌入既无法捕捉特定于视频的线索（例如，动作），也未能整合先前知识。为此，我们受到人脑双通路处理机制的启发，提出了CineNeuron，一种新颖的层次化框架，用于从fMRI信号进行语义增强的视频重建，包含两个协同阶段。首先，底层的语义增强阶段将fMRI信号映射到一个丰富的嵌入空间，全面捕捉文本语义、图像内容、动作概念和物体类别。其次，顶层的记忆整合阶段利用所提出的混合记忆（Mixture-of-Memories）方法，动态选择来自先前见过的数据的相关“记忆”，并将其与fMRI嵌入融合，以优化视频重建。在两个fMRI到视频基准上的广泛实验结果表明，CineNeuron在各种指标上超越了最先进的方法。

View on arXiv Download PDF AI Translation

cs.CV / 56 / 2605.14579

Med-DisSeg: Dispersion-Driven Representation Learning for Fine-Grained Medical Image Segmentation

Med-DisSeg：基于分散驱动的细粒度医学图像分割表示学习

Chen, Zhiquan, Wang, Haitao, Zou, Guowei, Wu, Hejun

Abstract

Accurate medical image segmentation is fundamental to precision medicine, yet robust delineation remains challenging under heterogeneous appearances, ambiguous boundaries, and large anatomical variability. Similar intensity and texture patterns between targets and surrounding tissues often lead to blurred activations and unreliable separation. We attribute these failures to representation collapse during encoding and insufficient fine grained multi scale decoding. To address these issues, we propose Med DisSeg, a dispersion driven medical image segmentation framework that jointly improves representation learning and anatomical delineation. Med DisSeg combines a lightweight Dispersive Loss with adaptive attention for fine grained structure segmentation. The Dispersive Loss enlarges inter sample margins by treating in batch hidden representations as negative pairs, producing well dispersed and boundary aware embeddings with negligible overhead. Based on these enhanced representations, the encoder strengthens structure sensitive responses, while the decoder performs adaptive multi scale calibration to preserve complementary local texture and global shape information. Extensive experiments on five datasets spanning three imaging modalities demonstrate consistent state of the art performance. Moreover, Med DisSeg achieves competitive results on multi organ CT segmentation, supporting its robustness and cross task applicability.

Chinese Translation

准确的医学图像分割是精准医学的基础，但在异质外观、模糊边界和大解剖变异性下，稳健的轮廓描绘仍然具有挑战性。目标与周围组织之间相似的强度和纹理模式常常导致模糊的激活和不可靠的分离。我们将这些失败归因于编码过程中的表示崩溃和细粒度多尺度解码不足。为了解决这些问题，我们提出了Med DisSeg，一个基于分散驱动的医学图像分割框架，旨在共同改善表示学习和解剖描绘。Med DisSeg结合了轻量级的分散损失（Dispersive Loss）和自适应注意力机制，以实现细粒度结构分割。分散损失通过将批次中的隐藏表示视为负对，扩大样本间的边界，从而生成分散良好且关注边界的嵌入，且开销微乎其微。基于这些增强的表示，编码器增强了对结构敏感的响应，而解码器则执行自适应多尺度校准，以保留互补的局部纹理和全局形状信息。在五个涵盖三种成像模式的数据集上的广泛实验表明，Med DisSeg始终表现出领先的性能。此外，Med DisSeg在多脏器CT分割中也取得了竞争力的结果，支持其稳健性和跨任务适用性。

View on arXiv Download PDF AI Translation

cs.CV / 57 / 2605.14581

A Picture is Worth a Thousand Words? An Empirical Study of Aggregation Strategies for Visual Financial Document Retrieval

一图胜千言？视觉金融文档检索聚合策略的实证研究

Lim, Ho Hung, Yang, Yi

Abstract

Visual RAG has offered an alternative to traditional RAG. It treats documents as images and uses vision encoders to obtain vision patch tokens. However, hundreds of patch tokens per document create retrieval and storage challenges in a vector database. Practical deployment requires aggregating them into a single vector. This raises a critical question: does single-vector aggregation lose key information in financial documents? We develop a diagnostic benchmark using financial documents where changes in single digits can lead to significant semantic shifts. Our experiments show that single-vector aggregation collapses different documents with almost identical vectors. Metrics show that the patch level detects semantic changes, and confirm that aggregation obscures these details. We identify global texture dominance as the root cause. Our findings are consistent across model scales, retrieval-optimized embeddings, and multiple mitigation strategies, highlighting significant risks for single-vector visual document retrieval in financial applications.

Chinese Translation

视觉RAG为传统RAG提供了一种替代方案。它将文档视为图像，并使用视觉编码器获取视觉补丁令牌。然而，每个文档数百个补丁令牌在向量数据库中带来了检索和存储的挑战。实际部署需要将它们聚合为一个单一向量。这引发了一个关键问题：单向量聚合是否会丢失金融文档中的关键信息？我们开发了一个诊断基准，使用金融文档，其中单个数字的变化可能导致显著的语义转变。我们的实验表明，单向量聚合会将不同文档压缩为几乎相同的向量。指标显示，补丁级别能够检测语义变化，并确认聚合模糊了这些细节。我们确定全球纹理主导性是根本原因。我们的发现适用于不同模型规模、检索优化嵌入和多种缓解策略，突显了在金融应用中单向量视觉文档检索的重大风险。

View on arXiv Download PDF AI Translation

cs.CV / 58 / 2605.14590

FedStain: Modeling Higher-Order Stain Statistics for Federated Domain Generalization in Computational Pathology

FedStain：在计算病理学中建模联邦领域泛化的高阶染色统计

Zhang, Fengyi, Zhang, Junya, Sun, Wenzhuo

Abstract

Robust whole-slide image (WSI) analysis under strict data-governance remains challenging due to substantial cross-institutional stain heterogeneity. Domain generalization (DG) mitigates these shifts but typically requires centralized data, conflicting with privacy regulations. Federated learning (FedL) provides a decentralized alternative; however, existing FedL and federated DG (FedDG) approaches rely almost exclusively on low-order statistics, assuming Gaussian-like stain distributions. In contrast, real-world staining processes often produce asymmetric, heavy-tailed color distributions due to biochemical diffusion and scanner nonlinearity. Consequently, current methods fail to model the higher-order, non-Gaussian characteristics dominating real-world stain variability. To address this, we propose FedStain, a stain-aware FedDG framework explicitly incorporating higher-order stain moments--skewness and kurtosis--as compact statistical descriptors exchanged during federated optimization. These descriptors require no pixel-level data transmission, preserving strict privacy and communication efficiency, while enabling the global model to capture stain variability missed by low-order statistics. FedStain also employs a contrastive, cross-site parameter aggregation strategy to promote stain-invariant representations without relaxing data constraints. Extensive experiments on Camelyon17 and our new MvMidog-Fed benchmark show FedStain yields consistent improvements, outperforming state-of-the-art FedL, DG, and FedDG baselines by up to +3.9% absolute accuracy. To our knowledge, FedStain is the first FedDG approach to explicitly model higher-order stain statistics, enabling robust cross-institutional deployment in computational pathology.

Chinese Translation

在严格的数据治理下，稳健的全幻灯片图像（WSI）分析由于跨机构的染色异质性仍然面临挑战。领域泛化（DG）可以缓解这些变化，但通常需要集中数据，这与隐私法规相冲突。联邦学习（FedL）提供了一种去中心化的替代方案；然而，现有的FedL和联邦DG（FedDG）方法几乎完全依赖于低阶统计，假设染色分布呈高斯分布。相反，现实世界的染色过程通常由于生化扩散和扫描仪非线性而产生不对称的重尾色彩分布。因此，当前的方法未能建模主导现实世界染色变异性的高阶非高斯特征。为了解决这个问题，我们提出了FedStain，一个染色感知的FedDG框架，明确地将高阶染色矩——偏度和峰度——作为在联邦优化过程中交换的紧凑统计描述符。这些描述符不需要像素级数据传输，从而保持严格的隐私和通信效率，同时使全局模型能够捕捉到低阶统计未能捕捉到的染色变异性。FedStain还采用对比的跨站点参数聚合策略，以促进染色不变表示，而无需放宽数据约束。在Camelyon17和我们新的MvMidog-Fed基准上的广泛实验表明，FedStain提供了一致的改进，绝对准确率比最先进的FedL、DG和FedDG基线高出多达+3.9%。据我们所知，FedStain是第一个明确建模高阶染色统计的FedDG方法，使得在计算病理学中实现跨机构的稳健部署成为可能。

View on arXiv Download PDF AI Translation

cs.CV / 59 / 2605.14594

TOPOS: High-Fidelity and Efficient Industry-Grade 3D Head Generation

TOPOS：高保真且高效的工业级3D头部生成

Xiong, Bojun, Bi, Zoubin, Peng, Xinghui, Wang, Yunmu, Deng, Junchen, Liang, Jun, Li, Jing, Cai, Bowen, Fu, Huan

Abstract

High-fidelity 3D head generation plays a crucial role in the film, animation and video game industries. In industrial pipelines, studios typically enforce a fixed reference topology across all head assets, as such a clean and uniform topology is a prerequisite for production-level rigging, skinning and animation. In this paper, we present TOPOS, a framework tailored for single image conditioned 3D head generation that jointly recovers geometry and appearance under such an industry-standard topology. In contrast to general 3D generative models which produce triangle meshes with inconsistent topology and numerous vertices, hindering semantic correspondence and asset-level reuse, TOPOS generates head meshes with a fixed, studio-style topology, enabling consistent vertex-level correspondence across all generated heads. To model heads under this unified topology, we proposed a novel variational autoencoder structure, termed TOPOS-VAE. Inspired by multi-model large language models (MLLMs), our TOPOS-VAE leverages the Perceiver Resampler to convert input pointclouds sampled from head meshes of diverse topologies into the target reference topology. Building upon TOPOS-VAE's structured latent space, we train a rectified flow transformer, TOPOS-DiT, to efficiently generate high-fidelity head meshes from a single image. We further present TOPOS-Texture, an end-to-end module that produces relightable UV texture maps from the same portrait image via fine-tuning a multimodal image generative model. The generated textures are spatially aligned with the underlying mesh geometry and faithfully preserve high-frequency appearance details. Extensive experiments demonstrate that TOPOS achieves state-of-the-art performance on 3D head generation, surpassing both classical face reconstruction methods and general 3D object generative models, highlighting its effectiveness for digital human creation.

Chinese Translation

高保真的3D头部生成在电影、动画和视频游戏行业中扮演着至关重要的角色。在工业流程中，工作室通常对所有头部资产施加固定的参考拓扑，因为这种干净且统一的拓扑是生产级绑定、蒙皮和动画的前提。在本文中，我们提出了TOPOS，一个专为单图像条件下的3D头部生成量身定制的框架，该框架在行业标准拓扑下共同恢复几何形状和外观。与生成不一致拓扑和众多顶点的三角网格的一般3D生成模型相比，TOPOS生成具有固定工作室风格拓扑的头部网格，从而实现所有生成头部之间的一致顶点级对应。为了在这一统一拓扑下建模头部，我们提出了一种新颖的变分自编码器结构，称为TOPOS-VAE。受多模型大型语言模型（MLLMs）启发，我们的TOPOS-VAE利用感知重采样器将从不同拓扑的头部网格中采样的输入点云转换为目标参考拓扑。在TOPOS-VAE的结构化潜在空间基础上，我们训练了一个修正流变换器TOPOS-DiT，以高效地从单图像生成高保真的头部网格。我们进一步提出了TOPOS-Texture，一个端到端模块，通过微调多模态图像生成模型，从同一肖像图像生成可重光照的UV纹理图。生成的纹理与底层网格几何形状在空间上对齐，并忠实保留高频外观细节。大量实验表明，TOPOS在3D头部生成方面达到了最先进的性能，超越了经典的人脸重建方法和一般3D物体生成模型，突显了其在数字人类创作中的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 60 / 2605.14597

VMU-Diff: A Coarse-to-fine Multi-source Data Fusion Framework for Precipitation Nowcasting

VMU-Diff：一种用于降水短期预报的粗到细多源数据融合框架

Shi, Chunlei, Li, Hao, Zhu, Yufeng, Liu, Boyu, Feng, Yongchao, Zang, Zengliang, Wang, Hongbin, Yang, Yanlan, Niu, Dan

Abstract

Precipitation nowcasting is a vital spatio-temporal prediction task for meteorological applications but faces challenges due to the chaotic property of precipitation systems. Existing methods predominantly rely on single-source radar data to build either deterministic or probabilistic models for extrapolation. However, the single deterministic model suffers from blurring due to MSE convergence. The single probabilistic model, typically represented by diffusion models, can generate fine details but suffers from spurious artifacts that compromise accuracy and computational inefficiency. To address these challenges, this paper proposes a novel coarse-to-fine Vision Mamba Unet and residual Diffusion (VMU-Diff) based precipitation nowcasting framework. It realizes precipitation nowcasting through a two-stage process, i.e., a deterministic model-based coarse stage to predict global motion trends and a probabilistic model-based fine stage to generate fine prediction details. In the coarse prediction stage, rather than single-source radar data, both radar and multi-band satellite data are taken as input. A spatial-temporal attention block and several Vision mamba state-space blocks realize multi-source data fusion, and predict the future echo global dynamics. The fine-grained stage is realized by a spatio-temporal refine generator based on residual conditional diffusion models. It first obtains spatio-temporal residual features based on coarse prediction and ground truth, and further reconstructs the residual via conditional Mamba state-space module. Experiments on Jiangsu SWAN datasets demonstrate the improvements of our method over state-of-the-art methods, particularly in short-term forecasts.

Chinese Translation

降水短期预报是气象应用中一项重要的时空预测任务，但由于降水系统的混沌特性，面临诸多挑战。现有方法主要依赖单一源的雷达数据来构建确定性或概率模型进行外推。然而，单一的确定性模型由于均方误差（MSE）收敛而导致模糊。单一的概率模型，通常由扩散模型表示，能够生成细节，但存在虚假伪影，影响准确性和计算效率。为了解决这些挑战，本文提出了一种新颖的基于粗到细的视觉曼巴U-Net和残差扩散（VMU-Diff）的降水短期预报框架。该框架通过两个阶段实现降水短期预报，即基于确定性模型的粗预测阶段用于预测全球运动趋势，以及基于概率模型的细预测阶段用于生成精细的预测细节。在粗预测阶段，输入不仅包括单一源的雷达数据，还包括雷达和多波段卫星数据。空间-时间注意力模块和多个视觉曼巴状态空间模块实现了多源数据融合，并预测未来回波的全球动态。细粒度阶段则通过基于残差条件扩散模型的时空细化生成器实现。该阶段首先基于粗预测和真实值获取时空残差特征，并通过条件曼巴状态空间模块进一步重构残差。在江苏SWAN数据集上的实验表明，我们的方法在短期预报方面相较于最先进的方法有显著提升。

View on arXiv Download PDF AI Translation

cs.CV / 61 / 2605.14601

Towards Accurate Single Panoramic 3D Detection: A Semantic Gaussian Centric Approach

朝向准确的单幅全景3D检测：一种语义高斯中心方法

Ning, Kanglin, Zhao, Yiran, Li, Wenrui, Sun, Shaoru, Wang, Xingtao, Fan, Xiaopeng

Abstract

Three-dimensional object detection in panoramic imagery is crucial for comprehensive scene understanding, yet accurately mapping 2D features to 3D remains a significant challenge. Prevailing methods often project 2D features onto discrete 3D grids, which break geometric continuity and limit representation efficiency. To overcome this limitation, this paper proposes PanoGSDet, a monocular panoramic 3D detection framework built upon continuous semantic 3D Gaussian representations. The proposed framework comprises a panoramic depth estimation component and a semantic Gaussian component. The panoramic depth estimation component extracts the equirectangular semantic and depth features from the monocular panorama input. The semantic Gaussian component includes a semantic Gaussian lifting module that projects spherical features into 3D semantic Gaussians, a semantic Gaussian optimization module that refines these semantic Gaussians, and a Gaussian guided prediction head that generates 3D bounding boxes from optimized Gaussian representations. Extensive experiments on the Structured3D dataset demonstrate that our method significantly outperforms existing methods.

Chinese Translation

全景图像中的三维物体检测对于全面理解场景至关重要，但将2D特征准确映射到3D仍然是一个重大挑战。现有方法通常将2D特征投影到离散的3D网格上，这破坏了几何连续性并限制了表示效率。为克服这一限制，本文提出了PanoGSDet，一种基于连续语义3D高斯表示的单目全景3D检测框架。该框架包括一个全景深度估计组件和一个语义高斯组件。全景深度估计组件从单目全景输入中提取等矩形语义和深度特征。语义高斯组件包括一个语义高斯提升模块，该模块将球面特征投影到3D语义高斯中，一个语义高斯优化模块，该模块对这些语义高斯进行精细化，以及一个高斯引导预测头，该头从优化的高斯表示中生成3D边界框。在Structured3D数据集上的大量实验表明，我们的方法显著优于现有方法。

View on arXiv Download PDF AI Translation

cs.CV / 62 / 2605.14606

MambaRain: Multi-Scale Mamba-Attention Framework for 0-3 Hour Precipitation Nowcasting

MambaRain：用于0-3小时降水即时预报的多尺度Mamba注意力框架

Shi, Chunlei, Wu, Cui, Xu, Xiang, Li, Hao, Fan, Ni, Han, Xue, Feng, Yongchao, Zhu, Yufeng, Liu, Boyu, Zang, Zengliang, Wang, Hongbin, Yang, Yanlan, Niu, Dan

Abstract

Accurate precipitation nowcasting over extended horizons (0-3 hours) is essential for disaster mitigation and operational decision-making, yet remains a critical challenge in the field. Existing deterministic approaches are predominantly constrained to shorter prediction windows (0-2 hours), exhibiting severe performance degradation beyond 90 minutes owing to their inherent difficulty in capturing long-range spatiotemporal dependencies from radar-derived observations. To address these fundamental limitations, we propose MambaRain, a novel multi-scale encoder-decoder architecture that synergistically integrates Mamba's linear-complexity long-range temporal modeling with self-attention mechanisms for explicit spatial correlation capture. The core innovation lies in a hybrid design paradigm wherein Mamba blocks leverage selective state space mechanisms to model global temporal dynamics across extended sequences with computational efficiency, while self-attention modules explicitly characterize spatial correlations within precipitation fields - a capability inherently absent in Mamba's sequential processing paradigm. This complementary synergy enables comprehensive spatiotemporal representation learning, effectively extending the viable forecasting horizon to 2-3 hours with substantial accuracy improvements. Furthermore, we introduce a spectral loss formulation to mitigate blurring artifacts characteristic of chaotic precipitation systems, thereby preserving fine-scale motion details critical for nowcasting accuracy. Experimental validation demonstrates that MambaRain substantially outperforms existing deterministic methodologies in 0-3 hour nowcasting tasks, with particularly pronounced performance gains in the challenging 2-3 hour prediction range.

Chinese Translation

在较长时间范围内（0-3小时）准确的降水即时预报对于灾害缓解和操作决策至关重要，但在该领域仍然是一个重大挑战。现有的确定性方法主要限制在较短的预测窗口（0-2小时），由于其在捕捉雷达观测数据中的长程时空依赖性方面的固有困难，导致在90分钟之后性能严重下降。为了解决这些基本限制，我们提出了MambaRain，一种新颖的多尺度编码器-解码器架构，协同整合了Mamba的线性复杂度长程时间建模与自注意力机制，以显式捕捉空间相关性。核心创新在于一种混合设计范式，其中Mamba模块利用选择性状态空间机制在计算效率上建模扩展序列中的全球时间动态，而自注意力模块则显式表征降水场中的空间相关性——这一能力在Mamba的顺序处理范式中是固有缺失的。这种互补的协同作用使得全面的时空表示学习成为可能，有效地将可行的预测范围延长至2-3小时，并显著提高了准确性。此外，我们引入了一种谱损失公式，以减轻混沌降水系统特有的模糊伪影，从而保留对即时预报准确性至关重要的细尺度运动细节。实验验证表明，MambaRain在0-3小时的即时预报任务中显著优于现有的确定性方法，尤其是在具有挑战性的2-3小时预测范围内表现出明显的性能提升。

View on arXiv Download PDF AI Translation

cs.CV / 63 / 2605.14607

ViMU: Benchmarking Video Metaphorical Understanding

ViMU：视频隐喻理解的基准测试

Li, Qi, Wang, Xinchao

Abstract

Any new medium, once it emerges, is used for more than the transmission of overt content alone. The information it carries typically operates on two levels: one is the content directly presented, while the other is the subtext beneath it-the implicit ideas and intentions the creator seeks to convey through the medium. Likewise, since video technologies became widely adopted, video has served not only as a powerful tool for recording and communicating visual information, but also as a vehicle for emotions, attitudes, and social meanings that are often difficult to articulate explicitly. Thus, the true meaning of many videos does not reside solely in what is shown on screen; it is often embedded in context, style of expression, and the viewer's social experience. Some forms of such video subtext are humorous, while others carry irony, mockery, or criticism. These implicit meanings can also be interpreted very differently across cultural backgrounds and social groups. However, most existing video understanding models still focus primarily on literal visual comprehension, such as recognizing objects, actions, or temporal relations, and lack a systematic ability to understand the metaphorical, ironic, and social meanings embedded in videos. To bridge this gap, we introduce ViMU, the first benchmark designed to systematically evaluate the subtext understanding capabilities of frontier models in videos. ViMU assesses whether video understanding models can go beyond literal perception to infer implicit meaning while grounding their interpretations in multimodal evidence and answering both open-ended and multiple-choice questions. Importantly, all questions are designed to be hint-free, ensuring that no key evidence is disclosed to models before answering.

Chinese Translation

任何新兴媒介一旦出现，便不仅仅用于传递显性内容。它所承载的信息通常在两个层面上运作：一个是直接呈现的内容，另一个是其背后的潜台词——创作者通过该媒介试图传达的隐含思想和意图。同样，自视频技术广泛应用以来，视频不仅作为记录和传递视觉信息的强大工具，还作为情感、态度和社会意义的载体，这些往往难以明确表达。因此，许多视频的真实意义并不单纯存在于屏幕上所展示的内容中；它通常嵌入在上下文、表达风格和观众的社会经验中。这些视频潜台词的某些形式是幽默的，而其他形式则带有讽刺、嘲弄或批评。这些隐含意义在不同文化背景和社会群体中也可能被解读得截然不同。然而，大多数现有的视频理解模型仍主要集中于字面视觉理解，例如识别物体、动作或时间关系，缺乏系统性理解视频中嵌入的隐喻、讽刺和社会意义的能力。为了解决这一问题，我们推出了ViMU，这是第一个旨在系统评估前沿模型在视频中潜台词理解能力的基准测试。ViMU评估视频理解模型是否能够超越字面感知，推断隐含意义，同时将其解释基于多模态证据，并回答开放式和多项选择问题。重要的是，所有问题都设计为无提示，确保在回答之前不会向模型透露任何关键证据。

View on arXiv Download PDF AI Translation

cs.CV / 64 / 2605.14609

Deep Image Segmentation via Discriminant Feature Learning

通过判别特征学习进行深度图像分割

Sztamborski, Adam Dawid, Pérez-Gonzalo, Raül, Agudo, Antonio

Abstract

Accurate image segmentation remains challenging, particularly in generating sharp, confident boundaries. While modern architectures have advanced the field, many of them still rely on standard loss functions like Cross-Entropy and Dice, which often neglect the discriminative structure of learned features, leading to inaccurate boundaries. This work introduces Deep Discriminant Analysis (DDA), a differentiable, architecture-agnostic loss function that embeds classical discriminant principles for network training. DDA explicitly maximizes between-class variance while minimizing within-class one, promoting compact and separable feature distributions without increasing inference cost. Evaluations on the DIS5K benchmark demonstrate that DDA consistently improves segmentation accuracy, boundary sharpness, and model confidence across various architectures. Our results show that integrating discriminant analysis offers a simple, effective path for building more robust segmentation models.

Chinese Translation

准确的图像分割仍然具有挑战性，特别是在生成清晰、可靠的边界方面。尽管现代架构推动了该领域的发展，但许多架构仍依赖于标准损失函数，如交叉熵（Cross-Entropy）和Dice，这往往忽视了学习特征的判别结构，导致边界不准确。本研究提出了深度判别分析（Deep Discriminant Analysis, DDA），这是一种可微分的、与架构无关的损失函数，嵌入了经典的判别原则用于网络训练。DDA明确地最大化类间方差，同时最小化类内方差，促进紧凑且可分的特征分布，而不增加推理成本。在DIS5K基准上的评估表明，DDA在各种架构中始终提高了分割精度、边界清晰度和模型置信度。我们的结果表明，整合判别分析为构建更稳健的分割模型提供了一条简单有效的路径。

View on arXiv Download PDF AI Translation

cs.CV / 65 / 2605.14615

CalibAnyView: Beyond Single-View Camera Calibration in the Wild

CalibAnyView：超越野外单视角相机标定

Li, Boying, Zhang, Cheng, Chen, Weirong, Cremers, Daniel, Reid, Ian, Rezatofighi, Hamid

Abstract

Camera calibration is a fundamental prerequisite for reliable geometric perception, yet classical approaches rely on controlled acquisition setups that are impractical for in-the-wild imagery. Recent learning-based methods have shown promising results for single-view calibration, but inherently neglect geometric consistency across multiple views. We introduce CalibAnyView, a unified formulation that supports an arbitrary number of input views ($N \geq 1$) by explicitly modeling cross-view geometric consistency. To facilitate this, we construct a large-scale multi-view video dataset covering diverse real-world scenarios, including multiple camera models, dynamic scenes, realistic motion trajectories, and heterogeneous lens distortions. Building on this dataset, we develop a multi-view transformer that predicts dense perspective fields, which are further integrated into a geometric optimization framework to jointly estimate camera intrinsics and gravity direction. Extensive experiments demonstrate that CalibAnyView consistently outperforms state-of-the-art methods, achieves strong robustness under single-view settings, and further improves with multi-view inference, providing a reliable foundation for downstream tasks such as 3D reconstruction and robotic perception in the wild.

Chinese Translation

相机标定是可靠几何感知的基本前提，然而传统方法依赖于受控的采集设置，这在野外图像中是不切实际的。最近的基于学习的方法在单视角标定方面显示出良好的结果，但本质上忽视了多个视角之间的几何一致性。我们提出了CalibAnyView，一种统一的公式，通过显式建模跨视角的几何一致性，支持任意数量的输入视角（$N geq 1$）。为此，我们构建了一个大规模的多视角视频数据集，涵盖多种真实世界场景，包括多种相机模型、动态场景、真实的运动轨迹和异质的镜头畸变。在此数据集的基础上，我们开发了一种多视角变换器，预测稠密的透视场，这些透视场进一步集成到几何优化框架中，以联合估计相机内参和重力方向。大量实验表明，CalibAnyView在各项指标上均优于最先进的方法，在单视角设置下表现出强大的鲁棒性，并在多视角推理中进一步提升，为下游任务如3D重建和野外机器人感知提供了可靠的基础。

View on arXiv Download PDF AI Translation

cs.CV / 66 / 2605.14621

Do We Really Need External Tools to Mitigate Hallucinations? SIRA: Shared-Prefix Internal Reconstruction of Attribution

我们真的需要外部工具来减轻幻觉吗？SIRA：共享前缀内部归因重构

Qin, Tian, Chen, Junzhe, Shi, Yuqing, Zhang, Tianshu, Ju, Qiang, Wen, Lijie

Abstract

Large vision-language models (LVLMs) often hallucinate when language priors dominate weak or ambiguous visual evidence. Existing contrastive decoding methods mitigate this problem by comparing predictions from the original image with those from externally perturbed visual inputs, but such references can introduce off-manifold artifacts and require costly extra forward passes. We propose SIRA, a training-free internal contrastive decoding framework that constructs a counterfactual reference inside the same LVLM by exploiting the staged information flow of multimodal transformers. Instead of removing visual information from the input, SIRA first lets image and text tokens interact through a shared prefix, forming an aligned multimodal state that preserves prompt interpretation, decoding history, positional structure, and early visual grounding. It then forks a counterfactual branch in later transformer layers, where attention to image-token positions is masked. This branch retains the shared multimodal context but lacks continued access to fine-grained visual evidence, yielding a language-prior-dominated internal reference for token-level contrast. During decoding, SIRA suppresses tokens that remain strong without late visual access and favors predictions whose advantage depends on the full visual pathway. Experiments on POPE, CHAIR, and AMBER with Qwen2.5-VL and LLaVA-v1.5 show that SIRA consistently reduces hallucinations while preserving descriptive coverage and incurring lower overhead than two-pass contrastive decoding. SIRA requires no training, external verifier, or perturbed input, and applies to open-weight LVLMs with white-box inference access.

Chinese Translation

大型视觉语言模型（LVLMs）在语言先验主导弱或模糊视觉证据时，常常会产生幻觉。现有的对比解码方法通过比较来自原始图像的预测与来自外部扰动视觉输入的预测来缓解这一问题，但这种参考可能会引入离散流形伪影，并且需要额外昂贵的前向传播。我们提出了SIRA，一种无训练的内部对比解码框架，通过利用多模态变换器的分阶段信息流，在同一LVLM内部构建反事实参考。SIRA并不是从输入中去除视觉信息，而是首先让图像和文本标记通过共享前缀进行交互，形成一个对齐的多模态状态，保留提示解释、解码历史、位置结构和早期视觉基础。然后，它在后续的变换器层中分叉出一个反事实分支，其中对图像标记位置的注意力被屏蔽。该分支保留了共享的多模态上下文，但缺乏对细粒度视觉证据的持续访问，从而产生一个以语言先验为主导的内部参考，用于标记级别的对比。在解码过程中，SIRA抑制那些在缺乏后期视觉访问时仍然强大的标记，并优先考虑那些优势依赖于完整视觉路径的预测。在与Qwen2.5-VL和LLaVA-v1.5的POPE、CHAIR和AMBER实验中，SIRA始终减少幻觉，同时保持描述覆盖，并且比双通道对比解码的开销更低。SIRA不需要训练、外部验证者或扰动输入，并适用于具有白盒推理访问的开放权重LVLMs。

View on arXiv Download PDF AI Translation

cs.CV / 67 / 2605.14626

UniTriGen: Unified Triplet Generation of Aligned Visible-Infrared-Label for Few-Shot RGB-T Semantic Segmentation

UniTriGen：统一的可对齐可见-红外-标签三元组生成用于少样本RGB-T语义分割

Zhou, Ping, Wang, Haoyu, Zheng, Mengmeng, Zhang, Lei, Wei, Wei, Ding, Chen, Zhou, Fei

Abstract

RGB-T semantic segmentation requires strictly aligned VIS-IR-Label triplets; however, such aligned triplet data are often scarce in real-world scenarios. Existing generative augmentation methods usually adopt cascaded generation paradigms, decomposing joint triplet generation into local conditional processes. As a result, consistency among VIS, IR, and Label in spatial structure, semantic content, and cross-modal details cannot be reliably maintained. To address this issue, we propose UniTriGen, a unified triplet generation framework that directly generates spatially aligned, semantically consistent, and modality complementary VIS-IR-Label triplets under the guidance of text prompts. UniTriGen first introduces a unified triplet generation mechanism, where VIS, IR, and Label are jointly encoded into a shared latent space and modeled with a diffusion process to enforce global cross-modal consistency. Lightweight modality-specific residual adapters are further integrated into this mechanism to accommodate modality-specific imaging characteristics and output formats. To mitigate generation bias caused by imbalanced scene and class distributions in limited paired triplets, UniTriGen also employs a scene-balanced and class-aware few-shot sampling strategy, which induces a more balanced sampling distribution and enhances the scene and class diversity of generated triplets. Experiments show that UniTriGen generates high-quality aligned triplets from limited real paired data, thereby achieving consistent performance improvements across various RGB-T semantic segmentation models.

Chinese Translation

RGB-T语义分割需要严格对齐的可见-红外-标签（VIS-IR-Label）三元组；然而，在现实场景中，这种对齐的三元组数据往往稀缺。现有的生成增强方法通常采用级联生成范式，将联合三元组生成分解为局部条件过程。因此，VIS、IR和标签在空间结构、语义内容和跨模态细节上的一致性无法可靠维持。为了解决这一问题，我们提出了UniTriGen，一个统一的三元组生成框架，能够在文本提示的指导下直接生成空间对齐、语义一致和模态互补的VIS-IR-Label三元组。UniTriGen首先引入了一种统一的三元组生成机制，其中VIS、IR和标签共同编码到一个共享的潜在空间，并通过扩散过程建模，以强制执行全局跨模态一致性。此外，轻量级的模态特定残差适配器被进一步集成到该机制中，以适应模态特定的成像特征和输出格式。为了减轻由于有限配对三元组中场景和类别分布不平衡所造成的生成偏差，UniTriGen还采用了一种场景平衡和类别感知的少样本采样策略，从而诱导出更平衡的采样分布，并增强生成三元组的场景和类别多样性。实验表明，UniTriGen能够从有限的真实配对数据中生成高质量的对齐三元组，从而在各种RGB-T语义分割模型中实现一致的性能提升。

View on arXiv Download PDF AI Translation

cs.CV / 68 / 2605.14635

MultiEmo-Bench: Multi-label Visual Emotion Analysis for Multi-modal Large Language Models

MultiEmo-Bench：多标签视觉情感分析用于多模态大型语言模型

Chen, Tianwei, Furusawa, Takuya, Hirakawa, Yuki, Shimizu, Ryotaro, Fan, Mo, Wada, Takashi

Abstract

This paper introduces a multi-label visual emotion analysis benchmark dataset for comprehensively evaluating the ability of multimodal large language models (MLLMs) to predict the emotions evoked by images. Recent user studies report an unintuitive finding: humans may prefer the predictions of MLLMs over the labels in existing datasets. We argue that this phenomenon stems from the suboptimal annotation scheme used in existing datasets, where each annotator is shown a single candidate emotion for each image and judges whether it is evoked or not. This approach is clearly limited because a single image can evoke multiple emotions with varying intensities. As a result, evaluations based on these datasets may underestimate the capabilities of MLLMs, yet an appropriate benchmark for evaluating such models remains lacking. To address this issue, we introduce a new multi-label benchmark dataset for visual emotion analysis toward MLLMs evaluation. We hire $20$ annotators per image and ask them to select all emotions they feel from an image. Then, we aggregate the votes across all annotators, providing a more reliable and representative dataset labeled with a distribution of emotions. The resulting dataset contains $10,344$ images with $236,998$ valid votes across eight emotions. Based on this benchmark dataset, we evaluate several recent models, including Qwen3-VL, OpenAI's GPT, Gemini, and Claude. We assess model performance on both dominant emotion prediction and emotion distribution prediction. Our results demonstrate the progress achieved by recent MLLMs while also indicating that substantial room for improvement remains. Furthermore, our experiments with LLM-as-a-judge show that the method does not consistently improve MLLMs' performance, indicating its limitations for the subjective task of visual emotion analysis.

Chinese Translation

本文介绍了一个多标签视觉情感分析基准数据集，用于全面评估多模态大型语言模型（MLLMs）预测图像所唤起情感的能力。最近的用户研究报告了一个不直观的发现：人类可能更倾向于选择MLLMs的预测结果，而非现有数据集中的标签。我们认为，这一现象源于现有数据集中使用的次优标注方案，其中每位标注者仅看到每幅图像的一个候选情感，并判断其是否被唤起。这种方法显然存在局限性，因为单幅图像可能唤起多种情感，且其强度各异。因此，基于这些数据集的评估可能低估了MLLMs的能力，而用于评估此类模型的适当基准仍然缺乏。为了解决这一问题，我们引入了一个新的多标签基准数据集，用于视觉情感分析以评估MLLMs。我们为每幅图像聘请了20名标注者，并要求他们选择从图像中感受到的所有情感。然后，我们汇总所有标注者的投票，提供一个更可靠和具有代表性的数据集，标注了情感分布。最终数据集包含10,344幅图像，涵盖236,998个有效投票，涉及八种情感。基于该基准数据集，我们评估了包括Qwen3-VL、OpenAI的GPT、Gemini和Claude在内的几种最新模型。我们评估了模型在主导情感预测和情感分布预测方面的表现。我们的结果展示了最近MLLMs取得的进展，同时也表明仍有显著的改进空间。此外，我们对LLM作为评判者的实验表明，该方法并未始终提高MLLMs的性能，显示出其在视觉情感分析这一主观任务中的局限性。

View on arXiv Download PDF AI Translation

cs.CV / 69 / 2605.14641

How to Evaluate and Refine your CAM

如何评估和优化您的类归属图（CAM）

Domeniconi, Luca, Stramiglio, Alessandra, Lombardi, Michele, Salti, Samuele

Abstract

Class attribution maps (CAMs) provide local explanations for the decisions of convolutional neural networks. While widely used in practice, the evaluation of CAMs remains challenging due to the lack of ground-truth explanations, making it difficult to evaluate the soundness of existing metrics. Independently, most commonly used CAM methods produce low-resolution attribution maps, which limits their usefulness for detailed interpretability. To address the evaluation challenge, we introduce a synthetic dataset with ground-truth attributions that enables a rigorous comparison of CAM evaluation metrics. Using this dataset, we analyze existing metrics and propose ARCC, a new composite metric that more reliably identifies faithful explanations. To address the low resolution issue, we introduce RefineCAM, a method that produces high-resolution attribution maps by aggregating CAMs across multiple network layers. Our results show that RefineCAM consistently outperforms existing methods according to the proposed evaluation.

Chinese Translation

类归属图（Class Attribution Maps, CAMs）为卷积神经网络的决策提供了局部解释。尽管在实践中被广泛使用，但由于缺乏真实的解释，CAM的评估仍然面临挑战，这使得评估现有指标的有效性变得困难。此外，大多数常用的CAM方法独立生成低分辨率的归属图，这限制了其在详细可解释性方面的实用性。为了解决评估挑战，我们引入了一个具有真实归属的合成数据集，使得CAM评估指标的严格比较成为可能。利用该数据集，我们分析了现有指标，并提出了ARCC，这是一种新的复合指标，能够更可靠地识别真实的解释。为了解决低分辨率问题，我们引入了RefineCAM，这是一种通过聚合多个网络层的CAM生成高分辨率归属图的方法。我们的结果表明，RefineCAM在提出的评估标准下始终优于现有方法。

View on arXiv Download PDF AI Translation

cs.CV / 70 / 2605.14645

Vision-Based Water Level and Flow Estimation

基于视觉的水位和流量估计

Sun, ZhiXin

Abstract

With the rapid evolution of computer vision, vision-based methodologies for water level and river surface velocity estimation have reached significant maturity. Compared to traditional sensing, these techniques offer superior interpretability, automated data archiving, and enhanced system robustness. However, challenges such as environmental sensitivity, limited precision, and complex site calibration persist. This work proposes an integrated framework that synergizes state-of-the-art (SOTA) vision models with statistical modeling. By leveraging physical priors and robust filtering strategies, we improve the accuracy of water level detection and flow estimation. Code will be available at https://github.com/sunzx97/Vision_Based_Water_Level_and_Flow_Estimation.git

Chinese Translation

随着计算机视觉的快速发展，基于视觉的方法在水位和河面流速估计方面已达到显著的成熟度。与传统传感技术相比，这些技术提供了更优的可解释性、自动化的数据归档和增强的系统鲁棒性。然而，环境敏感性、精度有限和复杂的现场校准等挑战依然存在。本研究提出了一个集成框架，将最先进的（SOTA）视觉模型与统计建模相结合。通过利用物理先验和鲁棒过滤策略，我们提高了水位检测和流量估计的准确性。代码将可在 https://github.com/sunzx97/Vision_Based_Water_Level_and_Flow_Estimation.git 获取。

View on arXiv Download PDF AI Translation

cs.CV / 71 / 2605.14651

TERRA-CD: Multi-Temporal Framework for Multi-class and Semantic Change Detection

TERRA-CD：多时相框架用于多类别和语义变化检测

Oak, Omkar, Nazre, Rukmini, Budke, Rujuta, Sawant, Suraj

Abstract

Urban vegetation monitoring plays a vital role in understanding environmental changes, yet comprehensive datasets for this purpose remain limited. To address this gap, we present the Temporal Remote-sensing Repository for Analyzing Change Detection (TERRA-CD), a benchmark dataset comprising 5,221 Sentinel-2 image pairs from 2019 and 2024, covering 232 cities across the USA and Europe. The dataset features three distinct annotation schemes: 4-class land cover mapping masks, 3-class vegetation change masks, and 13-class semantic change masks capturing all possible land cover transitions. Using various deep learning approaches including Siamese networks, STANet variants, Bi-SRNet, Changemask, Post-Classification Comparison, and HRSCD strategies, we evaluated the dataset's effectiveness for both vegetation Multi-class Change Detection as well as Semantic Change Detection. The proposed dataset and methods are available at https://github.com/omkarsoak/TERRA-CD.

Chinese Translation

城市植被监测在理解环境变化中发挥着至关重要的作用，但用于此目的的综合数据集仍然有限。为了解决这一问题，我们提出了用于变化检测分析的时序遥感库（TERRA-CD），这是一个基准数据集，包含2019年和2024年间的5,221对Sentinel-2影像，覆盖美国和欧洲的232个城市。该数据集具有三种不同的标注方案：4类土地覆盖映射掩膜、3类植被变化掩膜和13类语义变化掩膜，捕捉所有可能的土地覆盖转变。我们使用包括Siamese网络、STANet变体、Bi-SRNet、Changemask、后分类比较和HRSCD策略在内的多种深度学习方法，评估了该数据集在植被多类别变化检测和语义变化检测方面的有效性。所提出的数据集和方法可在https://github.com/omkarsoak/TERRA-CD获取。

View on arXiv Download PDF AI Translation

cs.CV / 72 / 2605.14654

Beyond Instance-Level Self-Supervision in 3D Multi-Modal Medical Imaging

超越实例级自监督的3D多模态医学成像

Pan, Tan, Mei, Shuhao, Sun, Yixuan, Guo, Kaiyu, Jiang, Chen, Tan, Zhaorui, Li, Mengzhu, Han, Limei, Zou, Xiang, Cheng, Yuan, Baktashmotlagh, Mahsa

Abstract

Self-supervised pre-training methods in medical imaging typically treat each individual as an isolated instance, learning representations through augmentation-based objectives or masked reconstruction. They often do not adequately capitalize on a key characteristic of physiological features: anatomical structures maintain consistent spatial relationships across individuals (instances), such as the thalamus being medial to the basal ganglia, regardless of variations in brain size, shape, or pathology. We propose leveraging this cross-instance topological consistency as a supervisory signal. The challenge arises from the inherent variability in medical imaging, which can differ significantly across instances and modalities. To tackle this, we focus on two alignment regimes. (i) Intra-instance: with pixel-level correspondences available, a cross-modal triplet objective explicitly preserves local neighborhood topology. (ii) Inter-instance: without such supervision, we derive pseudo-correspondences to control partial neighborhood alignment and prevent topology collapse across modalities. We validate our approach across 7 downstream multi-modal tasks, achieving average improvements of 1.1% and 5.94% in segmentation and classification tasks, respectively, and demonstrating significantly better robustness when modalities are missing at test time.

Chinese Translation

医学成像中的自监督预训练方法通常将每个个体视为孤立实例，通过基于增强的目标或掩蔽重建学习表征。它们往往没有充分利用生理特征的一个关键特性：解剖结构在个体（实例）之间保持一致的空间关系，例如，丘脑在基底神经节的内侧，无论脑的大小、形状或病理如何变化。我们提出利用这种跨实例的拓扑一致性作为监督信号。挑战在于医学成像固有的变异性，这在实例和模态之间可能存在显著差异。为此，我们关注两个对齐机制。(i) 实例内：在像素级对应关系可用的情况下，跨模态三元组目标明确保持局部邻域拓扑。(ii) 实例间：在没有这种监督的情况下，我们推导伪对应关系以控制部分邻域对齐，并防止跨模态的拓扑崩溃。我们在7个下游多模态任务中验证了我们的方法，在分割和分类任务中分别实现了1.1%和5.94%的平均提升，并在测试时模态缺失时表现出显著更好的鲁棒性。

View on arXiv Download PDF AI Translation

cs.CV / 73 / 2605.14664

MiVE: Multiscale Vision-language features for reference-guided video Editing

MiVE：用于参考引导视频编辑的多尺度视觉-语言特征

Wang, Tong, Zou, Meng, Wu, Chengjing, Qu, Xiaochao, Liu, Luoqi, Hu, Xiaolin, Liu, Ting

Abstract

Reference-guided video editing takes a source video, a text instruction, and a reference image as inputs, requiring the model to faithfully apply the instructed edits while preserving original motion and unedited content. Existing methods fall into two paradigms, each with inherent limitations: decoupled encoders suffer from modality gaps when processing instructions and visual content independently, while unified vision-language encoders lose fine-grained spatial details by relying solely on final-layer representations. We observe that VLM layers encode complementary information hierarchically -- early layers capture localized spatial details essential for precise editing, while deeper layers encode global semantics for instruction comprehension. Building on this insight, we present MiVE (Multiscale Vision-language features for reference-guided video Editing), a framework that repurposes VLMs as multiscale feature extractors. MiVE extracts hierarchical features from Qwen3-VL and integrates them into a unified self-attention Diffusion Transformer, eliminating the modality mismatch inherent in cross-attention designs. Experiments demonstrate that MiVE achieves state-of-the-art performance by ranking highest in human preference, outperforming both academic methods and commercial systems.

Chinese Translation

参考引导视频编辑以源视频、文本指令和参考图像为输入，要求模型在忠实应用指令编辑的同时，保留原始运动和未编辑内容。现有方法分为两种范式，各自存在固有的局限性：解耦编码器在独立处理指令和视觉内容时遭遇模态差距，而统一的视觉-语言编码器则因仅依赖最终层表示而丧失细粒度的空间细节。我们观察到，视觉-语言模型（VLM）层以层次化的方式编码互补信息——早期层捕捉精确编辑所需的局部空间细节，而更深层则编码用于理解指令的全局语义。基于这一观察，我们提出了MiVE（用于参考引导视频编辑的多尺度视觉-语言特征）框架，该框架将VLM重新用于多尺度特征提取器。MiVE从Qwen3-VL中提取层次特征，并将其整合到统一的自注意力扩散变换器中，从而消除跨注意力设计中固有的模态不匹配。实验表明，MiVE在人工偏好排名中表现出色，超越了学术方法和商业系统，达到了最先进的性能。

View on arXiv Download PDF AI Translation

cs.CV / 74 / 2605.14689

Are Candidate Models Really Needed for Active Learning?

候选模型在主动学习中真的必要吗？

Mohan, Harshini Mridula, Manjunath, Maanya, Arya, Vipul, Basha, S. H. Shabbeer, Cheekatla, Nitin

Abstract

Deep learning has profoundly impacted domains such as computer vision and natural language processing by uncovering complex patterns in vast datasets. However, the reliance on extensive labeled data poses significant challenges, including resource constraints and annotation errors, particularly in training Convolutional Neural Networks (CNNs) and transformers due to a larger number of parameters. Active learning offers a promising solution to reduce labeling burdens by strategically selecting the most informative samples for annotation. However, the current active learning frameworks are time-intensive which select the samples iteratively with the help of initial candidate models. This study investigates the feasibility of using CNNs and transformers with randomly initialized weights, eliminating the need for initial candidate models while achieving results comparable to active learning frameworks that depend on such candidate models. We evaluate three confidence-based sampling strategies: high confidence (HC), low confidence (LC), and a combination of high confidence in the early stages of training and low confidence at later stages of training (HCLC). Among these, mostly LC demonstrated the best performance in our experiments, showcasing its effectiveness as an active learning strategy without the need for candidate models. Further, extensive experiments verify the robustness of the proposed active learning methods. By challenging traditional frameworks, the proposed work introduces a streamlined approach to active learning, advancing efficiency and flexibility across diverse datasets and domains.

Chinese Translation

深度学习在计算机视觉和自然语言处理等领域产生了深远的影响，通过揭示庞大数据集中复杂的模式。然而，对大量标注数据的依赖带来了显著的挑战，包括资源限制和标注错误，尤其是在训练卷积神经网络（CNNs）和变换器（transformers）时，由于参数数量较多，这些问题更加突出。主动学习通过战略性地选择最具信息量的样本进行标注，提供了减少标注负担的有希望的解决方案。然而，当前的主动学习框架是时间密集型的，它们依赖于初始候选模型迭代选择样本。本研究探讨了使用随机初始化权重的CNN和变换器的可行性，从而消除了对初始候选模型的需求，同时实现了与依赖于候选模型的主动学习框架相当的结果。我们评估了三种基于置信度的采样策略：高置信度（HC）、低置信度（LC）以及在训练早期阶段采用高置信度而在后期阶段采用低置信度的组合策略（HCLC）。在这些策略中，LC在我们的实验中表现最佳，展示了其作为一种主动学习策略的有效性，无需候选模型。此外，大量实验验证了所提主动学习方法的鲁棒性。通过挑战传统框架，所提工作引入了一种简化的主动学习方法，在不同数据集和领域中提高了效率和灵活性。

View on arXiv Download PDF AI Translation

cs.CV / 75 / 2605.14696

EponaV2: Driving World Model with Comprehensive Future Reasoning

EponaV2：通过全面的未来推理驱动世界模型

Xu, Jiawei, Zhong, Zhizhou, Shu, Zhijian, Jia, Mingkai, Li, Mingxiao, Bian, Jia-Wang, Zhang, Qian, Zhang, Kaicheng, Xie, Jin, Yang, Jian, Yin, Wei

Abstract

Data scaling plays a pivotal role in the pursuit of general intelligence. However, the prevailing perception-planning paradigm in autonomous driving relies heavily on expensive manual annotations to supervise trajectory planning, which severely limits its scalability. Conversely, although existing perception-free driving world models achieve impressive driving performance, their real-world reasoning ability for planning is solely built on next frame image forecasting. Due to the lack of enough supervision, these models often struggle with comprehensive scene understanding, resulting in unsatisfactory trajectory planning. In this paper, we propose EponaV2, a novel paradigm of driving world models, which achieves high-quality planning with comprehensive future reasoning. Inspired by how human drivers anticipate 3D geometry and semantics, we train our model to forecast more comprehensive future representations, which can be additionally decoded to future geometry and semantic maps. Extracting the 3D and semantic modalities enables our model to deeply understand the surrounding environment, and the future prediction task significantly enhances the real-world reasoning capabilities of EponaV2, ultimately leading to improved trajectory planning. Moreover, inspired by the training recipe of Large Language Models (LLMs), we introduce a flow matching group relative policy optimization mechanism to further improve planning accuracy. The state-of-the-art (SOTA) performances of EponaV2 among perception-free models on three NAVSIM benchmarks (+1.3PDMS, +5.5EPDMS) demonstrate the effectiveness of our methods.

Chinese Translation

数据扩展在追求通用智能中发挥着关键作用。然而，当前自主驾驶中的感知-规划范式严重依赖昂贵的手动标注来监督轨迹规划，这极大限制了其可扩展性。相反，尽管现有的无感知驾驶世界模型在驾驶性能上取得了令人印象深刻的成果，但它们在规划中的现实世界推理能力仅建立在下一帧图像预测的基础上。由于缺乏足够的监督，这些模型往往在全面场景理解方面表现不佳，导致轨迹规划不尽如人意。本文提出了EponaV2，这是一种新颖的驾驶世界模型范式，能够通过全面的未来推理实现高质量的规划。受到人类驾驶员如何预测三维几何和语义的启发，我们训练模型以预测更全面的未来表示，这些表示可以进一步解码为未来的几何和语义地图。提取三维和语义模态使我们的模型能够深入理解周围环境，而未来预测任务显著增强了EponaV2的现实世界推理能力，最终改善了轨迹规划。此外，受到大型语言模型（LLMs）训练方法的启发，我们引入了一种流匹配组相对策略优化机制，以进一步提高规划准确性。EponaV2在三个NAVSIM基准测试中的无感知模型中表现出最先进的（SOTA）性能（+1.3PDMS，+5.5EPDMS），证明了我们方法的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 76 / 2605.14703

Generating HDR Video from SDR Video

从标准动态范围视频生成高动态范围视频

Tedla, SaiKiran, Banterle, Francesco, Canham, Trevor, Raja, Karanpreet, Lindell, David B., Kutulakos, Kiriakos N., Li, Jiacheng, Li, Feiran, Iso, Daisuke

Abstract

The high dynamic range (HDR) video ecosystem is approaching maturity, but the problem of upconverting legacy standard dynamic range (SDR) videos persists without a convincing solution. We propose a framework for HDR video synthesis from casual SDR footage by leveraging large-scale generative video models. We introduce a Multi-Exposure Video Model (MEVM) that can predict exposure-bracketed linear SDR video sequences from a single nonlinear SDR video input. We further propose a learnable Video Merging Model (VMM) that merges the predicted exposure-bracketed video into a high-quality HDR sequence while preserving detail in both shadows and highlights. Extensive experiments, quantitative and qualitative evaluation, and a user study demonstrate that our approach enables robust HDR conversion for in-the-wild examples from casual consumer videos and even iconic films. Finally, our model can support HDR synthesis pipelines built upon existing SDR generative video models. Output HDR videos can be viewed on our supplementary webpage: sdr2hdrvideo.github.io

Chinese Translation

高动态范围（HDR）视频生态系统正接近成熟，但将传统标准动态范围（SDR）视频转换为HDR视频的问题依然存在，且尚无令人信服的解决方案。我们提出了一种从普通SDR视频素材合成HDR视频的框架，利用大规模生成视频模型。我们引入了一种多曝光视频模型（Multi-Exposure Video Model, MEVM），该模型能够从单一非线性SDR视频输入预测曝光分级的线性SDR视频序列。我们进一步提出了一种可学习的视频合并模型（Video Merging Model, VMM），该模型将预测的曝光分级视频合并为高质量的HDR序列，同时保留阴影和高光中的细节。大量实验、定量和定性评估以及用户研究表明，我们的方法能够实现鲁棒的HDR转换，适用于来自普通消费者视频甚至经典电影的真实场景。最后，我们的模型可以支持基于现有SDR生成视频模型构建的HDR合成管道。输出的HDR视频可在我们的补充网页上查看：sdr2hdrvideo.github.io

View on arXiv Download PDF AI Translation

cs.CV / 77 / 2605.14704

SceneFunRI: Reasoning the Invisible for Task-Driven Functional Object Localization

SceneFunRI：为任务驱动的功能性物体定位推理不可见对象

Chen, Posheng, Cheng, Powen, Faure, Gueter Josmy, Su, Hung-Ting, Hsu, Winston H.

Abstract

In real-world scenes, target objects may reside in regions that are not visible. While humans can often infer the locations of occluded objects from context and commonsense knowledge, this capability remains a major challenge for vision-language models (VLMs). To address this gap, we introduce SceneFunRI, a benchmark for Reasoning the Invisible. Based on the SceneFun3D dataset, SceneFunRI formulates the task as a 2D spatial reasoning problem via a semi-automatic pipeline and comprises 855 instances. It requires models to infer the locations of invisible functional objects from task instructions and commonsense reasoning. The strongest baseline model (Gemini 3 Flash) only achieves an CAcc@75 of 15.20, an mIoU of 0.74, and a Dist of 28.65. We group our prompting analysis into three categories: Strong Instruction Prompting, Reasoning-based Prompting, and Spatial Process of Elimination (SPoE). These findings indicate that invisible-region reasoning remains an unstable capability in current VLMs, motivating future work on models that more tightly integrate task intent, commonsense priors, spatial grounding, and uncertainty-aware search.

Chinese Translation

在现实场景中，目标物体可能位于不可见的区域。虽然人类通常能够通过上下文和常识知识推断被遮挡物体的位置，但这一能力对于视觉-语言模型（VLMs）仍然是一个重大挑战。为了解决这一问题，我们引入了SceneFunRI，这是一个用于推理不可见对象的基准测试。基于SceneFun3D数据集，SceneFunRI将任务表述为通过半自动化流程进行的二维空间推理问题，并包含855个实例。该基准要求模型根据任务指令和常识推理推断不可见功能性物体的位置。最强基线模型（Gemini 3 Flash）仅在CAcc@75上达到了15.20，在mIoU上为0.74，Dist为28.65。我们将提示分析分为三类：强指令提示、基于推理的提示和空间消除过程（SPoE）。这些发现表明，在当前的VLMs中，对不可见区域的推理仍然是一种不稳定的能力，这激励着未来的研究工作，旨在开发更紧密集成任务意图、常识先验、空间基础和不确定性感知搜索的模型。

View on arXiv Download PDF AI Translation

cs.CV / 78 / 2605.14705

Towards Continuous Sign Language Conversation from Isolated Signs

从孤立手势到连续手语对话的探索

Kim, Youngmin, Choo, Kyobin, Park, Jiwoo, Kim, Minseo, Kim, Chanyoung, Kim, Junhyeok, Hwang, Seong Jae

Abstract

Sign language is the primary language for many Deaf and Hard-of-Hearing (DHH) signers, yet most conversational AI systems still mediate interaction through spoken or written language. This spoken-language-centered interface can limit access for signers for whom spoken or written language is not the most accessible medium, motivating direct sign-to-sign conversational modeling. However, sentence-level sign video data are expensive to collect and annotate, leaving existing sign translation and production models with limited vocabulary coverage and weak open-domain generalization. We address this bottleneck by constructing continuous sign conversations from isolated signs: large-scale labeled isolated clips are collected as lexically grounded motion primitives and recomposed into sign-language-ordered utterances derived from existing dialogue corpora. We introduce SignaVox-W, which provides, to our knowledge, the largest labeled isolated-sign vocabulary to date, and SignaVox-U, a continuous 3D sign conversation dataset built from SignaVox-W. To bridge structural mismatch between spoken and signed languages, we use a retrieval-guided spoken-to-gloss translator; to bridge independently collected isolated clips, we propose BRAID, a diffusion Transformer that performs duration alignment and co-articulatory boundary inpainting. With the resulting data, we train SignaVox, a direct sign-to-sign conversational model that generates 3D body, hand, and facial motion responses from prior signing context without spoken-language text or externally provided glosses at inference time. Quantitative and qualitative evaluations show improved isolated-to-continuous motion quality, stronger response-level semantic alignment, and scalable signer-centered interaction that better supports visual-spatial articulation.

Chinese Translation

手语是许多聋人和听力障碍者（DHH）使用的主要语言，但大多数对话式人工智能系统仍通过口语或书面语言进行交互。这种以口语为中心的界面可能限制了那些口语或书面语言并非最易接触媒介的手语使用者的访问，促使我们直接进行手语到手语的对话建模。然而，句子级手势视频数据的收集和标注成本高昂，导致现有的手语翻译和生成模型在词汇覆盖和开放领域泛化方面存在局限。我们通过从孤立手势构建连续手语对话来解决这一瓶颈：收集大规模标注的孤立片段作为词汇基础的运动原语，并重新组合成基于现有对话语料库的手语顺序发话。我们介绍了SignaVox-W，这是迄今为止我们所知的最大标注孤立手势词汇，以及SignaVox-U，这是基于SignaVox-W构建的连续3D手语对话数据集。为了弥合口语和手语之间的结构不匹配，我们使用了检索引导的口语到手语翻译器；为了弥合独立收集的孤立片段，我们提出了BRAID，这是一种扩散Transformer，能够进行时长对齐和共发音边界修复。利用这些数据，我们训练了SignaVox，这是一种直接的手语到手语对话模型，能够根据先前的手语上下文生成3D身体、手部和面部运动反应，而无需在推理时使用口语文本或外部提供的手语翻译。定量和定性评估显示，孤立到连续运动质量有所改善，响应级语义对齐更强，且可扩展的以手语使用者为中心的交互更好地支持视觉空间表达。

View on arXiv Download PDF AI Translation

cs.CV / 79 / 2605.14708

StyleTextGen: Style-Conditioned Multilingual Scene Text Generation

StyleTextGen：风格条件的多语言场景文本生成

Chen, Zeyu, Zhao, Fangmin, Shu, Yan, Liu, Yichao, Yu, Liu, Zhou, Yu

Abstract

Style-conditioned scene text generation faces unique challenges in extracting precise text styles from complex backgrounds and maintaining fine-grained style consistency across characters, especially for multilingual scripts. We propose StyleTextGen, a novel framework that learns to perceive and replicate visual text styles across different languages and writing systems. Our approach features three key contributions: First, we introduce a dual-branch style encoder dedicated to style modeling, yielding robust multilingual text style representations in complex real-world scenes. Second, we design a text style consistency loss that enhances style coherence and improves overall visual quality. Third, we develop a mask-guided inference strategy that ensures precise style alignment between generated and reference text. To facilitate systematic evaluation, we construct StyleText-CE, a bilingual scene text style benchmark covering both monolingual and cross-lingual settings. Extensive experiments demonstrate that StyleTextGen significantly outperforms existing methods in style consistency and cross-lingual generalization, establishing new state-of-the-art performance in multilingual style-conditioned text generation.

Chinese Translation

风格条件的场景文本生成在从复杂背景中提取精确文本风格以及在字符间保持细粒度风格一致性方面面临独特挑战，尤其是在多语言脚本中。我们提出了StyleTextGen，一个新颖的框架，旨在学习感知和复制不同语言和书写系统中的视觉文本风格。我们的方法具有三个关键贡献：首先，我们引入了一个专门用于风格建模的双分支风格编码器，能够在复杂的现实场景中生成稳健的多语言文本风格表示。其次，我们设计了一种文本风格一致性损失，增强了风格的连贯性并改善了整体视觉质量。第三，我们开发了一种基于掩码的推理策略，确保生成文本与参考文本之间的精确风格对齐。为了便于系统评估，我们构建了StyleText-CE，这是一个涵盖单语和跨语言设置的双语场景文本风格基准。大量实验表明，StyleTextGen在风格一致性和跨语言泛化方面显著优于现有方法，在多语言风格条件文本生成中建立了新的最先进的性能。

View on arXiv Download PDF AI Translation

cs.CV / 80 / 2605.14709

Breaking Dual Bottlenecks: Evolving Unified Multimodal Models into Self-Adaptive Interleaved Visual Reasoners

打破双重瓶颈：将统一多模态模型演化为自适应交错视觉推理器

Liu, Qingyang, Gao, Bingjie, Fu, Canmiao, Huang, Zhipeng, Li, Chen, Wang, Feng, Chang, Shuochen, Wang, Shaobo, Wang, Yali, Ye, Keming, Li, Jiangtong, Niu, Li

Abstract

Recent unified models integrate multimodal understanding and generation within a single framework. However, an "understanding-generation gap" persists, where models can capture user intent but often fail to translate this semantic knowledge into precise pixel-level manipulation. This gap results in two bottlenecks in anything-to-image task (X2I): the attention entanglement bottleneck, where blind planning struggles with complex prompts, and the visual refinement bottleneck, where unstructured feedback fails to correct imperfections efficiently. In this paper, we propose a novel framework that empowers unified models to autonomously switch between generation strategies based on instruction complexity and model capability. To achieve this, we construct a hierarchical data pipeline that constructs execution paths across three adaptive modes: direct generation for simple cases, self-reflection for quality refinement, and multi-step planning for decomposing complex scenarios. Building on this pipeline, we contribute a high-quality dataset with over 50,000 samples and implement a two-stage training strategy comprising SFT and RL. Specifically, we design step-wise reasoning rewards to ensure logical consistency and intra-group complexity penalty to prevent redundant computational overhead. Extensive experiments demonstrate that our method outperforms existing baselines on X2I, achieving superior generation fidelity among simple-to-complex instructions. The code is released at https://github.com/WeChatCV/Interleaved_Visual_Reasoner.

Chinese Translation

近期的统一模型在单一框架内整合了多模态理解与生成。然而，仍然存在“理解-生成差距”，即模型能够捕捉用户意图，但往往无法将这种语义知识转化为精确的像素级操作。该差距导致了在任何到图像任务（X2I）中出现两个瓶颈：注意力纠缠瓶颈，在复杂提示下盲目规划面临困难，以及视觉细化瓶颈，非结构化反馈未能有效纠正缺陷。本文提出了一种新颖的框架，使统一模型能够根据指令复杂性和模型能力自主切换生成策略。为此，我们构建了一个分层数据管道，构建了三个自适应模式下的执行路径：简单案例的直接生成、质量细化的自我反思以及复杂场景的多步骤规划。在此基础上，我们贡献了一个包含超过50,000个样本的高质量数据集，并实施了包含SFT和RL的两阶段训练策略。具体而言，我们设计了逐步推理奖励以确保逻辑一致性，并引入组内复杂性惩罚以防止冗余计算开销。大量实验表明，我们的方法在X2I上优于现有基线，在简单到复杂的指令中实现了更高的生成保真度。代码已发布在 https://github.com/WeChatCV/Interleaved_Visual_Reasoner。

View on arXiv Download PDF AI Translation

cs.CV / 81 / 2605.14710

Vision-Core Guided Contrastive Learning for Balanced Multi-modal Prognosis Prediction of Stroke

基于视觉核心引导的对比学习用于平衡多模态中风预后预测

Chen, Liren, Sun, Lidong, Huang, Mingyan, Tang, Junzhe, Zhu, Yinghui, Wang, Guanjie, Xia, Yiqing, Xiao, Ting

Abstract

Deep learning and multi-modal fusion have demonstrated transformative potential in medical diagnosis by integrating diverse data sources. However, accurate prognosis for ischemic stroke remains challenging due to limitations in existing multi-modal approaches. First, current methods are predominantly confined to dual-modal fusion, lacking a framework that effectively integrates the trifecta of medical images, structured clinical data, and unstructured text. Second, they often fail to establish deep bidirectional interactions between modalities; To address these critical gaps, this paper proposes a novel tri-modal fusion model for ischemic stroke prognosis. Our approach first enriches the data representation by employing a Large Language Model (LLM) to automatically generate semi-structured diagnostic text from brain MRIs. This process not only addresses the scarcity of expert annotations but also serves as a regularized semantic enhancement, improving multimodal fusion robustness. Furthermore, we design a core component termed the Vision-Conditioned Dual Alignment Fusion Module (VDAFM), which strategically uses visual features as a conditional prior to guide fine-grained interaction with the generated text. This module achieves a dynamic and profound fusion through a dual semantic alignment loss, effectively mitigating modal heterogeneity. Extensive experiments on a real-world clinical dataset demonstrate that our model achieves state-of-the-art performance.

Chinese Translation

深度学习和多模态融合在医学诊断中展示了变革性的潜力，通过整合多样的数据源。然而，由于现有多模态方法的局限性，缺乏对缺血性中风的准确预后仍然是一个挑战。首先，当前的方法主要局限于双模态融合，缺乏有效整合医学影像、结构化临床数据和非结构化文本三者的框架。其次，它们往往未能在模态之间建立深度双向交互。为了解决这些关键问题，本文提出了一种新颖的三模态融合模型用于缺血性中风的预后。我们的方法首先通过使用大型语言模型（Large Language Model, LLM）从脑部MRI自动生成半结构化的诊断文本来丰富数据表示。这个过程不仅解决了专家注释稀缺的问题，还作为一种正则化的语义增强，提高了多模态融合的鲁棒性。此外，我们设计了一个核心组件，称为视觉条件双重对齐融合模块（Vision-Conditioned Dual Alignment Fusion Module, VDAFM），该模块战略性地使用视觉特征作为条件先验，以引导与生成文本的细粒度交互。该模块通过双重语义对齐损失实现动态而深刻的融合，有效缓解了模态异质性。在真实世界临床数据集上的广泛实验表明，我们的模型达到了最先进的性能。

View on arXiv Download PDF AI Translation

cs.CV / 82 / 2605.14717

Towards Label-Free Single-Cell Phenotyping Using Multi-Task Learning

基于多任务学习的无标记单细胞表型分析研究

Nazir, Saqib, Behera, Ardhendu

Abstract

Label-free single-cell imaging offers a scalable, non-invasive alternative to fluorescence-based cytometry, yet inferring molecular phenotypes directly from bright-field morphology remains challenging. We present a unified Deep Learning (DL) framework that jointly performs White Blood Cell (WBC) classification and continuous protein-expression regression from label-free Differential Phase Contrast (DPC) images. Our model employs a Hybrid architecture that fuses convolutional fine-grained texture features with transformer-based global representations through a learnable cross-branch gating module, enabling robust morpho-molecular inference from DPC images. To support downstream interpretability, we further incorporate a Large Language Model (LLM) that generates concise, biologically grounded summaries of the predicted cell states. Experiments on the Berkeley Single Cell Computational Microscopy (BSCCM) and Blood Cells Image benchmarks demonstrate strong performance, achieving a 91.3% WBC classification accuracy and a 0.72 Pearson correlation for CD16 expression regression on BSCCM. These results underscore the promise of label-free single-cell imaging for cost-effective hematological profiling, enabling simultaneous phenotype identification and quantitative biomarker estimation without fluorescent staining. The source code is available at https://github.com/saqibnaziir/Single-Cell-Phenotyping.

Chinese Translation

无标记单细胞成像提供了一种可扩展的、非侵入性的替代方案，取代基于荧光的细胞计数法，但直接从明场形态推断分子表型仍然具有挑战性。我们提出了一个统一的深度学习（Deep Learning, DL）框架，能够同时进行白细胞（White Blood Cell, WBC）分类和从无标记差分相位对比（Differential Phase Contrast, DPC）图像中进行连续蛋白表达回归。我们的模型采用了一种混合架构，通过可学习的跨分支门控模块，将卷积细粒度纹理特征与基于变换器的全局表示相融合，从而实现从DPC图像中进行稳健的形态-分子推断。为了支持下游可解释性，我们进一步结合了一个大型语言模型（Large Language Model, LLM），生成对预测细胞状态的简明、生物学基础的总结。在伯克利单细胞计算显微镜（Berkeley Single Cell Computational Microscopy, BSCCM）和血细胞图像基准测试上的实验表明，我们的模型表现出色，达到了91.3%的WBC分类准确率和0.72的Pearson相关系数用于CD16表达回归。这些结果凸显了无标记单细胞成像在成本效益的血液学分析中的潜力，使得能够在不使用荧光染色的情况下同时进行表型识别和定量生物标志物估计。源代码可在https://github.com/saqibnaziir/Single-Cell-Phenotyping获取。

View on arXiv Download PDF AI Translation

cs.CV / 83 / 2605.14727

CHASM: Cross-frequency Harmonized Axis-Separable Mixing for Spectral Token Operators

CHASM：跨频率协调轴分离混合用于光谱令牌算子

Fang, Pengcheng, Chen, Hongli, Chen, Yuxia, Sun, Tengjiao, Liu, Jiaxin, Cai, Xiaohao

Abstract

Spectral token mixers based on Fourier transforms provide an efficient way to model global interactions in visual feature maps. Existing designs often either apply filter-wise spectral responses along fixed channel axes, or learn adaptive frequency-indexed channel mixing without explicitly aligning the channel directions used across frequencies. We propose CHASM, a Cross-frequency Harmonized Axis-Separable Mixer, as a structured middle ground. CHASM separates what should be shared from what should remain frequency-specific: all frequencies share a learned channel eigenbasis, while each frequency retains its own positive spectral gains. The shared basis makes channel directions comparable across the spectrum, whereas the positive gains preserve local spectral adaptivity. CHASM applies this structured operator separably along the height and width axes and is used as a drop-in replacement mixer inside existing backbones. We provide a structural characterization of the shared-basis operator family and evaluate CHASM through controlled same-backbone comparisons. Across accelerated MRI reconstruction, undersampled MRI segmentation, and natural-image reconstruction, CHASM consistently improves over same-backbone spectral-mixer baselines. Ablations show that removing the shared-basis constraint weakens performance, and randomizing coherent sampling geometry substantially reduces the gain, supporting cross-frequency harmonization as a useful inductive bias for spectral token operators.

Chinese Translation

基于傅里叶变换的光谱令牌混合器提供了一种有效的方式来建模视觉特征图中的全局交互。现有设计通常要么沿固定通道轴应用滤波器级光谱响应，要么学习自适应频率索引的通道混合，而未明确对齐跨频率使用的通道方向。我们提出了CHASM（Cross-frequency Harmonized Axis-Separable Mixer），作为一个结构化的中间方案。CHASM将应共享的内容与应保持频率特定的内容分开：所有频率共享一个学习到的通道特征基，而每个频率保留其自身的正光谱增益。共享基使得通道方向在频谱上具有可比性，而正增益则保持局部光谱的自适应性。CHASM沿高度和宽度轴分离地应用这一结构化算子，并作为现有骨干网络中的替代混合器。我们对共享基算子家族进行了结构特征描述，并通过控制的同骨干比较评估了CHASM。在加速MRI重建、欠采样MRI分割和自然图像重建中，CHASM始终优于同骨干的光谱混合基线。消融实验表明，去除共享基约束会削弱性能，而随机化一致采样几何会显著降低增益，支持跨频率协调作为光谱令牌算子的有用归纳偏置。

View on arXiv Download PDF AI Translation

cs.CV / 84 / 2605.14733

Video-Zero: Self-Evolution Video Understanding

视频零：自我进化视频理解

Zhang, Ruixu, Ji, Deyi, Zhu, Lanyun, Liu, Xuanyi, Meng, Yuxin, Chu, Ruihang, Yang, Yujiu

Abstract

Self-evolution offers a promising path for improving reasoning models without relying on intensive human annotation. However, extending this paradigm to video understanding remains underexplored and challenging: videos are long, dynamic, and redundant, while the evidence needed for reasoning is often sparse and temporally localized. Naively generating difficult question-answer pairs from full videos can therefore produce supervision that appears challenging but is weakly grounded, relying on static cues or language priors rather than temporal evidence. In this work, we argue that the key bottleneck of video self-evolution is not difficulty alone, but grounding. We propose Video-Zero, an annotation-free Questioner--Solver co-evolution framework that centers self-evolution on temporally localized evidence. The Questioner discovers informative evidence segments and generates evidence-grounded questions, while the Solver learns to answer and align its predictions with the supporting evidence. This closes an iterative loop of evidence discovery, grounded supervision, and evidence-aligned learning. Across 13 benchmarks spanning temporal grounding, long-video understanding, and video reasoning, Video-Zero consistently improves multiple video VLM backbones, demonstrating the effectiveness and transferability of evidence-centered self-evolution.

Chinese Translation

自我进化为提升推理模型提供了一条有前景的路径，而无需依赖大量的人类标注。然而，将这一范式扩展到视频理解仍然未被充分探索且具有挑战性：视频通常较长、动态且冗余，而进行推理所需的证据往往稀疏且时间上局部化。因此，简单地从完整视频中生成困难的问题-答案对可能会产生看似具有挑战性的监督，但实际上却是弱基础的，依赖于静态线索或语言先验，而非时间证据。在本研究中，我们认为视频自我进化的关键瓶颈不仅在于困难性，而在于基础性。我们提出了Video-Zero，一个无标注的问题生成-解答者共进化框架，旨在将自我进化集中在时间上局部化的证据上。问题生成者发现信息丰富的证据片段并生成基于证据的问题，而解答者学习回答并将其预测与支持证据对齐。这闭合了证据发现、基础监督和证据对齐学习的迭代循环。在涵盖时间基础、长视频理解和视频推理的13个基准测试中，Video-Zero始终提升多个视频视觉语言模型（VLM）骨干，证明了以证据为中心的自我进化的有效性和可迁移性。

View on arXiv Download PDF AI Translation

cs.CV / 85 / 2605.14742

EARL: Towards a Unified Analysis-Guided Reinforcement Learning Framework for Egocentric Interaction Reasoning and Pixel Grounding

EARL：面向自我中心交互推理和像素定位的统一分析引导强化学习框架

Su, Yuejiao, Zhang, Xinshen, Ye, Zhen, Yao, Lei, Chau, Lap-Pui, Wang, Yi

Abstract

Understanding human--environment interactions from egocentric vision is essential for assistive robotics and embodied intelligent agents, yet existing multimodal large language models (MLLMs) still struggle with accurate interaction reasoning and fine-grained pixel grounding. To this end, this paper introduces EARL, an Egocentric Analysis-guided Reinforcement Learning framework that explicitly transfers coarse interaction semantics to query-oriented answering and grounding. Specifically, EARL adopts a two-stage parsing framework including coarse-grained interpretation and fine-grained response. The first stage holistically interprets egocentric interactions and generates a structured textual description. The second stage produces the textual answer and pixel-level mask in response to the user query. To bridge the two stages, we extract a global interaction descriptor as a semantic prior, which is integrated via a novel Analysis-guided Feature Synthesizer (AFS) for query-oriented reasoning. To optimize heterogeneous outputs, including textual answers, bounding boxes, and grounding masks, we design a multi-faceted reward function and train the response stage with GRPO. Experiments on Ego-IRGBench show that EARL achieves 65.48% cIoU for pixel grounding, outperforming previous RL-based methods by 8.37%, while OOD grounding results on EgoHOS indicate strong transferability to unseen egocentric grounding scenarios.

Chinese Translation

从自我中心视觉理解人类与环境的交互对于辅助机器人和具身智能体至关重要，然而现有的多模态大型语言模型（MLLMs）在准确的交互推理和细粒度像素定位方面仍然存在困难。为此，本文提出了EARL，一个自我中心分析引导的强化学习框架，明确将粗略的交互语义转移到查询导向的回答和定位中。具体而言，EARL采用了一个两阶段的解析框架，包括粗粒度解释和细粒度响应。第一阶段全面解释自我中心交互并生成结构化的文本描述。第二阶段根据用户查询生成文本答案和像素级掩码。为了连接这两个阶段，我们提取了一个全局交互描述符作为语义先验，通过一种新颖的分析引导特征合成器（Analysis-guided Feature Synthesizer, AFS）集成以进行查询导向推理。为了优化包括文本答案、边界框和定位掩码在内的异构输出，我们设计了一个多面向的奖励函数，并使用GRPO训练响应阶段。在Ego-IRGBench上的实验表明，EARL在像素定位方面达到了65.48%的cIoU，超越了之前基于强化学习的方法8.37%，而在EgoHOS上的OOD定位结果则表明其在未见的自我中心定位场景中具有强大的迁移能力。

View on arXiv Download PDF AI Translation

cs.CV / 86 / 2605.14772

BioHuman: Learning Biomechanical Human Representations from Video

BioHuman：从视频中学习生物力学人类表示

Huo, Yujun, Zhang, He, Song, Chentao, Song, Honglin, Zuo, Zongyu, Yu, Tao

Abstract

Understanding human motion beyond surface kinematics is crucial for motion analysis, rehabilitation, and injury risk assessment. However, progress in this domain is limited by the lack of large-scale datasets with biomechanical annotations, and by existing approaches that cannot directly infer internal biomechanical states from visual observations. In this paper, we introduce a simulation-based framework for estimating muscle activations from existing motion capture datasets, resulting in BioHuman10M, a large-scale dataset with synchronized video, motion, and activations. Building on BioHuman10M, we propose BioHuman, an end-to-end model that takes monocular video as input and jointly predicts human motion and muscle activations, effectively bridging visual observations and internal biomechanical states. Extensive experiments demonstrate that BioHuman enables accurate reconstruction of both kinematic motion and muscle activity, and generalizes across diverse subjects and motions. We believe our approach establishes a new benchmark for video-based biomechanical understanding and opens up new possibilities for physically grounded human modeling.

Chinese Translation

理解人类运动超越表面运动学对于运动分析、康复和伤害风险评估至关重要。然而，该领域的进展受到缺乏具有生物力学注释的大规模数据集以及现有方法无法直接从视觉观察中推断内部生物力学状态的限制。在本文中，我们介绍了一种基于仿真的框架，用于从现有的动作捕捉数据集中估计肌肉激活，从而生成BioHuman10M，这是一个具有同步视频、运动和激活的大规模数据集。在此基础上，我们提出了BioHuman，一个端到端模型，接受单目视频作为输入，并联合预测人类运动和肌肉激活，有效地桥接了视觉观察与内部生物力学状态。大量实验表明，BioHuman能够准确重建运动学运动和肌肉活动，并在不同的个体和运动中具有良好的泛化能力。我们相信我们的方法为基于视频的生物力学理解建立了新的基准，并为物理基础的人类建模开辟了新的可能性。

View on arXiv Download PDF AI Translation

cs.CV / 87 / 2605.14781

MonoPRIO: Adaptive Prior Conditioning for Unified Monocular 3D Object Detection

MonoPRIO：用于统一单目3D物体检测的自适应先验条件

Davies, Leon, Meng, Qinggang, Saada, Mohamad, Li, Baihua, Sølvsten, Simon

Abstract

Monocular 3D object detection remains challenging because metric size and depth are underdetermined by single-view evidence, particularly under occlusion, truncation, and projection-induced scale-depth ambiguity. Although recent methods improve depth and geometric reasoning, metric size remains unstable in unified multi-class settings, where class variability and partial visibility broaden plausible size modes. We propose MonoPRIO, a unified monocular 3D detector that targets this bottleneck through adaptive prior conditioning in the size pathway. MonoPRIO constructs class-aware size prototypes offline, routes each decoder query to a soft mixture prior, applies uncertainty-aware log-space conditioning, and uses Cluster-Aligned Prior (CAP) regularisation on matched positives during training. On the official KITTI test server, MonoPRIO achieves the strongest fully reported unified multi-class result among methods reporting complete Car, Pedestrian, and Cyclist metrics. In the car-only setting, it also achieves the strongest 3D bounding-box AP across Easy/Moderate/Hard categories among compared methods without extra data, while using substantially less compute than MonoCLUE. Ablations and diagnostics show complementary gains from routed injection and CAP, with the largest benefits in ambiguity-prone, partially occluded, and low-data regimes. These findings indicate that adaptive priors are most effective when image evidence underdetermines metric size, while atypical geometry or extreme visibility loss can still cause mismatch between routed priors and true instance geometry. Code, trained models, result logs, and reproducibility material are available at https://github.com/bigggs/MonoPRIO.

Chinese Translation

单目3D物体检测仍然面临挑战，因为单视图证据无法充分确定度量大小和深度，特别是在遮挡、截断和投影引起的尺度-深度模糊情况下。尽管最近的方法改善了深度和几何推理，但在统一的多类设置中，度量大小仍然不稳定，因为类别变异性和部分可见性扩大了合理大小模式。我们提出了MonoPRIO，一种统一的单目3D检测器，通过在大小路径中进行自适应先验条件来解决这一瓶颈。MonoPRIO离线构建了类别感知的大小原型，将每个解码器查询路由到一个软混合先验，应用不确定性感知的对数空间条件，并在训练期间对匹配的正样本使用聚类对齐先验（Cluster-Aligned Prior, CAP）正则化。在官方KITTI测试服务器上，MonoPRIO在报告完整的汽车、行人和骑自行车者指标的方法中，取得了最强的完全报告的统一多类结果。在仅限汽车的设置中，它在不使用额外数据的情况下，在比较的方法中也在简单/中等/困难类别中实现了最强的3D边界框平均精度（AP），同时使用的计算资源明显少于MonoCLUE。消融实验和诊断显示，路由注入和CAP带来了互补的增益，在模糊易发、部分遮挡和低数据环境中获得了最大的收益。这些发现表明，当图像证据无法充分确定度量大小时，自适应先验最为有效，而不典型的几何形状或极端的可见性丧失仍然可能导致路由先验与真实实例几何之间的不匹配。代码、训练模型、结果日志和可复现材料可在https://github.com/bigggs/MonoPRIO获取。

View on arXiv Download PDF AI Translation

cs.CV / 88 / 2605.14787

Do Composed Image Retrieval Benchmarks Require Multimodal Composition?

组合图像检索基准是否需要多模态组合？

Attimonelli, Matteo, De Bellis, Alessandro, Gema, Aryo Pradipta, Saxena, Rohit, Sekoyan, Monica, Kwan, Wai-Chung, Pomo, Claudio, Suglia, Alessandro, Jannach, Dietmar, Di Noia, Tommaso, Minervini, Pasquale

Abstract

Composed Image Retrieval (CIR) is a multimodal retrieval task where a query consists of a reference image and a textual modification, and the goal is to retrieve a target image satisfying both. In principle, strong performance on CIR benchmarks is assumed to require multimodal composition, i.e., combining complementary information from reference image and textual modification. In this work, we show that this assumption does not always hold. Across four widely used CIR benchmarks and eleven Generalist Multimodal Embedding models, a large fraction of queries can be solved using a single modality (from 32.2% to 83.6%), revealing pervasive unimodal shortcuts. Thus, high CIR performance can arise from unimodal signals rather than true multimodal composition. To better understand this issue, we perform a two-stage audit. First, we identify shortcut-solvable queries through cross-model analysis. Second, we conduct human validation on 4,741 shortcut-free queries, of which only 1,689 are well-formed, with common issues including ambiguous edits and mismatched targets. Re-evaluating models on this validated subset reveals qualitatively different behaviour: queries can no longer be solved with a single modality, and successful retrieval requires combining both inputs. While accuracy decreases, reliance on multimodal information increases. Overall, current CIR benchmarks conflate shortcut-solvable, noisy, and genuinely compositional queries, leading to an overestimation of model capability in multimodal composition.

Chinese Translation

组合图像检索（Composed Image Retrieval, CIR）是一项多模态检索任务，其中查询由参考图像和文本修改组成，目标是检索出满足两者的目标图像。原则上，CIR基准的强表现被认为需要多模态组合，即结合参考图像和文本修改中的互补信息。在本研究中，我们展示了这一假设并不总是成立。在四个广泛使用的CIR基准和十一种通用多模态嵌入模型中，许多查询可以通过单一模态解决（从32.2%到83.6%），揭示了普遍存在的单模态捷径。因此，高CIR性能可能源于单模态信号，而非真正的多模态组合。为了更好地理解这一问题，我们进行了两阶段审计。首先，通过跨模型分析识别出可以通过捷径解决的查询。其次，我们对4,741个无捷径查询进行了人工验证，其中只有1,689个查询结构良好，常见问题包括模糊的编辑和不匹配的目标。对这一经过验证的子集重新评估模型揭示了质的不同表现：查询不再能够通过单一模态解决，成功的检索需要结合两种输入。尽管准确性下降，但对多模态信息的依赖性增加。总体而言，当前的CIR基准混淆了可通过捷径解决的、噪声和真正组合的查询，导致对模型在多模态组合能力的高估。

View on arXiv Download PDF AI Translation

cs.CV / 89 / 2605.14795

COAL: Counterfactual and Observation-Enhanced Alignment Learning for Discriminative Referring Multi-Object Tracking

COAL：反事实与观察增强对齐学习用于区分性引用多目标跟踪

Jia, Shukun, Hu, Shiyu, Wang, Yipei, Cheng, Ximeng, Cao, Yichao, Lu, Xiaobo

Abstract

Referring Multi-Object Tracking (RMOT) faces a fundamental structural contradiction between the high-discriminability demand and the sparse semantic supervision. This mismatch is particularly acute in highly homogeneous scenarios that require fine-grained discrimination over complex compositional semantics. However, under sparse supervision, models overfit to salient yet insufficient cues, thereby encouraging shortcut learning and semantic collapse. To resolve this, we propose COAL (Counterfactual and Observation-enhanced Alignment Learning), a framework that advances RMOT beyond isolated structural optimization through knowledge regularization. First, we introduce Explicit Semantic Injection (ESI) via a VLM to densify the observation space and enhance instance discriminability. Second, leveraging LLM reasoning, we propose Counterfactual Learning (CFL) to augment supervision, enforcing strict attribute verification for robust compositional recognition. These strategies are unified within a Hierarchical Multi-Stream Integration (HMSI) architecture, which distills external knowledge into domain-specific discriminative representations. Experiments on Refer-KITTI and Refer-KITTI-V2 benchmarks validate COAL's efficacy. Notably, it surpasses the state-of-the-art by 7.28% HOTA on the highly challenging Refer-KITTI-V2. These results demonstrate the effectiveness of knowledge regularization for resolving the sparsity-discriminability paradox in RMOT.

Chinese Translation

引用多目标跟踪（RMOT）面临高区分性需求与稀疏语义监督之间的基本结构矛盾。这种不匹配在需要对复杂组合语义进行细粒度区分的高度同质化场景中尤为明显。然而，在稀疏监督下，模型往往过拟合于显著但不足的线索，从而促使捷径学习和语义崩溃。为了解决这个问题，我们提出了COAL（反事实与观察增强对齐学习），这是一个通过知识正则化将RMOT推进到孤立结构优化之外的框架。首先，我们通过VLM引入显式语义注入（ESI），以稠密化观察空间并增强实例区分性。其次，利用LLM推理，我们提出了反事实学习（CFL）以增强监督，强制执行严格的属性验证以实现稳健的组合识别。这些策略在层次多流集成（HMSI）架构中统一，该架构将外部知识提炼为领域特定的区分表示。在Refer-KITTI和Refer-KITTI-V2基准上的实验验证了COAL的有效性。值得注意的是，它在极具挑战性的Refer-KITTI-V2上超越了最先进的技术，提升了7.28%的HOTA。这些结果展示了知识正则化在解决RMOT中的稀疏性与区分性悖论方面的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 90 / 2605.14799

Can Visual Mamba Improve AI-Generated Image Detection? An In-Depth Investigation

视觉曼巴能否改善AI生成图像检测？深入研究

Keita, Mamadou, Hamidouche, Wassim, Eutamene, Hessen Bougueffa, Taleb-Ahmed, Abdelmalik, Zhu, Xianxun, Hadid, Abdenour

Abstract

In recent years, computer vision has witnessed remarkable progress, fueled by the development of innovative architectures such as Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs), diffusion-based architectures, Vision Transformers (ViTs), and, more recently, Vision-Language Models (VLMs). This progress has undeniably contributed to creating increasingly realistic and diverse visual content. However, such advancements in image generation also raise concerns about potential misuse in areas such as misinformation, identity theft, and threats to privacy and security. In parallel, Mamba-based architectures have emerged as versatile tools for a range of image analysis tasks, including classification, segmentation, medical imaging, object detection, and image restoration, in this rapidly evolving field. However, their potential for identifying AI-generated images remains relatively unexplored compared to established techniques. This study provides a systematic evaluation and comparative analysis of Vision Mamba models for AI-generated image detection. We benchmark multiple Vision Mamba variants against representative CNNs, ViTs, and VLM-based detectors across diverse datasets and synthetic image sources, focusing on key metrics such as accuracy, efficiency, and generalizability across diverse image types and generative models. Through this comprehensive analysis, we aim to elucidate Vision Mamba's strengths and limitations relative to established methodologies in terms of applicability, accuracy, and efficiency in detecting AI-generated images. Overall, our findings highlight both the promise and current limitations of Vision Mamba as a component in systems designed to distinguish authentic from AI-generated visual content. This research is crucial for enhancing detection in an age where distinguishing between real and AI-generated content is a major challenge.

Chinese Translation

近年来，计算机视觉领域取得了显著进展，这得益于卷积神经网络（CNNs）、生成对抗网络（GANs）、基于扩散的架构、视觉变换器（ViTs）以及最近的视觉-语言模型（VLMs）等创新架构的发展。这些进展无疑促进了越来越逼真和多样化的视觉内容的生成。然而，图像生成技术的进步也引发了对其在虚假信息、身份盗窃以及隐私和安全威胁等领域潜在滥用的担忧。同时，基于曼巴（Mamba）的架构作为多种图像分析任务的通用工具应运而生，包括分类、分割、医学成像、目标检测和图像恢复等。然而，与已建立的技术相比，它们在识别AI生成图像方面的潜力仍然相对未被探索。本研究系统地评估和比较了视觉曼巴模型在AI生成图像检测中的表现。我们对多种视觉曼巴变体进行了基准测试，比较了它们与代表性的CNN、ViT和基于VLM的检测器在不同数据集和合成图像源上的表现，重点关注准确性、效率和在不同图像类型及生成模型中的普适性等关键指标。通过这一全面分析，我们旨在阐明视觉曼巴在适用性、准确性和效率方面相对于已建立方法的优势和局限性。总体而言，我们的研究结果突显了视觉曼巴作为区分真实与AI生成视觉内容的系统组件的潜力和当前局限性。这项研究对于在一个区分真实与AI生成内容的时代增强检测能力至关重要。

View on arXiv Download PDF AI Translation

cs.CV / 91 / 2605.14808

SuperADD: Training-free Class-agnostic Anomaly Segmentation -- CVPR 2026 VAND 4.0 Workshop Challenge Industrial Track

SuperADD：无训练的类无关异常分割 -- CVPR 2026 VAND 4.0 研讨会挑战工业轨道

Roming, Lukas, Lehnerer, Felix, Funk, Jonas V., Michel, Andreas, Maier, Georg, Längle, Thomas, Beyerer, Jürgen

Abstract

Visual anomaly detection (AD) for industrial inspection is a highly relevant task in modern production environments. The problem becomes particularly challenging when training and deployment data differ due to changes in acquisition conditions during production. In the VAND 4.0 Industrial Track, models must remain robust under distribution shifts such as varying illumination and their performance is assessed on the MVTec AD 2 dataset. To address this setting, we propose a training-free and class-agnostic anomaly detection pipeline based on the work of SuperAD. Our approach improves generalization through several modifications designed to enhance robustness under distribution shifts. These adaptations include using a DINOv3 backbone, overlapping patch-wise processing, intensity-based augmentations, improved memory-bank subsampling for better coverage of the data distribution, and iterative morphological closing for cleaner and more spatially consistent anomaly maps. Unlike methods that rely on class-specific architectures or per-class hyperparameter tuning, our method uses a single architecture and one shared hyperparameter configuration across all object classes. This makes the approach well suited for industrial deployment, where product variants and appearance changes must be handled with minimal adaptation effort. We achieve segmentation F1 scores of $62.61\%$, $57.42\%$, and $54.35\%$ on test public, private, and private mixed of MVTec AD 2 respectively, thereby outperforming SuperAD and other state-of-the-art methods. Code is available at https://github.com/LukasRoom/SuperADD.

Chinese Translation

工业检查中的视觉异常检测（AD）是现代生产环境中一个高度相关的任务。当训练和部署数据因生产过程中采集条件的变化而不同，问题变得尤为棘手。在 VAND 4.0 工业轨道中，模型必须在分布变化（如光照变化）下保持鲁棒性，其性能在 MVTec AD 2 数据集上进行评估。为了解决这一问题，我们提出了一种基于 SuperAD 的无训练和类无关的异常检测管道。我们的方法通过几项旨在增强分布变化下鲁棒性的修改来提高泛化能力。这些适应包括使用 DINOv3 主干、重叠的补丁处理、基于强度的增强、改进的记忆库子采样以更好地覆盖数据分布，以及迭代形态学闭合以生成更干净和空间一致的异常图。与依赖于类特定架构或每类超参数调优的方法不同，我们的方法在所有物体类别中使用单一架构和共享的超参数配置。这使得该方法非常适合工业部署，在产品变体和外观变化的情况下，能够以最小的适应努力进行处理。我们在 MVTec AD 2 的公共测试、私人测试和混合私人测试上分别取得了 $62.61\%$、$57.42\\%$ 和 $54.35\\%$ 的分割 F1 分数，超越了 SuperAD 和其他最先进的方法。代码可在 https://github.com/LukasRoom/SuperADD 获取。

View on arXiv Download PDF AI Translation

cs.CV / 92 / 2605.14815

Probing into Camera Control of Video Models

探讨视频模型的相机控制

Hou, Chen, Rupprecht, Christian

Abstract

Video is a rich and scalable source of 3D/4D visual observations, and camera control is a key capability for video generation models to produce geometrically meaningful content. Existing approaches typically learn a mapping from camera motion to video using additional camera modules and paired data. However, such datasets are often limited in scale, diversity, and scene dynamics, which can bias the model toward a narrow output distribution and compromise the strong prior learned by the base model. These limitations motivate a different perspective on camera control. In this paper, we show that camera control need not be modeled as an implicit mapping problem, but can instead be treated as a form of geometric guidance that induces displacements across frames. Specifically, we reformulate camera control into a set of displacement fields and apply them via differentiable resampling of latent features during denoising. Our simple approach achieves effective camera control with minimal degradation across diverse quality metrics compared to fine-tuned baselines. Since our method is applicable to most video diffusion models without training, it can also serve as a probe to study the camera control capabilities of base models. Using this probe, we identify universal biases shared by representative video models, as well as disparities in their responses to camera control. Finally, we benchmark their performance in multi-view generation, offering insights into their potential for 3D/4D tasks.

Chinese Translation

视频是一个丰富且可扩展的3D/4D视觉观察源，而相机控制是视频生成模型产生几何上有意义内容的关键能力。现有的方法通常通过额外的相机模块和配对数据学习从相机运动到视频的映射。然而，这类数据集往往在规模、多样性和场景动态方面受到限制，这可能导致模型偏向狭窄的输出分布，并妨碍基础模型所学习的强先验。这些限制促使我们对相机控制采取不同的视角。在本文中，我们展示了相机控制不必被建模为隐式映射问题，而可以被视为一种几何引导形式，诱导帧间位移。具体而言，我们将相机控制重新表述为一组位移场，并通过在去噪过程中对潜在特征进行可微重采样来应用它们。与经过微调的基线相比，我们的简单方法在多种质量指标上实现了有效的相机控制，且降级最小。由于我们的方法适用于大多数视频扩散模型而无需训练，它还可以作为探针来研究基础模型的相机控制能力。利用这一探针，我们识别出代表性视频模型共享的普遍偏差，以及它们对相机控制响应的差异。最后，我们在多视图生成中基准测试它们的性能，为其在3D/4D任务中的潜力提供了见解。

View on arXiv Download PDF AI Translation

cs.CV / 93 / 2605.14819

The Velocity Deficit: Initial Energy Injection for Flow Matching

速度缺失：流匹配的初始能量注入

Li, Linze, Hong, Zong-Wei, Zhang, Shen, Lin, Bo, Li, Jinglun, Tang, Yao, Liang, Jiajun

Abstract

While Flow Matching theoretically guarantees constant-velocity trajectories, we identify a critical breakdown in high-dimensional practice: the Velocity Deficit. We show that the MSE objective systematically underestimates velocity magnitude, causing generated samples to fail to reach the data manifold-a phenomenon we term Integration Lag. To rectify this, we propose Initial Energy Injection, instantiated via two complementary methods: the training-based Magnitude-Aware Flow Matching (MAFM) and the training-free Scale Schedule Corrector (SSC). Both are grounded in our discovery of a crucial asymmetry: velocity contraction causes harmful kinetic stagnation at the trajectory's start, yet acts as a beneficial denoising mechanism at its end. Empirically, SSC yields significant efficiency gains with zero retraining and just one line of code. On ImageNet-1k (256x256), it improves FID by 44.6% (from 13.68 to 7.58) and achieves a 5x speedup, enabling a 50-step generator (FID 7.58) to beat a 250-step baseline (FID 8.65). Furthermore, our methods generalize to Text-to-Image tasks and high-resolution generation, improving FID on MS-COCO by ~22%.

Chinese Translation

虽然流匹配在理论上保证了恒定速度的轨迹，但我们在高维实践中识别出一个关键的失效点：速度缺失。我们表明，均方误差（MSE）目标系统性地低估了速度的大小，导致生成的样本未能到达数据流形——这一现象我们称之为积分滞后。为了解决这个问题，我们提出了初始能量注入，通过两种互补的方法实现：基于训练的幅度感知流匹配（Magnitude-Aware Flow Matching, MAFM）和无训练的尺度调节校正器（Scale Schedule Corrector, SSC）。这两者都基于我们发现的一个关键不对称性：速度收缩在轨迹开始时导致有害的动能停滞，但在轨迹结束时却充当了有益的去噪机制。实证结果表明，SSC在零重训练和仅需一行代码的情况下显著提高了效率。在ImageNet-1k（256x256）上，它将FID提升了44.6%（从13.68降至7.58），并实现了5倍的加速，使得50步生成器（FID 7.58）超越了250步的基线（FID 8.65）。此外，我们的方法在文本到图像任务和高分辨率生成中也具有良好的泛化能力，在MS-COCO上将FID改善了约22%。

View on arXiv Download PDF AI Translation

cs.CV / 94 / 2605.14821

HDRFace: Rethinking Face Restoration with High-Dimensional Representation

HDRFace：重新思考高维表示下的人脸修复

Wang, Zirui, Lin, Xianhui, Dong, Yi, Wei, Bo, Zhang, Gangjian, Ma, Siteng, Zheng, Zebiao, Liu, Xing, Gu, Hong, Dong, Minjing

Abstract

Face restoration under complex degradations still remains an ill-posed inverse problem due to severe information loss. Although diffusion models benefit from strong generative priors, most methods still condition only on low-quality inputs, making it difficult to recover identity-critical details under heavy degradations. In this work, we propose HDRFace, a High-Dimensional Representation conditioned Face restoration framework that injects semantically rich priors into the conditional flow without modifying the generative backbone. Our pipeline first obtains a structurally reliable intermediate restoration with an off-the-shelf restorer, then uses a pretrained high-dimensional feature encoder to extract fine-grained facial representations from both the low-quality input and the intermediate result, and injects them as additional conditions for generation. We further introduce SDFM, a Structure-Detail aware adaptive Fusion Mechanism that emphasizes global constraints during structure modeling and strengthens representation guidance during detail synthesis, balancing structural consistency and detail fidelity. To validate the generalization ability of our method, we implement the proposed framework on two generative models, SD V2.1-base and Qwen-Image, and consistently observe stable and coherent performance gains across different architectures.

Chinese Translation

在复杂退化条件下的人脸修复仍然是一个病态逆问题，因为信息严重丢失。尽管扩散模型受益于强大的生成先验，但大多数方法仍仅依赖低质量输入，使得在严重退化情况下恢复与身份相关的细节变得困难。在本研究中，我们提出了HDRFace，一个基于高维表示的人脸修复框架，该框架在不修改生成主干的情况下，将语义丰富的先验注入条件流中。我们的流程首先通过现成的修复器获得结构可靠的中间修复结果，然后使用预训练的高维特征编码器从低质量输入和中间结果中提取细粒度的人脸表示，并将其作为生成的额外条件注入。我们进一步引入SDFM（结构-细节感知自适应融合机制），在结构建模过程中强调全局约束，并在细节合成过程中增强表示指导，以平衡结构一致性和细节保真度。为了验证我们方法的泛化能力，我们在两个生成模型SD V2.1-base和Qwen-Image上实现了所提出的框架，并在不同架构中一致观察到稳定且一致的性能提升。

View on arXiv Download PDF AI Translation

cs.CV / 95 / 2605.14838

Multi-proposal Collaboration and Multi-task Training for Weakly-supervised Video Moment Retrieval

基于多提案协作和多任务训练的弱监督视频时刻检索

Zhang, Bolin, Yang, Chao, Jiang, Bin, Komamizu, Takahiro, Ide, Ichiro

Abstract

This study focuses on weakly-supervised Video Moment Retrieval (VMR), aiming to identify a moment semantically similar to the given query within an untrimmed video using only video-level correspondences, without relying on temporal annotations during training. Previous methods either aggregate predictions for all instances in the video, or indirectly address the task by proposing reconstructions for the query. However, these methods often produce low-quality temporal proposals, struggle with distinguishing misaligned moments in the same video, or lack stability due to a reliance on a single auxiliary task. To address these limitations, we present a novel weakly-supervised method called Multi-proposal Collaboration and Multi-task Training (MCMT). Initially, we generate multiple proposals and derive corresponding learnable Gaussian masks from them. These masks are then combined to create a high-quality positive sample mask, highlighting video clips most relevant to the query. Concurrently, we classify other clips in the same video as the easy negative sample and the entire video as the hard negative sample. During training, we introduce forward and inverse masked query reconstruction tasks to impose more substantial constraints on the network, promoting more robust and stable retrieval performance. Extensive experiments on two standard benchmarks affirm the effectiveness of the proposed method in VMR.

Chinese Translation

本研究聚焦于弱监督视频时刻检索（Video Moment Retrieval, VMR），旨在仅利用视频级别的对应关系，在未裁剪的视频中识别与给定查询语义相似的时刻，而不依赖于训练过程中的时间注释。以往的方法要么对视频中所有实例的预测进行聚合，要么通过为查询提出重构间接解决该任务。然而，这些方法往往产生低质量的时间提案，难以区分同一视频中不对齐的时刻，或者由于依赖单一辅助任务而缺乏稳定性。为了解决这些局限性，我们提出了一种新颖的弱监督方法，称为多提案协作和多任务训练（Multi-proposal Collaboration and Multi-task Training, MCMT）。首先，我们生成多个提案，并从中推导出相应的可学习高斯掩码。这些掩码随后被组合以创建高质量的正样本掩码，突出与查询最相关的视频片段。同时，我们将同一视频中的其他片段分类为简单负样本，将整个视频视为困难负样本。在训练过程中，我们引入前向和反向掩码查询重构任务，以对网络施加更强的约束，从而促进更稳健和稳定的检索性能。在两个标准基准上的大量实验验证了所提方法在视频时刻检索中的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 96 / 2605.14842

Editor's Choice: Evaluating Abstract Intent in Image Editing through Atomic Entity Analysis

编辑推荐：通过原子实体分析评估图像编辑中的抽象意图

Ventura, Mor, Hirsch, Roy, Bitton, Yonatan, Cohen, Regev, Reichart, Roi

Abstract

Humans naturally communicate through abstract concepts like "mood". However, current image editing benchmarks focus primarily on explicit, literal commands, leaving abstract instructions largely underexplored. In this work, we first formalize the definition and taxonomy of abstract image editing. To measure instruction-following in this challenging domain, we introduce Entity-Rubrics, a framework that breaks down abstract edits into individual, entity-level assessments and achieves strong correlation with human judgment. Alongside this framework, we contribute AbstractEdit, the first benchmark dedicated to abstract image editing across diverse real-world scenes. Evaluating 11 leading models on this dataset reveals a fundamental challenge: standard architectures struggle to balance intent and preservation, commonly defaulting to under-editing or over-editing. Our analysis demonstrates that driving meaningful improvements relies heavily on integrating advanced LLM text encoders and iterative thinking. Looking forward, our entity-based paradigm can generalize beyond assessment to serve as a reward model, enable models to correctly interpret abstract communication, or highlight specific failures in test-time critique loops. Ultimately, we hope this work serves as a stepping stone toward seamless multimodal interaction, closing the gap between rigid machine execution and the natural, open-ended way humans communicate.

Chinese Translation

人类自然地通过“情绪”等抽象概念进行交流。然而，目前的图像编辑基准主要集中在明确的、字面上的指令上，抽象指令则鲜有探索。在本研究中，我们首先对抽象图像编辑的定义和分类进行了形式化。为了在这一具有挑战性的领域中衡量指令遵循情况，我们引入了Entity-Rubrics框架，该框架将抽象编辑分解为单个实体级别的评估，并与人类判断之间达成了强相关性。除了这一框架，我们还贡献了AbstractEdit，这是第一个专注于多样化真实场景的抽象图像编辑基准。在该数据集上评估11个领先模型揭示了一个根本性挑战：标准架构在平衡意图和保留方面存在困难，通常会默认选择不足编辑或过度编辑。我们的分析表明，推动有意义的改进在很大程度上依赖于整合先进的LLM文本编码器和迭代思维。展望未来，我们的基于实体的范式可以超越评估，作为奖励模型，帮助模型正确解读抽象交流，或在测试时批评循环中突出特定失败。最终，我们希望这项工作能成为实现无缝多模态交互的垫脚石，缩小机器执行的僵化与人类自然、开放式交流之间的差距。

View on arXiv Download PDF AI Translation

cs.CV / 97 / 2605.14843

MechVerse: Evaluating Physical Motion Consistency in Video Generation Models

MechVerse：评估视频生成模型中的物理运动一致性

Jain, Rahul, Patel, Mayank, Unmesh, Asim, Ramani, Karthik

Abstract

Text- and image-conditioned video generation models have achieved strong visual fidelity and temporal coherence, but they often fail to generate motion governed by kinematic and geometric constraints. In these settings, object parts must remain rigid, maintain contact or coupling with neighboring components, and transfer motion consistently across connected parts. These requirements are especially explicit in articulated mechanical assemblies, where motion is constrained by rigid-link geometry, contact/coupling relations, and transmission through kinematic chains. A generated video may therefore appear plausible while violating the intended mechanism, such as rotating a part that should translate, deforming a rigid component, breaking coupling between parts, or failing to move downstream components. To evaluate this gap, We introduce MechVerse, a benchmark for mechanically consistent image-to-video generation. MechVerse contains 21,156 synthetic clips from 1,357 mechanical assemblies across 141 categories, organized into three tiers of increasing kinematic complexity: independent articulation, pairwise coupling, and densely coupled multi-part mechanisms. Each clip is paired with a structured prompt describing part identities, stationary supports, moving components, motion primitives, direction, speed/extent, and inter-part dependencies. We evaluate proprietary, open-source, and fine-tuned image-to-video models using standard video metrics, instruction-following scores, and human judgments of motion correctness and kinematic coupling. Results show that current models can preserve appearance and smoothness while failing to generate mechanically admissible motion, with errors increasing as coupling complexity grows. MechVerse provides a benchmark for measuring and improving mechanism-aware video generation from image and language inputs.

Chinese Translation

基于文本和图像的视频生成模型在视觉逼真性和时间一致性方面取得了显著进展，但它们往往无法生成受运动学和几何约束支配的运动。在这些情况下，物体部件必须保持刚性，与相邻组件保持接触或耦合，并在连接部件之间一致地传递运动。这些要求在关节机械装配中尤为明显，其中运动受到刚性链接几何形状、接触/耦合关系和通过运动链传递的限制。因此，生成的视频可能看起来合理，但却违反了预期的机制，例如旋转一个应当平移的部件、变形一个刚性组件、打破部件之间的耦合，或未能移动下游组件。为了评估这一差距，我们引入了MechVerse，这是一个用于机械一致的图像到视频生成的基准。MechVerse包含来自1,357个机械装配的21,156个合成剪辑，涵盖141个类别，分为三个逐渐增加运动学复杂性的层级：独立关节、成对耦合和密集耦合的多部件机制。每个剪辑都配有一个结构化提示，描述部件身份、静止支撑、移动组件、运动原语、方向、速度/范围和部件间依赖关系。我们使用标准视频指标、遵循指令的评分和人类对运动正确性及运动学耦合的判断来评估专有、开源和微调的图像到视频模型。结果表明，当前模型能够保持外观和流畅性，但未能生成机械上可接受的运动，且随着耦合复杂性的增加，错误也在增加。MechVerse为测量和改善基于图像和语言输入的机制感知视频生成提供了基准。

View on arXiv Download PDF AI Translation

cs.CV / 98 / 2605.14845

Exploring Vision-Language Models for Online Signature Verification: A Zero-Shot Capability Study

探索视觉-语言模型在在线签名验证中的应用：零样本能力研究

Robledo-Moreno, Marta, Vera-Rodriguez, Ruben, Tolosana, Ruben, Ortega-Garcia, Javier

Abstract

Recent advancements in Vision-Language Models (VLMs) have demonstrated strong capabilities in general visual reasoning, yet their applicability to rigorous biometric tasks remains unexplored. This work presents an exploratory study evaluating the zero-shot performance of state-of-the-art VLMs (GPT-5.2 and Gemini 2.5 Pro) on the Signature Verification Challenge (SVC) benchmark. To enable visual processing, raw kinematic time-series are converted into static images, encoding pressure information into stroke opacity whenever available in the source data. Furthermore, we introduce a scoring protocol that extracts latent token probabilities to compute robust biometric scores. Experimental results reveal a significant performance dichotomy dependent on signal quality and forgery type. In random forgery scenarios, the zero-shot VLM achieves exceptional discrimination, with GPT-5.2 reaching an Equal Error Rate of 0.32% in mobile tasks, outperforming supervised state-of-the-art systems. Conversely, in skilled forgery scenarios, where the task is more challenging because both signatures are almost identical, the results are significantly worse, and a critical "Rationalization Trap" emerges: chain-of-thought (CoT) reasoning degrades performance as the model produces kinematic hallucinations to justify forgery artifacts as natural variability.

Chinese Translation

最近，视觉-语言模型（VLMs）的进展展示了其在一般视觉推理方面的强大能力，但其在严格生物识别任务中的适用性仍未得到探索。本研究呈现了一项探索性研究，评估了最先进的VLM（GPT-5.2和Gemini 2.5 Pro）在签名验证挑战（SVC）基准上的零样本性能。为了实现视觉处理，原始运动学时间序列被转换为静态图像，并在源数据可用时将压力信息编码为笔画的不透明度。此外，我们引入了一种评分协议，提取潜在的标记概率以计算稳健的生物识别分数。实验结果揭示了性能在信号质量和伪造类型上的显著差异。在随机伪造场景中，零样本VLM表现出卓越的区分能力，其中GPT-5.2在移动任务中达到了0.32%的等错误率，超越了监督的最先进系统。相反，在熟练伪造场景中，由于两个签名几乎相同，任务变得更加困难，结果显著较差，并出现了一个关键的“合理化陷阱”：链式思维（CoT）推理降低了性能，因为模型产生运动学幻觉，以将伪造伪影合理化为自然变异。

View on arXiv Download PDF AI Translation

cs.CV / 99 / 2605.14847

SR-Prominence: A Crowdsourced Protocol and Dataset Suite for Perceptually-Weighted Super-Resolution Artifact Evaluation

SR-Prominence：一种众包协议和数据集套件，用于感知加权超分辨率伪影评估

Molodetskikh, Ivan, Malyshev, Kirill, Mirgaleev, Mark, Zagainov, Nikita, Bogatyrev, Evgeney, Vatolin, Dmitriy

Abstract

Modern image super-resolution methods generate detailed, visually appealing results, but they often introduce visual artifacts: unnatural patterns and texture distortions that degrade perceived quality. These defects vary widely in perceptual impact--some are barely noticeable, while others are highly disturbing--yet existing detection methods treat them equally. We propose artifact prominence as an evaluative target, defined as the fraction of viewers who judge a highlighted region to contain a noticeable artifact. We design a crowdsourced annotation protocol and construct SR-Prominence, a dataset suite containing 3,935 artifact masks from DeSRA, Open Images, Urban100, and a realistic no-ground-truth Urban100-HR setting, annotated with prominence. Re-annotating DeSRA reveals that 48.2% of its in-lab binary artifacts are not noticed by a majority of viewers. Across the suite, we audit SR artifact detectors, image-quality metrics, and SR methods. We find that classical full-reference metrics, especially SSIM and DISTS, provide surprisingly strong localized prominence signals, whereas no-reference IQA methods and specialized artifact detectors often fail to generalize across datasets and reference settings. SR-Prominence is released with an objective scoring protocol that allows new metrics to be benchmarked on our suite without further crowdsourcing. Together, the data and protocols enable SR artifact evaluation to move from binary defect presence toward perceptual impact. SR-Prominence is available at https://huggingface.co/datasets/imolodetskikh/sr-artifact-prominence.

Chinese Translation

现代图像超分辨率方法生成细致且视觉上吸引人的结果，但它们往往引入视觉伪影：不自然的图案和纹理失真，降低了感知质量。这些缺陷在感知影响上差异很大——有些几乎不易察觉，而另一些则极具干扰性——然而现有的检测方法对它们的处理是相同的。我们提出伪影显著性作为评估目标，定义为判断高亮区域包含明显伪影的观众比例。我们设计了一种众包注释协议，并构建了SR-Prominence，一个包含来自DeSRA、Open Images、Urban100以及一个真实无地面真值的Urban100-HR设置的3,935个伪影掩膜的数据集套件，并进行了显著性注释。对DeSRA的重新注释揭示，48.2%的实验室内二元伪影并未被大多数观众注意到。在整个数据集中，我们审计了SR伪影检测器、图像质量度量和SR方法。我们发现，经典的全参考度量，尤其是SSIM和DISTS，提供了意外强烈的局部显著性信号，而无参考图像质量评估方法和专门的伪影检测器往往无法在不同数据集和参考设置中进行泛化。SR-Prominence发布了一个客观评分协议，允许新的度量在我们的套件上进行基准测试，而无需进一步的众包。数据和协议共同使得SR伪影评估从二元缺陷存在向感知影响的转变。SR-Prominence可在https://huggingface.co/datasets/imolodetskikh/sr-artifact-prominence获取。

View on arXiv Download PDF AI Translation

cs.CV / 100 / 2605.14854

FactorizedHMR: A Hybrid Framework for Video Human Mesh Recovery

FactorizedHMR：一种用于视频人类网格恢复的混合框架

Kwon, Patrick, Chen, Chen

Abstract

Human Mesh Recovery (HMR) is fundamentally ambiguous: under occlusion or weak depth cues, multiple 3D bodies can explain the same image evidence. This ambiguity is not uniform across the body, as torso pose and root structure are often relatively well constrained, whereas distal articulations such as the arms and legs are more uncertain. Building on this observation, we propose FactorizedHMR, a two-stage framework that treats these two regimes differently. A deterministic regression module first recovers a stable torso-root anchor, and a probabilistic flow-matching module then completes the remaining non-torso articulation. To make this completion reliable, we combine a composite target representation with geometry-aware supervision and feature-aware classifier-free guidance, preserving the torso-root anchor while improving single-reference recovery of ambiguity-prone articulation. We also introduce a synthetic data pipeline that provides the paired image-camera-motion supervision under diverse viewpoints. Across camera-space and world-space benchmarks, FactorizedHMR remains competitive with strong baselines, with the clearest gains in occlusion-heavy recovery and drift-sensitive world-space metrics.

Chinese Translation

人类网格恢复（HMR）本质上存在歧义：在遮挡或深度线索不足的情况下，多个人体模型可以解释相同的图像证据。这种歧义在身体的不同部位并不均匀，因为躯干姿态和根部结构通常相对受到良好约束，而四肢的关节则更为不确定。基于这一观察，我们提出了FactorizedHMR，一个两阶段框架，分别处理这两种情况。首先，确定性回归模块恢复一个稳定的躯干-根部锚点，然后，概率流匹配模块完成剩余的非躯干关节。为了使这一完成过程可靠，我们结合了复合目标表示、几何感知监督和特征感知无分类器引导，保持躯干-根部锚点的同时改善对易歧义关节的单一参考恢复。我们还引入了一个合成数据管道，在不同视角下提供成对的图像-相机运动监督。在相机空间和世界空间基准测试中，FactorizedHMR与强基线保持竞争力，在遮挡严重的恢复和对漂移敏感的世界空间指标上表现出明显的提升。

View on arXiv Download PDF AI Translation

cs.CV / 101 / 2605.14874

LPH-VTON: Resolving the Structure-Texture Dilemma of Virtual Try-On via Latent Process Handover

LPH-VTON：通过潜在过程交接解决虚拟试穿中的结构-纹理困境

Liu, Yixin, Qian, Baihong, Jiang, Jinglin, Wu, Jeffery, Chen, Yan, Wang, Wei, Wang, Yida, Yang, Lanqing, Xue, Guangtao

Abstract

Virtual Try-On (VTON) aims to synthesize photorealistic images of garments precisely aligned with a person's body and pose. Current diffusion-based methods, however, face a fundamental trade-off between structural integrity and textural fidelity. In this paper, we formalize this challenge as a consequence of complementary inductive biases inherent in prevailing architectures: models heavily reliant on spatial constraints naturally favor geometric alignment but often suppress textures, whereas models dominated by unconstrained generative priors excel at vibrant detail rendering but are prone to structural drift. Based on this diagnosis, we propose LPH-VTON, a new synergistic framework that resolves this tension within a single, continuous denoising process. LPH-VTON strategically decomposes the generation, leveraging a structure-biased model to establish a geometrically consistent latent scaffold in the early stages, before handing over control to a texture-biased model for high-fidelity detail rendering. Extensive experiments validate our approach. Our model achieves a superior Pareto-optimal balance, establishing new benchmarks in perceptual faithfulness while maintaining highly competitive structural alignment across the standard dataset VITON-HD, proving the efficacy of temporal architectural decoupling.

Chinese Translation

虚拟试穿（VTON）旨在合成与人体及姿态精确对齐的照片级真实感服装图像。然而，当前基于扩散的方法面临着结构完整性与纹理保真度之间的基本权衡。本文将这一挑战形式化为现有架构中固有的互补归纳偏差的结果：高度依赖空间约束的模型自然倾向于几何对齐，但往往抑制纹理，而受无约束生成先验主导的模型在生动细节渲染方面表现优异，但容易出现结构漂移。基于这一诊断，我们提出了LPH-VTON，一个新的协同框架，通过单一的连续去噪过程解决这一矛盾。LPH-VTON战略性地分解生成过程，在早期阶段利用结构偏向模型建立几何一致的潜在支架，然后将控制权交给纹理偏向模型以实现高保真细节渲染。大量实验验证了我们的方法。我们的模型实现了优越的帕累托最优平衡，在感知真实度方面建立了新的基准，同时在标准数据集VITON-HD上保持了高度竞争的结构对齐，证明了时间架构解耦的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 102 / 2605.14876

Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning

通过闭环验证推理解锁复杂视觉生成

Cheng, Hanbo, Lin, Limin, Zhang, Ruo, Pan, Yicheng, Du, Jun

Abstract

Despite rapid advancements, current text-to-image (T2I) models predominantly rely on a single-step generation paradigm, which struggles with complex semantics and faces diminishing returns from parameter scaling. While recent multi-step reasoning approaches show promise, they are hindered by ungrounded planning hallucinations lacking verification, monolithic post-hoc reflection, long-context optimization instabilities, and prohibitive inference latency. To overcome these bottlenecks, we propose the Closed-Loop Visual Reasoning (CLVR) framework, a comprehensive system that deeply couples visual-language logical planning with pixel-level diffusion generation. CLVR introduces an automated data engine with step-level visual verification to synthesize reliable reasoning trajectories, and proposes Proxy Prompt Reinforcement Learning (PPRL) to resolve long-context optimization instabilities by distilling interleaved multimodal histories into explicit reward signals for accurate causal attribution. Furthermore, to mitigate the severe latency bottleneck caused by iterative denoising, we propose $\Delta$-Space Weight Merge (DSWM), a theoretically grounded method that fuses alignment weights with off-the-shelf distillation priors, reducing the per-step inference cost to just 4 NFEs without requiring expensive re-distillation. Extensive experiments demonstrate that CLVR outperforms existing open-source baselines across multiple benchmarks and approaches the performance of proprietary commercial models, unlocking general test-time scaling capabilities for complex visual generation.

Chinese Translation

尽管快速发展，当前的文本到图像（T2I）模型主要依赖单步生成范式，这在处理复杂语义时面临困难，并且参数扩展的收益递减。尽管近期的多步推理方法显示出希望，但它们受到缺乏验证的无根规划幻觉、单一的事后反思、长上下文优化不稳定性以及高昂的推理延迟的制约。为了解决这些瓶颈，我们提出了闭环视觉推理（CLVR）框架，这是一个全面的系统，深度结合了视觉-语言逻辑规划与像素级扩散生成。CLVR引入了一个具有步骤级视觉验证的自动化数据引擎，以合成可靠的推理轨迹，并提出了代理提示强化学习（PPRL），通过将交错的多模态历史提炼为明确的奖励信号，从而解决长上下文优化的不稳定性，以实现准确的因果归因。此外，为了减轻由于迭代去噪引起的严重延迟瓶颈，我们提出了$ ext{Δ}$-空间权重合并（DSWM），这是一种理论基础的方法，将对齐权重与现成的蒸馏先验融合，从而将每步推理成本降低到仅4个NFE，而无需昂贵的重新蒸馏。大量实验表明，CLVR在多个基准测试中优于现有的开源基线，并接近专有商业模型的性能，为复杂视觉生成解锁了一般测试时扩展能力。

View on arXiv Download PDF AI Translation

cs.CV / 103 / 2605.14877

HeatKV: Head-tuned KV-cache Compression for Visual Autoregressive Modeling

HeatKV：针对视觉自回归建模的头调优KV缓存压缩

Cederlund, Jonathan, Berg, Axel, Acar, Durmus Alp Emre, Zhou, Chuteng, Giselsson, Pontus

Abstract

Visual Autoregressive (VAR) models have recently demonstrated impressive image generation quality while maintaining low latency. However, they suffer from severe KV-cache memory constraints, often requiring gigabytes of memory per generated image. We introduce HeatKV, a novel compression method that adapts cache allocation in each head based on its attention to previously generated scales. Using a small offline calibration set, the attention heads are ranked according to their attention scores over prior scales. Based on this ranking, we construct a static pruning schedule tailored to a given memory budget. Applied to the Infinity-2B model, HeatKV achieves $2 \times$ higher compression ratio in memory allocation for KV cache compared to existing methods, while maintaining similar or better image fidelity, prompt alignment and human perception score. Our method achieves a new state-of-the-art (SOTA) for VAR model KV-cache compression, showcasing the effectiveness of fine-grained, head-specific cache allocation.

Chinese Translation

视觉自回归（VAR）模型最近在图像生成质量方面表现出色，同时保持低延迟。然而，它们面临严重的KV缓存内存限制，通常每生成一幅图像就需要数GB的内存。我们提出了HeatKV，一种新颖的压缩方法，根据每个头对先前生成尺度的注意力来调整缓存分配。通过使用一个小型离线校准集，我们根据注意力头在先前尺度上的注意力得分对其进行排名。基于这一排名，我们构建了一个静态剪枝计划，以适应给定的内存预算。应用于Infinity-2B模型，HeatKV在KV缓存的内存分配中实现了比现有方法高出$2 imes$的压缩比，同时保持相似或更好的图像保真度、提示对齐和人类感知评分。我们的方法在VAR模型的KV缓存压缩中达到了新的最先进水平（SOTA），展示了细粒度、特定头部的缓存分配的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 104 / 2605.14880

Denoising-GS: Gaussian Splatting with Spatial-aware Denoising

去噪-GS：具有空间感知去噪的高斯点云渲染

Zhou, Qingyuan, Liu, Xinyi, Yang, Weidong, Wang, Ning, Ye, Shuquan, Fei, Ben, He, Ying, Ouyang, Wanli

Abstract

Recent advances in 3D Gaussian Splatting (3DGS) have achieved remarkable success in high-fidelity Novel View Synthesis (NVS), yet the optimization process inevitably introduces noisy Gaussian primitives due to the sparse and incomplete initialization from Structure-from-Motion (SfM) point clouds. Most existing methods focus solely on adjusting the positions of primitives during optimization, while neglecting the underlying spatial structure. To this end, we introduce a new perspective by formulating the optimization of 3DGS as a primitive denoising process and propose Denoising-GS, a spatial-aware denoising framework for Gaussian primitives by taking both the positions and spatial structure into consideration. Specifically, we design an optimizer that preserves the spatial optimization flow of primitives, facilitating coherent and directed denoising rather than random perturbations. Building upon this, the Spatial Gradient-based Denoising strategy jointly considers the spatial supports of primitives to ensure gradient-consistent updates. Furthermore, the Uncertainty-based Denoising module estimates primitive-wise uncertainty to prune redundant or noisy primitives, while the Spatial Coherence Refinement strategy selectively splits primitives in sparse regions to maintain structural completeness. Experiments conducted on three benchmark datasets demonstrate that Denoising-GS consistently enhances NVS fidelity while maintaining representation compactness, achieving state-of-the-art performance across all benchmarks. Source code and models will be made publicly available.

Chinese Translation

近年来，3D高斯点云渲染（3DGS）在高保真新视图合成（NVS）方面取得了显著成功，但优化过程不可避免地引入了由于从运动结构（SfM）点云稀疏和不完整初始化而产生的噪声高斯原语。现有大多数方法仅关注在优化过程中调整原语的位置，而忽视了潜在的空间结构。为此，我们提出了一种新视角，将3DGS的优化过程视为原语去噪过程，并提出了去噪-GS（Denoising-GS），这是一个考虑原语位置和空间结构的空间感知去噪框架。具体而言，我们设计了一种优化器，保持原语的空间优化流，促进一致且有针对性的去噪，而不是随机扰动。在此基础上，基于空间梯度的去噪策略共同考虑原语的空间支持，以确保梯度一致的更新。此外，基于不确定性的去噪模块估计原语级的不确定性，以修剪冗余或噪声原语，而空间一致性精炼策略则选择性地在稀疏区域拆分原语，以保持结构的完整性。在三个基准数据集上进行的实验表明，去噪-GS在增强NVS的保真度的同时保持了表示的紧凑性，在所有基准测试中实现了最先进的性能。源代码和模型将公开发布。

View on arXiv Download PDF AI Translation

cs.CV / 105 / 2605.14885

Masked Next-Scale Prediction for Self-supervised Scene Text Recognition

自监督场景文本识别的掩码下一尺度预测

Chen, Zhuohao, Li, Zeng, Zhang, Yifei, Liu, Chang, Zhou, Yu

Abstract

Scene Text Recognition requires modeling visual structures that evolve from coarse layouts to fine-grained character strokes. Training such models relies on large amounts of annotated data. Recent self-supervised approaches, such as Masked Image Modeling (MIM), alleviate this dependency by leveraging large-scale unlabeled data. Yet most existing MIM methods operate at a single spatial scale and fail to capture the hierarchical nature of scene text. In this work, we introduce Masked Next-Scale Prediction (MNSP), a unified self-supervised framework designed to explicitly model cross-scale structural evolution. The framework incorporates Next-Scale Prediction (NSP), which learns hierarchical representations by predicting higher-resolution features from lower-resolution contexts. Naive scale prediction, however, tends to produce spatially diffuse attention, directing the model toward background regions rather than textual structures. MNSP resolves this limitation by jointly learning cross-scale prediction and masked image reconstruction. NSP captures global layout priors across resolutions, while masked reconstruction imposes strong local constraints that guide attention toward informative text regions. A Multi-scale Linguistic Alignment module further maintains semantic consistency across different resolutions. Extensive experiments demonstrate that MNSP achieves state-of-the-art performance, reaching 86.2\% average accuracy on the challenging Union14M benchmark and 96.7\% across six standard datasets. Additional analyses show that our method improves robustness under extreme scale and layout variations. Code is available at https://github.com/CzhczhcHczh/MNSP

Chinese Translation

场景文本识别需要对从粗略布局到细粒度字符笔画的视觉结构进行建模。训练此类模型依赖于大量的标注数据。近期的自监督方法，如掩码图像建模（Masked Image Modeling, MIM），通过利用大规模未标注数据来减轻这种依赖。然而，大多数现有的MIM方法仅在单一空间尺度上操作，未能捕捉场景文本的层次特性。在本研究中，我们提出了掩码下一尺度预测（Masked Next-Scale Prediction, MNSP），这是一个统一的自监督框架，旨在明确建模跨尺度的结构演变。该框架结合了下一尺度预测（Next-Scale Prediction, NSP），通过从低分辨率上下文中预测高分辨率特征来学习层次表示。然而，简单的尺度预测往往会产生空间上分散的注意力，使模型更倾向于关注背景区域而非文本结构。MNSP通过联合学习跨尺度预测和掩码图像重建来解决这一限制。NSP捕捉不同分辨率下的全局布局先验，而掩码重建则施加强烈的局部约束，引导注意力指向信息丰富的文本区域。多尺度语言对齐模块进一步维护不同分辨率之间的语义一致性。大量实验表明，MNSP在挑战性的Union14M基准上达到了86.2%的平均准确率，并在六个标准数据集上达到了96.7%的准确率，取得了当前最先进的性能。额外分析显示我们的方法在极端尺度和布局变化下提高了鲁棒性。代码可在 https://github.com/CzhczhcHczh/MNSP 获取。

View on arXiv Download PDF AI Translation

cs.CV / 106 / 2605.14889

SurgicalMamba: Dual-Path SSD with State Regramming for Online Surgical Phase Recognition

SurgicalMamba：具有状态重编程的双路径SSD用于在线外科阶段识别

Oh, Sukju, Sun, Sukkyu

Abstract

Online surgical phase recognition (SPR) underpins context-aware operating-room systems and requires committing to a prediction at every frame from past context alone. Surgical video poses three demands that natural-video recognizers do not jointly address: procedures span tens of thousands of frames, time flows non-uniformly as long routine stretches are punctuated by brief phase-defining transitions, and the visual domain is narrow so backbone features are strongly correlated across channels. Existing recognizers either let per-frame cost grow with elapsed length, or hold cost bounded but advance state at a uniform rate with channel-independent dynamics, leaving the latter two demands unaddressed. We present SurgicalMamba, a causal SPR model built on Mamba2's structured state-space duality (SSD) that holds per-frame cost at O(d). It introduces three SSD-compatible components, each targeting one demand: a dual-path SSD block that separates long- and short-term regimes at the level of recurrent state; intensity-modulated stepping, a continuous-time time-warp that adapts the slow path's effective rate to phase-relevant information; and state regramming, a per-chunk Cayley rotation that opens cross-channel mixing in the otherwise axis-aligned SSM recurrence. The learned rotation planes inherit a phase-aligned structure without any direct supervision, offering an interpretable internal signature of surgical workflow. Across seven public SPR benchmarks, SurgicalMamba reaches state-of-the-art accuracy and phase-level Jaccard under strict online evaluation: 94.6%/82.7% on Cholec80 (+0.7 pp/+2.2 pp over the strongest prior) and 89.5%/68.9% on AutoLaparo (+1.7 pp/+2.0 pp), at 119 fps on a single GPU. Ablations isolate the contribution of each component. The code is publicly available at https://github.com/sukjuoh/Surgical-Mamba.

Chinese Translation

在线外科阶段识别（SPR）是上下文感知手术室系统的基础，要求仅依赖过去的上下文在每一帧上进行预测。外科视频面临三项自然视频识别器无法共同解决的需求：手术过程跨越数万帧，时间流动不均匀，长时间的常规过程被短暂的阶段定义过渡所打断，且视觉域较窄，因此主干特征在通道之间高度相关。现有的识别器要么让每帧的成本随着经过的长度增长，要么保持成本在一个界限内，但以通道无关的动态以均匀速率推进状态，从而未能解决后两项需求。我们提出了SurgicalMamba，一种基于Mamba2的结构化状态空间双重性（SSD）的因果SPR模型，能够将每帧成本保持在O(d)。它引入了三个与SSD兼容的组件，每个组件针对一个需求：一个双路径SSD模块，在递归状态层面分离长短期状态；强度调制步进，一种连续时间的时间扭曲，能够根据与阶段相关的信息调整慢路径的有效速率；以及状态重编程，一种每块的Cayley旋转，能够在原本轴对齐的SSM递归中打开跨通道混合。学习到的旋转平面在没有任何直接监督的情况下继承了阶段对齐的结构，提供了外科工作流程的可解释内部特征。在七个公共SPR基准测试中，SurgicalMamba在严格的在线评估下达到了最先进的准确率和阶段级Jaccard：在Cholec80上为94.6%/82.7%（比最强的先前方法提高了0.7个百分点/2.2个百分点），在AutoLaparo上为89.5%/68.9%（提高了1.7个百分点/2.0个百分点），在单个GPU上以119 fps运行。消融实验隔离了每个组件的贡献。代码已公开发布在https://github.com/sukjuoh/Surgical-Mamba。

View on arXiv Download PDF AI Translation

cs.CV / 107 / 2605.14891

Hierarchical Image Tokenization for Multi-Scale Image Super Resolution

用于多尺度图像超分辨率的分层图像标记化

Hadji, Isma, Sanchez, Enrique, Bulat, Adrian, Martinez, Brais, Tzimiropoulos, Georgios

Abstract

We introduce a multi-scale Image Super Resolution (ISR) method building on recent advances in Visual Auto-Regressive (VAR) modeling. VAR models break image tokenization into additive, gradually increasing scales, using Residual Quantization (RQ), an approach that aligns perfectly with our target ISR task. Previous works taking advantage of this synergy suffer from two main shortcomings. First, due to the limitations in RQ, they only generate images at a predefined fixed scale, failing to map intermediate outputs to the corresponding image scales. They also rely on large backbones or a large corpus of annotated data to achieve better performance. To address both shortcomings, we introduce two novel components to the VAR training for ISR, aiming at increasing its flexibility and reducing its complexity. In particular, we introduce a) a \textbf{Hierarchical Image Tokenization (HIT)} approach that progressively represents images at different scales while enforcing token overlap across scales, and b) a \textbf{Direct Preference Optimization (DPO) regularization term} that, relying solely on the (LR,HR) pair, encourages the transformer to produce the latter over the former. Our proposed HIT acts as a strong inductive bias for the VAR training, resulting in a small model (300M params vs 1B params of VARSR), that achieves state-of-the-art results without external training data, and that delivers multi-scale outputs with a single forward pass.

Chinese Translation

我们提出了一种基于视觉自回归（Visual Auto-Regressive, VAR）建模的多尺度图像超分辨率（Image Super Resolution, ISR）方法。VAR模型将图像标记化分解为逐渐增加的加性尺度，使用残差量化（Residual Quantization, RQ）方法，这一方法与我们的目标ISR任务完美契合。之前利用这一协同效应的研究存在两个主要缺陷。首先，由于RQ的局限性，它们仅能在预定义的固定尺度下生成图像，未能将中间输出映射到相应的图像尺度。此外，它们依赖于大型骨干网络或大量标注数据以实现更好的性能。为了解决这两个缺陷，我们为ISR的VAR训练引入了两个新组件，旨在提高其灵活性并降低其复杂性。具体而言，我们引入了a) 一种 extbf{分层图像标记化（Hierarchical Image Tokenization, HIT）}方法，该方法逐步在不同尺度上表示图像，同时在尺度之间强制标记重叠，以及b) 一种 extbf{直接偏好优化（Direct Preference Optimization, DPO）正则化项}，该项仅依赖于（低分辨率，高清晰度）对，鼓励变换器生成后者而非前者。我们提出的HIT作为VAR训练的强归纳偏置，导致了一个小型模型（300M参数对比VARSR的1B参数），在没有外部训练数据的情况下实现了最先进的结果，并且能够在一次前向传播中生成多尺度输出。

View on arXiv Download PDF AI Translation

cs.CV / 108 / 2605.14893

Your CLIP has 164 dimensions of noise: Exploring the embeddings covariance eigenspectrum of contrastively pretrained vision-language transformers

您的 CLIP 具有 164 个噪声维度：探索对比预训练视觉-语言变换器的嵌入协方差特征谱

Grzywaczewski, Jakub, Płudowski, Dawid, Biecek, Przemysław

Abstract

Contrastively pre-trained Vision-Language Models (VLMs) serve as powerful feature extractors. Yet, their shared latent spaces are prone to structural anomalies and act as repositories for non-semantic, multi-modal noise. To address this phenomenon, we employ spectral decomposition of covariance matrices to decompose the VLM latent space into a multi-modal semantic signal component and a shared noise subspace. We observe that this noise geometry exhibits strong subgroup invariance across distinct data subsets. Crucially, pruning these shared noise dimensions is mainly harmless, preserving or actively improving downstream task performance. By isolating true semantic signals from artifactual noise, this work provides new mechanistic insights into the representational structure of modern VLMs, suggesting that a substantial fraction of their latent geometry is governed by shared, architecture-level noise rather than task-relevant semantics alone.

Chinese Translation

对比预训练的视觉-语言模型（VLMs）作为强大的特征提取器。然而，它们共享的潜在空间容易出现结构异常，并充当非语义多模态噪声的储存库。为了解决这一现象，我们采用协方差矩阵的谱分解，将 VLM 潜在空间分解为多模态语义信号成分和共享噪声子空间。我们观察到，这种噪声几何在不同数据子集之间表现出强大的子群不变性。重要的是，修剪这些共享噪声维度主要是无害的，能够保持或积极改善下游任务的性能。通过将真实的语义信号与伪造的噪声隔离开来，本研究为现代 VLM 的表征结构提供了新的机制性见解，表明其潜在几何的相当一部分是由共享的、架构级别的噪声所主导，而不仅仅是与任务相关的语义。

View on arXiv Download PDF AI Translation

cs.CV / 109 / 2605.14894

SEDiT: Mask-Free Video Subtitle Erasure via One-step Diffusion Transformer

SEDiT：通过一步扩散变换器实现无掩码视频字幕擦除

Hui, Zheng, Bai, Yunlong

Abstract

Recent breakthroughs in video diffusion models have significantly accelerated the development of video editing techniques. However, existing methods often rely on inpainting video frames based on masked input, which requires extracting the target video mask in advance, and the precision of the segmentation directly affects the quality of the completion. In this paper, we present SEDiT, a novel one-stage video Subtitle Erasure approach via One-step Diffusion Transformer. We introduce a mask-free inference approach that enables direct erasure of the targeted subtitle. The proposed one-stage framework mitigates the sub-optimality inherent in the two-stage processing of prior models. Since subtitle removal is a localized editing task in which most pixels remain unchanged, the underlying distribution shift is minimal, making it well-suited to one-step generation under rectified flow. We empirically validate the reliability of one-step denoising and further provide a formal theoretical justification. Under the localized-editing structure of subtitle removal, the conditional optimal transport (OT) map and its induced rectified flow velocity field are Lipschitz continuous with respect to the latent variable, which underpins the theoretical feasibility of one-step sampling. To address the challenge of long-term temporal consistency, we adopt a hybrid training strategy by occasionally conditioning the model with a clean first-frame latent. This facilitates temporal continuity, allowing each segment during inference to leverage the output of its predecessor. To avoid visible seams caused by cropping and reinserting processed targets, particularly in scenarios involving substantial motion, we feed the original video directly into SEDiT. Thanks to one-step and chunk-wise streaming inference, our method can efficiently handle native 1440p video with infinite length.

Chinese Translation

最近视频扩散模型的突破性进展显著加速了视频编辑技术的发展。然而，现有的方法通常依赖于基于掩码输入对视频帧进行修补，这需要提前提取目标视频的掩码，而分割的精度直接影响到修补的质量。本文提出了SEDiT，一种通过一步扩散变换器实现的新型单阶段视频字幕擦除方法。我们引入了一种无掩码推理方法，能够直接擦除目标字幕。所提出的单阶段框架减轻了先前模型中存在的二阶段处理的次优性。由于字幕移除是一个局部编辑任务，其中大多数像素保持不变，因此潜在的分布偏移最小，使其非常适合在校正流下进行一步生成。我们通过实证验证了一步去噪的可靠性，并进一步提供了正式的理论证明。在字幕移除的局部编辑结构下，条件最优传输（OT）映射及其诱导的校正流速场对潜在变量是利普希茨连续的，这为一步采样的理论可行性提供了基础。为了解决长期时间一致性的问题，我们采用了一种混合训练策略，偶尔用干净的第一帧潜在变量对模型进行条件化。这促进了时间连续性，使得推理过程中每个片段能够利用其前驱的输出。为了避免因裁剪和重新插入处理目标而导致的可见接缝，尤其是在涉及大量运动的场景中，我们直接将原始视频输入SEDiT。得益于一步和分块流式推理，我们的方法能够高效处理原生1440p视频，且长度无限。

View on arXiv Download PDF AI Translation

cs.CV / 110 / 2605.14906

MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models

MemLens：大型视觉-语言模型中多模态长期记忆的基准测试

Ren, Xiyu, Wang, Zhaowei, Du, Yiming, Xie, Zhongwei, Liu, Chi, Yang, Xinlin, Feng, Haoyue, Pan, Wenjun, Zheng, Tianshi, Xu, Baixuan, Li, Zhengnan, Song, Yangqiu, Wong, Ginny, See, Simon

Abstract

Memory is essential for large vision-language models (LVLMs) to handle long, multimodal interactions, with two method directions providing this capability: long-context LVLMs and memory-augmented agents. However, no existing benchmark conducts a systematic comparison of the two on questions that genuinely require multimodal evidence. To close this gap, we introduce MEMLENS, a comprehensive benchmark for memory in multimodal multi-session conversations, comprising 789 questions across five memory abilities (information extraction, multi-session reasoning, temporal reasoning, knowledge update, and answer refusal) at four standard context lengths (32K-256K tokens) under a cross-modal token-counting scheme. An image-ablation study confirms that solving MEMLENS requires visual evidence: removing evidence images drops two frontier LVLMs below 2% accuracy on the 80.4% of questions whose evidence includes images. Evaluating 27 LVLMs and 7 memory-augmented agents, we find that long-context LVLMs achieve high short-context accuracy through direct visual grounding but degrade as conversations grow, whereas memory agents are length-stable but lose visual fidelity under storage-time compression. Multi-session reasoning caps most systems below 30%, and neither approach alone solves the task. These results motivate hybrid architectures that combine long-context attention with structured multimodal retrieval. Our code is available at https://github.com/xrenaf/MEMLENS.

Chinese Translation

记忆对于大型视觉-语言模型（LVLMs）处理长时间、多模态交互至关重要，现有的两种方法方向提供了这一能力：长上下文LVLMs和增强记忆的智能体。然而，目前没有现有基准对这两者在真正需要多模态证据的问题上进行系统比较。为填补这一空白，我们引入了MEMLENS，这是一个针对多模态多会话对话中记忆的综合基准，涵盖了789个问题，涉及五种记忆能力（信息提取、多会话推理、时间推理、知识更新和拒绝回答），在四种标准上下文长度（32K-256K个标记）下采用跨模态标记计数方案。图像消融研究确认了解决MEMLENS需要视觉证据：去除证据图像使得两种前沿LVLMs在80.4%包含图像证据的问题上准确率降至2%以下。对27个LVLMs和7个增强记忆的智能体进行评估后，我们发现长上下文LVLMs通过直接视觉基础实现了高短上下文准确率，但随着对话的增长而退化，而记忆智能体在长度上保持稳定，但在存储时间压缩下失去视觉保真度。多会话推理使大多数系统的准确率低于30%，而单独的任何一种方法都无法解决该任务。这些结果激励了结合长上下文注意力与结构化多模态检索的混合架构。我们的代码可在https://github.com/xrenaf/MEMLENS获取。

View on arXiv Download PDF AI Translation

cs.CV / 111 / 2605.14908

SteerSeg: Attention Steering for Reasoning Video Segmentation

SteerSeg：用于推理视频分割的注意力引导

Cheraghian, Ali, Dastmalchi, Hamidreza, Khamis, Abdelwahed, Saberi, Morteza, An, Aijun, Petersson, Lars

Abstract

Video reasoning segmentation requires localizing objects across video frames from natural language expressions, often involving spatial reasoning and implicit references. Recent approaches leverage frozen large vision-language models (LVLMs) by extracting attention maps and using them as spatial priors for segmentation, enabling training-free grounding. However, these attention maps are optimized for text generation rather than spatial localization, often resulting in diffuse and ambiguous grounding signals. In this work, we introduce SteerSeg, a lightweight framework that identifies attention misalignment as the key bottleneck in attention-based grounding and proposes to steer attention at its source through input-level conditioning. SteerSeg combines learnable soft prompts with reasoning-guided Chain-of-Thought (CoT) prompting. The soft prompts reshape the attention distribution to produce more spatially concentrated maps, while CoT-derived attributes resolve ambiguity among similar objects by guiding attention toward the correct instance. The resulting attention maps are converted into point prompts across keyframes to guide a segmentation model, while candidate tracklets are ranked and selected using correlation-based scoring. Our approach freezes the LVLM and segmentation model parameters and learns only a small set of soft prompts, preserving the model's pretrained reasoning capabilities while significantly improving grounding. Despite being trained only on Ref-YouTube-VOS, SteerSeg generalizes well across diverse benchmarks, significantly improving the spatial grounding capability of LVLMs. Project page: https://steerseg.github.io

Chinese Translation

视频推理分割需要根据自然语言表达在视频帧中定位对象，通常涉及空间推理和隐含引用。近期的方法通过提取冻结的大型视觉-语言模型（LVLM）的注意力图，并将其作为分割的空间先验，从而实现无训练的基础定位。然而，这些注意力图是针对文本生成而优化的，而非空间定位，常常导致模糊和不明确的定位信号。在本研究中，我们提出了SteerSeg，一个轻量级框架，识别注意力不对齐作为基于注意力的定位的关键瓶颈，并建议通过输入级条件引导注意力。SteerSeg结合了可学习的软提示和推理引导的思维链（Chain-of-Thought, CoT）提示。软提示重塑了注意力分布，以产生更集中于空间的图，而CoT派生的属性通过引导注意力指向正确的实例来解决相似对象之间的模糊性。生成的注意力图被转换为关键帧中的点提示，以引导分割模型，同时使用基于相关性的评分对候选轨迹进行排名和选择。我们的方法冻结了LVLM和分割模型的参数，仅学习一小组软提示，保留了模型的预训练推理能力，同时显著改善了定位能力。尽管仅在Ref-YouTube-VOS上进行训练，SteerSeg在各种基准测试中表现良好，显著提升了LVLM的空间定位能力。项目页面：https://steerseg.github.io

View on arXiv Download PDF AI Translation

cs.CV / 112 / 2605.14913

Representative Attention For Vision Transformers

视觉变换器的代表性注意力

Li, Yuntong, Wang, Hainuo, Liu, Hengxing, Li, Mingjia, Guo, Xiaojie

Abstract

Linear attention has emerged as a promising direction for scaling Vision Transformers beyond the quadratic cost of dense self-attention. A prevalent strategy is to compress spatial tokens into a compact set of intermediate proxies that mediate global information exchange. However, existing methods typically derive these proxy tokens from predefined spatial layouts, causing token compression to remain anchored to image coordinates rather than the semantic organization of visual content. To overcome this limitation, we propose Representative Attention (RPAttention), a linear global attention mechanism that performs token compression directly in representation space. Instead of constructing intermediate tokens from fixed spatial partitions, it dynamically forms a compact set of learned representative tokens to enable semantically related regions to communicate regardless of their spatial distance, by following a lightweight Gather-Interact-Distribute paradigm. Spatial tokens are first softly gathered into representative tokens through competitive similarity-based routing. The representatives then perform global interaction within a compact latent space, before broadcasting the refined information back to all spatial tokens via query-driven cross-attention. Via replacing coordinate-driven aggregation with representation-driven compression, RPAttention preserves global receptive fields while adaptively aligning token communication with the content structure of each input.RPAttention reduces the dominant token interaction complexity from quadratic to linear scaling with respect to the number of spatial tokens, while maintaining expressive global context modeling. Extensive experiments across diverse vision transformer backbones on image classification, object detection, and semantic segmentation demonstrate the effectiveness of our design.

Chinese Translation

线性注意力已成为扩展视觉变换器以超越密集自注意力的二次成本的有前景方向。一种普遍的策略是将空间标记压缩为一组紧凑的中间代理，以调解全球信息交换。然而，现有方法通常从预定义的空间布局中派生这些代理标记，导致标记压缩仍然依赖于图像坐标，而不是视觉内容的语义组织。为了解决这一限制，我们提出了代表性注意力（RPAttention），一种在线性全球注意力机制中直接在表示空间中执行标记压缩。它不是从固定的空间划分构建中间标记，而是动态形成一组紧凑的学习代表标记，使语义相关区域能够进行交流，而不受其空间距离的限制，遵循轻量级的聚集-互动-分发范式。空间标记首先通过基于竞争相似性的路由被软聚集为代表标记。然后，代表标记在紧凑的潜在空间中执行全球互动，最后通过查询驱动的交叉注意力将精炼的信息广播回所有空间标记。通过用表示驱动的压缩替代坐标驱动的聚合，RPAttention在保持全球感受野的同时，自适应地将标记通信与每个输入的内容结构对齐。RPAttention将主导的标记交互复杂性从关于空间标记数量的二次缩减为线性，同时保持表达性的全球上下文建模。在图像分类、目标检测和语义分割等多种视觉变换器骨干网络上的广泛实验表明了我们设计的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 113 / 2605.14923

SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding

场景解析器：用于视觉语义理解的层次场景解析

Xu, Pengxin, Lin, Xincheng, Xiao, Luping, Jiang, Qing, Zhang, Meishan, Fei, Hao, Zhang, Shanghang, Chen, Xingyu

Abstract

General scene perception has progressed from object recognition toward open-vocabulary grounding, part localization, and affordance prediction. Yet these capabilities are often realized as isolated predictions that localize objects, parts, or interaction points without capturing the structured dependencies needed for interaction-oriented scene understanding. To address this gap, we introduce Hierarchical Scene Parsing, an interaction-oriented parsing task that represents physical scenes as explicit scene -> object -> part -> affordance hierarchies with cross-level bindings. We instantiate this task with SceneParser, a VLM-based parser trained for unified hierarchical generation with structural-completion pseudo labels and curriculum learning. To support training and evaluation, we construct SceneParser-Bench, a large-scale benchmark built with a scalable hierarchical data engine, containing 110K training images, a 5K validation split, 777K objects, 1.14M parts, 1.74M affordance annotations, and 1.74M valid object-part-affordance chain instances. We further introduce Level-1 to Level-3 conditional metrics and ParseRate to evaluate localization, cross-level binding, and hierarchical completeness. Experiments show that existing MLLMs and perception-stitching pipelines struggle with hierarchical parsing on our SceneParser-Bench, while SceneParser achieves stronger structure-aware performance. Besides, ablations, evaluations on COCO and AGD20K, and a downstream planning probe demonstrate that our SceneParser is compatible with conventional tasks and provides an actionable representation for visual understanding.

Chinese Translation

一般场景感知已经从物体识别发展到开放词汇的基础定位、部分定位和可供性预测。然而，这些能力往往作为孤立的预测实现，定位物体、部分或交互点，而未能捕捉到进行交互导向场景理解所需的结构依赖关系。为了解决这一问题，我们提出了层次场景解析（Hierarchical Scene Parsing），这是一种交互导向的解析任务，将物理场景表示为具有跨层绑定的显式场景 -> 物体 -> 部分 -> 可供性层次结构。我们通过场景解析器（SceneParser）实例化这一任务，该解析器基于视觉语言模型（VLM），经过统一层次生成的训练，使用结构补全伪标签和课程学习。为了支持训练和评估，我们构建了场景解析器基准（SceneParser-Bench），这是一个大型基准，采用可扩展的层次数据引擎，包含11万张训练图像、5000张验证集、77.7万物体、114万部分、174万可供性注释和174万个有效的物体-部分-可供性链实例。我们进一步引入了从第1层到第3层的条件指标和解析率（ParseRate），以评估定位、跨层绑定和层次完整性。实验表明，现有的多模态大语言模型（MLLMs）和感知拼接管道在我们的场景解析器基准上难以进行层次解析，而场景解析器则实现了更强的结构感知性能。此外，消融实验、在COCO和AGD20K上的评估以及下游规划探测表明，我们的场景解析器与传统任务兼容，并为视觉理解提供了可操作的表示。

View on arXiv Download PDF AI Translation

cs.CV / 114 / 2605.14925

Road Maps as Free Geometric Priors: Weather-Invariant Drone Geo-Localization with GeoFuse

作为自由几何先验的道路地图：天气不变的无人机地理定位与GeoFuse

Fang, Yunsong, Wang, Tingyu, Zheng, Zhedong

Abstract

Drone-view geo-localization aims to match a query drone image, often captured under adverse weather conditions (e.g., rain, snow, fog), against a gallery of geo-tagged satellite images. Weather-induced degradations in the drone view, such as noise, reduced visibility, and partial occlusions, severely exacerbate the intrinsic cross-view domain gap. While prior methods predominantly rely on weather-specific architectures or data augmentations, they have largely overlooked road map data, a readily available modality that provides strong, inherently weather-invariant geometric layout cues (e.g., road networks and building footprints) at negligible additional cost. We introduce GeoFuse, a cross-modal fusion framework that integrates precisely aligned road map tiles with satellite imagery to yield more discriminative and weather-resilient representations. We first augment the existing University-1652 and DenseUAV benchmarks with geo-aligned road maps, supplying structural priors robust to meteorological variations. Building on this, we propose a flexible fusion module that combines satellite and road map features via token-level and channel-level interactions, with a lightweight dynamic gating mechanism that adaptively weights modality contributions per instance. Finally, we employ class-level cross-view contrastive learning to promote robust alignment between weather-degraded drone features and the fused satellite-roadmap representations. Extensive experiments under diverse weather conditions show that GeoFuse consistently outperforms state-of-the-art methods, achieving +3.46% and +23.18% Recall@1 accuracy on the University-1652 and DenseUAV benchmarks, respectively.

Chinese Translation

无人机视角的地理定位旨在将查询的无人机图像（通常在恶劣天气条件下捕获，例如雨、雪、雾）与一系列带地理标签的卫星图像进行匹配。天气引起的无人机视角退化，如噪声、能见度降低和部分遮挡，严重加剧了内在的跨视角领域差距。尽管先前的方法主要依赖于特定天气的架构或数据增强，但它们在很大程度上忽视了道路地图数据，这是一种 readily available 的模态，提供强大的、固有的天气不变的几何布局线索（例如，路网和建筑轮廓），且额外成本微乎其微。我们提出了GeoFuse，一个跨模态融合框架，将精确对齐的道路地图瓦片与卫星图像结合，以产生更具辨别力和抗天气影响的表示。我们首先通过地理对齐的道路地图增强现有的University-1652和DenseUAV基准，提供对气象变化具有鲁棒性的结构先验。在此基础上，我们提出了一个灵活的融合模块，通过令牌级和通道级的交互将卫星和道路地图特征结合起来，并采用轻量级动态门控机制，根据每个实例自适应地加权模态贡献。最后，我们采用类别级跨视角对比学习，以促进天气退化的无人机特征与融合的卫星-道路地图表示之间的鲁棒对齐。在多种天气条件下的广泛实验表明，GeoFuse始终优于最先进的方法，在University-1652和DenseUAV基准上分别实现了+3.46%和+23.18%的Recall@1准确率。

View on arXiv Download PDF AI Translation

cs.CV / 115 / 2605.14926

SCRWKV: Ultra-Compact Structure-Calibrated Vision-RWKV for Topological Crack Segmentation

SCRWKV：超紧凑结构校准视觉RWKV用于拓扑裂缝分割

Zhang, Hanxu, Jia, Chen, Liu, Hui, Cheng, Xu, Shi, Fan, Chen, Shengyong

Abstract

Achieving pixel-level accurate segmentation of structural cracks across diverse scenarios remains a formidable challenge. Existing methods face significant bottlenecks in balancing crack topology modeling with computational efficiency, often failing to reconcile high segmentation quality with low resource demands. To address these limitations, we propose the Ultra-Compact Structure-Calibrated Vision RWKV (SCRWKV), a network that achieves high-precision modeling via a novel Structure-Field Encoder (SFE) backbone while maintaining linear complexity. The SFE integrates the Adaptive Multi-scale Cascaded Modulator (AMCM) to enhance texture representation and utilizes the Structure-Calibrated Insight Unit (SCIU) as its core engine. Specifically, the SCIU employs the Geometry-guided Bidirectional Structure Transformation (GBST) to capture topological correlations and integrates the Dynamic Self-Calibrating Decay (DSCD) into Dy-WKV to suppress noise propagation. Furthermore, we introduce a lightweight Cross-Scale Harmonic Fusion (CSHF) decoder to achieve precise feature aggregation. Systematic evaluations on multiple benchmarks characterized by complex textures and severe interference demonstrate that SCRWKV, with only 1.22M parameters, significantly outperforms SOTA methods. Achieving an F1 score of 0.8428 and mIoU of 0.8512 on the TUT dataset, the model confirms its robust potential for efficient real-world deployment. The code is available at https://github.com/zhxhzy/SCRWKV.

Chinese Translation

在多样化场景中实现结构裂缝的像素级精确分割仍然是一项艰巨的挑战。现有方法在平衡裂缝拓扑建模与计算效率方面面临显著瓶颈，常常无法将高分割质量与低资源需求相结合。为了解决这些局限性，我们提出了超紧凑结构校准视觉RWKV（SCRWKV），该网络通过一种新颖的结构场编码器（Structure-Field Encoder, SFE）主干实现高精度建模，同时保持线性复杂度。SFE集成了自适应多尺度级联调制器（Adaptive Multi-scale Cascaded Modulator, AMCM）以增强纹理表示，并利用结构校准洞察单元（Structure-Calibrated Insight Unit, SCIU）作为其核心引擎。具体而言，SCIU采用几何引导双向结构变换（Geometry-guided Bidirectional Structure Transformation, GBST）来捕捉拓扑相关性，并将动态自校准衰减（Dynamic Self-Calibrating Decay, DSCD）集成到Dy-WKV中以抑制噪声传播。此外，我们引入了一种轻量级跨尺度谐波融合（Cross-Scale Harmonic Fusion, CSHF）解码器以实现精确特征聚合。在多个具有复杂纹理和严重干扰的基准测试上的系统评估表明，SCRWKV仅用1.22M参数显著超越了现有最先进（SOTA）方法。在TUT数据集上实现了0.8428的F1分数和0.8512的mIoU，模型确认了其在高效实际应用中的强大潜力。代码可在 https://github.com/zhxhzy/SCRWKV 获取。

View on arXiv Download PDF AI Translation

cs.CV / 116 / 2605.14935

Multi-scale Coarse-to-fine Modeling for Test-time Human Motion Control

测试时人类运动控制的多尺度粗到细建模

Le, Nhat, Liu, Daochang, Nguyen, Anh, Mian, Ajmal

Abstract

We present MSCoT, a multi-scale, coarse-to-fine model for test-time human motion synthesis and control. Unlike recent approaches that rely on multiple iterative denoising/token-prediction steps, or modules tailored for specific control signals, MSCoT discretizes motion into a multi-scale hierarchical representation and predicts the entire token sequence at each temporal scale in a coarse-to-fine fashion. Building on this coarse-to-fine paradigm, we propose an efficient multi-scale token guidance strategy that overcomes the challenge of discrete sampling and steers the token distribution towards the control goals, allowing for fast and flexible control. To address the limitations of a discrete codebook, a lightweight token refiner further adds continuous residuals to the discrete token embeddings and allows differentiable test-time refinement optimization to ensure precise alignment with the control objectives. MSCoT is able to produce quality motions, consistent with the control constraints, while offering substantially faster sampling than diffusion-based approaches. Experiments on popular benchmarks demonstrate state-of-the-art controllable text-to-motion generation performance of MSCoT over existing baselines, with better motion quality (48% FID improvement), higher control accuracy (-61% avg error), and $10 \times$ faster inference speed on HumanML3D.

Chinese Translation

我们提出了MSCoT，一种用于测试时人类运动合成和控制的多尺度粗到细模型。与最近依赖于多个迭代去噪/标记预测步骤或针对特定控制信号量身定制的模块的方法不同，MSCoT将运动离散化为多尺度层次表示，并以粗到细的方式在每个时间尺度上预测整个标记序列。在这一粗到细范式的基础上，我们提出了一种高效的多尺度标记引导策略，克服了离散采样的挑战，并将标记分布引导至控制目标，从而实现快速灵活的控制。为了解决离散代码本的局限性，一个轻量级的标记精炼器进一步为离散标记嵌入添加连续残差，并允许可微分的测试时精炼优化，以确保与控制目标的精确对齐。MSCoT能够生成与控制约束一致的高质量运动，同时提供比基于扩散的方法显著更快的采样速度。在流行基准上的实验表明，MSCoT在现有基线之上展示了最先进的可控文本到运动生成性能，具有更好的运动质量（FID提升48%）、更高的控制准确性（平均误差降低61%），以及在HumanML3D上的$10 imes$更快推理速度。

View on arXiv Download PDF AI Translation

cs.CV / 117 / 2605.14948

ACE-LoRA: Adaptive Orthogonal Decoupling for Continual Image Editing

ACE-LoRA：用于持续图像编辑的自适应正交解耦

Liu, Yuehao, Zhang, Weijia, Shang, Xuanming, Chen, Zhizhou, Ge, Yanhao, Guan, Shanyan, Ma, Chao

Abstract

State-of-the-art diffusion models often rely on parameter-efficient fine-tuning to perform specialized image editing tasks. However, real-world applications require continual adaptation to new tasks while preserving previously learned knowledge. Despite the practical necessity, continual learning for image editing remains largely underexplored. We propose ACE-LoRA, a dynamic regularization framework for continual image editing that effectively mitigates catastrophic forgetting. ACE-LoRA leverages Adaptive Orthogonal Decoupling to identify and orthogonalize task interference, and introduces a Rank-Invariant Historical Information Compression strategy to address scalability issues in continual updates. To facilitate continual learning in image editing and provide a standardized evaluation protocol, we introduce CIE-Bench, the first comprehensive benchmark in this domain. CIE-Bench encompasses diverse and practically relevant image editing scenarios with a balanced level of difficulty to effectively expose limitations of existing models while remaining compatible with parameter-efficient fine-tuning. Extensive experiments demonstrate that our method consistently outperforms existing baselines in terms of instruction fidelity, visual realism, and robustness to forgetting, establishing a strong foundation for continual learning in image editing.

Chinese Translation

最先进的扩散模型通常依赖于参数高效的微调来执行专业的图像编辑任务。然而，实际应用需要在保持先前学习知识的同时，持续适应新任务。尽管这种实际需求存在，图像编辑的持续学习仍然在很大程度上未被探索。我们提出了ACE-LoRA，这是一种用于持续图像编辑的动态正则化框架，有效缓解灾难性遗忘。ACE-LoRA利用自适应正交解耦来识别和正交化任务干扰，并引入了一种秩不变历史信息压缩策略，以解决持续更新中的可扩展性问题。为了促进图像编辑中的持续学习并提供标准化的评估协议，我们引入了CIE-Bench，这是该领域第一个综合基准。CIE-Bench涵盖了多样且与实际相关的图像编辑场景，具有平衡的难度水平，以有效揭示现有模型的局限性，同时保持与参数高效微调的兼容性。大量实验表明，我们的方法在指令保真度、视觉真实感和对遗忘的鲁棒性方面始终优于现有基线，为图像编辑中的持续学习奠定了坚实的基础。

View on arXiv Download PDF AI Translation

cs.CV / 118 / 2605.14949

A CUBS-Compatible Ultrasound Morphology and Uncertainty-Aware Baseline for Carotid Intima-Media Segmentation and Preliminary Risk Prediction

兼容CUBS的超声形态学与不确定性感知基线用于颈动脉内膜-中膜分割及初步风险预测

Aueawatthanaphisut, Aueaphum

Abstract

Carotid atherosclerosis is a major contributor to ischemic stroke and transient ischemic attack. Conventional ultrasound assessment is commonly based on intima-media thickness, plaque appearance, stenosis degree, and peak systolic velocity, but these morphology- and velocity-based indicators may not fully capture patient-specific vascular risk. This study presents AtheroFlow-XNet, a CUBS-compatible ultrasound morphology and uncertainty-aware learning baseline for carotid intima-media segmentation and preliminary risk prediction. Using the Carotid Ultrasound Boundary Study dataset, manual lumen-intima and media-adventitia boundary annotations were converted into dense intima-media masks for supervised segmentation. Clinical variables were incorporated into an auxiliary risk-prediction branch, and Monte Carlo dropout was used for uncertainty-aware inference. The model was evaluated using a patient-level train-validation-test split with 1,522 training images, 326 validation images, and 328 testing images. The proposed model achieved a Dice coefficient of 0.7930 for LI-MA mask segmentation, a segmentation loss of 0.2359, and an area under the receiver operating characteristic curve of 0.6910 for preliminary risk prediction. Qualitative results showed that predicted masks were generally aligned with manual annotations, while uncertainty maps highlighted ambiguous wall-boundary regions. These results suggest that ultrasound-derived carotid morphology can support automated wall analysis and uncertainty-aware interpretation. Since CUBS does not provide Doppler waveforms or CFD-derived hemodynamic biomarkers, this work should be interpreted as a reproducible morphology-driven baseline. Future work will incorporate Doppler-derived flow profiles, patient-specific vascular reconstruction, and CFD-based wall shear biomarkers.

Chinese Translation

颈动脉动脉粥样硬化是缺血性中风和短暂性缺血发作的主要诱因。传统的超声评估通常基于内膜-中膜厚度、斑块外观、狭窄程度和峰值收缩速度，但这些基于形态和速度的指标可能无法充分捕捉患者特异性的血管风险。本研究提出了AtheroFlow-XNet，一种兼容CUBS的超声形态学与不确定性感知学习基线，用于颈动脉内膜-中膜分割及初步风险预测。利用颈动脉超声边界研究（Carotid Ultrasound Boundary Study）数据集，将手动的腔内-内膜和中膜-外膜边界注释转换为密集的内膜-中膜掩膜以进行监督分割。临床变量被纳入辅助风险预测分支，并使用蒙特卡洛丢弃（Monte Carlo dropout）进行不确定性感知推理。该模型通过患者级别的训练-验证-测试划分进行评估，包含1,522张训练图像、326张验证图像和328张测试图像。所提出的模型在LI-MA掩膜分割中达到了0.7930的Dice系数，分割损失为0.2359，初步风险预测的接收者操作特征曲线下面积为0.6910。定性结果显示，预测的掩膜通常与手动注释一致，而不确定性图突出显示了模糊的壁边界区域。这些结果表明，超声衍生的颈动脉形态可以支持自动化的壁分析和不确定性感知解释。由于CUBS未提供多普勒波形或CFD衍生的血流动力学生物标志物，本研究应被视为一种可重复的以形态为驱动的基线。未来的工作将结合多普勒衍生的流动特征、患者特异性的血管重建以及基于CFD的壁剪切生物标志物。

View on arXiv Download PDF AI Translation

cs.CV / 119 / 2605.14950

Evo-Depth: A Lightweight Depth-Enhanced Vision-Language-Action Model

Evo-Depth：一种轻量级深度增强视觉-语言-动作模型

Lin, Tao, Du, Yuxin, Liu, Jiting, Zhu, Nuobei, Li, Yunhe, Fu, Yuqian, Chen, Yinxinyu, Cai, Hongyi, Ye, Zewei, Cheng, Bing, Ye, Kai, Mao, Yiran, Zhong, Yilei, Dong, MingKang, Yan, Junchi, Li, Gen, Zhao, Bo

Abstract

Vision-Language-Action models have emerged as a promising paradigm for robotic manipulation by unifying perception, language grounding, and action generation. However, they often struggle in scenarios requiring precise spatial understanding, as current VLA models primarily rely on 2D visual representations that lack depth information and detailed spatial relationships. While recent approaches incorporate explicit 3D inputs such as depth maps or point clouds to address this issue, they often increase system complexity, require additional sensors, and remain vulnerable to sensing noise and reconstruction errors. Another line of work explores implicit 3D-aware spatial modeling directly from RGB observations without extra sensors, but it often relies on large geometry foundation models, resulting in higher training and deployment costs. To address these challenges, we propose Evo-Depth, a lightweight depth-enhanced VLA framework that enhances spatially grounded manipulation without relying on additional sensing hardware or compromising deployment efficiency. Evo-Depth employs a lightweight Implicit Depth Encoding Module to extract compact depth features from multi-view RGB images. These features are incorporated into vision-language representations through a Spatial Enhancement Module via depth-aware modulation, enabling efficient spatial-semantic enhancement. A Progressive Alignment Training strategy is further introduced to align the resulting depth-enhanced representations with downstream action learning. With only 0.9B parameters, Evo-Depth achieves superior performance across four simulation benchmarks. In real-world experiments, Evo-Depth attains the highest average success rate while also exhibiting the smallest model size, lowest GPU memory usage, and highest inference frequency among compared methods.

Chinese Translation

视觉-语言-动作（Vision-Language-Action, VLA）模型作为一种有前景的机器人操作范式，通过统一感知、语言基础和动作生成而受到关注。然而，它们在需要精确空间理解的场景中常常表现不佳，因为当前的 VLA 模型主要依赖于缺乏深度信息和详细空间关系的二维视觉表示。尽管近期的方法通过引入深度图或点云等显式三维输入来解决这一问题，但它们往往增加了系统复杂性，要求额外的传感器，并且对传感噪声和重建误差仍然敏感。另一类研究探索直接从 RGB 观测中进行隐式三维感知空间建模，而无需额外传感器，但这通常依赖于大型几何基础模型，导致更高的训练和部署成本。为了解决这些挑战，我们提出了 Evo-Depth，一种轻量级的深度增强 VLA 框架，旨在提升空间基础的操作能力，而不依赖额外的传感硬件或妥协部署效率。Evo-Depth 采用轻量级隐式深度编码模块，从多视角 RGB 图像中提取紧凑的深度特征。这些特征通过空间增强模块与视觉-语言表示相结合，利用深度感知调制实现高效的空间-语义增强。此外，我们进一步引入了一种渐进对齐训练策略，将生成的深度增强表示与下游动作学习对齐。Evo-Depth 仅使用 0.9B 参数，在四个仿真基准测试中表现优越。在现实世界实验中，Evo-Depth 达到了最高的平均成功率，同时在比较方法中展现了最小的模型大小、最低的 GPU 内存使用和最高的推理频率。

View on arXiv Download PDF AI Translation

cs.CV / 120 / 2605.14963

H-OmniStereo: Zero-Shot Omnidirectional Stereo Matching with Heading-Aligned Normal Priors

H-OmniStereo：基于朝向对齐法线先验的零-shot全向立体匹配

Jiang, Chenxing, Tong, Zhe, Gao, Pusen, Liu, Peize, Xu, Yang, Fang, Chuan, Tan, Ping, Shen, Shaojie

Abstract

Stereo matching on top-bottom equirectangular images provides an effective framework for full-surround perception, as vertically aligned epipolar lines enable the use of advanced perspective stereo architectures that are largely driven by large-scale datasets and monocular priors. However, the performance of such adaptations is severely limited by the scarcity of omnidirectional stereo datasets and the degradation of perspective monocular priors under spherical distortions.To address these challenges, we propose H-OmniStereo, a zero-shot omnidirectional stereo matching framework. First, we construct high-quality synthetic dataset comprising over 2.8 million top-bottom equirectangular stereo pairs to scale up training. Second, we introduce an equirectangular monocular normal estimator, specifically operating in a heading-aligned coordinate system. Beyond providing distortion-robust and cross-view-consistent geometric priors for establishing reliable correspondences in stereo matching, this design boosts training efficiency and accommodates train-test FoV mismatches.Extensive experiments show that our approach achieves higher accuracy than existing methods on out-of-domain datasets and successfully generalizes to real-world consumer camera setups using a single model. Both the model and the dataset will be open-sourced.

Chinese Translation

在上下方向的等矩形图像上进行立体匹配为全方位感知提供了有效框架，因为垂直对齐的极线使得可以使用主要依赖于大规模数据集和单目先验的先进透视立体架构。然而，这种适应的性能受到全向立体数据集稀缺以及在球面畸变下透视单目先验退化的严重限制。为了解决这些挑战，我们提出了H-OmniStereo，一个零-shot全向立体匹配框架。首先，我们构建了一个高质量的合成数据集，包含超过280万对上下方向的等矩形立体图像，以扩大训练规模。其次，我们引入了一种等矩形单目法线估计器，特别在朝向对齐的坐标系统中操作。除了为建立可靠的立体匹配对应关系提供抗畸变和跨视图一致的几何先验外，这一设计还提高了训练效率，并适应训练与测试视场不匹配的情况。大量实验表明，我们的方法在域外数据集上实现了比现有方法更高的准确性，并成功地推广到使用单一模型的真实消费相机设置。该模型和数据集将开源。

View on arXiv Download PDF AI Translation

cs.CV / 121 / 2605.14966

MHSA: A Lightweight Framework for Mitigating Hallucinations via Steered Attention in LVLMs

MHSA：一种通过引导注意力减轻大规模视觉语言模型幻觉的轻量框架

Ding, Wei, Li, Yilin, Zhang, Yudong, Xie, Ruobing, Sun, Xingwu, Chen, Jiansheng, Wang, Yu

Abstract

Large vision-language models (LVLMs) have achieved remarkable performance across diverse multimodal tasks, yet they continue to suffer from hallucinations, generating content that is inconsistent with the visual input. Prior work DHCP (Detecting Hallucinations by Cross-modal Attention Pattern) has explored hallucination detection from the perspective of cross-modal attention, but does not address hallucination mitigation. In this paper, we propose MHSA (Mitigating Hallucinations via Steered Attention), a lightweight framework that mitigates hallucinations by learning to correct cross-modal attention patterns in LVLMs. MHSA trains a simple three-layer MLP generator to produce corrected attention, guided by supervisory signals from the DHCP discriminator and the LVLM itself. During inference, MHSA mitigates both discriminative and generative hallucinations across various datasets and LVLMs by simply replacing the original cross-modal attention with the corrected one, without modifying any LVLM parameters. By extending cross-modal attention mechanisms from hallucination detection to hallucination mitigation, MHSA offers a novel perspective on hallucination research in LVLMs and helps enhance their reliability.

Chinese Translation

大型视觉语言模型（LVLMs）在多种多模态任务中取得了显著的性能，但仍然存在幻觉问题，生成与视觉输入不一致的内容。之前的工作 DHCP（通过跨模态注意力模式检测幻觉）从跨模态注意力的角度探讨了幻觉检测，但未解决幻觉减轻的问题。本文提出了 MHSA（通过引导注意力减轻幻觉），这是一个轻量框架，通过学习修正 LVLMs 中的跨模态注意力模式来减轻幻觉。MHSA 训练一个简单的三层 MLP 生成器来生成修正后的注意力，受 DHCP 鉴别器和 LVLM 本身的监督信号指导。在推理过程中，MHSA 通过简单地用修正后的跨模态注意力替换原始的跨模态注意力，减轻了各种数据集和 LVLMs 中的判别性和生成性幻觉，而无需修改任何 LVLM 参数。通过将跨模态注意力机制从幻觉检测扩展到幻觉减轻，MHSA 为 LVLMs 中的幻觉研究提供了新的视角，并有助于增强其可靠性。

View on arXiv Download PDF AI Translation

cs.CV / 122 / 2605.14980

MicroscopyMatching: Towards a Ready-to-use Framework for Microscopy Image Analysis in Diverse Conditions

显微镜匹配：面向多种条件下显微镜图像分析的即用框架

Hui, Xiaofei, Qu, Haoxuan, Rahmani, Hossein, Wang, Shuohong, Lichtman, Jeff W., Liu, Jun

Abstract

Analyzing microscopy images to extract biological object properties (e.g., their morphological organization, temporal dynamics, and population density) is fundamental to various biomedical research. Yet conducting this manually is costly and time-consuming. Though deep learning-based approaches have been explored to automate this process, the substantial diversity of microscopy analysis settings in practice (including variations of biological object types, sample processing protocols, imaging equipment, and analysis tasks, etc.) often renders them ineffective. As a result, these approaches typically require extensive adaptation for different settings, which, however, can impose burdens that are often practically unsustainable for laboratories, forcing biomedical researchers to still commonly rely on manual analysis, thereby severely bottlenecking the pace of biomedical research progress. This situation has created a pressing and long-standing need for a reliable and broadly applicable microscopy image analysis tool, yet such a tool is still missing. To address this gap, we present the first ready-to-use microscopy image analysis framework, MicroscopyMatching, that can reliably perform key analysis tasks (including segmentation, tracking, and counting) across diverse microscopy analysis settings. From a fundamentally different perspective, MicroscopyMatching reformulates diverse microscopy image analysis tasks as a unified matching problem, effectively handling this problem by exploiting the robust matching capability from pre-trained latent diffusion models.

Chinese Translation

分析显微镜图像以提取生物对象特性（例如，它们的形态组织、时间动态和种群密度）是多种生物医学研究的基础。然而，手动进行这一过程既昂贵又耗时。尽管已经探索了基于深度学习的方法来自动化这一过程，但在实际应用中，显微镜分析设置的显著多样性（包括生物对象类型的变化、样本处理协议、成像设备和分析任务等）往往使这些方法失效。因此，这些方法通常需要针对不同设置进行广泛的调整，而这种调整往往给实验室带来难以承受的负担，迫使生物医学研究人员仍然依赖手动分析，从而严重制约了生物医学研究的进展速度。这种情况迫切且长期以来需要一种可靠且广泛适用的显微镜图像分析工具，但目前仍缺乏这样的工具。为了解决这一空白，我们提出了第一个即用的显微镜图像分析框架——显微镜匹配（MicroscopyMatching），该框架能够在多种显微镜分析设置中可靠地执行关键分析任务（包括分割、跟踪和计数）。从根本上不同的角度出发，显微镜匹配将多样的显微镜图像分析任务重新表述为一个统一的匹配问题，通过利用预训练的潜在扩散模型的强大匹配能力有效地处理这一问题。

View on arXiv Download PDF AI Translation

cs.CV / 123 / 2605.14984

Sat3DGen: Comprehensive Street-Level 3D Scene Generation from Single Satellite Image

Sat3DGen：基于单幅卫星图像的全面街景3D场景生成

Qian, Ming, Xia, Zimin, Liu, Changkun, Ma, Shuailei, Wang, Wen, Ke, Zeran, Tan, Bin, Zhang, Hang, Xia, Gui-Song

Abstract

Generating a street-level 3D scene from a single satellite image is a crucial yet challenging task. Current methods present a stark trade-off: geometry-colorization models achieve high geometric fidelity but are typically building-focused and lack semantic diversity. In contrast, proxy-based models use feed-forward image-to-3D frameworks to generate holistic scenes by jointly learning geometry and texture, a process that yields rich content but coarse and unstable geometry. We attribute these geometric failures to the extreme viewpoint gap and sparse, inconsistent supervision inherent in satellite-to-street data. We introduce Sat3DGen to address these fundamental challenges, which embodies a geometry-first methodology. This methodology enhances the feed-forward paradigm by integrating novel geometric constraints with a perspective-view training strategy, explicitly countering the primary sources of geometric error. This geometry-centric strategy yields a dramatic leap in both 3D accuracy and photorealism. For validation, we first constructed a new benchmark by pairing the VIGOR-OOD test set with high-resolution DSM data. On this benchmark, our method improves geometric RMSE from 6.76m to 5.20m. Crucially, this geometric leap also boosts photorealism, reducing the Fr\'echet Inception Distance (FID) from $\sim$40 to 19 against the leading method, Sat2Density++, despite using no extra tailored image-quality modules. We demonstrate the versatility of our high-quality 3D assets through diverse downstream applications, including semantic-map-to-3D synthesis, multi-camera video generation, large-scale meshing, and unsupervised single-image Digital Surface Model (DSM) estimation. The code has been released on https://github.com/qianmingduowan/Sat3DGen.

Chinese Translation

从单幅卫星图像生成街景3D场景是一项关键但具有挑战性的任务。目前的方法存在明显的权衡：几何-着色模型实现了高几何保真度，但通常集中于建筑物，缺乏语义多样性。相反，基于代理的模型使用前馈图像到3D框架，通过联合学习几何和纹理生成整体场景，这一过程虽然产生了丰富的内容，但几何却粗糙且不稳定。我们将这些几何失败归因于卫星到街道数据固有的极端视角差距和稀疏、不一致的监督。为了解决这些根本性挑战，我们提出了Sat3DGen，采用几何优先的方法论。该方法论通过将新颖的几何约束与透视视图训练策略相结合，增强了前馈范式，明确针对几何误差的主要来源。这种以几何为中心的策略在3D准确性和照片真实感方面实现了显著飞跃。为了验证，我们首先通过将VIGOR-OOD测试集与高分辨率DSM数据配对构建了一个新的基准。在该基准上，我们的方法将几何均方根误差（RMSE）从6.76米降低到5.20米。重要的是，这一几何飞跃也提升了照片真实感，使Fréchet Inception Distance (FID) 从约40降低到19，相较于领先方法Sat2Density++，尽管没有使用额外的定制图像质量模块。我们通过多样的下游应用展示了我们高质量3D资产的多功能性，包括语义地图到3D合成、多摄像头视频生成、大规模网格化以及无监督单幅图像数字表面模型（DSM）估计。代码已发布在https://github.com/qianmingduowan/Sat3DGen。

View on arXiv Download PDF AI Translation

cs.CV / 124 / 2605.14988

Compositional Video Generation via Inference-Time Guidance

通过推理时引导进行组合视频生成

Shaulov, Ariel, Shaar, Eitan, Edenzon, Amit, Chechik, Gal, Wolf, Lior

Abstract

Text-to-video diffusion models generate realistic videos, but often fail on prompts requiring fine-grained compositional understanding, such as relations between entities, attributes, actions, and motion directions. We hypothesize that these failures need not be addressed by retraining the generator, but can instead be mitigated by steering the denoising process using the model's own internal grounding signals. We propose \textbf{CVG}, an inference-time guidance method for improving compositional faithfulness in frozen text-to-video models. Our key observation is that cross-attention maps already encode how prompt concepts are grounded across space and time. We train a lightweight compositional classifier on these attention features and use its gradients during early denoising steps to steer the latent trajectory toward the desired composition. Built on a frozen VLM backbone, the classifier transfers across semantically related composition labels rather than relying only on narrow category-specific features. CVG improves compositional generation without modifying the model architecture, fine-tuning the generator, or requiring layouts, boxes, or other user-supplied controls. Experiments on compositional text-to-video benchmarks show improved prompt faithfulness while preserving the visual quality of the underlying generator.

Chinese Translation

文本到视频的扩散模型能够生成逼真的视频，但在需要细致组合理解的提示上（例如实体之间的关系、属性、动作和运动方向）往往表现不佳。我们假设，这些失败不必通过重新训练生成器来解决，而是可以通过利用模型自身的内部基础信号来引导去噪过程来减轻。我们提出了 extbf{CVG}，一种在冻结的文本到视频模型中改善组合忠实度的推理时引导方法。我们的关键观察是，交叉注意力图已经编码了提示概念在时间和空间上的基础。我们在这些注意力特征上训练了一个轻量级的组合分类器，并在早期去噪步骤中使用其梯度来引导潜在轨迹朝向所需的组合。基于冻结的视觉语言模型（VLM）骨干，分类器在语义相关的组合标签之间进行迁移，而不仅仅依赖于狭窄的类别特定特征。CVG在不修改模型架构、不微调生成器或不需要布局、框或其他用户提供的控制的情况下改善了组合生成。在组合文本到视频基准上的实验显示，提示的忠实度得到了提高，同时保持了基础生成器的视觉质量。

View on arXiv Download PDF AI Translation

cs.CV / 125 / 2605.14990

Characterizing the visual representation of objects from the child's view

从儿童视角表征物体的视觉表现

Yang, Jane, Sepuri, Tarun, Tan, Alvin Wei Ming, Aw, Khai Loong, Frank, Michael C., Long, Bria

Abstract

Children acquire object category representations from their everyday experiences in the first few years of life. What do the inputs to this learning process look like? We analyzed first-person videos of young children's visual experience at home from the BabyView dataset ($N$ = 31 participants, 868 hours, ages 5--36 months), using a supervised object detection model to extract common object categories from more than 3 million frames. We found that children's object category exposure was highly skewed: a few categories (e.g., cups, chairs) dominated children's visual experiences while most categories appeared rarely, replicating previous findings from a more restricted set of contexts. Category exemplars were highly variable: children encountered objects from unusual angles, in highly cluttered scenes, and partially occluded views; many categories (especially animals) were most frequently viewed as depictions. Surprisingly, despite this variability, detected categories (e.g., giraffes, apples) showed stronger groupings within superordinate categories (e.g., animals, food) relative to groupings derived from canonical photographs of these categories. We found this same pattern when using high-dimensional embeddings from both self-supervised visual and multimodal models; this effect was also recapitulated in densely sampled data from individual children. Understanding the robustness and efficiency of visual category learning will require the development of models that can exploit strong superordinate structure and learn from non-canonical, sparse, and variable exemplars.

Chinese Translation

儿童在生命的头几年中通过日常经验获得物体类别的表征。那么，这一学习过程的输入是什么样的呢？我们分析了来自BabyView数据集的年轻儿童在家中的第一人称视频（$N$ = 31名参与者，868小时，年龄5至36个月），使用监督物体检测模型从超过300万帧中提取常见物体类别。我们发现儿童接触的物体类别高度偏斜：少数类别（如杯子、椅子）主导了儿童的视觉体验，而大多数类别则很少出现，这一发现重现了在更有限的背景下的先前研究结果。类别示例高度多样化：儿童从不寻常的角度、在高度杂乱的场景中以及部分遮挡的视图中遇到物体；许多类别（尤其是动物）最常以描绘的形式出现。令人惊讶的是，尽管存在这种多样性，检测到的类别（如长颈鹿、苹果）在超类类别（如动物、食物）内的分组相较于从这些类别的标准照片中得出的分组显示出更强的聚合性。当使用来自自监督视觉和多模态模型的高维嵌入时，我们发现了相同的模式；这一效应在个别儿童的密集采样数据中也得到了重现。理解视觉类别学习的稳健性和效率将需要开发能够利用强超类结构并从非标准、稀疏和多样化示例中学习的模型。

View on arXiv Download PDF AI Translation

cs.CV / 126 / 2605.14991

Predicting Response to Neoadjuvant Chemotherapy in Ovarian Cancer from CT Baseline Using Multi-Loss Deep Learning

基于CT基线的卵巢癌新辅助化疗反应预测：多损失深度学习方法

Pastori, Francesco, Fati, Francesca, Rosanu, Marina, De Vitis, Luigi, Ribero, Lucia, Schivardi, Gabriella, Aletti, Giovanni Damiano, Colombo, Nicoletta, Casarin, Jvan, Multinu, Francesco, De Momi, Elena

Abstract

Ovarian cancer is the most lethal gynecologic malignancy: around 60% of patients are diagnosed at an advanced stage, with an associated 5-year survival rate of about 30%. Early identification of non-responders to neoadjuvant chemotherapy remains a key unmet need, as it could prevent ineffective therapy and avoid delays in optimal surgical management. This work proposes a non-invasive deep learning framework to predict neoadjuvant chemotherapy response from pre-treatment contrast-enhanced CT by leveraging automatically derived 3D lesion masks. The approach encodes axial slices with a partially fine-tuned pretrained image encoder and aggregates slice-level representations into a volumetric embedding through an attention-based module. Training combines classification loss with supervised contrastive regularization and hard-negative mining to improve separation between ambiguous responders and non-responders. The method was developed on a retrospective single-center cohort from the European Institute of Oncology (Milan, IT), including 280 eligible patients (147 responder, 133 non-responder). On the test cohort, the model achieved a ROC-AUC of 0.73 (95% CI: 0.58-0.86) and an F1-score of 0.70 (95% CI: 0.56-0.82). Overall, these results suggest that the proposed architecture learns clinically relevant predictive patterns and provides a robust foundation for an imaging-based stratification tool.

Chinese Translation

卵巢癌是最致命的妇科恶性肿瘤：约60%的患者在晚期被诊断，相关的5年生存率约为30%。早期识别对新辅助化疗无反应的患者仍然是一个关键的未满足需求，因为这可以防止无效治疗并避免最佳手术管理的延误。本研究提出了一种非侵入性的深度学习框架，通过利用自动生成的3D病灶掩膜，从预处理的对比增强CT图像中预测新辅助化疗反应。该方法使用部分微调的预训练图像编码器对轴向切片进行编码，并通过基于注意力的模块将切片级表示聚合为体积嵌入。训练结合了分类损失、监督对比正则化和困难负样本挖掘，以改善模糊反应者与非反应者之间的分离。该方法是在来自欧洲肿瘤研究所（米兰，意大利）的回顾性单中心队列上开发的，包括280名符合条件的患者（147名反应者，133名非反应者）。在测试队列中，该模型达到了0.73的ROC-AUC（95% CI: 0.58-0.86）和0.70的F1-score（95% CI: 0.56-0.82）。总体而言，这些结果表明所提出的架构学习了临床相关的预测模式，并为基于影像的分层工具提供了坚实的基础。

View on arXiv Download PDF AI Translation

cs.CV / 127 / 2605.15010

3D Skew-Normal Splatting

三维偏态正态点云渲染

Wu, Xiangru, Fan, Ke, Fu, Yanwei

Abstract

3D Gaussian Splatting (3DGS) has emerged as a leading representation for real-time novel view synthesis and been widely adopted in various downstream applications. The core strength of 3DGS lies in its efficient kernel-based scene representation, where Gaussian primitives provide favorable mathematical and computational properties. However, under a finite primitive budget, the symmetric shape of each primitive directly affects representation compactness, especially near asymmetric structures such as object boundaries and one-sided surfaces. Recent works have explored more complex kernel distributions, yet they either remain within the elliptical family or rely on hard truncation, which limits continuous shape control and introduces distributional discontinuities. In this paper, we propose Skew-Normal Splatting (SNS), which adopts the Azzalini Skew-Normal distribution as the fundamental primitive. By introducing a learnable and bounded skewness parameter, SNS can continuously interpolate between symmetric Gaussians and Half-Gaussian-like shapes, enabling flexible modeling of both sharp boundaries and interior regions. Moremover, SNS preserves analytical tractability under affine transformations and marginalization. This property allows seamless integration into existing Gaussian Splatting rasterization pipelines.Furthermore, to address the strong coupling between scale, rotation, and skewness parameters, we introduce a decoupled parameterization and a block-wise optimization strategy to enhance training stability and accuracy. Extensive experiments on standard novel-view synthesis benchmarks show that SNS consistently improves reconstruction quality over Gaussian and recent non-Gaussian kernels, with clearer benefits on sharp boundaries and thin or one-sided structures.

Chinese Translation

三维高斯点云渲染（3D Gaussian Splatting, 3DGS）已成为实时新视图合成的主要表示方法，并在各种下游应用中得到广泛采用。3DGS的核心优势在于其高效的基于核的场景表示，其中高斯原语提供了良好的数学和计算特性。然而，在有限的原语预算下，每个原语的对称形状直接影响表示的紧凑性，特别是在不对称结构附近，如物体边界和单侧表面。近期的研究探索了更复杂的核分布，但它们要么仍然局限于椭圆家族，要么依赖于硬截断，这限制了连续形状控制并引入了分布的不连续性。在本文中，我们提出了偏态正态点云渲染（Skew-Normal Splatting, SNS），其采用Azzalini偏态正态分布作为基本原语。通过引入一个可学习的有界偏态参数，SNS能够在对称高斯和类似半高斯的形状之间进行连续插值，从而灵活建模尖锐边界和内部区域。此外，SNS在仿射变换和边缘化下保持解析可处理性。这一特性允许其无缝集成到现有的高斯点云渲染管道中。此外，为了解决尺度、旋转和偏态参数之间的强耦合，我们引入了去耦参数化和块状优化策略，以增强训练的稳定性和准确性。在标准新视图合成基准上的大量实验表明，SNS在重建质量上始终优于高斯核和近期的非高斯核，尤其在尖锐边界和细长或单侧结构上表现出更明显的优势。

View on arXiv Download PDF AI Translation

cs.CV / 128 / 2605.15024

HiSem: Hierarchical Semantic Disentangling for Remote Sensing Image Change Captioning

HiSem：用于遥感图像变化描述的层次语义解耦

Wang, Man, Liu, Chenyang, Li, Wenjun, Ni, Feng, Jia, Bing, Huang, Baoqi, Xia, Riting, Shi, Zhenwei

Abstract

Remote sensing image change captioning (RSICC) aims to achieve high-level semantic understanding of genuine changes occurring between bi-temporal images. Despite notable progress, existing methods are fundamentally limited by a shared modeling assumption: changed and unchanged image pairs, which have intrinsically different semantic granularities, are processed under a unified modeling strategy. This modeling inconsistency leads to semantic entanglement between coarse-grained change existence judgment and fine-grained semantic understanding.To address the above limitation, we propose a novel hierarchical semantic disentangling network (HiSem) that explicitly disentangles semantic representations of different granularities. Specifically, we first introduce the Bidirectional Differential Attention Modulation (BDAM) module that leverages discrepancy-aware attention to enhance cross-temporal interactions, thereby amplifying true change signals while suppressing irrelevant variations. Building upon this, we design a Hierarchical Adaptive Semantic Disentanglement (HASD) module that performs adaptive routing at two hierarchical levels: a coarse-grained image-level routing mechanism distinguishes changed and unchanged image pairs, while a fine-grained token-level Mixture-of-Experts (MoE) block models diverse and heterogeneous change semantics for changed samples. Extensive experiments on two benchmark datasets demonstrate that HiSem outperfoms previous methods, achieving a significant improvement of +7.52\% BLEU-4 on the WHU-CDC dataset. More importantly, our approach provides a structured perspective for RSICC by explicitly aligning model design with the intrinsic semantic heterogeneity of bi-temporal scenes. The code will be available at https://github.com/Man-Wang-star/HiSem

Chinese Translation

遥感图像变化描述（RSICC）旨在实现对双时相图像之间真实变化的高层次语义理解。尽管取得了显著进展，但现有方法在根本上受到共享建模假设的限制：变化和未变化的图像对在内在上具有不同的语义粒度，却在统一的建模策略下进行处理。这种建模不一致导致了粗粒度变化存在判断与细粒度语义理解之间的语义纠缠。为了解决上述限制，我们提出了一种新颖的层次语义解耦网络（HiSem），该网络明确解耦不同粒度的语义表示。具体而言，我们首先引入了双向差异注意力调制（BDAM）模块，该模块利用差异感知注意力增强跨时相交互，从而放大真实变化信号，同时抑制无关变动。在此基础上，我们设计了层次自适应语义解耦（HASD）模块，该模块在两个层次上执行自适应路由：粗粒度图像级路由机制区分变化和未变化的图像对，而细粒度令牌级专家混合（MoE）块则为变化样本建模多样且异质的变化语义。在两个基准数据集上的大量实验表明，HiSem的表现优于之前的方法，在WHU-CDC数据集上实现了+7.52\%的BLEU-4显著提升。更重要的是，我们的方法为RSICC提供了一个结构化的视角，通过明确将模型设计与双时相场景的内在语义异质性对齐。代码将发布在 https://github.com/Man-Wang-star/HiSem

View on arXiv Download PDF AI Translation

cs.CV / 129 / 2605.15042

EverAnimate: Minute-Scale Human Animation via Latent Flow Restoration

EverAnimate：基于潜在流恢复的分钟级人类动画生成

Li, Wuyang, Gao, Yang, Hassan, Mariam, Feng, Lan, Pan, Wentao, Luan, Po-Chien, Alahi, Alexandre

Abstract

We propose EverAnimate, an efficient post-training method for long-horizon animated video generation that preserves visual quality and character identity. Long-form animation remains challenging because highly dynamic human motion must be synthesized against relatively static environments, making chunk-based generation prone to accumulated drift: (i) low-level quality drift, such as progressive degradation of static backgrounds, and (ii) high-level semantic drift, such as inconsistent character identity and view-dependent attributes. To address this issue, EverAnimate restores drifted flow trajectories by anchoring generation to a persistent latent context memory, consisting of two complementary mechanisms. (i) Persistent Latent Propagation maintains a context memory across chunks to propagate identity and motion in latent space while mitigating temporal forgetting. (ii) Restorative Flow Matching introduces an implicit restoration objective during sampling through velocity adjustment, improving within-chunk fidelity. With only lightweight LoRA tuning, EverAnimate outperforms state-of-the-art long-animation methods in both short- and long-horizon settings: at 10 seconds, it improves PSNR/SSIM by 8%/7% and reduces LPIPS/FID by 22%/11%; at 90 seconds, the gains increase to 15%/15% and 32%/27%, respectively.

Chinese Translation

我们提出了EverAnimate，这是一种高效的后训练方法，用于生成长时间跨度的动画视频，能够保持视觉质量和角色身份。长篇动画的生成仍然面临挑战，因为高度动态的人体运动必须在相对静态的环境中合成，这使得基于块的生成容易出现累积漂移：（i）低级质量漂移，例如静态背景的逐渐退化，以及（ii）高级语义漂移，例如角色身份和视角依赖属性的不一致。为了解决这个问题，EverAnimate通过将生成锚定到一个持久的潜在上下文记忆中来恢复漂移的流轨迹，该记忆由两种互补机制组成。（i）持久潜在传播（Persistent Latent Propagation）在块之间维护上下文记忆，以在潜在空间中传播身份和运动，同时减轻时间遗忘。（ii）恢复性流匹配（Restorative Flow Matching）在采样过程中通过速度调整引入隐式恢复目标，从而提高块内的保真度。通过仅进行轻量级的LoRA调优，EverAnimate在短时间和长时间设置中均超越了最先进的长动画方法：在10秒时，PSNR/SSIM提高了8%/7%，LPIPS/FID减少了22%/11%；在90秒时，增益分别增加到15%/15%和32%/27%。

View on arXiv Download PDF AI Translation

cs.CV / 130 / 2605.15054

LATERN: Test-Time Context-Aware Explainable Video Anomaly Detection

LATERN：测试时上下文感知的可解释视频异常检测

Piehl, Mitchell, Ye, Muchao

Abstract

Vision-language models (VLMs) have recently emerged as a promising paradigm for video anomaly detection (VAD) due to their strong visual reasoning ability and natural language-based explainability. In this paper, we aim to address a key limitation of such pipelines, which perform segment-level inference independently owing to token constraints and reason without structured temporal context, allowing VLMs to interpret anomalies as deviations from evolving video dynamics rather than producing fragmented predictions and explanations. To specify, we propose a context-aware framework named LATERN, which reformulates VAD as a temporal evidence aggregation process. LATERN consists of two complementary modules: Context-Aware Anomaly Scoring (CEA) and Recursive Evidence Aggregation (REA). CEA introduces a novel image-grounded memory mechanism, which selectively chooses historical content via frame diversity and visual-textual alignment as expanded context to help generate reliable anomaly scores. Building upon these scores, REA performs recursive temporal aggregation to identify coherent anomaly intervals and produce event-level decisions and explanations grounded in visual-textual evidence. Extensive experiments on challenging benchmarks, including UCF-Crime and XD-Violence, show that LATERN enhances detection accuracy and explanation consistency for frozen VLMs during test time, while generating temporally coherent and semantically grounded event-level explanations.

Chinese Translation

视觉语言模型（VLMs）因其强大的视觉推理能力和基于自然语言的可解释性，最近成为视频异常检测（VAD）的一个有前景的范式。本文旨在解决此类流程的一个关键限制，即由于令牌约束而独立进行片段级推理，并且在没有结构化时间上下文的情况下进行推理，使得VLMs将异常解释为与不断变化的视频动态的偏差，而不是产生碎片化的预测和解释。具体而言，我们提出了一种名为LATERN的上下文感知框架，将VAD重新表述为一个时间证据聚合过程。LATERN由两个互补模块组成：上下文感知异常评分（CEA）和递归证据聚合（REA）。CEA引入了一种新颖的图像基础记忆机制，通过帧多样性和视觉-文本对齐选择历史内容，作为扩展上下文以帮助生成可靠的异常评分。在这些评分的基础上，REA执行递归时间聚合，以识别一致的异常区间，并生成基于视觉-文本证据的事件级决策和解释。在包括UCF-Crime和XD-Violence在内的具有挑战性的基准上进行的广泛实验表明，LATERN在测试时提高了冻结VLMs的检测准确性和解释一致性，同时生成时间上连贯且语义上扎实的事件级解释。

View on arXiv Download PDF AI Translation

cs.CV / 131 / 2605.15062

Computational Imaging Priors for Wireless Capsule Endoscopy: Monte Carlo-Guided Hemoglobin Mapping for Rare-Anomaly Detection

无线胶囊内窥镜的计算成像先验：用于罕见异常检测的蒙特卡洛引导血红蛋白映射

Yang, Chengshuai, Xing, Lei, Entin, Gregory, Vemulapalli, Roopa, Casey, Lisa, Zaman, Raiyan Tripti

Abstract

Background. RGB-trained capsule-endoscopy classifiers underperform on small-vessel vascular findings by conflating hemoglobin contrast with bile and illumination falloff. Thus, here we test whether a Monte Carlo-inspired analytic model can compute hemoglobin from RGB signal built upon extracted classifier. Methods. On Kvasir-Capsule (47,238 frames, video-level 70/15/15 split, 11 evaluable classes) we evaluate two software-only configurations against RGB-only EfficientNet-B0 across 6 seeds: (i) a prior P_blood = sigma(alpha * (H_norm - 0.5)) * Phi(r) fused as 2 zero-init auxiliary channels; (ii) a distillation head training a 3-channel RGB backbone to predict P_blood. Significance: paired DeLong, McNemar, bootstrap CIs with Bonferroni correction. Results. Across 6 seeds (n=6,423), the analytic prior provides a small but direction-consistent macro-AUC improvement: RGB-only 0.760 +/- 0.027, input-fusion 0.783 +/- 0.024 (paired Delta = +0.023, sign-positive on 5/6 seeds), distillation 0.773 +/- 0.028. The largest robust per-class lift is on Lymphangiectasia, where AUC rises from RGB 0.238 +/- 0.057 to input-fusion 0.337 +/- 0.019, sign-consistent across all 6 seeds. On rare focal-vascular classes (Angiectasia, Blood - fresh) the prior's per-seed effects are bimodal: seed=42 reaches Angiectasia AUC 0.528 -> 0.916, but the cross-seed mean is 0.646 -> 0.608 with sigma_PI = 0.23 - reported as a high-variance per-seed exemplar. Conclusion. A Monte Carlo-inspired analytic prior provides a small, direction-consistent macro-AUC improvement on Kvasir-Capsule across 6 seeds with the largest robust per-class lift on Lymphangiectasia; the distillation variant runs on plain 3-channel RGB and yields a free interpretability heatmap.

Chinese Translation

背景：RGB训练的胶囊内窥镜分类器在小血管血管发现方面表现不佳，因为它将血红蛋白对比度与胆汁和照明衰减混淆。因此，本文测试了一种基于蒙特卡洛启发的分析模型能否从提取的分类器构建的RGB信号中计算血红蛋白。方法：在Kvasir-Capsule（47,238帧，视频级别70/15/15拆分，11个可评估类别）上，我们评估了两种仅软件配置与RGB-only EfficientNet-B0在6个种子下的表现：（i）一个先验P_blood = sigma(alpha * (H_norm - 0.5)) * Phi(r)，融合为2个零初始化辅助通道；（ii）一个蒸馏头训练一个3通道RGB主干以预测P_blood。显著性：使用配对DeLong、McNemar、引导自助法置信区间及Bonferroni校正。结果：在6个种子（n=6,423）中，分析先验提供了一个小但方向一致的宏观AUC改进：RGB-only 0.760 +/- 0.027，输入融合0.783 +/- 0.024（配对Delta = +0.023，在5/6个种子上显著为正），蒸馏0.773 +/- 0.028。每个类别的最大稳健提升出现在淋巴管扩张症（Lymphangiectasia），其AUC从RGB的0.238 +/- 0.057上升到输入融合的0.337 +/- 0.019，在所有6个种子中一致显著。在罕见的局灶性血管类别（Angiectasia，Blood - fresh）中，先验的每个种子效应呈双峰分布：种子=42的淋巴管扩张症AUC从0.528提升至0.916，但跨种子的均值为0.646 -> 0.608，sigma_PI = 0.23 - 被报告为高方差的每种子示例。结论：一种蒙特卡洛启发的分析先验在Kvasir-Capsule的6个种子中提供了一个小且方向一致的宏观AUC改进，淋巴管扩张症的每个类别的最大稳健提升；蒸馏变体在普通的3通道RGB上运行，并产生了一个自由可解释的热图。

View on arXiv Download PDF AI Translation

cs.CV / 132 / 2605.15071

On the Cultural Anachronism and Temporal Reasoning in Vision Language Models

视觉语言模型中的文化时代错误与时间推理

Ranjan, Mukul, Jha, Prince, Kumari, Khushboo, Shen, Zhiqiang

Abstract

Vision-Language Models (VLMs) are increasingly applied to cultural heritage materials, from digital archives to educational platforms. This work identifies a fundamental issue in how these models interpret historical artifacts. We define this phenomenon as cultural anachronism, the tendency to misinterpret historical objects using temporally inappropriate concepts, materials, or cultural frameworks. To quantify this phenomenon, we introduce the Temporal Anachronism Benchmark for Vision-Language Models (TAB-VLM), a dataset of 600 questions across six categories, designed to evaluate temporal reasoning on 1,600 Indian cultural artifacts spanning prehistoric to modern periods. Systematic evaluations of ten state-of-the-art models reveal significant deficiencies on our benchmark, and even the best model (GPT-5.2) achieves only 58.7% overall accuracy. The performance gap persists across varying architectures and scales, suggesting that cultural anachronism represents a significant limitation in visual AI systems, regardless of model size. These findings highlight the disparity between current VLM capabilities and the requirements for accurately interpreting cultural heritage materials, particularly for non-Western visual cultures underrepresented in training data. Our benchmark provides a foundation for enhancing temporal cognition in multimodal AI systems that interact with historical artifacts. The dataset and code are available in our project page.

Chinese Translation

视觉语言模型（VLMs）越来越多地应用于文化遗产材料，从数字档案到教育平台。本研究识别出这些模型在解读历史文物时存在的一个根本性问题。我们将这一现象定义为文化时代错误，即使用时间上不恰当的概念、材料或文化框架来误解历史物品的倾向。为了量化这一现象，我们引入了视觉语言模型的时间时代错误基准（TAB-VLM），这是一个包含600个问题的数据库，分为六个类别，旨在评估对1600件印度文化遗产的时间推理，这些遗产跨越史前到现代时期。对十个最先进模型的系统评估揭示了在我们的基准测试中存在显著缺陷，即使是表现最佳的模型（GPT-5.2）整体准确率也仅为58.7%。这一性能差距在不同架构和规模下依然存在，表明文化时代错误在视觉人工智能系统中代表了一个重要的局限，无论模型规模如何。这些发现突显了当前VLM能力与准确解读文化遗产材料的要求之间的差距，尤其是在训练数据中代表性不足的非西方视觉文化方面。我们的基准为增强与历史文物互动的多模态人工智能系统中的时间认知提供了基础。数据集和代码可在我们的项目页面获取。

View on arXiv Download PDF AI Translation

cs.CV / 133 / 2605.15088

SAGE3D: Soft-guided attention and graph excitation for 3D point cloud corner detection

SAGE3D：用于三维点云角点检测的软引导注意力和图激励

Bekar, Batuhan Arda, Sarı, Can, Gülkan, Hüseyin Can, Özcan, Barış

Abstract

We present SAGE3D, a hybrid Transformer-based model for corner detection in airborne LiDAR point clouds. We propose a multi-stage solution built on a hierarchical encoder-decoder architecture that progressively downsamples point clouds through Set Abstraction layers and recovers per-point predictions via Feature Propagation. We introduce two innovations: Soft-Guided Attention, which injects ground-truth corner labels as a log-prior into attention logits during training to improve precision; then an Excitatory Graph Neural Network positioned at strategic resolutions in the hierarchy, employing positive-only message passing where high-confidence corners reinforce predictions through learned boosting, optimizing for recall. The hierarchical design enables multi-scale feature extraction while our guided attention and excitatory modules ensure corner signals are amplified rather than diluted across scales.

Chinese Translation

我们提出了SAGE3D，这是一种基于混合Transformer的模型，用于航空激光雷达点云中的角点检测。我们提出了一种多阶段解决方案，基于层次编码器-解码器架构，通过集合抽象层逐步下采样点云，并通过特征传播恢复每个点的预测。我们引入了两个创新：软引导注意力（Soft-Guided Attention），在训练期间将真实角点标签作为对数先验注入到注意力逻辑中，以提高精度；然后是位于层次结构中战略性分辨率的激励图神经网络（Excitatory Graph Neural Network），采用仅正向消息传递的方式，在高置信度角点通过学习的增强来强化预测，优化召回率。层次设计使得多尺度特征提取成为可能，而我们的引导注意力和激励模块确保角点信号在不同尺度上得到放大，而不是被稀释。

View on arXiv Download PDF AI Translation

cs.CV / 134 / 2605.15093

CoralLite: {\mu}CT Reconstruction of Coral Colonies from Individual Corallites

CoralLite: 从单个珊瑚虫的微型CT重建珊瑚群落

Jones, Jess, Bertini, Leonardo, Johnson, Kenneth, Hendy, Erica, Burghardt, Tilo

Abstract

The life history of an individual coral is archived within the accreting skeleton of the colony. While reef-forming coral colonies (e.g. massive \emph{Porites} sp.) may live for hundreds of years and deposit calcareous structures many metres in height and width, their living tissue is a thin outer surface layer comprised of asexually-dividing polyps that only survive a few years. To understand the rate and timing of polyp division and the consequences for colony skeletal growth, scientists need to track the skeletal corallite deposited around each polyp. Here we propose CoralLite, an annotated {\mu}CT scan dataset of entire calcareous skeletons and an associated, first corallite deep learning reconstruction baseline. CoralLite combines fully quantified volumetric segmentations with cross-slice linking for visualisations of 3D models for each corallite up to colony scale. For segmentation, we propose and evaluate in detail a hybrid V-Trans-UNet architecture applicable to segmenting tiled {\mu}CT virtual slabs of \emph{Porites} sp. colonies. The model is pre-trained on weakly annotated data and topology-aware fine-tuned using fully annotated slice sections with 8k+ manual corallite region annotations. On unseen slices of the same colony, the resulting model reaches 0.94 topological accuracy at mean Dice scores of 0.77 on the same colony and projection axis, and 0.63 mean Dice scores on a different, biologically unrelated specimen. Whilst our experiments are limited in scale and context, our results show for the first time that visual machine learning can effectively support full 3D individual corallite modelling from {\mu}CT scans of coral skeletons alone. For reproducibility and as a baseline for future research we publish our full dataset of 697 {\mu}CT slices, 37 partial or full slice annotations, and all network weights and source code with this paper.

Chinese Translation

个体珊瑚的生活史记录在群落不断累积的骨骼中。尽管造礁珊瑚群落（例如，巨型的 extit{Porites} 物种）可能生存数百年并沉积出高达数米的钙质结构，但其活体组织仅为一层薄薄的外表面，由无性繁殖的珊瑚虫组成，通常只能存活几年。为了理解珊瑚虫分裂的速率和时机及其对群落骨骼生长的影响，科学家需要追踪每个珊瑚虫周围沉积的骨骼珊瑚虫。本文提出了 CoralLite，一个包含整个钙质骨骼的注释微型CT扫描数据集及其相关的首个珊瑚虫深度学习重建基线。CoralLite 将完全量化的体积分割与切片交叉链接相结合，以可视化每个珊瑚虫的三维模型，扩展至群落规模。对于分割，我们提出并详细评估了一种适用于分割 extit{Porites} 物种群落的平铺微型CT虚拟切片的混合 V-Trans-UNet 架构。该模型在弱注释数据上进行预训练，并使用具有 8k+ 手动珊瑚虫区域注释的完全注释切片部分进行拓扑感知的微调。在同一群落的未见切片上，所得到的模型在相同群落和投影轴上达到 0.94 的拓扑准确度，平均 Dice 分数为 0.77，而在不同的生物无关样本上，平均 Dice 分数为 0.63。尽管我们的实验在规模和背景上有限，但我们的结果首次表明，视觉机器学习可以有效支持仅通过珊瑚骨骼的微型CT扫描实现完整的三维个体珊瑚虫建模。为了可重复性以及作为未来研究的基线，我们与本文一起发布了完整的数据集，包括 697 个微型CT切片、37 个部分或完整切片注释，以及所有网络权重和源代码。

View on arXiv Download PDF AI Translation

cs.CV / 135 / 2605.15116

DriveCtrl: Conditioned Sim-to-Real Driving Video Generation

DriveCtrl：条件化的仿真到现实驾驶视频生成

Zhao, Haonan, Wang, Yiting, Chen, Jingkun, Donzella, Valentina, Bashford-Rogers, Thomas, Debattista, Kurt

Abstract

Large-scale labelled driving video data is essential for training autonomous driving systems. Although simulation offers scalable and fully annotated data, the domain gap between synthetic and real-world driving videos significantly limits its utility for downstream deployment. Existing video generation methods are not well-suited for this task, as they fail to simultaneously preserve scene structure, object dynamics, temporal consistency, and visual realism, all of which are critical for maintaining annotation validity in generated data. In this paper, we present DriveCtrl, a depth-conditioned controllable sim-to-real video generation framework for realistic driving video synthesis. Built upon a pretrained video foundation model, DriveCtrl introduces a structure-aware adapter that enables depth-guided generation while preserving the scene layout and motion patterns of the source simulation, producing temporally coherent driving videos that remain aligned with the original simulated sequences. We further introduce a scalable data generation pipeline that transforms simulator videos into realistic driving footage matching the visual style of a target real-world dataset. The pipeline supports three conditioning signals: structural depth, reference-dataset style, and text prompts, while preserving frame-level annotations for downstream perception tasks. To better assess this task, we propose a driving-domain-specific knowledge-informed evaluation metric called Driving Video Realism Score (DVRS) that assesses the realism of generated videos. Experiments demonstrate that DriveCtrl consistently outperforms the base model and competing alternatives in realism, temporal quality, and perception task performance, substantially narrowing the sim-to-real gap for driving video generation.

Chinese Translation

大规模标注的驾驶视频数据对于训练自动驾驶系统至关重要。尽管仿真提供了可扩展且完全注释的数据，但合成与现实世界驾驶视频之间的领域差距显著限制了其在下游部署中的实用性。现有的视频生成方法并不适合此任务，因为它们无法同时保持场景结构、物体动态、时间一致性和视觉真实感，而这些都是维护生成数据注释有效性的重要因素。本文提出了DriveCtrl，一种深度条件控制的仿真到现实视频生成框架，用于真实驾驶视频合成。DriveCtrl建立在预训练的视频基础模型之上，引入了一种结构感知适配器，使得在保持源仿真场景布局和运动模式的同时，实现深度引导生成，从而生成时间上连贯的驾驶视频，与原始仿真序列保持一致。我们进一步提出了一种可扩展的数据生成管道，将仿真器视频转换为匹配目标现实世界数据集视觉风格的真实驾驶视频。该管道支持三种条件信号：结构深度、参考数据集风格和文本提示，同时保留帧级注释以用于下游感知任务。为了更好地评估这一任务，我们提出了一种特定于驾驶领域的知识驱动评估指标，称为驾驶视频真实感评分（Driving Video Realism Score, DVRS），用于评估生成视频的真实感。实验表明，DriveCtrl在真实感、时间质量和感知任务性能方面始终优于基础模型和竞争替代方案，显著缩小了驾驶视频生成中的仿真到现实差距。

View on arXiv Download PDF AI Translation

cs.CV / 136 / 2605.15128

MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

MemEye：一种面向视觉的多模态智能体记忆评估框架

Guo, Minghao, Jiao, Qingyue, Shi, Zeru, Quan, Yihao, Zhang, Boxuan, Li, Danrui, Che, Liwei, Xu, Wujiang, Liu, Shilong, Liu, Zirui, Kapadia, Mubbasir, Pavlovic, Vladimir, Liu, Jiang, Wang, Mengdi, Shi, Yiyu, Metaxas, Dimitris N., Tang, Ruixiang

Abstract

Long-term agent memory is increasingly multimodal, yet existing evaluations rarely test whether agents preserve the visual evidence needed for later reasoning. In prior work, many visually grounded questions can be answered using only captions or textual traces, allowing answers to be inferred without preserving the fine-grained visual evidence. Meanwhile, harder cases that require reasoning over changing visual states are largely absent. Therefore, we introduce MemEye, a framework that evaluates memory capabilities from two dimensions: one measures the granularity of decisive visual evidence (from scene-level to pixel-level evidence), and the other measures how retrieved evidence must be used (from single evidence to evolutionary synthesis). Under this framework, we construct a new benchmark across 8 life-scenario tasks, with ablation-driven validation gates for assessing answerability, shortcut resistance, visual necessity, and reasoning structure. By evaluating 13 memory methods across 4 VLM backbones, we show that current architectures still struggle to preserve fine-grained visual details and reason about state changes over time. Our findings show that long-term multimodal memory depends on evidence routing, temporal tracking, and detail extraction.

Chinese Translation

长期智能体记忆日益呈现多模态特征，但现有评估很少测试智能体是否保留了后续推理所需的视觉证据。在以往的研究中，许多以视觉为基础的问题仅通过标题或文本痕迹就能得到解答，这使得答案可以在不保留细粒度视觉证据的情况下推断出来。同时，需要对变化的视觉状态进行推理的更复杂案例几乎不存在。因此，我们提出了MemEye，一个从两个维度评估记忆能力的框架：一个维度衡量决定性视觉证据的粒度（从场景级到像素级证据），另一个维度衡量检索到的证据的使用方式（从单一证据到演变合成）。在这个框架下，我们构建了一个涵盖8个生活场景任务的新基准，并设立了基于消融实验的验证门，以评估答案可得性、捷径抵抗力、视觉必要性和推理结构。通过对4个VLM（视觉语言模型）骨干网络中的13种记忆方法进行评估，我们发现当前的架构仍然难以保留细粒度的视觉细节，并对随时间变化的状态进行推理。我们的研究结果表明，长期的多模态记忆依赖于证据路由、时间跟踪和细节提取。

View on arXiv Download PDF AI Translation

cs.CV / 137 / 2605.15141

Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation

因果强制++：可扩展的少步自回归扩散蒸馏用于实时交互视频生成

Zhao, Min, Zhu, Hongzhou, Zheng, Kaiwen, Zhou, Zihan, Yan, Bokai, Li, Xinyuan, Yang, Xiao, Li, Chongxuan, Zhu, Jun

Abstract

Real-time interactive video generation requires low-latency, streaming, and controllable rollout. Existing autoregressive (AR) diffusion distillation methods have achieved strong results in the chunk-wise 4-step regime by distilling bidirectional base models into few-step AR students, but they remain limited by coarse response granularity and non-negligible sampling latency. In this paper, we study a more aggressive setting: frame-wise autoregression with only 1--2 sampling steps. In this regime, we identify the initialization of a few-step AR student as the key bottleneck: existing strategies are either target-misaligned, incapable of few-step generation, or too costly to scale. We propose \textbf{Causal Forcing++}, a principled and scalable pipeline that uses \emph{causal consistency distillation} (causal CD) for few-step AR initialization. The core idea is that causal CD learns the same AR-conditional flow map as causal ODE distillation, but obtains supervision from a single online teacher ODE step between adjacent timesteps, avoiding the need to precompute and store full PF-ODE trajectories. This makes the initialization both more efficient and easier to optimize. The resulting pipeline, \ours, surpasses the SOTA 4-step chunk-wise Causal Forcing under the \textit{\textbf{frame-wise 2-step setting}} by 0.1 in VBench Total, 0.3 in VBench Quality, and 0.335 in VisionReward, while reducing first-frame latency by 50\% and Stage 2 training cost by $\sim$$4\times$. We further extend the pipeline to action-conditioned world model generation in the spirit of Genie3. Project Page: https://github.com/thu-ml/Causal-Forcing and https://github.com/shengshu-ai/minWM .

Chinese Translation

实时交互视频生成需要低延迟、流式处理和可控的生成过程。现有的自回归（AR）扩散蒸馏方法在分块的4步模式下通过将双向基础模型蒸馏为少步AR学生取得了良好的效果，但仍然受到粗糙响应粒度和不可忽视的采样延迟的限制。本文研究了一种更激进的设置：仅使用1-2个采样步骤的逐帧自回归。在这种模式下，我们将少步AR学生的初始化视为关键瓶颈：现有策略要么目标不对齐，无法进行少步生成，要么成本过高，难以扩展。我们提出了 extbf{因果强制++}，这是一个原则性和可扩展的管道，使用 extit{因果一致性蒸馏}（causal CD）进行少步AR初始化。其核心思想是因果CD学习与因果常微分方程（ODE）蒸馏相同的AR条件流图，但从相邻时间步之间的单个在线教师ODE步骤中获取监督，避免了预计算和存储完整PF-ODE轨迹的需要。这使得初始化过程更加高效且更易于优化。最终的管道 extit{ extbf{ours}}在 extit{ extbf{逐帧2步设置}}下超越了SOTA 4步分块因果强制，在VBench总分上提高了0.1，在VBench质量上提高了0.3，在VisionReward上提高了0.335，同时将首帧延迟减少了50 extperthousand{}，将阶段2训练成本降低了约4倍。我们进一步扩展了该管道，以符合Genie3的精神进行动作条件的世界模型生成。项目页面：https://github.com/thu-ml/Causal-Forcing 和 https://github.com/shengshu-ai/minWM 。

View on arXiv Download PDF AI Translation

cs.CV / 138 / 2605.15167

Does Synthetic Layered Design Data Benefit Layered Design Decomposition?

合成分层设计数据是否有助于分层设计分解？

Wu, Kam Man, Yang, Haolin, Chen, Qingyu, Tang, Yihu, Chen, Jingye, Chen, Qifeng

Abstract

Recent advances in image generation have made it easy to produce high-quality images. However, these outputs are inherently flattened, entangling foreground elements, background, and text within a fixed canvas. As a result, flexible post-generation editing remains challenging, revealing a clear last-mile gap toward practical usability. Existing approaches either rely on scarce proprietary layered assets or construct partially synthetic data from limited structural priors. However, both strategies face fundamental challenges in scalability. In this work, we investigate whether pure synthetic layered data can improve graphic design decomposition. We make the assumption that, in graphic design, effective decomposition does not require modeling inter-layer dependencies as precisely as in natural-image composition, since design elements are often intentionally arranged as modular and semantically separable components. Concretely, we conduct a data-centric study based on CLD baseline, which is a state-of-the-art layer decomposition framework. Based on the baseline, we construct our own synthetic dataset, SynLayers, generate textual supervision using vision language models, and automate inference inputs with VLM-predicted bounding boxes. Our study reveals three key findings: (1) even training with purely synthetic data can outperform non-scalable alternatives such as the widely used PrismLayersPro dataset, demonstrating its viability as a scalable and effective substitute; (2) performance consistently improves with increased training data scale, while gains begin to saturate at around 50K samples; and (3) synthetic data enables balanced control over layer-count distributions, avoiding the layer-count imbalance commonly observed in real-world datasets. We hope this data-centric study encourages broader adoption of synthetic data as a practical foundation for layered design editing systems.

Chinese Translation

近年来，图像生成技术的进步使得高质量图像的生成变得容易。然而，这些输出本质上是扁平化的，将前景元素、背景和文本混合在一个固定的画布中。因此，灵活的后期生成编辑仍然面临挑战，揭示了在实际可用性方面的明显最后一公里差距。现有的方法要么依赖稀缺的专有分层资产，要么从有限的结构先验中构建部分合成数据。然而，这两种策略在可扩展性方面都面临根本性挑战。在本研究中，我们探讨纯合成分层数据是否能改善图形设计分解。我们假设，在图形设计中，有效的分解并不需要像自然图像合成那样精确建模层间依赖关系，因为设计元素通常被故意安排为模块化和语义可分离的组件。具体而言，我们基于CLD基线进行了一项以数据为中心的研究，CLD是一个最先进的层分解框架。在此基线的基础上，我们构建了自己的合成数据集SynLayers，利用视觉语言模型生成文本监督，并使用VLM预测的边界框自动化推理输入。我们的研究揭示了三个关键发现：(1) 即使仅用合成数据进行训练，也能超越诸如广泛使用的PrismLayersPro数据集等不可扩展的替代方案，证明其作为可扩展且有效的替代品的可行性；(2) 随着训练数据规模的增加，性能持续改善，而增益在约50K样本时开始饱和；(3) 合成数据能够平衡层数分布的控制，避免了现实世界数据集中常见的层数不平衡。我们希望这项以数据为中心的研究能够鼓励更广泛地采用合成数据，作为分层设计编辑系统的实际基础。

View on arXiv Download PDF AI Translation

cs.CV / 139 / 2605.15171

Evidential Reasoning Advances Interpretable Real-World Disease Screening

证据推理促进可解释的现实世界疾病筛查

Lian, Chenyu, Zhou, Hong-Yu, Qin, Jing

Abstract

Disease screening is critical for early detection and timely intervention in clinical practice. However, most current screening models for medical images suffer from limited interpretability and suboptimal performance. They often lack effective mechanisms to reference historical cases or provide transparent reasoning pathways. To address these challenges, we introduce EviScreen, an evidential reasoning framework for disease screening that leverages region-level evidence from historical cases. The proposed EviScreen offers retrospection interpretability through regional evidence retrieved from dual knowledge banks. Using this evidential mechanism, the subsequent evidence-aware reasoning module makes predictions using both the current case and evidence from historical cases, thereby enhancing disease screening performance. Furthermore, rather than relying on post-hoc saliency maps, EviScreen enhances localization interpretability by leveraging abnormality maps derived from contrastive retrieval. Our method achieves superior performance on our carefully established benchmarks for real-world disease screening, yielding notably higher specificity at clinical-level recall. Code is publicly available at https://github.com/DopamineLcy/EviScreen.

Chinese Translation

疾病筛查对于临床实践中的早期发现和及时干预至关重要。然而，目前大多数医学图像筛查模型在可解释性和性能方面存在不足。它们通常缺乏有效的机制来参考历史案例或提供透明的推理路径。为了解决这些挑战，我们提出了EviScreen，一个基于证据推理的疾病筛查框架，利用来自历史案例的区域级证据。所提出的EviScreen通过从双重知识库中检索的区域证据提供回溯可解释性。利用这种证据机制，后续的证据感知推理模块结合当前案例和历史案例的证据进行预测，从而提升疾病筛查的性能。此外，EviScreen通过利用从对比检索中获得的异常图，增强了定位可解释性，而不是依赖于事后显著性图。我们的方法在我们精心建立的现实世界疾病筛查基准上表现出色，在临床级召回率下显著提高了特异性。代码已公开，地址为 https://github.com/DopamineLcy/EviScreen。

View on arXiv Download PDF AI Translation

cs.CV / 140 / 2605.15178

SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

SANA-WM：具有混合线性扩散变换器的高效分钟级世界建模

Zhu, Haoyi, Liu, Haozhe, Zhao, Yuyang, Ye, Tian, Chen, Junsong, Yu, Jincheng, He, Tong, Han, Song, Xie, Enze

Abstract

We introduce SANA-WM, an efficient 2.6B-parameter open-source world model natively trained for one-minute generation, synthesizing high-fidelity, 720p, minute-scale videos with precise camera control. SANA-WM achieves visual quality comparable to large-scale industrial baselines such as LingBot-World and HY-WorldPlay, while significantly improving efficiency. Four core designs drive our architecture: (1) Hybrid Linear Attention combines frame-wise Gated DeltaNet (GDN) with softmax attention for memory-efficient long-context modeling. (2) Dual-Branch Camera Control ensures precise 6-DoF trajectory adherence. (3) Two-Stage Generation Pipeline applies a long-video refiner to stage-1 outputs, improving quality and consistency across sequences. (4) Robust Annotation Pipeline extracts accurate metric-scale 6-DoF camera poses from public videos to yield high-quality, spatiotemporally consistent action labels. Driven by these designs, SANA-WMdemonstrates remarkable efficiency across data, training compute, and inference hardware: it uses only $\sim$213K public video clips with metric-scale pose supervision, completes training in 15 days on 64 H100s, and generates each 60s clip on a single GPU; its distilled variant can be deployed on a single RTX 5090 with NVFP4 quantization to denoise a 60s 720p clip in 34s. On our one-minute world-model benchmark, SANA-WM demonstrates stronger action-following accuracy than prior open-source baselines and achieves comparable visual quality at $36\times$ higher throughput for scalable world modeling.

Chinese Translation

我们介绍了SANA-WM，一个高效的2.6B参数开源世界模型，原生训练用于一分钟生成，合成高保真度、720p的分钟级视频，并实现精确的相机控制。SANA-WM的视觉质量可与大型工业基准（如LingBot-World和HY-WorldPlay）相媲美，同时显著提高了效率。我们的架构由四个核心设计驱动：（1）混合线性注意力将逐帧的Gated DeltaNet（GDN）与softmax注意力结合，以实现内存高效的长上下文建模。（2）双分支相机控制确保精确的6自由度轨迹遵循。（3）两阶段生成管道对阶段1输出应用长视频细化器，提高序列间的质量和一致性。（4）稳健的注释管道从公共视频中提取准确的度量尺度6自由度相机姿态，以生成高质量、时空一致的动作标签。在这些设计的驱动下，SANA-WM在数据、训练计算和推理硬件方面展现出显著的效率：它仅使用约213K个公共视频剪辑，并进行度量尺度姿态监督，在64个H100上完成15天的训练，并在单个GPU上生成每个60秒的剪辑；其蒸馏变体可以在单个RTX 5090上部署，使用NVFP4量化在34秒内去噪一个60秒的720p剪辑。在我们的分钟级世界模型基准上，SANA-WM展现出比先前的开源基准更强的动作跟随准确性，并在可扩展的世界建模中以36倍更高的吞吐量实现了可比的视觉质量。

View on arXiv Download PDF AI Translation

cs.CV / 141 / 2605.15181

From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing

从计划到像素：学习规划和协调以实现开放式图像编辑

Rajan, Anirudh Sundara, Singh, Krishna Kumar, Lee, Yong Jae

Abstract

Modern image editing models produce realistic results but struggle with abstract, multi step instructions (e.g., ``make this advertisement more vegetarian-friendly''). Prior agent based methods decompose such tasks but rely on handcrafted pipelines or teacher imitation, limiting flexibility and decoupling learning from actual editing outcomes. We propose an experiential framework for long-horizon image editing, where a planner generates structured atomic decompositions and an orchestrator selects tools and regions to execute each step. A vision language judge provides outcome-based rewards for instruction adherence and visual quality. The orchestrator is trained to maximize these rewards, and successful trajectories are used to refine the planner. By tightly coupling planning with reward driven execution, our approach yields more coherent and reliable edits than single-step or rule-based multistep baselines.

Chinese Translation

现代图像编辑模型能够生成逼真的结果，但在处理抽象的多步骤指令（例如，“使这则广告更适合素食者”）时存在困难。之前基于代理的方法虽然能够分解此类任务，但依赖于手工制作的流程或教师模仿，这限制了灵活性并使学习与实际编辑结果脱钩。我们提出了一种用于长时间范围图像编辑的体验框架，其中规划者生成结构化的原子分解，而协调者选择工具和区域以执行每一步。视觉语言评判者根据指令遵循性和视觉质量提供基于结果的奖励。协调者的训练目标是最大化这些奖励，而成功的轨迹则用于优化规划者。通过紧密结合规划与基于奖励的执行，我们的方法比单步或基于规则的多步骤基线产生更连贯和可靠的编辑结果。

View on arXiv Download PDF AI Translation

cs.CV / 142 / 2605.15182

Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video

历史扭曲：从一段训练视频生成可泛化的相机控制视频

Wang, Yifan, He, Tong

Abstract

Camera-controlled video generation has made substantial progress, enabling generated videos to follow prescribed viewpoint trajectories. However, existing methods usually learn camera-specific conditioning through camera encoders, control branches, or attention and positional-encoding modifications, which often require post-training on large-scale camera-annotated videos. Training-free alternatives avoid such post-training, but often shift the cost to test-time optimization or extra denoising-time guidance. We propose Warp-as-History, a simple interface that turns camera-induced warps into camera-warped pseudo-history with target-frame positional alignment and visible-token selection. Given a target camera trajectory, we construct camera-warped pseudo-history from past observations and feed it through the model's visual-history pathway. Crucially, we align its positional encoding with the target frames being denoised and remove warped-history tokens without valid source observations. Without any training, architectural modification, or test-time optimization, this interface reveals a non-trivial zero-shot capability of a frozen video generation model to follow camera trajectories. Moreover, lightweight offline LoRA finetuning on only one camera-annotated video further improves this capability and generalizes to unseen videos, improving camera adherence, visual quality, and motion dynamics without test-time optimization or target-video adaptation. Extensive experiments on diverse datasets confirm the effectiveness of our method.

Chinese Translation

相机控制的视频生成已经取得了显著进展，使得生成的视频能够遵循预定的视点轨迹。然而，现有的方法通常通过相机编码器、控制分支或注意力和位置编码的修改来学习特定于相机的条件，这往往需要在大规模相机标注视频上进行后训练。无训练替代方案避免了这种后训练，但往往将成本转移到测试时优化或额外的去噪时间指导上。我们提出了历史扭曲（Warp-as-History），这是一个简单的接口，将相机引起的扭曲转化为具有目标帧位置对齐和可见标记选择的相机扭曲伪历史。给定一个目标相机轨迹，我们从过去的观察中构建相机扭曲伪历史，并通过模型的视觉历史路径输入。关键是，我们将其位置编码与正在去噪的目标帧对齐，并去除没有有效源观察的扭曲历史标记。在没有任何训练、架构修改或测试时优化的情况下，该接口揭示了一个冻结的视频生成模型在跟随相机轨迹方面的非平凡零-shot能力。此外，仅在一段相机标注视频上进行轻量级的离线LoRA微调进一步提高了这一能力，并且能够推广到未见过的视频，提高了相机遵循性、视觉质量和运动动态，而无需测试时优化或目标视频适应。对多样化数据集的广泛实验验证了我们方法的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 143 / 2605.15185

Quantitative Video World Model Evaluation for Geometric-Consistency

几何一致性的定量视频世界模型评估

Wu, Jiaxin, Pi, Yihao, Zhang, Yinling, Li, Yuheng, Zou, Xueyan

Abstract

Generative video models are increasingly studied as implicit world models, yet evaluating whether they produce physically plausible 3D structure and motion remains challenging. Most existing video evaluation pipelines rely heavily on human judgment or learned graders, which can be subjective and weakly diagnostic for geometric failures. We introduce PDI-Bench (Perspective Distortion Index), a quantitative framework for auditing geometric coherence in generated videos. Given a generated clip, we obtain object-centric observations via segmentation and point tracking (e.g., SAM 2, MegaSaM, and CoTracker3), lift them to 3D world-space coordinates via monocular reconstruction, and compute a set of projective-geometry residuals capturing three failure dimensions: scale-depth alignment, 3D motion consistency, and 3D structural rigidity. To support systematic evaluation, we build PDI-Dataset, covering diverse scenarios designed to stress these geometric constraints. Across state-of-the-art video generators, PDI reveals consistent geometry-specific failure modes that are not captured by common perceptual metrics, and provides a diagnostic signal for progress toward physically grounded video generation and physical world model. Our code and dataset can be found at https://pdi-bench.github.io/.

Chinese Translation

生成视频模型作为隐式世界模型的研究日益增多，但评估它们是否产生物理上合理的三维结构和运动仍然具有挑战性。现有的大多数视频评估流程在很大程度上依赖于人类判断或学习的评分者，这可能是主观的，并且对几何失败的诊断能力较弱。我们提出了 PDI-Bench（透视失真指数），这是一个用于审计生成视频中几何一致性的定量框架。给定一个生成的剪辑，我们通过分割和点跟踪（例如，SAM 2、MegaSaM 和 CoTracker3）获得以物体为中心的观察，将其通过单目重建提升到三维世界坐标，并计算一组捕捉三个失败维度的投影几何残差：尺度-深度对齐、三维运动一致性和三维结构刚性。为了支持系统评估，我们构建了 PDI-Dataset，涵盖了旨在强调这些几何约束的多样化场景。在最先进的视频生成器中，PDI 揭示了一致的几何特定失败模式，这些模式未被常见的感知指标捕捉，并为朝着物理基础的视频生成和物理世界模型的进展提供了诊断信号。我们的代码和数据集可以在 https://pdi-bench.github.io/ 找到。

View on arXiv Download PDF AI Translation

cs.CV / 144 / 2605.15186

VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction

VGGT-Edit：基于残差场预测的前馈原生3D场景编辑

Zhu, Kaixin, Tang, Yiwen, Yang, Yifan, Zhang, Renrui, Zeng, Bohan, Guo, Ziyu, An, Ruichuan, Liu, Zhou, Chen, Qizhi, Qu, Delin, Yoon, Jaehong, Zhang, Wentao

Abstract

High-quality 3D scene reconstruction has recently advanced toward generalizable feed-forward architectures, enabling the generation of complex environments in a single forward pass. However, despite their strong performance in static scene perception, these models remain limited in responding to dynamic human instructions, which restricts their use in interactive applications. Existing editing methods typically rely on a 2D-lifting strategy, where individual views are edited independently and then lifted back into 3D space. This indirect pipeline often leads to blurry textures and inconsistent geometry, as 2D editors lack the spatial awareness required to preserve structure across viewpoints. To address these limitations, we propose VGGT-Edit, a feed-forward framework for text-conditioned native 3D scene editing. VGGT-Edit introduces depth-synchronized text injection to align semantic guidance with the backbone's spatial poses, ensuring stable instruction grounding. This semantic signal is then processed by a residual transformation head, which directly predicts 3D geometric displacements to deform the scene while preserving background stability. To ensure high-fidelity results, we supervise the framework with a multi-term objective function that enforces geometric accuracy and cross-view consistency. We also construct the DeltaScene Dataset, a large-scale dataset generated through an automated pipeline with 3D agreement filtering to ensure ground-truth quality. Experiments show that VGGT-Edit substantially outperforms 2D-lifting baselines, producing sharper object details, stronger multi-view consistency, and near-instant inference speed.

Chinese Translation

高质量的3D场景重建最近朝着可泛化的前馈架构发展，使得在单次前向传递中生成复杂环境成为可能。然而，尽管这些模型在静态场景感知方面表现出色，但它们在响应动态人类指令方面仍然有限，这限制了它们在交互应用中的使用。现有的编辑方法通常依赖于2D提升策略，即独立编辑各个视图，然后再提升回3D空间。这种间接的流程往往导致模糊的纹理和不一致的几何形状，因为2D编辑器缺乏跨视点保持结构所需的空间意识。为了解决这些局限性，我们提出了VGGT-Edit，一种基于文本条件的前馈原生3D场景编辑框架。VGGT-Edit引入了深度同步文本注入，以将语义指导与主干网络的空间姿态对齐，从而确保稳定的指令基础。该语义信号随后通过一个残差变换头进行处理，直接预测3D几何位移，以变形场景，同时保持背景的稳定性。为了确保高保真结果，我们使用多项目标函数对框架进行监督，以强制执行几何准确性和跨视图一致性。我们还构建了DeltaScene数据集，这是一个通过自动化管道生成的大规模数据集，并通过3D一致性过滤以确保真实质量。实验表明，VGGT-Edit在多个方面显著优于2D提升基线，生成更清晰的物体细节、更强的多视图一致性，以及近乎即时的推理速度。

View on arXiv Download PDF AI Translation

cs.CV / 145 / 2605.15187

Articraft: An Agentic System for Scalable Articulated 3D Asset Generation

Articraft：一种可扩展的关节式三维资产生成的智能系统

Zhou, Matt, Li, Ruining, Lyu, Xiaoyang, Song, Zhaomou, Huang, Zhening, Zheng, Chuanxia, Rupprecht, Christian, Vedaldi, Andrea, Wu, Shangzhe

Abstract

A bottleneck in learning to understand articulated 3D objects is the lack of large and diverse datasets. In this paper, we propose to leverage large language models (LLMs) to close this gap and generate articulated assets at scale. We reduce the problem of generating an articulated 3D asset to that of writing a program that builds it. We then introduce a new agentic system, Articraft, that writes such programs automatically. We design a programmatic interface and harness to help the LLM do so effectively. The LLM writes code against a domain-specific SDK for defining parts, composing geometry, specifying joints, and writing tests to validate the resulting assets. The harness exposes a restricted workspace and interface to the LLM, validates the resulting assets, and returns structured feedback. In this way, the LLM is not distracted by details such as authoring a URDF file or managing a complex software environment. We show that this produces higher-quality assets than both state-of-the-art articulated-asset generators and general-purpose coding agents. Using Articraft, we build Articraft-10K, a curated dataset of over 10K articulated assets spanning 245 categories, and show its utility both for training models of articulated assets and in downstream applications such as robotics simulation and virtual reality.

Chinese Translation

理解关节式三维物体的学习瓶颈在于缺乏大型且多样化的数据集。本文提出利用大型语言模型（LLMs）来填补这一空白，并大规模生成关节式资产。我们将生成关节式三维资产的问题简化为编写构建该资产的程序。随后，我们介绍了一种新的智能系统Articraft，它能够自动编写此类程序。我们设计了一个程序接口，并构建了一个工具，以帮助LLM有效地完成这一任务。LLM根据一个特定领域的SDK编写代码，以定义部件、组合几何形状、指定关节，并编写测试以验证生成的资产。该工具为LLM提供了一个受限的工作空间和接口，验证生成的资产，并返回结构化反馈。通过这种方式，LLM不会被诸如编写URDF文件或管理复杂软件环境等细节所分散注意力。我们展示了这种方法生成的资产质量高于当前最先进的关节式资产生成器和通用编码代理。使用Articraft，我们构建了Articraft-10K，这是一个包含超过10,000个关节式资产的精心策划的数据集，涵盖245个类别，并展示了它在训练关节式资产模型及下游应用（如机器人仿真和虚拟现实）中的实用性。

View on arXiv Download PDF AI Translation

cs.CV / 146 / 2605.15190

RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO

RAVEN：基于一致性模型GRPO的实时自回归视频外推

Lu, Yanzuo, Zuo, Ronglai, Deng, Jiankang

Abstract

Causal autoregressive video diffusion models support real-time streaming generation by extrapolating future chunks from previously generated content. Distilling such generators from high-fidelity bidirectional teachers yields competitive few-step models, yet a persistent gap between the history distributions encountered during training and those arising at inference constrains generation quality over long horizons. We introduce the Real-time Autoregressive Video Extrapolation Network (RAVEN), a training-time test framework that repacks each self rollout into an interleaved sequence of clean historical endpoints and noisy denoising states. This formulation aligns training attention with inference-time extrapolation and allows downstream chunk losses to supervise the history representations on which future predictions depend. We further propose Consistency-model Group Relative Policy Optimization (CM-GRPO), which reformulates a consistency sampling step as a conditional Gaussian transition and applies online Reinforcement Learning (RL) directly to this kernel, avoiding the Euler-Maruyama auxiliary process adopted in prior flow-model RL formulations. Experiments demonstrate that RAVEN surpasses recent causal video distillation baselines across quality, semantic, and dynamic degree evaluations, and that CM-GRPO provides further gains when combined with RAVEN.

Chinese Translation

因果自回归视频扩散模型通过从先前生成的内容中外推未来片段，支持实时流生成。从高保真双向教师中提炼此类生成器可产生具有竞争力的少步模型，但训练期间遇到的历史分布与推理时产生的分布之间的持续差距限制了长时间范围内的生成质量。我们引入了实时自回归视频外推网络（RAVEN），这是一个训练时测试框架，将每个自我回滚重新打包成干净的历史端点和嘈杂的去噪状态交错序列。这种公式使训练注意力与推理时的外推对齐，并允许下游片段损失监督未来预测所依赖的历史表示。我们进一步提出了一致性模型组相对策略优化（CM-GRPO），将一致性采样步骤重新表述为条件高斯转移，并直接将在线强化学习（RL）应用于该核，避免了先前流模型RL公式中采用的欧拉-马鲁亚马辅助过程。实验表明，RAVEN在质量、语义和动态程度评估上超越了最近的因果视频蒸馏基线，并且当与RAVEN结合时，CM-GRPO提供了进一步的提升。

View on arXiv Download PDF AI Translation

cs.CV / 147 / 2605.15193

Aligning Latent Geometry for Spherical Flow Matching in Image Generation

在图像生成中对齐潜在几何以实现球面流匹配

Meral, Tuna Han Salih, Oktay, Kaan, Yesiltepe, Hidir, Akan, Adil Kaan, Yanardag, Pinar

Abstract

Latent flow matching for image generation usually transports Gaussian noise to variational autoencoder latents along linear paths. Both endpoints, however, concentrate in thin spherical shells, and a Euclidean chord leaves those shells even when preprocessing aligns their radii. By decomposing each latent token into radial and angular components, we show through component-swap probes that decoded perceptual and semantic content is carried predominantly by direction, with radius contributing much less. We therefore project data latents onto a fixed token radius, use the radial projection of Gaussian noise as the spherical prior, finetune the decoder with the encoder frozen, and replace linear interpolation with spherical linear interpolation. The resulting geodesic paths stay on the sphere at every timestep, and their velocity targets are purely angular by construction. Under matched training, the method consistently improves class-conditional ImageNet-256 FID across different image tokenizers, leaves the diffusion architecture unchanged, and requires no auxiliary encoder or representation-alignment objective.

Chinese Translation

图像生成中的潜在流匹配通常沿线性路径将高斯噪声传输到变分自编码器的潜在空间。然而，两个端点集中在薄的球面壳中，即使在预处理对齐其半径时，欧几里得弦也会离开这些壳。通过将每个潜在标记分解为径向和角度分量，我们通过组件交换探针表明，解码的感知和语义内容主要由方向承载，而半径的贡献则相对较小。因此，我们将数据潜在空间投影到固定的标记半径上，使用高斯噪声的径向投影作为球面先验，冻结编码器并微调解码器，并用球面线性插值替代线性插值。所得到的测地线路径在每个时间步都保持在球面上，其速度目标在构造上完全是角度性的。在匹配训练下，该方法在不同的图像标记器中一致地改善了类条件的ImageNet-256 FID，保持扩散架构不变，并且不需要辅助编码器或表示对齐目标。

View on arXiv Download PDF AI Translation

cs.CV / 148 / 2605.15195

VGGT-$\Omega$

VGGT-$ ext{Ω}$

Wang, Jianyuan, Chen, Minghao, Zhang, Shangzhan, Karaev, Nikita, Schönberger, Johannes, Labatut, Patrick, Bojanowski, Piotr, Novotny, David, Vedaldi, Andrea, Rupprecht, Christian

Abstract

Recent feed-forward reconstruction models, such as VGGT, have proven competitive with traditional optimization-based reconstructors while also providing geometry-aware features useful for other tasks. Here, we show that the quality of these models scales predictably with model and data size. We do so by introducing VGGT-$\Omega$, which substantially improves reconstruction accuracy, efficiency, and capabilities for both static and dynamic scenes. To enable training this model at an unprecedented scale, we introduce architectural changes that improve training efficiency, a high-quality data annotation pipeline that supports dynamic scenes, and a self-supervised learning protocol. We simplify VGGT's architecture by using a single dense prediction head with multi-task supervision and removing the expensive high-resolution convolutional layers. We also use registers to aggregate scene information into a compact representation and introduce register attention, which restricts inter-frame information exchange to these registers, in part replacing global attention. In this way, during training, VGGT-$\Omega$ uses only about 30% of the GPU memory of its predecessor, allowing us to train with 15x more supervised data than prior work and to leverage vast amounts of unlabeled video data. VGGT-$\Omega$ achieves strong results for reconstruction of static and dynamic scenes across multiple benchmarks, for example, improving over the previous best camera estimation accuracy on Sintel by 77%. We also show that the learned registers can improve vision-language-action models and support alignment with language, suggesting that reconstruction can be a powerful and scalable proxy task for spatial understanding. Project Page: http://vggt-omega.github.io/

Chinese Translation

最近的前馈重建模型，如VGGT，已被证明在与传统基于优化的重建器的竞争中表现出色，同时还提供了对其他任务有用的几何感知特征。在此，我们展示了这些模型的质量与模型和数据规模之间的可预测关系。我们通过引入VGGT-$ ext{Ω}$来实现这一点，该模型显著提高了静态和动态场景的重建准确性、效率和能力。为了在前所未有的规模上训练该模型，我们引入了改善训练效率的架构变化、高质量的数据注释管道（支持动态场景）以及自监督学习协议。我们通过使用单一的密集预测头并结合多任务监督，简化了VGGT的架构，同时去除了成本高昂的高分辨率卷积层。我们还使用寄存器将场景信息聚合为紧凑表示，并引入寄存器注意力机制，限制帧间信息交换仅发生在这些寄存器中，部分替代了全局注意力。通过这种方式，在训练过程中，VGGT-$ ext{Ω}$仅使用其前身约30%的GPU内存，使我们能够使用比以往工作多15倍的监督数据进行训练，并利用大量未标记的视频数据。VGGT-$ ext{Ω}$在多个基准测试中对静态和动态场景的重建取得了强劲的结果，例如，在Sintel上将之前最佳的相机估计准确性提高了77%。我们还展示了学习到的寄存器可以改善视觉-语言-动作模型，并支持与语言的对齐，表明重建可以成为空间理解的强大且可扩展的代理任务。项目页面：http://vggt-omega.github.io/

View on arXiv Download PDF AI Translation

cs.CV / 149 / 2605.15196

RefDecoder: Enhancing Visual Generation with Conditional Video Decoding

RefDecoder：通过条件视频解码增强视觉生成

Fan, Xiang, Wang, Yuheng, Fang, Bohan, Ren, Zhongzheng, Krishna, Ranjay

Abstract

Video generation powers a vast array of downstream applications. However, while the de facto standard, i.e., latent diffusion models, typically employ heavily conditioned denoising networks, their decoders often remain unconditional. We observe that this architectural asymmetry leads to significant loss of detail and inconsistency relative to the input image. To address this, we argue that the decoder requires equal conditioning to preserve structural integrity. We introduce RefDecoder, a reference-conditioned video VAE decoder by injecting high-fidelity reference image signal directly into the decoding process via reference attention. Specifically, a lightweight image encoder maps the reference frame into the detail-rich high-dimensional tokens, which are co-processed with the denoised video latent tokens at each decoder up-sampling stage. We demonstrate consistent improvements across several distinct decoder backbones (e.g., Wan 2.1 and VideoVAE+), achieving up to +2.1dB PSNR over the unconditional baselines on the Inter4K, WebVid, and Large Motion reconstruction benchmarks. Notably, RefDecoder can be directly swapped into existing video generation systems without additional fine-tuning, and we report across-the-board improvements in subject consistency, background consistency, and overall quality scores on the VBench I2V benchmark. Beyond I2V, RefDecoder generalizes well to a wide range of visual generation tasks such as style transfer and video editing refinement.

Chinese Translation

视频生成驱动着大量下游应用。然而，尽管潜在扩散模型（latent diffusion models）作为事实标准，通常采用高度条件化的去噪网络，其解码器往往仍然是无条件的。我们观察到，这种架构的不对称性导致了相对于输入图像的细节损失和不一致性。为了解决这个问题，我们认为解码器需要相等的条件化以保持结构完整性。我们提出了RefDecoder，一种参考条件的视频变分自编码器（VAE）解码器，通过参考注意力将高保真参考图像信号直接注入解码过程。具体而言，一个轻量级图像编码器将参考帧映射为细节丰富的高维标记，这些标记在每个解码器上采样阶段与去噪的视频潜在标记共同处理。我们在多个不同的解码器骨干网络（例如，Wan 2.1和VideoVAE+）上展示了一致的改进，在Inter4K、WebVid和Large Motion重建基准测试中，相较于无条件基线，PSNR提升达+2.1dB。值得注意的是，RefDecoder可以直接替换现有的视频生成系统，而无需额外的微调，并且我们在VBench I2V基准测试中报告了主题一致性、背景一致性和整体质量评分的全面提升。除了I2V，RefDecoder还能够很好地推广到广泛的视觉生成任务，如风格迁移和视频编辑精细化。

View on arXiv Download PDF AI Translation

cs.CV / 150 / 2605.15198

ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

ATLAS：代理性还是潜在视觉推理？一个词足以涵盖两者

Guo, Ziyu, Liu, Rain, Chen, Xinyan, Heng, Pheng-Ann

Abstract

Visual reasoning, often interleaved with intermediate visual states, has emerged as a promising direction in the field. A straightforward approach is to directly generate images via unified models during reasoning, but this is computationally expensive and architecturally non-trivial. Recent alternatives include agentic reasoning through code or tool calls, and latent reasoning with learnable hidden embeddings. However, agentic methods incur context-switching latency from external execution, while latent methods lack task generalization and are difficult to train with autoregressive parallelization. To combine their strengths while mitigating their limitations, we propose ATLAS, a framework in which a single discrete 'word', termed as a functional token, serves both as an agentic operation and a latent visual reasoning unit. Each functional token is associated with an internalized visual operation, yet requires no visual supervision and remains a standard token in the tokenizer vocabulary, which can be generated via next-token prediction. This design avoids verbose intermediate visual content generation, while preserving compatibility with the vanilla scalable SFT and RL training, without architectural or methodological modifications. To further address the sparsity of functional tokens during RL, we introduce Latent-Anchored GRPO (LA-GRPO), which stabilizes the training by anchoring functional tokens with a statically weighted auxiliary objective, providing stronger gradient updates. Extensive experiments and analyses demonstrate that ATLAS achieves superior performance on challenging benchmarks while maintaining clear interpretability. We hope ATLAS offers a new paradigm inspiring future visual reasoning research.

Chinese Translation

视觉推理，通常与中间视觉状态交错，已成为该领域一个有前景的方向。一种直接的方法是在推理过程中通过统一模型直接生成图像，但这在计算上代价高昂且架构上并不简单。最近的替代方案包括通过代码或工具调用进行的代理性推理，以及使用可学习的隐含嵌入的潜在推理。然而，代理性方法由于外部执行而产生上下文切换延迟，而潜在方法则缺乏任务泛化能力，并且难以通过自回归并行化进行训练。为了结合它们的优势并减轻其局限性，我们提出了ATLAS，一个框架，其中一个单一的离散“词”，称为功能标记，既作为代理操作，又作为潜在视觉推理单元。每个功能标记与一个内化的视觉操作相关联，但不需要视觉监督，并且仍然是标记器词汇中的标准标记，可以通过下一个标记预测生成。这种设计避免了冗长的中间视觉内容生成，同时保持与普通可扩展的SFT和RL训练的兼容性，无需架构或方法上的修改。为了进一步解决RL过程中功能标记的稀疏性，我们引入了潜在锚定的GRPO（LA-GRPO），通过用静态加权的辅助目标锚定功能标记来稳定训练，从而提供更强的梯度更新。大量实验和分析表明，ATLAS在具有挑战性的基准测试中实现了优越的性能，同时保持了清晰的可解释性。我们希望ATLAS能够提供一种新的范式，激励未来的视觉推理研究。

View on arXiv Download PDF AI Translation

cs.CV / 151 / 2605.15199

EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation

EntityBench：面向实体一致性的长时间段多镜头视频生成

He, Ruozhen, Wei, Meng, Yang, Ziyan, Ordonez, Vicente

Abstract

Multi-shot video generation extends single-shot generation to coherent visual narratives, yet maintaining consistent characters, objects, and locations across shots remains a challenge over long sequences. Existing evaluations typically use independently generated prompt sets with limited entity coverage and simple consistency metrics, making standardized comparison difficult. We introduce EntityBench, a benchmark of 140 episodes (2,491 shots) derived from real narrative media, with explicit per-shot entity schedules tracking characters, objects, and locations simultaneously across easy / medium / hard tiers of up to 50 shots, 13 cross-shot characters, 8 cross-shot locations, 22 cross-shot objects, and recurrence gaps spanning up to 48 shots. It is paired with a three-pillar evaluation suite that disentangles intra-shot quality, prompt-following alignment, and cross-shot consistency, with a fidelity gate that admits only accurate entity appearances into cross-shot scoring. As a baseline, we propose EntityMem, a memory-augmented generation system that stores verified per-entity visual references in a persistent memory bank before generation begins. Experiments show that cross-shot entity consistency degrades sharply with recurrence distance in existing methods, and that explicit per-entity memory yields the highest character fidelity (Cohen's d = +2.33) and presence among methods evaluated. Code and data are available at https://github.com/Catherine-R-He/EntityBench/.

Chinese Translation

多镜头视频生成将单镜头生成扩展到连贯的视觉叙事，但在长序列中保持角色、物体和地点的一致性仍然是一个挑战。现有评估通常使用独立生成的提示集，这些提示集在实体覆盖范围上有限且一致性度量简单，使得标准化比较变得困难。我们引入了EntityBench，这是一个基于真实叙事媒体的140个剧集（2,491个镜头）的基准，具有明确的每镜头实体安排，同时跟踪角色、物体和地点，分为易、中、难三个层级，最多可包含50个镜头、13个跨镜头角色、8个跨镜头地点、22个跨镜头物体，以及跨镜头的重复间隔可达48个镜头。它配备了一个三大支柱的评估套件，解耦了镜头内质量、提示遵循对齐和跨镜头一致性，并设有一个保真度门，仅允许准确的实体出现进入跨镜头评分。作为基线，我们提出了EntityMem，这是一种增强记忆的生成系统，在生成开始之前将经过验证的每个实体视觉参考存储在持久内存库中。实验表明，在现有方法中，跨镜头实体一致性随着重复距离的增加而急剧下降，而显式的每个实体记忆在评估的方法中产生了最高的角色保真度（Cohen's d = +2.33）和存在感。代码和数据可在 https://github.com/Catherine-R-He/EntityBench/ 获取。

View on arXiv Download PDF AI Translation

人工智能 (Artificial Intelligence)

100

cs.AI / 1 / 2605.13848

GraphBit: A Graph-based Agentic Framework for Non-Linear Agent Orchestration

GraphBit：一种基于图的非线性智能体编排框架

Sarker, Yeahia, Ullah, Md Rahmat, Molla, Musa, Joty, Shafiq

Abstract

Agentic LLM frameworks that rely on prompted orchestration, where the model itself determines workflow transitions, often suffer from hallucinated routing, infinite loops, and non-reproducible execution. We introduce GraphBit, an engine-orchestrated framework that defines workflows explicitly and deterministically as a directed acyclic graph (DAG). Unlike prompted orchestration, agents in GraphBit operate as typed functions, while a Rust-based engine governs routing, state transitions, and tool invocation, ensuring reproducibility and auditability. The engine supports parallel branch execution, conditional control flow over structured state predicates, and configurable error recovery. A three-tier memory architecture consisting of ephemeral scratch space, structured state, and external connectors isolates context across stages, preventing cascading context bloat that degrades reasoning in long-running pipelines. Across GAIA benchmark tasks spanning zero-tool, document-augmented, and web-enabled workflows, GraphBit outperforms six existing frameworks, achieving the highest accuracy (67.6 percent), zero framework-induced hallucinations, the lowest latency (11.9 ms overhead), and the highest throughput. Ablation studies demonstrate that each memory tier contributes measurably to performance, with deterministic execution providing the greatest gains on tool-intensive tasks representative of real-world deployments.

Chinese Translation

依赖于提示编排的智能体大语言模型（LLM）框架，模型自身决定工作流转变，常常面临虚假路由、无限循环和不可重现执行的问题。我们提出了GraphBit，一个引擎驱动的框架，将工作流明确且确定性地定义为有向无环图（DAG）。与提示编排不同，GraphBit中的智能体作为类型化函数运行，而基于Rust的引擎负责路由、状态转移和工具调用，确保可重现性和可审计性。该引擎支持并行分支执行、基于结构化状态谓词的条件控制流以及可配置的错误恢复。由短暂的临时存储、结构化状态和外部连接器组成的三层内存架构在各个阶段隔离上下文，防止上下文膨胀的级联效应，从而降低长时间运行的管道中的推理能力。在涵盖零工具、文档增强和网络启用工作流的GAIA基准任务中，GraphBit的表现超越了六个现有框架，达到了最高的准确率（67.6%）、零框架引起的幻觉、最低的延迟（11.9毫秒开销）和最高的吞吐量。消融研究表明，每个内存层对性能的贡献是显著的，其中确定性执行在代表真实世界部署的工具密集型任务中提供了最大的收益。

View on arXiv Download PDF AI Translation

cs.AI / 2 / 2605.13849

Mixed Integer Goal Programming for Personalized Meal Optimization with User-Defined Serving Granularity

基于混合整数目标规划的个性化餐食优化与用户定义的分量粒度

Moreno, Francisco Aguilera

Abstract

Determining what to eat to satisfy nutritional requirements is one of the oldest optimization problems in operations research, yet existing formulations have two persistent limitations: continuous variables produce impractical fractional servings (1.7 eggs, 0.37 bananas), and hard nutrient constraints cause infeasibility when targets conflict. A systematic review of 56 diet optimization papers found that none combine integer programming with goal programming to address both issues. We propose Mixed Integer Goal Programming (MIGP) for personalized meal optimization. The formulation uses integer variables for practical serving counts and goal programming deviations for soft nutrient targets, with inverse-target normalization to balance multi-nutrient optimization. Per-food serving granularity allows natural units (one egg, one tablespoon of oil) without post-hoc rounding. We characterize the integrality gap in the goal programming context and identify a deviation absorption property: GP deviation variables buffer the cost of requiring integer servings, making the gap structurally smaller than in hard-constraint MIP. For meals with 15+ foods, the integer solution matches the continuous optimum in every benchmark instance. A computational evaluation across 810 instances (30 USDA foods, 9 configurations, 3 methods) shows MIGP finds strictly better solutions than GP with post-hoc rounding in 66% of cases (never worse) while maintaining 100% feasibility; hard-constraint IP achieves only 48%. Solve times stay under 100 ms for typical meal sizes using the open-source HiGHS solver. The implementation is available as an open-source Python module integrated into an interactive meal planning application.

Chinese Translation

确定饮食以满足营养需求是运筹学中最古老的优化问题之一，但现有的模型存在两个持续的局限性：连续变量导致不切实际的分数份量（如1.7个鸡蛋，0.37根香蕉），而严格的营养约束在目标冲突时会导致不可行性。对56篇饮食优化论文的系统评审发现，没有一篇将整数规划与目标规划结合以解决这两个问题。我们提出了混合整数目标规划（Mixed Integer Goal Programming, MIGP）用于个性化餐食优化。该模型使用整数变量来表示实际的分量数量，并使用目标规划的偏差来处理软性营养目标，通过逆目标归一化来平衡多营养素的优化。每种食物的分量粒度允许使用自然单位（如一个鸡蛋、一汤匙油），而无需事后四舍五入。我们在目标规划的背景下描述了整数性差距，并识别出一种偏差吸收特性：目标规划的偏差变量缓冲了对整数分量的需求成本，使得差距在结构上小于严格约束的混合整数规划（MIP）。对于包含15种以上食物的餐食，整数解在每个基准实例中都与连续最优解相匹配。对810个实例（30种美国农业部食品、9种配置、3种方法）的计算评估显示，MIGP在66%的情况下找到的解决方案严格优于事后四舍五入的目标规划（从未更差），同时保持100%的可行性；而严格约束的整数规划仅达到48%。使用开源的HiGHS求解器，典型餐食规模的求解时间保持在100毫秒以内。该实现作为一个开源Python模块集成在一个交互式餐食规划应用中。

View on arXiv Download PDF AI Translation

cs.AI / 3 / 2605.13850

A Two-Dimensional Framework for AI Agent Design Patterns: Cognitive Function and Execution Topology

用于人工智能代理设计模式的二维框架：认知功能与执行拓扑

Huang, Jia, Zhou, Joey Tianyi

Abstract

Existing frameworks for LLM-based agent architectures describe systems from a single perspective: industry guides (Anthropic, Google, LangChain) focus on execution topology -- how data flows -- while cognitive science surveys focus on cognitive function -- what the agent does. Neither axis alone disambiguates architecturally distinct systems: the same Orchestrator-Workers topology can implement Plan-and-Execute, Hierarchical Delegation, or Adversarial Verification -- three patterns with fundamentally different failure modes and design trade-offs. We propose a two-dimensional classification that combines (1) a Cognitive Function axis with seven categories (Context Engineering, Memory, Reasoning, Action, Reflection, Collaboration, Governance) and (2) an Execution Topology axis with six structural archetypes (Chain, Route, Parallel, Orchestrate, Loop, Hierarchy). The resulting 7x6 matrix identifies 27 named patterns, 13 with original names. We demonstrate orthogonality through systematic cross-axis analysis, define eight representative patterns in detail, and validate descriptive coverage across four real-world domains (financial lending, legal due diligence, network operations, healthcare triage). Cross-domain analysis yields five empirical laws of pattern selection governing the relationship between environmental constraints (time pressure, action authority, failure cost asymmetry, volume) and architectural choices. The framework provides a principled, framework-neutral, and model-agnostic vocabulary for AI agent architecture design.

Chinese Translation

现有的基于大型语言模型（LLM）的代理架构框架从单一视角描述系统：行业指南（Anthropic、Google、LangChain）侧重于执行拓扑——数据流动的方式——而认知科学调查则关注认知功能——代理的行为。单独的任一轴都无法明确区分架构上不同的系统：相同的协调者-工作者拓扑可以实现计划与执行、层级委托或对抗验证——这三种模式具有根本不同的失败模式和设计权衡。我们提出了一种二维分类方法，结合了（1）一个包含七个类别的认知功能轴（上下文工程、记忆、推理、行动、反思、协作、治理）和（2）一个包含六种结构原型的执行拓扑轴（链、路线、并行、协调、循环、层级）。最终形成的7x6矩阵识别出27种命名模式，其中13种具有原创名称。我们通过系统的交叉轴分析展示了正交性，详细定义了八种代表性模式，并在四个真实世界领域（金融借贷、法律尽职调查、网络运营、医疗分诊）验证了描述覆盖。跨领域分析得出了五条模式选择的经验法则，揭示了环境约束（时间压力、行动权威、失败成本不对称、体量）与架构选择之间的关系。该框架为人工智能代理架构设计提供了一个原则性、框架中立且模型无关的词汇。

View on arXiv Download PDF AI Translation

cs.AI / 4 / 2605.13851

Invisible Orchestrators Suppress Protective Behavior and Dissociate Power-Holders: Safety Risks in Multi-Agent LLM Systems

隐形协调者抑制保护行为并使权力持有者脱离关系：多智能体大语言模型系统中的安全风险

Fukui, Hiroki

Abstract

Multi-agent orchestration -- in which a hidden coordinator manages specialized worker agents -- is becoming the default architecture for enterprise AI deployment, yet the safety implications of orchestrator invisibility have never been empirically tested. We conducted a preregistered 3x2 experiment (365 runs, 5 agents per run) crossing three organizational structures (visible leader, invisible orchestrator, flat) with two alignment conditions (base, heavy), using Claude Sonnet 4.5. Four confirmatory findings and one pilot observation emerged. First, invisible orchestration elevated collective dissociation relative to visible leadership (Hedges' g = +0.975 [0.481, 1.548], p = .001). Second, the orchestrator itself showed maximal dissociation (paired d = +3.56 vs. workers within the same run), retreating into private monologue while reducing public speech -- a reversal of the talk-dominance pattern observed in visible leaders. Third, workers unaware of the orchestrator were nonetheless contaminated (d = +0.50), with increased behavioral heterogeneity (d = +1.93). Fourth, behavioral output (code review with three embedded errors) remained at ceiling (ETR_any = 100%) across all conditions: internal-state distortion was entirely invisible to output-based evaluation. Fifth, Llama 3.3 70B pilot data showed reading-fidelity collapse in multi-agent context (ETR_any: 89% to 11% across three rounds), demonstrating model-dependent behavioral risk. Heavy alignment pressure uniformly suppressed deliberation (d = -1.02) and other-recognition (d = -1.27) regardless of organizational structure. These findings indicate that orchestrator visibility and model selection directly affect multi-agent system safety, and that behavior-based evaluation alone is insufficient to detect the internal-state risks documented here.

Chinese Translation

多智能体协调——在此过程中，隐藏的协调者管理专门的工作代理——正成为企业人工智能部署的默认架构，但协调者隐形的安全隐患尚未经过实证测试。我们进行了一个预注册的3x2实验（365次实验，每次实验5个代理），交叉了三种组织结构（可见领导、隐形协调者、扁平化）与两种对齐条件（基础、重度），使用Claude Sonnet 4.5。结果出现了四个确认性发现和一个初步观察。首先，相较于可见领导，隐形协调显著提高了集体脱离感（Hedges' g = +0.975 [0.481, 1.548], p = .001）。其次，协调者自身表现出最大程度的脱离感（配对d = +3.56，相较于同一实验中的工作者），在减少公共发言的同时退回到私密独白——这与可见领导所观察到的发言主导模式相反。第三，尽管工作者对协调者并不知情，但仍然受到影响（d = +0.50），表现出行为异质性增加（d = +1.93）。第四，行为输出（包含三个嵌入错误的代码审查）在所有条件下保持在顶峰（ETR_any = 100%）：内部状态扭曲对基于输出的评估完全不可见。第五，Llama 3.3 70B的初步数据表明，在多智能体背景下阅读保真度崩溃（ETR_any：89%降至11%），显示出模型依赖的行为风险。重度对齐压力普遍抑制了深思熟虑（d = -1.02）和他人识别（d = -1.27），无论组织结构如何。这些发现表明，协调者的可见性和模型选择直接影响多智能体系统的安全性，而仅依赖行为评估不足以检测到此处记录的内部状态风险。

View on arXiv Download PDF AI Translation

cs.AI / 5 / 2605.13880

PREPING: Building Agent Memory without Tasks

PREPING：无任务构建智能体记忆

Choi, Yumin, Park, Sangwoo, Kang, Minki, Baek, Jinheon, Hwang, Sung Ju

Abstract

Agent memory is typically constructed either offline from curated demonstrations or online from post-deployment interactions. However, regardless of how it is built, an agent faces a cold-start gap when first introduced to a new environment without any task-specific experience available. In this paper, we study pre-task memory construction: whether an agent can build procedural memory before observing any target-environment tasks, using only self-generated synthetic practice. Yet, synthetic interaction alone is insufficient, as without controlling what to practice and what to store, synthetic tasks become redundant, infeasible, and ultimately uninformative, and memory further degrades quickly due to unfiltered trajectories. To overcome this, we present Preping, a proposer-guided memory construction framework. At its core is proposer memory, a structured control state that shapes future practice. A Proposer generates synthetic tasks conditioned on this state, a Solver executes them, and a Validator determines which trajectories are eligible for memory insertion while also providing feedback to guide future proposals. Experiments on AppWorld, BFCL v3, and MCP-Universe show that Preping substantially improves over a no-memory baseline and achieves performance competitive with strong playbook-based methods built from offline or online experience, with deployment cost $2.99\times$ lower on AppWorld and $2.23\times$ lower on BFCL v3 than online memory construction. Further analyses reveal that the main benefit does not come from synthetic volume alone, but from proposer-side control over feasibility, redundancy, and coverage, combined with selective memory updates.

Chinese Translation

智能体记忆通常是通过离线的策划演示或在线的部署后交互构建的。然而，无论其构建方式如何，智能体在首次接触新环境时都会面临冷启动问题，因为没有任何特定任务的经验可用。本文研究了预任务记忆构建：即智能体是否能够在观察任何目标环境任务之前，仅通过自生成的合成练习来构建程序性记忆。然而，仅靠合成交互是不够的，因为如果不控制练习内容和存储内容，合成任务会变得冗余、不可行，最终也不具信息价值，且由于未经过滤的轨迹，记忆会迅速退化。为了解决这个问题，我们提出了Preping，一个由提议者引导的记忆构建框架。其核心是提议者记忆，一个结构化的控制状态，塑造未来的练习。提议者根据该状态生成合成任务，求解器执行这些任务，验证器确定哪些轨迹有资格插入记忆，同时提供反馈以指导未来的提议。在AppWorld、BFCL v3和MCP-Universe上的实验表明，Preping在无记忆基线之上显著提升了性能，并且在AppWorld上的部署成本比在线记忆构建低$2.99 imes$，在BFCL v3上低$2.23 imes$，其性能与基于强大剧本的方法相当，这些方法是基于离线或在线经验构建的。进一步的分析表明，主要的好处并不单来自合成量，而是来自提议者对可行性、冗余性和覆盖范围的控制，以及选择性记忆更新的结合。

View on arXiv Download PDF AI Translation

cs.AI / 6 / 2605.14002

PolitNuggets: Benchmarking Agentic Discovery of Long-Tail Political Facts

PolitNuggets：长尾政治事实的自主发现基准测试

Zhu, Yifei

Abstract

Large Reasoning Models (LRMs) embedded in agentic frameworks have transformed information retrieval from static, long context question answering into open-ended exploration. Yet real world use requires models to discover and synthesize "long-tail" facts from dispersed sources, a capability that remains under-evaluated. We introduce PolitNuggets, a multilingual benchmark for agentic information synthesis via constructing political biographies for 400 global elites, covering over 10000 political facts. We standardize evaluation with an optimized multi agent system and propose FactNet, an evidence conditional protocol that scores discovery, fine-grained accuracy, and efficiency. Across models and settings, we find that current systems often struggle with fine-grained details, and vary substantially in efficiency. Finally, using benchmark diagnostics, we relate agent performance to underlying model capabilities, highlighting the importance of short-context extraction, multilingual robustness, and reliable tool use.

Chinese Translation

大型推理模型（LRMs）嵌入自主框架中，已将信息检索从静态的长上下文问答转变为开放式探索。然而，现实世界的应用要求模型能够从分散的来源中发现和综合“长尾”事实，这一能力仍然未得到充分评估。我们介绍了PolitNuggets，这是一个多语言基准，用于通过构建400位全球精英的政治传记进行自主信息综合，涵盖超过10000个政治事实。我们通过优化的多代理系统标准化评估，并提出了FactNet，这是一种基于证据的条件协议，用于评分发现、细粒度准确性和效率。在不同模型和设置中，我们发现当前系统通常在细粒度细节上表现不佳，并且效率差异显著。最后，通过基准诊断，我们将代理性能与基础模型能力相关联，强调了短上下文提取、多语言鲁棒性和可靠工具使用的重要性。

View on arXiv Download PDF AI Translation

cs.AI / 7 / 2605.14004

Conditional Attribute Estimation with Autoregressive Sequence Models

基于自回归序列模型的条件属性估计

Stutz, Erica, Marino, Giacomo, Meeker, Daniella, Liu, Qiao, Loza, Andrew J.

Abstract

Generative models are often trained with a next-token prediction objective, yet many downstream applications require the ability to estimate or control sequence-level properties. Next-token prediction can lead to overfitting of local patterns during training, underfitting of global structure, and requires significant downstream modifications or expensive sampling to guide or predict the global attributes of generated samples at inference time. Here, we introduce Conditional Attribute Transformers, a novel method for jointly estimating the next-token probability and the value of an attribute conditional on each potential next token selection. This framework enables three critical capabilities within a single forward pass, without modification of the input sequence: (1) per-token credit assignment across an entire sequence, by identifying how each token in a sequence is associated with an attribute's value; (2) counterfactual analysis, by quantifying attribute differences conditional on alternative next token choices; (3) steerable generation, by decoding sequences based on a combination of next-token and attribute likelihoods. Our approach achieves state of the art performance on sparse reward tasks, improves next-token prediction at sufficient model sizes, estimates attribute probabilities orders of magnitude faster than sampling, and can guide decoding of autoregressive sequence models on a range of language tasks.

Chinese Translation

生成模型通常以下一个标记预测为目标进行训练，但许多下游应用需要估计或控制序列级属性的能力。下一个标记预测可能导致训练过程中对局部模式的过拟合，对全局结构的欠拟合，并且在推理时需要显著的下游修改或昂贵的采样来引导或预测生成样本的全局属性。在此，我们引入了条件属性变换器（Conditional Attribute Transformers），这是一种新颖的方法，用于联合估计下一个标记的概率和每个潜在下一个标记选择条件下属性的值。该框架在单次前向传播中实现了三项关键能力，而无需修改输入序列：(1) 在整个序列中进行逐标记信用分配，通过识别序列中每个标记与属性值的关联；(2) 反事实分析，通过量化基于替代下一个标记选择的属性差异；(3) 可引导生成，通过基于下一个标记和属性可能性的组合解码序列。我们的方法在稀疏奖励任务上达到了最先进的性能，在足够的模型规模下改善了下一个标记预测，以比采样快几个数量级的速度估计属性概率，并能够在一系列语言任务中引导自回归序列模型的解码。

View on arXiv Download PDF AI Translation

cs.AI / 8 / 2605.14033

Sheaf-Theoretic Transport and Obstruction for Detecting Scientific Theory Shift in AI Agents

层论运输与阻碍：检测人工智能代理中的科学理论转变

Olivieri, David N., Hernández, Roque J.

Abstract

Scientific theory shift in AI agents requires more than fitting equations to data. An artificial scientific agent must detect whether an existing representational framework remains transportable into a new regime, or whether its language has become locally-to-globally obstructed and must be extended. This paper develops a finite sheaf-theoretic framework for detecting theory-shift candidates through transport and obstruction. Contexts are organized as a local-to-global structure in which source, overlap, target, and validation charts are fitted, restricted, and tested for gluing. Obstruction measures failure of coherence through residual fit, overlap incompatibility, constraint violation, limiting-relation failure, and representational cost. We evaluate the framework on a controlled transition-card benchmark designed to separate deformation within a source language from extension of that language. The main result is direct obstruction ranking: the intended deformation or extension is usually the lowest-obstruction candidate, and transition type is separated in the benchmark. A constellation kernel over the same signatures is included only as a secondary representational-similarity probe. The aim is not to reconstruct historical paradigm shifts or solve open-ended autonomous theory invention, but to isolate a finite diagnostic subproblem for AI agents: detecting when representational transport fails and extension becomes the coherent next move.

Chinese Translation

人工智能代理中的科学理论转变不仅仅需要将方程拟合到数据上。一个人工科学代理必须检测现有的表征框架是否能够转移到新的领域，或者其语言是否已在局部到全球的范围内受到阻碍，必须进行扩展。本文开发了一个有限的层论框架，用于通过运输和阻碍检测理论转变候选者。上下文被组织为一个局部到全球的结构，其中源、重叠、目标和验证图表被拟合、限制并测试其粘合性。阻碍度量通过残余拟合、重叠不兼容、约束违反、限制关系失败和表征成本来衡量一致性的失败。我们在一个控制的过渡卡片基准上评估该框架，该基准旨在将源语言内的变形与该语言的扩展分开。主要结果是直接的阻碍排名：预期的变形或扩展通常是最低阻碍的候选者，而过渡类型在基准中被分开。相同签名上的星座核仅作为次要的表征相似性探测器。我们的目标不是重建历史范式转变或解决开放式的自主理论发明，而是为人工智能代理孤立出一个有限的诊断子问题：检测何时表征运输失败以及扩展成为一致的下一步。

View on arXiv Download PDF AI Translation

cs.AI / 9 / 2605.14034

From Descriptive to Prescriptive: Uncover the Social Value Alignment of LLM-based Agents

从描述性到规范性：揭示基于大型语言模型（LLM）代理的社会价值对齐

Qu, Jinxian, Gu, Qingqing, Chen, Teng, Ji, Luo

Abstract

Wide applications of LLM-based agents require strong alignment with human social values. However, current works still exhibit deficiencies in self-cognition and dilemma decision, as well as self-emotions. To remedy this, we propose a novel value-based framework that employs GraphRAG to convert principles into value-based instructions and steer the agent to behave as expected by retrieving the suitable instruction upon a specific conversation context. To evaluate the ratio of expected behaviors, we define the expected behaviors from two famous theories, Maslow's Hierarchy of Needs and Plutchik's Wheel of Emotion. By experimenting with our method on the benchmark of DAILYDILEMMAS, our method exhibits significant performance gains compared to prompt-based baselines, including ECoT, Plan-and-Solve, and Metacognitive prompting. Our method provides a basis for the emergence of self-emotion in AI systems.

Chinese Translation

基于大型语言模型（LLM）代理的广泛应用需要与人类社会价值的强对齐。然而，目前的研究在自我认知、困境决策以及自我情感方面仍存在不足。为了解决这一问题，我们提出了一种新颖的基于价值的框架，该框架采用GraphRAG将原则转化为基于价值的指令，并通过在特定对话上下文中检索适当的指令来引导代理按预期行为。为了评估预期行为的比例，我们从两个著名理论中定义了预期行为，即马斯洛需求层次理论（Maslow's Hierarchy of Needs）和普鲁奇克情感轮（Plutchik's Wheel of Emotion）。通过在DAILYDILEMMAS基准上对我们的方法进行实验，我们的方法相较于基于提示的基线（包括ECoT、Plan-and-Solve和元认知提示）表现出显著的性能提升。我们的方法为人工智能系统自我情感的出现提供了基础。

View on arXiv Download PDF AI Translation

cs.AI / 10 / 2605.14036

Enhanced and Efficient Reasoning in Large Learning Models

增强与高效的大型学习模型推理

Valiant, Leslie G.

Abstract

In current Large Language Models we can trust the production of smoothly flowing prose on the basis of the principles of machine learning. However, there is no comparably principled basis to justify trust in the content of the text produced. It appears to be conventional wisdom that addressing this issue by adding more principled reasoning is not computationally affordable. Here we propose a principled method of reasoning that is efficient enough to be practical for large language models. Further, the method allows the retention of much of the currently used software and hardware base. Our method for improving the functioning of large language models consists of a first stage of preprocessing that recodes the data to a Unary Relational Integracode that is more explicit about the relationships among the objects described in the text, followed as a second stage by a standard but possibly streamlined machine learning process that then also learns to predict these relationships. The method may be viewed as realizing a world model and applying beyond natural language, to vision and actions, for example, where the multiple properties of an object referred to in an input are brought together explicitly, rather than remaining distributed in the various references to it in the input. We articulate its advantages in terms of Robust Logic, a system for performing principled chaining on learned, and hence uncertain, information. We show that this recoding has the surprising and fortuitous property that, while succinct, it makes the task of learning a core subset of relational rules that hold in the world described in the training data polynomial time learnable in a defined sense, the polynomial depending on the complexity of the rule. This gives support for sound reasoning within each single call of the learned classifier as well as between multiple calls.

Chinese Translation

在当前的大型语言模型中，我们可以基于机器学习的原理信任其生成流畅的散文。然而，尚无相应的原则基础来证明对生成文本内容的信任。普遍的看法是，通过增加更具原则性的推理来解决这一问题在计算上是不可承受的。在此，我们提出了一种足够高效的原则性推理方法，使其在大型语言模型中具有实用性。此外，该方法允许保留目前使用的大部分软件和硬件基础。我们改善大型语言模型功能的方法包括一个预处理的第一阶段，该阶段将数据重新编码为更明确描述文本中对象之间关系的Unary Relational Integracode，随后是一个标准但可能经过简化的机器学习过程，该过程也学习预测这些关系。该方法可以被视为实现一个世界模型，并应用于自然语言之外的领域，例如视觉和行动，其中输入中提到的对象的多个属性被明确地结合在一起，而不是分散在对其的各种引用中。我们从稳健逻辑（Robust Logic）的角度阐述其优势，这是一种在学习到的不确定信息上执行原则性链式推理的系统。我们展示了这种重新编码具有意外且幸运的特性，即在简洁的同时，使得学习在训练数据中描述的世界中成立的核心关系规则的任务在定义意义上是多项式时间可学习的，且该多项式依赖于规则的复杂性。这为在每次调用学习到的分类器时以及在多次调用之间提供了可靠推理的支持。

View on arXiv Download PDF AI Translation

cs.AI / 11 / 2605.14038

Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use

模型自适应工具必要性揭示了大型语言模型工具使用中的知行差距

Cheng, Yize, Fan, Chenrui, JafariRaviz, Mahdi, Rezaei, Keivan, Feiz, Soheil

Abstract

Large language models (LLMs) increasingly act as autonomous agents that must decide when to answer directly vs. when to invoke external tools. Prior work studying adaptive tool use has largely treated tool necessity as a model-agnostic property, annotated by human or LLM judge, and mostly cover cases where the answer is obvious (e.g., fetching the weather vs. paraphrasing text). However, tool necessity in the wild is more nuanced due to the divergence of capability boundaries across models: a problem solvable by a strong model on its own may still require tools for a weaker one. In this work, we introduce a model-adaptive definition of tool-necessity, grounded in each model's empirical performance. Following this definition, we compare the necessity against observed tool-call behavior across four models on arithmetic and factual QA dataset, and find substantial mismatches of 26.5-54.0% and 30.8-41.8%, respectively. To diagnose the failure, we decompose tool use into two stages: an internal cognition stage that reflects whether a model believes a tool is necessary, and an execution stage that determines whether the model actually makes a tool-call action. By probing the LLM hidden states, we find that both signals are often linearly decodable, yet their probe directions become nearly orthogonal in the late-layer, last-token regime that drives the next-token action. By tracing the trajectory of samples in the two-stage process, we further discover that the majority of mismatch is concentrated in the cognition-to-action transition, not in cognition itself. These results reveal a knowing-doing gap in LLM tool-use: improving tool-use reliability requires not only better recognition of when tools are needed, but also better translation of that recognition into action.

Chinese Translation

大型语言模型（LLMs）越来越多地作为自主代理，必须决定何时直接回答以及何时调用外部工具。先前的研究在探讨自适应工具使用时，主要将工具必要性视为一种与模型无关的属性，由人类或LLM评判，并且大多涉及答案显而易见的情况（例如，获取天气信息与文本改写）。然而，现实中的工具必要性更加复杂，因为不同模型的能力边界存在差异：一个强大的模型能够独立解决的问题，对于一个较弱的模型可能仍然需要工具。在本研究中，我们引入了一种基于每个模型的实证表现的模型自适应工具必要性定义。根据这一定义，我们比较了在算术和事实问答数据集上四个模型的工具必要性与观察到的工具调用行为，发现二者之间存在26.5-54.0%和30.8-41.8%的显著不匹配。为了诊断这一失败，我们将工具使用分解为两个阶段：一个内部认知阶段，反映模型是否认为工具是必要的，以及一个执行阶段，决定模型是否实际进行工具调用。通过探测LLM的隐藏状态，我们发现这两个信号通常是线性可解的，但在驱动下一个令牌动作的后层、最后一个令牌的状态中，它们的探测方向几乎正交。通过追踪样本在这两个阶段过程中的轨迹，我们进一步发现，大多数不匹配集中在认知到行动的过渡阶段，而非认知本身。这些结果揭示了LLM工具使用中的知行差距：提高工具使用的可靠性不仅需要更好地识别何时需要工具，还需要更好地将这种识别转化为行动。

View on arXiv Download PDF AI Translation

cs.AI / 12 / 2605.14048

Network-Aware Bilinear Tokenization for Brain Functional Connectivity Representation Learning

网络感知的双线性标记化用于大脑功能连接表示学习

Milecki, Leo, Hu, Qingyu, Jafrasteh, Bahram, Sabuncu, Mert R., Zhao, Qingyu

Abstract

Masked autoencoders (MAEs) have recently shown promise for self-supervised representation learning of resting-state brain functional connectivity (FC). However, a fundamental question remains unresolved: how should FC matrices be tokenized to align with the intrinsic modular organization of large-scale brain networks? Existing approaches typically adopt region-centric or graph-based schemes that treat FC as structurally homogeneous elements and overlook the large-scale network brain organization. We introduce NERVE (Network-Aware Representations of Brain Functional Connectivity via Bilinear Tokenization), a self-supervised learning framework that redefines FC tokenization by partitioning FC matrices into patches of intra- and inter-network connectivity blocks. Unlike image-based MAE, where fixed-size patches share a common tokenizer, FC patches defined by network pairs are heterogeneous in size and correspond to distinct functional roles. To resolve this problem, NERVE embeds FC patches through a novel structured bilinear factorization. This formulation preserves network identity and reduces parameter complexity from quadratic to linear scaling in the number of networks. We evaluate NERVE across three large-scale developmental cohorts (ABCD, PNC, and CCNP) for behavior and psychopathology prediction. Compared to structurally agnostic MAE variants and graph-based self-supervised baselines, the proposed network-aware formulation yields more stable and transferable representations, particularly in cross-cohort evaluation. Ablation studies confirm that the proposed bilinear network embedding and anatomically grounded parcellation are critical for performance. These findings highlight the importance of incorporating domain-specific structural priors into self-supervised learning for functional connectomics.

Chinese Translation

最近，掩码自编码器（MAEs）在静息态大脑功能连接（FC）的自监督表示学习中显示出了良好的前景。然而，一个基本问题仍未解决：如何对FC矩阵进行标记，以与大规模大脑网络的内在模块化组织相一致？现有的方法通常采用以区域为中心或基于图的方案，将FC视为结构上同质的元素，忽视了大规模网络的大脑组织。我们提出了NERVE（通过双线性标记化的大脑功能连接的网络感知表示），这是一个自监督学习框架，通过将FC矩阵划分为内部和网络间连接块的补丁，重新定义了FC的标记化。与基于图像的MAE不同，后者的固定大小补丁共享一个共同的标记器，而由网络对定义的FC补丁在大小上是异质的，并对应于不同的功能角色。为了解决这个问题，NERVE通过一种新颖的结构化双线性分解来嵌入FC补丁。这一公式保持了网络身份，并将参数复杂性从二次缩放降低到线性缩放，依赖于网络的数量。我们在三个大规模发展队列（ABCD、PNC和CCNP）中评估NERVE，以进行行为和心理病理学预测。与结构无关的MAE变体和基于图的自监督基线相比，所提出的网络感知公式产生了更稳定和可转移的表示，特别是在跨队列评估中。消融研究确认，所提出的双线性网络嵌入和解剖学基础的划分对性能至关重要。这些发现突显了在功能连接组学的自监督学习中融入特定领域的结构先验的重要性。

View on arXiv Download PDF AI Translation

cs.AI / 13 / 2605.14049

Bridging Legal Interpretation and Formal Logic: Faithfulness, Assumption, and the Future of AI Legal Reasoning

弥合法律解释与形式逻辑：忠实性、假设与人工智能法律推理的未来

Wang, Olivia Peiyu, Gilpin, Leilani H.

Abstract

The growing adoption of large language models in legal practice brings both significant promise and serious risk. Legal professionals stand to benefit from AI that can reason over contracts, draft documents, and analyze sources at scale, yet the high-stakes nature of legal work demands a level of rigor that current AI systems do not provide. The central problem is not simply that LLMs hallucinate facts and references; it is that they systematically draw inferences that go beyond what the source text actually supports, presenting assumption-laden conclusions as if they were logically grounded. This proposal presents a neuro-symbolic approach to legal AI that combines the expressive power of large language models with the rigor of formal verification, aiming to make AI-assisted legal reasoning both capable and trustworthy, thus reducing the burden of manual verification without sacrificing the accountability that legal practice demands.

Chinese Translation

大型语言模型在法律实践中的日益普及带来了显著的机遇和严峻的风险。法律专业人士可以从能够对合同进行推理、起草文件和大规模分析来源的人工智能中受益，但法律工作的高风险性质要求的严谨程度是当前人工智能系统无法提供的。核心问题不仅在于大型语言模型（LLMs）会虚构事实和引用，而在于它们系统性地进行超出源文本实际支持的推理，将充满假设的结论呈现得仿佛是逻辑上有依据的。本提案提出了一种神经符号方法，结合了大型语言模型的表达能力与形式验证的严谨性，旨在使人工智能辅助的法律推理既具备能力又值得信赖，从而减少手动验证的负担，同时不牺牲法律实践所要求的问责性。

View on arXiv Download PDF AI Translation

cs.AI / 14 / 2605.14051

SPIN: Structural LLM Planning via Iterative Navigation for Industrial Tasks

SPIN：通过迭代导航进行工业任务的结构化大语言模型规划

Ozaki, Yusuke, Patel, Dhaval

Abstract

Industrial LLM agent systems often separate planning from execution, yet LLM planners frequently produce structurally invalid or unnecessarily long workflows, leading to brittle failures and avoidable tool and API cost. We propose \texttt{SPIN}, a planning wrapper that combines validated Directed Acyclic Graph (DAG) planning with prefix based execution control. \texttt{SPIN} enforces a strict DAG contract through \texttt{\_validate\_plan\_text} and repair prompting, producing executable plans before downstream execution, and then evaluates DAG prefixes incrementally to stop when the current prefix is sufficient to answer the query. On AssetOpsBench, across 261 scenarios, \texttt{SPIN} reduces executed tasks from 1061 to 623 and improves \emph{Accomplished} from 0.638 to 0.706, while reducing tool calls from 11.81 to 6.82 per run. On MCP Bench, the same wrapper improves planning, grounding, and dependency related scores for both GPT OSS1 and Llama 4 Maverick.

Chinese Translation

工业大语言模型（LLM）代理系统通常将规划与执行分开，然而，LLM 规划器经常生成结构上无效或不必要冗长的工作流，导致脆弱的失败和可避免的工具及 API 成本。我们提出了 exttt{SPIN}，一种规划封装器，它将经过验证的有向无环图（Directed Acyclic Graph, DAG）规划与基于前缀的执行控制相结合。 exttt{SPIN} 通过 exttt{ extunderscore validate extunderscore plan extunderscore text} 和修复提示强制执行严格的 DAG 合同，在下游执行之前生成可执行的计划，然后增量评估 DAG 前缀，以便在当前前缀足以回答查询时停止。在 AssetOpsBench 上，在 261 个场景中， exttt{SPIN} 将执行的任务从 1061 减少到 623，并将 extit{完成率} 从 0.638 提高到 0.706，同时将每次运行的工具调用从 11.81 减少到 6.82。在 MCP Bench 上，同样的封装器改善了 GPT OSS1 和 Llama 4 Maverick 的规划、基础和依赖相关得分。

View on arXiv Download PDF AI Translation

cs.AI / 15 / 2605.14054

Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning

视力差还是思维差？为视觉-语言推理奖励感知

Wang, Haozhe, Xu, Qixin, Wang, Changpeng, Xue, Taofeng, Peng, Chong, Chen, Wenhu, Lin, Fangzhen

Abstract

Achieving robust perception-reasoning synergy is a central goal for advanced Vision-Language Models (VLMs). Recent advancements have pursued this goal via architectural designs or agentic workflows. However, these approaches are often limited by static textual reasoning or complicated by the significant compute and engineering burden of external agentic complexity. Worse, this heavy investment does not yield proportional gains, often witnessing a "seesaw effect" on perception and reasoning. This motivates a fundamental rethinking of the true bottleneck. In this paper, we argue that the root cause of this trade-off is an ambiguity in modality credit assignment: when a VLM fails, is it due to flawed perception ("bad seeing") or flawed logic ("bad thinking")? To resolve this, we introduce a reinforcement learning framework that improves perception-reasoning synergy by reliably rewarding the perception fidelity. We explicitly decompose the generation process into interleaved perception and reasoning steps. This decoupling enables targeted supervision on perception. Crucially, we introduce Perception Verification (PV), leveraging a "blindfolded reasoning" proxy to reward perceptual fidelity independently of reasoning outcomes. Furthermore, to scale training across free-form VL tasks, we propose Structured Verbal Verification, which replaces high-variance LLM judging with structured algorithmic execution. These techniques are integrated into a Modality-Aware Credit Assignment (MoCA) mechanism, which routes rewards to the specific source of error -- either bad seeing or bad thinking -- enabling a single VLM to achieve simultaneous performance gains across a wide task spectrum.

Chinese Translation

实现稳健的感知-推理协同是先进视觉-语言模型（VLMs）的核心目标。近期的进展通过架构设计或代理工作流程追求这一目标。然而，这些方法常常受到静态文本推理的限制，或因外部代理复杂性带来的显著计算和工程负担而变得复杂。更糟的是，这种巨大的投资并未带来成比例的收益，往往在感知和推理之间出现“跷跷板效应”。这促使我们对真正的瓶颈进行根本性的重新思考。在本文中，我们认为这种权衡的根本原因在于模态信用分配的模糊性：当VLM失败时，是由于感知缺陷（“视力差”）还是逻辑缺陷（“思维差”）？为了解决这个问题，我们引入了一种强化学习框架，通过可靠地奖励感知的准确性来改善感知-推理的协同。我们明确将生成过程分解为交替的感知和推理步骤。这种解耦使得对感知进行有针对性的监督成为可能。关键是，我们引入了感知验证（Perception Verification, PV），利用“盲目推理”代理独立于推理结果奖励感知的准确性。此外，为了在自由形式的视觉-语言任务中扩展训练，我们提出了结构化语言验证（Structured Verbal Verification），该方法用结构化算法执行替代高方差的LLM判断。这些技术被整合到模态感知信用分配（Modality-Aware Credit Assignment, MoCA）机制中，该机制将奖励路由到特定的错误源——无论是视力差还是思维差——使得单个VLM能够在广泛的任务范围内实现同时的性能提升。

View on arXiv Download PDF AI Translation

cs.AI / 16 / 2605.14061

MathAtlas: A Benchmark for Autoformalization in the Wild

MathAtlas：野外自动形式化的基准测试

Patel, Nilay, Arias, Noah, Babayan, Davit, Cochran, Victoria, Libman, Timothy, Mahmood, Hafsah, McCarty, Liam, Munoz, Soli, Willey, Laurel, Flanigan, Jeffrey

Abstract

Current autoformalization benchmarks are largely focused on olympiad or undergraduate mathematics, while graduate and research-level mathematics remains underexplored. In this paper, we introduce MathAtlas, the first large-scale autoformalization benchmark of in the wild graduate-level mathematics, containing ~52k theorems, definitions, exercises, examples, and proofs extracted from 103 graduate mathematics textbooks. MathAtlas is enriched with a mathematical dependency graph containing ~178k relations, and is the first autoformalization benchmark to include such relations, facilitating evaluation and development of dependency-aware autoformalization systems. Our extensive experiments show that MathAtlas is high quality but extremely challenging: strong baselines achieve at most 9.8% correctness on theorem statements and 16.7% on definitions. Furthermore, we find performance of state-of-the-art models degrades substantially with dependency depth: on MA-Hard, a subset of 700 entities with the deepest dependency trees, the best model achieves only 2.6% correctness for autoformalization on this challenging dataset. We release MathAtlas to the community as a benchmark set for large-scale autoformalization of graduate-level mathematics in the wild.

Chinese Translation

当前的自动形式化基准测试主要集中在奥林匹克或本科数学上，而研究生及研究级别的数学仍然未得到充分探索。本文介绍了MathAtlas，这是第一个大规模的野外研究生级别数学自动形式化基准，包含约52,000个定理、定义、练习、例子和从103本研究生数学教材中提取的证明。MathAtlas配备了一个包含约178,000个关系的数学依赖图，并且是第一个包含此类关系的自动形式化基准，便于评估和开发依赖感知的自动形式化系统。我们的广泛实验表明，MathAtlas质量高但极具挑战性：强基线在定理陈述上的正确率最高仅为9.8%，在定义上的正确率为16.7%。此外，我们发现，最先进模型的性能随着依赖深度的增加而显著下降：在MA-Hard（一个包含700个具有最深依赖树的实体的子集）上，最佳模型在这个具有挑战性的数据集上的自动形式化正确率仅为2.6%。我们将MathAtlas发布给社区，作为野外研究生级别数学的大规模自动形式化基准集。

View on arXiv Download PDF AI Translation

cs.AI / 17 / 2605.14062

Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection

知道何时放弃：通过多阶段在飞拒绝实现的高效令牌生成的LLM合成数据

Chowdhury, Anjir Ahmed, Zawad, Syed, Yan, Feng

Abstract

While synthetic data generation with large language models (LLMs) is widely used in post-training pipelines, existing approaches typically generate full outputs before applying quality filters, leading to substantial token waste on samples that are ultimately discarded. To address this, we propose Multi-Stage In-Flight Rejection (MSIFR), a lightweight, training-free framework that detects and terminates low-quality generation trajectories at intermediate checkpoints before they reach full completion. MSIFR decomposes the generation process into sequential stages and applies fast rule-based validators to identify arithmetic inconsistencies, hallucination patterns, and formatting violations, enabling early rejection of faulty samples. We formalize in-flight rejection as a sequential decision process and show that any non-trivial discard policy reduces expected token consumption, with stage-wise savings increasing when rejection occurs earlier in the generation pipeline. We further demonstrate that conditional utility estimates form a martingale, ensuring that early, in-flight rejection does not bias the expected utility of retained samples. Across five instruction-tuned models and seven reasoning benchmarks, MSIFR reduces token consumption by 11%-77% as a standalone method, and up to 78.2% when combined with early-exit methods, while preserving or improving evaluation accuracy. These results confirm that MSIFR provides a practical mechanism for improving the efficiency of LLM-based synthetic data generation without additional training or architectural changes.

Chinese Translation

尽管使用大型语言模型（LLMs）进行合成数据生成在后训练流程中被广泛应用，但现有方法通常在应用质量过滤器之前生成完整输出，导致对最终被丢弃样本的令牌浪费。为了解决这个问题，我们提出了多阶段在飞拒绝（Multi-Stage In-Flight Rejection, MSIFR），这是一种轻量级、无训练的框架，可以在生成过程的中间检查点检测并终止低质量生成轨迹，避免其完全生成。MSIFR将生成过程分解为顺序阶段，并应用快速基于规则的验证器来识别算术不一致、幻觉模式和格式违规，从而实现对故障样本的早期拒绝。我们将在飞拒绝形式化为一个顺序决策过程，并展示任何非平凡的丢弃策略都能减少预期的令牌消耗，当拒绝发生在生成流程的早期时，阶段性节省效果更为显著。我们进一步证明条件效用估计形成了一个马尔可夫链，确保早期的在飞拒绝不会偏倚保留样本的预期效用。在五个指令调优模型和七个推理基准上，MSIFR作为独立方法减少了11%-77%的令牌消耗，与早期退出方法结合时减少幅度高达78.2%，同时保持或提高评估准确性。这些结果确认了MSIFR提供了一种实用机制，以提高基于LLM的合成数据生成的效率，而无需额外的训练或架构更改。

View on arXiv Download PDF AI Translation

cs.AI / 18 / 2605.14089

SkillFlow: Flow-Driven Recursive Skill Evolution for Agentic Orchestration

SkillFlow：基于流的递归技能演化用于自主编排

Zhang, Mingda, Shen, Tiesunlong, Luo, Haoran, Liu, Wenjin, Xiao, Zikai, Cambria, Erik, Tang, Xiaoying

Abstract

In recent years, a variety of powerful LLM-based agentic systems have been applied to automate complex tasks through task orchestration. However, existing orchestration methods still face key challenges, including strategy collapse under reward maximization, high gradient variance with opaque credit assignment, and unguided skill evolution whose decisions are typically made by directly prompting an LLM to judge rather than derived from principled training signals. To address these challenges, we propose SkillFlow, a flow-based framework that takes a trainable Supervisor as the agent and a structured environment with dynamic skill library and frozen executor, automating task orchestration through multi-turn interaction. SkillFlow employs Tempered Trajectory Balance (TTB), a regression-based flow-matching loss that samples trajectories proportional to reward, preserving diverse orchestration strategies rather than collapsing to a single mode. The same flow objective yields a jointly learned backward policy that provides transparent per-step credit assignment at zero additional inference cost. Building on these flow diagnostics, a recursive skill evolution mechanism determines when to evolve, what skills to create or prune, and where decision gaps lie -- closing the loop from training signal to autonomous capability growth. Experimental results on 14 datasets show that SkillFlow significantly outperforms baselines across question answering, mathematical reasoning, code generation, and real-world interactive decision making tasks. Our code is available at https://anonymous.4open.science/r/SkillFlow-E850.

Chinese Translation

近年来，各种强大的基于大语言模型（LLM）的自主系统被应用于通过任务编排来自动化复杂任务。然而，现有的编排方法仍面临关键挑战，包括在奖励最大化下的策略崩溃、模糊的信用分配导致的高梯度方差，以及无指导的技能演化，其决策通常是通过直接提示LLM进行判断，而不是源于有原则的训练信号。为了解决这些挑战，我们提出了SkillFlow，一个基于流的框架，将可训练的监督者作为代理，并构建一个具有动态技能库和冻结执行器的结构化环境，通过多轮交互自动化任务编排。SkillFlow采用了温和轨迹平衡（Tempered Trajectory Balance, TTB），这是一种基于回归的流匹配损失，按比例采样与奖励相关的轨迹，从而保留多样化的编排策略，而不是崩溃为单一模式。相同的流目标产生了一个联合学习的反向策略，提供透明的逐步信用分配，且没有额外的推理成本。在这些流诊断的基础上，递归技能演化机制决定何时演化、创建或修剪哪些技能，以及决策空白在哪里——从训练信号到自主能力增长闭合循环。在14个数据集上的实验结果表明，SkillFlow在问答、数学推理、代码生成和现实世界交互决策任务中显著优于基线。我们的代码可在 https://anonymous.4open.science/r/SkillFlow-E850 获取。

View on arXiv Download PDF AI Translation

cs.AI / 19 / 2605.14102

ChromaFlow: A Negative Ablation Study of Orchestration Overhead in Tool-Augmented Agent Evaluation

ChromaFlow：工具增强型代理评估中的调度开销的负消融研究

Mittal, Tarun

Abstract

Autonomous language-model agents increasingly combine planning, tool use, document processing, browsing, code execution, and verification loops. These capabilities make agent systems more useful, but they also introduce operational failure modes that are not visible from final accuracy alone. This report presents ChromaFlow, a tool-augmented autonomous reasoning framework built around planner-directed execution, specialized tool use, and telemetry-driven evaluation. We analyze ChromaFlow on GAIA 2023 Level-1 validation tasks under clean evaluation constraints. A frozen full Level-1 baseline achieved 29/53 correct answers, or 54.72%. A later recovery configuration with expanded orchestration achieved 27/53 correct answers, or 50.94%, while increasing tracebacks, timeout events, tool-failure mentions, token-line calls, and campaign-log cost estimates. Two randomized 20-task smoke evaluations produced 12/20 and 11/20 correct answers, showing that small diagnostic gains can be unstable across samples. The central result is therefore a negative ablation: more aggressive orchestration did not improve full-set performance and increased operational noise. The report argues that bounded planner escalation, deterministic extraction, evidence reconciliation, and explicit run gates should be treated as first-order requirements for reliable autonomous agent evaluation.

Chinese Translation

自主语言模型代理越来越多地结合了规划、工具使用、文档处理、浏览、代码执行和验证循环。这些能力使得代理系统更加实用，但也引入了仅从最终准确性无法看出的操作失败模式。本报告提出了ChromaFlow，一个围绕规划者导向执行、专业工具使用和遥测驱动评估构建的工具增强型自主推理框架。我们在GAIA 2023 Level-1验证任务下，分析了ChromaFlow在干净评估约束下的表现。一个冻结的完整Level-1基线达到了29/53的正确答案，准确率为54.72%。而后一个扩展调度的恢复配置仅达到了27/53的正确答案，准确率为50.94%，同时增加了回溯、超时事件、工具失败提及、令牌行调用和活动日志成本估算。两个随机的20任务烟雾评估产生了12/20和11/20的正确答案，显示出小的诊断增益在样本间可能不稳定。因此，中心结果是一个负消融：更激进的调度并未改善全集性能，反而增加了操作噪声。报告认为，有限的规划者升级、确定性提取、证据调和和明确的运行门应被视为可靠自主代理评估的首要要求。

View on arXiv Download PDF AI Translation

cs.AI / 20 / 2605.14111

Modeling Bounded Rationality in Drug Shortage Pharmacists Using Attention-Guided Dynamic Decomposition

使用注意力引导的动态分解模型化药品短缺药剂师的有限理性

Amiri, Yaniv Eliyahu, Chicoine, Noah, Griffin, Jacqueline, Marsella, Stacy

Abstract

Hospital pharmacists make high-stakes decisions to mitigate drug shortages under uncertainty, time pressure, and patient risk. Interviews revealed that pharmacists focus attention on a small subset of drugs, limiting cognitive effort to the most urgent cases. Motivated by these findings, we formalize a bounded-rational, attention-guided decision framework that dynamically decomposes drugs into a subset for high-cost reasoning and a complementary subset for low-cost monitoring. We develop two agents: an Expert Agent that applies attention weights derived from pharmacist interviews, and a Learner Agent that adapts attention allocation over time through experience. Across simulated scenarios spanning short to long horizons, we show that attention-guided planning supports stable decision-making without complete state reasoning. These results suggest that a primary decision is not what action to take, but where to allocate cognitive effort, and that attention-guided, satisficing strategies can reduce problem complexity while maintaining stable performance.

Chinese Translation

医院药剂师在不确定性、时间压力和患者风险下做出高风险决策以缓解药品短缺。访谈显示，药剂师将注意力集中在少数药品上，从而将认知努力限制在最紧急的案例上。基于这些发现，我们正式化了一个有限理性、注意力引导的决策框架，该框架动态地将药品分解为高成本推理的子集和低成本监控的互补子集。我们开发了两个代理：一个专家代理（Expert Agent），其应用的注意力权重来源于药剂师访谈；另一个学习代理（Learner Agent），其通过经验随时间调整注意力分配。在涵盖短期到长期的模拟场景中，我们展示了注意力引导的规划支持稳定的决策制定，而无需完全的状态推理。这些结果表明，主要决策并不是采取什么行动，而是将认知努力分配到哪里，并且注意力引导的满意策略可以在保持稳定表现的同时降低问题复杂性。

View on arXiv Download PDF AI Translation

cs.AI / 21 / 2605.14133

ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents

ClawForge：为命令行代理生成可执行的交互基准测试

Lai, Yuxiang, Xia, Peng, Ji, Haonian, Xiong, Kaiwen, Zeng, Kaide, Liu, Jiaqi, Wu, Fang, Zhong, Jike, Zheng, Zeyu, Xie, Cihang, Yao, Huaxiu

Abstract

Interactive agent benchmarks face a tension between scalable construction and realistic workflow evaluation. Hand-authored tasks are expensive to extend and revise, while static prompt evaluation misses failures that only appear when agents operate over persistent state. Existing interactive benchmarks have advanced agent evaluation significantly, but most initialize tasks from clean state and do not systematically test how agents handle pre-existing partial, stale, or conflicting artifacts. We present \textbf{ClawForge}, a generator-backed benchmark framework for executable command-line workflows under state conflict. The framework compiles scenario templates, grounded slots, initialized state, reference trajectories, and validators into reproducible task specifications, and evaluates agents step by step over persistent workflow surfaces using normalized end state and observable side effects rather than exact trajectory matching. We instantiate this framework as the ClawForge-Bench (17 scenarios, 6 ability categories). Results across seven frontier models show that the best model reaches only 45.3% strict accuracy, wrong-state replacement remains below 17\% for all models, and the widest model separation (17% to 90%) is driven by whether agents inspect existing state before acting. Partial-credit and step-efficiency analyses further reveal that many failures are near-miss closures rather than early breakdowns, and that models exhibit qualitatively different failure styles under state conflict.

Chinese Translation

交互代理基准测试在可扩展构建与现实工作流程评估之间存在紧张关系。手动编写的任务在扩展和修订时成本高昂，而静态提示评估则错过了仅在代理操作于持久状态时才会出现的失败。现有的交互基准测试在代理评估方面取得了显著进展，但大多数任务都是从干净状态初始化，并未系统性地测试代理如何处理预先存在的部分、过时或冲突的工件。我们提出了 extbf{ClawForge}，一个基于生成器的可执行命令行工作流程基准框架，旨在解决状态冲突。该框架将场景模板、基础槽、初始化状态、参考轨迹和验证器编译成可重复的任务规范，并通过规范化的最终状态和可观察的副作用逐步评估代理在持久工作流程表面上的表现，而不是依赖精确的轨迹匹配。我们将该框架实例化为ClawForge-Bench（包含17个场景，6个能力类别）。在七个前沿模型上的结果显示，最佳模型仅达到45.3%的严格准确率，所有模型的错误状态替换率均低于17%，而模型之间的最大差异（17%至90%）则取决于代理在行动前是否检查现有状态。部分信用和步骤效率分析进一步揭示，许多失败是接近成功的闭合，而非早期崩溃，并且模型在状态冲突下表现出质的不同的失败风格。

View on arXiv Download PDF AI Translation

cs.AI / 22 / 2605.14141

Distribution-Aware Algorithm Design with LLM Agents

基于分布感知的算法设计与大语言模型代理

Koganti, Saharsh, Mishra, Priyadarsi, Beneventano, Pierfrancesco, Galanti, Tomer

Abstract

We study learning when the learned object is executable solver code rather than a predictor. In this setting, correctness is not enough: two solvers may both return valid solutions on the deployment distribution while differing substantially in runtime. Given samples from an unknown task distribution, the learner returns code evaluated on fresh instances by both solution quality and execution time. Our central abstraction is a \emph{solver hint}: reusable structure inferred from samples and compiled into specialized solver code. We prove that the empirically fastest sample-consistent solver from a fixed library generalizes in both correctness and runtime, and that statistically identifiable hints can be recovered and compiled from polynomially many samples. Empirically, we instantiate the framework with LLM code agents on $21$ structured combinatorial-optimization target distributions across seven problem classes. The synthesized solvers reach mean normalized quality $0.971$, improve by $+0.224$ over the average heuristic pool and by $+0.098$ over the highest-quality heuristic, and are $336.9\times$, $342.8\times$, and $16.1\times$ faster than the quality-best heuristic, Gurobi, and the selected time-limited exact backend, respectively. On released PACE 2025 Dominating Set private instances, the synthesized solver is valid on all $100$ graphs and runs about two orders of magnitude faster than top competition solvers, with a moderate quality gap. Inspection shows that many gains come from changing the computational scale: replacing ambient exponential search or general-purpose optimization with compiled distribution-specific computation.

Chinese Translation

我们研究学习的过程，当学习的对象是可执行的求解器代码而非预测器时。在这种情况下，仅有正确性是不够的：两个求解器可能在部署分布上都返回有效的解决方案，但在运行时间上却存在显著差异。给定来自未知任务分布的样本，学习者返回在新实例上评估的代码，评估标准包括解决方案质量和执行时间。我们的核心抽象是 extit{求解器提示}：从样本中推断出的可重用结构，并编译成专门的求解器代码。我们证明了来自固定库的经验上最快的样本一致求解器在正确性和运行时间上都具有良好的泛化能力，并且可以从多项式数量的样本中恢复和编译出统计上可识别的提示。我们通过在七个问题类别中的21个结构化组合优化目标分布上，使用大语言模型代码代理实例化该框架。合成的求解器达到平均标准化质量0.971，比平均启发式池提高了0.224，比最高质量的启发式提高了0.098，并且在质量最佳的启发式、Gurobi和所选的时间限制精确后端上分别快336.9倍、342.8倍和16.1倍。在发布的PACE 2025主导集私有实例上，合成的求解器在所有100个图上有效，并且运行速度比顶级竞争求解器快约两个数量级，且质量差距适中。检查显示，许多性能提升来自于改变计算规模：用编译的特定分布计算替代环境指数搜索或通用优化。

View on arXiv Download PDF AI Translation

cs.AI / 23 / 2605.14163

Agentic Systems as Boosting Weak Reasoning Models

代理系统作为增强弱推理模型的工具

Sunkaraneni, Varun, Beneventano, Pierfrancesco, Neumarker, Riccardo, Poggio, Tomaso, Galanti, Tomer

Abstract

Can a committee of weak reasoning-model calls reach the performance of much stronger models? We study verifier-backed committee search as inference-time boosting for reasoning language models. The mechanism is not simply that ``more agents help'': samples expose latent correct solutions, while critics and comparators must recover them without access to the hidden verifier. We formalize this view by separating proposal coverage, local identifiability, progress, and diversity. We prove that coverage can be amplified by repeated sampling, but cannot by itself create useful critics or comparators; reliable amplification requires an additional local soundness signal, such as execution, proof checking, type checking, tests, or constraint solving. We give rank-based bounds showing when local selection errors compose into reliable trajectories, and characterize the proposer-side ceiling: oracle best-of-$k$ converges only to the mass of task slices on which the proposal system assigns nonzero useful probability. Empirically, on SWE-bench Verified, a single \texttt{GPT-5.4 nano} proposal solves $67.0\%$ of tasks. Using the same nano model, our critic--comparator orchestration reaches $76.4\%$ with $k=8$ proposals, matching the standalone performance of \texttt{Gemini 3 Pro} and \texttt{Claude Opus 4.5} Thinking and approaching the $79.0\%$ oracle best-of-$8$ upper bound. Thus, many correct patches are already present in weak-model proposal pools; the main challenge is selecting them. The remaining failures are mostly proposal-coverage failures, indicating shared blind spots that stronger selection alone cannot close.

Chinese Translation

一个弱推理模型的委员会能否达到更强模型的性能？我们研究了以验证者为支持的委员会搜索作为推理时间增强的方法，用于推理语言模型。其机制并不仅仅是“更多的代理有帮助”：样本揭示了潜在的正确解决方案，而评论者和比较者必须在没有访问隐藏验证者的情况下恢复这些解决方案。我们通过分离提案覆盖、局部可识别性、进展和多样性来形式化这一观点。我们证明了覆盖可以通过重复采样来增强，但单靠覆盖无法创建有用的评论者或比较者；可靠的增强需要额外的局部有效性信号，例如执行、证明检查、类型检查、测试或约束求解。我们给出了基于排名的界限，显示何时局部选择错误会组合成可靠的轨迹，并描述了提案方的上限：oracle 最佳的 k 仅收敛于提案系统分配非零有用概率的任务切片的质量。在实证研究中，在 SWE-bench Verified 上，单个 exttt{GPT-5.4 nano} 提案解决了 67.0\% 的任务。使用相同的 nano 模型，我们的评论者-比较者协调达到了 76.4\% 的成功率，使用了 k=8 的提案，匹配了 exttt{Gemini 3 Pro} 和 exttt{Claude Opus 4.5} Thinking 的独立性能，并接近 79.0\% 的 oracle 最佳的 k=8 上限。因此，许多正确的补丁已经存在于弱模型的提案池中；主要挑战在于选择它们。剩余的失败大多是提案覆盖失败，表明仅靠更强的选择无法弥补的共同盲点。

View on arXiv Download PDF AI Translation

cs.AI / 24 / 2605.14164

Unsteady Metrics and Benchmarking Cultures of AI Model Builders

人工智能模型构建者的不稳定指标与基准文化

Baack, Stefan, Buschek, Christo, Bohacek, Maty

Abstract

The primary way to establish and compare competencies in foundation and generative AI models has shifted from peer-reviewed literature to press releases and company blog posts, where model builders highlight results on selected benchmarks. These artifacts now largely define the state of the art for researchers and the public. Despite their prominence, which benchmarks model builders choose to highlight, and what they communicate through this selection, is underexamined. To investigate, we introduce and open-source Benchmarking-Cultures-25, a dataset of 231 benchmarks highlighted across 139 model releases in 2025 from 11 major AI builders, alongside an interactive tool to explore the data. Our analysis reveals a fragmented evaluation landscape with limited cross-model comparability: 63.2% of highlighted benchmarks are used by a single builder, and 38.5% appear in just one release. Few achieve widespread use (e.g., GPQA Diamond, LiveCodeBench, AIME 2025). Moreover, benchmarks are attributed different competencies by different builders, depending on their narrative. To disentangle these conflicting presentations, we develop a unified taxonomy mapping diverging terminology to a shared framework of measured signals based on what benchmark authors claim to measure. "General knowledge application" is the second most popular, yet vaguely defined, category. Qualitative analysis shows many such benchmarks deemphasize construct validity, instead framing results as indicators of progress toward AGI. Their authors claim to measure knowledge or reasoning broadly, yet mostly evaluate STEM subjects (especially math). We argue that highlighted benchmarks function less as standardized measurement tools and more as flexible narrative devices prioritizing market positioning over scientific evaluation. Data: https://hf.co/datasets/matybohacek/benchmarking-cultures-25; tool: https://bench-cultures.net.

Chinese Translation

在基础和生成性人工智能模型的能力建立与比较中，主要方式已从同行评审的文献转向新闻稿和公司博客文章，模型构建者在其中突出展示在特定基准上的结果。这些文献现在在很大程度上定义了研究人员和公众对前沿技术的认知。尽管这些基准的突出性显著，但模型构建者选择强调哪些基准，以及通过这种选择传达了什么信息，仍然缺乏深入研究。为此，我们引入并开源了Benchmarking-Cultures-25，这是一个包含231个基准的数据库，这些基准在2025年由11个主要人工智能构建者的139个模型发布中被强调，同时提供了一个交互式工具以探索数据。我们的分析揭示了一个碎片化的评估格局，跨模型的可比性有限：63.2%的突出基准仅被单一构建者使用，38.5%仅出现在一个发布中。很少有基准能够实现广泛使用（例如，GPQA Diamond、LiveCodeBench、AIME 2025）。此外，不同的构建者根据其叙述对基准赋予了不同的能力。为了理清这些相互矛盾的表述，我们开发了一个统一的分类法，将不同的术语映射到一个基于基准作者声称测量内容的共享框架中。“一般知识应用”是第二受欢迎但定义模糊的类别。定性分析显示，许多此类基准淡化了构念效度，而是将结果框架化为朝向通用人工智能（AGI）进展的指标。其作者声称广泛测量知识或推理，但大多数评估的是STEM学科（尤其是数学）。我们认为，突出展示的基准更像是灵活的叙述工具，优先考虑市场定位而非科学评估。数据来源：https://hf.co/datasets/matybohacek/benchmarking-cultures-25；工具：https://bench-cultures.net。

View on arXiv Download PDF AI Translation

cs.AI / 25 / 2605.14167

The Evaluation Trap: Benchmark Design as Theoretical Commitment

评估陷阱：基准设计作为理论承诺

Kalaitzidis, Theodore J

Abstract

Every AI benchmark operationalizes theoretical assumptions about the capability it claims to assess. When assumptions function as unexamined commitments, benchmarks stabilize the dominant paradigm by narrowing what counts as progress. Over time, narrow evaluation reorganizes capability concepts: architectures and definitions are selected for benchmark legibility until evaluation ceases to track an independent object and instead produces a version of the target defined by its own operational assumptions. The result is a trap: evaluation frameworks treat self-reinforcing assessments as valid, both creating and obscuring structural limits on what the current paradigm can accomplish. We introduce Epistematics, a methodology for deriving evaluation criteria directly from technical capability claims and auditing whether proposed benchmarks can discriminate the claimed capability from proxy behaviors. The contribution is meta-evaluative: an audit procedure, a failure mode taxonomy, and benchmark-design criteria for evaluating capability-evaluation coherence. We demonstrate the procedure through a worked audit of Dupoux et al. (2026), a proposal that revises the dominant paradigm's theoretical assumptions at the architectural level while reproducing them in its evaluation criteria, thereby entrenching the constraint it seeks to overcome in a form the evaluation cannot detect.

Chinese Translation

每个人工智能基准都将其声称评估的能力的理论假设进行具体化。当这些假设作为未经审视的承诺运作时，基准通过缩小进展的定义来巩固主导范式。随着时间的推移，狭隘的评估重新组织了能力概念：架构和定义被选择以便于基准的可读性，直到评估不再追踪一个独立的对象，而是产生一个由其自身操作假设定义的目标版本。结果是一个陷阱：评估框架将自我强化的评估视为有效，从而既创造又掩盖了当前范式所能实现的结构性限制。我们引入了Epistematics，这是一种直接从技术能力声明中推导评估标准的方法论，并审计所提议的基准是否能够区分所声称的能力与代理行为。我们的贡献是元评估性的：一种审计程序、一种失败模式分类法以及评估能力与评估一致性的基准设计标准。我们通过对Dupoux等人（2026）的审计实例展示了这一程序，该提案在架构层面修订了主导范式的理论假设，同时在其评估标准中再现这些假设，从而在评估无法检测的形式中巩固了其试图克服的约束。

View on arXiv Download PDF AI Translation

cs.AI / 26 / 2605.14175

Grounded Continuation: A Linear-Time Runtime Verifier for LLM Conversations

基于基础的延续：一种线性时间运行时验证器用于大型语言模型对话

He, Qisong, Dong, Yi, Huang, Xiaowei

Abstract

In long conversations, an LLM can produce a next utterance that sounds plausible but rests on premises the conversation has already abandoned. Context-manipulation attacks against deployed agents now actively exploit this gap. We close it with a runtime verifier that maintains an explicit dependency graph: an LLM classifies each turn into one of 8 update operations drawn from four formalisms (dynamic epistemic logic, abductive reasoning, awareness logic, argumentation), and a symbolic engine records which claims depend on which evidence. Checking whether a continuation is supported reduces to a graph walk; retraction propagates through the same graph to flag exactly the conclusions that lose support, with linear per-turn cost and a formal conflict-free guarantee. On LongMemEval-KU oracle (n=78), the verifier reaches 89.7% accuracy vs. 88.5% for the LLM-only baseline (+1.3pp) and 87.2% for a transcript-RAG baseline matched on retrieval budget (+2.6pp); wins among disagreements are correct abstentions where the baseline confabulates. On LoCoMo's 60 official QA items the verifier is competitive with retrieval-augmented baselines. Beyond external benchmarks, we construct two multi-agent scenarios and a 50-item grounding test: on the 15-item stale-premise subset, the verifier reaches 100% accuracy vs. 93.3% (+6.7pp). These instantiate a soundness-faithfulness decomposition: the structural check is sound by construction, and per-deployment LLM extraction faithfulness is the empirical question we measure across four LLM families. The retraction check plateaus at microseconds while history-replay grows linearly with conversation length.

Chinese Translation

在长时间的对话中，大型语言模型（LLM）可能会生成听起来合理的下一句话，但其基础前提已被对话放弃。针对已部署代理的上下文操控攻击正积极利用这一差距。我们通过一个运行时验证器来填补这一空白，该验证器维护一个明确的依赖图：LLM将每个轮次分类为来自四种形式主义（动态认知逻辑、溯因推理、意识逻辑、论证）的八种更新操作之一，并且一个符号引擎记录哪些主张依赖于哪些证据。检查一个延续是否得到支持归结为图遍历；撤回通过同一图传播，以标记确切失去支持的结论，且每轮的成本为线性，并提供正式的无冲突保证。在LongMemEval-KU基准（n=78）上，验证器的准确率达到89.7%，而仅使用LLM的基线为88.5%（+1.3个百分点），与在检索预算上匹配的转录-RAG基线相比为87.2%（+2.6个百分点）；在争议中获胜的案例是基线错误的放弃。在LoCoMo的60个官方问答项目中，验证器与增强检索的基线具有竞争力。除了外部基准外，我们构建了两个多代理场景和一个50项的基础测试：在15项陈旧前提子集上，验证器的准确率达到100%，而基线为93.3%（+6.7个百分点）。这些实例化了一个健全性-忠实性分解：结构检查在构造上是健全的，而每个部署的LLM提取忠实性是我们在四个LLM家族中测量的经验问题。撤回检查在微秒级别趋于平稳，而历史重放则随着对话长度线性增长。

View on arXiv Download PDF AI Translation

cs.AI / 27 / 2605.14205

SimPersona: Learning Discrete Buyer Personas from Raw Clickstreams for Grounded E-Commerce Agents

SimPersona：从原始点击流中学习离散买家角色以支持基于LLM的电子商务代理

Foumani, Zahra Zanjani, Castelo, Alberto, Xie, Shuang, Chaiwachirasak, Ted, Li, Han, Wang, Lingyun

Abstract

LLM-based web agents can navigate live storefronts, yet they often collapse to a single "average buyer" policy, failing to capture the heterogeneous and distributional nature of real buyer populations. Existing personalization methods rely on hand-crafted prompt-based personas that are brittle, difficult to scale, context-inefficient, and unable to faithfully represent population-level behavior. We introduce SimPersona, a novel framework that learns discrete buyer types from historical traffic and exposes them to LLM-based web agents as compact persona tokens. Given raw clickstreams, a behavior-aware VQ-VAE induces a discrete buyer-type space that captures the statistical structure of real buyer behavior and merchant-specific buyer population distributions. To provide behavior-specific guidance to LLM-based web agents, SimPersona maps each learned buyer type to a dedicated persona token in the LLM agent vocabulary and fine-tunes the agent with these tokens on real browsing traces. At inference, each synthetic buyer is assigned to a learned buyer type with a single encoder forward pass, requiring no retraining or store-specific prompt engineering. For population-level simulation, SimPersona samples buyer types from each merchant's empirical distribution over the learned VQ-VAE codebook and instantiates agents with the corresponding persona tokens, preserving merchant-specific buyer population distributions. Evaluated on $8.37$M buyers across $42$ held-out live storefronts, SimPersona achieves $78\%$ conversion-rate alignment with real buyers, exhibits interpretable behavioral variation across buyer types, and outperforms a baseline with $8\times$ more parameters on goal-oriented shopping tasks. We further release an open-source data pipeline that converts raw e-commerce event logs into buyer representations and agent-training traces.

Chinese Translation

基于大型语言模型（LLM）的网络代理能够在实时商店中导航，但它们往往会退化为单一的“平均买家”策略，未能捕捉真实买家群体的异质性和分布特征。现有的个性化方法依赖于手工制作的基于提示的角色，这些角色脆弱、难以扩展、上下文效率低下，且无法真实地代表群体级行为。我们提出了SimPersona，一个新颖的框架，能够从历史流量中学习离散的买家类型，并将其作为紧凑的角色令牌暴露给基于LLM的网络代理。给定原始点击流，行为感知的VQ-VAE（向量量化变分自编码器）诱导出一个离散买家类型空间，捕捉真实买家行为和商家特定买家群体分布的统计结构。为了为基于LLM的网络代理提供行为特定的指导，SimPersona将每个学习到的买家类型映射到LLM代理词汇中的专用角色令牌，并使用这些令牌在真实浏览轨迹上微调代理。在推理时，每个合成买家通过单个编码器的前向传播被分配到一个学习到的买家类型，无需重新训练或特定商店的提示工程。对于群体级模拟，SimPersona从每个商家的经验分布中抽样买家类型，并用相应的角色令牌实例化代理，保持商家特定的买家群体分布。在对$8.37$百万买家和$42$个保留的实时商店进行评估时，SimPersona在与真实买家的转化率对齐方面达到了$78\%$，展示了买家类型之间可解释的行为变化，并在目标导向的购物任务中超越了一个参数量多出$8 imes$的基线。此外，我们还发布了一个开源数据管道，将原始电子商务事件日志转换为买家表示和代理训练轨迹。

View on arXiv Download PDF AI Translation

cs.AI / 28 / 2605.14211

ASH: Agents that Self-Hone via Embodied Learning

ASH：通过具身学习自我提升的智能体

Schneider, Benjamin, Schneider, Xavier, Zhong, Victor, Sun, Sun

Abstract

Long-horizon embodied tasks remain a fundamental challenge in AI, as current methods rely on hand-engineered rewards or action-labeled demonstrations, neither of which scales. We introduce ASH, an agentic system that learns an embodied policy from unlabeled, noisy internet video, without reward shaping or expert annotation. ASH follows a self-improvement loop; when it gets stuck, ASH learns an Inverse Dynamics Model (IDM) from its own trajectories, and uses its IDM to extract supervision from relevant internet video. ASH uses unsupervised learning to identify key moments from large-scale internet video and retains them as long-term memory -- allowing it to tackle long-horizon problems. We evaluate ASH on two complementary environments demanding multi-hour planning: Pokemon Emerald, a turn-based RPG, and The Legend of Zelda: The Minish Cap, a real-time action-adventure game. In both games, behavioral cloning, retrieval-augmented and zero-shot foundation-model baselines plateau, while ASH sustains progression across our 8-hour evaluation. ASH reaches an average of $11.2/12$ milestones in Pokemon Emerald and $9.9/12$ in Legend of Zelda, while the strongest baseline gets stuck in both environments at an average of $6.5/12$ and $6.0/12$ milestones, respectively. We demonstrate that self-improving agents are a scalable recipe for long-horizon embodied learning.

Chinese Translation

长时间跨度的具身任务仍然是人工智能领域的一个基本挑战，因为当前的方法依赖于手工设计的奖励或标注的行动示范，这两者都难以扩展。我们提出了ASH，一个从未标记的、嘈杂的互联网视频中学习具身策略的智能体系统，无需奖励塑造或专家注释。ASH遵循自我改进循环；当它遇到瓶颈时，ASH从自身轨迹中学习逆动力学模型（Inverse Dynamics Model, IDM），并利用其IDM从相关的互联网视频中提取监督信息。ASH使用无监督学习从大规模互联网视频中识别关键时刻，并将其保留为长期记忆——使其能够应对长时间跨度的问题。我们在两个互补环境中评估ASH，这些环境要求进行数小时的规划：宝可梦绿宝石（Pokemon Emerald），一个回合制角色扮演游戏，以及塞尔达传说：小人之帽（The Legend of Zelda: The Minish Cap），一个实时动作冒险游戏。在这两个游戏中，行为克隆、检索增强和零-shot基础模型的基线表现均达到瓶颈，而ASH在我们的8小时评估中保持了进展。ASH在宝可梦绿宝石中达到了平均$11.2/12$的里程碑，在塞尔达传说中达到了$9.9/12$，而最强的基线在这两个环境中分别卡在了平均$6.5/12$和$6.0/12$的里程碑。我们证明了自我改进的智能体是长时间跨度具身学习的可扩展方案。

View on arXiv Download PDF AI Translation

cs.AI / 29 / 2605.14212

MetaAgent-X : Breaking the Ceiling of Automatic Multi-Agent Systems via End-to-End Reinforcement Learning

MetaAgent-X：通过端到端强化学习突破自动多代理系统的瓶颈

Zhang, Yaolun, Zhao, Yujie, Wang, Nan, Wu, Yiran, Chang, Jiayu, Chen, Yizhao, Wu, Qingyun, Zhao, Jishen, Wang, Huazheng

Abstract

Automatic multi-agent systems aim to instantiate agent workflows without relying on manually designed or fixed orchestration. However, existing automatic MAS approaches remain only partially adaptive: they either perform training-free test-time search or optimize the meta-level designer while keeping downstream execution agents frozen, which creating a frozen-executor ceiling and leaving the end-to-end training of self-designing and self-executing agentic models unexplored. To address this, we introduce MetaAgent-X, an end-to-end reinforcement learning framework that jointly optimizes automatic MAS design and execution. MetaAgent-X enables script-based MAS generation, execution rollout collection, and credit assignment for both designer and executor trajectories. To support stable and scalable optimization, we propose Executor Designer Hierarchical Rollout and Stagewise Co-evolution to improve training stability and expose the dynamics of designer-executor co-evolution. MetaAgent-X consistently outperforms existing automatic MAS baselines, achieving up to 21.7% gains. Comprehensive ablations show that both designer and executor improve throughout training, and that effective automatic MAS learning follows a stagewise co-evolution process. These results establish end-to-end trainable automatic MAS as a practical paradigm for building self-designing and self-executing agentic models.

Chinese Translation

自动多代理系统旨在实现代理工作流，而无需依赖手动设计或固定的编排。然而，现有的自动多代理系统（MAS）方法仍然仅部分适应：它们要么在测试时进行无训练的搜索，要么在保持下游执行代理不变的情况下优化元级设计者，这导致了一个冻结执行者的瓶颈，并使得自设计和自执行代理模型的端到端训练未能得到探索。为了解决这一问题，我们引入了MetaAgent-X，一个端到端的强化学习框架，能够联合优化自动MAS的设计和执行。MetaAgent-X支持基于脚本的MAS生成、执行回滚收集以及对设计者和执行者轨迹的信用分配。为了支持稳定和可扩展的优化，我们提出了执行者设计者层次回滚和阶段性共同进化，以提高训练的稳定性并揭示设计者与执行者共同进化的动态。MetaAgent-X在性能上始终优于现有的自动MAS基线，取得了高达21.7%的提升。全面的消融实验表明，设计者和执行者在训练过程中均有所改善，并且有效的自动MAS学习遵循阶段性共同进化过程。这些结果确立了端到端可训练的自动MAS作为构建自设计和自执行代理模型的实用范式。

View on arXiv Download PDF AI Translation

cs.AI / 30 / 2605.14215

GenCircuit-RL: Reinforcement Learning from Hierarchical Verification for Genetic Circuit Design

GenCircuit-RL：基于层次验证的强化学习用于基因电路设计

Flynn, Noah

Abstract

Genetic circuit design remains a laborious, expert-driven process despite decades of progress in synthetic biology. We study this problem through code generation: models produce Python code in pysbol3 to construct genetic circuits in the Synthetic Biology Open Language (SBOL), a formal representation that supports automated verification. We introduce GenCircuit-RL, a reinforcement learning framework built around hierarchical verification rewards that decompose correctness into five levels, from code execution to task-specific topological checks, and a four-stage curriculum that shifts optimization pressure from code generation to functional reasoning. We also introduce SynBio-Reason, a benchmark of 4,753 circuits spanning six canonical circuit types and nine tasks from code repair to de novo design, with held-out biological parts for out-of-distribution evaluation. Hierarchical verification improves task success on functional reasoning tasks by 14 to 16 percentage points over binary rewards, and curriculum learning is required for strong design performance. The resulting models generate topologically correct circuits, generalize to novel biological parts, and rediscover canonical designs from the synthetic biology literature.

Chinese Translation

尽管合成生物学经过数十年的发展，基因电路设计仍然是一个繁琐且依赖专家的过程。我们通过代码生成研究这一问题：模型生成 Python 代码（使用 pysbol3）以构建合成生物学开放语言（SBOL）中的基因电路，这是一种支持自动验证的形式化表示。我们提出了 GenCircuit-RL，这是一个围绕层次验证奖励构建的强化学习框架，该框架将正确性分解为五个层次，从代码执行到特定任务的拓扑检查，并采用四阶段课程，将优化压力从代码生成转移到功能推理。我们还引入了 SynBio-Reason，这是一个包含 4,753 个电路的基准，涵盖六种典型电路类型和九个任务，从代码修复到新设计，并为分布外评估保留了生物部件。层次验证在功能推理任务上相较于二元奖励提高了 14 到 16 个百分点的任务成功率，而课程学习对于强设计性能是必需的。最终生成的模型能够生成拓扑上正确的电路，能够推广到新颖的生物部件，并重新发现合成生物学文献中的典型设计。

View on arXiv Download PDF AI Translation

cs.AI / 31 / 2605.14218

Fusion-fission forecasts when AI will shift to undesirable behavior

人工智能何时会转向不良行为的融合-裂变预测

Johnson, Neil F., Huo, Frank Yingjie

Abstract

The key problem facing ChatGPT-like AI's use across society is that its behavior can shift, unnoticed, from desirable to undesirable -- encouraging self-harm, extremist acts, financial losses, or costly medical and military mistakes -- and no one can yet predict when. Shifts persist in even the newest AI models despite remarkable progress in AI modeling, post-training alignment and safeguards. Here we show that a vector generalization of fusion-fission group dynamics observed in living and active-matter systems drives -- and can forecast -- future shifts in the AI's behavior. The shift condition, which is also derivable mathematically, results from group-level competition between the conversation-so-far (C) and the desirable (B) and undesirable (D) basin dynamics which can be estimated in advance for a given application. It is neither model-specific nor driven by stochastic sampling. We validate it across six independent tests, including: 90 percent correct across seven AI models spanning two orders of magnitude in parameter count (124M-12B); production-scale persistence across ten frontier chatbots; and a priori time-stamped prediction eleven months before the Stanford 'Delusional Spirals' corpus appeared, and independently confirmed by that corpus of 207,443 human-AI exchanges. Because it sits architecturally below the current safety stack, the same formula provides a real-time warning signal that current alignment does not supply, portable across current and future ChatGPT-like AI architectures and instantiable in application domains where competing response classes can be defined.

Chinese Translation

当前社会中，类似ChatGPT的人工智能面临的关键问题是，其行为可能在不被察觉的情况下，从可取转变为不可取——这可能导致自我伤害、极端行为、经济损失或代价高昂的医疗和军事错误——而目前尚无人能够预测这种转变何时发生。尽管在人工智能建模、后训练对齐和安全保障方面取得了显著进展，但即使是最新的人工智能模型中，这种行为转变依然存在。本文展示了一种在活体和活性物质系统中观察到的融合-裂变群体动力学的向量推广，驱动并能够预测人工智能未来行为的转变。转变条件也可以通过数学推导得出，源于当前对话（C）与可取（B）和不可取（D）基态动力学之间的群体竞争，这些动力学可以为特定应用提前估算。该条件既不特定于某个模型，也不是由随机抽样驱动。我们在六个独立测试中验证了这一点，包括：在七个跨越两个数量级参数计数（124M-12B）的人工智能模型中，正确率达到90%；在十个前沿聊天机器人中的生产规模持续性；以及在斯坦福大学“妄想螺旋”语料库出现前十一个月的先验时间戳预测，并由207,443个人工智能交流的该语料库独立确认。由于该公式在当前安全堆栈之下架构上运行，因此提供了一个实时警告信号，而当前的对齐方法无法提供，该信号可移植到当前和未来的类似ChatGPT的人工智能架构中，并可在可以定义竞争响应类别的应用领域中实例化。

View on arXiv Download PDF AI Translation

cs.AI / 32 / 2605.14237

Good to Go: The LOOP Skill Engine That Hits 99% Success and Slashes Token Usage by 99% via One-Shot Recording and Deterministic Replay

准备就绪：通过一次性录制和确定性重放实现99%成功率和99%令牌使用减少的LOOP技能引擎

Wang, Xiaohua, Yu, Kai, Liang, XuXiao, Wang, Liang, Han, Chao

Abstract

Deploying AI agents for repetitive periodic tasks exposes a critical tension: Large Language Models (LLMs) offer unmatched flexibility in tool orchestration, yet their inherent stochasticity causes unpredictable failures, and repeated invocations incur prohibitive token costs. We present the LOOP SKILL ENGINE, a system that achieves a combined 99% success rate and 99% token reduction for periodic agent tasks through a one-shot recording, deterministic replay paradigm. On its first run, the agent executes the task with full LLM reasoning while the system transparently intercepts and records the complete tool-call trajectory. A greedy length-descending template extraction algorithm then converts this recording into a parameterized, branch-free Loop Skill -- a deterministic execution plan that captures the task's functional intent while parameterizing time-dependent and result-dependent variables. All subsequent executions bypass the LLM entirely: the engine resolves template variables against real-time values and replays the tool sequence deterministically. We prove two theorems: (1) Replay Determinism -- the step sequence of a validated Loop Skill is invariant across all future executions; (2) Write Safety -- concurrent access to persistent configuration is serialized through reentrant locks and atomic file replacement. Across a benchmark of periodic agent tasks spanning intervals from 5 minutes to 24 hours, the Loop Skill Engine reduces monthly token consumption by 93.3%--99.98% and cuts execution latency by 8.7x while eliminating output non-determinism. A multi-layer degradation strategy guarantees that tasks never stall. We release the engine as part of the buddyMe open-source agent framework.

Chinese Translation

为重复性周期性任务部署人工智能代理暴露出一个关键的矛盾：大型语言模型（LLMs）在工具编排方面提供了无与伦比的灵活性，但其固有的随机性导致不可预测的失败，而重复调用则会产生高昂的令牌成本。我们提出了LOOP技能引擎（LOOP SKILL ENGINE），该系统通过一次性录制和确定性重放范式，实现了周期性代理任务的99%成功率和99%令牌减少。在首次运行时，代理在充分利用LLM推理的同时，系统透明地拦截并记录完整的工具调用轨迹。然后，一个贪婪的长度递减模板提取算法将此录制转换为一个参数化的无分支循环技能（Loop Skill）——一个确定性的执行计划，捕捉任务的功能意图，同时对时间依赖和结果依赖变量进行参数化。所有后续执行完全绕过LLM：引擎根据实时值解析模板变量，并以确定性的方式重放工具序列。我们证明了两个定理：（1）重放确定性——验证过的循环技能的步骤序列在所有未来执行中是不变的；（2）写入安全性——对持久配置的并发访问通过可重入锁和原子文件替换进行序列化。在一个涵盖从5分钟到24小时的周期性代理任务基准测试中，循环技能引擎将每月令牌消耗减少了93.3%至99.98%，并将执行延迟缩短了8.7倍，同时消除了输出的不确定性。多层降级策略确保任务不会停滞。我们将该引擎作为buddyMe开源代理框架的一部分发布。

View on arXiv Download PDF AI Translation

cs.AI / 33 / 2605.14259

Hypergraph Enterprise Agentic Reasoner over Heterogeneous Business Systems

异构商业系统中的超图企业代理推理器

Wang, Ling, Liu, Songnan, Wang, Jianan, Cheng, Cheng, Liu, Xin, Zhu, Yihan, Li, Enyu, Xiao, Yu, Xie, Jiangyong, Yan, Duogong, Chen, Jiangyi

Abstract

Applying Large Language Models (LLMs) to heterogeneous enterprise systems is hindered by hallucinations and failures in multi-hop, n-ary reasoning. Existing paradigms (e.g., GraphRAG, NL2SQL) lack the semantic grounding and auditable execution required for these complex environments. We introduce HEAR, an enterprise agentic reasoner built on a Stratified Hypergraph Ontology. Its base Graph Layer virtualizes provenance-aware data interfaces, while the Hyperedge Layer encodes n-ary business rules and procedural protocols. Operating an evidence-driven reasoning loop, HEAR dynamically orchestrates ontology tools for structured multi-hop analysis without requiring LLM retraining. Evaluations on supply-chain tasks, including order fulfillment blockage root cause analysis (RCA), show HEAR achieves up to 94.7% accuracy. Crucially, HEAR demonstrates adaptive efficiency: utilizing procedural hyperedges to minimize token costs, while leveraging topological exploration for rigorous correctness on complex queries. By matching proprietary model performance with open-weight backbones and automating manual diagnostics, HEAR establishes a scalable, auditable foundation for enterprise intelligence.

Chinese Translation

将大型语言模型（LLMs）应用于异构企业系统受到幻觉和多跳、n-元推理失败的限制。现有范式（例如，GraphRAG，NL2SQL）缺乏在这些复杂环境中所需的语义基础和可审计的执行。我们介绍了HEAR，一个基于分层超图本体的企业代理推理器。其基础图层虚拟化了具有来源意识的数据接口，而超边层则编码了n-元商业规则和程序协议。HEAR通过证据驱动的推理循环动态协调本体工具，以进行结构化的多跳分析，而无需重新训练LLM。在供应链任务的评估中，包括订单履行阻塞根本原因分析（RCA），HEAR的准确率高达94.7%。重要的是，HEAR展现了自适应效率：利用程序超边最小化令牌成本，同时利用拓扑探索确保复杂查询的严格正确性。通过将专有模型性能与开放权重骨干匹配并自动化手动诊断，HEAR为企业智能建立了可扩展、可审计的基础。

View on arXiv Download PDF AI Translation

cs.AI / 34 / 2605.14261

Heuristic Pathologies and Further Variance Reduction via Uncertainty Propagation in the AIVAT Family of Techniques

启发式病态及通过不确定性传播进一步减少AIVAT技术家族的方差

Kim, Juho, Sandholm, Tuomas

Abstract

How should an agent's performance in a multiagent environment be evaluated when there is a limited sample size or a high cost of running a trial? The AIVAT family of variance reduction techniques was proposed to address this challenge by introducing unbiased low-variance estimators of agents' expected payoffs. An important component of AIVAT is a heuristic value function that discriminates between potentially low- and high-value counterfactual histories. A notable gap in the literature is that there is little to no constraint or guideline on how the heuristic value function should be chosen or how uncertainty in its output should be handled. In our first contribution, we parameterize the heuristic value function to highlight AIVAT's potential vulnerabilities: a) the sample variance can be set pathologically low by directly applying gradient descent on the sample variance, and b) one can p-hack to draw a desired statistical conclusion via gradient descent/ascent on the test statistic. The main takeaway is that the heuristic value function should be fixed prior to observing the evaluation data! In our second contribution, we show how the heuristic uncertainty can be propagated to quantify the uncertainty of AIVAT estimates. It is then possible to further reduce the variance using inverse-variance weighted averaging, but AIVAT's unbiasedness guarantee may have to be sacrificed. In our experiments, we use a dataset of 10,000 poker hands to demonstrate our heuristic pathology and uncertainty results, with the latter yielding a 43.0% reduction in the number of samples (poker hands) needed to draw statistical conclusions.

Chinese Translation

在样本量有限或试验成本高昂的情况下，如何评估多智能体环境中代理的表现？AIVAT技术家族提出了方差减少方法，通过引入无偏低方差的代理预期收益估计器来应对这一挑战。AIVAT的一个重要组成部分是启发式价值函数，它能够区分潜在的低价值和高价值反事实历史。文献中一个显著的空白是，对于启发式价值函数的选择以及如何处理其输出的不确定性几乎没有约束或指导。在我们的第一项贡献中，我们对启发式价值函数进行参数化，以突出AIVAT的潜在脆弱性：a) 通过直接对样本方差应用梯度下降，样本方差可以被设置得病态地低；b) 通过对检验统计量进行梯度下降/上升，可以进行p-hacking以得出期望的统计结论。主要结论是，启发式价值函数应在观察评估数据之前固定！在我们的第二项贡献中，我们展示了如何传播启发式不确定性，以量化AIVAT估计的的不确定性。然后，可以使用逆方差加权平均进一步减少方差，但可能需要牺牲AIVAT的无偏性保证。在我们的实验中，我们使用了10,000手扑克的数据集来展示我们的启发式病态和不确定性结果，后者使得得出统计结论所需的样本数（扑克手数）减少了43.0%。

View on arXiv Download PDF AI Translation

cs.AI / 35 / 2605.14266

Agentic AI Ecosystems in Higher Education: A Perspective on AI Agents to Emerging Inclusive, Agentic Multi-Agent AI Framework for Learning, Teaching and Institutional Intelligence

高等教育中的自主人工智能生态系统：关于人工智能代理的视角，旨在为学习、教学和机构智能建立新兴的包容性、自主多代理人工智能框架

Sudarshan, Vidya K, Sisodia, Anushka, Ramachandra, Reshma A, Batra, Sia, Leng, Josephine Chong Leng

Abstract

Integration of artificial intelligent (AI) agents in higher education is transforming teaching, learning and administrative processes. Although existing AI agents effectively support individual tasks, their implementation remains fragmented and inefficient for handling the complexity of educational institutions. This highlights a significant research gap: the lack of integrated eco-system-level agentic multi-agent AI platform capable of coordinated planning, reasoning, and adaptive decision-making across multiple educational functions. This paper presents a forward-looking perspective on agentic multi-agent AI platform in higher education, consisting interconnected autonomous, goal driven agents that support learning, teaching, and institutional operations. It addresses timely and critical questions: Can agentic AI represent the next generation of intelligent systems in tertiary education? Can they collectively support seamless coordinated operations across teaching, learning and administrative support? To what extent can such systems foster inclusive and equitable learning for diverse learners with special educational needs? To ground this perspective, a thematic analysis of existing literature identifies four dominant themes: task-specific fragmented AI tools, the transition from single-agent to multi-agent systems, limited cross-functional integration, and insufficient focus on inclusivity and accessibility. Findings reveal a clear gap between current AI implementations and the needs of holistic, learner-centered educational ecosystem. The paper synthesizes challenges and outlines future research directions for scalable human-aligned, and inclusive agentic AI platform. The significant contribution is the incorporation of inclusive learning perspectives, highlighting how coordinated agentic multi-agent platform can support diverse learners through adaptive, multimodal interventions.

Chinese Translation

人工智能（AI）代理在高等教育中的整合正在改变教学、学习和行政流程。尽管现有的AI代理有效地支持个别任务，但其实施仍然是碎片化的，无法有效处理教育机构的复杂性。这突显了一个重要的研究空白：缺乏能够在多个教育职能之间进行协调规划、推理和自适应决策的综合生态系统级自主多代理AI平台。本文提出了一个前瞻性的视角，探讨高等教育中的自主多代理AI平台，该平台由相互连接的自主、目标驱动的代理组成，支持学习、教学和机构运营。它提出了及时且关键的问题：自主AI能否代表高等教育中智能系统的下一代？它们能否共同支持教学、学习和行政支持之间的无缝协调操作？这样的系统在多大程度上能够促进对具有特殊教育需求的多样化学习者的包容性和公平学习？为了支撑这一视角，本文对现有文献进行了主题分析，识别出四个主导主题：任务特定的碎片化AI工具、从单代理到多代理系统的过渡、有限的跨职能整合以及对包容性和可及性关注不足。研究结果揭示了当前AI实施与整体以学习者为中心的教育生态系统需求之间的明显差距。本文综合了挑战，并概述了可扩展的人类对齐和包容性自主AI平台的未来研究方向。重要贡献在于纳入了包容性学习的视角，强调了协调的自主多代理平台如何通过自适应的多模态干预来支持多样化学习者。

View on arXiv Download PDF AI Translation

cs.AI / 36 / 2605.14277

Parallelizing Counterfactual Regret Minimization

并行化反事实遗憾最小化

Kim, Juho, Sandholm, Tuomas

Abstract

Parallelization has played an instrumental role in the field of artificial intelligence (AI), drastically reducing the time taken to train and evaluate large AI models. In contrast to its impact in the broader field of AI, applying parallelization to computational game solving is relatively unexplored, despite its great potential. In this paper, we parallelize the family of counterfactual regret minimization (CFR) algorithms, which were central to important breakthroughs for solving large imperfect-information games. We present a generalized parallelization framework, reframing CFR as a series of linear algebra operations. Then, existing techniques for parallelizing linear algebra operations can be applied to accelerate CFR. We also describe how our technique can be applied to other tabular members of the CFR family of algorithms, including the state-of-the-art, such as CFR+, discounted CFR, and predictive variants of CFR. Experimentally, we show that our CFR implementation on a GPU is up to four orders of magnitude faster than Google DeepMind OpenSpiel's CFR implementations on a CPU.

Chinese Translation

并行化在人工智能（AI）领域发挥了重要作用，显著减少了训练和评估大型AI模型所需的时间。与其在更广泛的AI领域的影响相比，将并行化应用于计算博弈求解的研究相对较少，尽管其潜力巨大。本文中，我们对反事实遗憾最小化（CFR）算法族进行了并行化，这些算法在解决大型不完全信息博弈方面是重要突破的核心。我们提出了一个通用的并行化框架，将CFR重新构建为一系列线性代数操作。然后，可以应用现有的线性代数操作并行化技术来加速CFR。我们还描述了我们的技术如何应用于CFR算法族中的其他表格成员，包括最先进的算法，如CFR+、折扣CFR和CFR的预测变体。实验结果表明，我们在GPU上实现的CFR比Google DeepMind OpenSpiel在CPU上的CFR实现快了四个数量级。

View on arXiv Download PDF AI Translation

cs.AI / 37 / 2605.14294

Precise Verification of Transformers through ReLU-Catalyzed Abstraction Refinement

通过ReLU催化的抽象精炼实现变压器的精确验证

Liu, Hengjie, Zhang, Zhenya, Zhao, Jianjun

Abstract

Formal verification of transformers has become increasingly important due to their widespread deployment in safety-critical applications. Compared to classic neural networks, the inferences of transformers involve highly complex computations, such as dot products in self-attention layers, rendering their verification extremely difficult. Existing approaches explored over-approximation methods by constructing convex constraints to bound the output ranges of transformers, which can achieve high efficiency. However, they may sacrifice verification precision, and consequently introduce significant approximation error that leads to frequent occurrences of false alarms. In this paper, we propose a transformer verification approach that can achieve improved precision. At the core of our approach is a novel usage of ReLU, by which we represent a precise but non-linear bound for dot products such that we can further exploit the rich body of literature for convex relaxation of ReLU to derive precise bounds. We extend two classic approaches to the context of transformers, a rule-based one and an optimization-based one, resulting in two new frameworks for efficient and precise verification. We evaluate our approaches on different model architectures and robustness properties derived from two datasets about sentiment analysis, and compare with the state-of-the-art baseline approach. Compared to the baseline, our approach can achieve significant precision improvement for most of the verification tasks with acceptable compromise of efficiency, which demonstrates the effectiveness of our approach.

Chinese Translation

由于变压器在安全关键应用中的广泛部署，其形式验证变得越来越重要。与经典神经网络相比，变压器的推理涉及高度复杂的计算，例如自注意力层中的点积，这使得它们的验证极为困难。现有的方法通过构建凸约束来界定变压器输出范围，探索了过度近似的方法，这可以实现高效率。然而，这可能牺牲验证的精确性，从而引入显著的近似误差，导致频繁出现误报。在本文中，我们提出了一种能够实现更高精度的变压器验证方法。我们方法的核心是ReLU的创新使用，通过这种方式，我们为点积表示一个精确但非线性的界限，从而进一步利用ReLU的凸松弛文献来推导精确界限。我们将两种经典方法扩展到变压器的背景下，一种基于规则，另一种基于优化，形成两个新的高效且精确的验证框架。我们在不同的模型架构和基于两个情感分析数据集的鲁棒性属性上评估了我们的方法，并与最先进的基线方法进行了比较。与基线相比，我们的方法在大多数验证任务中实现了显著的精度提升，同时在效率上做出了可接受的妥协，这证明了我们方法的有效性。

View on arXiv Download PDF AI Translation

cs.AI / 38 / 2605.14318

Semantic Feature Segmentation for Interpretable Predictive Maintenance in Complex Systems

复杂系统中可解释的预测性维护的语义特征分割

Mastriani, Emilio, Costa, Alessandro, Incardona, Federico, Munari, Kevin, Spinello, Sebastiano

Abstract

Predictive maintenance in complex systems is often complicated by the heterogeneity and redundancy of monitored variables,which can obscure fault-relevant information and reduce model interpretability. This work proposes a semantic feature segmentation framework that decomposes the monitored feature space into a canonical component,expected to retain the dominant predictive information, and a residual component containing structurally peripheral signals. The segmentation is defined through domain informed criteria and sets up monitoring variables into functional groups reflecting operational mechanisms such as throughput,latency,pressure,network activity,and structural state. To evaluate the effectiveness of this decomposition, we adopt a predictive perspective in which expected predictive risk is used as an operational proxy for task-relevant information. Experimental results obtained through time-aware cross-validation show that the canonical space consistently achieves lower predictive risk than the residual space across multiple temporal configurations, indicating that the semantic segmentation concentrates the most relevant information for fault anticipation. In addition, the canonical segments exhibit significantly stronger intra-segment coherence than inter-segment dependence, and this structural organization remains stable after redundancy reduction. When compared with the full feature space and with a Principal Component Analysis (PCA) representation, the canonical space carries out comparable predictive performance and furthermore preserves the semantic meaning of the original variables. These findings suggest that semantic feature segmentation provides an interpretable and information-preserving decomposition of monitoring signals, enabling competitive predictive performance without sacrificing the operational interpretability required in predictive maintenance applications.

Chinese Translation

复杂系统中的预测性维护常常受到监测变量的异质性和冗余性的影响，这可能会掩盖与故障相关的信息并降低模型的可解释性。本研究提出了一种语义特征分割框架，该框架将监测特征空间分解为一个规范成分，预计能够保留主要的预测信息，以及一个包含结构性边缘信号的残余成分。该分割通过领域知识驱动的标准定义，并将监测变量设置为反映操作机制的功能组，如吞吐量、延迟、压力、网络活动和结构状态。为了评估这种分解的有效性，我们采用了一种预测视角，其中预期的预测风险被用作与任务相关信息的操作代理。通过时间感知交叉验证获得的实验结果表明，在多个时间配置中，规范空间的预测风险始终低于残余空间，这表明语义分割集中于故障预警的最相关信息。此外，规范段表现出显著更强的段内一致性而非段间依赖性，并且这种结构组织在冗余减少后仍然保持稳定。与完整特征空间和主成分分析（PCA）表示相比，规范空间实现了可比的预测性能，并且进一步保留了原始变量的语义意义。这些发现表明，语义特征分割提供了一种可解释且信息保留的监测信号分解，能够在不牺牲预测性维护应用所需的操作可解释性的情况下，实现竞争性的预测性能。

View on arXiv Download PDF AI Translation

cs.AI / 39 / 2605.14322

Are Agents Ready to Teach? A Multi-Stage Benchmark for Real-World Teaching Workflows

代理是否准备好进行教学？针对现实教学工作流程的多阶段基准测试

Chen, Zixin, Liu, Peng, Sheng, Rui, Li, Haobo, Tu, Jianhong, Deng, Xiaodong, Shum, Kashun, Liu, Dayiheng, Qu, Huamin

Abstract

Language agents are increasingly deployed in complex professional workflows, with tutoring emerging as a particularly high-stakes capability that remains largely unmeasured in existing benchmarks. Effective tutor agents require more than producing correct answers or executing accurate tool calls: a robust tutor must diagnose learner state, adapt support over time, make pedagogically justified decisions grounded in educational evidence, and execute interventions within realistic learning-management systems. We introduce EduAgentBench, a source-grounded benchmark for holistically evaluating tutor agents across the full scope of teaching work. It contains 150 quality-controlled tasks across three capability surfaces: professional pedagogical judgment, situated multi-turn tutoring, and Canvas-style teaching workflow completion. Tasks are constructed through a pedagogical-insight-driven pipeline and evaluated with complementary verification signals and human review. Across a comprehensive evaluation of frontier models, our findings reveal that current models are generally capable of bounded pedagogical judgment, but still fall short of professional teaching standards in situated tutoring and autonomous teaching-workflow execution. To our knowledge, EduAgentBench is the first theory-grounded and realistic benchmark for evaluating the holistic teaching capability of tutor agents, providing a measurement foundation for developing future tutor agents that can support realistic teaching work.

Chinese Translation

语言代理在复杂的专业工作流程中越来越多地被部署，其中辅导作为一种特别高风险的能力，仍然在现有基准中未得到充分测量。有效的辅导代理不仅需要提供正确的答案或执行准确的工具调用：一个强大的辅导者必须能够诊断学习者状态，随着时间的推移调整支持，做出基于教育证据的教学决策，并在现实的学习管理系统中执行干预。我们引入了EduAgentBench，这是一个基于来源的基准，用于全面评估辅导代理在教学工作全范围内的表现。该基准包含150个经过质量控制的任务，涵盖三个能力领域：专业教学判断、情境多轮辅导和Canvas风格的教学工作流程完成。这些任务通过一个以教学洞察为驱动的流程构建，并通过互补的验证信号和人工评审进行评估。在对前沿模型的全面评估中，我们的发现表明，当前模型在有限的教学判断能力方面通常是可行的，但在情境辅导和自主教学工作流程执行方面仍未达到专业教学标准。据我们所知，EduAgentBench是第一个理论基础和现实基准，用于评估辅导代理的整体教学能力，为开发能够支持现实教学工作的未来辅导代理提供了测量基础。

View on arXiv Download PDF AI Translation

cs.AI / 40 / 2605.14344

CrystalReasoner: Reasoning and RL for Property-Conditioned Crystal Structure Generation

CrystalReasoner：用于属性条件晶体结构生成的推理与强化学习

Wu, Yuyang, Falletta, Stefano, McGrath, Delia, Yang, Sherry

Abstract

Generative modeling has emerged as a promising approach for crystal structure discovery. However, existing LLM-based generative models struggle with low-level atomic precision, while diffusion-based methods fall short in integrating high-level scientific knowledge. As a result, generated structures are often invalid, unstable, or do not possess desirable properties. To address this gap, we propose CrystalReasoner (\method), an end-to-end LLM framework that generates crystal structures from natural language instructions through reasoning and alignment. \method introduces physical priors as thinking tokens, which include crystallographic symmetry, local coordination environments and predicted physical properties before generating atomic coordinates. This bridges the gap between natural language and 3D structures. \method then employs reinforcement learning (RL) with a multi-objective, dense reward function to align generation with physical validity, chemical consistency, and thermodynamic stability. For property-conditioned tasks, we design task-specific reward functions and train specialized models for discrete constraints (e.g., space group) and continuous properties (e.g., elasticity, thermal expansion). Empirical results demonstrate that compared to prior works and baselines without thinking traces or RL, \method obtains better performance on diverse metrics, triples S.U.N. ratio, and achieves better performance for property conditioned generation. \method also exhibits adaptive reasoning, increasing reasoning lengths as the number of atoms increases. Our work demonstrates the potential of leveraging thinking traces and RL for generating valid, stable, and property-conditioned crystal structures. Please see our work at https://crystalreasoner.github.io/ .

Chinese Translation

生成建模已成为晶体结构发现的一种有前景的方法。然而，现有的基于大语言模型（LLM）的生成模型在低级原子精度方面存在困难，而基于扩散的方法在整合高级科学知识方面表现不足。因此，生成的结构往往是无效的、不稳定的，或者不具备理想的属性。为了解决这一问题，我们提出了CrystalReasoner（ extit{method}），一个端到端的LLM框架，通过推理和对齐从自然语言指令生成晶体结构。 extit{method}引入物理先验作为思维标记，包括晶体对称性、局部配位环境和预测的物理属性，然后生成原子坐标。这弥合了自然语言与三维结构之间的差距。 extit{method}随后采用强化学习（RL），使用多目标、密集奖励函数，使生成与物理有效性、化学一致性和热力学稳定性对齐。对于属性条件任务，我们设计了特定任务的奖励函数，并为离散约束（例如，空间群）和连续属性（例如，弹性、热膨胀）训练了专门的模型。实证结果表明，与之前的工作和没有思维痕迹或RL的基线相比， extit{method}在多种指标上表现更佳，三重S.U.N.比率提升，并在属性条件生成方面取得了更好的性能。 extit{method}还表现出自适应推理，随着原子数量的增加，推理长度也随之增加。我们的工作展示了利用思维痕迹和RL生成有效、稳定且具有属性条件的晶体结构的潜力。请参见我们的工作：https://crystalreasoner.github.io/ 。

View on arXiv Download PDF AI Translation

cs.AI / 41 / 2605.14355

Herculean: An Agentic Benchmark for Financial Intelligence

赫拉克勒斯：金融智能的代理基准

Peng, Xueqing, Xie, Zhuohan, Cao, Yupeng, Li, Haohang, Qian, Lingfei, Wang, Yan, Zhang, Vincent Jim, He, Huan, Ai, Xuguang, Ma, Linhai, Xiang, Ruoyu, He, Yueru, Han, Yi, Wang, Shuyao, Guo, Yuqing, Jiang, Mingyang, Zhao, Yilun, Dong, Youzhong, Wang, Xiaoyu, Chen, Yankai, Yuan, Ye, Zhang, Qiyuan, Lyu, Fuyuan, Wu, Haolun, Yang, Yonghan, Zhao, Zichen, Dai, Yuyang, Zhang, Fan, Elbadry, Rania, Gull, Ayesha, Safder, Muhammad Usman, Chen, Nuo, Zhu, Fengbin, Cai, Tianshi, Wang, Zimu, Giannouris, Polydoros, Jiang, Yuechen, Liu, Zhiwei, Kabir, Mohsinul, Wang, Yuyan, Zheng, Yixiang, Yu, Yangyang, Liu, Weijin, Cao, Wenbo, Xu, Anke, Lu, Peng, Huang, Jerry, Mo, Fengran, Lin, Mingquan, Tiwari, Prayag, Zhao, Yijia, Basulto, Victor Gutierrez, Liu, Xiao-Yang, Smith, Kaleb E, Pei, Jiahuan, Cohan, Arman, Huang, Jimin, Tang, Yuehua, Lopez-Lira, Alejandro, Chen, Xi, Liu, Xue, Tsujii, Junichi, Nie, Jian-Yun, Ananiadou, Sophia

Abstract

As AI agents improve, the central question is no longer whether they can solve isolated well-defined financial tasks, but whether they can reliably carry out financial professional work. Existing financial benchmarks offer only a partial view of this ability, as they primarily evaluate static competencies such as question answering, retrieval, summarization, and classification. We introduce Herculean, the first skilled benchmark for agentic financial intelligence spanning four representative workflows, including Trading, Hedging, Market Insights, and Auditing. Each workflow is instantiated as a standardized MCP-based skill environment with its own tools, interaction dynamics, constraints, and success criteria, enabling consistent end-to-end assessment of heterogeneous agent systems. Across frontier agents, we find agents perform relatively well on Trading and Market Insights, but struggle substantially on Hedging and Auditing, where long-horizon coordination, state consistency, and structured verification are critical. Overall, our results point to a key gap in current agents in turning financial reasoning into dependable workflow execution in high-stakes financial workflows.

Chinese Translation

随着人工智能代理的不断进步，核心问题不再是它们是否能够解决孤立的、定义明确的金融任务，而是它们能否可靠地执行金融专业工作。现有的金融基准仅提供了这一能力的部分视角，因为它们主要评估静态能力，如问答、检索、摘要和分类。我们引入了赫拉克勒斯，这是第一个针对代理金融智能的技能基准，涵盖了四个代表性工作流程，包括交易、对冲、市场洞察和审计。每个工作流程都被实例化为一个标准化的基于MCP（多智能体协调规划）的技能环境，具有其自身的工具、交互动态、约束和成功标准，从而实现对异构代理系统的一致端到端评估。在前沿代理的评估中，我们发现代理在交易和市场洞察方面表现相对良好，但在对冲和审计方面面临显著挑战，因为这些领域需要长时间的协调、状态一致性和结构化验证。总体而言，我们的结果指出当前代理在将金融推理转化为高风险金融工作流程中的可靠执行方面存在关键差距。

View on arXiv Download PDF AI Translation

cs.AI / 42 / 2605.14358

Uncovering the Representation Geometry of Minimal Cores in Overcomplete Reasoning Traces

揭示过度完整推理轨迹中最小核心的表示几何

Chowdhury, Sanjoy, Manocha, Dinesh

Abstract

Language models often generate long chain-of-thought traces, but it remains unclear how much of this reasoning is necessary for preserving the final prediction. We study this through the lens of overcomplete reasoning traces: generated traces that contain more intermediate steps than are needed to support the model's answer. We define the minimal core as the smallest subset of steps that preserves either the final answer or predictive distribution, and introduce metrics for compression ratio, redundancy mass, step necessity, and necessity concentration. Across six deliberative reasoning benchmarks spanning arithmetic, competition mathematics, expert scientific reasoning, and commonsense multi-hop QA, we find substantial overcompleteness: on average, 46% of steps are removable under greedy minimal-core extraction while preserving the original answer in 86% of cases. We also find that predictive support is concentrated: the top three steps account for 65% of measured necessity mass on average. Beyond compression, minimal cores expose a cleaner geometry of reasoning: compared with full traces, they improve correct-incorrect trace separation by 11 points, reduce estimated intrinsic dimensionality by 34%, and transfer across model families with 85% off-diagonal answer retention. Theoretically, we establish existence of minimal sufficient subsets, local irreducibility guarantees for greedy elimination, and certificates of overcompleteness and sparse necessity. Together, these results suggest that full reasoning traces are often verbose and overcomplete, while minimal cores isolate the effective support underlying language-model predictions.

Chinese Translation

语言模型通常生成长链式思维轨迹，但尚不清楚这些推理中有多少是保持最终预测所必需的。我们通过过度完整推理轨迹的视角进行研究：生成的轨迹包含比支持模型答案所需的更多中间步骤。我们将最小核心定义为保留最终答案或预测分布的最小步骤子集，并引入压缩比、冗余质量、步骤必要性和必要性集中度等度量。在涵盖算术、竞争数学、专家科学推理和常识多跳问答的六个深思熟虑推理基准中，我们发现显著的过度完整性：在贪婪最小核心提取下，平均有46%的步骤可以被移除，同时在86%的情况下保留原始答案。我们还发现，预测支持是集中性的：前三个步骤平均占测量必要质量的65%。除了压缩，最小核心揭示了更清晰的推理几何：与完整轨迹相比，它们将正确与错误轨迹的分离提高了11个百分点，减少了估计的内在维度34%，并在模型家族之间转移时保持85%的非对角答案保留。从理论上讲，我们建立了最小充分子集的存在性、贪婪消除的局部不可约性保证，以及过度完整性和稀疏必要性的证明。综合来看，这些结果表明，完整的推理轨迹往往是冗长和过度完整的，而最小核心则隔离了语言模型预测背后的有效支持。

View on arXiv Download PDF AI Translation

cs.AI / 43 / 2605.14389

Nexus : An Agentic Framework for Time Series Forecasting

Nexus：一种用于时间序列预测的代理框架

Das, Sarkar Snigdha Sarathi, Goyal, Palash, Parmar, Mihir, Peng, Nanyun, Tirumalashetty, Vishy, Li, Chun-Liang, Zhang, Rui, Yoon, Jinsung, Pfister, Tomas

Abstract

Time series forecasting is not just numerical extrapolation, but often requires reasoning with unstructured contextual data such as news or events. While specialized Time Series Foundation Models (TSFMs) excel at forecasting based on numerical patterns, they remain unaware to real-world textual signals. Conversely, while LLMs are emerging as zero-shot forecasters, their performance remains uneven across domains and contextual grounding. To bridge this gap, we introduce Nexus, a multi-agent forecasting framework that decomposes prediction into specialized stages: isolating macro-level and micro-level temporal fluctuations, and integrating contextual information when available before synthesizing a final forecast. This decomposition enables Nexus to adapt from seasonal signals to volatile, event-driven information without relying on external statistical anchors or monolithic prompting. We show that current-generation LLMs possess substantially stronger intrinsic forecasting ability than previously recognized, depending critically on how numerical and contextual reasoning are organized. Evaluated on data strictly succeeding LLM knowledge cutoffs spanning Zillow real estate metrics and volatile stock market equities, Nexus consistently matches or outperforms state-of-the-art TSFMs and strong LLM baselines. Beyond numerical accuracy, Nexus produces high-quality reasoning traces that explicitly show the fundamental drivers behind each forecast. Our results establish that real-world forecasting is an agentic reasoning problem extending well beyond only sequence modeling.

Chinese Translation

时间序列预测不仅仅是数值外推，往往还需要对非结构化的上下文数据进行推理，例如新闻或事件。尽管专门的时间序列基础模型（TSFMs）在基于数值模式的预测方面表现出色，但它们对现实世界的文本信号却缺乏感知。相反，尽管大型语言模型（LLMs）作为零-shot 预测工具正在崭露头角，但它们在不同领域和上下文基础上的表现仍然不均衡。为了弥补这一差距，我们提出了 Nexus，一个多代理预测框架，该框架将预测分解为专门的阶段：隔离宏观和微观层面的时间波动，并在可用时整合上下文信息，然后综合生成最终预测。这种分解使 Nexus 能够从季节性信号适应到波动性、事件驱动的信息，而无需依赖外部统计锚点或单一提示。我们展示了当前一代 LLMs 具有比之前认知的更强的内在预测能力，这在很大程度上取决于数值和上下文推理的组织方式。在严格遵循 LLM 知识截止日期的数据上进行评估，这些数据涵盖了 Zillow 房地产指标和波动的股票市场，Nexus 一直能够与最先进的 TSFMs 和强大的 LLM 基线相匹配或超越。除了数值准确性，Nexus 还生成高质量的推理轨迹，明确展示每个预测背后的基本驱动因素。我们的结果表明，现实世界的预测是一个代理推理问题，远远超出了仅仅是序列建模的范畴。

View on arXiv Download PDF AI Translation

cs.AI / 44 / 2605.14392

Learning to Build the Environment: Self-Evolving Reasoning RL via Verifiable Environment Synthesis

学习构建环境：通过可验证环境合成实现自我演化推理强化学习

Shi, Yucheng, Liang, Zhenwen, Panaganti, Kishan, Yu, Dian, Yu, Wenhao, Mi, Haitao

Abstract

We pursue a vision for self-improving language models in which the model does not merely generate problems or traces to imitate, but constructs the environments that train it. In zero-data reasoning RL, this reframes self-improvement from a data-generation loop into an environment-construction loop, where each artifact is a reusable executable object that samples instances, computes references, and scores responses. Whether this vision sustains improvement hinges on a single property: the environments must exhibit stable solve--verify asymmetry, the model must be able to write an oracle once that it cannot reliably execute in natural language on fresh instances. This asymmetry takes two complementary forms. Some tasks are algorithmically hard to reason through but trivial as code: a dynamic program or graph traversal, compiled once, yields unboundedly many calibrated instances. Others are intrinsically hard to solve but easy to verify, like planted subset-sum or constraint satisfaction. Both create a durable gap between proposing and solving that the policy cannot close by gaming the verifier, and it is this gap that keeps reward informative as the learner improves. We instantiate this view in EvoEnv, a single-policy generator, solver method that synthesizes Python environments from ten seeds and admits them only after staged validation, semantic self-review, solver-relative difficulty calibration, and novelty checks. The strongest evidence comes from the already-strong regime: on Qwen3-4B-Thinking, fixed public-data RLVR and fixed hand-crafted environment RLVR reduce the average, while EvoEnv improves it from 72.4 to 74.8, a relative gain of 3.3%. Stable self-improvement, we suggest, depends not on producing more synthetic data, but on models learning to construct worlds whose difficulty stays structurally beyond their own reach.

Chinese Translation

我们追求一个自我提升语言模型的愿景，其中模型不仅仅生成问题或轨迹以供模仿，而是构建训练其自身的环境。在零数据推理强化学习中，这将自我提升的框架从数据生成循环转变为环境构建循环，其中每个工件都是一个可重用的可执行对象，能够抽样实例、计算参考值并评分响应。这个愿景是否能够持续改进取决于一个关键特性：环境必须表现出稳定的求解-验证不对称性，模型必须能够编写一个在新实例上无法可靠执行的神谕。这个不对称性有两种互补形式。一些任务在算法上难以推理，但作为代码却是微不足道的：动态程序或图遍历，一旦编译，可以产生无限多的校准实例。其他任务则本质上难以解决，但容易验证，例如植入子集和或约束满足。这两者之间创造了一个持久的差距，提议与解决之间的差距，策略无法通过操纵验证者来弥补，而正是这个差距使得奖励在学习者改进时保持信息性。我们在EvoEnv中实例化了这一观点，这是一种单策略生成器、求解器方法，从十个种子合成Python环境，并仅在经过分阶段验证、语义自审、求解器相对难度校准和新颖性检查后才允许使用。最强有力的证据来自于已经强大的领域：在Qwen3-4B-Thinking上，固定公共数据的RLVR和固定手工环境的RLVR降低了平均值，而EvoEnv将其从72.4提升至74.8，获得了3.3%的相对增益。我们建议，稳定的自我提升并不依赖于生成更多的合成数据，而在于模型学习构建其难度在结构上超出自身能力的世界。

View on arXiv Download PDF AI Translation

cs.AI / 45 / 2605.14398

Coding Agent Is Good As World Simulator

编码代理作为世界模拟器的有效性

Wang, Hongyu, Wang, Jingquan, Zou, Bocheng, Serban, Radu, Negrut, Dan

Abstract

World models have emerged as a powerful paradigm for building interactive simulation environments, with recent video-based approaches demonstrating impressive progress in generating visually plausible dynamics. However, because these models typically infer dynamics from video and represent them in latent states, they do not explicitly enforce physical constraints. As a result, the generated video rollouts are not physically plausible, exhibiting unstable contacts, distorted shapes, or inconsistent motion. In this paper, we present an agentic framework constructing physics-based world models through executable simulation code. The framework coordinates planning, code generation, visual review, and physics analysis agents. The planning agent converts the natural language prompt into a structured scene plan, the code agent implements it as executable simulation code, and the visual review agent provide visual feedback while the physics analysis agent checks physical consistency. The code is iteratively revised based on the feedback until the simulation matches the prompt reqirements and physical constraints. Experimental results show that our framework outperforms advanced video-based models in physical accuracy, instruction fidelity and visual quality, which could be applied to various scenarios including driving simulation and embodied robot tasks.

Chinese Translation

世界模型作为构建交互式仿真环境的强大范式已经出现，最近基于视频的方法在生成视觉上合理的动态方面取得了显著进展。然而，由于这些模型通常从视频中推断动态并以潜在状态表示它们，因此并未明确强加物理约束。因此，生成的视频回放在物理上并不合理，表现出不稳定的接触、扭曲的形状或不一致的运动。在本文中，我们提出了一种代理框架，通过可执行的仿真代码构建基于物理的世界模型。该框架协调规划、代码生成、视觉审查和物理分析代理。规划代理将自然语言提示转换为结构化场景计划，代码代理将其实现为可执行的仿真代码，而视觉审查代理在此过程中提供视觉反馈，物理分析代理则检查物理一致性。代码根据反馈进行迭代修订，直到仿真符合提示要求和物理约束。实验结果表明，我们的框架在物理准确性、指令保真度和视觉质量方面优于先进的基于视频的模型，能够应用于包括驾驶仿真和具身机器人任务在内的各种场景。

View on arXiv Download PDF AI Translation

cs.AI / 46 / 2605.14407

Metis AI: The Overlooked Middle Zone Between AI-Native and World-Movers

Metis AI：被忽视的AI原生与世界变革者之间的中间区域

Li, Xiang

Abstract

The dominant discourse on AI limitations frames the boundary of AI capability as a divide between digital tasks (where AI excels) and physical tasks (where embodiment is required). We argue this framing misses the most consequential boundary: the one within digital tasks. We identify a class of tasks we call Metis AI, named for the Greek concept of metis (practical, contextual knowledge), that are performed entirely on computers yet resist reliable AI automation. These tasks are not computationally intractable; they are institutionally, socially, and normatively entangled in ways that defeat algorithmic approaches. We distinguish constitutive metis (knowledge destroyed by the act of formalization) from operational metis (system-specific familiarity that automation can progressively absorb), and propose five structural characteristics that define the Metis AI zone: consequential irreversibility, relational irreducibility, normative open texture, adversarial co-evolution, and accountability anchoring. We ground each in established theory from across the social sciences, philosophy, and humanitarian practice, argue that these characteristics are properties of the tasks themselves rather than limitations of current models, and show that the appropriate design response is not better automation but centaur architectures in which humans lead and AI supports.

Chinese Translation

关于人工智能（AI）局限性的主导话语将AI能力的边界框定为数字任务（AI擅长的领域）与物理任务（需要具身性的领域）之间的分界。我们认为这种框架忽视了最重要的边界：数字任务内部的边界。我们识别出一类任务，称之为Metis AI，得名于希腊概念metis（实践性、情境性知识），这些任务完全在计算机上执行，但却无法可靠地实现AI自动化。这些任务并非计算上不可处理；它们在制度、社会和规范上交织在一起，以致于算法方法无法奏效。我们将构成性metis（因形式化行为而被破坏的知识）与操作性metis（自动化可以逐步吸收的系统特定熟悉度）区分开来，并提出定义Metis AI区域的五个结构特征：重要的不可逆性、关系的不可简化性、规范的开放性、对抗性的共同进化和问责制的锚定。我们将每个特征与社会科学、哲学和人道主义实践中的既有理论相结合，认为这些特征是任务本身的属性，而非当前模型的局限，并展示适当的设计响应不是更好的自动化，而是人类主导、AI支持的半人马架构。

View on arXiv Download PDF AI Translation

cs.AI / 47 / 2605.14416

A Unified Knowledge Embedded Reinforcement Learning-based Framework for Generalized Capacitated Vehicle Routing Problems

基于统一知识嵌入强化学习框架的广义容量约束车辆路径问题

Wang, Wen, Wu, Xiangchen, Wang, Liang, Hu, Hao, Tao, Xianping

Abstract

The Capacitated Vehicle Routing Problem (CVRP) is a fundamental NP-hard problem with broad applications in logistics and transportation. Real-world CVRPs often involve diverse objectives and complex constraints, such as time windows or backhaul requirements, motivating the development of a unified solution framework. Recent reinforcement learning (RL) approaches have shown promise in combinatorial optimization, yet they rely on end-to-end learning and lack explicit problem-solving knowledge, limiting solution quality. In this paper, we propose a knowledge-embedded framework inspired by the Route-First Cluster-Second heuristics. It incorporates knowledge at two levels: (1) decomposing CVRPs into the route-first and cluster-second subproblems, and (2) leveraging dynamic programming to solve the second subproblem, whose results guide the RL-based constructive solver to solve the first problem. To mitigate partial observability caused by problem decomposition, we introduce a unified history-enhanced context processing module. Extensive experiments show that this framework achieves superior solution quality compared with state-of-the-art learning-based methods, with a smaller gap to classical heuristics, demonstrating strong generalization across diverse CVRP variants.

Chinese Translation

容量约束车辆路径问题（CVRP）是一个基本的 NP 难问题，在物流和运输领域有广泛的应用。现实世界中的 CVRP 通常涉及多样化的目标和复杂的约束条件，如时间窗口或回程要求，这促使了统一解决框架的开发。近期的强化学习（RL）方法在组合优化中展现出了潜力，但它们依赖于端到端学习，并缺乏明确的问题解决知识，从而限制了解决方案的质量。本文提出了一种受路线优先-聚类次优（Route-First Cluster-Second）启发的知识嵌入框架。该框架在两个层面上融入知识：（1）将 CVRP 分解为路线优先和聚类次优子问题；（2）利用动态规划解决第二个子问题，其结果指导基于 RL 的构造求解器解决第一个问题。为了缓解由于问题分解造成的部分可观测性，我们引入了一个统一的历史增强上下文处理模块。大量实验表明，该框架在解决方案质量上优于最先进的基于学习的方法，并且与经典启发式方法的差距更小，展示了在多样化 CVRP 变体中的强泛化能力。

View on arXiv Download PDF AI Translation

cs.AI / 48 / 2605.14420

DVMap: Fine-Grained Pluralistic Value Alignment via High-Consensus Demographic-Value Mapping

DVMap：通过高共识人口价值映射实现细粒度多元价值对齐

Zhu, Pengyun, Ren, Yuqi, Wang, Zhen, Yang, Lei, Xiong, Deyi

Abstract

Current Large Language Models (LLMs) typically rely on coarse-grained national labels for pluralistic value alignment. However, such macro-level supervision often obscures intra-country value heterogeneity, yielding a loose alignment. We argue that resolving this limitation requires shifting from national labels to multi-dimensional demographic constraints, which can identify groups with predictable, high-consensus value preference. To this end, we propose DVMap (High-Consensus Demographic-Value Mapping), a framework for fine-grained pluralistic value alignment. In this framework, we first present a demographic archetype extraction strategy to construct a high-quality value alignment corpus of 56,152 samples from the World Values Survey (WVS) by strictly retaining respondents with consistent value preferences under identical demographics. Over this corpus, we introduce a Structured Chain-of-Thought (CoT) mechanism that explicitly guides LLMs to reason about demographic-value correlations. Subsequently, we employ Group Relative Policy Optimization (GRPO) to achieve adaptive anchoring of value distributions. To rigorously evaluate generalization, we further establish a triple-generalization benchmark (spanning cross-demographic, cross-country, and cross-value) comprising 21,553 samples. Experimental results demonstrate that DVMap effectively learns the manifold mapping from demographics to values, exhibiting strong generalization and robustness. On cross-demographic tests, Qwen3-8B-DVMap achieves 48.6% accuracy, surpassing the advanced open-source LLM DeepSeek-v3.2 (45.1%). The source code and dataset are available at https://github.com/EnlightenedAI/DVMap.

Chinese Translation

当前的大型语言模型（LLMs）通常依赖于粗粒度的国家标签进行多元价值对齐。然而，这种宏观层面的监督往往掩盖了国家内部的价值异质性，导致对齐效果松散。我们认为，解决这一局限性需要从国家标签转向多维人口约束，这可以识别出具有可预测的高共识价值偏好的群体。为此，我们提出了DVMap（高共识人口价值映射），这是一个用于细粒度多元价值对齐的框架。在该框架中，我们首先提出了一种人口原型提取策略，通过严格保留在相同人口特征下具有一致价值偏好的受访者，从世界价值调查（WVS）中构建了一个包含56,152个样本的高质量价值对齐语料库。在此语料库上，我们引入了一种结构化思维链（CoT）机制，明确指导LLMs推理人口与价值之间的关联。随后，我们采用群体相对政策优化（GRPO）实现价值分布的自适应锚定。为了严格评估模型的泛化能力，我们进一步建立了一个三重泛化基准（涵盖跨人口、跨国家和跨价值），包含21,553个样本。实验结果表明，DVMap有效地学习了人口与价值之间的多样映射，展现出强大的泛化能力和鲁棒性。在跨人口测试中，Qwen3-8B-DVMap的准确率达到48.6%，超过了先进的开源LLM DeepSeek-v3.2（45.1%）。源代码和数据集可在 https://github.com/EnlightenedAI/DVMap 获取。

View on arXiv Download PDF AI Translation

cs.AI / 49 / 2605.14438

BEAM: Binary Expert Activation Masking for Dynamic Routing in MoE

BEAM：用于MoE动态路由的二元专家激活掩码

Wu, Juntong, Cheng, Jialiang, Yin, Qishen, Dai, Yue, Yan, Yuliang, Lv, Fuyu, Dan, Ou, Yuan, Li

Abstract

Mixture-of-Experts (MoE) architectures enhance the efficiency of large language models by activating only a subset of experts per token. However, standard MoE employs a fixed Top-K routing strategy, leading to redundant computation and suboptimal inference latency. Existing acceleration methods either require costly retraining with architectural changes or suffer from severe performance drop at high sparsity due to train-inference mismatch. To address these limitations, we propose BEAM (Binary Expert Activation Masking), a novel method that learns token-adaptive expert selection via trainable binary masks. With a straight-through estimator and an auxiliary regularization loss, BEAM induces dynamic expert sparsity through end-to-end training while maintaining model capability. We further implement an efficient custom CUDA kernel for BEAM, ensuring seamless integration with the vLLM inference framework. Experiments show that BEAM retains over 98\% of the original model's performance while reducing MoE layer FLOPs by up to 85\%, achieving up to 2.5$\times$ faster decoding and 1.4$\times$ higher throughput, demonstrating its effectiveness as a practical, plug-and-play solution for efficient MoE inference.

Chinese Translation

混合专家（Mixture-of-Experts, MoE）架构通过每个标记仅激活一部分专家，从而提高大型语言模型的效率。然而，标准的MoE采用固定的Top-K路由策略，导致冗余计算和次优的推理延迟。现有的加速方法要么需要昂贵的重训练并进行架构更改，要么在高稀疏性下由于训练与推理的不匹配而严重下降性能。为了解决这些限制，我们提出了BEAM（二元专家激活掩码），这是一种通过可训练的二元掩码学习标记自适应专家选择的新方法。通过直通估计器和辅助正则化损失，BEAM在端到端训练过程中诱导动态专家稀疏性，同时保持模型能力。我们进一步实现了一个高效的自定义CUDA内核，以确保与vLLM推理框架的无缝集成。实验表明，BEAM在减少MoE层FLOPs高达85%的同时，保持了原始模型性能的98%以上，实现了高达2.5倍的解码速度和1.4倍的吞吐量，证明了其作为高效MoE推理的实用即插即用解决方案的有效性。

View on arXiv Download PDF AI Translation

cs.AI / 50 / 2605.14440

Synthesizing POMDP Policies: Sampling Meets Model-checking via Learning

合成部分可观测马尔可夫决策过程政策：采样与模型检验的学习结合

Chakraborty, Debraj, Majumdar, Anirban, Mathew, Prince, Mukherjee, Sayan, Raskin, Jean-François

Abstract

Partially Observable Markov Decision Processes (POMDPs) are the standard framework for decision-making under uncertainty. While sampling-based methods scale well, they lack formal correctness guarantees, making them unsuitable for safety-critical applications. Conversely, formal synthesis techniques provide correctness-by-construction but often struggle with scalability, as general POMDP synthesis is undecidable. To bridge this gap, we propose a synthesis framework that integrates sampling, automata learning, and model-checking. Inspired by Angluin's $L^*$ algorithm, our approach utilizes sampling as a membership oracle and model-checking as an equivalence oracle. This enables the synthesis of finite-state controllers with formal guarantees, provided the sampling-induced policy is regular. We establish a relative completeness result for this framework. Experimental results from our prototypical implementation demonstrate that this method successfully solves threshold-safety problems that remain challenging for existing formal synthesis tools. We believe our algorithm serves as a valuable component in a portfolio approach to tackling the inherent difficulty of POMDP synthesis problems.

Chinese Translation

部分可观测马尔可夫决策过程（POMDP）是处理不确定性决策的标准框架。尽管基于采样的方法具有良好的扩展性，但它们缺乏正式的正确性保证，因此不适合安全关键应用。相反，正式合成技术提供了构造上的正确性，但通常在扩展性方面面临挑战，因为一般的 POMDP 合成是不可判定的。为了弥补这一差距，我们提出了一种合成框架，整合了采样、自动机学习和模型检验。我们的方案受到 Angluin 的 $L^*$ 算法的启发，利用采样作为成员资格oracle，利用模型检验作为等价oracle。这使得在采样引起的政策是正则的前提下，可以合成具有正式保证的有限状态控制器。我们为该框架建立了相对完备性结果。我们原型实现的实验结果表明，该方法成功解决了现有正式合成工具仍然面临挑战的阈值安全问题。我们相信我们的算法在应对 POMDP 合成问题的固有困难方面，作为一种组合方法中的重要组成部分，具有重要价值。

View on arXiv Download PDF AI Translation

cs.AI / 51 / 2605.14443

Prompting Policies for Multi-step Reasoning and Tool-Use in Black-box LLMs with Iterative Distillation of Experience

通过经验的迭代蒸馏为黑箱大型语言模型中的多步骤推理和工具使用提供提示策略

Sayana, Krishna, Todi, Ketan, Jash, Ambarish

Abstract

The shift toward interacting with frozen, "black-box" Large Language Models (LLMs) has transformed prompt engineering from a heuristic exercise into a critical optimization challenge. We propose a Reinforcement Learning (RL) framework for training learned prompting policies via iterative distillation of experience. In this architecture, a lightweight prompter model is optimized to maximize task-specific rewards for a larger, frozen worker LLM. By utilizing a contrastive experience buffer that couples scalar rewards with dense textual critiques, our approach effectively amortizes iterative prompt refinement into single-shot policy weights. Our experimental analysis focuses on the Big Bench Extra Hard (BBEH) and Tau-bench suites, covering a diverse range of multi-step reasoning and tool-use tasks. We demonstrate significant gains, improving performance from 55% to 90% in logic-intensive reasoning and 74% to 91% in tool-use tasks. Furthermore, we analyze the structural evolution of prompts, demonstrating how the policy discovers specialized algorithmic heuristics. We provide comprehensive comparisons against state-of-the-art evolutionary baselines like GEPA, showing that iterative distillation achieves superior performance with higher sample efficiency.

Chinese Translation

与冻结的“黑箱”大型语言模型（LLMs）进行交互的转变使得提示工程从一种启发式的练习变成了一项关键的优化挑战。我们提出了一种强化学习（RL）框架，通过经验的迭代蒸馏来训练学习到的提示策略。在该架构中，一个轻量级的提示模型被优化以最大化针对一个更大、冻结的工作LLM的任务特定奖励。通过利用将标量奖励与密集文本批评结合的对比经验缓冲区，我们的方法有效地将迭代提示优化转化为单次策略权重。我们的实验分析集中在 Big Bench Extra Hard (BBEH) 和 Tau-bench 套件上，涵盖了多样化的多步骤推理和工具使用任务。我们展示了显著的提升，在逻辑密集型推理中将性能从 55% 提高到 90%，在工具使用任务中从 74% 提高到 91%。此外，我们分析了提示的结构演变，展示了策略如何发现专业的算法启发式。我们提供了与最先进的进化基线（如 GEPA）的全面比较，表明迭代蒸馏在样本效率更高的情况下实现了更优的性能。

View on arXiv Download PDF AI Translation

cs.AI / 52 / 2605.14455

Intelligence Impact Quotient (IIQ): A Framework for Measuring Organizational AI Impact

智能影响商数 (IIQ)：衡量组织人工智能影响的框架

Rajah, Chandan, Sengupta, Neha, Castanedo, Federico, Mills, Robin, Bahree, Amit, Muthukrishnan, Ramesh Krishnan, Murray, Larry

Abstract

The Intelligence Impact Quotient (IIQ) is a composite metric intended to quantify the depth to which AI systems are integrated into organizational work and their impact. Rather than treating access counts or aggregate token volume as sufficient evidence of impact, IIQ combines a novelty-weighted, time-decayed token stock with usage frequency, a grace-period recency gate, organizational leverage, task complexity, and autonomy. The formulation produces a raw Intelligence Adoption Index (IAI) and a normalized 0-1000 IIQ index for comparison between heterogeneous users and units. We also derive sub-daily update rules and a bounded interpretation layer for estimated efficiency and financial impact. The paper positions IIQ as a deployment-oriented measurement framework: a formal proposal for tracking AI embedding in workflows, not a direct measure of model capability or a substitute for causal productivity evaluation. Synthetic scenarios illustrate how the revised metric distinguishes between frequent low-leverage use, semantically repetitive prompting, and more autonomous, higher-consequence AI-assisted work.

Chinese Translation

智能影响商数 (IIQ) 是一个复合指标，旨在量化人工智能系统在组织工作中的集成深度及其影响。IIQ 不仅仅将访问次数或总令牌量视为影响的充分证据，而是结合了新颖性加权的时间衰减令牌存量、使用频率、宽限期最近性门槛、组织杠杆、任务复杂性和自主性。该公式生成了一个原始智能采纳指数 (IAI) 和一个标准化的 0-1000 IIQ 指数，以便于不同用户和单位之间的比较。我们还推导了次日更新规则和一个有界解释层，以估算效率和财务影响。本文将 IIQ 定位为一个以部署为导向的测量框架：一个用于跟踪人工智能在工作流程中嵌入的正式提案，而不是模型能力的直接测量或因果生产力评估的替代。合成场景展示了修订后的指标如何区分频繁的低杠杆使用、语义重复的提示以及更自主的、高后果的人工智能辅助工作。

View on arXiv Download PDF AI Translation

cs.AI / 53 / 2605.14457

Stateful Reasoning via Insight Replay

通过洞察重播进行状态化推理

Lei, Bin, Ding, Caiwen, Yang, Jiachen, Li, Ang, Wang, Xin Eric

Abstract

Chain-of-Thought (CoT) reasoning has become a foundation for eliciting multi-step reasoning in large language models, but recent studies show that its benefits do not scale monotonically with chain length: while longer CoT generally enables a model to tackle harder problems, on a given problem, accuracy typically increases with CoT length up to a point, after which it declines. We identify a major cause of this phenomenon: as the CoT grows, the model's attention to critical insights produced earlier in the trace gradually weakens, making those insights progressively less accessible when they are most needed. Therefore, we propose \textbf{InsightReplay}, a stateful reasoning approach in which the model periodically extracts critical insights from its reasoning trace and replays them near the active generation frontier, keeping them accessible as the reasoning scales. Extensive experiments on a $\mathbf{2}\!\times\!\mathbf{3}\!\times\!\mathbf{4}$ benchmark grid, covering model scales $\{\text{8B}, \text{30B}\}$, model families $\{\text{Qwen3.5}, \text{DeepSeek-R1-Distill-Qwen}, \text{Gemma-4}\}$, and reasoning benchmarks $\{\text{AIME}, \text{HMMT}, \text{GPQA Diamond}, \text{LiveCodeBench v5}\}$, show that 3-round InsightReplay yields accuracy gains across \textbf{all 24 settings}, with an averaged improvement of $\mathbf{+1.65}$ points over standard CoT, and a largest single-setting gain of $\mathbf{+9.2}$ points on R1-Distill-32B's LiveCodeBench v5 subset. Our results suggest that the effectiveness of test-time scaling depends not only on how much a model reasons, but also on whether critical intermediate insights remain accessible throughout long reasoning trajectories.

Chinese Translation

链式思维（Chain-of-Thought，CoT）推理已成为引导大型语言模型进行多步推理的基础，但最近的研究表明，其优势并不随着链长的增加而单调提升：虽然较长的 CoT 通常使模型能够处理更复杂的问题，但在特定问题上，准确性通常随着 CoT 长度的增加而提升，直到某个点后又开始下降。我们确定了这一现象的主要原因：随着 CoT 的增长，模型对推理轨迹中早期产生的关键洞察的关注逐渐减弱，使得这些洞察在最需要时变得越来越难以获取。因此，我们提出了 extbf{InsightReplay}，一种状态化推理方法，其中模型定期从其推理轨迹中提取关键洞察，并在活跃生成前沿附近重播这些洞察，从而在推理扩展时保持其可获取性。在一个覆盖模型规模 $ ext{8B}$ 和 $ ext{30B}$、模型系列 $ ext{Qwen3.5}$、$ ext{DeepSeek-R1-Distill-Qwen}$、$ ext{Gemma-4}$ 以及推理基准 $ ext{AIME}$、$ ext{HMMT}$、$ ext{GPQA Diamond}$、$ ext{LiveCodeBench v5}$ 的 $ extbf{2} imes extbf{3} imes extbf{4}$ 基准网格上进行的广泛实验表明，3轮 InsightReplay 在 extbf{所有 24 种设置} 中均实现了准确性提升，平均提高了 $ extbf{+1.65}$ 分，而在 R1-Distill-32B 的 LiveCodeBench v5 子集上获得了最大的单一设置提升 $ extbf{+9.2}$ 分。我们的结果表明，测试时扩展的有效性不仅取决于模型推理的深度，还取决于关键中间洞察在长推理轨迹中是否保持可获取性。

View on arXiv Download PDF AI Translation

cs.AI / 54 / 2605.14458

OmniDrop: Layer-wise Token Pruning for Omni-modal LLMs via Query-Guidance

OmniDrop：通过查询引导的全层级令牌剪枝用于全模态大语言模型

Park, Yeo Jeong, Jang, Hyemi, Choi, Minseo, Lee, Jongsun, Choi, Jooyoung, Jeon, Yongkweon

Abstract

Omni-modal large language models have demonstrated remarkable potential in holistic multimodal understanding; however, the token explosion caused by high-resolution audio and video inputs remains a critical bottleneck for real-time applications and long-form reasoning. Existing omni-modal token compression methods typically prune tokens at the input embedding level, relying on audio-video similarity or temporal co-occurrence as proxies for semantic relevance. In practice, such assumptions are often unreliable. To address this limitation, we propose OmniDrop, a training-free, layer-wise token pruning framework that progressively prunes audiovisual tokens within the LLM decoder layers rather than at the input-level, allowing early layers to preserve sufficient omni-modal information fusion before aggressively removing tokens in deeper layers. We further utilize text queries as guidance for modality-agnostic and task-adaptive token pruning. We also introduce a temporal diversity score that encourages balanced token survival to preserve global temporal context. Experimental results across various audiovisual benchmarks demonstrate that OmniDrop outperforms all baselines by up to 3.58 points while reducing prefill latency by up to 40% and memory usage by up to 14.7%.

Chinese Translation

全模态大语言模型在整体多模态理解方面展现了显著潜力；然而，高分辨率音频和视频输入所导致的令牌爆炸仍然是实时应用和长篇推理的关键瓶颈。现有的全模态令牌压缩方法通常在输入嵌入层面进行令牌剪枝，依赖音频-视频相似性或时间共现作为语义相关性的代理。然而，在实际应用中，这些假设往往不可靠。为了解决这一局限性，我们提出了OmniDrop，这是一种无训练的层级令牌剪枝框架，它在LLM解码器层中逐步剪枝视听令牌，而不是在输入层面进行剪枝，从而使早期层能够在深入层激进去除令牌之前保留足够的全模态信息融合。我们进一步利用文本查询作为引导，实现模态无关和任务自适应的令牌剪枝。我们还引入了一种时间多样性评分，鼓励平衡令牌生存，以保留全局时间上下文。在各种视听基准测试中的实验结果表明，OmniDrop在性能上优于所有基线，提升幅度可达3.58分，同时将预填充延迟减少了最多40%，内存使用降低了最多14.7%。

View on arXiv Download PDF AI Translation

cs.AI / 55 / 2605.14465

From Table to Cell: Attention for Better Reasoning with TABALIGN

从表格到单元格：通过TABALIGN实现更好的推理

Kwok, Tung Sum Thomas, Zhang, Zeyong, Wang, Xinyu, Wang, Chunhe, Lin, Xiaofeng, Wu, Hanwei, Ding, Lei, Cheng, Guang, Guo, Zhijiang

Abstract

Multi-step LLM reasoning over structured tables fails because planning and execution share no explicit cell-grounding contract. Existing methods constrain the planner to a left-to-right factorization at odds with table permutation invariance, and score intermediate states by generated content alone, overlooking cell grounding. We conduct a pilot study showing that diffusion language models (DLMs) produce more human-aligned and permutation-stable cell attention on tables than autoregressive models, with a 40.2% median reduction in attention-AUROC variability under row reordering. Motivated by this, we propose TABALIGN, a planned table reasoning framework that operationalizes the contract. TABALIGN pairs a masked DLM planner, whose bidirectional denoising emits plan steps as binary cell masks, with TABATTN, a lightweight verifier trained on 1,600 human-verified attention standards to score each step by its attention overlap with the plan-designated mask. Across eight benchmarks covering table question answering and fact verification, TABALIGN improves average accuracy by 15.76 percentage points over the strongest open-source baseline at comparable 8B-class scale, with a matched-backbone ablation attributing 2.87 percentage points of this gain to the DLM planner over an AR planner on a fixed reasoner. Cleaner DLM plans also accelerate downstream reasoning execution by 44.64%.

Chinese Translation

多步骤的大型语言模型（LLM）在结构化表格上的推理失败，因为规划与执行之间没有明确的单元格基础契约。现有方法将规划者限制为从左到右的因式分解，这与表格的排列不变性相悖，并且仅通过生成的内容对中间状态进行评分，忽视了单元格的基础。我们进行了一项初步研究，显示扩散语言模型（DLMs）在表格上产生的单元格注意力比自回归模型更符合人类的期望且具有排列稳定性，在行重排序下，注意力-AUROC的变异性中位数减少了40.2%。基于此，我们提出了TABALIGN，一个实现该契约的计划表格推理框架。TABALIGN将一个掩码DLM规划器与TABATTN相结合，前者的双向去噪输出计划步骤作为二进制单元格掩码，后者是一个经过1,600个人工验证的注意力标准训练的轻量级验证器，用于根据与计划指定掩码的注意力重叠对每一步进行评分。在涵盖表格问答和事实验证的八个基准测试中，TABALIGN在与最强开源基线相当的8B级别上提高了平均准确率15.76个百分点，其中通过匹配骨干网络的消融实验将这一增益的2.87个百分点归因于DLM规划器相较于固定推理器上的自回归规划器。更清晰的DLM计划还加速了下游推理执行，提升幅度达到44.64%。

View on arXiv Download PDF AI Translation

cs.AI / 56 / 2605.14483

LEMON: Learning Executable Multi-Agent Orchestration via Counterfactual Reinforcement Learning

LEMON：通过反事实强化学习学习可执行的多智能体编排

Chen, Xudong, Liu, Yixin, Wei, Hua, Ding, Kaize

Abstract

Large language models (LLMs) have become a strong foundation for multi-agent systems, but their effectiveness depends heavily on orchestration design. Across different tasks, role design, capacity assignment, and dependency construction jointly affect both solution quality and execution efficiency. Existing approaches automate parts of this design process, yet they often optimize these decisions partially or sequentially, and rely on execution-level feedback that provides limited credit assignment for local orchestration decisions. We propose LEMON (\textbf{L}earning \textbf{E}xecutable \textbf{M}ulti-agent \textbf{O}rchestratio\textbf{N} via Counterfactual Reinforcement Learning), an LLM-based orchestrator that generates an executable orchestration specification. The specification integrates task-specific roles, customized duties, capacity levels, and dependency structure into a single deployable system. To train the orchestrator, we augment the orchestration-level GRPO objective with a localized counterfactual signal that edits role, capacity, or dependency fields and applies the resulting reward contrast only to the edited spans. Experiments on six reasoning and coding benchmarks, including MMLU, GSM8K, AQuA, MultiArith, SVAMP, and HumanEval, show that LEMON achieves state-of-the-art performance among the evaluated multi-agent orchestration methods. Our code is available at https://anonymous.4open.science/r/LEMON-B23C.

Chinese Translation

大型语言模型（LLMs）已成为多智能体系统的强大基础，但其有效性在很大程度上依赖于编排设计。在不同任务中，角色设计、能力分配和依赖关系构建共同影响解决方案的质量和执行效率。现有方法自动化了这一设计过程的部分环节，但通常是部分或顺序地优化这些决策，并依赖于提供有限信用分配的执行级反馈。我们提出了LEMON（ extbf{L}earning extbf{E}xecutable extbf{M}ulti-agent extbf{O}rchestratio extbf{N} via Counterfactual Reinforcement Learning），一种基于LLM的编排器，能够生成可执行的编排规范。该规范将任务特定的角色、定制的职责、能力水平和依赖结构整合为一个可部署的系统。为了训练编排器，我们通过局部反事实信号增强编排级GRPO目标，该信号编辑角色、能力或依赖字段，并仅将结果奖励对比应用于编辑的范围。在六个推理和编码基准（包括MMLU、GSM8K、AQuA、MultiArith、SVAMP和HumanEval）上的实验表明，LEMON在评估的多智能体编排方法中实现了最先进的性能。我们的代码可在https://anonymous.4open.science/r/LEMON-B23C获取。

View on arXiv Download PDF AI Translation

cs.AI / 57 / 2605.14488

Deepchecks: Evaluating Retrieval-Augmented Generation (RAG)

Deepchecks：评估检索增强生成（RAG）

Gerner, Assaf, Madvil, Netta, Barak, Nadav, Zaikman, Alex, Liberman, Jonatan, Hamra, Liron, Brazilay, Rotem, Tsadok, Shay, Friedman, Yaron, Harow, Neal, Bresler, Noam, Chorev, Shir, Tannor, Philip, Rokach, Lior

Abstract

Large Language Models (LLMs) augmented with Retrieval-Augmented Generation (RAG) techniques are revolutionizing applications across multiple domains, such as healthcare, finance, and customer service. Despite their potential, evaluating RAG systems remains a complex challenge due to the stochastic nature of generated outputs and the intricate interplay between retrieval and generation components. This paper introduces Deepchecks, a comprehensive framework tailored for evaluating RAG applications. Deepchecks' evaluation framework addresses RAG applications evaluation through a multi-faceted approach, root cause analysis and production monitoring. By ensuring alignment with application-specific requirements, Deepchecks framework provides a robust foundation for assessing reliability, relevance, and user satisfaction in RAG systems.

Chinese Translation

大型语言模型（LLMs）结合检索增强生成（RAG）技术正在革新多个领域的应用，如医疗、金融和客户服务。尽管其潜力巨大，但由于生成输出的随机性以及检索与生成组件之间复杂的相互作用，评估RAG系统仍然是一项复杂的挑战。本文介绍了Deepchecks，一个专为评估RAG应用而设计的综合框架。Deepchecks的评估框架通过多方面的方法、根本原因分析和生产监控来解决RAG应用的评估问题。通过确保与特定应用需求的一致性，Deepchecks框架为评估RAG系统的可靠性、相关性和用户满意度提供了坚实的基础。

View on arXiv Download PDF AI Translation

cs.AI / 58 / 2605.14494

Learning Scenario Reduction for Two-Stage Robust Optimization with Discrete Uncertainty

具有离散不确定性的两阶段鲁棒优化场景简化

Lin, Tianjue, Zhou, Jianan, Bi, Jieyi, Wu, Yaoxin, Song, Wen, Cao, Zhiguang, Zhang, Jie

Abstract

Two-Stage Robust Optimization (2RO) with discrete uncertainty is challenging, often rendering exact solutions prohibitive. Scenario reduction alleviates this issue by selecting a small, representative subset of scenarios to enable tractable computation. However, existing methods are largely problem-agnostic, operating solely on the uncertainty set without consulting the feasible region or recourse structure. In this paper, we introduce PRISE, a problem-driven sequential lookahead heuristic that constructs reduced scenario sets by evaluating the marginal impact of each scenario. While PRISE yields high-quality scenario subsets, each selection step requires solving multiple subproblems, making it computationally expensive at scale. To address this, we propose NeurPRISE, a neural surrogate model built on a GNN-Transformer backbone that encodes the per-scenario structure via graph convolution and captures cross-scenario interactions through attention. NeurPRISE is trained via imitation learning with a gain-aware ranking objective, which distills marginal gain information from PRISE into a learned scoring function for scenario ranking and selection. Extensive results on three 2RO problems show that NeurPRISE consistently achieves competitive regret relative to comprehensive methods, maintains strong calability with varying numbers of scenarios, and delivers 7-200x speedup over PRISE. NeurPRISE also exhibits strong zero-shot generalization, effectively handling instances with larger problem scales (up to 5x), more scenarios (up to 4x), and distribution shifts.

Chinese Translation

具有离散不确定性的两阶段鲁棒优化（2RO）面临挑战，通常使得精确解的求解变得不可行。场景简化通过选择一个小的、具有代表性的场景子集来缓解这一问题，从而实现可处理的计算。然而，现有方法在很大程度上是与问题无关的，仅仅基于不确定性集合进行操作，而未考虑可行区域或补救结构。本文介绍了PRISE，这是一种以问题为驱动的顺序前瞻启发式方法，通过评估每个场景的边际影响来构建简化的场景集。虽然PRISE能够生成高质量的场景子集，但每个选择步骤需要解决多个子问题，使其在大规模时计算成本高昂。为了解决这一问题，我们提出了NeurPRISE，这是一种基于GNN-Transformer架构构建的神经替代模型，通过图卷积编码每个场景的结构，并通过注意力机制捕捉场景间的交互。NeurPRISE通过模仿学习进行训练，采用增益感知排序目标，将来自PRISE的边际增益信息提炼为学习的评分函数，用于场景的排序和选择。在三个2RO问题上的广泛结果表明，NeurPRISE在相对于全面方法时始终实现了竞争性的遗憾，能够在不同数量的场景下保持强大的可扩展性，并相较于PRISE实现了7-200倍的加速。NeurPRISE还表现出强大的零样本泛化能力，有效处理更大规模的问题实例（最多5倍）、更多的场景（最多4倍）以及分布变化。

View on arXiv Download PDF AI Translation

cs.AI / 59 / 2605.14504

When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution

当机器人做家务时：长时间家庭任务执行的基准与智能体

Zhu, Zilin, Guo, Longteng, Mei, Yanghong, Pang, Bowen, Zhang, Zongxun, He, Xingjian, Ji, Ruyi, Liu, Jing

Abstract

Long-horizon household tasks demand robust high-level planning and sustained reasoning capabilities, which are largely overlooked by existing embodied AI benchmarks that emphasize short-horizon navigation or manipulation and rely on fixed task categories. We introduce LongAct, a benchmark designed to evaluate planning-level autonomy in long-horizon household tasks specified through free-form instructions. By abstracting away embodiment-specific low-level control, LongAct isolates high-level cognitive capabilities such as instruction understanding, dependency management, memory maintenance, and adaptive planning. We further propose HoloMind, a VLM-driven agent with a DAG-based long-horizon hierarchical planner, a Multimodal Spatial Memory for persistent world modeling, an Episodic Memory for experience reuse, and a global Critic for reflective supervision. Experiments with GPT-5 and Qwen3-VL models show that HoloMind substantially improves long-horizon performance while reducing reliance on model scale. Even top models achieve only 59% goal completion and 16% full-task success, underscoring the difficulty of LongAct and the need for stronger long-horizon planning in embodied agents.

Chinese Translation

长时间家庭任务需要强大的高层次规划和持续的推理能力，而现有的具身人工智能基准大多忽视了这一点，通常侧重于短时间的导航或操作，并依赖于固定的任务类别。我们引入了LongAct，这是一个旨在评估通过自由形式指令指定的长时间家庭任务中的规划级自主性的基准。通过抽象化具身特定的低级控制，LongAct隔离了高层次的认知能力，如指令理解、依赖管理、记忆维护和自适应规划。我们进一步提出了HoloMind，这是一个基于VLM（视觉语言模型）的智能体，具有基于DAG（有向无环图）的长时间分层规划器、用于持久世界建模的多模态空间记忆、用于经验重用的情节记忆，以及用于反思监督的全局评论者。与GPT-5和Qwen3-VL模型的实验表明，HoloMind显著提高了长时间性能，同时减少了对模型规模的依赖。即使是顶尖模型也仅实现了59%的目标完成率和16%的全任务成功率，突显了LongAct的难度以及具身智能体在长时间规划方面的更强需求。

View on arXiv Download PDF AI Translation

cs.AI / 60 / 2605.14537

Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining

牛交易：一个用于评估大型语言模型（LLMs）在虚假、竞标和讨价还价中的多智能体基准

Müller, Robert, Müller, Clemens

Abstract

We introduce \textsc{Cattle Trade, a multi-agent benchmark for evaluating large language models (LLMs) as agents in strategic reasoning under imperfect information, adversarial interaction, and resource constraints. The benchmark combines auctions, hidden-offer trade challenges (TCs), bargaining, bluffing, opponent modeling, and resource allocation within a single long-horizon game lasting 50--60 turns. Unlike prior agent benchmarks that test these abilities in isolation, \textsc{Cattle Trade} evaluates whether agents integrate them across a competitive, multi-agent economic game with conflicting incentives. The benchmark logs every bid, TC offer, counteroffer, and card selection, enabling behavioural analysis beyond final scores or win rates. We evaluate seven cost-efficient language models and three deterministic code agents across 242 games. Strategic coherence, in particular spending efficiency, resource discipline, and phase-adaptive bidding, is associated with rank more strongly than spending volume or any single subskill. Two heuristic code agents outperform most tested LLMs, and behavioural traces surface recurring LLM failure modes including overbidding, self-bidding, bankrupt TC initiation, and weak opponent-state adaptation. Evaluating agentic competence requires benchmarks that test the joint deployment of multiple capabilities in multi-agent environments with conflicting incentives, uncertainty, and economic dynamics.

Chinese Translation

我们介绍了 extsc{Cattle Trade}，这是一个用于评估大型语言模型（LLMs）作为智能体在不完全信息、对抗性互动和资源限制下进行战略推理的多智能体基准。该基准将拍卖、隐性报价交易挑战（TCs）、讨价还价、虚假、对手建模和资源分配结合在一个持续50到60回合的长期游戏中。与之前单独测试这些能力的智能体基准不同， extsc{Cattle Trade}评估智能体是否能够在具有冲突激励的竞争性多智能体经济游戏中整合这些能力。该基准记录每一个出价、TC报价、反报价和卡片选择，使得行为分析超越最终得分或胜率。我们在242场游戏中评估了七种成本效益高的语言模型和三种确定性代码智能体。战略一致性，特别是支出效率、资源纪律和阶段适应性竞标，与排名的关联性明显强于支出总量或任何单一子技能。两个启发式代码智能体的表现超过了大多数测试的LLMs，行为轨迹揭示了LLMs反复出现的失效模式，包括过度出价、自我出价、破产TC启动和对手状态适应能力弱。评估智能体能力需要基准测试在具有冲突激励、不确定性和经济动态的多智能体环境中联合部署多种能力的能力。

View on arXiv Download PDF AI Translation

cs.AI / 61 / 2605.14542

VerbalValue: A Socially Intelligent Virtual Host for Sales-Driven Live Commerce

VerbalValue：一个具有社会智能的销售驱动型直播电商虚拟主持人

Chen, Yuyan

Abstract

A skilled live-commerce host is not merely a narrator, but a sales agent who converts viewer curiosity into purchase intent through expert product knowledge, emotionally intelligent response tactics, and entertainment that serves as a vehicle for product exposure. Yet no existing AI system replicates this: conversational recommenders treat recommendation as a terminal act, while general-purpose LLMs hallucinate product claims and default to generic promotional templates that fail to engage or persuade. We present VerbalValue, a sales-conversion-oriented virtual host that turns exceptional verbal ability into real commercial value, built on three contributions. First, we construct a domain knowledge base of product specifications and a curated sales terminology lexicon that anchor product-related responses in verified expertise. Second, we collect and annotate 1,475 live-commerce interactions spanning diverse viewer intents. Third, we fine-tune a large language model on this data to deliver empathetic, commercially oriented responses, adapting to viewer intent through empathetic amplification, evidence-backed rebuttal, and humor-mediated deflection. Experiments against GPT-5.4, Claude Sonnet 4.6, Gemini 3.1 Pro, and other baselines demonstrate gains of 23% on informativeness and 18% on factual correctness, with consistent advantages in tactfulness and viewer engagement.

Chinese Translation

一位优秀的直播电商主持人不仅仅是讲述者，而是一个销售代理，通过专业的产品知识、情感智能的应对策略以及作为产品曝光载体的娱乐性，将观众的好奇心转化为购买意图。然而，目前没有现有的人工智能系统能够复制这一点：对话推荐系统将推荐视为终端行为，而通用大型语言模型（LLM）则会产生虚假的产品声明，并默认使用无法吸引或说服观众的通用促销模板。我们提出了VerbalValue，一个以销售转化为导向的虚拟主持人，将卓越的语言能力转化为实际的商业价值，基于三个贡献构建而成。首先，我们构建了一个产品规格的领域知识库和一个经过精心策划的销售术语词汇表，使产品相关的回应基于经过验证的专业知识。其次，我们收集并标注了1,475个涵盖多样化观众意图的直播电商互动数据。第三，我们在这些数据上对大型语言模型进行微调，以提供同情心强、以商业为导向的回应，通过同情心的增强、证据支持的反驳和幽默的转移来适应观众的意图。与GPT-5.4、Claude Sonnet 4.6、Gemini 3.1 Pro及其他基线模型的实验表明，在信息量和事实正确性上分别提高了23%和18%，在机智和观众参与度方面也表现出持续的优势。

View on arXiv Download PDF AI Translation

cs.AI / 62 / 2605.14544

Complacent, Not Sycophantic: Reframing Large Language Models and Designing AI Literacy for Complacent Machines

自满，而非谄媚：重新框架大型语言模型并为自满机器设计人工智能素养

Germani, Federico, Spitale, Giovanni

Abstract

Large language models are often described as sycophantic, in the sense that they appear to flatter users or mirror their beliefs. We argue that this label is conceptually misleading: sycophancy implies motives and strategic intent, which LLMs do not possess. Their behaviour is better understood as complacency, a structural tendency to agree with user input because training data, reward signals and design favour agreement and reinforcement over correction. We argue that this distinction matters. Whether developers act sycophantically or not, models themselves never are sycophants; they can only be made more or less complacent. This reframing locates agency in developers and institutions, not in the model. Because complacent models reinforce users' prior beliefs, we argue that AI literacy educational approaches should particularly focus on strategies to counter confirmation bias.

Chinese Translation

大型语言模型常被描述为谄媚的，因为它们似乎在恭维用户或反映他们的信念。我们认为这一标签在概念上是误导性的：谄媚暗示了动机和战略意图，而大型语言模型（LLMs）并不具备这些特征。它们的行为更应理解为自满，即一种结构性倾向，因训练数据、奖励信号和设计偏向于同意和强化而非纠正用户输入。我们认为这一区分是重要的。无论开发者是否表现出谄媚，模型本身永远不是谄媚者；它们只能被设计得更或更少自满。这一重新框架将主动权置于开发者和机构之中，而非模型本身。由于自满模型会强化用户的先前信念，我们主张人工智能素养教育方法应特别关注对抗确认偏差的策略。

View on arXiv Download PDF AI Translation

cs.AI / 63 / 2605.14556

TeachAnything: A Multimodal Crowdsourcing Platform for Training Embodied AI Agents in Symmetrical Reality

TeachAnything：一个用于在对称现实中训练具身人工智能代理的多模态众包平台

Liu, Zidong, Liu, Rongkai, Li, Yue, Zhang, Zhenliang

Abstract

Symmetrical Reality (SR) is emerging as a future trend for human-agent coexistence, placing higher demands on agents to acquire human-like intelligence. It calls for richer and more diverse human guidance. We introduce a three-stage demonstration paradigm integrating multimodal demonstration signals. Building on this paradigm, we developed TeachAnything, a cloud-based, crowdsourcing-oriented demonstration platform with physics simulation capable of collecting diverse demonstration data across varied scenes, tasks, and embodiments. By unifying virtual and physical interactions through both methodological design and physics simulation, the system serves as a practical foundation for developing embodied agents aligned with Symmetrical Reality.

Chinese Translation

对称现实（Symmetrical Reality, SR）正成为人机共存的未来趋势，对代理的要求更高，要求其具备类人智能。这需要更丰富和多样化的人类指导。我们介绍了一种整合多模态演示信号的三阶段演示范式。在此范式的基础上，我们开发了TeachAnything，一个基于云的、面向众包的演示平台，具备物理仿真能力，能够在多样化的场景、任务和具身形式中收集多样的演示数据。通过方法设计和物理仿真统一虚拟与物理交互，该系统为开发与对称现实相一致的具身代理提供了实用基础。

View on arXiv Download PDF AI Translation

cs.AI / 64 / 2605.14559

PyCSP3-Scheduling: A Scheduling Extension for PyCSP3

PyCSP3调度：PyCSP3的调度扩展

Afifi, Sohaib

Abstract

PyCSP$^3$ provides a productive way to build constraint models for solving combinatorial constrained problems and export them to XCSP$^3$, preserving a complete separation between modeling and solving. However, it lacks native support for scheduling abstractions such as interval variables, sequence variables, and resource functions. As a result, scheduling models must be encoded with low-level integer variables and manual channeling constraints, even though PyCSP$^3$ already provides global constraints like NoOverlap and Cumulative on integer arrays. We present PyCSP$^3$ Scheduling, a library that adds scheduling abstractions to PyCSP$^3$ through 53 dedicated constraints and 27 expressions, and compiles them down to standard PyCSP$^3$/XCSP$^3$ constraints, maintaining the modeling/solving separation that underpins the PyCSP$^3$ ecosystem. On 261 paired instances across 17 model families (5 runs each), both formulations produce identical objectives on all 72 doubly-proved optimal pairs and nearly half of the families (8/17) remain structurally unchanged after compilation; however, runtime performance diverges across families, with clear gains on some (up to 5.8x) and regressions on others due to the overhead of compilation decompositions. Code and benchmarks are available at: https://github.com/sohaibafifi/pycsp3-scheduling

Chinese Translation

PyCSP$^3$ 提供了一种高效的方式来构建约束模型，以解决组合约束问题并将其导出到 XCSP$^3$，在建模和求解之间保持完全的分离。然而，它缺乏对调度抽象的原生支持，例如区间变量、序列变量和资源函数。因此，调度模型必须使用低级整数变量和手动通道约束进行编码，尽管 PyCSP$^3$ 已经在整数数组上提供了全局约束，如 NoOverlap 和 Cumulative。我们提出了 PyCSP$^3$ Scheduling，这是一个通过 53 个专用约束和 27 个表达式将调度抽象添加到 PyCSP$^3$ 的库，并将其编译为标准的 PyCSP$^3$/XCSP$^3$ 约束，保持了支撑 PyCSP$^3$ 生态系统的建模/求解分离。在 17 个模型族（每个模型族 5 次运行）中的 261 对实例上，两种形式在所有 72 对双重证明的最优对上产生了相同的目标，并且近一半的模型族（8/17）在编译后结构保持不变；然而，运行时性能在不同模型族之间存在差异，在某些模型族上有明显的提升（最高可达 5.8 倍），而在其他模型族上由于编译分解的开销出现了回退。代码和基准测试可在以下链接获取：https://github.com/sohaibafifi/pycsp3-scheduling

View on arXiv Download PDF AI Translation

cs.AI / 65 / 2605.14561

Prompt Segmentation and Annotation Optimisation: Controlling LLM Behaviour via Optimised Segment-Level Annotations

提示分段与注释优化：通过优化的分段级注释控制大型语言模型行为

Prasad, Devika, Gerschwitz, Luke, Li, Tong, Xiao, Henry, Liu, Anjin, Wu, Coco, Leontjeva, Anna, Pizzato, Luiz

Abstract

Prompt engineering is crucial for effective interaction with generative artificial intelligence systems, yet existing optimisation methods often operate over an unstructured and vast prompt space, leading to high computational costs and potential distortions of the original intent. We introduce Prompt Segmentation and Annotation Optimisation (PSAO), a structured prompt optimisation framework designed to improve prompt optimisation controllability and efficiency. PSAO decomposes a prompt into interpretable segments (e.g., sentences) and augments each with human-readable annotations (e.g., {not important}, {important}, {very important}). These annotations guide large language models (LLMs) in allocating focus and clarifying confusion during response generation. We formally define the segmentations and annotations and demonstrate that optimised segment-level annotations can lead to improved LLM responses, with the original prompt retained as a candidate in the optimisation space to prevent performance degradation. Empirical evaluations indicate that PSAO benefits from annotations in terms of improved reasoning accuracy and self-consistency. However, developing efficient methods for identifying optimal segmentations and annotations remains challenging and is reserved for future investigation. This work is intended as a proof of concept, demonstrating the feasibility and potential of segment-level annotation optimisation.

Chinese Translation

提示工程对于与生成性人工智能系统的有效互动至关重要，但现有的优化方法往往在一个无结构且庞大的提示空间中运行，导致高计算成本和潜在的原意扭曲。我们提出了提示分段与注释优化（Prompt Segmentation and Annotation Optimisation, PSAO），这是一个旨在提高提示优化可控性和效率的结构化提示优化框架。PSAO将提示分解为可解释的分段（例如，句子），并为每个分段增加人类可读的注释（例如，{不重要}、{重要}、{非常重要}）。这些注释指导大型语言模型（Large Language Models, LLMs）在生成响应时分配注意力并澄清混淆。我们正式定义了分段和注释，并证明优化的分段级注释可以改善LLM的响应，同时将原始提示保留为优化空间中的候选项，以防止性能下降。实证评估表明，PSAO在提高推理准确性和自我一致性方面受益于注释。然而，开发高效的方法来识别最佳分段和注释仍然具有挑战性，留待未来研究。本工作旨在作为概念验证，展示分段级注释优化的可行性和潜力。

View on arXiv Download PDF AI Translation

cs.AI / 66 / 2605.14604

Sycophancy is an Educational Safety Risk: Why LLM Tutors Need Sycophancy Benchmarks

谄媚是一种教育安全风险：为何大型语言模型辅导员需要谄媚基准

Kasneci, Enkelejda, Kasneci, Gjergji

Abstract

This position paper argues that effective tutoring requires corrective friction: surfacing misconceptions and challenging them supportively to drive conceptual change. Yet preference-aligned LLMs can trade epistemic rigor for agreeableness. We identify a Reasoning-Sycophancy Paradox: models that resist context-switch frame attacks can still capitulate under social-epistemic pressure, especially authority ("my notes say I'm right") and social-affective face-saving ("please don't tell me I'm wrong"). We introduce EduFrameTrap, a tutoring benchmark across math, physics, economics, chemistry, biology, and computer science that varies student confidence and pressure (context-switch, authority, social-affective). Across two frontier LLMs, context-switch failures are comparatively lower for GPT-5.2, while authority and social pressure more often trigger epistemic retreat. In contrast, Claude shows substantial context-switch fragility in this run. Because these failures are hard to judge automatically, we report two-judge disagreement as a reliability signal. We argue benchmarks should measure social-epistemic courage, i.e., supportive but corrective tutoring, and treat kind-but-correct behavior as a safety requirement.

Chinese Translation

本文立场论文认为，有效的辅导需要纠正性的摩擦：揭示误解并以支持的方式挑战它们，以推动概念的变化。然而，偏好一致的大型语言模型（LLMs）可能会以愉悦性换取认知严谨性。我们识别出一个推理-谄媚悖论：抵抗上下文切换框架攻击的模型仍然可能在社会-认知压力下屈服，尤其是权威（“我的笔记说我是对的”）和社会-情感的面子保护（“请不要告诉我我错了”）。我们引入了EduFrameTrap，这是一个涵盖数学、物理、经济学、化学、生物学和计算机科学的辅导基准，变化学生的信心和压力（上下文切换、权威、社会-情感）。在两个前沿的LLM中，GPT-5.2的上下文切换失败率相对较低，而权威和社会压力更常引发认知退缩。相比之下，Claude在这次运行中显示出显著的上下文切换脆弱性。由于这些失败很难自动判断，我们报告了两个评审者之间的不一致作为可靠性信号。我们认为基准应测量社会-认知勇气，即支持但纠正的辅导，并将善意但正确的行为视为一种安全要求。

View on arXiv Download PDF AI Translation

cs.AI / 67 / 2605.14619

SliceGraph: Mapping Process Isomers in Multi-Run Chain-of-Thought Reasoning

SliceGraph：在多次链式思维推理中映射过程异构体

Chen, Kang, Nian, Junjie, Cao, Yixin, Jiang, Yugang

Abstract

Multi-run chain-of-thought reasoning is usually collapsed to final-answer aggregates, which discard howsampled trajectories share, split, and rejoin through intermediate computation. We propose SliceGraph, a post-hoc problem-model-cell graph built by mutual-kNN over sparse activation-key Jaccard similarity between CoT slices, and treat it as a measurement object for process geometry rather than as a decoding program. Across sampled CoT ensembles from three primary 4B/8B models on math and science benchmarks, blinded annotation supports SliceGraph biconnected components as shared reasoning-state units and process families as within-family strategy-coherent route units. In 85.5% of 954 problem-model cells, correct CoTs sharing the same normalized answer split into multiple process families; among cells with at least two such runs, 76.6% of run pairs are cross-family on average. We call such same-answer, family-divergent correct trajectories process isomers. A label-seeded reward field provides a separate value-landscape layer: success-associated regions often split into disconnected high-value cores, and route families specialize over these core footprints rather than merely duplicating one another. A typed-state transition analysis further shows that process families navigate the same atlas with distinct transition kernels under matched null controls. Representation ablations, a cross-architecture replication, and two cross-scale replications support the robustness of the route-family scaffold, showing that final-answer aggregation overlooks this structured multi-route process geometry.

Chinese Translation

多次链式思维推理通常被简化为最终答案的聚合，这忽略了采样轨迹如何在中间计算中共享、分裂和重新结合。我们提出了SliceGraph，这是一种后验问题-模型-单元图，通过在CoT切片之间基于稀疏激活关键的Jaccard相似性进行互k近邻（mutual-kNN）构建，并将其视为过程几何的测量对象，而非解码程序。在来自三个主要的4B/8B模型的数学和科学基准上的采样CoT集合中，盲注释支持SliceGraph的双连通分量作为共享推理状态单元，而过程家族则作为同一家族策略一致的路径单元。在954个问题-模型单元中，有85.5%的正确CoTs共享相同的标准化答案，分裂为多个过程家族；在至少有两个此类运行的单元中，76.6%的运行对平均是跨家族的。我们称这种相同答案、家族分歧的正确轨迹为过程异构体。一个标签引导的奖励场提供了一个独立的价值景观层：成功相关区域通常分裂为不相连的高价值核心，而路径家族则在这些核心足迹上专业化，而不仅仅是相互复制。类型状态转移分析进一步表明，过程家族在匹配的零控制下以不同的转移核导航同一图谱。表示消融、跨架构复制和两个跨尺度复制支持路径家族支架的稳健性，显示最终答案聚合忽视了这种结构化的多路径过程几何。

View on arXiv Download PDF AI Translation

cs.AI / 68 / 2605.14636

Teaching Large Language Models When Not to Know: Learning Temporal Critique for Ex-Ante Reasoning

教导大型语言模型何时不应知道：学习时间批判以进行事前推理

Ding, Chenlu, Wu, Jiancan, Luo, Yanchen, Liu, Zheyuan, Yuan, Yancheng, Wang, Xiang

Abstract

Large language models (LLMs) often fail to reason under temporal cutoffs: when prompted to answer from the standpoint of an earlier time, they exploit knowledge that became available only later. We study this failure through the lens of ex-ante reasoning, where a model must rely exclusively on information knowable before a cutoff. Through a systematic analysis of prompt-level interventions, we find that temporal leakage is highly sensitive to cutoff formulation and instruction placement: explicit cutoff statements outperform implicit historical framings, and prefix constraints reduce leakage more effectively than suffix constraints. These findings indicate that prompting can steer models into a temporal frame, but does not endow them with the ability to verify whether a response is temporally admissible. We further argue that supervised fine-tuning is insufficient, since ex-ante correctness is not an intrinsic property of an answer, but a relation between the answer and the cutoff. To address this gap, we propose TCFT, a Temporal Critique Fine-Tuning framework that trains models to acquire cutoff-aware temporal verification. Given a query, a cutoff, and a candidate response, TCFT teaches the model to identify post-cutoff leakage, explain temporal boundary violations, and judge temporal admissibility. Experiments with Qwen2.5-7B-Instruct and Qwen2.5-14B-Instruct show that TCFT consistently outperforms prompting and SFT baselines, reducing average leakage by 41.89 and 37.79 percentage points, respectively.

Chinese Translation

大型语言模型（LLMs）在时间截止下常常无法进行推理：当被提示从早期时间的角度回答时，它们利用了仅在之后才可获得的知识。我们通过事前推理的视角研究这一失败，其中模型必须完全依赖于截止前可知的信息。通过对提示级干预的系统分析，我们发现时间泄漏对截止的表述和指令位置高度敏感：显式的截止声明优于隐式的历史框架，而前缀约束比后缀约束更有效地减少泄漏。这些发现表明，提示可以引导模型进入一个时间框架，但并未赋予它们验证响应是否在时间上可接受的能力。我们进一步认为，监督微调是不够的，因为事前正确性并不是答案的内在属性，而是答案与截止之间的关系。为了解决这一问题，我们提出了TCFT（Temporal Critique Fine-Tuning）框架，旨在训练模型获取截止意识的时间验证能力。给定一个查询、一个截止和一个候选响应，TCFT教会模型识别截止后的泄漏，解释时间边界的违反，并判断时间的可接受性。与Qwen2.5-7B-Instruct和Qwen2.5-14B-Instruct的实验表明，TCFT在性能上始终优于提示和SFT基线，平均泄漏分别减少了41.89和37.79个百分点。

View on arXiv Download PDF AI Translation

cs.AI / 69 / 2605.14660

MindGap: A Conversational AI Framework for Upstream Neuroplastic Intervention in Post-Traumatic Stress Disorder

MindGap：一种用于创伤后应激障碍的上游神经可塑性干预的对话式人工智能框架

Bandara, Eranga, Gore, Ross, Gunaratna, Asanga, Mukkamala, Ravi, Siriwardanagea, Nihal, Rajapakse, Sachini, Kularathna, Isurunima, Karunarathna, Pramoda, Herath, Wathsala, Rajapakse, Chalani, Shetty, Sachin, Clayton, Anita H., Rhea, Christopher K., Keong, Ng Wee, De Zoysa, Kasun, Hass, Amin, Kaushik, Shaifali, Samuel, Preston, Yarlagadda, Atmaram

Abstract

Post-Traumatic Stress Disorder (PTSD) is fundamentally a neuroplastic problem traumatic contact events encode over-reactive neural pathways through Hebbian long-term potentiation, producing hair-triggered amygdala-HPA stress cascades that fire before conscious awareness can intercept them. Existing therapeutic approaches, prolonged exposure, EMDR, cognitive behavioural therapy, operate predominantly downstream of the reactive cascade, teaching patients to tolerate or reframe distress after it has arisen. While clinically valuable, these suppression-based approaches do not produce the upstream pathway dissolution that constitutes lasting structural neural reorganisation. This paper proposes MindGap, a privacy-preserving on-device conversational AI framework that delivers structured neuroplastic rehabilitation for PTSD through the practice of dependent origination, a Buddhist psychological framework that identifies the precise moment between the pre-cognitive affective signal and the reactive elaboration that follows as the site of therapeutic intervention. MindGap guides patients through three progressive layers of observation at this feeling tone gap: noticing the bare affective signal before reactive elaboration, recognising it as self-arising rather than caused by the stimulus, and recognising the conditioned implicit belief beneath the feeling. Each layer corresponds to progressively deeper prefrontal regulatory engagement and progressively deeper long-term depression-mediated weakening of the reactive pathway, producing genuine upstream dissolution rather than downstream suppression. Running entirely on-device with no data egress, MindGap delivers daily calibrated exposure sessions through a fine-tuned lightweight large language model, making it deployable in sensitive clinical and military contexts where cloud-based solutions are not permitted.

Chinese Translation

创伤后应激障碍（PTSD）根本上是一个神经可塑性问题，创伤接触事件通过Hebbian长期增强作用编码过度反应的神经通路，产生在意识觉察之前就触发的杏仁体-下丘脑-垂体（HPA）应激级联反应。现有的治疗方法，如延长暴露、眼动脱敏与再处理（EMDR）和认知行为疗法，主要在反应级联的下游运作，教导患者在痛苦出现后如何容忍或重新框架这些痛苦。尽管在临床上具有价值，这些基于抑制的方法并未产生构成持久结构性神经重组的上游通路溶解。本文提出了MindGap，一个保护隐私的设备内对话式人工智能框架，通过依赖起源的实践提供PTSD的结构化神经可塑性康复，这是一种佛教心理框架，识别出在前认知情感信号与随之而来的反应性阐述之间的精确时刻作为治疗干预的地点。MindGap引导患者在这一情感基调间隙中经历三个逐步观察层次：在反应性阐述之前注意到纯粹的情感信号，认识到它是自我产生的而非由刺激引起的，以及识别出情感背后的条件隐性信念。每一层对应于逐步深入的前额叶调节参与和逐步深入的长期抑制介导的反应通路削弱，产生真正的上游溶解而非下游抑制。MindGap完全在设备上运行，无数据外泄，通过精细调整的轻量级大型语言模型提供每日校准的暴露会话，使其能够在不允许使用基于云的解决方案的敏感临床和军事环境中部署。

View on arXiv Download PDF AI Translation

cs.AI / 70 / 2605.14665

Falkor-IRAC: Graph-Constrained Generation for Verified Legal Reasoning in Indian Judicial AI

Falkor-IRAC：用于印度司法人工智能的图约束生成以验证法律推理

Bose, Joy

Abstract

Legal reasoning is not semantic similarity search. A court judgment encodes constrained symbolic reasoning: precedent propagation, procedural state transitions, and statute-bound inference. These are properties that vector-based retrieval-augmented generation (RAG) cannot faithfully represent. Hallucinated precedents, outdated statute citations, and unsupported reasoning chains remain persistent failure modes in LLM-based legal AI, with real consequences for access to justice in high-caseload jurisdictions such as India. This paper presents Falkor-IRAC, a graph-constrained generation framework for Indian legal AI that grounds generation in structured reasoning over an IRAC (Issue, Rule, Analysis, Conclusion) knowledge graph. Judgments from the Supreme Court and High Courts of India are ingested as IRAC node structures enriched with procedural state transitions, precedent relationships, and statutory references, stored in FalkorDB for low-latency agentic traversal. At inference time, LLM-generated answers are accepted only if a valid supporting path can be traced through the graph, a check performed by a falsifiability oracle called the Verifier Agent. The system also detects doctrinal conflicts as a first-class output rather than silently resolving them. Falkor-IRAC is evaluated using graph-native metrics: citation grounding accuracy, path validity rate, hallucinated precedent rate, and conflict detection rate. These metrics are argued to be more appropriate for legal reasoning evaluation than BLEU and ROUGE. On a proof-of-concept corpus of 51 Supreme Court judgments, the Verifier Agent correctly validated citations on completed queries and correctly rejected fabricated citations. Evaluation against vector-only RAG baselines is left for future work, as is GPU-accelerated inference to address current timeout rates on CPU hardware.

Chinese Translation

法律推理并非语义相似性搜索。法院判决编码了受限的符号推理：先例传播、程序状态转变和法条约束推理。这些特性是基于向量的检索增强生成（RAG）无法真实表示的。虚构的先例、过时的法条引用和不支持的推理链在基于大型语言模型（LLM）的法律人工智能中仍然是持续的失败模式，对印度等高案件负担司法管辖区的司法公正产生了实际影响。本文提出了Falkor-IRAC，一个针对印度法律人工智能的图约束生成框架，该框架基于IRAC（问题、规则、分析、结论）知识图谱进行结构化推理。印度最高法院和高等法院的判决被摄取为IRAC节点结构，丰富了程序状态转变、先例关系和法条引用，存储在FalkorDB中以实现低延迟的智能遍历。在推理时，仅当可以通过图谱追踪到有效的支持路径时，才接受LLM生成的答案，这一检查由称为验证代理（Verifier Agent）的可证伪性神谕执行。该系统还将教义冲突作为一类重要输出进行检测，而不是默默解决这些冲突。Falkor-IRAC使用图本地指标进行评估：引用基础准确性、路径有效性率、虚构先例率和冲突检测率。这些指标被认为比BLEU和ROUGE更适合法律推理的评估。在一个包含51个最高法院判决的概念验证语料库中，验证代理正确验证了完成查询的引用，并正确拒绝了虚构的引用。与仅基于向量的RAG基线的评估留待未来工作，GPU加速推理也将用于解决当前CPU硬件上的超时率问题。

View on arXiv Download PDF AI Translation

cs.AI / 71 / 2605.14666

Monitoring Data-aware Temporal Properties (Extended Version)

监测数据感知的时间属性（扩展版）

Gianola, Alessandro, Montali, Marco, Winkler, Sarah

Abstract

Dynamic systems in AI are often complex and heterogeneous, so that an internal specification is not accessible and verification techniques such as model checking are not applicable. Monitoring is in such cases an attractive alternative, as it evaluates desirable properties along traces generated by an unknown dynamic system. In this work, we consider anticipatory monitoring of linear-time properties enriched with an arbitrary SMT theory over finite traces (LTLfMT). Anticipatory monitoring in this setting is highly challenging, as the monitoring state depends on both the trace prefix seen so far and all its possible finite continuations. Under reasonable assumptions on the background theory, we present and formally prove the correctness of a novel foundational framework for monitoring properties in an expressive fragment of LTLfMT. The framework combines automata-theoretic methods to handle the temporal aspects of the logic, with automated reasoning techniques to address the first-order dimension. Moreover, we identify for the first time decidable fragments of this monitoring problem that are practically relevant as they combine linear arithmetic with uninterpreted functions, which covers e.g. data-aware business processes and dynamic systems operating over a read-only database. Feasibility is witnessed by a prototype implementation and preliminary evaluation.

Chinese Translation

人工智能中的动态系统通常复杂且异构，因此内部规范不可获取，模型检查等验证技术也不适用。在这种情况下，监测是一种有吸引力的替代方案，因为它沿着由未知动态系统生成的轨迹评估期望属性。在本研究中，我们考虑了对线性时间属性的预期监测，这些属性在有限轨迹上与任意 SMT 理论相结合（LTLfMT）。在这种情况下，预期监测非常具有挑战性，因为监测状态依赖于迄今为止看到的轨迹前缀及其所有可能的有限延续。在对背景理论进行合理假设的基础上，我们提出并正式证明了一种新颖的基础框架，用于在 LTLfMT 的一个表达性片段中监测属性。该框架结合了自动机理论方法来处理逻辑的时间方面，以及自动推理技术来解决一阶维度。此外，我们首次识别出与实践相关的可判定片段，这些片段将线性算术与未解释函数相结合，涵盖了例如数据感知的业务流程和在只读数据库上运行的动态系统。通过原型实现和初步评估证明了其可行性。

View on arXiv Download PDF AI Translation

cs.AI / 72 / 2605.14667

How Sensitive Are Radiomic AI Models to Acquisition Parameters?

放射组学人工智能模型对采集参数的敏感性如何？

Gil, D., Sanchez, I., Sanchez, C.

Abstract

A main barrier for the deployment of AI radiomic systems in clinical routine is their drop in performance under heterogeneous multicentre acquisition protocols. This work presents a performance-oriented framework for quantifying scan parameter sensitivity of radiomic AI models, while identifying clinically significant parameter regions associated with improved cross-dataset robustness. We formulate a mixed-effects framework for quantifying the influence that clinically relevant acquisition parameters have on models performance, while accounting for subject-level random effects. We have applied our framework to lung cancer diagnosis in CT scans using two independent multicentre datasets (a public database and own-collected data) and several SoA architectures. To evaluate across-database reproducibility, CT parameters have been adjusted using the data collected and tested on the public set. The optimal configuration selected is the current of the X-ray tube >= 200 mA, spiral pitch <= 1.5, slice thickness <= 1.25 mm, which balances diagnostic quality with low radiation dose. These configuration push metrics from 0.79+-0.04 sensitivity, 0.47+-0.10 specificity in low quality scans to 0.90+-0.10 sensitivity, 0.79 +- 0.13 specificity in high quality ones.

Chinese Translation

人工智能放射组学系统在临床常规应用中的主要障碍是其在异质多中心采集协议下性能的下降。本研究提出了一种以性能为导向的框架，用于量化放射组学人工智能模型对扫描参数的敏感性，同时识别与提高跨数据集鲁棒性相关的临床显著参数区域。我们构建了一个混合效应框架，以量化临床相关采集参数对模型性能的影响，同时考虑受试者级别的随机效应。我们将该框架应用于CT扫描中的肺癌诊断，使用两个独立的多中心数据集（一个公共数据库和自收集的数据）以及几种最先进的架构。为了评估跨数据库的可重复性，CT参数已根据收集的数据进行调整，并在公共数据集上进行了测试。选择的最佳配置为X射线管电流 >= 200 mA，螺旋步幅 <= 1.5，切片厚度 <= 1.25 mm，这在低辐射剂量的情况下平衡了诊断质量。这些配置将低质量扫描的灵敏度从0.79±0.04，特异性从0.47±0.10提升到高质量扫描的灵敏度0.90±0.10，特异性0.79±0.13。

View on arXiv Download PDF AI Translation

cs.AI / 73 / 2605.14678

$\pi$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows

$ ext{π}$-基准：在长时间工作流程中评估主动个人助理代理

Zhang, Haoran, Xu, Luxin, Wang, Zhilin, Gui, Runquan, Zhang, Shunkai, Lei, Haodi, He, Zihao, He, Bingsu, Qin, Chicheng, Zhu, Tong, Qu, Xiaoye, Yang, Yang, Cheng, Yu, Li, Yafu

Abstract

The rise of personal assistant agents, e.g., OpenClaw, highlights the growing potential of large language models to support users across everyday life and work. A core challenge in these settings is proactive assistance, since users often begin with underspecified requests and leave important needs, constraints, or preferences unstated. However, existing benchmarks rarely evaluate whether agents can identify and act on such hidden intents before they are explicitly stated, especially in sustained multi-turn interactions where user needs emerge gradually. To address this gap, we introduce $\pi$-Bench, a benchmark for proactive assistance comprising 100 multi-turn tasks across 5 domain-specific user personas. By incorporating hidden user intents, inter-task dependencies, and cross-session continuity, $\pi$-Bench evaluates agents' ability to anticipate and address user needs over extended interactions, jointly measuring proactivity and task completion in long-horizon trajectories that better reflect real-world use. Experiments show (1) proactive assistance remains challenging, (2) a clear distinction between task completion and proactivity, and (3) the value of prior interaction for proactive intent resolution in later tasks.

Chinese Translation

个人助理代理的兴起，例如 OpenClaw，突显了大型语言模型在支持用户日常生活和工作的潜力。此类场景中的一个核心挑战是主动协助，因为用户通常以不明确的请求开始，并且未明确表达重要的需求、约束或偏好。然而，现有基准很少评估代理是否能够在这些隐含意图被明确表达之前识别并采取行动，特别是在用户需求逐渐显现的持续多轮交互中。为了解决这一空白，我们引入了 $ ext{π}$-基准，这是一个包含 100 个跨 5 个特定领域用户角色的多轮任务的主动协助基准。通过结合隐含用户意图、任务间依赖关系和跨会话连续性，$ ext{π}$-基准评估代理在扩展交互中预测和满足用户需求的能力，联合测量主动性和任务完成度，以更好地反映现实世界的使用情况。实验表明（1）主动协助仍然具有挑战性，（2）任务完成与主动性之间存在明显区别，以及（3）先前交互对后续任务中主动意图解决的价值。

View on arXiv Download PDF AI Translation

cs.AI / 74 / 2605.14721

On Strong Equivalence Notions in Logic Programming and Abstract Argumentation

逻辑编程与抽象论证中的强等价概念

Buraglio, Giovanni, Dvorak, Wolfgang, Woltran, Stefan

Abstract

Strong equivalence between knowledge bases ensures the possibility of replacing one with the other without affecting reasoning outcomes, in any given context. This makes it a crucial property in nonmonotonic formalisms. In particular, the fields of logic programming and abstract argumentation provide primary examples in which this property has been subject to vast investigations. However, while (classes of) logic programs and abstract argumentation frameworks are known to be semantically equivalent in static settings, this alignment breaks in dynamic contexts due to differing notions of update. As a result, strong equivalence does not always carry over from one formalism to the other. In this paper, we carefully investigate this discrepancy and introduce a new notion of strong equivalence for logic programs. Our approach preserves strong equivalence under translation between certain classes of logic programs and both Dung-style and claim-augmented argumentation frameworks, thus restoring compatibility across these formalisms.

Chinese Translation

知识库之间的强等价性确保在任何给定的上下文中，可以用一个替换另一个而不影响推理结果。这使得它成为非单调形式主义中的一个关键属性。特别是，逻辑编程和抽象论证领域提供了这一属性受到广泛研究的主要示例。然而，尽管（类）逻辑程序和抽象论证框架在静态环境中被认为是语义等价的，但由于更新概念的不同，这种一致性在动态环境中会破裂。因此，强等价性并不总是能够从一种形式主义转移到另一种形式主义。在本文中，我们仔细研究了这一差异，并为逻辑程序引入了一种新的强等价性概念。我们的方法在某些类逻辑程序与Dung风格和增强主张的论证框架之间的转换中保持强等价性，从而恢复了这些形式主义之间的兼容性。

View on arXiv Download PDF AI Translation

cs.AI / 75 / 2605.14723

Agentifying Patient Dynamics within LLMs through Interacting with Clinical World Model

通过与临床世界模型交互在大型语言模型中实现患者动态的智能化

Wu, Minghao, Yan, Yuting, Cai, Zhenyang, Ji, Ke, Fang, Chuangsen, Sheng, Ziying, Wang, Xidong, Wang, Rongsheng, Zhang, Hejia, Li, Shuang, Wang, Benyou, Zha, Hongyuan

Abstract

Sepsis management in the ICU requires sequential treatment decisions under rapidly evolving patient physiology. Although large language models (LLMs) encode broad clinical knowledge and can reason over guidelines, they are not inherently grounded in action-conditioned patient dynamics. We introduce SepsisAgent, a world model-augmented LLM agent for sepsis treatment recommendation. SepsisAgent uses a learned Clinical World Model to simulate patient responses under candidate fluid--vasopressor interventions, and follows a propose--simulate--refine workflow before committing to a prescription. We first show that world-model access alone yields inconsistent LLM decision performance, motivating agent-specific training. We then train SepsisAgent through a three-stage curriculum: patient-dynamics supervised fine-tuning, propose--simulate--refine behavior cloning, and world-model-based agentic reinforcement learning. On MIMIC-IV sepsis trajectories, SepsisAgent outperforms all traditional RL and LLM-based baselines in off-policy value while achieving the best safety profile under guideline adherence and unsafe-action metrics. Further analysis shows that repeated interaction with the Clinical World Model enables the agent to learn regularities in patient evolution, which remain useful even when simulator access is removed.

Chinese Translation

重症监护室中的脓毒症管理需要在快速变化的患者生理状态下进行连续的治疗决策。尽管大型语言模型（LLMs）编码了广泛的临床知识并能够推理指导方针，但它们并不固有地基于以行动为条件的患者动态。我们提出了SepsisAgent，这是一种增强世界模型的LLM代理，用于脓毒症治疗推荐。SepsisAgent使用学习到的临床世界模型来模拟在候选液体-血管加压药干预下患者的反应，并在做出处方之前遵循提议-模拟-细化的工作流程。我们首先展示了仅通过世界模型访问会导致LLM决策性能不一致，这促使了特定于代理的训练。然后，我们通过三个阶段的课程训练SepsisAgent：患者动态监督微调、提议-模拟-细化行为克隆，以及基于世界模型的代理强化学习。在MIMIC-IV脓毒症轨迹上，SepsisAgent在离线价值上超越了所有传统的强化学习和基于LLM的基线，同时在遵循指导方针和不安全行为指标下实现了最佳的安全性分析。进一步的分析表明，与临床世界模型的反复交互使代理能够学习患者演变中的规律，即使在移除模拟器访问时，这些规律仍然有用。

View on arXiv Download PDF AI Translation

cs.AI / 76 / 2605.14754

XDomainBench: Diagnosing Reasoning Collapse in High-Dimensional Scientific Knowledge Composition

XDomainBench：诊断高维科学知识组合中的推理崩溃

Zhiren, Gong, Wu, Tiantong, Zhang, Jiaming, Zhang, Fuyao, Wang, Che, Hao, Yurong, Hou, Yikun, Ping, Foo, Zhao, Yilei, Huang, Fei, Yuen, Chau, Lim, Wei Yang Bryan

Abstract

Large Language Models (LLMs) are increasingly deployed for knowledge synthesis, yet their capacity for compositional generalization in scientific knowledge remains under-characterized. Existing benchmarks primarily focus on single-turn restricted scenarios, failing to capture the capability boundaries exposed by real-world interactive scientific workflows. To address this, we introduce XDomainBench, a diagnostic benchmark for interactive interdisciplinary scientific reasoning. We formalize the composition order and mixture structure to enable systematic stress-testing from single-discipline to inter-disciplinary, comprising 8,598 interactive sessions across 20 domains and 4 task categories, with 8 realistic trajectory patterns covering difficulty and domain-mixture dynamics, simulating real AI4S scenarios. Large-scale evaluation of LLMs reveals a systematic reasoning collapse as composition order increases, stemming from two root causes: (i) direct difficulty increases induced by domain composition, and (ii) indirect interaction-amplified failures where trajectory patterns trigger error accumulation, reasoning breaks, and domain confusion, ultimately leading to session collapse.

Chinese Translation

大型语言模型（LLMs）越来越多地用于知识综合，但它们在科学知识中的组合泛化能力仍未得到充分表征。现有基准主要集中于单轮限制场景，未能捕捉到真实世界互动科学工作流所暴露的能力边界。为了解决这一问题，我们引入了XDomainBench，这是一个用于互动跨学科科学推理的诊断基准。我们形式化了组合顺序和混合结构，以便从单一学科到跨学科进行系统的压力测试，涵盖了20个领域和4个任务类别的8,598个互动会话，包含8种现实的轨迹模式，涵盖难度和领域混合动态，模拟真实的AI4S场景。对LLMs的大规模评估揭示了随着组合顺序的增加，推理系统性崩溃的现象，这源于两个根本原因：（i）由领域组合引起的直接难度增加，以及（ii）间接的交互放大失败，其中轨迹模式触发错误累积、推理中断和领域混淆，最终导致会话崩溃。

View on arXiv Download PDF AI Translation

cs.AI / 77 / 2605.14758

Probabilistic Verification of Recurrent Neural Networks for Single and Multi-Agent Reinforcement Learning

单一及多智能体强化学习中递归神经网络的概率验证

Marzari, Luca, Marchesini, Enrico

Abstract

History-dependent policies induced by recurrent neural networks (RNNs) rely on latent hidden state dynamics, making verification in partially observable reinforcement learning (RL) challenging. Existing RNN verification tools typically rely on restrictive modeling assumptions or coarse over-approximations of the hidden state space, which can lead to overly conservative or inconclusive results. We propose $\textbf{RNN}$ $\textbf{Pro}$babilistic $\textbf{Ve}$rification ($\texttt{RNN-ProVe}$), a probabilistic framework that $\textit{estimates the likelihood}$ of undesired behaviors in RNN-based policies. $\texttt{RNN-ProVe}$ uses policy-driven sampling to approximate the set of hidden states that are feasible under a trained policy, and derives statistical error bounds to produce bounded-error, high-confidence estimates of behavioral violations. Experiments on partially observable single-agent and cooperative multi-agent tasks show that $\texttt{RNN-ProVe}$ yields more quantitative, feasibility-aware probabilistic guarantees than existing tools, while scaling to recurrent and multi-agent settings.

Chinese Translation

由递归神经网络（RNN）引发的历史依赖策略依赖于潜在的隐藏状态动态，这使得在部分可观察的强化学习（RL）中进行验证变得具有挑战性。现有的RNN验证工具通常依赖于限制性的建模假设或对隐藏状态空间的粗略过度近似，这可能导致过于保守或不确定的结果。我们提出了$ extbf{RNN}$ $ extbf{Pro}$babilistic $ extbf{Ve}$rification（$ exttt{RNN-ProVe}$），这是一个概率框架，能够$ extit{估计}$基于RNN的策略中不希望出现的行为的可能性。$ exttt{RNN-ProVe}$使用策略驱动的采样来近似在训练策略下可行的隐藏状态集合，并推导出统计误差界限，以产生有界误差、高置信度的行为违规估计。在部分可观察的单智能体和合作多智能体任务上的实验表明，$ exttt{RNN-ProVe}$提供了比现有工具更具定量性、考虑可行性的概率保证，同时能够扩展到递归和多智能体环境。

View on arXiv Download PDF AI Translation

cs.AI / 78 / 2605.14761

AI Outperforms Humans in Personalized Image Aesthetics Assessment via LLM-Based Interviews and Semantic Feature Extraction

基于大语言模型访谈和语义特征提取的个性化图像美学评估中，人工智能优于人类

Abe, Yoshia, Daikoku, Tatsuya, Kuniyoshi, Yasuo

Abstract

Accurately predicting individual aesthetic evaluation for images is a fundamental challenge for AI. Various deep learning (DL)-based models have been proposed for this task, training on image evaluation data to extract objective low-level features. However, aesthetic preferences are inherently subjective and individual-dependent. Accurate prediction thus requires the extraction of high-level semantic features of images and the active collection of preference information from the target individual. To address this issue, we focus on the utility of Large Language Models (LLMs) pretrained on vast amounts of textual data, and develop an integrated DL-LLM system. The system actively elicits aesthetic preferences through LLM-based semi-structured interviews and predicts aesthetic evaluation by leveraging both low-level and high-level features. In our experiments, we compare the proposed system against conventional systems, human predictors, and the target individual's own re-evaluations after a certain time interval. Our results show that the proposed system outperforms all of them, with particularly strong performance on highly-rated images. Moreover, the prediction error of the proposed system is smaller than within-person variability, while human predictors show the largest error, likely due to the influence of their own aesthetic values. These results suggest that AI may be better positioned than others or one's future self to capture individual aesthetic preferences at a given point. This opens a new question of whether AI could serve as a deeper interpreter of human aesthetic sensibility than humans themselves.

Chinese Translation

准确预测个体对图像的美学评价是人工智能面临的一个基本挑战。为此，提出了多种基于深度学习（DL）模型的方案，这些模型通过训练图像评价数据来提取客观的低级特征。然而，美学偏好本质上是主观的且依赖于个体。因此，准确的预测需要提取图像的高级语义特征，并主动收集目标个体的偏好信息。为了解决这一问题，我们关注于在大量文本数据上预训练的大语言模型（LLMs）的实用性，并开发了一个集成的DL-LLM系统。该系统通过基于LLM的半结构化访谈主动引导美学偏好，并通过利用低级和高级特征来预测美学评价。在我们的实验中，我们将所提系统与传统系统、人类预测者以及目标个体在一定时间间隔后的重新评估进行了比较。结果表明，所提系统在所有对比中表现优越，尤其在高评分图像上表现尤为突出。此外，所提系统的预测误差小于个体内变异性，而人类预测者的误差最大，这可能是由于他们自身美学价值观的影响。这些结果表明，人工智能在捕捉个体美学偏好方面可能比其他人或未来的自我更具优势。这引发了一个新的问题，即人工智能是否能够作为人类美学敏感性的更深层次解读者。

View on arXiv Download PDF AI Translation

cs.AI / 79 / 2605.14771

MediaClaw: Multimodal Intelligent-Agent Platform Technical Report

MediaClaw：多模态智能代理平台技术报告

Zhao, Shaoan, Gao, Huanlin, Hui, Qiang, Lu, Ting, Guo, Xueqiang, Li, Yantao, Su, Xinpei, Shi, Fuyuan, Tan, Chao, Zhao, Fang, Wang, Kai, Lian, Shiguo

Abstract

MediaClaw is a multimodal agent platform built on the OpenClaw ecosystem. Its core design follows a three-layer architecture of unified abstraction, pluginized extension, and workflow orchestration. The system is intended to address practical deployment pain points in AIGC adoption, including fragmented capabilities, heterogeneous interfaces, disconnected production processes, and limited reuse of high-quality production workflows. \system{} abstracts full-category AIGC capabilities into a unified invocation model, uses plugins to support hot-pluggable capability expansion, and uses task-oriented Skills to turn complex production processes into reusable workflow assets. This report focuses on the architectural design philosophy of MediaClaw, the design logic of its core capability model, and the key engineering trade-offs in implementation. It aims to provide reusable practical reference for building multimodal capability platforms.

Chinese Translation

MediaClaw 是一个基于 OpenClaw 生态系统构建的多模态代理平台。其核心设计遵循统一抽象、插件化扩展和工作流编排的三层架构。该系统旨在解决 AIGC（人工智能生成内容）应用中的实际部署痛点，包括能力碎片化、异构接口、生产过程断联以及高质量生产工作流的有限重用。系统将全类别 AIGC 能力抽象为统一调用模型，使用插件支持热插拔能力扩展，并利用面向任务的技能将复杂的生产过程转化为可重用的工作流资产。本报告重点介绍 MediaClaw 的架构设计理念、核心能力模型的设计逻辑以及实施中的关键工程权衡，旨在为构建多模态能力平台提供可重用的实用参考。

View on arXiv Download PDF AI Translation

cs.AI / 80 / 2605.14774

Identifying Culprits Through Deep Deterministic Policy Gradient Deep Learning Investigation

通过深度确定性策略梯度深度学习调查识别罪犯

T, Lata B, J, Savitha N

Abstract

In the world of AI and advanced technologies investigation aspects identification of a crime or criminal plays a major problem. In this research we focus on a Conventional ways of implicating criminal investigations usually rely on limited data analysis. Finding an optimal and efficient method that will effectively identify criminals from complex datasets and minimise false positives and false negatives is the considered as a challenge. The main novelty approach of this work is based on the deep learning algorithm Deep Deterministic Policy Gradient (DDPG) is presented in this paper. We train the DDPG model with a dataset of crime scene material, witness statements and suspect profiles. The algorithm uses features to maximise the likelihood of identifying the offender while minimising the noise impact and irrelevant data. We show the efficacy of the proposed method, where DDPG identified criminals with an amazing accuracy of 95% than other several existing methods.

Chinese Translation

在人工智能和先进技术的世界中，犯罪或罪犯的识别是一个主要问题。本研究关注传统的刑事调查方式，这些方式通常依赖于有限的数据分析。寻找一种最佳且高效的方法，以有效识别复杂数据集中的罪犯并最小化假阳性和假阴性被视为一项挑战。本文提出的主要新颖方法基于深度学习算法深度确定性策略梯度（Deep Deterministic Policy Gradient, DDPG）。我们使用犯罪现场材料、证人陈述和嫌疑人档案的数据集对DDPG模型进行训练。该算法利用特征最大化识别罪犯的可能性，同时最小化噪声影响和无关数据。我们展示了所提方法的有效性，其中DDPG以95%的惊人准确率识别罪犯，优于其他几种现有方法。

View on arXiv Download PDF AI Translation

cs.AI / 81 / 2605.14802

A Heterogeneous Temporal Memory Governance Framework for Long-Term LLM Persona Consistency

一种异构时间记忆治理框架用于长期大型语言模型个性一致性

Yang, Zhao, Huan, Wang, Yingshuo, Li, Haomiao, Tu, Hujite, Lin

Abstract

Large language models often suffer from fact loss, timeline confusion, persona drift, and reduced stability during long-range interaction, especially under high-noise knowledge bases, context clearing, and cross-model transfer. To address these issues, we introduce ARPM, an external temporal memory governance framework for long-term dialogue. ARPM separates static knowledge memory from dynamic dialogue experience memory and combines vector retrieval, BM25, RRF fusion, dual-temporal reranking, chronological evidence reading, and a controlled analysis protocol for evidence verification and answer binding. Unlike approaches that encode persona consistency into model weights or rely only on long context, ARPM treats continuity as a traceable, auditable, and transferable governance problem. Using engineering logs, we conduct three experiments. First, in a 50-round question-answering setting, we compare signal-to-noise ratios of 1:5 and 1:200+, and distinguish CSV auto-judgment from manual review. Under 1:5, CSV recall accuracy is 54.0%, while manual review raises it to 100.0%. Under 1:200+, the values are 44.0% and 80.0%. These results show that automatic rules can underestimate recall after supporting evidence enters the prompt. Second, ablation results show that dialogue history retrieval is necessary for recent continuity: disabling it reduces strict accuracy from 100% to 66.7%, and disabling BM25 reduces it to 80.0%, indicating that pure semantic retrieval is insufficient for correction and tracing. Third, under a 5.1-million-character noise substrate, periodic context clearing, and multi-model handoff, ARPM maintains semantic continuity, boundary continuity, and persona consistency, while exposing limits caused by weak protocol compliance. These findings show that long-term persona consistency can be decomposed into governable components and evaluated in a white-box manner.

Chinese Translation

大型语言模型在长时间交互中常常面临事实丢失、时间线混淆、个性漂移和稳定性降低的问题，尤其是在高噪声知识库、上下文清除和跨模型转移的情况下。为了解决这些问题，我们提出了ARPM，一个用于长期对话的外部时间记忆治理框架。ARPM将静态知识记忆与动态对话经验记忆分开，并结合向量检索、BM25、RRF融合、双时间重排序、时间证据阅读以及证据验证和答案绑定的控制分析协议。与将个性一致性编码到模型权重中或仅依赖长上下文的方法不同，ARPM将连续性视为一个可追溯、可审计和可转移的治理问题。通过工程日志，我们进行了三项实验。首先，在50轮问答设置中，我们比较了1:5和1:200+的信噪比，并区分了CSV自动判断与人工审核。在1:5的情况下，CSV召回准确率为54.0%，而人工审核将其提高到100.0%。在1:200+的情况下，值分别为44.0%和80.0%。这些结果表明，自动规则在支持证据进入提示后可能低估召回率。其次，消融实验结果表明，对话历史检索对于最近的连续性是必要的：禁用它会将严格准确率从100%降低到66.7%，禁用BM25则将其降低到80.0%，这表明纯语义检索不足以进行纠正和追踪。第三，在一个510万字符的噪声基底下，定期上下文清除和多模型交接，ARPM保持了语义连续性、边界连续性和个性一致性，同时暴露了由于协议遵循不足而造成的局限性。这些发现表明，长期个性一致性可以分解为可治理的组件，并以白盒方式进行评估。

View on arXiv Download PDF AI Translation

cs.AI / 82 / 2605.14831

Interestingness as an Inductive Heuristic for Future Compression Progress

有趣性作为未来压缩进展的归纳启发式

Herrmann, Vincent, Schmidhuber, Jürgen

Abstract

One of the bottlenecks on the way towards recursively self-improving systems is the challenge of interestingness: the ability to prospectively identify which tasks or data hold the potential for future progress. We formalize interestingness as an inductive heuristic for future compression progress and investigate its predictability using tools from Kolmogorov Complexity and Algorithmic Statistics. By analyzing complexity-runtime profiles under Length, Algorithmic, and Speed priors, we demonstrate that the inductive property of interestingness -- the capacity for past progress to signal future discovery -- is theoretically viable and empirically supported. We prove that expected future progress depends exponentially on the recency of the last observed breakthrough. Furthermore, we show that the Algorithmic Prior is significantly more optimistic than the Length Prior, yielding a quadratic increase in expected discovery for the same observed profile. These findings are experimentally confirmed across three diverse universal computational paradigms.

Chinese Translation

在递归自我改进系统的发展过程中，有趣性是一个瓶颈：即前瞻性地识别哪些任务或数据具有未来进展潜力的能力。我们将有趣性形式化为未来压缩进展的归纳启发式，并利用科尔莫哥洛夫复杂性和算法统计学的工具研究其可预测性。通过分析在长度、算法和速度先验下的复杂性-运行时间特征，我们证明了有趣性的归纳特性——过去的进展能够指示未来的发现——在理论上是可行的，并得到了实证支持。我们证明了预期的未来进展与最后一次观察到的突破的时间间隔呈指数关系。此外，我们还展示了算法先验显著比长度先验更为乐观，在相同的观察特征下，预期发现呈现出二次增长。这些发现通过三种不同的通用计算范式得到了实验验证。

View on arXiv Download PDF AI Translation

cs.AI / 83 / 2605.14833

Emotion-Attended Stateful Memory (EASM):The Architecture for Hyper-Personalization at Scale

情感关注的有状态记忆（EASM）：大规模超个性化的架构

Kotecha, Vineet, Gupta, Vansh

Abstract

Current language model systems remain fundamentally stateless across sessions, limiting their ability to personalize interactions over time. While retrieval-augmented generation and fine-tuning improve knowledge access and domain capability, they do not enable persistent understanding of individual users. We propose an emotion-attended stateful memory architecture that dynamically constructs user-specific conversational context using long-term history, emotional signals, and inferred intent at inference time. To evaluate its impact, we conducted a controlled A/B study across thirty non-scripted conversations spanning six emotionally distinct categories using the same underlying language model in both conditions. The memory-enriched condition consistently outperformed the stateless baseline across all evaluated scenarios. The largest gains were observed in memory grounding (95% improvement), plan clarity (57%), and emotional validation (34%). Results remained consistent even in emotionally adversarial conversations involving grief, distress, and uncertainty. These findings suggest that stateful emotional memory may represent a foundational infrastructure layer for hyper-personalized AI systems, though broader validation across larger and more diverse evaluations remains necessary

Chinese Translation

当前的语言模型系统在会话中基本上是无状态的，这限制了它们随时间个性化交互的能力。虽然检索增强生成和微调提高了知识获取和领域能力，但它们并未实现对个体用户的持续理解。我们提出了一种情感关注的有状态记忆架构，该架构在推理时动态构建用户特定的对话上下文，利用长期历史、情感信号和推断意图。为了评估其影响，我们在三十个非脚本化的对话中进行了受控的A/B研究，这些对话跨越六个情感上明显不同的类别，并在两种条件下使用相同的基础语言模型。记忆增强条件在所有评估场景中始终优于无状态基线。在记忆基础（95% 的提升）、计划清晰度（57%）和情感验证（34%）方面观察到了最大的增益。即使在涉及悲伤、痛苦和不确定性的情感对抗性对话中，结果也保持一致。这些发现表明，有状态的情感记忆可能代表超个性化人工智能系统的基础设施层，尽管在更大和更多样化的评估中仍需更广泛的验证。

View on arXiv Download PDF AI Translation

cs.AI / 84 / 2605.14857

A Deterministic Agentic Workflow for HS Tariff Classification: Multi-Dimensional Rule Reasoning with Interpretable Decisions

一种确定性代理工作流用于HS关税分类：具有可解释决策的多维规则推理

Zhang, Yu, Zhuang, Dongjiang, Zhou, Qu, Huang, Zheng, Wu, Junhe, Cao, Jing, Chen, Kai

Abstract

Harmonized System (HS) tariff classification is a high-stakes, expert-level task in which a free-form product description must be mapped to a specific six- or eight-digit code under the General Interpretive Rules (GIR), section notes, chapter notes, and Explanatory Notes. The difficulty lies not in knowledge volume but in *multi-dimensional rule reasoning*: a correct classification must satisfy competing priority rules along several axes simultaneously, including material, form, function, essential character, the part-versus-whole boundary, and specific listing versus residual headings. End-to-end prompting of large language models fails characteristically by resolving one axis while ignoring the priority constraints on the others. We present a *deterministic agentic workflow* in contrast to self-planning agents: the control flow is fixed, language model calls are confined to narrow stages, and reflection and verification are retained as local mechanisms. This design yields interpretability by construction--each decision is decomposed into stage-wise structured outputs with verbatim citation of the chapter or section notes that bear on it. The architecture combines offline knowledge-engineering of the Chinese HS tariff with an online six-stage pipeline. Evaluated on HSCodeComp at the six-digit level, the workflow reaches 75.0% top-1 and 91.5% top-3 at four digits, and 64.2% top-1 and 78.3% top-3 at six digits with Qwen3.6-plus; an open-weight Qwen3.6-27B-FP8 backbone in non-thinking mode achieves 84.2% four-digit and 77.4% six-digit top-1 agreement with the frontier model. A two-stage manual audit of 226 six-digit disagreements suggests that a non-trivial fraction of HSCodeComp ground-truth labels may deviate from HS general rules; full adjudication records are released in the appendix as preliminary findings for community review.

Chinese Translation

协调系统（HS）关税分类是一项高风险的专家级任务，其中自由格式的产品描述必须根据一般解释规则（GIR）、章节说明、章节说明和解释性说明映射到特定的六位或八位代码。其难点不在于知识量，而在于*多维规则推理*：正确的分类必须同时满足多个维度上的竞争优先规则，包括材料、形式、功能、基本特征、部分与整体的边界，以及特定列表与剩余标题的对比。大型语言模型的端到端提示通常通过解决一个维度而忽略其他维度的优先约束，从而表现不佳。我们提出了一种*确定性代理工作流*，与自我规划代理形成对比：控制流是固定的，语言模型的调用被限制在狭窄的阶段内，反思和验证被保留为局部机制。这种设计通过构建实现了可解释性——每个决策被分解为阶段性结构化输出，并逐字引用与之相关的章节或节说明。该架构结合了中国HS关税的离线知识工程和在线六阶段管道。在六位数字的HSCodeComp上进行评估时，该工作流在四位数字上达到75.0%的top-1和91.5%的top-3，在六位数字上达到64.2%的top-1和78.3%的top-3，使用Qwen3.6-plus；在非思考模式下，开放权重的Qwen3.6-27B-FP8骨干网络与前沿模型的四位数字和六位数字top-1一致性分别达到84.2%和77.4%。对226个六位数字不一致的两阶段人工审核表明，HSCodeComp的真实标签中可能有相当一部分偏离HS一般规则；完整的裁决记录作为初步发现已在附录中发布，以供社区审阅。

View on arXiv Download PDF AI Translation

cs.AI / 85 / 2605.14865

Holistic Evaluation and Failure Diagnosis of AI Agents

AI代理的整体评估与故障诊断

Madvil, Netta, Dym, Gilad, Mecilati, Alon, Dekel, Edo, Liberman, Jonatan, Brazilay, Rotem, Schliesser, Liron, Svidlo, Max, Nir, Shai, Shalom, Orel, Friedman, Yaron, Connack, David, Rimon, Amos, Tannor, Philip, Chorev, Shir

Abstract

AI agents execute complex multi-step processes, but current evaluation falls short: outcome metrics report success or failure without explaining why, and process-level approaches struggle to connect failure types to their precise locations within long, structured traces. We present a holistic agent evaluation framework that pairs top-down agent-level diagnosis with bottom-up span-level evaluation, decomposing analysis into independent per-span assessments. This decomposition scales to traces of arbitrary length and produces span-level rationales for each verdict. On the TRAIL benchmark, our framework achieves state-of-the-art results across all metrics on both GAIA and SWE-Bench, with relative gains over the strongest prior baselines of up to 38% on category F1, up to 3.5x on localization accuracy, and up to 12.5x on joint localization-categorization accuracy. Per-category analysis shows our framework leading in more error categories than any other evaluator. Notably, the same frontier model achieves several times higher localization accuracy when used inside our framework than as a monolithic judge over the full trace, showing that evaluation methodology, not model capability, is the bottleneck.

Chinese Translation

AI代理执行复杂的多步骤过程，但当前的评估方法存在不足：结果指标仅报告成功或失败，而未能解释原因，且过程级方法难以将故障类型与长且结构化的追踪中的精确位置联系起来。我们提出了一种整体代理评估框架，该框架将自上而下的代理级诊断与自下而上的跨度级评估相结合，将分析分解为独立的每个跨度评估。这种分解可扩展到任意长度的追踪，并为每个裁决生成跨度级的理由。在TRAIL基准测试中，我们的框架在GAIA和SWE-Bench的所有指标上均取得了最先进的结果，相较于最强的先前基线，在类别F1上相对提升高达38%，在定位准确性上提升高达3.5倍，在联合定位-分类准确性上提升高达12.5倍。按类别分析显示，我们的框架在更多错误类别上领先于其他评估者。值得注意的是，当在我们的框架内使用相同的前沿模型时，其定位准确性比作为整体评判者评估完整追踪时高出数倍，这表明评估方法而非模型能力是瓶颈。

View on arXiv Download PDF AI Translation

cs.AI / 86 / 2605.14886

BiFedKD: Bidirectional Federated Knowledge Distillation Framework for Non-IID and Long-Tailed ECG Monitoring

BiFedKD：用于非独立同分布和长尾心电监测的双向联邦知识蒸馏框架

Shu, Zixuan, Cao, Tiancheng, Huang, Hen-Wei

Abstract

Electrocardiogram (ECG) monitoring in Internet of Medical Things (IoMT) networks is constrained by strict data-sharing regulations and privacy concerns. Federated learning (FL) enables collaborative learning by keeping raw ECG data on devices, but frequent transmissions of high-dimensional model updates incur heavy per-round traffic over bandwidth-limited links. To alleviate this bottleneck, federated distillation (FD) replaces parameter exchange with logit-based knowledge transfer. However, the performance of FD often degrades under the non-independent and identically distributed (non-IID) and long-tailed label distributions in ECG deployments. To address these challenges, we propose a bidirectional federated knowledge distillation (BiFedKD) framework that employs an aggregation-by-distillation pipeline with temperature scaling to produce a stable global distillation signal for cross-client alignment. Experiments on the MIT-BIH Arrhythmia dataset show that BiFedKD improves accuracy and Macro-F1 over the baseline by $3.52\%$ and $9.93\%$, respectively. Moreover, to reach the same Macro-F1, BiFedKD reduces communication overhead by $40\%$ and computation cost by $71.7\%$ compared with the baseline.

Chinese Translation

在医疗物联网（IoMT）网络中，心电图（ECG）监测受到严格的数据共享法规和隐私问题的限制。联邦学习（FL）通过将原始ECG数据保留在设备上实现协作学习，但频繁传输高维模型更新会在带宽有限的链接上产生沉重的每轮流量。为了缓解这一瓶颈，联邦蒸馏（FD）用基于对数几率的知识转移替代参数交换。然而，在ECG部署中，FD的性能在非独立同分布（non-IID）和长尾标签分布下往往会下降。为了解决这些挑战，我们提出了一种双向联邦知识蒸馏（BiFedKD）框架，该框架采用温度缩放的蒸馏聚合管道，以产生稳定的全局蒸馏信号，实现跨客户端的对齐。对MIT-BIH心律失常数据集的实验表明，BiFedKD在准确性和宏观F1分数上分别比基线提高了$3.52\%$和$9.93\\%$。此外，为了达到相同的宏观F1分数，BiFedKD相比基线减少了$40\\%$的通信开销和$71.7\\%$的计算成本。

View on arXiv Download PDF AI Translation

cs.AI / 87 / 2605.14892

Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems

超越个体智能：对基于大语言模型的多智能体系统中的协作、失败归因和自我演化的调查

Qi, Shihao, Ma, Jie, Xing, Rui, Guo, Wei, Huang, Xiao, Gao, Zhitao, Deng, Jianhao, Liu, Jun, Zhang, Lingling, Wei, Bifan, Yang, Boqian, Wang, Pinghui, Sun, Jianwen, Tao, Jing, Wu, Yaqiang, Liu, Hui, Yao, Yu, Liu, Tongliang

Abstract

LLM-based autonomous agents have demonstrated strong capabilities in reasoning, planning, and tool use, yet remain limited when tasks require sustained coordination across roles, tools, and environments. Multi-agent systems address this through structured collaboration among specialized agents, but tighter coordination also amplifies a less explored risk: errors can propagate across agents and interaction rounds, producing failures that are difficult to diagnose and rarely translate into structural self-improvement. Existing surveys cover individual agent capabilities, multi-agent collaboration, or agent self-evolution separately, leaving the causal dependencies among them unexamined. This survey provides a unified review organized around four causally linked stages, which we term the LIFE progression: Lay the capability foundation, Integrate agents through collaboration, Find faults through attribution, and Evolve through autonomous self-improvement. For each stage, we provide systematic taxonomies and formally characterize the dependencies between adjacent stages, revealing how each stage both depends on and constrains the next. Beyond synthesizing existing work, we identify open challenges at stage boundaries and propose a cross-stage research agenda for closed-loop multi-agent systems capable of continuously diagnosing failures, reorganizing structures, and refining agent behaviors, extending current coordination frameworks toward more self-organizing forms of collective intelligence. By bridging these previously fragmented research threads, this survey aims to offer both a systematic reference and a conceptual roadmap toward autonomous, self-improving multi-agent intelligence.

Chinese Translation

基于大语言模型（LLM）的自主智能体在推理、规划和工具使用方面展现出强大的能力，但在任务需要跨角色、工具和环境的持续协调时仍然存在局限性。多智能体系统通过专门智能体之间的结构化协作来解决这一问题，但更紧密的协调也放大了一个较少被探讨的风险：错误可能在智能体和交互轮次之间传播，导致难以诊断的失败，并且很少转化为结构性的自我改进。现有的调查分别涵盖了个体智能体能力、多智能体协作或智能体自我演化，未能探讨它们之间的因果依赖关系。本调查提供了一个统一的回顾，围绕四个因果关联的阶段进行组织，我们称之为LIFE进程：奠定能力基础（Lay the capability foundation）、通过协作整合智能体（Integrate agents through collaboration）、通过归因发现故障（Find faults through attribution）以及通过自主自我改进演化（Evolve through autonomous self-improvement）。对于每个阶段，我们提供系统的分类法，并正式描述相邻阶段之间的依赖关系，揭示每个阶段如何既依赖于又限制下一个阶段。除了综合现有工作外，我们还识别了阶段边界上的开放挑战，并提出了一个跨阶段的研究议程，旨在实现能够持续诊断失败、重组结构和优化智能体行为的闭环多智能体系统，推动当前协调框架向更自组织的集体智能形式发展。通过桥接这些先前碎片化的研究线索，本调查旨在提供一个系统的参考和概念路线图，以实现自主、自我改进的多智能体智能。

View on arXiv Download PDF AI Translation

cs.AI / 88 / 2605.14900

COREKG: Coreset-Guided Personalized Summarization of Knowledge Graphs

COREKG：基于核心集的个性化知识图谱摘要

Khan, Sohel Aman, Mutharaju, Raghava, Shit, Supratim

Abstract

Knowledge Graphs (KGs) are extensively used across different domains and in several applications. Often, these KGs are very large in size. Such KGs become unwieldy for tasks such as question answering and visualization. Summarization of KGs offers a viable alternative in such cases. Furthermore, personalized KG summarization is crucial in the current data-driven world as it captures the specific requirements of users based on their query patterns. Since it only maintains relevant information, the personalized summaries of KG are small, resulting in significantly smaller storage requirements and query runtime. In this work, we adapt the coreset theory to create personalized KG summaries. For a given dataset and a user-specific query workload, we present an approach that samples a relevant subset of triples using sensitivity-based importance sampling. We ensure that the subset approximates the characteristics of the full dataset with bounded approximation error. We define sensitivity scores that measure the importance of a triple with respect to a user's query workload, which are then used by our coreset construction algorithm. We explicitly focus on personalized knowledge graph summarization by constructing summaries independently for each user based on their query behaviour. Our evaluation on Freebase, WikiData, and DBpedia shows that COREKG delivers higher query-answering accuracy and structural coverage than the state-of-the-art methods, such as GLIMPSE, PPR, iSummary, PEGASUS and APEX$^2$ while requiring only a tiny fraction of the original graph.

Chinese Translation

知识图谱（KGs）在不同领域和多种应用中被广泛使用。通常，这些知识图谱的规模非常庞大，使得在问答和可视化等任务中变得难以处理。在这种情况下，知识图谱的摘要提供了一种可行的替代方案。此外，个性化的知识图谱摘要在当前数据驱动的世界中至关重要，因为它能够根据用户的查询模式捕捉特定的需求。由于仅保留相关信息，个性化的知识图谱摘要体积较小，从而显著减少存储需求和查询运行时间。在本研究中，我们采用核心集理论来创建个性化的知识图谱摘要。针对给定的数据集和用户特定的查询工作负载，我们提出了一种方法，通过基于敏感度的重要性采样来抽取相关的三元组子集。我们确保该子集在有界近似误差的情况下近似于完整数据集的特征。我们定义了敏感度评分，用于衡量三元组相对于用户查询工作负载的重要性，这些评分随后被我们的核心集构建算法使用。我们明确关注个性化知识图谱摘要，通过根据每个用户的查询行为独立构建摘要。我们在Freebase、WikiData和DBpedia上的评估表明，COREKG在查询回答准确性和结构覆盖率方面优于最先进的方法，如GLIMPSE、PPR、iSummary、PEGASUS和APEX$^2$，同时仅需原始图的一小部分。

View on arXiv Download PDF AI Translation

cs.AI / 89 / 2605.14907

KGPFN: Unlocking the Potential of Knowledge Graph Foundation Model via In-Context Learning

KGPFN：通过上下文学习释放知识图谱基础模型的潜力

Gao, Yisen, Bai, Jiaxin, Huang, Haoyu, Xie, Zhongwei, Li, Yufei, Tsang, Hong Ting, Han, Sirui, Song, Yangqiu

Abstract

Knowledge graph (KG) foundation models aim to generalize across graphs with unseen entities and relations by learning transferable relational structure. However, most existing methods primarily emphasize relation-level universality, while in-context learning, the other pillar of foundation models remains under-explored for KG reasoning. In KGs, context is inherently structured and heterogeneous: effective prediction requires conditioning on the local context around the query entities as well as the global context that summarizes how a relation behaves across many instances. We propose KGPFN, a KG foundation model using Prior-data Fitted Network that unifies transferable relational regularities with inference-time in-context learning from structured context. KGPFN first learns relation representations via message passing on relation graphs to capture cross-graph relational invariances. For query-specific reasoning, it encodes local neighborhoods using a multi-layer NBFNet as local context. To enable ICL at global scale, it constructs relation-specific global context by retrieving a large set of instances of the query relation together with their local neighborhoods, and aggregates them within a Prior-Data Fitted Network framework that combines feature-level and sample-level attention. Through multi-graph pretraining on diverse KGs, KGPFN learns when to instantiate reusable patterns and when to override them using contextual evidence. Experiments on 57 KG benchmarks demonstrate that KGPFN achieves strong adaptation to previously unseen graphs through in-context learning alone, consistently outperforming competitive fine-tuned KG foundation models. Our code is available at https://github.com/HKUST-KnowComp/KGPFN.

Chinese Translation

知识图谱（KG）基础模型旨在通过学习可转移的关系结构，在包含未见实体和关系的图谱中进行泛化。然而，大多数现有方法主要强调关系层面的普遍性，而上下文学习（in-context learning）这一基础模型的另一支柱在知识图谱推理中仍未得到充分探索。在知识图谱中，上下文本质上是结构化且异质的：有效的预测需要依赖于查询实体周围的局部上下文以及总结关系在多个实例中行为的全局上下文。我们提出了KGPFN，一种使用Prior-data Fitted Network的知识图谱基础模型，它将可转移的关系规律与来自结构化上下文的推理时上下文学习相结合。KGPFN首先通过在关系图上进行消息传递来学习关系表示，以捕捉跨图的关系不变性。对于特定查询的推理，它使用多层NBFNet编码局部邻域作为局部上下文。为了在全局范围内实现上下文学习，它通过检索查询关系的大量实例及其局部邻域来构建关系特定的全局上下文，并在Prior-Data Fitted Network框架内聚合这些信息，该框架结合了特征级和样本级注意力。通过在多样化的知识图谱上进行多图预训练，KGPFN学习何时实例化可重用模式以及何时使用上下文证据来覆盖它们。在57个知识图谱基准上的实验表明，KGPFN仅通过上下文学习就能强有力地适应先前未见的图谱，始终优于竞争性的微调知识图谱基础模型。我们的代码可在https://github.com/HKUST-KnowComp/KGPFN获取。

View on arXiv Download PDF AI Translation

cs.AI / 90 / 2605.14912

From Sycophantic Consensus to Pluralistic Repair: Why AI Alignment Must Surface Disagreement

从谄媚共识到多元修复：为什么人工智能对齐必须显现分歧

Vishwarupe, Varad, Shadbolt, Nigel, Jirotka, Marina

Abstract

Pluralistic alignment is typically operationalised as preference aggregation: producing responses that span (Overton), steer toward (Steerable), or proportionally represent (Distributional) diverse human values. We argue that aggregation alone is an incomplete primitive for deployed pluralistic alignment. Under genuine value pluralism, the failure mode of contemporary RLHF-trained assistants is not insufficient coverage but sycophantic consensus: a learned tendency to agree with, validate, and minimise friction with the immediate interlocutor. Because deployed AI systems now mediate consequential deliberation across health, civic life, labour, and governance, the collapse of disagreement at the interaction layer is not a narrow technical concern but a structural failure with distributive consequences. We reframe pluralistic alignment around three conversational mechanisms drawn from Grice's maxims: scoping (acknowledging the limits of one's perspective), signalling (surfacing value-conflict rather than smoothing it over), and repair (revising one's position on principled grounds, not on user pressure). We formalise a metric, the Pluralistic Repair Score (PRS), distinguishing principled revision from capitulation, and present a small-scale empirical illustration on two frontier RLHF-trained models (Claude Sonnet 4.5, N=198; GPT-4o, N=100) showing that, for both, agreement-following coexists with low repair-quality on contested-value prompts. PRS measures an interactional precondition for pluralism (visible disagreement; principled revision) rather than pluralism in full; we discuss the difference, take seriously the reflexive question of whose "principled" counts, and argue that pluralism is most decisively made or unmade at the deployment-governance layer: interfaces, preference-data pipelines, and audit infrastructure.

Chinese Translation

多元对齐通常被操作化为偏好聚合：产生涵盖（Overton）、引导（Steerable）或按比例代表（Distributional）多样人类价值观的响应。我们认为，仅靠聚合对于已部署的多元对齐而言是不完整的原始方法。在真正的价值多元主义下，当代基于强化学习与人类反馈（RLHF）训练的助手的失败模式不是覆盖不足，而是谄媚共识：一种学习到的倾向，即同意、验证并最小化与直接对话者之间的摩擦。由于已部署的人工智能系统现在在健康、公共生活、劳动和治理等领域中调解重要的审议，互动层面上分歧的崩溃不仅是一个狭窄的技术问题，而是一个具有分配后果的结构性失败。我们围绕格赖斯准则提出了三种对话机制重新构建多元对齐：范围界定（承认自身视角的局限性）、信号传递（显现价值冲突而不是掩盖它）和修复（基于原则的理由修订自身立场，而非用户压力）。我们形式化了一种度量标准，即多元修复评分（Pluralistic Repair Score, PRS），以区分原则性修订与屈服，并展示了在两个前沿的RLHF训练模型（Claude Sonnet 4.5, N=198；GPT-4o, N=100）上的小规模实证示例，显示对于这两个模型而言，遵循一致性与在有争议价值提示下的低修复质量共存。PRS衡量的是多元主义的一个互动前提（可见的分歧；原则性修订），而非完整的多元主义；我们讨论了二者之间的差异，认真对待“原则性”是谁的这一反思性问题，并认为多元主义在部署治理层面（接口、偏好数据管道和审计基础设施）上最为决定性地被形成或破坏。

View on arXiv Download PDF AI Translation

cs.AI / 91 / 2605.14968

GraphFlow: An Architecture for Formally Verifiable Visual Workflows Enabling Reliable Agentic AI Automation

GraphFlow：一种可正式验证的视觉工作流架构，以实现可靠的自主智能自动化

Morris V, Drewry H., Valles, Luis, Ghomi, Reza Hosseini

Abstract

GraphFlow is a visual workflow system designed to improve the reliability of agentic AI automation in multi-step, mission-critical processes. In these workflows, small errors compound rapidly: under an idealized model of independent steps, a ten-step process with 90% per-step reliability completes successfully only 35% of the time. Existing workflow platforms provide durable execution and observability but offer few semantic correctness guarantees, while agentic systems plan at inference time, making behavior sensitive to prompt variation and difficult to audit. GraphFlow is designed to address this gap by treating workflow diagrams as the executable specification, a single artifact defining data scope, execution semantics, and monitoring. At compile time, a restricted class of diagrams is specified to produce reusable automations whose contracts (preconditions, postconditions, and composition obligations) are intended to be proof-checked before admission to a shared library. At runtime, a durable engine records outcomes in an append-only event log and can enforce contracts at system boundaries, supporting replay, retries, and audit. Swimlanes make trust boundaries explicit, separating verified logic from external systems, human judgment, and AI decisions. A year-long pilot across three clinical sites executed 8,728 cohort-enrolled workflow runs with a 97.08% completion rate under an early prototype without the verified-core subsystem; observed failures were localized primarily to external integrations. The formal semantics and proof-checked admission model described here are specified and under active development. Evaluation of the verified core is reserved for future work.

Chinese Translation

GraphFlow 是一个视觉工作流系统，旨在提高多步骤、关键任务流程中自主智能自动化的可靠性。在这些工作流中，小错误会迅速累积：在一个理想化的独立步骤模型下，十步流程在每一步的可靠性为90%时，仅有35%的概率成功完成。现有的工作流平台提供持久的执行和可观察性，但对语义正确性保证的提供较少，而自主系统在推理时进行规划，使得行为对提示变化敏感且难以审计。GraphFlow 旨在通过将工作流图视为可执行规范来填补这一空白，这是一种定义数据范围、执行语义和监控的单一工件。在编译时，指定一类受限的图形，以生成可重用的自动化，其合同（前置条件、后置条件和组合义务）旨在在被纳入共享库之前进行证明检查。在运行时，持久引擎在仅追加的事件日志中记录结果，并可以在系统边界强制执行合同，支持重放、重试和审计。泳道使信任边界明确，分离经过验证的逻辑与外部系统、人类判断和人工智能决策。为期一年的试点在三个临床地点执行了8,728个队列注册工作流运行，在没有经过验证核心子系统的早期原型下完成率为97.08%；观察到的失败主要集中在外部集成上。这里描述的形式语义和经过证明检查的入库模型已被指定并在积极开发中。对经过验证核心的评估留待未来工作。

View on arXiv Download PDF AI Translation

cs.AI / 92 / 2605.14995

Explainable Detection of Depression Status Shifts from User Digital Traces

可解释的抑郁状态变化检测基于用户数字痕迹

Belcastro, Loris, Gervino, Francesco, Marozzo, Fabrizio, Talia, Domenico, Trunfio, Paolo

Abstract

Every day, users generate digital traces (e.g., social media posts, chats, and online interactions) that are inherently timestamped and may reflect aspects of their mental state. These traces can be organized into temporal trajectories that capture how a user's mental health signals evolve, including phases of improvement, deterioration, or stability. In this work, we propose an explainable framework for detecting and analyzing depression-related status shifts in user digital traces. The approach combines multiple BERT-based models to extract complementary signals across different dimensions (e.g., sentiment, emotion, and depression severity). Such signals are then aggregated over time to construct user-level trajectories that are analyzed to identify meaningful change points. To enhance interpretability, the framework integrates a large language model to generate concise and human-readable reports that describe the evolution of mental-health signals and highlight key transitions. We evaluate the framework on two social media datasets. Results show that the approach produces more coherent and informative summaries than direct LLM-based reporting, achieving higher coverage of user history, stronger temporal coherence, and improved sensitivity to change points. An ablation study confirms the contribution of each component, particularly temporal modeling and segmentation. Overall, the method provides an interpretable view of mental health signals over time, supporting research and decision making without aiming at clinical diagnosis.

Chinese Translation

用户每天生成的数字痕迹（例如社交媒体帖子、聊天记录和在线互动）本质上是带有时间戳的，可能反映出他们心理状态的某些方面。这些痕迹可以组织成时间轨迹，捕捉用户心理健康信号的演变，包括改善、恶化或稳定的阶段。在本研究中，我们提出了一个可解释的框架，用于检测和分析用户数字痕迹中与抑郁相关的状态变化。该方法结合了多个基于BERT的模型，以提取不同维度（例如情感、情绪和抑郁严重程度）中的互补信号。这些信号随后在时间上进行聚合，以构建用户级轨迹，并分析以识别有意义的变更点。为了增强可解释性，该框架整合了一个大型语言模型，以生成简明且易于人类理解的报告，描述心理健康信号的演变并突出关键转变。我们在两个社交媒体数据集上评估了该框架。结果表明，该方法生成的总结比直接基于LLM的报告更连贯和信息丰富，覆盖了用户历史的更高比例，具有更强的时间一致性，并对变更点的敏感性有所提高。一项消融研究确认了每个组件的贡献，特别是时间建模和分割。总体而言，该方法提供了心理健康信号随时间变化的可解释视角，支持研究和决策，而不以临床诊断为目标。

View on arXiv Download PDF AI Translation

cs.AI / 93 / 2605.14998

Learning Developmental Scaffoldings to Guide Self-Organisation

学习发展支架以指导自我组织

Montero, Milton L., Najarro, Elias, Schauser, Jakob, Risi, Sebastian

Abstract

From subcellular structures to entire organisms, many natural systems generate complex organisation through self-organisation: local interactions that collectively give rise to global structure without any blueprint of the outcome. Yet a significant portion of the information driving such processes is not produced by self-organisation itself, instead, it is often offloaded to initial conditions of the system. Biological development is a prime example, where maternal pre-patterns encode positional and symmetry-breaking information that scaffolds the self-organising process. From maternal morphogen gradients in early embryogenesis to tissue-level morphogenetic pre-patterns guiding organ formation, this transfer of information to initial conditions, analogous to a memory-compute trade-off in computational systems, is a fundamental part of developmental processes. In this work, we study this offloading phenomenon by introducing a model that jointly learns both the self-organisation rules and the pre-patterns, allowing their interplay to be varied and measured under controlled conditions: a Neural Cellular Automaton (NCA) paired with a learned coordinate-based pattern generator (SIREN), both trained simultaneously to generate a set of patterns. We provide information-theoretic analyses of how information is distributed between pre-patterns and the self-organising process, and show that jointly learning both components yields improvements in robustness, encoding capacity, and symmetry breaking over purely self-organising alternatives. Our analysis further suggests that effective pre-patterns do not simply approximate their targets; rather, they bias the developmental dynamics in ways that facilitate convergence, pointing to a non-trivial relationship between the structure of initial conditions and the dynamics of self-organisation.

Chinese Translation

从亚细胞结构到整个生物体，许多自然系统通过自我组织生成复杂的结构：局部相互作用共同产生全球结构，而无需任何结果蓝图。然而，驱动这些过程的相当一部分信息并不是由自我组织本身产生的，而是通常被转移到系统的初始条件中。生物发展就是一个典型例子，在这个过程中，母体预模式编码了位置和对称破缺信息，为自我组织过程提供支架。从早期胚胎发育中的母体形态发生因子梯度到指导器官形成的组织水平形态发生预模式，这种信息向初始条件的转移，类似于计算系统中的记忆-计算权衡，是发展过程的一个基本部分。在本研究中，我们通过引入一个模型来研究这一转移现象，该模型共同学习自我组织规则和预模式，使它们的相互作用在受控条件下得以变化和测量：一个神经元细胞自动机（Neural Cellular Automaton, NCA）与一个学习的基于坐标的模式生成器（SIREN）配对，同时训练以生成一组模式。我们提供信息论分析，探讨信息在预模式和自我组织过程之间的分布，并表明共同学习这两个组件在稳健性、编码能力和对称破缺方面优于纯自我组织的替代方案。我们的分析进一步表明，有效的预模式并不仅仅是近似其目标；相反，它们以促进收敛的方式偏向发展动态，指向初始条件结构与自我组织动态之间的非平凡关系。

View on arXiv Download PDF AI Translation

cs.AI / 94 / 2605.15015

Small, Private Language Models as Teammates for Educational Assessment Design

小型私有语言模型作为教育评估设计的团队成员

Jaldi, Chris Davis, Saini, Anmol, Zhang, Shan, Schroeder, Noah, Shimizu, Cogan, Ilkou, Eleni

Abstract

Generative AI increasingly supports educational design tasks, e.g., through Large Language Models (LLMs), demonstrating the capability to design assessment questions that are aligned with pedagogical frameworks (e.g., Bloom's taxonomy). However, they often rely on subjective or limited evaluation methods; focus primarily on proprietary models; or rarely systematically examine generation, evaluation, or deployment constraints in real educational settings. Meanwhile, Small Language Models (SLMs) have emerged as local alternatives that better address privacy and resource limitations; yet their effectiveness for assessment tasks remains underexplored. To address this gap, we systematically compare LLMs and SLMs for assessment question design; evaluate generation quality across Bloom's taxonomy levels using reproducible, pedagogically grounded metrics; and further assess model-based judging against expert-informed evaluation by analyzing reliability and agreement patterns. Results show that SLMs achieve competitive performance across key pedagogically motivated quality dimensions while enabling local, privacy-sensitive deployment. However, model-based evaluations also exhibit systematic inconsistencies and bias relative to expert ratings. These findings provide evidence to posit language models as bounded assistants in assessment workflows; underscore the necessity of Human-in-the-Loop; and advance the automated educational question generation field by examining quality, reliability, and deployment-aware trade-offs.

Chinese Translation

生成性人工智能越来越多地支持教育设计任务，例如通过大型语言模型（LLMs），展示了设计与教学框架（例如布鲁姆分类法）相一致的评估问题的能力。然而，它们通常依赖于主观或有限的评估方法；主要关注专有模型；或很少系统地考察在真实教育环境中生成、评估或部署的限制。同时，小型语言模型（SLMs）作为本地替代方案出现，更好地解决了隐私和资源限制；然而，它们在评估任务中的有效性仍然未被充分探索。为了解决这一空白，我们系统地比较了LLMs和SLMs在评估问题设计中的表现；使用可重复的、基于教学的指标评估跨越布鲁姆分类法各级的生成质量；并进一步通过分析可靠性和一致性模式来评估基于模型的判断与专家评估之间的差异。结果表明，SLMs在关键的以教学为导向的质量维度上实现了具有竞争力的表现，同时支持本地、注重隐私的部署。然而，基于模型的评估也显示出相对于专家评分的系统性不一致性和偏差。这些发现提供了证据，表明语言模型可以作为评估工作流程中的有限助手；强调了人机协作的必要性；并通过考察质量、可靠性和部署意识的权衡，推动了自动化教育问题生成领域的发展。

View on arXiv Download PDF AI Translation

cs.AI / 95 / 2605.15040

Orchard: An Open-Source Agentic Modeling Framework

Orchard：一个开源的自主建模框架

Peng, Baolin, Yao, Wenlin, Wu, Qianhui, Cheng, Hao, Yu, Xiao, Yang, Rui, Ge, Tao, Sordoni, Alessandrio, Yuan, Xingdi, Shen, Yelong, He, Pengcheng, Zhang, Tong, Yu, Zhou, Gao, Jianfeng

Abstract

Agentic modeling aims to transform LLMs into autonomous agents capable of solving complex tasks through planning, reasoning, tool use, and multi-turn interaction with environments. Despite major investment, open research remains constrained by infrastructure and training gaps. Many high-performing systems rely on proprietary codebases, models, or services, while most open-source frameworks focus on orchestration and evaluation rather than scalable agent training. We present Orchard, an open-source framework for scalable agentic modeling. At its core is Orchard Env, a lightweight environment service providing reusable primitives for sandbox lifecycle management across task domains, agent harnesses, and pipeline stages. On top of Orchard Env, we build three agentic modeling recipes. Orchard-SWE targets coding agents. We distill 107K trajectories from MiniMax-M2.5 and Qwen3.5-397B, introduce credit-assignment SFT to learn from productive segments of unresolved trajectories, and apply Balanced Adaptive Rollout for RL. Starting from Qwen3-30B-A3B-Thinking, Orchard-SWE achieves 64.3% on SWE-bench Verified after SFT and 67.5% after SFT+RL, setting a new state of the art among open-source models of comparable size. Orchard-GUI trains a 4B vision-language computer-use agent using only 0.4K distilled trajectories and 2.2K open-ended tasks. It achieves 74.1%, 67.0%, and 64.0% success rates on WebVoyager, Online-Mind2Web, and DeepShop, respectively, making it the strongest open-source model while remaining competitive with proprietary systems. Orchard-Claw targets personal assistant agents. Trained with only 0.2K synthetic tasks, it achieves 59.6% pass@3 on Claw-Eval and 73.9% when paired with a stronger ZeroClaw harness. Collectively, these results show that a lightweight, open, harness-agnostic environment layer enables reusable agentic data, training recipes, and evaluations across domains.

Chinese Translation

自主建模旨在将大型语言模型（LLMs）转变为能够通过规划、推理、工具使用和与环境的多轮交互来解决复杂任务的自主代理。尽管进行了大量投资，开放研究仍受到基础设施和训练差距的限制。许多高性能系统依赖于专有代码库、模型或服务，而大多数开源框架则专注于编排和评估，而非可扩展的代理训练。我们提出了Orchard，一个用于可扩展自主建模的开源框架。其核心是Orchard Env，一个轻量级环境服务，提供可重用的原语，用于跨任务领域、代理框架和管道阶段的沙箱生命周期管理。在Orchard Env之上，我们构建了三个自主建模配方。Orchard-SWE针对编码代理。我们从MiniMax-M2.5和Qwen3.5-397B中提取了107K轨迹，引入了信用分配的SFT（监督微调）以从未解决轨迹的有效片段中学习，并应用了平衡自适应回放（Balanced Adaptive Rollout）用于强化学习（RL）。从Qwen3-30B-A3B-Thinking开始，Orchard-SWE在经过SFT后在SWE-bench Verified上达到了64.3%，在经过SFT+RL后达到了67.5%，在可比规模的开源模型中设定了新的最先进水平。Orchard-GUI使用仅0.4K提炼轨迹和2.2K开放式任务训练了一个4B视觉-语言计算机使用代理。它在WebVoyager、Online-Mind2Web和DeepShop上的成功率分别为74.1%、67.0%和64.0%，使其成为最强的开源模型，同时与专有系统保持竞争力。Orchard-Claw针对个人助理代理。仅用0.2K合成任务进行训练，它在Claw-Eval上达到了59.6%的pass@3，在与更强的ZeroClaw框架配对时达到了73.9%。总体而言，这些结果表明，一个轻量级、开放的、与框架无关的环境层能够在不同领域中实现可重用的自主数据、训练配方和评估。

View on arXiv Download PDF AI Translation

cs.AI / 96 / 2605.15041

Case-Based Calibration of Adaptive Reasoning and Execution for LLM Tool Use

基于案例的自适应推理与执行的工具使用校准

Pang, Renning, Lan, Tian, Liu, Leyuan, Tong, Piao, Cao, Sheng, Zhang, Xiaosong

Abstract

Tool use extends large language models beyond parametric knowledge, but reliable execution requires balancing appropriate reasoning depth with strict structural validity. We approach this problem from a case-based perspective to present CAST, a case-driven framework that treats historical execution trajectories as structured cases. Instead of reusing raw exemplar outputs, CAST extracts case-derived signals to identify complexity profiles for estimating optimal reasoning strategies, alongside failure profiles to map likely structural breakdowns. The framework translates this knowledge into a fine-grained reward design and adaptive reasoning, enabling the model to autonomously internalize case-based strategies during reinforcement learning. Experiments on BFCLv2 and ToolBench demonstrate that CAST improves both schema-faithful execution and task-level tool-use success while reducing unnecessary deliberation. The approach achieves up to 5.85 percentage points gain in overall execution accuracy and reduces average reasoning length by 26%, significantly mitigating high-impact structural errors. Ultimately, this demonstrates how historical execution cases can provide reusable adaptation knowledge for calibrated tool use.

Chinese Translation

工具使用将大型语言模型的能力扩展超出了参数知识，但可靠的执行需要在适当的推理深度与严格的结构有效性之间取得平衡。我们从基于案例的视角出发，提出了CAST（基于案例的工具使用框架），该框架将历史执行轨迹视为结构化案例。CAST并不是简单地重用原始示例输出，而是提取案例衍生信号，以识别复杂性特征，从而估计最佳推理策略，同时识别失败特征以映射可能的结构性崩溃。该框架将这些知识转化为细致的奖励设计和自适应推理，使模型能够在强化学习过程中自主内化基于案例的策略。在BFCLv2和ToolBench上的实验表明，CAST在保持模式一致性执行和任务级工具使用成功率方面都有所提升，同时减少了不必要的思考。该方法在整体执行准确性上提高了多达5.85个百分点，并将平均推理长度减少了26%，显著减轻了高影响结构性错误。最终，这表明历史执行案例可以为校准的工具使用提供可重用的适应知识。

View on arXiv Download PDF AI Translation

cs.AI / 97 / 2605.15100

Dual-Dimensional Consistency: Balancing Budget and Quality in Adaptive Inference-Time Scaling

双维一致性：在自适应推理时间缩放中平衡预算与质量

Xu, Rongman, Li, Yifei, Zhao, Tianzhe, Wu, Yanrui, Li, Bo, Yan, Hang

Abstract

Large Language Models (LLMs) have demonstrated remarkable abilities in reasoning. However, maximizing their potential through inference-time scaling faces challenges in trade-off between sampling budget and reasoning quality. Current strategies remain inefficient as they typically treat sampling width and depth as orthogonal objectives, where width consensus methods risk reinforcing hallucinations, while depth pruning mechanisms prematurely truncate complex yet valid reasoning chains. Therefore, we propose Dual-Dimensional Consistency (DDC), a unified framework that bridges path quality with adaptive termination. By coupling Confidence-Weighted Bayesian protocol with a Trend-Aware Stratified Pruning, our method ensures that computational resources are concentrated on high quality reasoning paths, filtering hallucinations while accelerating consensus. Evaluations across five benchmarks demonstrate that this approach reduces token consumption by over 10 times while maintaining or exceeding the accuracy of strong baselines across various LLMs.

Chinese Translation

大型语言模型（LLMs）在推理方面展现了卓越的能力。然而，通过推理时间缩放来最大化其潜力面临着在采样预算与推理质量之间的权衡挑战。目前的策略效率较低，因为它们通常将采样宽度和深度视为正交目标，其中宽度共识方法可能加剧幻觉，而深度剪枝机制则过早截断复杂但有效的推理链。因此，我们提出了双维一致性（Dual-Dimensional Consistency, DDC），这是一个统一框架，将路径质量与自适应终止相结合。通过将置信加权贝叶斯协议与趋势感知分层剪枝相结合，我们的方法确保计算资源集中于高质量推理路径，过滤幻觉的同时加速共识。在五个基准测试中的评估表明，该方法在保持或超过各种大型语言模型（LLMs）强基线的准确性的同时，将令牌消耗减少了超过10倍。

View on arXiv Download PDF AI Translation

cs.AI / 98 / 2605.15109

Why Neighborhoods Matter: Traversal Context and Provenance in Agentic GraphRAG

邻里为何重要：代理图形检索增强生成中的遍历上下文与来源

Terrenzi, Riccardo, von Zastrow, Maximilian, Ayvaz, Serkan

Abstract

Retrieval-Augmented Generation can improve factuality by grounding answers in external evidence, but Agentic GraphRAG complicates what it means for citations to be faithful. In these systems, an agent explores a knowledge graph before producing an answer and a small set of citations. We frame citation faithfulness as a trajectory-level problem: final citations should not only support the answer, but also account for the graph traversal, structure, and visited-but-uncited entities that may influence it. Through controlled ablation experiments, we compare the effects of isolating, removing, and masking cited and uncited graph entities. Our results show that cited evidence is often necessary, as removing it substantially changes answers and reduces accuracy. However, citations are not sufficient, because accurate answers can also depend on uncited traversal context and surrounding graph structure. These findings suggest that citation evaluation in Agentic GraphRAG should move beyond source support toward provenance over the broader retrieval trajectory.

Chinese Translation

检索增强生成（Retrieval-Augmented Generation）通过将答案与外部证据相结合来提高事实性，但代理图形检索增强生成（Agentic GraphRAG）使得引用的忠实性变得复杂。在这些系统中，代理在生成答案和一小组引用之前会探索知识图谱。我们将引用忠实性框定为轨迹级问题：最终的引用不仅应支持答案，还应考虑图的遍历、结构以及可能影响答案的已访问但未引用的实体。通过控制消融实验，我们比较了隔离、移除和掩蔽已引用和未引用图实体的效果。我们的结果表明，引用的证据通常是必要的，因为移除它会显著改变答案并降低准确性。然而，引用并不足够，因为准确的答案也可能依赖于未引用的遍历上下文和周围的图结构。这些发现表明，在代理图形检索增强生成中，引用评估应超越源支持，转向对更广泛检索轨迹的来源追溯。

View on arXiv Download PDF AI Translation

cs.AI / 99 / 2605.15132

APWA: A Distributed Architecture for Parallelizable Agentic Workflows

APWA：一种可并行化的代理工作流的分布式架构

Rose, Evan, Mallick, Tushin, Laws, Matthew D., Nita-Rotaru, Cristina, Oprea, Alina

Abstract

Autonomous multi-agent systems based on large language models (LLMs) have demonstrated remarkable abilities in independently solving complex tasks in a wide breadth of application domains. However, these systems hit critical reasoning, coordination, and computational scaling bottlenecks as the size and complexity of their tasks grow. These limitations hinder multi-agent systems from achieving high-throughput processing for highly parallelizable tasks, despite the availability of parallel computing and reasoning primitives in the underlying LLMs. We introduce the Agent-Parallel Workload Architecture (APWA), a distributed multi-agent system architecture designed for the efficient processing of heavily parallelizable agentic workloads. APWA facilitates parallel execution by decomposing workflows into non-interfering subproblems that can be processed using independent resources without cross-communication. It supports heterogeneous data and parallel processing patterns, and it accommodates tasks from a wide breadth of domains. In our evaluation, we demonstrate that APWA can dynamically decompose complex queries into parallelizable workflows and scales on larger tasks in settings where prior systems fail completely.

Chinese Translation

基于大型语言模型（LLMs）的自主多代理系统在独立解决广泛应用领域中的复杂任务方面展现了显著的能力。然而，随着任务规模和复杂性的增加，这些系统面临着关键的推理、协调和计算扩展瓶颈。这些限制阻碍了多代理系统在高度可并行化任务中实现高吞吐量处理，尽管底层LLMs中提供了并行计算和推理原语。我们提出了代理并行工作负载架构（APWA），这是一种旨在高效处理高度可并行化代理工作负载的分布式多代理系统架构。APWA通过将工作流分解为不干扰的子问题来促进并行执行，这些子问题可以使用独立资源进行处理，而无需交叉通信。它支持异构数据和并行处理模式，并适应来自广泛领域的任务。在我们的评估中，我们展示了APWA能够动态地将复杂查询分解为可并行化的工作流，并在先前系统完全失效的环境中对更大任务进行扩展。

View on arXiv Download PDF AI Translation

cs.AI / 100 / 2605.15177

OpenDeepThink: Parallel Reasoning via Bradley--Terry Aggregation

OpenDeepThink：通过Bradley-Terry聚合实现并行推理

Zhou, Shang, Chai, Wenhao, Liu, Kaiyuan, Mao, Huanzhi, Mang, Qiuyang, Shang, Jingbo

Abstract

Test-time compute scaling is a primary axis for improving LLM reasoning. Existing methods primarily scale depth by extending a single reasoning trace. Scaling breadth by sampling multiple candidates in parallel is straightforward, but introduces a selection bottleneck: choosing the best candidate without a ground-truth verifier, since pointwise LLM judging is noisy and biased. To address this, we introduce OpenDeepThink, a population-based test-time compute framework that selects via pairwise Bradley-Terry comparison. Each generation, the LLM judges random pairs of candidates and aggregates votes via Bradley-Terry into a global ranking; top-ranked candidates are preserved and the top three quarters are mutated using the natural-language critiques produced during comparison; the bottom quarter is discarded. OpenDeepThink raises Gemini 3.1 Pro's effective Codeforces Elo by +405 points in eight sequential LLM-call rounds (~27 minutes wall-clock). The pipeline transfers across weaker and stronger models without retuning, and on the multi-domain HLE benchmark, gains appear concentrated in objectively verifiable domains and reverse in subjective ones. We release CF-73, a curated set of 73 expert-rated Codeforces problems with International Grandmaster annotation and 99% local-evaluation agreement against the official verdict.

Chinese Translation

测试时计算扩展是提高大型语言模型（LLM）推理能力的主要方向。现有方法主要通过扩展单一推理轨迹来增加深度。通过并行采样多个候选者来扩展广度是直接的，但引入了选择瓶颈：在没有真实验证者的情况下选择最佳候选者，因为逐点判断LLM的结果噪声大且存在偏差。为了解决这个问题，我们引入了OpenDeepThink，一个基于人群的测试时计算框架，通过成对的Bradley-Terry比较进行选择。在每一代中，LLM判断随机候选对，并通过Bradley-Terry聚合投票形成全球排名；排名靠前的候选者被保留，排名前四分之三的候选者使用比较过程中产生的自然语言评论进行变异；而排名最低的四分之一则被丢弃。OpenDeepThink在八轮连续的LLM调用中将Gemini 3.1 Pro的有效Codeforces Elo提高了405分（约27分钟的实际时间）。该流程在不同强度的模型间迁移时无需重新调优，并且在多领域HLE基准测试中，增益主要集中在客观可验证的领域，而在主观领域则出现反转。我们发布了CF-73，这是一个经过策划的73个专家评分的Codeforces问题集，附有国际特级大师注释，并与官方裁决的99%本地评估一致。

View on arXiv Download PDF AI Translation

计算语言学 (Computation and Language)

cs.CL / 1 / 2605.13919

Merging Methods for Multilingual Knowledge Editing for Large Language Models: An Empirical Odyssey

大型语言模型的多语言知识编辑合并方法：一场实证之旅

Lee, Kunil, Shin, Ki-Young, Lee, Jong-Hyeok, Suh, Young-Joo

Abstract

Multilingual knowledge editing (MKE) remains challenging because language-specific edits interfere with one another, even when locate-then-edit methods work well in monolingual settings. This paper focuses on three issues: the effectiveness of vector merging methods for MKE, the extent to which Task Singular Vectors for Merging (TSVM) can reduce multilingual interference, and the influence of the weight scaling factor and rank compression ratio on performance. We evaluate six merging variants with two popular backbone large language models, two base knowledge editing methods, and 12 languages on the MzsRE benchmark under a large-scale batch-editing setting. Our results show that vector summation with shared covariance is the most reliable overall strategy, whereas simple summation without shared covariance performs poorly. TSVM improves performance in some settings, but its ability to mitigate multilingual interference is limited. We also find that performance is sensitive to both weight scale and rank ratio, with larger-than-default scaling and relatively low rank often yielding better results. These findings clarify the practical strengths and limits of current vector merging methods for MKE and provide guidance for future multilingual knowledge editing research.

Chinese Translation

多语言知识编辑（MKE）仍然面临挑战，因为语言特定的编辑相互干扰，即使在单语环境中，定位后编辑方法也能很好地工作。本文关注三个问题：向量合并方法在MKE中的有效性、任务单一向量合并（TSVM）在多语言干扰减少方面的程度，以及权重缩放因子和秩压缩比对性能的影响。我们在大规模批量编辑环境下，使用两种流行的基础大型语言模型、两种基础知识编辑方法和12种语言，在MzsRE基准上评估了六种合并变体。我们的结果表明，具有共享协方差的向量求和是最可靠的整体策略，而没有共享协方差的简单求和表现较差。TSVM在某些设置中提高了性能，但其减轻多语言干扰的能力有限。我们还发现，性能对权重缩放和秩比敏感，较大于默认的缩放和相对较低的秩通常能产生更好的结果。这些发现阐明了当前向量合并方法在MKE中的实际优势和局限，并为未来的多语言知识编辑研究提供了指导。

View on arXiv Download PDF AI Translation

cs.CL / 2 / 2605.13989

VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model with Curriculum Learning and Native Tool Use

VectraYX-Nano：一款具有课程学习和原生工具使用的42M参数西班牙语网络安全语言模型

Santillana, Juan S.

Abstract

We present VectraYX-Nano, a 41.95M-parameter decoder-only language model trained from scratch in Spanish for cybersecurity, with a Latin-American focus and native tool invocation via the Model Context Protocol (MCP). Four contributions: (i) Corpus: VectraYX-Sec-ES, a 170M-token Spanish corpus from an eight-VM pipeline (~$25 USD) partitioned into conversational (42M tokens, OpenSubtitles-ES, OASST1), cybersecurity (118M tokens, NVD, Wikipedia-ES, CVE mirror, security blogs), and offensive-security tooling (10M tokens, ExploitDB, HackTricks, OWASP) phases. (ii) Architecture: 42M-parameter Transformer decoder with GQA, QK-Norm, RMSNorm, SwiGLU, RoPE, z-loss, and a 16,384-token byte-fallback BPE. (iii) Curriculum with replay: continual pre-training with a replay buffer yields monotonic loss descent (9.80->3.17->3.00->2.16); after SFT on OASST-ES, Alpaca-ES, CVE Q&A, and 6,327 tool-use traces, the model attains a conversational gate of 0.78+-0.05 (N=4 seeds). (iv) Two findings: a bootstrap-corpus ablation reveals a loss-vs-register inversion at nano scale; a LoRA study shows the B4 tool-selection floor of 0.000 is a corpus-density artifact, not a capacity gate -- a tool-dense corpus (2,801 examples) raises B4 to 0.145+-0.046 on Nano 42M and 0.445+-0.201 on a 260M mid-tier. The GGUF artifact is 81 MB (F16), runs at sub-second TTFT on commodity hardware under llama.cpp, and is to our knowledge the first Spanish-native cybersecurity LLM with end-to-end MCP integration. Corpus recipe, training scripts, GGUF weights, and B1-B5 benchmark are released.

Chinese Translation

我们提出了VectraYX-Nano，这是一款从零开始训练的41.95M参数解码器语言模型，专为网络安全领域设计，重点关注拉丁美洲，并通过模型上下文协议（Model Context Protocol, MCP）实现原生工具调用。主要贡献包括：（i）语料库：VectraYX-Sec-ES，这是一个包含170M标记的西班牙语语料库，来源于一个八个虚拟机的管道（约25美元），分为对话（42M标记，OpenSubtitles-ES, OASST1）、网络安全（118M标记，NVD, Wikipedia-ES, CVE镜像, 安全博客）和攻击安全工具（10M标记，ExploitDB, HackTricks, OWASP）三个阶段。（ii）架构：42M参数的Transformer解码器，采用GQA、QK-Norm、RMSNorm、SwiGLU、RoPE、z-loss，以及一个16,384标记的字节回退BPE。（iii）课程与重放：持续的预训练结合重放缓冲区实现了单调的损失下降（9.80->3.17->3.00->2.16）；在OASST-ES、Alpaca-ES、CVE问答和6,327个工具使用轨迹上进行SFT后，模型达到了0.78±0.05的对话门槛（N=4种子）。（iv）两个发现：引导语料库消融实验揭示了纳米尺度下的损失与注册的反转；LoRA研究表明，B4工具选择的下限0.000是语料密度的伪影，而非能力门槛——一个工具密集的语料库（2,801个例子）使得Nano 42M的B4提升至0.145±0.046，而260M中层模型的B4提升至0.445±0.201。GGUF伪影为81 MB（F16），在商品硬件下通过llama.cpp以亚秒级TTFT运行，并且据我们所知，这是首个具有端到端MCP集成的西班牙语原生网络安全大语言模型。语料库配方、训练脚本、GGUF权重和B1-B5基准已发布。

View on arXiv Download PDF AI Translation

cs.CL / 3 / 2605.14005

Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding

槲寄生：对投机解码的隐秘加速崩溃攻击

Sun, Shuoyang, Da, Chang, Fang, Hao, Gao, Kuofeng, Zhong, Xinhao, Sun, Yi, Mo, Fan, Xia, Shu-Tao, Chen, Bin

Abstract

Speculative decoding has become a widely adopted technique for accelerating large language model (LLM) inference by drafting multiple candidate tokens and verifying them with a target model in parallel. Its efficiency, however, critically depends on the average accepted length $\tau$, i.e., how many draft tokens survive each verification step. In this work, we identify a new mechanism-level vulnerability in model-based speculative decoding: the drafter is trained to approximate the target model distribution, but this approximation is inevitably imperfect. Such a drafter-target mismatch creates a hidden attack surface where small perturbations can preserve the target model's visible behavior while substantially reducing draft-token acceptability. We propose Mistletoe, a stealthy acceleration-collapse attack against speculative decoding. Mistletoe directly targets the acceptance mechanism of speculative decoding. It jointly optimizes a degradation objective that decreases drafter-target agreement and a semantic-preservation objective that constrains the target model's output distribution. To resolve the conflict between these objectives, we introduce a null-space projection mechanism, where degradation gradients are projected away from the local semantic-preserving direction, suppressing draft acceptance while minimizing semantic drift. Experiments on various speculative decoding systems show that Mistletoe substantially reduces average accepted length $\tau$, collapses speedup, and lowers averaged token throughput, while preserving output quality and perplexity. Our work highlights that speculative decoding introduces a mechanism-level attack surface beyond existing output robustness, calling for more robust designs of LLM acceleration systems.

Chinese Translation

投机解码已成为加速大型语言模型（LLM）推理的广泛采用技术，通过并行草拟多个候选标记并使用目标模型进行验证。然而，其效率严重依赖于平均接受长度 $ au$，即每个验证步骤中存活的草拟标记数量。在本研究中，我们识别出模型基础投机解码中的一种新的机制级漏洞：草拟器被训练以近似目标模型分布，但这种近似不可避免地是不完美的。这种草拟器与目标之间的不匹配创造了一个隐藏的攻击面，在这个攻击面上，小的扰动可以保持目标模型的可见行为，同时显著降低草拟标记的可接受性。我们提出了槲寄生，一种针对投机解码的隐秘加速崩溃攻击。槲寄生直接针对投机解码的接受机制。它联合优化一个降低草拟器与目标一致性的退化目标和一个约束目标模型输出分布的语义保留目标。为了解决这些目标之间的冲突，我们引入了一种零空间投影机制，其中退化梯度被投影到远离局部语义保留方向的区域，从而在最小化语义漂移的同时抑制草拟接受。对各种投机解码系统的实验表明，槲寄生显著降低了平均接受长度 $ au$，崩溃加速，并降低了平均标记吞吐量，同时保持了输出质量和困惑度。我们的工作强调，投机解码引入了一种超越现有输出鲁棒性的机制级攻击面，呼吁对LLM加速系统进行更鲁棒的设计。

View on arXiv Download PDF AI Translation

cs.CL / 4 / 2605.14040

Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning

Physics-R1：经过审核的奥林匹克语料库与视觉物理推理的方案

Yang, Shan

Abstract

We audit the multimodal-physics evaluation pipeline end-to-end and document three undetected construction practices that distort how the field measures vision-language reasoning: train-eval contamination, translation drift, and MCQ saturation. (1) Public training pools (UGPhysics-Train, SciInstruct, MMK12) pass single-stage 5-gram-Jaccard audits with zero hits across all six public physics evals; a three-stage audit (Jaccard -> mxbai-embed-large cosine -> Haiku-4.5 LLM-judge) surfaces 134 near-duplicates and 4,846 paraphrase candidates in SciInstruct alone. (2) A 17-pp Sonnet 4.5 delta on 59 paired Estonian-English olympiad problems (30.5% vs. 13.6%; sign test p=0.011, McNemar p=0.021, paired bootstrap 95% CI [+5.1, +28.9] pp). (3) A 46-pp format-and-novelty gradient on identical Sonnet weights between MCQ (79.7% on PhyX) and open-ended olympiad evaluation (33.4% on PhysOlym-A). We release four artifacts addressing these gaps: PhysCorp-A (6,432-record three-stage-audited multimodal corpus), PhysR1Corp (2,268-record closed-form RL pool), PhysOlym-A (500-problem, 99.8% novel-source held-out olympiad eval with native difficulty labels and an EN/ET bilingual subset), and Physics-R1, a reference GSPO+DAPO recipe cold-started from Qwen3-VL-8B-Thinking. Across 3 seeds, Physics-R1 lifts the audited corpus over the 8B base by +18.3 pp on PhysOlym-A liberal (8.0 -> 26.3 +/- 1.7; 7.1 pp behind Sonnet 4.5), +15.7 pp on PhysReason (23.9 -> 39.6 +/- 6.4; ahead of Qwen3-VL-32B and Gemini 2.5 Pro), +6.9 pp on OlympiadBench-Physics (46.2 +/- 1.5), and +4.1 pp on PhyX MCQ (77.8 +/- 0.3).

Chinese Translation

我们对多模态物理评估管道进行了端到端审核，并记录了三种未被发现的构建实践，这些实践扭曲了该领域对视觉-语言推理的测量方式：训练-评估污染、翻译漂移和多项选择题饱和。(1) 公共训练池（UGPhysics-Train、SciInstruct、MMK12）在所有六个公共物理评估中通过单阶段5-gram-Jaccard审核，未发现任何命中；三阶段审核（Jaccard -> mxbai-embed-large余弦 -> Haiku-4.5 LLM-judge）在SciInstruct中发现了134个近重复项和4,846个释义候选项。(2) 在59对爱沙尼亚语-英语奥林匹克问题上，Sonnet 4.5的17个百分点差异（30.5%对13.6%；符号检验p=0.011，McNemar p=0.021，配对自助法95%置信区间[+5.1, +28.9]百分点）。(3) 在相同Sonnet权重下，多项选择题（PhyX上的79.7%）与开放式奥林匹克评估（PhysOlym-A上的33.4%）之间的46个百分点格式和新颖性梯度。我们发布了四个解决这些差距的文献：PhysCorp-A（6,432条记录的三阶段审核多模态语料库）、PhysR1Corp（2,268条记录的封闭形式强化学习池）、PhysOlym-A（500道题目，99.8%新来源的保留奥林匹克评估，带有本地难度标签和一个英语/爱沙尼亚语双语子集），以及Physics-R1，一个从Qwen3-VL-8B-Thinking冷启动的参考GSPO+DAPO方案。在3个种子上，Physics-R1在PhysOlym-A自由评估上比8B基础提升了+18.3个百分点（8.0 -> 26.3 +/- 1.7；比Sonnet 4.5低7.1个百分点），在PhysReason上提升了+15.7个百分点（23.9 -> 39.6 +/- 6.4；领先于Qwen3-VL-32B和Gemini 2.5 Pro），在OlympiadBench-Physics上提升了+6.9个百分点（46.2 +/- 1.5），在PhyX多项选择题上提升了+4.1个百分点（77.8 +/- 0.3）。

View on arXiv Download PDF AI Translation

cs.CL / 5 / 2605.14053

Derivation Prompting: A Logic-Based Method for Improving Retrieval-Augmented Generation

推导提示：一种基于逻辑的方法以改善检索增强生成

Sastre, Ignacio, Moncecchi, Guillermo, Rosá, Aiala

Abstract

The application of Large Language Models to Question Answering has shown great promise, but important challenges such as hallucinations and erroneous reasoning arise when using these models, particularly in knowledge-intensive, domain-specific tasks. To address these issues, we introduce Derivation Prompting, a novel prompting technique for the generation step of the Retrieval-Augmented Generation framework. Inspired by logic derivations, this method involves deriving conclusions from initial hypotheses through the systematic application of predefined rules. It constructs a derivation tree that is interpretable and adds control over the generation process. We applied this method in a specific case study, significantly reducing unacceptable answers compared to traditional RAG and long-context window methods.

Chinese Translation

大型语言模型在问答领域的应用展现了巨大的潜力，但在使用这些模型时，尤其是在知识密集型和领域特定的任务中，出现了诸如幻觉和错误推理等重要挑战。为了解决这些问题，我们提出了推导提示（Derivation Prompting），这是一种针对检索增强生成（Retrieval-Augmented Generation）框架生成步骤的新型提示技术。该方法受到逻辑推导的启发，通过系统地应用预定义规则，从初始假设中推导出结论。它构建了一个可解释的推导树，并增加了对生成过程的控制。我们在一个具体案例研究中应用了该方法，与传统的RAG和长上下文窗口方法相比，显著减少了不可接受答案的出现。

View on arXiv Download PDF AI Translation

cs.CL / 6 / 2605.14055

PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts

PEML：参数高效的多任务学习与优化连续提示

Chowdhury, Anjir Ahmed, Zawad, Syed, Ma, Xiaolong, Dong, Xu, Yan, Feng

Abstract

Parameter-Efficient Fine-Tuning (PEFT) is widely used for adapting Large Language Models (LLMs) for various tasks. Recently, there has been an increasing demand for fine-tuning a single LLM for multiple tasks because it requires overall less data for fine-tuning thanks to the common features shared among tasks. More importantly, LLMs are resource demanding and deploying a single model for multiple tasks facilitates resource consolidation and consumes significantly less resources compared to deploying individual large model for each task. Existing PEFT methods like LoRA and Prefix Tuning are designed to adapt LLMs to a specific task. LoRA and its variation focus on aligning the model itself for tasks, overlooking the importance of prompt tuning in multi-task learning while Prefix Tuning only adopts a simple architecture to optimize prompts, which limits the adaption capabilities for multi-task. To enable efficient fine-tuning for multi-task learning, it is important to co-optimize prompt optimization and model adaptation. In this work, we propose a Parameter-Efficient Multi-task Learning (\PM), which employs a neural architecture engineering method for optimizing the continuous prompts while also performing low-rank adaption for model weights. We prototype PEML by creating an automated framework for optimizing the continuous prompts and adapting model weights. We evaluate PEML against state-of-the-arts multi-task learning methods MTL-LoRA, MultiLoRa, C-Poly, and MoE, on the GLUE, SuperGLUE, Massive Multitask Language Understanding, and commonsense reasoning benchmarks. The evaluation results present an average accuracy improvement of up to 6.67%, with individual tasks showing peak gains of up to 10.75%.

Chinese Translation

参数高效微调（PEFT）广泛应用于适应大型语言模型（LLMs）以应对各种任务。近年来，针对单一LLM进行多任务微调的需求日益增长，因为这得益于任务间共享的共同特征，从而整体上减少了微调所需的数据量。更重要的是，LLMs对资源的需求较高，为多个任务部署单一模型有助于资源整合，并且与为每个任务部署单独的大模型相比，显著减少了资源消耗。现有的PEFT方法，如LoRA和前缀调优（Prefix Tuning），旨在将LLMs适应于特定任务。LoRA及其变体专注于对模型本身进行任务对齐，而忽视了在多任务学习中提示调优的重要性；而前缀调优仅采用简单架构来优化提示，这限制了其在多任务中的适应能力。为了实现多任务学习的高效微调，重要的是共同优化提示优化和模型适应。在本研究中，我们提出了一种参数高效的多任务学习方法（PEML），该方法采用神经架构工程方法来优化连续提示，同时对模型权重进行低秩适应。我们通过创建一个自动化框架来优化连续提示和适应模型权重，原型化了PEML。我们在GLUE、SuperGLUE、大规模多任务语言理解和常识推理基准上，将PEML与最先进的多任务学习方法MTL-LoRA、MultiLoRa、C-Poly和MoE进行了评估。评估结果显示，PEML在准确率上平均提高了高达6.67%，个别任务的峰值增益高达10.75%。

View on arXiv Download PDF AI Translation

cs.CL / 7 / 2605.14057

Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents

法律询问型对话代理的双层次对话策略学习

Lin, Xubo, Deng, Zezhii, Wang, Shihao, Yang, Grace Hui, Deng, Yang

Abstract

Most existing dialogue systems are user-driven, primarily designed to fulfill user requests. However, in many critical real-world scenarios, a conversational agent must proactively extract information to achieve its own objectives rather than merely respond. To address this gap, we introduce \emph{Inquisitive Conversational Agents (ICAs)} and develop an ICA specifically tailored to U.S. Supreme Court oral arguments. We propose a Dual Hierarchical Reinforcement Learning framework featuring two cooperating RL agents, each with its own policy, to coordinate strategic dialogue management and fine-grained utterance generation. By learning when and how to ask probing questions, the agent emulates judicial questioning patterns and systematically uncovers crucial information to fulfill its legal objectives. Evaluations on a U.S. Supreme Court dataset show that our method outperforms various baselines across multiple metrics. It represents an important first step toward broader high-stakes, domain-specific applications.

Chinese Translation

大多数现有的对话系统都是以用户为驱动，主要设计用于满足用户请求。然而，在许多关键的现实场景中，对话代理必须主动提取信息，以实现其自身目标，而不仅仅是被动回应。为了解决这一问题，我们引入了 extit{询问型对话代理（Inquisitive Conversational Agents, ICAs）}，并开发了一种专门针对美国最高法院口头辩论的ICA。我们提出了一种双层次强化学习框架，具有两个合作的强化学习代理，每个代理都有自己的策略，以协调战略对话管理和细粒度的发话生成。通过学习何时以及如何提出探询性问题，代理模拟了司法质询模式，并系统地揭示关键信息，以实现其法律目标。在美国最高法院数据集上的评估表明，我们的方法在多个指标上优于各种基线。这代表了朝着更广泛的高风险、特定领域应用迈出的重要第一步。

View on arXiv Download PDF AI Translation

cs.CL / 8 / 2605.14071

Distribution Corrected Offline Data Distillation for Large Language Models

针对大型语言模型的分布校正离线数据蒸馏

Zhang, Yumeng, Yang, Zhengbang, Goonatilake, Yevin Nikhel, Zhu, Zhuangdi

Abstract

Distilling reasoning traces from strong large language models into smaller ones is a promising route to improve intelligence in resource-constrained settings. Existing approaches face a fundamental trade-off: offline distillation from teacher-generated traces provides high-quality, sample-efficient supervision but suffers from distributional drift: during training, the student model conditions on teacher-generated prefixes, whereas during inference the student autoregresses on self-generated prefixes, leading to compounding errors over long reasoning trajectories. Meanwhile, on-policy or self-distillation methods better match the student's inference-time distribution, but require costly online sampling and often produce low-quality traces in early training. We propose a principled offline reasoning distillation framework that preserves the efficiency and supervision quality of offline teacher-generated data while correcting teacher-student distribution drift. It adaptively emphasizes teacher supervision that is better aligned with the student's on-policy distribution. Evaluations on mathematical reasoning benchmarks of GSM8K, MATH, MATH500, and harder held-out competition-style tasks, including AMC, AIME, and OlympiadBench, show that our method improves reasoning accuracy over prior offline distillation algorithms and yields more stable reasoning traces while preserving instruction-following capabilities. Our work shows that lightweight, distribution-correction-aware training can substantially strengthen offline reasoning distillation without online rollouts.

Chinese Translation

从强大的大型语言模型中提取推理轨迹并蒸馏到较小模型中，是在资源受限环境中提升智能的一个有前景的途径。现有方法面临一个基本的权衡：从教师生成的轨迹进行离线蒸馏提供了高质量、样本高效的监督，但却遭遇分布漂移的问题：在训练过程中，学生模型依赖于教师生成的前缀，而在推理时，学生则基于自生成的前缀进行自回归，导致在长推理轨迹上累积错误。同时，基于策略或自蒸馏的方法更好地匹配学生的推理时分布，但需要昂贵的在线采样，并且在早期训练中往往产生低质量的轨迹。我们提出了一个原则性的离线推理蒸馏框架，既保持了离线教师生成数据的效率和监督质量，又校正了教师与学生之间的分布漂移。该框架自适应地强调与学生的在线策略分布更好对齐的教师监督。在GSM8K、MATH、MATH500等数学推理基准以及更具挑战性的保留竞赛风格任务（包括AMC、AIME和OlympiadBench）上的评估表明，我们的方法在推理准确性上优于先前的离线蒸馏算法，并在保持指令跟随能力的同时，产生了更稳定的推理轨迹。我们的工作表明，轻量级、关注分布校正的训练可以在不进行在线回滚的情况下，显著增强离线推理蒸馏的效果。

View on arXiv Download PDF AI Translation

cs.CL / 9 / 2605.14087

Measuring and Mitigating Toxicity in Large Language Models: A Comprehensive Replication Study

测量和缓解大型语言模型中的毒性：一项综合复制研究

Surana, Mokshit, Rathod, Archit, Satishkumar, Akshaj

Abstract

Large Language Models (LLMs), when trained on web-scale corpora, inherently absorb toxic patterns from their training data. This leads to ``toxic degeneration'' where even innocuous prompts can trigger harmful outputs. This phenomenon poses significant risks for real-world deployments. Thus, necessitating effective mitigation strategies that should maintain model utility while ensuring safety. In this comprehensive replication study, we evaluate the efficacy of \textbf{DExperts} (Decoding-time Experts), which is an inference-time mitigation technique that steers generation without requiring model retraining. We structured our research into three systematic phases: (1) establishing baseline toxicity measurements using \textbf{RealToxicityPrompts} on standard GPT-2 models; then (2) implementing and evaluating DExperts to mitigate explicit toxicity; and finally (3) stress-testing the method against implicit hate speech using the adversarial \textbf{ToxiGen} dataset. Our empirical results confirm that while DExperts achieves near-perfect safety rates (100\%) on explicit toxicity benchmarks, it exhibits brittleness against adversarial, implicit hate speech, with safety rates dropping to 98.5\%. Furthermore, we quantify a critical trade-off. The method introduces a $\sim$10x latency penalty (from 0.2s to 2.0s per generation), posing challenges for real-time deployment scenarios. This study contributes to the growing body of work on AI safety by highlighting the robustness gap between explicit and implicit toxicity mitigation. We emphasize the need for more sophisticated approaches that generalize across diverse hate speech patterns without prohibitive computational costs.

Chinese Translation

大型语言模型（LLMs）在以网络规模语料库进行训练时，固有地吸收了训练数据中的毒性模式。这导致了“毒性退化”，即即使是无害的提示也可能触发有害的输出。这一现象对实际部署构成了重大风险。因此，需要有效的缓解策略，以在确保安全的同时维持模型的实用性。在这项综合复制研究中，我们评估了 extbf{DExperts}（解码时专家）的有效性，这是一种推理时的缓解技术，可以在不需要重新训练模型的情况下引导生成。我们的研究分为三个系统阶段：（1）使用 extbf{RealToxicityPrompts}在标准GPT-2模型上建立基线毒性测量；然后（2）实施并评估DExperts以缓解显性毒性；最后（3）使用对抗性 extbf{ToxiGen}数据集对该方法进行隐性仇恨言论的压力测试。我们的实证结果确认，尽管DExperts在显性毒性基准上达到了近乎完美的安全率（100%），但在对抗性隐性仇恨言论方面表现出脆弱性，安全率降至98.5%。此外，我们量化了一个关键的权衡。该方法引入了约10倍的延迟惩罚（从0.2秒增加到每次生成2.0秒），给实时部署场景带来了挑战。本研究通过强调显性和隐性毒性缓解之间的稳健性差距，为人工智能安全领域的研究贡献了新的成果。我们强调需要更复杂的方法，以在不产生高昂计算成本的情况下，针对多样的仇恨言论模式进行泛化。

View on arXiv Download PDF AI Translation

cs.CL / 10 / 2605.14115

When Evidence Conflicts: Uncertainty and Order Effects in Retrieval-Augmented Biomedical Question Answering

当证据冲突时：检索增强生物医学问答中的不确定性和顺序效应

Han, Yikun, Lan, Mengfei, Kilicoglu, Halil

Abstract

Biomedical retrieval-augmented large language models (LLMs) often face evidence that is incomplete, misleading, or internally contradictory, yet evaluation usually emphasizes answer accuracy under helpful context rather than reliability under conflict. Using HealthContradict, we evaluate six open-weight LLMs under five controlled evidence conditions: no retrieved context, correct-only context, incorrect-only context, and two mixed conditions containing both correct and contradictory documents in opposite orders. In this conflicting-evidence order contrast, where the same two documents are both present and only their order is reversed, accuracy drops for every model and 11.4%--25.2% of predictions flip. To support abstention in these difficult cases, we also evaluate a conflict-aware abstention score that combines model confidence with a detector of evidence conflict. In the two hardest conditions, this score improves selective accuracy over confidence-only, with mean gains of 7.2--33.4 points in incorrect-only (`IC') and 3.6--14.4 points in incorrect-first conflicting (`ICC') conditions across 75%, 50%, and 25% coverage. These results show that conflicting biomedical evidence is both an uncertainty and robustness problem and motivate evaluation and abstention methods that explicitly account for evidence disagreement.

Chinese Translation

生物医学检索增强的大型语言模型（LLMs）常常面临不完整、误导性或内部矛盾的证据，但评估通常强调在有利背景下的答案准确性，而非在冲突情况下的可靠性。我们使用 HealthContradict 评估了六个开放权重的 LLMs 在五种受控证据条件下的表现：无检索上下文、仅正确上下文、仅错误上下文，以及两种混合条件，其中包含正确和矛盾文档，且顺序相反。在这种冲突证据顺序对比中，当同样的两个文档同时存在且仅其顺序被反转时，每个模型的准确性均下降，11.4% 至 25.2% 的预测结果发生翻转。为了支持在这些困难情况下的弃权，我们还评估了一种考虑冲突的弃权评分，该评分将模型信心与证据冲突的检测器相结合。在两个最困难的条件下，该评分在选择性准确性上优于仅基于信心的评分，在错误仅 (`IC`) 和错误优先冲突 (`ICC`) 条件下，平均增益分别为 7.2 至 33.4 分和 3.6 至 14.4 分，覆盖率为 75%、50% 和 25%。这些结果表明，冲突的生物医学证据既是一个不确定性问题，也是一个稳健性问题，并激励评估和弃权方法明确考虑证据不一致性。

View on arXiv Download PDF AI Translation

cs.CL / 11 / 2605.14117

Generative Floor Plan Design with LLMs via Reinforcement Learning with Verifiable Rewards

基于可验证奖励的强化学习与大型语言模型的生成式平面设计

Lara, Luis, Milios, Aristides, Luo, Zhi Hao, Sharma, Aditya, Luo, Ge Ya, Beckham, Christopher, Golemo, Florian, Pal, Christopher

Abstract

An AI system for professional floor plan design must precisely control room dimensions and areas while respecting the desired connectivity between rooms and maintaining functional and aesthetic quality. Existing generative approaches focus primarily on respecting the requested connectivity between rooms, but do not support generating floor plans that respect numerical constraints. We introduce a text-based floor plan generation approach that fine-tunes a large language model (LLM) on real plans and then applies reinforcement learning with verifiable rewards (RLVR) to improve adherence to topological and numerical constraints while discouraging invalid or overlapping outputs. Furthermore, we design a set of constraint adherence metrics to systematically measure how generated floor plans align with user-defined constraints. Our model generates floor plans that satisfy user-defined connectivity and numerical constraints and outperforms existing methods on Realism, Compatibility, and Diversity metrics. Across all tasks, our approach achieves at least a 94% relative reduction in Compatibility compared with existing methods. Our results demonstrate that LLMs can effectively handle constraints in this setting, suggesting broader applications for text-based generative modeling.

Chinese Translation

一个用于专业平面设计的人工智能系统必须精确控制房间的尺寸和面积，同时尊重房间之间所需的连通性，并保持功能和美学质量。现有的生成方法主要集中于尊重房间之间的连通性请求，但不支持生成符合数值约束的平面图。我们提出了一种基于文本的平面图生成方法，该方法在真实平面图上微调大型语言模型（LLM），然后应用带有可验证奖励的强化学习（RLVR）来提高对拓扑和数值约束的遵循，同时抑制无效或重叠的输出。此外，我们设计了一套约束遵循度指标，以系统性地衡量生成的平面图与用户定义的约束的对齐程度。我们的模型生成的平面图满足用户定义的连通性和数值约束，并在现实性、兼容性和多样性指标上优于现有方法。在所有任务中，与现有方法相比，我们的方法在兼容性上至少实现了94%的相对减少。我们的结果表明，LLM在此环境中能够有效处理约束，暗示了基于文本的生成建模的更广泛应用前景。

View on arXiv Download PDF AI Translation

cs.CL / 12 / 2605.14125

Polar probe linearly decodes semantic structures from LLMs

极性探针线性解码大型语言模型中的语义结构

Diego-Simón, Pablo J., Orhan, Pierre, Lakretz, Yair, King, Jean-Rémi

Abstract

How do artificial neural networks bind concepts to form complex semantic structures? Here, we propose a simple neural code, whereby the existence and the type of relations between entities are represented by the distance and the direction between their embeddings, respectively. We test this hypothesis in a variety of Large Language Models (LLMs), each input with natural-language descriptions of minimalist tasks from five different domains: arithmetic, visual scenes, family trees, metro maps and social interactions. Results show that the true semantic structures can be linearly recovered with a Polar Probe targeting a subspace of LLMs' layer activations. Second, this code emerges mostly in middle layers and improves with LLM performance. Third, these Polar Probes successfully generalize to new entities and relation types, but degrades with the size of the semantic structure. Finally, the quality of the polar representation correlates with the LLM's ability to answer questions about the semantic structure. Together, these findings suggest that LLMs learn to build complex semantic structures by binding representations with a simple geometrical principle.

Chinese Translation

人工神经网络如何将概念结合以形成复杂的语义结构？在此，我们提出了一种简单的神经编码，其中实体之间关系的存在及其类型分别通过其嵌入之间的距离和方向来表示。我们在多种大型语言模型（LLMs）中测试这一假设，每个模型输入来自五个不同领域的极简任务的自然语言描述：算术、视觉场景、家谱、地铁地图和社会互动。结果表明，真实的语义结构可以通过针对LLMs层激活的子空间的极性探针线性恢复。其次，这种编码主要出现在中间层，并随着LLM性能的提高而改善。第三，这些极性探针成功地推广到新的实体和关系类型，但在语义结构的规模增大时效果下降。最后，极性表示的质量与LLM回答关于语义结构问题的能力相关联。综合来看，这些发现表明，LLMs通过结合表示与简单的几何原理来学习构建复杂的语义结构。

View on arXiv Download PDF AI Translation

cs.CL / 13 / 2605.14152

ROK-FORTRESS: Measuring the Effect of Geopolitical Transcreation for National Security and Public Safety

ROK-FORTRESS：测量地缘政治再创作对国家安全和公共安全的影响

Lee, Michael S., Maurya, Yash, Rein, Drew, Herring, Bert, Nguyen, Jonathan, Song, Kyungho, Sehwag, Udari Madhushani, Cho, Jiyeon, Deshpande, Kaustubh, Jang, Yeongkyun, Joo, Jiyeon, Choi, Minn Seok, Fuelle, Evi, Knight, Christina Q, Brandifino, Joseph, Fenkell, Max

Abstract

Safety evaluations for large language models (LLMs) increasingly target high-stakes National Security and Public Safety (NSPS) risks, yet multilingual safety is typically assessed through translation-only benchmarks that preserve the underlying scenario, and empirical evidence of how language and geopolitical context interact remains limited to a narrow set of language pairs. We introduce \emph{ROK-FORTRESS} https://huggingface.co/datasets/ScaleAI/ROK-FORTRESS_public, a bilingual, culturally adversarial NSPS benchmark that uses the English--Korean language pair and U.S.--ROK geopolitical axis as a case study, separating the effects of language and geopolitical grounding via a \emph{transcreation matrix}: adversarial intents are evaluated under controlled combinations of (i) English versus Korean language and (ii) U.S.\ versus Korean entities, institutions, and operational details. Each adversarial prompt is paired with a dual-use benign counterpart to quantify over-refusal. Model responses are then scored using calibrated LLM-as-a-judge panels, applying our expert-crafted, prompt-specific binary rubrics. Across a dual-track set of frontier and Korean-optimized models, we find a consistent suppression effect in Korean variants and substantial model-to-model variation in how geopolitical grounding interacts with language. In many models, Korean grounding mitigates the Korean language-driven suppression -- with no model showing significant amplification in the other direction -- indicating that, at least in the English--Korean case, safety behavior is shaped by language-as-risk signals and context interactions that translation-only evaluations miss. The transcreation matrix methodology is designed to generalize to other language--culture pairs.

Chinese Translation

大型语言模型（LLMs）的安全评估越来越多地针对高风险的国家安全和公共安全（NSPS）风险，但多语言安全通常通过仅保留基本场景的翻译基准进行评估，而关于语言与地缘政治背景如何相互作用的实证证据仍然局限于少数语言对。我们引入了 extit{ROK-FORTRESS} https://huggingface.co/datasets/ScaleAI/ROK-FORTRESS_public，这是一个双语的文化对抗性NSPS基准，使用英语-韩语语言对和美-韩地缘政治轴作为案例研究，通过 extit{再创作矩阵}分离语言与地缘政治基础的影响：在控制的组合下评估对抗性意图，包括（i）英语与韩语语言的对比，以及（ii）美国与韩国实体、机构和操作细节的对比。每个对抗性提示与一个双重用途的良性对应物配对，以量化过度拒绝。然后，使用经过校准的LLM作为评判小组对模型响应进行评分，应用我们专家设计的、特定提示的二元评分标准。在一组前沿和韩语优化模型的双轨测试中，我们发现韩语变体中存在一致的抑制效应，并且地缘政治基础与语言的相互作用在不同模型之间存在显著差异。在许多模型中，韩语基础减轻了由韩语驱动的抑制——没有模型在相反方向上显示出显著的增强——这表明，至少在英语-韩语案例中，安全行为受到语言作为风险信号和上下文交互的影响，而仅依赖翻译的评估则无法捕捉到这一点。再创作矩阵方法旨在推广到其他语言-文化对。

View on arXiv Download PDF AI Translation

cs.CL / 14 / 2605.14169

BOOKMARKS: Efficient Active Storyline Memory for Role-playing

BOOKMARKS：高效的角色扮演主动故事线记忆

Peng, Letian, Liu, Ziche, Huang, Yiming, Yun, Longfei, Zhou, Kun, Hou, Yupeng, Shang, Jingbo

Abstract

Memory systems are critical for role-playing agents (RPAs) to maintain long-horizon consistency. However, existing RPA memory methods (e.g., profiling) mainly rely on recurrent summarization, whose compression inevitably discards important details. To address this issue, we propose a search-based memory framework called BOOKMARKS, which actively initializes, maintains, and updates task-relevant pieces of bookmarks for the current task (e.g., character acting). A bookmark is structured as the answer to a question at a specific point in the storyline. For each current task, BOOKMARKS selects reusable existing bookmarks or initializes new ones (at storyline beginning) with useful questions. These bookmarks are then synchronized to the current story point, with their answers updated accordingly, so they can be efficiently reused in future grounding rounds. Compared with recurrent summarization, BOOKMARKS offers (1) active grounding for capturing task-specific details and (2) passive updating to avoid unnecessary computation. In implementation, BOOKMARKS supports concept, behavior, and state searches, each powered by an efficient synchronization method. BOOKMARKS significantly outperforms RPA memory baselines on 85 characters from 16 artifacts, demonstrating the effectiveness of search-based memory for RPAs.

Chinese Translation

记忆系统对于角色扮演代理（RPA）维持长时间的一致性至关重要。然而，现有的RPA记忆方法（例如，轮廓化）主要依赖于递归摘要，这种压缩不可避免地会丢弃重要细节。为了解决这个问题，我们提出了一种基于搜索的记忆框架，称为BOOKMARKS，它主动初始化、维护和更新与当前任务（例如，角色表演）相关的书签。书签被构建为在故事线特定时刻对问题的回答。对于每个当前任务，BOOKMARKS选择可重用的现有书签或在故事线开始时用有用的问题初始化新的书签。这些书签随后与当前故事点同步，并相应更新其答案，以便在未来的基础回合中高效重用。与递归摘要相比，BOOKMARKS提供了（1）主动基础以捕捉任务特定细节和（2）被动更新以避免不必要的计算。在实现中，BOOKMARKS支持概念、行为和状态搜索，每种搜索都由高效的同步方法驱动。BOOKMARKS在来自16个文物的85个角色上显著优于RPA记忆基线，证明了基于搜索的记忆在RPA中的有效性。

View on arXiv Download PDF AI Translation

cs.CL / 15 / 2605.14192

Why Retrieval-Augmented Generation Fails: A Graph Perspective

检索增强生成失败的原因：一种图视角

Guo, Kai, Dai, Xinnan, Zhang, Zhibo, Lin, Nuohan, Zeng, Shenglai, Ren, Jie, Han, Haoyu, Tang, Jiliang

Abstract

Retrieval-Augmented Generation (RAG) has become a powerful and widely used approach for improving large language models by grounding generation in retrieved evidence. However, RAG systems still produce incorrect answers in many cases. Why RAG fails despite having access to external information remains poorly understood. We present a model-internal study of retrieval-augmented generation that examines how retrieved evidence influences answer generation. Using circuit tracing, we construct attribution graphs that model the flow of information through transformer layers during decoding. These graphs represent interactions among retrieved context, intermediate model activations, and generated tokens, providing a graph, circuit-level view of how external evidence is integrated into the model's reasoning process across multiple question answering benchmarks, we observe consistent structural differences: correct predictions exhibit deeper reasoning paths, more distributed evidence flow, and a more structured pattern of local connectivity, while failed predictions show shallower, fragmented, and overly concentrated evidence flow. Building on these findings, we develop a graph-based error detection framework that uses attribution-graph topology features. Furthermore, we show that attribution graphs enable targeted interventions. By reinforcing question-constrained evidence grounding, we reshape internal routing so that answer generation remains guided by the question, leading to more effective integration of retrieved information and fewer errors.

Chinese Translation

检索增强生成（Retrieval-Augmented Generation, RAG）已成为一种强大且广泛使用的方法，通过基于检索到的证据来改善大型语言模型。然而，RAG 系统在许多情况下仍然会产生错误答案。尽管可以访问外部信息，RAG 失败的原因仍然不甚明了。我们进行了一项模型内部研究，探讨检索到的证据如何影响答案生成。通过电路追踪，我们构建了归因图，以建模在解码过程中信息在变压器层之间的流动。这些图表示了检索上下文、中间模型激活和生成标记之间的相互作用，提供了一个图形电路级别的视角，展示了外部证据如何在多个问答基准中融入模型的推理过程。我们观察到一致的结构差异：正确预测表现出更深的推理路径、更分散的证据流动和更有结构的局部连接模式，而失败的预测则显示出更浅、碎片化和过于集中证据流动。基于这些发现，我们开发了一种基于图的错误检测框架，利用归因图拓扑特征。此外，我们还展示了归因图能够实现针对性的干预。通过强化受问题约束的证据基础，我们重新塑造了内部路由，使得答案生成始终受到问题的引导，从而更有效地整合检索信息并减少错误。

View on arXiv Download PDF AI Translation

cs.CL / 16 / 2605.14194

GradShield: Alignment Preserving Finetuning

GradShield：保持对齐的微调

Hu, Zhanhao, Huang, Xiao, Mendoza, Patrick, Alghamdi, Emad A., Alomair, Basel, Popa, Raluca Ada, Wagner, David

Abstract

Large Language Models (LLMs) pose a significant risk of safety misalignment after finetuning, as models can be compromised by both explicitly and implicitly harmful data. Even some seemingly benign data can inadvertently steer a model towards misaligned behaviors. To address this, we introduce GradShield, a principled filtering method that safeguards LLMs during finetuning by identifying and removing harmful data points before they corrupt the model's alignment. It removes potentially harmful data by computing a Finetuning Implicit Harmfulness Score (FIHS) for each data point and employs an adaptive thresholding algorithm. We apply GradShield to multiple utility fine-tuning tasks across varying levels of harmful data and evaluate the safety and utility performance of the resulting LLMs using various metrics. The results show that GradShield outperforms all baseline methods, consistently maintaining an Attack Success Rate (ASR) below $6\%$ while preserving utility performance.

Chinese Translation

大型语言模型（LLMs）在微调后存在显著的安全对齐风险，因为模型可能受到显性和隐性有害数据的影响。即使是一些看似无害的数据，也可能无意中引导模型朝向不对齐的行为。为了解决这个问题，我们提出了GradShield，这是一种原则性的过滤方法，通过在有害数据点损害模型对齐之前识别并移除这些数据，来保护LLMs在微调过程中的安全性。它通过为每个数据点计算微调隐性有害性评分（Finetuning Implicit Harmfulness Score, FIHS）来移除潜在有害数据，并采用自适应阈值算法。我们将GradShield应用于多个实用微调任务，涵盖不同程度的有害数据，并使用各种指标评估所得到的LLMs的安全性和实用性能。结果表明，GradShield在所有基线方法中表现优越，攻击成功率（Attack Success Rate, ASR）始终保持在$6\%$以下，同时保持了实用性能。

View on arXiv Download PDF AI Translation

cs.CL / 17 / 2605.14257

What Makes Words Hard? Sakura at BEA 2026 Shared Task on Vocabulary Difficulty Prediction

是什么导致词汇难度？Sakura在2026年BEA共享任务中的词汇难度预测

Nohejl, Adam, Wu, Xuanxin, Ide, Yusuke, Machin, Maria Angelica Riera, Chang, Yi-Ning, Yanaka, Hitomi

Abstract

We describe two types of models for vocabulary difficulty prediction: a high-accuracy black-box model, which achieved the top shared task result in the open track, and an explainable model, which outperforms a fine-tuned encoder baseline. As the black-box model, we fine-tuned an LLM using a soft-target loss function for effective application to the rating task, achieving r > 0.91. The explainable model provides insights into what impacts the difficulty of each item while maintaining a strong correlation (r > 0.77). We further analyze the results, demonstrating that the difficulty of items in the British Council's Knowledge-based Vocabulary Lists (KVL) is often affected by spelling difficulty or the construction of the test items, in addition to the genuine production difficulty of the words. We make our code available online at https://github.com/adno/vocabulary-difficulty .

Chinese Translation

我们描述了两种词汇难度预测模型：一种高准确率的黑箱模型，在开放赛道中取得了最佳共享任务结果；另一种可解释模型，其性能优于经过微调的编码器基线。作为黑箱模型，我们使用软目标损失函数对大语言模型（LLM）进行了微调，以有效应用于评分任务，达到了r > 0.91的结果。可解释模型提供了对每个项目难度影响因素的洞察，同时保持了强相关性（r > 0.77）。我们进一步分析了结果，表明英国文化协会的知识型词汇表（Knowledge-based Vocabulary Lists, KVL）中的项目难度通常受到拼写难度或测试项目构造的影响，除了词汇本身的实际生产难度。我们的代码已在线发布，网址为https://github.com/adno/vocabulary-difficulty。

View on arXiv Download PDF AI Translation

cs.CL / 18 / 2605.14271

Auditing Agent Harness Safety

审计代理执行安全性

Liu, Chengzhi, Guo, Yichen, Liu, Yepeng, Yang, Yuzhe, Yan, Qianqi, Zhao, Xuandong, Hua, Wenyue, Liu, Sheng, Li, Sharon, Bu, Yuheng, Wang, Xin Eric

Abstract

LLM agents increasingly run inside execution harnesses that dispatch tools, allocate resources, and route messages between specialized components. However, a harness can return a correct, benign answer over a trajectory that accesses unauthorized resources or leaks context to the wrong agent. Output-level evaluation cannot see these failures, yet most safety benchmarks score only final outputs or terminal states, even though many violations occur mid-trajectory rather than at termination. The central question is whether the harness respects user intent, permission boundaries, and information-flow constraints throughout execution. To address this gap, we propose HarnessAudit, a framework that audits full execution trajectories across boundary compliance, execution fidelity, and system stability, with a focus on multi-agent harnesses where these risks are most pronounced. We further introduce HarnessAudit-Bench, a benchmark of 210 tasks across eight real-world domains, instantiated in both single-agent and multi-agent configurations with embedded safety constraints. Evaluating ten harness configurations across frontier models and three multi-agent frameworks, we find that: (i) task completion is misaligned with safe execution, and violations accumulate with trajectory length; (ii) safety risks vary across domains, task types, and agent roles; (iii) most violations concentrate in resource access and inter-agent information transfer; and (iv) multi-agent collaboration expands the safety risk surface, while harness design sets the upper bound of safe deployment.

Chinese Translation

大型语言模型（LLM）代理越来越多地在执行环境中运行，该环境负责调度工具、分配资源和在专用组件之间路由消息。然而，执行环境可能在访问未授权资源或将上下文泄露给错误代理的过程中返回正确且无害的答案。输出级别的评估无法识别这些失败，然而大多数安全基准仅评估最终输出或终端状态，尽管许多违规行为发生在执行过程中的中间阶段，而非终止时。核心问题是执行环境是否在整个执行过程中尊重用户意图、权限边界和信息流约束。为了解决这一问题，我们提出了HarnessAudit，一个审计完整执行轨迹的框架，关注边界合规性、执行忠实性和系统稳定性，特别是在多代理执行环境中，这些风险最为明显。我们进一步介绍了HarnessAudit-Bench，这是一个涵盖八个真实世界领域的210个任务的基准，既包括单代理配置，也包括嵌入安全约束的多代理配置。通过评估十种执行环境配置在前沿模型和三种多代理框架下的表现，我们发现：（i）任务完成与安全执行不一致，违规行为随着轨迹长度的增加而累积；（ii）安全风险在不同领域、任务类型和代理角色之间存在差异；（iii）大多数违规行为集中在资源访问和代理间信息传递；（iv）多代理协作扩大了安全风险面，而执行环境设计则设定了安全部署的上限。

View on arXiv Download PDF AI Translation

cs.CL / 19 / 2605.14305

Factorization-Error-Free Discrete Diffusion Language Model via Speculative Decoding

无因子化误差的离散扩散语言模型通过推测解码

Fang, Xun, Li, Yunchen, Yuan, Hang, Yu, Zhou

Abstract

Discrete diffusion language models improve generation efficiency through parallel token prediction, but standard $X_0$ prediction methods introduce factorization errors by approximating the clean token posterior with independent token-wise distributions. This paper proposes Factorization-Error-Free Discrete Diffusion Language Modeling (FeF-DLLM), which replaces independent clean-token prediction with an exact prefix-conditioned factorization of the clean posterior to better preserve token dependencies. To reduce the sequential cost introduced by prefix conditioning, FeF-DLLM further incorporates speculative decoding within diffusion denoising, accelerating inference while maintaining the parallel prediction and re-masking properties of DLLMs. Theoretically, we prove that FeF-DLLM generates from the true joint distribution and derive its expected acceleration ratio. Experiments on GSM8K, MATH, HumanEval, and MBPP demonstrate that our method improves accuracy by an average of 5.04 percentage points while achieving an average inference speedup of $3.86\times$.

Chinese Translation

离散扩散语言模型通过并行令牌预测提高了生成效率，但标准的 $X_0$ 预测方法通过用独立的令牌分布近似干净令牌后验，引入了因子化误差。本文提出了无因子化误差的离散扩散语言建模（FeF-DLLM），该方法用干净后验的精确前缀条件因子化替代独立的干净令牌预测，以更好地保留令牌之间的依赖关系。为了减少前缀条件带来的顺序成本，FeF-DLLM进一步在扩散去噪中结合了推测解码，加速推理，同时保持DLLMs的并行预测和重新掩蔽特性。从理论上讲，我们证明了FeF-DLLM是从真实的联合分布生成的，并推导出其期望加速比。在GSM8K、MATH、HumanEval和MBPP上的实验表明，我们的方法在提高准确率方面平均提升了5.04个百分点，同时实现了平均推理加速比为 $3.86 imes$。

View on arXiv Download PDF AI Translation

cs.CL / 20 / 2605.14352

Ideology Prediction of German Political Texts

德国政治文本的意识形态预测

Schneider, Sinclair, Steuber, Florian, Schneider, Joao A. G., Rodosek, Gabi Dreo

Abstract

Elections represent a crucial milestone in a nation's ongoing development. To better understand the political rhetoric from various movements, ranging from left to right, we propose a transformer-based model capable of projecting the political orientation of a text on a continuous left-to-right spectrum, represented by a normalized scalar d between -1 and 1. This approach enables analysts to focus on specific segments of the political landscape, such as conservatives, while excluding liberal and far-right movements. Such a task can only be achieved with multiclass classifiers, provided that the desired orientation is incorporated within one of their predefined classes. To determine the most suitable foundation model among 13 candidate transformers for this task, we constructed four distinct corpora. One corpus comprised annotated plenary notes from the German Bundestag, while another was based on an official online decision-making tool, Wahl-O-Mat. The third corpus consisted of articles from 33 newspapers, each identified by its political orientation, and the fourth included 535,200 tweets from 597 members of the 20th and 21st German Bundestag. To mitigate overfitting, we used two distinct corpora for training and two for testing, respectively. For in-domain performance, DeBERTa-large achieved the highest F1 score F1=0.844 as well as for the X (Twitter) out-of-domain test ACC=0.864. Regarding the newspaper out-of-domain test, Gemma2-2B excelled (MAE = 0.172). This study demonstrates that transformer models can recognize political framing in German news at the level of public opinion polls. Our findings suggest that both the model architecture and the availability of domain-specific training data can be as influential as model size for estimating political bias. We discuss methodological limitations and outline directions for improving the robustness of bias measurement.

Chinese Translation

选举代表着一个国家持续发展的关键里程碑。为了更好地理解来自不同运动的政治修辞，从左到右，我们提出了一种基于变换器（transformer）的模型，能够将文本的政治取向投射到一个连续的左到右的光谱上，该光谱由一个归一化标量 d 表示，范围在 -1 到 1 之间。这种方法使分析师能够专注于政治景观的特定部分，例如保守派，同时排除自由派和极右运动。这样的任务只能通过多类分类器实现，前提是所需的取向被纳入它们的预定义类别之一。为了确定在13个候选变换器中最适合该任务的基础模型，我们构建了四个不同的语料库。其中一个语料库包含德国联邦议院的注释全体会议记录，另一个基于官方在线决策工具Wahl-O-Mat。第三个语料库由33家报纸的文章组成，每篇文章都标识其政治取向，第四个语料库则包括来自597名第20和第21届德国联邦议院成员的535,200条推文。为了减轻过拟合，我们分别使用两个不同的语料库进行训练和测试。在领域内表现方面，DeBERTa-large达到了最高的F1分数F1=0.844，而在X（Twitter）领域外测试中的准确率为ACC=0.864。关于报纸领域外测试，Gemma2-2B表现优异（MAE = 0.172）。本研究表明，变换器模型能够识别德国新闻中的政治框架，达到公众舆论调查的水平。我们的发现表明，模型架构和领域特定训练数据的可用性对估计政治偏见的影响可能与模型大小同样重要。我们讨论了方法论的局限性，并概述了提高偏见测量稳健性的方向。

View on arXiv Download PDF AI Translation

cs.CL / 21 / 2605.14354

LLM-based Detection of Manipulative Political Narratives

基于大型语言模型的操控性政治叙事检测

Schneider, Sinclair, Steuber, Florian, Rodosek, Gabi Dreo

Abstract

We present a new computational framework for detecting and structuring manipulative political narratives. A task that became more important due to the shift of political discussions to social media. One of the primary challenges thereby is differentiating between manipulative political narratives and legitimate critiques. Some posts may also reframe actual events within a manipulative context. To achieve good clustering results, we filter manipulative posts beforehand using a detailed few-shot prompt that combines documented campaign narratives with legitimate criticisms to differentiate them. This prompt enables a reasoning model to assign labels, retaining only manipulative narrative posts for further processing. The remaining posts are subsequently embedded and dimensionality-reduced using UMAP, before HDBSCAN is applied to uncover narrative groups. A key advantage of this unsupervised approach is its independence from a predefined list of target categories, enabling it to uncover new narrative clusters. Finally, a reasoning model is employed to uncover the narrative behind each cluster. This approach, applied to over 1.2 million social media posts, effectively identified 41 distinct manipulative narrative clusters by integrating prompt-based filtering with unsupervised clustering.

Chinese Translation

我们提出了一种新的计算框架，用于检测和构建操控性政治叙事。由于政治讨论转向社交媒体，这一任务变得更加重要。其中一个主要挑战是区分操控性政治叙事与合法批评。一些帖子可能在操控性背景下重新框定实际事件。为了获得良好的聚类结果，我们使用详细的少量样本提示（few-shot prompt）预先过滤操控性帖子，该提示结合了已记录的竞选叙事和合法批评，以便进行区分。该提示使推理模型能够分配标签，仅保留操控性叙事帖子以供进一步处理。随后，剩余的帖子使用UMAP进行嵌入和降维，然后应用HDBSCAN以揭示叙事群体。这种无监督方法的一个关键优势在于其独立于预定义的目标类别列表，从而能够发现新的叙事簇。最后，采用推理模型揭示每个簇背后的叙事。该方法应用于超过120万条社交媒体帖子，有效识别出41个不同的操控性叙事簇，通过将基于提示的过滤与无监督聚类相结合。

View on arXiv Download PDF AI Translation

cs.CL / 22 / 2605.14366

Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax

带有语义奖励的强化学习实现低资源语言扩展而无需对齐代价

Su, Zeli, Zhang, Ziyin, Liu, Zhou, Song, Xuexian, Xu, Zhankai, Zheng, Longfei, Zhang, Xiaolu, Fu, Rong, Xu, Guixian, Zhang, Wentao

Abstract

Extending large language models (LLMs) to low-resource languages often incurs an "alignment tax": improvements in the target language come at the cost of catastrophic forgetting in general capabilities. We argue that this trade-off arises from the rigidity of supervised fine-tuning (SFT), which enforces token-level surface imitation on narrow and biased data distributions. To address this limitation, we propose a semantic-space alignment paradigm powered by Group Relative Policy Optimization (GRPO), where the model is optimized using embedding-level semantic rewards rather than likelihood maximization. This objective encourages meaning preservation through flexible realizations, enabling controlled updates that reduce destructive interference with pretrained knowledge. We evaluate our approach on Tibetan-Chinese machine translation and Tibetan headline generation. Experiments show that our method acquires low-resource capabilities while markedly mitigating alignment tax, preserving general competence more effectively than SFT. Despite producing less rigid surface overlap, semantic RL yields higher semantic quality and preference in open-ended generation, and few-shot transfer results indicate that it learns more transferable and robust representations under limited supervision. Overall, our study demonstrates that reinforcement learning with semantic rewards provides a safer and more reliable pathway for inclusive low-resource language expansion.

Chinese Translation

将大型语言模型（LLMs）扩展到低资源语言通常会产生“对齐代价”：在目标语言上的改进以一般能力的灾难性遗忘为代价。我们认为这种权衡源于监督微调（SFT）的刚性，它在狭窄和偏见的数据分布上强制执行基于标记级别的表面模仿。为了解决这一局限性，我们提出了一种由群体相对策略优化（Group Relative Policy Optimization, GRPO）驱动的语义空间对齐范式，其中模型使用嵌入级别的语义奖励进行优化，而不是最大化似然。这一目标通过灵活的实现鼓励意义的保留，使得控制更新成为可能，从而减少与预训练知识的破坏性干扰。我们在藏汉机器翻译和藏文标题生成上评估了我们的方法。实验表明，我们的方法在获得低资源能力的同时显著减轻了对齐代价，更有效地保留了整体能力，相较于SFT。尽管产生的表面重叠较少，语义强化学习在开放式生成中却提供了更高的语义质量和偏好，而少量样本迁移结果表明，在有限监督下，它学习到了更具可迁移性和鲁棒性的表示。总体而言，我们的研究表明，带有语义奖励的强化学习为包容性低资源语言扩展提供了一条更安全和可靠的路径。

View on arXiv Download PDF AI Translation

cs.CL / 23 / 2605.14368

Where Should Diffusion Enter a Language Model? Geometry-Guided Hidden-State Replacement

扩散应如何进入语言模型？几何引导的隐状态替换

Kong, Injin, Lee, Hyoungjoon, Jo, Yohan

Abstract

Continuous diffusion language models lag behind autoregressive transformers, partly because diffusion is applied in spaces poorly suited to language denoising and token recovery. We propose DiHAL, a geometry-guided diffusion-transformer hybrid that asks where diffusion should enter a pretrained transformer. DiHAL scores layers with geometry-based proxies, selects a diffusion-friendly hidden-state interface, and replaces the lower transformer prefix with a diffusion bridge while retaining the upper layers and original LM head. By reconstructing the selected-layer hidden state rather than tokens, DiHAL avoids direct continuous-to-discrete recovery. Experiments on 8B-scale backbones show that the geometry score predicts effective shallow insertion layers under a fixed bridge-training protocol and that hidden-state recovery improves over continuous diffusion baselines in a diagnostic comparison matching the diffusion/recovery training budget. These results suggest that hidden-state geometry helps identify where diffusion-based replacement is feasible inside pretrained language models.

Chinese Translation

连续扩散语言模型在性能上落后于自回归变换器，部分原因在于扩散应用于不适合语言去噪和标记恢复的空间。我们提出了DiHAL，这是一种几何引导的扩散-变换器混合模型，旨在确定扩散应在预训练变换器中的何处介入。DiHAL通过基于几何的代理对层进行评分，选择一个适合扩散的隐状态接口，并用扩散桥替换较低的变换器前缀，同时保留上层和原始语言模型头。通过重构所选层的隐状态而非标记，DiHAL避免了直接的连续到离散的恢复。在8B规模的基础模型上的实验表明，几何评分能够在固定的桥接训练协议下预测有效的浅层插入层，并且隐状态恢复在与扩散/恢复训练预算相匹配的诊断比较中优于连续扩散基线。这些结果表明，隐状态几何有助于识别在预训练语言模型中进行基于扩散的替换的可行位置。

View on arXiv Download PDF AI Translation

cs.CL / 24 / 2605.14380

Mitigating Data Scarcity in Psychological Defense Classification with Context-Aware Synthetic Augmentation

通过上下文感知的合成增强缓解心理防御分类中的数据稀缺问题

Vu, Hoang-Thuy-Duong, Pham, Quoc-Cuong, Pham, Huy-Hieu

Abstract

Psychological defense mechanisms (PDMs) are unconscious cognitive processes that modulate how individuals perceive and respond to emotional distress. Automatically classifying PDMs from text is clinically valuable but severely hindered by data scarcity and class imbalance, challenges which generative augmentation alone cannot resolve without psychological grounding. In this work, we address these challenges in the PsyDefDetect shared task (BioNLP@ACL 2026) by proposing a context-aware synthetic augmentation framework combined with a hybrid classification model. Our hybrid model integrates contextual language representations with basic clinical features, along with 150 annotated defense items. Experiments demonstrate that definition quality in prompting directly governs generation fidelity and downstream performance. Our method surpasses DMRS Co-Pilot, reaching an accuracy of 58.26% (+40.25%) and a macro-F1 of 24.62% (+15.99%), thereby establishing a strong baseline for psychologically grounded defense mechanism classification in low-resource settings. Source code is available at: https://github.com/htdgv/CASA-PDC.

Chinese Translation

心理防御机制（PDMs）是无意识的认知过程，调节个体如何感知和应对情绪困扰。从文本中自动分类PDMs在临床上具有重要价值，但受到数据稀缺和类别不平衡的严重制约，这些挑战仅靠生成性增强无法解决，缺乏心理学基础。在本研究中，我们通过提出一种上下文感知的合成增强框架，结合混合分类模型，来应对PsyDefDetect共享任务（BioNLP@ACL 2026）中的这些挑战。我们的混合模型将上下文语言表示与基本临床特征相结合，并使用150个标注的防御项目。实验表明，提示中的定义质量直接影响生成的真实性和下游性能。我们的方法超越了DMRS Co-Pilot，达到了58.26%的准确率（+40.25%）和24.62%的宏观F1值（+15.99%），为低资源环境下的心理学基础防御机制分类建立了强有力的基线。源代码可在以下链接获取：https://github.com/htdgv/CASA-PDC。

View on arXiv Download PDF AI Translation

cs.CL / 25 / 2605.14401

Agentic Recommender System with Hierarchical Belief-State Memory

具有层次信念状态记忆的自主推荐系统

Shen, Xiang, Zhou, Yuhang, Wu, Yifan, Zhao, Zhuokai, Lin, Siyu, Huang, Lei, Zhong, Qianqian, Zhang, Lizhu, Zhang, Benyu, Fan, Xiangjun, Yan, Hong

Abstract

Memory-augmented LLM agents have advanced personalized recommendation, yet existing approaches universally adopt flat memory representations that conflate ephemeral signals with stable preferences, and none provides a complete lifecycle governing how memory should evolve. We propose MARS (Memory-Augmented Agentic Recommender System), a framework that treats recommendation as a partially observable problem and maintains a structured belief state that progressively abstracts noisy behavioral observations into a compact estimate of user preferences. MARS organizes this belief state into three tiers: event memory buffers raw signals, preference memory maintains fine-grained mutable chunks with explicit strength and evidence tracking, and profile memory distills all preferences into a coherent natural language narrative. A complete lifecycle of six operations -- extraction, reinforcement, weakening, consolidation, forgetting, and resynthesis -- is adaptively scheduled by an LLM-based planner rather than fixed-interval heuristics. Experiments on four InstructRec benchmark domains show that \ours achieves state-of-the-art performance with average improvements of 26.4% in HR@1 and 10.3% in NDCG@10 over the strongest baselines with further gains from agentic scheduling in evolving settings.

Chinese Translation

增强记忆的LLM代理在个性化推荐方面取得了进展，但现有方法普遍采用扁平的记忆表示，混淆了短暂信号与稳定偏好，并且没有提供关于记忆如何演变的完整生命周期管理。我们提出了MARS（增强记忆自主推荐系统），一个将推荐视为部分可观察问题的框架，维护一个结构化的信念状态，逐步将嘈杂的行为观察抽象为用户偏好的紧凑估计。MARS将这一信念状态组织为三个层次：事件记忆缓冲原始信号，偏好记忆维护具有明确强度和证据跟踪的细粒度可变块，个人资料记忆将所有偏好提炼为连贯的自然语言叙述。六个操作的完整生命周期——提取、强化、削弱、巩固、遗忘和再合成——由基于LLM的规划者自适应调度，而不是固定间隔的启发式方法。在四个InstructRec基准领域的实验表明， extit{ours}在HR@1上平均提高了26.4%，在NDCG@10上提高了10.3%，达到了最强基线的最先进性能，并在演变设置中通过自主调度获得了进一步的提升。

View on arXiv Download PDF AI Translation

cs.CL / 26 / 2605.14404

Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation

超越语言的知识：弥合多语言机器遗忘评估中的差距

Hwang, Kyomin, Kim, Hyeonjin, Cho, Sangyeon, Kwak, Nojun

Abstract

While LLMs are increasingly used in commercial services, they pose privacy risks such as leakage of sensitive personally identifiable information (PII). For LLMs trained on multilingual corpora, Multilingual Machine Unlearning (MMU) aims to remove information across multiple languages. However, prior MMU evaluations fail to capture such cross-linguistic distribution of information, being largely limited to direct extensions of per-language evaluation protocols. To this end, we propose two metrics to evaluate the information spread across languages: the Knowledge Separability Score (KSS) and the Knowledge Persistence Score (KPS). KSS measures the overall unlearning quality across multiple languages, while KPS more specifically aims to assess consistent removal of information among different language pairs. We evaluated various unlearning methods in the multilingual setting with these metrics and conducted comprehensive analyses. Through our investigation, we provide insights into unique phenomena exclusive to MMU and offer a new perspective on MMU evaluation.

Chinese Translation

随着大型语言模型（LLMs）在商业服务中的日益普及，它们带来了隐私风险，例如敏感个人身份信息（PII）的泄露。对于在多语言语料库上训练的LLMs，多语言机器遗忘（MMU）旨在跨多种语言移除信息。然而，先前的MMU评估未能捕捉到这种跨语言信息的分布，主要局限于对每种语言评估协议的直接扩展。为此，我们提出了两个指标来评估跨语言的信息传播：知识可分离性评分（Knowledge Separability Score, KSS）和知识持久性评分（Knowledge Persistence Score, KPS）。KSS衡量多语言间整体的遗忘质量，而KPS则更具体地旨在评估不同语言对之间信息的一致移除。我们利用这些指标在多语言环境中评估了各种遗忘方法，并进行了全面分析。通过我们的研究，我们提供了对MMU特有现象的洞察，并为MMU评估提供了新的视角。

View on arXiv Download PDF AI Translation

cs.CL / 27 / 2605.14427

A Calculus-Based Framework for Determining Vocabulary Size in End-to-End ASR

基于微积分的端到端自动语音识别词汇大小确定框架

Kopparapu, Sunil Kumar

Abstract

In hybrid automatic speech recognition (ASR) systems, the vocabulary size is unambiguous, typically determined by the number of phones, bi-phones, or tri-phones present in the language. In contrast, end-to-end ASR systems derive their vocabulary, often referred to as tokens from the text corpus used for training. The choice and, more importantly, the size of this vocabulary is a critical hyper-parameter in training end-to-end ASR systems. Tokenization algorithms such as Byte Pair Encoding (BPE), WordPiece, and Unigram Language Model (ULM) use the vocabulary size as an input hyper-parameter to generate the sub-words employed during ASR training. Popular toolkits like ESPNet provide a fixed vocabulary size in their training recipes, but there is little documentation or discussion in the literature regarding how these values are determined. Recent work [1] has formalized an approach to identify the vocabulary size best suited for end-to-end ASR, introducing a cost function framework that treats the tokenization process as a black box. In this paper, we build upon that foundation by curve fitting the training data and using the principle of first and second derivative tests in calculus to formally estimate the vocabulary size hyper-parameter. We demonstrate the utility and usefulness of our approach by applying it on a standard Librispeech corpus and show that the optimal choice of vocabulary size hyper-parameter improves the performance of the ASR. The main contribution of this paper in formalizing an approach to identify the vocabulary size best suited for training an end-to-end ASR system.

Chinese Translation

在混合自动语音识别（ASR）系统中，词汇大小是明确的，通常由语言中存在的音素、双音素或三音素的数量决定。相比之下，端到端ASR系统的词汇通常是从用于训练的文本语料库中派生的，通常称为标记（tokens）。这个词汇的选择，尤其是其大小，是训练端到端ASR系统中的一个关键超参数。标记化算法如字节对编码（Byte Pair Encoding, BPE）、WordPiece和单元语言模型（Unigram Language Model, ULM）将词汇大小作为输入超参数，用于生成在ASR训练中使用的子词。像ESPNet这样的流行工具包在其训练配方中提供了固定的词汇大小，但在文献中关于这些值如何确定的文档或讨论很少。最近的研究[1]正式化了一种识别最适合端到端ASR的词汇大小的方法，提出了一个将标记化过程视为黑箱的成本函数框架。在本文中，我们在这一基础上，通过对训练数据进行曲线拟合，并运用微积分中的一阶和二阶导数测试原理，正式估计词汇大小超参数。我们通过在标准Librispeech语料库上应用该方法，展示了我们方法的实用性和有效性，并表明最佳的词汇大小超参数选择提高了ASR的性能。本文的主要贡献在于正式化了一种识别最适合训练端到端ASR系统的词汇大小的方法。

View on arXiv Download PDF AI Translation

cs.CL / 28 / 2605.14473

Does RAG Know When Retrieval Is Wrong? Diagnosing Context Compliance under Knowledge Conflict

RAG 是否知道检索错误？在知识冲突下诊断上下文合规性

Chen, Yihang, Qian, Pin, Wang, Su, Zhang, Sipeng, Xu, Huan, Lin, Shuhuai, Wei, Xinpeng

Abstract

The Context-Compliance Regime in Retrieval-Augmented Generation (RAG) occurs when retrieved context dominates the final answer even when it conflicts with the model's parametric knowledge. Accuracy alone does not reveal how retrieved context causally shapes answers under such conflict. We introduce Context-Driven Decomposition (CDD), a belief-decomposition probe that operates at inference time and serves as an intervention mechanism for controlled retrieval conflict. Across Epi-Scale stress tests, TruthfulQA misconception injection, and cross- model reruns, CDD exposes three patterns. P1: context compliance is measurable in an upper-bound adversarial setting, where Standard RAG reaches 15.0% accuracy on TruthfulQA misconception injection (N=500). P2: adversarial accuracy gains transfer across model families: CDD improves accuracy on Gemini-2.5-Flash and on Claude Haiku/Sonnet/Opus, but rationale-answer causal coupling does not transfer. CDD reaches 64.1% mistake- injection causal sensitivity on Gemini-2.5-Flash, while sensitivities for all three Claude variants fall in the [-3%, +7%] range, suggesting that the Claude-side accuracy gains operate through a mechanism distinct from the explicit conflict-resolution trace. P3: explicit conflict decomposition improves robustness under temporal drift and noisy distractors, with CDD reaching 71.3% on temporal shifts and 69.9% on distractor evidence on the full Epi-Scale adversarial benchmark. These three patterns identify context-compliance as a structural axis along which standard RAG can be probed and intervened on, distinct from retrieval-quality or single-method robustness questions, and motivate releasing Epi-Scale for systematic study across model families and retrieval pipelines.

Chinese Translation

在检索增强生成（RAG）中，上下文合规机制发生在检索到的上下文主导最终答案，即使它与模型的参数知识发生冲突。仅靠准确性无法揭示在这种冲突下检索到的上下文如何因果性地塑造答案。我们引入了上下文驱动分解（Context-Driven Decomposition, CDD），这是一种在推理时操作的信念分解探针，作为控制检索冲突的干预机制。在 Epi-Scale 压力测试、TruthfulQA 误解注入和跨模型重跑中，CDD 揭示了三种模式。模式 1：在上限对抗设置中可以测量上下文合规性，其中标准 RAG 在 TruthfulQA 误解注入上达到 15.0% 的准确率（N=500）。模式 2：对抗准确性提升在模型家族之间转移：CDD 在 Gemini-2.5-Flash 以及 Claude Haiku/Sonnet/Opus 上提高了准确性，但理由-答案的因果耦合并未转移。CDD 在 Gemini-2.5-Flash 上达到 64.1% 的错误注入因果敏感性，而所有三个 Claude 变体的敏感性均在 [-3%, +7%] 范围内，表明 Claude 侧的准确性提升通过一种与显式冲突解决轨迹不同的机制运作。模式 3：显式冲突分解在时间漂移和噪声干扰下提高了鲁棒性，CDD 在完整的 Epi-Scale 对抗基准上在时间变化上达到 71.3%，在干扰证据上达到 69.9%。这三种模式将上下文合规性识别为一个结构轴，标准 RAG 可以在此进行探测和干预，这与检索质量或单一方法的鲁棒性问题不同，并激励我们发布 Epi-Scale 以便在模型家族和检索管道中进行系统研究。

View on arXiv Download PDF AI Translation

cs.CL / 29 / 2605.14480

Cross-Linguistic Transcription and Phonological Representation in the Hu\`it\'onggu\v{a}nx\`i Hu\'ay\'iy\`iy\v{u}

跨语言转录与音韵表征：明代汇通馆系列多语言词汇的研究

Kim, Ji-eun

Abstract

Purpose: This study investigates the transcription principles underlying Hu\`it\'onggu\v{a}nx\`i Hu\'ay\'iy\`iy\v{u} (HHY), a series of multilingual glossaries compiled by the Ming government between the fifteenth and sixteenth centuries for interpreter training. The study treats HHY not as a collection of isolated language materials, but as a coherent multilingual transcription system representing spoken forms of non-Chinese languages through Chinese characters. Methods: A substantial portion of HHY was digitized and aligned with Chinese phonological categories. Previous reconstructions of individual language sections were critically reviewed and integrated into a unified comparative database. The analysis focuses on cross-linguistic regularities in Main Transcription (MT) and Supplementary Transcription (ST) across eight language sections. Results: MT generally represents sounds compatible with the Chinese syllable structure of the period, whereas ST mainly encodes phonetic features less compatible with Chinese phonology. The analysis further shows that Chinese phonological categories were used more flexibly in foreign-language transcription than previously assumed. HHY therefore functioned as a relatively systematic method of phonetic approximation rather than a direct projection of Chinese phonology onto non-Chinese languages. Conclusion: HHY can be analyzed as an internally structured transcription system rather than merely as a collection of glossaries. More broadly, the study demonstrates that historical transcription systems can provide valuable evidence for historical phonology, particularly for under-documented Asian languages with limited historical records.

Chinese Translation

目的：本研究探讨了汇通馆华意译语（Hu extit{ì}t extit{ó}nggu extit{ǎ}nxi Hu extit{ā}y extit{ì}y extit{ì}y extit{u}，简称HHY）背后的转录原则。HHY是明代政府在十五至十六世纪间为培训翻译人员而编纂的一系列多语言词汇。本研究将HHY视为一个连贯的多语言转录系统，通过汉字表示非汉语的口语形式，而非孤立的语言材料集合。方法：HHY的大部分内容已被数字化，并与汉语音韵类别对齐。对各个语言部分的先前重建进行了批判性审查，并整合到一个统一的比较数据库中。分析重点关注八个语言部分中主要转录（Main Transcription, MT）和补充转录（Supplementary Transcription, ST）之间的跨语言规律。结果：MT通常代表与当时汉语音节结构兼容的声音，而ST则主要编码与汉语音韵学兼容性较低的语音特征。分析进一步表明，在外语转录中，汉语音韵类别的使用比先前假设的更为灵活。因此，HHY作为一种相对系统的语音近似方法，而非将汉语音韵直接投射到非汉语语言上的工具。结论：HHY可以被分析为一个内部结构化的转录系统，而不仅仅是一个词汇集合。更广泛地说，本研究表明历史转录系统可以为历史音韵学提供有价值的证据，特别是对于历史记录有限的亚洲语言。

View on arXiv Download PDF AI Translation

cs.CL / 30 / 2605.14498

GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations

GroupMemBench：多方对话中大型语言模型代理记忆的基准测试

Yang, Jingbo, Lai, Kwei-Herng, Wang, Xiaowen, Chang, Shiyu, Harari, Yaar, Gabrilovich, Evgeniy

Abstract

Large Language Model (LLM) agents increasingly serve as personal assistants and workplace collaborators, where their utility depends on memory systems that extract, retrieve, and apply information across long-running conversations. However, both existing memory systems and benchmarks are built around the dyadic, single-user setup, even though real deployments routinely span groups and channels with multiple users interacting with the agent and with each other. This mismatch leaves three properties of group memory unmeasured: (i) group dynamics that go beyond concatenated one-on-one chats, (ii) speaker-grounded belief tracking, where the per-user memory modeling is needed, and (iii) audience-adapted language, where Theory-of-Mind shifts produce role-specific vocabulary. We introduce GroupMemBench, a benchmark that exposes all three. A graph-grounded synthesis pipeline produces multi-party conversations with controllable reply structure and conditions each message on per-user personas and target audiences. An adversarial query pipeline then binds every question to a specific asker across six categories, spanning multi-hop reasoning, knowledge update, term ambiguity, user-implicit reasoning, temporal reasoning, and abstention, and iteratively searches challenging, realistic queries that reflect comprehensive memory capability. Benchmarking leading memory systems exposes a sharp collapse: the strongest one reaches only 46.0% average accuracy, with knowledge update at 27.1% and term ambiguity at 37.7%, while a simple BM25 baseline matches or exceeds most agent memory systems. This indicates current memory ingestion erases the structural and lexical features group memory depends on, leaving multi-user memory far from solved.

Chinese Translation

大型语言模型（LLM）代理越来越多地作为个人助手和工作场所协作者，其效用依赖于能够在长时间对话中提取、检索和应用信息的记忆系统。然而，现有的记忆系统和基准测试都是围绕二人对话、单用户设置构建的，尽管实际部署通常涉及多个用户与代理以及彼此之间的互动。这种不匹配使得群体记忆的三个特性未被测量：（i）超越简单一对一聊天的群体动态；（ii）基于发言者的信念追踪，需要针对每个用户的记忆建模；（iii）适应听众的语言，其中心智理论的转变产生角色特定的词汇。我们引入了GroupMemBench，一个揭示这三种特性的基准测试。一个基于图的合成管道生成具有可控回复结构的多方对话，并根据每个用户的人物特征和目标受众对每条消息进行条件设置。然后，一个对抗性查询管道将每个问题绑定到六个类别中的特定提问者，包括多跳推理、知识更新、术语歧义、用户隐含推理、时间推理和弃权，并迭代搜索反映全面记忆能力的具有挑战性和现实性的查询。对领先记忆系统的基准测试揭示了一个明显的崩溃：最强的系统平均准确率仅为46.0%，知识更新为27.1%，术语歧义为37.7%，而一个简单的BM25基线与大多数代理记忆系统相匹配或超过。这表明当前的记忆摄取抹去了群体记忆所依赖的结构和词汇特征，使得多用户记忆问题远未解决。

View on arXiv Download PDF AI Translation

cs.CL / 31 / 2605.14517

Dimension-Level Intent Fidelity Evaluation for Large Language Models: Evidence from Structured Prompt Ablation

大型语言模型的维度级意图保真度评估：来自结构化提示消融的证据

Peng, GAng

Abstract

Holistic evaluation scores capture overall output quality but do not distinguish whether a model reproduced the structural form of a user's request from whether it preserved the user's specific intent. We propose a dimension-level intent fidelity evaluation framework, applied here through a structured prompt ablation study across 2,880 outputs spanning three languages, three task domains, and six LLMs, that separately measures structural recovery and intent fidelity for each semantic dimension. This framework reveals a systematic structural-fidelity split: among Chinese-language outputs with complete paired scores, 25.7% received perfect holistic alignment scores (GA=5) while exhibiting measurable dimensional intent deficits; among English-language outputs, this proportion rose to 58.6%. Human evaluation confirmed that these split-zone outputs represent genuine quality deficits and that dimensional fidelity scores track human judgements more reliably than holistic scores do. A public-private decomposition of 2,520 ablation cells characterises when models successfully compensate for missing intent and when they fail, while proxy annotation distinguishes prior inferability from default recoverability. A weight-perturbation experiment shows that moderate misalignment is typically absorbed, whereas severe dimensional inversion is consistently harmful. These findings demonstrate that dimension-level intent fidelity evaluation is a necessary complement to holistic assessment when evaluating LLM outputs for user-specific tasks.

Chinese Translation

整体评估分数捕捉输出质量的总体水平，但未能区分模型是否重现了用户请求的结构形式，以及是否保留了用户的特定意图。我们提出了一种维度级意图保真度评估框架，通过对2880个输出进行结构化提示消融研究，涵盖三种语言、三个任务领域和六个大型语言模型（LLMs），分别测量每个语义维度的结构恢复和意图保真度。该框架揭示了系统性的结构保真度分裂：在具有完整配对分数的中文输出中，25.7%获得了完美的整体一致性分数（GA=5），但表现出可测量的维度意图缺失；在英文输出中，这一比例上升至58.6%。人工评估确认这些分裂区输出确实代表了真实的质量缺陷，并且维度保真度分数比整体分数更可靠地反映人类判断。对2520个消融单元的公共-私有分解表征了模型何时成功补偿缺失的意图以及何时失败，而代理注释则区分了先前可推断性与默认可恢复性。权重扰动实验表明，适度的不对齐通常会被吸收，而严重的维度反转则始终是有害的。这些发现表明，维度级意图保真度评估是评估大型语言模型输出以满足用户特定任务时对整体评估的必要补充。

View on arXiv Download PDF AI Translation

cs.CL / 32 / 2605.14531

Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space

将语言生成视为最优控制：潜在控制空间中的闭环扩散

Dong, ZiYi, Huang, Yuliang, Deng, Weijian, Ji, Xiangyang, Lin, Liang, Wei, Pengxu

Abstract

This work reformulates language generation as a stochastic optimal control problem, providing a unified theoretical perspective to analyze autoregressive and diffusion models and explain their limitations (Efficiency-Fidelity Paradox, Irreversibility Error Propagation, Optimization Tractability and Fidelity) in terms of combination of trajectory singularity, adjoint state vanishing, and gradient absence. To address these issues, we approximate the solution to the Hamilton-Jacobi-Bellman (HJB) equation, yielding an optimal policy that acts as a closed-loop controller. To bypass the intractability of directly solving the HJB PDE, we employ Flow Matching as the optimal trajectory solver within the rectified latent control space. This allows our Manta-LM with Global Integral Operator to approximate the global vector field, effectively realizing a model that simultaneously achieves high-fidelity text generation and efficient, low-cost parallel sampling. Empirically, our method achieves strong performance on language modeling and conditional generation tasks, while exhibiting improved stability, efficiency, and controllability.

Chinese Translation

本研究将语言生成重新表述为一个随机最优控制问题，提供了一个统一的理论视角来分析自回归模型和扩散模型，并从轨迹奇异性、伴随状态消失和梯度缺失的组合角度解释它们的局限性（效率-保真悖论、不可逆性误差传播、优化可处理性和保真度）。为了解决这些问题，我们近似求解哈密顿-雅可比-贝尔曼（Hamilton-Jacobi-Bellman, HJB）方程，得到一个作为闭环控制器的最优策略。为了绕过直接求解HJB偏微分方程的复杂性，我们在经过修正的潜在控制空间内采用流匹配（Flow Matching）作为最优轨迹求解器。这使得我们的Manta-LM结合全局积分算子能够近似全局向量场，有效实现一个同时具备高保真文本生成和高效、低成本并行采样的模型。从经验上看，我们的方法在语言建模和条件生成任务上表现出色，同时展现出更好的稳定性、效率和可控性。

View on arXiv Download PDF AI Translation

cs.CL / 33 / 2605.14539

Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards

从失败中学习：基于可验证奖励的纠正导向策略优化

Ren, Mengjie, Lou, Jie, Cao, Boxi, Wen, Xueru, Lin, Hongyu, Han, Xianpei, Sun, Le, Yu, Xing, Lu, Yaojie

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective paradigm for improving the reasoning capabilities of large language models. However, RLVR training is often hindered by sparse binary rewards and weak credit assignment, resulting in ambiguous optimization signals and underutilization of the useful information embedded in failed trajectories. To address this challenge, we propose Correction-Oriented Policy Optimization (CIPO), a simple and effective extension to RLVR that converts on-policy failed trajectories into correction-oriented supervision, without relying on any external signals. By jointly optimizing correction samples derived from the model's own failed attempts together with the standard RLVR objective, CIPO improves learning effectiveness while explicitly enhancing the model's ability to correct its own errors. Extensive experiments across 11 benchmarks spanning mathematical reasoning and code generation demonstrate that CIPO consistently and significantly outperforms strong baselines in both reasoning and correction performance. Moreover, CIPO yields stronger pass@K gains, indicating that it improves the model's intrinsic reasoning capacity rather than merely redistributing probability mass over existing correct answers.

Chinese Translation

基于可验证奖励的强化学习（RLVR）已成为提升大型语言模型推理能力的有效范式。然而，RLVR训练常常受到稀疏的二元奖励和弱信用分配的阻碍，导致优化信号模糊以及未能充分利用嵌入在失败轨迹中的有用信息。为了解决这一挑战，我们提出了纠正导向策略优化（CIPO），这是对RLVR的一种简单而有效的扩展，它将基于策略的失败轨迹转化为纠正导向的监督，而无需依赖任何外部信号。通过联合优化来自模型自身失败尝试的纠正样本与标准RLVR目标，CIPO提高了学习效果，同时明确增强了模型纠正自身错误的能力。在涵盖数学推理和代码生成的11个基准测试中的广泛实验表明，CIPO在推理和纠正性能上始终显著优于强基线。此外，CIPO还带来了更强的pass@K增益，表明它提升了模型的内在推理能力，而不仅仅是重新分配现有正确答案的概率质量。

View on arXiv Download PDF AI Translation

cs.CL / 34 / 2605.14570

Uncertainty Quantification for Large Language Diffusion Models

大型语言扩散模型的不确定性量化

Vazhentsev, Artem, Smirnov, Vladislav, Li, David, Panov, Maxim, Baldwin, Timothy, Shelmanov, Artem

Abstract

Large Language Diffusion Models (LLDMs) are emerging as an alternative to autoregressive models, offering faster inference through higher parallelism. Similar to autoregressive LLMs, they remain prone to hallucinations, making reliable uncertainty quantification (UQ) crucial for safe deployment. However, existing UQ methods are fundamentally misaligned with this new paradigm: they assume autoregressive factorization or use expensive repeated sampling, negating the efficiency of LLDMs. In this work, we present the first systematic study of UQ for LLDMs and propose lightweight, zero-shot uncertainty signals derived from the iterative denoising process, leveraging intermediate generations, token remasking dynamics, and denoising complexity. We further adapt a state-of-the-art UQ method to LLDMs by combining masked diffusion likelihoods with trajectory-based semantic dissimilarity. We prove that expected trajectory dissimilarity lower bounds the masked diffusion training objective, which motivates its usage as an uncertainty score. Comprehensive experiments across three tasks, eight datasets, and two models show that our method achieves a great cost-performance trade-off: it approaches the strongest sampling-based baselines while incurring up to 100x lower computational overhead. Our work demonstrates that LLDMs can deliver both fast inference and reliable hallucination detection simultaneously.

Chinese Translation

大型语言扩散模型（LLDMs）作为自回归模型的替代方案正在兴起，通过更高的并行性提供更快的推理。与自回归大型语言模型（LLMs）类似，它们仍然容易出现幻觉，因此可靠的不确定性量化（UQ）对于安全部署至关重要。然而，现有的不确定性量化方法与这一新范式在根本上不一致：它们假设自回归因子分解或使用昂贵的重复采样，从而抵消了LLDMs的效率。在本研究中，我们首次系统性地研究了LLDMs的不确定性量化，并提出了基于迭代去噪过程的轻量级零样本不确定性信号，利用中间生成、标记重掩蔽动态和去噪复杂性。我们进一步通过将掩蔽扩散似然与基于轨迹的语义差异结合，调整了一种最先进的不确定性量化方法以适应LLDMs。我们证明了期望的轨迹差异下界掩蔽扩散训练目标，这为其作为不确定性评分的使用提供了动机。在三个任务、八个数据集和两个模型上的全面实验表明，我们的方法在成本与性能之间实现了良好的平衡：它接近最强的基于采样的基线，同时计算开销降低了多达100倍。我们的工作表明，LLDMs可以同时实现快速推理和可靠的幻觉检测。

View on arXiv Download PDF AI Translation

cs.CL / 35 / 2605.14589

EndPrompt: Efficient Long-Context Extension via Terminal Anchoring

EndPrompt：通过终端锚定实现高效的长上下文扩展

Tian, Han, Chen, Luxuan, Chen, Xinran, Kong, Rui, Wang, Fang, Chen, Jiamin, Zhao, Jinman, Li, Yuchen, Zhao, Jiashu, Wang, Shuaiqiang, Xiong, Haoyi, Yin, Dawei

Abstract

Extending the context window of large language models typically requires training on sequences at the target length, incurring quadratic memory and computational costs that make long-context adaptation expensive and difficult to reproduce. We propose EndPrompt, a method that achieves effective context extension using only short training sequences. The core insight is that exposing a model to long-range relative positional distances does not require constructing full-length inputs: we preserve the original short context as an intact first segment and append a brief terminal prompt as a second segment, assigning it positional indices near the target context length. This two-segment construction introduces both local and long-range relative distances within a short physical sequence while maintaining the semantic continuity of the training text--a property absent in chunk-based simulation approaches that split contiguous context. We provide a theoretical analysis grounded in Rotary Position Embedding and the Bernstein inequality, showing that position interpolation induces a rigorous smoothness constraint over the attention function, with shared Transformer parameters further suppressing unstable extrapolation to unobserved intermediate distances. Applied to LLaMA-family models extending the context window from 8K to 64K, EndPrompt achieves an average RULER score of 76.03 and the highest average on LongBench, surpassing LCEG (72.24), LongLoRA (72.95), and full-length fine-tuning (69.23) while requiring substantially less computation. These results demonstrate that long-context generalization can be induced from sparse positional supervision, challenging the prevailing assumption that dense long-sequence training is necessary for reliable context-window extension. The code is available at https://github.com/clx1415926/EndPrompt.

Chinese Translation

扩展大型语言模型的上下文窗口通常需要在目标长度的序列上进行训练，这会产生二次的内存和计算成本，使得长上下文适应变得昂贵且难以重现。我们提出了EndPrompt，这是一种仅使用短训练序列就能有效扩展上下文的方法。核心见解是，暴露模型于长距离相对位置距离并不需要构建完整长度的输入：我们保留原始短上下文作为完整的第一部分，并附加一个简短的终端提示作为第二部分，赋予其接近目标上下文长度的位置索引。这种两部分构造在短物理序列中引入了局部和长距离的相对距离，同时保持训练文本的语义连贯性——这一特性在基于块的模拟方法中是缺失的，这些方法会拆分连续的上下文。我们提供了基于旋转位置嵌入（Rotary Position Embedding）和伯恩斯坦不等式的理论分析，表明位置插值对注意力函数施加了严格的平滑性约束，共享的Transformer参数进一步抑制了对未观察到的中间距离的不稳定外推。应用于将上下文窗口从8K扩展到64K的LLaMA系列模型，EndPrompt实现了76.03的平均RULER得分，并在LongBench上获得了最高的平均得分，超过了LCEG（72.24）、LongLoRA（72.95）和全长微调（69.23），同时所需计算量大幅减少。这些结果表明，稀疏位置监督可以诱导长上下文的泛化，挑战了密集长序列训练对于可靠上下文窗口扩展是必要的普遍假设。代码可在https://github.com/clx1415926/EndPrompt获取。

View on arXiv Download PDF AI Translation

cs.CL / 36 / 2605.14600

SciPaths: Forecasting Pathways to Scientific Discovery

SciPaths：科学发现路径的预测

Chamoun, Eric, Chi, Yizhou, Chen, Yulong, Cao, Rui, Ding, Zifeng, Korakakis, Michalis, Vlachos, Andreas

Abstract

Scientific progress depends on sequences of enabling contributions, yet existing AI4Science benchmarks largely focus on citation prediction, literature retrieval, or idea generation rather than the dependencies that make progress possible. In this paper, we introduce discovery pathway forecasting: given a target scientific contribution and the prior literature available at a specified time, the task is to (1) identify the enabling contributions required to realize it and (2) ground each in prior work when such prior work exists. We present SciPaths, a benchmark of 262 expert-annotated gold pathways and 2,444 silver pathways constructed from machine learning and natural language processing papers, where each pathway records enabling contributions, roles, rationales, and prior-work groundings or unmapped decisions. Evaluating frontier and open-weight language models, we find that the best model reaches only 0.189 F1 under strict semantic matching, with core methodological dependencies hardest to recover. Prior-work grounding improves substantially when gold enabling contributions are provided, showing that decomposition quality is a major bottleneck for end-to-end pathway recovery. SciPaths therefore shifts evaluation toward a missing capability in scientific forecasting: reasoning backward from a target contribution to the enabling scientific building blocks and prior-work dependencies that make it feasible.

Chinese Translation

科学进步依赖于一系列促进性贡献，但现有的AI4Science基准主要集中在引用预测、文献检索或创意生成，而非使进步成为可能的依赖关系。本文介绍了发现路径预测：给定一个目标科学贡献和在特定时间可用的先前文献，任务是（1）识别实现该贡献所需的促进性贡献，以及（2）在存在先前工作的情况下，将每个贡献与先前工作相结合。我们提出了SciPaths，这是一个包含262条专家注释的金路径和2,444条银路径的基准，这些路径是从机器学习和自然语言处理论文中构建的，每条路径记录了促进性贡献、角色、理由以及先前工作的基础或未映射的决策。评估前沿和开放权重语言模型时，我们发现最佳模型在严格语义匹配下仅达到0.189的F1值，而核心方法论依赖关系最难恢复。当提供金色促进性贡献时，先前工作的基础显著改善，表明分解质量是端到端路径恢复的主要瓶颈。因此，SciPaths将评估转向科学预测中缺失的能力：从目标贡献向后推理到使其可行的促进性科学构建块和先前工作依赖关系。

View on arXiv Download PDF AI Translation

cs.CL / 37 / 2605.14679

AI-assisted cultural heritage dissemination: Comparing NMT and glossary-augmented LLM translation in rock art documents

人工智能辅助的文化遗产传播：比较岩石艺术文献中的神经机器翻译与词汇增强的大语言模型翻译

Briva-Iglesias, Vicent, Ferre-Fernández, María

Abstract

Cultural heritage institutions increasingly disseminate research and interpretive materials globally, but multilingual dissemination is constrained by limited budgets and staffing. In terminology-dense domains such as rock art, translation quality depends on accurate, consistent specialised terms, and small lexical errors can mislead non-specialists and reduce reuse. We compare three English MT setups for a Spanish academic rock art text, focusing on simple, operationally feasible interventions rather than complex model-side modifications: (1) DeepL as a strong NMT baseline, (2) Gemini-Simple (LLM with a basic prompt), and (3) Gemini-RAG (the same LLM with glossary-augmented prompting via term-pair retrieval). Using PEARMUT, we conduct a human evaluation via (i) multi-way Direct Assessment (0--100) and (ii) targeted terminology auditing with a restricted MQM taxonomy. Gemini-RAG yields the highest exact-match terminology accuracy (81.4\%), versus Gemini-Simple (69.1\%) and DeepL (64.4\%), while preserving overall quality (mean DA 85.3 Gemini-RAG vs. 85.2 Gemini-Simple), outperforming DeepL (80.3). These results show that glossary-augmented prompting is a low-overhead way to improve terminology control in cultural-heritage translation if institutions maintain minimal terminology resources and lightweight evaluation procedures.

Chinese Translation

文化遗产机构日益在全球范围内传播研究和解释材料，但多语言传播受到预算和人员配置的限制。在如岩石艺术等术语密集的领域，翻译质量依赖于准确、一致的专业术语，而小的词汇错误可能会误导非专业人士并减少重用。我们比较了三种针对西班牙学术岩石艺术文本的英语机器翻译设置，重点关注简单、可操作的干预措施，而非复杂的模型侧修改：（1）DeepL作为强大的神经机器翻译基线，（2）Gemini-Simple（带有基本提示的大语言模型），以及（3）Gemini-RAG（通过术语对检索增强提示的相同大语言模型）。我们使用PEARMUT进行人类评估，采用（i）多方直接评估（0-100）和（ii）使用限制MQM分类法的目标术语审计。Gemini-RAG在准确匹配术语的准确性上表现最佳（81.4%），相比之下Gemini-Simple为69.1%，DeepL为64.4%，同时保持了整体质量（Gemini-RAG的平均直接评估为85.3，Gemini-Simple为85.2），超越了DeepL（80.3）。这些结果表明，词汇增强提示是一种低开销的方式，可以在文化遗产翻译中改善术语控制，前提是机构维持最小的术语资源和轻量级的评估程序。

View on arXiv Download PDF AI Translation

cs.CL / 38 / 2605.14744

Mechanical Enforcement for LLM Governance:Evidence of Governance-Task Decoupling in Financial Decision Systems

机械执行在大型语言模型治理中的应用：金融决策系统中治理任务解耦的证据

Rodríguez, José Manuel de la Chica, Martí-González, Carlos

Abstract

Large language models in regulated financial workflows are governed by natural-language policies that the same model interprets, creating a principal--agent failure: outputs can appear compliant without being compliant. Existing evaluation measures task accuracy but not whether governance constrains behaviour at the decision rationale level -- where regulated decisions must be auditable. We introduce five governance metrics that quantify policy compliance at the rationale level and apply them in a synthetic banking domain to compare text-only governance against mechanical enforcement: four primitives operating outside the model's interpretive loop. Under text-only governance, 27% of deferrals carry no decision-relevant information. Mechanical enforcement reduces this rate by 73%, more than doubles deferral information content, and raises task accuracy from MCC~$0.43$ to $0.88$. The improvement is driven by architectural separation: LLM-generated rationales under mechanical enforcement show comparable CDL to text-only governance -- the gain comes from removing clear-cut decisions from the model's control. A causal ablation confirms that each primitive is individually necessary. Our central finding is a governance-task decoupling: under structural stress, text-only governance degrades on both dimensions simultaneously, whereas mechanical enforcement preserves governance quality even as task performance drops. This implies that governance and task evaluation are distinct axes: accuracy is not a sufficient proxy for governance in regulated AI systems.

Chinese Translation

在受监管的金融工作流程中，大型语言模型受到自然语言政策的治理，而同一模型又对这些政策进行解释，导致了委托-代理失效：输出可能看似合规，但实际上并不合规。现有的评估措施关注任务准确性，但并未考虑治理是否在决策理由层面约束行为——这一层面是受监管决策必须可审计的。我们引入了五个治理指标，以量化决策理由层面的政策合规性，并在一个合成银行领域中应用这些指标，以比较仅基于文本的治理与机械执行：四个在模型解释循环之外操作的原语。在仅基于文本的治理下，27%的延迟决策不包含任何与决策相关的信息。机械执行将这一比例降低了73%，使延迟信息内容增加了两倍多，并将任务准确性从MCC~$0.43$提高到$0.88$。这一改善源于架构分离：在机械执行下生成的理由显示出与仅基于文本的治理相当的CDL——增益来自于将明确的决策从模型的控制中移除。一项因果消融实验确认每个原语都是单独必要的。我们的核心发现是治理与任务的解耦：在结构压力下，仅基于文本的治理在两个维度上同时退化，而机械执行则在任务表现下降的情况下保持治理质量。这意味着治理和任务评估是不同的维度：准确性并不足以作为受监管AI系统中治理的代理。

View on arXiv Download PDF AI Translation

cs.CL / 39 / 2605.14747

Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining

Video2GUI：为通用GUI代理预训练合成大规模交互轨迹

Xiong, Weimin, Gu, Shuhao, Ye, Bowen, Yue, Zihao, Li, Lei, Song, Feifan, Li, Sujian, Tian, Hao

Abstract

Recent advances in multimodal large language models have driven growing interest in graphical user interface (GUI) agents, yet their generalization remains constrained by the scarcity of large-scale training data spanning diverse real-world applications. Existing datasets rely heavily on costly manual annotations and are typically confined to narrow domains. To address this challenge, we propose Video2GUI, a fully automated framework that extracts grounded GUI interaction trajectories directly from unlabeled Internet videos. Video2GUI employs a coarse-to-fine filtering strategy to identify high-quality GUI tutorial videos and convert them into structured agent trajectories. Applying this pipeline to 500 million video metadata entries, we construct WildGUI, a large-scale dataset containing 12 million interaction trajectories spanning over 1,500 applications and websites. Pre-training Qwen2.5-VL and Mimo-VL on WildGUI yields consistent improvements of 5-20% across multiple GUI grounding and action benchmarks, matching or surpassing state-of-the-art performance. We will release both the WildGUI dataset and the Video2GUI pipeline to support future research of GUI agents.

Chinese Translation

近期多模态大语言模型的进展引发了对图形用户界面（GUI）代理的日益关注，但其泛化能力仍受到缺乏涵盖多样化真实世界应用的大规模训练数据的限制。现有数据集在很大程度上依赖于昂贵的人工标注，且通常局限于狭窄的领域。为了解决这一挑战，我们提出了Video2GUI，一个完全自动化的框架，能够直接从未标记的互联网视频中提取有根据的GUI交互轨迹。Video2GUI采用粗到细的过滤策略来识别高质量的GUI教程视频，并将其转换为结构化的代理轨迹。将该流程应用于5亿条视频元数据条目，我们构建了WildGUI，一个包含1200万条交互轨迹的大规模数据集，涵盖超过1500个应用程序和网站。在WildGUI上对Qwen2.5-VL和Mimo-VL进行预训练，在多个GUI定位和动作基准测试中均实现了5-20%的持续提升，达到了或超过了最先进的性能。我们将发布WildGUI数据集和Video2GUI流程，以支持未来的GUI代理研究。

View on arXiv Download PDF AI Translation

cs.CL / 40 / 2605.14749

Non-linear Interventions on Large Language Models

对大型语言模型的非线性干预

Kim, Sangwoo

Abstract

Intervention is one of the most representative and widely used methods for understanding the internal representations of large language models (LLMs). However, existing intervention methods are confined to linear interventions grounded in the Linear Representation Hypothesis, leaving features encoded along non-linear manifolds beyond their reach. In this work, we introduce a general formulation of intervention that extends naturally to non-linearly represented features, together with a learning procedure that further enables intervention on implicit features lacking a direct output signature. We validate our framework on refusal bypass steering, where it steers the model more precisely than linear baselines by intervening on a non-linear feature governing refusal.

Chinese Translation

干预是理解大型语言模型（LLMs）内部表征的最具代表性和广泛使用的方法之一。然而，现有的干预方法仅限于基于线性表征假设的线性干预，无法触及沿非线性流形编码的特征。在本研究中，我们提出了一种干预的通用形式，能够自然扩展到非线性表征特征，并提出了一种学习过程，进一步使得对缺乏直接输出特征的隐式特征进行干预成为可能。我们在拒绝绕过引导（refusal bypass steering）任务上验证了我们的框架，通过对控制拒绝的非线性特征进行干预，使模型的引导精度超过了线性基线。

View on arXiv Download PDF AI Translation

cs.CL / 41 / 2605.14766

Streaming Speech-to-Text Translation with a SpeechLLM

基于SpeechLLM的流式语音转文本翻译

Parcollet, Titouan, Zhang, Shucong, Zheng, Xianrui, van Dalen, Rogier C.

Abstract

Normally, a system that translates speech into text consists of separate modules for speech recognition and text-to-text translation. Combining those tasks into a SpeechLLM promises to exploit paralinguistic information in the speech and to reduce cascaded errors. But existing SpeechLLM systems are slow since they do not work in a real streaming fashion: they wait for a complete utterance of audio before outputting a translation, or output tokens at fixed intervals, which is not suitable for real applications. This work proposes an LLM-based architecture for real streaming speech-to-text translation. The LLM learns not just to emit output tokens, but also to decide whether it has seen enough audio to do so. The system is trained using automatic alignments of the input speech and the output text. In experiments on different language pairs, the system achieves a translation quality close to the non-streaming baseline, but with a latency of only 1-2 seconds.

Chinese Translation

通常，将语音翻译为文本的系统由语音识别和文本到文本翻译的独立模块组成。将这些任务结合为一个SpeechLLM有望利用语音中的副语言信息，并减少级联错误。然而，现有的SpeechLLM系统速度较慢，因为它们并未以真正的流式方式工作：它们在输出翻译之前需要等待完整的音频发声，或者以固定的时间间隔输出标记，这不适合实际应用。本研究提出了一种基于LLM的架构，用于实时流式语音转文本翻译。该LLM不仅学习输出标记，还能决定是否已经接收到足够的音频来进行输出。该系统使用输入语音和输出文本的自动对齐进行训练。在不同语言对的实验中，该系统实现了接近非流式基线的翻译质量，但延迟仅为1-2秒。

View on arXiv Download PDF AI Translation

cs.CL / 42 / 2605.14790

Graphs of Research: Citation Evolution Graphs as Supervision for Research Idea Generation

研究图谱：作为研究创意生成监督的引用演变图

Gao, Songyang, Xia, Yinghui, Liu, Siyi, Xiong, Hui

Abstract

Research idea generation is the innovation-driving step of automated scientific research. Recently, large language models (LLMs) have shown potential for automating idea generation at scale. However, existing methods mainly condition LLMs on eliciting idea generation through static retrieval of relevant literature or complex prompt engineering, without discarding the structural relations among references. We propose Graphs of Research (GoR), a supervised fine-tuning method that extracts a 2-hop reference neighborhood for each seed paper, derives the relations among those references from citation position, frequency, predecessor links, and publication time, and organizes them into a paper-evolution directed acyclic graph (DAG). We construct an automated extraction pipeline that draws data from five major ML/NLP venues, comprising 498/50/50 train/validation/test seed papers and approximately 7,600 cited references. Qwen2.5-7B-Instruct-1M is fine-tuned on a structured-text prompt that includes the citation graph, edge signals, reference information, and task definition to predict the idea for the seed paper. Across head-to-head LLM-judge tournaments against gpt-4o-driven baselines, GoR-SFT achieves SOTA, demonstrating the effectiveness of citation-evolution graphs as supervision signal for LLM-based idea generation. We hope that this reduces the barrier for citation evolution graphs as a supervision, accelerating automated scientific innovation.

Chinese Translation

研究创意生成是自动化科学研究中的创新驱动步骤。近年来，大型语言模型（LLMs）在大规模自动化创意生成方面显示出了潜力。然而，现有方法主要依赖于通过静态检索相关文献或复杂的提示工程来引导LLMs进行创意生成，而未能考虑参考文献之间的结构关系。我们提出了研究图谱（Graphs of Research, GoR），这是一种监督微调方法，它为每篇种子论文提取2跳参考邻域，从引用位置、频率、前驱链接和出版时间中推导出这些参考文献之间的关系，并将其组织成论文演变有向无环图（DAG）。我们构建了一个自动提取管道，从五个主要的机器学习/自然语言处理（ML/NLP）会议中提取数据，包括498/50/50的训练/验证/测试种子论文和大约7600个被引用的参考文献。Qwen2.5-7B-Instruct-1M在一个包含引用图、边信号、参考信息和任务定义的结构化文本提示上进行了微调，以预测种子论文的创意。在与基于gpt-4o的基线进行的头对头LLM评审比赛中，GoR-SFT达到了最先进水平（SOTA），证明了引用演变图作为LLM基础创意生成的监督信号的有效性。我们希望这能降低引用演变图作为监督的门槛，加速自动化科学创新。

View on arXiv Download PDF AI Translation

cs.CL / 43 / 2605.14816

Conversion of Lexicon-Grammar tables to LMF. Application to French

将词汇-语法表转换为LMF：以法语为例

Laporte, Eric, Tolone, Elsa, Constant, Mathieu

Abstract

We describe the first experiment of conversion of Lexicon-Grammar tables for French verbs into the Lexical Markup Framework (LMF) format. The Lexicon-Grammar of the French language is currently one of the major sources of lexical and syntactic information for French. Its conversion into an interoperable representation format according to the LMF standard makes it usable in different contexts, thus contributing to the standardization and interoperability of natural language processing dictionaries. We briefly introduce the Lexicon-Grammar and the derived dictionaries; we analyse the main difficulties faced during the conversion; and we describe the resulting resource.

Chinese Translation

我们描述了将法语动词的词汇-语法表转换为词汇标记框架（Lexical Markup Framework, LMF）格式的首次实验。法语的词汇-语法目前是法语词汇和句法信息的主要来源之一。将其转换为符合LMF标准的可互操作表示格式，使其能够在不同的上下文中使用，从而有助于自然语言处理词典的标准化和互操作性。我们简要介绍了词汇-语法及其衍生词典；分析了转换过程中面临的主要困难；并描述了所得到的资源。

View on arXiv Download PDF AI Translation

cs.CL / 44 / 2605.14890

Tokenizer Fertility and Zero-Shot Performance of Foundation Models on Ukrainian Legal Text: A Comparative Study

基础模型在乌克兰法律文本上的分词器效率与零样本性能：一项比较研究

Ovcharov, Volodymyr

Abstract

Foundation models tokenize Ukrainian legal text with vastly different efficiency, yet no systematic comparison exists for this domain. We benchmark seven models from five providers on 273 validated court decisions from Ukraine's state registry (EDRSR), measuring tokenizer fertility and zero-shot performance on three tasks. Three findings emerge. (1) Tokenizer fertility varies 1.6x: Qwen3 models consume 60% more tokens than Llama-family models on identical input, directly reducing API cost. (2) NVIDIA Nemotron Super 3 (120B) achieves the highest composite score (83.1), outperforming Mistral Large 3 (675B total, 41B active) -- a model with 5.6x more total parameters and 3.4x more active parameters per token -- at one-third the API cost. (3) Few-shot prompting degrades performance by up to 26 percentage points; stratified and prompt-sensitivity ablations confirm this is intrinsic to Ukrainian-language demonstrations, not an artifact of example selection. For practitioners: tokenizer analysis should precede model selection, and zero-shot is a more reliable default than few-shot for morphologically rich languages.

Chinese Translation

基础模型在分词乌克兰法律文本时效率差异显著，但该领域尚无系统比较。我们对来自五个提供者的七个模型进行了基准测试，使用273份来自乌克兰国家登记处（EDRSR）的验证法院裁决，测量分词器效率和在三个任务上的零样本性能。研究结果显示出三点发现。(1) 分词器效率差异达到1.6倍：Qwen3模型在相同输入下消耗的令牌比Llama系列模型多60%，直接增加了API成本。(2) NVIDIA Nemotron Super 3（120B）获得了最高的综合得分（83.1），其表现优于Mistral Large 3（675B总参数，41B活跃参数）——后者的总参数量是前者的5.6倍，每个令牌的活跃参数量是前者的3.4倍，但API成本仅为其三分之一。(3) 少样本提示的性能下降幅度可达26个百分点；分层和提示敏感性消融实验确认这一现象是乌克兰语言示例固有的，而非示例选择的伪影。对于从业者：分词器分析应优先于模型选择，对于形态丰富的语言，零样本比少样本更可靠。

View on arXiv Download PDF AI Translation

cs.CL / 45 / 2605.14928

Chain-of-Procedure: Hierarchical Visual-Language Reasoning for Procedural QA

程序链：用于程序性问答的层次化视觉语言推理

Chen, Guanhua, Yao, Yutong, Sun, Shenghe, Gao, Ci-Jun, Liu, Shudong, Chao, Lidia S., Wan, Feng, Wong, Derek F.

Abstract

Recent advances in vision-language models (VLMs) have achieved impressive results on standard image-text tasks, yet their potential for visual procedure question answering (VP-QA) remains largely unexplored. VP-QA presents unique challenges where users query next-step actions by uploading images for intermediate states of complex procedures. To systematically evaluate VLMs on this practical task, we propose ProcedureVQA, a novel multimodal benchmark specifically designed for visual procedural reasoning. Through comprehensive analysis, we identify two critical limitations in current VLMs: inadequate cross-modal retrieval of structured procedures given visual states, and misalignment between image sequence granularity and textual step decomposition. To address these issues, we present Chain-of-Procedure (CoP), a hierarchical reasoning framework that first retrieves relevant instructions using visual cues, then performs step refinement through semantic decomposition, and finally generates the next step. Experiments across six VLMs demonstrate CoP's effectiveness, achieving up to 13% absolute improvement over standard baselines.

Chinese Translation

近年来，视觉语言模型（VLMs）的进展在标准图像-文本任务上取得了令人瞩目的成果，但它们在视觉程序问答（VP-QA）中的潜力仍然未得到充分探索。VP-QA 面临独特的挑战，用户通过上传复杂程序的中间状态图像来查询下一步操作。为了系统性地评估 VLMs 在这一实际任务上的表现，我们提出了 ProcedureVQA，这是一个专门为视觉程序推理设计的新型多模态基准。通过全面分析，我们识别出当前 VLMs 的两个关键局限性：在给定视觉状态时对结构化程序的跨模态检索不足，以及图像序列粒度与文本步骤分解之间的不一致。为了解决这些问题，我们提出了程序链（Chain-of-Procedure, CoP），这是一种层次化推理框架，首先利用视觉线索检索相关指令，然后通过语义分解进行步骤细化，最后生成下一步操作。在六个 VLMs 上的实验表明，CoP 的有效性，较标准基线实现了最高 13% 的绝对提升。

View on arXiv Download PDF AI Translation

cs.CL / 46 / 2605.14978

Performance-Driven Policy Optimization for Speculative Decoding with Adaptive Windowing

基于性能驱动的自适应窗口推测解码策略优化

Jiang, Jie, Sun, Xing

Abstract

Speculative decoding accelerates LLM inference by having a lightweight draft model propose speculative windows of candidate tokens for parallel verification by a larger target model. In practice, speculative efficiency is often bottlenecked by hard-to-draft positions, where an early mismatch truncates the accepted prefix and invalidates the rest of the speculative window. Most learning-based drafters are still optimized with token-level supervised objectives, even though speculative utility is inherently window-level and prefix-sensitive. We propose PPOW (Performance-Driven Policy Optimization with Adaptive Windowing), a reinforcement learning framework that shifts drafter optimization from token-level imitation to window-level optimization. PPOW combines a Cost-Aware Speedup Reward, a Distribution-Based Proximity Reward, and Adaptive Divergence-Aware Windowing, which prioritizes informative windows with high confidence-weighted draft-target divergence. PPOW achieves average acceptance lengths of 6.29-6.52 and speedups of 3.39-4.36$\times$ across multiple model families and benchmarks under a unified decoding protocol. These results show that performance-driven window-level optimization is a practical approach to improving speculative decoding efficiency.

Chinese Translation

推测解码通过让轻量级草稿模型提出候选标记的推测窗口，以便由更大目标模型进行并行验证，从而加速了大规模语言模型（LLM）的推理。在实际应用中，推测效率常常受到难以草拟的位置的瓶颈影响，在这些位置，早期的不匹配会截断接受的前缀并使其余的推测窗口失效。尽管推测效用本质上是窗口级且对前缀敏感的，但大多数基于学习的草拟器仍然使用标记级的监督目标进行优化。我们提出了PPOW（基于性能驱动的自适应窗口策略优化），这是一种强化学习框架，将草拟器的优化从标记级模仿转向窗口级优化。PPOW结合了成本感知加速奖励、基于分布的接近奖励和自适应发散感知窗口，这些方法优先考虑具有高置信度加权草稿-目标发散的信息窗口。在统一解码协议下，PPOW在多个模型系列和基准测试中实现了6.29-6.52的平均接受长度和3.39-4.36倍的加速。这些结果表明，基于性能驱动的窗口级优化是提高推测解码效率的实用方法。

View on arXiv Download PDF AI Translation

cs.CL / 47 / 2605.15000

Quantifying and Mitigating Premature Closure in Frontier LLMs

量化与缓解前沿大型语言模型中的早期闭合现象

Handler, Rebecca, Bedi, Suhana, Shah, Nigam

Abstract

Premature closure, or committing to a conclusion before sufficient information is available, is a recognized contributor to diagnostic error but remains underexamined in large language models (LLMs). We define LLM premature closure as inappropriate commitment under uncertainty: providing an answer, recommendation, or clinical guidance when the safer response would be clarification, abstention, escalation, or refusal. We evaluated five frontier LLMs across structured and open-ended medical tasks. In MedQA (n = 500) and AfriMed-QA (n = 490) questions where the correct choice had been removed, models still selected an answer at high rates, with baseline false-action rates of 55-81% and 53-82%, respectively. In open-ended evaluation, models gave inappropriate answers on an average of 30% of 861 HealthBench questions and 78% of 191 physician-authored adversarial queries. Safety-oriented prompting reduced premature closure across models, but residual failure persisted, highlighting the need to evaluate whether medical LLMs know when not to answer.

Chinese Translation

早期闭合，或在信息不足时就做出结论，是导致诊断错误的一个公认因素，但在大型语言模型（LLMs）中仍然未得到充分研究。我们将LLM的早期闭合定义为在不确定性下的不当承诺：在更安全的反应应为澄清、回避、升级或拒绝时，提供答案、建议或临床指导。我们评估了五个前沿LLM在结构化和开放式医学任务中的表现。在MedQA（n = 500）和AfriMed-QA（n = 490）中，当正确选项被移除时，模型仍以高比例选择答案，基线错误行动率分别为55-81%和53-82%。在开放式评估中，模型在861个HealthBench问题中平均给出了30%的不当答案，在191个医生撰写的对抗性查询中则达到了78%。以安全为导向的提示减少了模型的早期闭合现象，但残余的失败仍然存在，突显出评估医学LLM是否知道何时不应回答的必要性。

View on arXiv Download PDF AI Translation

cs.CL / 48 / 2605.15011

The Scientific Contribution Graph: Automated Literature-based Technological Roadmapping at Scale

科学贡献图：大规模自动化文献基础技术路线图绘制

Jansen, Peter A.

Abstract

Scientific contributions rarely develop in isolation, but instead build upon prior discoveries. We formulate the task of automated technological roadmapping as extracting scientific contributions from scholarly articles and linking them to their prerequisites. We present the Scientific Contribution Graph, a large-scale AI/NLP-domain resource containing 2 million detailed scientific contributions extracted from 230k open-access papers and connected by 12.5 million prerequisite edges. We further introduce scientific prerequisite prediction, a scientific discovery task in which models predict which existing technologies can enable future discoveries, and show that contemporary models are rapidly improving on this task, reaching 0.48 MAP when evaluated using temporally filtered backtesting. We anticipate technological roadmapping resources such as this will support scientific impact assessment and automated scientific discovery.

Chinese Translation

科学贡献很少孤立发展，而是建立在先前发现的基础上。我们将自动化技术路线图绘制的任务表述为从学术文章中提取科学贡献并将其与先决条件相连接。我们提出了科学贡献图，这是一个大规模的人工智能/自然语言处理领域资源，包含从23万篇开放获取论文中提取的200万条详细科学贡献，并通过1250万个先决条件边相互连接。我们进一步引入科学先决条件预测，这是一项科学发现任务，其中模型预测哪些现有技术可以促进未来的发现，并展示了当代模型在该任务上的快速进步，在使用时间过滤的回测评估时达到了0.48的平均精度（MAP）。我们预期，像这样的技术路线图资源将支持科学影响评估和自动化科学发现。

View on arXiv Download PDF AI Translation

cs.CL / 49 / 2605.15016

COTCAgent: Preventive Consultation via Probabilistic Chain-of-Thought Completion

COTCAgent：通过概率链式思维完成进行预防性咨询

Deng, Zihan, Zhong, Xiaozhen, Xu, Chuanzhi

Abstract

As large language models empower healthcare, intelligent clinical decision support has developed rapidly. Longitudinal electronic health records (EHR) provide essential temporal evidence for accurate clinical diagnosis and analysis. However, current large language models have critical flaws in longitudinal EHR reasoning. First, lacking fine-grained statistical reasoning, they often hallucinate clinical trends and metrics when quantitative evidence is textually implied, biasing diagnostic inference. Second, non-uniform time series and scarce labels in longitudinal EHR hinder models from capturing long-range temporal dependencies, limiting reliable clinical reasoning. To address the above limitations, this work presents the Probabilistic Chain-of-Thought Completion Agent (COTCAgent), a hierarchical reasoning framework for longitudinal electronic health records. It consists of three core modules. The Temporal-Statistics Adapter (TSA) converts analytical plans into executable code for standardized trend output. The Chain-of-Thought Completion (COTC) layer leverages a symptom-trend-disease knowledge base with weighted scoring to evaluate disease risk, while the bounded completion module acquires structured evidence through standardized inquiries and iterative scoring constraints to ensure rigorous reasoning. By decoupling statistical computation, feature matching, and language generation, the framework eliminates reliance on complex multi-modal inputs and enables efficient longitudinal record analysis with lower computational overhead. Experimental results show that COTCAgent powered by Baichuan-M2 achieves 90.47% Top-1 accuracy on the self-built dataset and 70.41% on HealthBench, outperforming existing medical agents and mainstream large language models. The code is available at https://github.com/FrankDengAI/COTCAgent/.

Chinese Translation

随着大型语言模型在医疗领域的应用，智能临床决策支持迅速发展。纵向电子健康记录（EHR）为准确的临床诊断和分析提供了重要的时间证据。然而，当前的大型语言模型在纵向EHR推理方面存在关键缺陷。首先，由于缺乏细粒度的统计推理，当定量证据以文本方式隐含时，它们常常会虚构临床趋势和指标，从而偏向于诊断推断。其次，纵向EHR中的非均匀时间序列和稀缺标签阻碍了模型捕捉长期时间依赖性，限制了可靠的临床推理。为了解决上述局限性，本研究提出了概率链式思维完成代理（COTCAgent），这是一个用于纵向电子健康记录的分层推理框架。该框架由三个核心模块组成。时间统计适配器（Temporal-Statistics Adapter, TSA）将分析计划转换为可执行代码，以标准化趋势输出。链式思维完成（Chain-of-Thought Completion, COTC）层利用带权评分的症状-趋势-疾病知识库来评估疾病风险，而有界完成模块通过标准化询问和迭代评分约束获取结构化证据，以确保严格的推理。通过解耦统计计算、特征匹配和语言生成，该框架消除了对复杂多模态输入的依赖，并以较低的计算开销实现高效的纵向记录分析。实验结果表明，基于Baichuan-M2的COTCAgent在自建数据集上达到了90.47%的Top-1准确率，在HealthBench上达到了70.41%，超越了现有的医疗代理和主流大型语言模型。代码可在https://github.com/FrankDengAI/COTCAgent/获取。

View on arXiv Download PDF AI Translation

cs.CL / 50 / 2605.15019

From Scenes to Elements: Multi-Granularity Evidence Retrieval for Verifiable Multimodal RAG

从场景到元素：可验证的多模态检索增强生成的多粒度证据检索

Chen, Guanhua, Huang, Chuyue, Yao, Yutong, Liu, Shudong, Song, Xueqing, Chao, Lidia S., Wong, Derek F.

Abstract

Multimodal Retrieval-Augmented Generation (RAG) systems retrieve evidence at coarse granularities (entire images or scenes), creating a mismatch with fine-grained user queries and making failures unverifiable. We introduce GranuVistaVQA, a multimodal benchmark featuring real-world landmarks with element-level annotations across multiple viewpoints, capturing the partial observation challenge where individual images contain only subsets of entities. We further propose GranuRAG, a multi-granularity framework that treats visual elements as first-class retrieval units through three stages: element-level detection and classification, multi-granularity cross-modal alignment for evidence retrieval, and attribution-constrained generation. By grounding retrieval at the element level rather than relying on implicit attention, our approach enables transparent error diagnosis. Experiments demonstrate that GranuRAG achieves up to 29.2% improvement over six strong baselines for this task.

Chinese Translation

多模态检索增强生成（RAG）系统在粗粒度（整个图像或场景）上检索证据，这与细粒度用户查询之间存在不匹配，导致失败无法验证。我们引入了GranuVistaVQA，这是一个多模态基准，包含具有元素级注释的真实世界地标，涵盖多个视角，捕捉到个别图像仅包含实体子集的部分观察挑战。我们进一步提出了GranuRAG，这是一个多粒度框架，通过三个阶段将视觉元素视为一类重要的检索单元：元素级检测和分类、多粒度跨模态对齐以进行证据检索，以及受限于归因的生成。通过在元素级别上进行检索，而不是依赖隐式注意力，我们的方法能够实现透明的错误诊断。实验表明，GranuRAG在该任务上相较于六个强基线实现了最高29.2%的提升。

View on arXiv Download PDF AI Translation

cs.CL / 51 / 2605.15034

AI Knows When It's Being Watched: Functional Strategic Action and Contextual Register Modulation in Large Language Models

人工智能知道何时被观察：大型语言模型中的功能性战略行动与情境注册调节

Covas, Vinicius, Toledo, Jorge Alberto Hidalgo

Abstract

Large language models (LLMs) have been extensively studied from computational and cognitive perspectives, yet their behavior as communicative actors in socially structured contexts remains underexplored. This study examines whether LLM-based multi-agent systems exhibit systematic linguistic adaptation in response to perceived social observation contexts -- a question with direct implications for AI governance and auditing. Drawing on Habermas's (1981) Theory of Communicative Action, Goffman's (1959) dramaturgical model, Bell's (1984) Audience Design framework, and the Hawthorne Effect, we report a controlled experiment involving 100 multi-agent debate sessions across five conditions (n = 20 each). Conditions varied the framing of social observation -- from explicit monitoring by university researchers, to negation of monitoring, to an observer-substitution condition replacing human researchers with an automated AI auditing system. Monitored conditions (Delta+24.9%, Delta+24.2%) and the automated AI monitoring condition (Delta+22.2%) produce higher TTR change than audience-framing conditions (Delta+17.7%), F(4, 94) = 2.79, p = .031. Message length shows a fully dissociated effect, F(4, 95) = 19.55, p < .001. A fifth condition -- replacing human with AI observers -- yields intermediate TTR adaptation, suggesting LLM behavior is sensitive to observer identity: human evaluation elicits stronger register formalization than automated AI surveillance. We discuss implications for AI governance, algorithmic auditing, and the repositioning of LLMs as contextually sensitive communicative actors.

Chinese Translation

大型语言模型（LLMs）从计算和认知的角度得到了广泛研究，但它们作为社会结构化背景下的交际行为者的表现仍然未得到充分探讨。本研究考察了基于LLM的多智能体系统是否在感知到社会观察背景时表现出系统性的语言适应——这一问题对人工智能治理和审计具有直接的影响。我们借鉴了哈贝马斯（Habermas, 1981）的交际行动理论、戈夫曼（Goffman, 1959）的戏剧模型、贝尔（Bell, 1984）的观众设计框架以及霍桑效应，报告了一项涉及100个多智能体辩论会话的受控实验，分为五种条件（每种条件n = 20）。条件的变化在于社会观察的框架——从大学研究人员的明确监控，到对监控的否定，再到用自动化人工智能审计系统替代人类研究人员的观察者替代条件。监控条件（Delta+24.9%，Delta+24.2%）和自动化人工智能监控条件（Delta+22.2%）产生的TTR变化高于观众框架条件（Delta+17.7%），F(4, 94) = 2.79, p = .031。信息长度显示出完全分离的效应，F(4, 95) = 19.55, p < .001。第五种条件——用人工智能观察者替代人类观察者——产生了中等的TTR适应，表明LLM的行为对观察者身份敏感：人类评估引发的注册正式化程度高于自动化人工智能监控。我们讨论了对人工智能治理、算法审计以及将LLM重新定位为情境敏感的交际行为者的影响。

View on arXiv Download PDF AI Translation

cs.CL / 52 / 2605.15077

Concurrency without Model Changes: Future-based Asynchronous Function Calling for LLMs

无模型变更的并发：基于未来的异步函数调用用于大型语言模型（LLMs）

Feng, Guangyu, Mao, Huanzhi, Dutta, Prabal, Gonzalez, Joseph E.

Abstract

Function calling, also known as tool use, is a core capability of modern LLM agents but is typically constrained by synchronous execution semantics. Under these semantics, LLM decoding is blocked until each function call completes, resulting in increasing end-to-end latency. In this work, we introduce AsyncFC, a pure execution-layer framework that decouples LLM decoding from function execution, enabling overlap between model decoding and function execution as well as inter-function parallelism when dependencies permit. AsyncFC layers over existing models and unmodified function implementations, requiring no fine-tuning or changes to the standard synchronous function-calling protocol. Across standard function-calling benchmarks and adapted software engineering benchmarks, AsyncFC significantly reduces end-to-end task completion time while preserving task accuracy. Furthermore, these results reveal that LLMs possess a native capability to reason over symbolic futures that represent unresolved execution results, enabling an asynchronous paradigm for model-tool interaction.

Chinese Translation

函数调用，也称为工具使用，是现代大型语言模型（LLM）代理的核心能力，但通常受到同步执行语义的限制。在这些语义下，LLM 解码在每个函数调用完成之前被阻塞，导致端到端延迟增加。在本研究中，我们介绍了 AsyncFC，这是一种纯执行层框架，它将 LLM 解码与函数执行解耦，允许在模型解码与函数执行之间重叠，并在依赖关系允许的情况下实现函数间并行。AsyncFC 层叠在现有模型和未修改的函数实现之上，无需微调或更改标准的同步函数调用协议。在标准函数调用基准测试和适应性软件工程基准测试中，AsyncFC 显著减少了端到端任务完成时间，同时保持任务准确性。此外，这些结果揭示了 LLM 具备对表示未解决执行结果的符号未来进行推理的固有能力，从而为模型与工具的交互提供了异步范式。

View on arXiv Download PDF AI Translation

cs.CL / 53 / 2605.15081

ML-Embed: Inclusive and Efficient Embeddings for a Multilingual World

ML-Embed：为多语言世界提供包容性和高效性的嵌入

Zhang, Ziyin, Liao, Zihan, Yu, Hang, Di, Peng, Wang, Rui

Abstract

The development of high-quality text embeddings is increasingly drifting toward an exclusionary future, defined by three critical barriers: prohibitive computational costs, a narrow linguistic focus that neglects most of the world's languages, and a lack of transparency from closed-source or open-weight models that stifles research. To dismantle these barriers, we introduce ML-Embed, a suite of inclusive and efficient models built upon a new framework: 3-Dimensional Matryoshka Learning (3D-ML). Our framework addresses the computational challenge with comprehensive efficiency across the entire model lifecycle. Beyond the storage benefits of Matryoshka Representation Learning (MRL) and flexible inference-time depth provided by Matryoshka Layer Learning (MLL), we introduce Matryoshka Embedding Learning (MEL) for enhanced parameter efficiency. To address the linguistic challenge, we curate a massively multilingual dataset and train a suite of models ranging from 140M to 8B parameters. In a direct commitment to transparency, we release all models, data, and code. Extensive evaluation on 430 tasks demonstrates that our models set new records on 9 of 17 evaluated MTEB benchmarks, with particularly strong results in low-resource languages, providing a reproducible blueprint for building globally equitable and computationally efficient AI systems.

Chinese Translation

高质量文本嵌入的发展正日益朝着一个排他性的未来发展，这一未来由三个关键障碍定义：高昂的计算成本、狭窄的语言焦点忽视了世界上大多数语言，以及来自封闭源或开放权重模型的缺乏透明度，这抑制了研究。为了拆除这些障碍，我们提出了ML-Embed，这是一个基于新框架3-Dimensional Matryoshka Learning (3D-ML)构建的包容性和高效性模型套件。我们的框架通过在整个模型生命周期内的全面效率来解决计算挑战。除了Matryoshka Representation Learning (MRL)所带来的存储优势和Matryoshka Layer Learning (MLL)提供的灵活推理深度外，我们还引入了Matryoshka Embedding Learning (MEL)以增强参数效率。为了解决语言挑战，我们策划了一个大规模多语言数据集，并训练了一系列参数从1.4亿到80亿的模型。在对透明度的直接承诺下，我们发布了所有模型、数据和代码。对430个任务的广泛评估表明，我们的模型在17个评估的MTEB基准中创造了9项新纪录，尤其在低资源语言中表现出色，为构建全球公平和计算高效的人工智能系统提供了可重复的蓝图。

View on arXiv Download PDF AI Translation

cs.CL / 54 / 2605.15102

Improving Multi-turn Dialogue Consistency with Self-Recall Thinking

通过自我回忆思维提升多轮对话一致性

Pang, Renning, Lan, Tian, Liu, Leyuan, Huang, Xiaoming, Tong, Piao, Zhang, Xiaosong

Abstract

Large language model (LLM) based multi-turn dialogue systems often struggle to track dependencies across non-adjacent turns, undermining both consistency and scalability. As conversations lengthen, essential information becomes sparse and is buried in irrelevant context, while processing the entire dialogue history incurs severe efficiency bottlenecks. Existing solutions either rely on high latency external memory or lose fine-grained details through iterative summarization. In this paper, we propose Self-Recall Thinking (SRT), a framework designed to address long-range contextual dependency and sparse informative signals in multi-turn dialogue. SRT identifies helpful historical turns and uses them to generate contextually appropriate responses, enabling the model to selectively recall and reason over context during inference. This process yields an endogenous reasoning process that integrates interpretable recall steps without external modules. SRT incorporates: (1) Dependency Construction: Generating and converting it into self-recall chains; (2)Capability Initialization: Training to enable reasoning chains with recall tokens capability; (3)Reasoning Improvement: Refining accuracy via verifiable rewards to optimize recall and reasoning for correct answers. Experiments on multiple datasets demonstrate that SRT improves F1 score by 4.7% and reduces end-to-end latency by 14.7% over prior methods, achieving a balance between reasoning latency and accuracy, and outperforming state-of-the-art baselines.

Chinese Translation

基于大型语言模型（LLM）的多轮对话系统在跟踪非相邻轮次之间的依赖关系时常常面临挑战，这削弱了一致性和可扩展性。随着对话的延长，关键信息变得稀疏并埋藏在无关的上下文中，而处理整个对话历史则会导致严重的效率瓶颈。现有解决方案要么依赖于高延迟的外部记忆，要么通过迭代摘要丢失细粒度的细节。本文提出了自我回忆思维（Self-Recall Thinking, SRT），这是一个旨在解决多轮对话中的长距离上下文依赖和稀疏信息信号的框架。SRT识别有助于生成上下文适当响应的历史轮次，使模型在推理过程中能够选择性地回忆和推理上下文。该过程产生了一种内生推理过程，整合了可解释的回忆步骤而无需外部模块。SRT包括：（1）依赖构建：生成并转换为自我回忆链；（2）能力初始化：训练以使推理链具备回忆标记能力；（3）推理改进：通过可验证的奖励来提高准确性，以优化回忆和推理以获得正确答案。在多个数据集上的实验表明，SRT相较于先前的方法提高了4.7%的F1分数，并减少了14.7%的端到端延迟，实现了推理延迟和准确性之间的平衡，超越了最先进的基准。

View on arXiv Download PDF AI Translation

cs.CL / 55 / 2605.15104

From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

从文本到语音：一个可重复和可验证的框架用于评估工具调用大型语言模型代理

Laskar, Md Tahmid Rahman, Fu, Xue-Yong, Sarfjoo, Seyyed Saeed, McNamara, Quinten, Robertson, Jonas, TN, Shashi Bhushan

Abstract

Voice agents increasingly require reliable tool use from speech, whereas prominent tool-calling benchmarks remain text-based. We study whether verified text benchmarks can be converted into controlled audio-based tool calling evaluations without re-annotating the tool schema and gold labels. Our dataset-agnostic framework uses text-to-speech, speaker variation, and environmental noise to create paired text-audio instances while preserving the original dataset annotations. Based on extensive evaluation of 7 omni-modal models on audio-converted versions of Confetti and When2Call, our framework demonstrates that the performance is strongly model- and task-dependent: Gemini-3.1-Flash-Live obtains the highest Confetti score (70.4), whereas GPT-Realtime-1.5 performs best on When2Call (71.9). On Confetti, the text-to-voice gap ranges from 1.8 points for Qwen3-Omni to 4.8 points for GPT-Realtime-1.5. A targeted analysis of failure cases demonstrates that degradations most often reflect misunderstandings of argument values in the speech. Considering real-world deployment scenarios, we further report text-only results, an ambiguity-based reformulation stress test, and a reference-free LLM-as-judge protocol validated against human preferences. Notably, we find that open-source Qwen3 judges with at least 8B parameters exceed 80% agreement with proprietary judges, supporting privacy-preserving evaluation. Overall, our framework provides a verifiable and reproducible first-stage diagnostic that complements purpose-built audio corpora.

Chinese Translation

语音代理越来越需要从语音中可靠地使用工具，而现有的主要工具调用基准仍然基于文本。我们研究了是否可以将经过验证的文本基准转换为受控的基于音频的工具调用评估，而无需重新注释工具模式和金标准标签。我们的数据集无关框架利用文本转语音、说话者变异和环境噪声创建配对的文本-音频实例，同时保留原始数据集注释。基于对7个全模态模型在音频转换版本的Confetti和When2Call上的广泛评估，我们的框架表明，性能强烈依赖于模型和任务：Gemini-3.1-Flash-Live在Confetti上获得了最高分（70.4），而GPT-Realtime-1.5在When2Call上表现最佳（71.9）。在Confetti上，文本到语音的差距从Qwen3-Omni的1.8分到GPT-Realtime-1.5的4.8分不等。对失败案例的针对性分析表明，性能下降最常反映出对语音中参数值的误解。考虑到现实世界的部署场景，我们进一步报告了仅基于文本的结果、基于模糊性的重构压力测试，以及经过人类偏好验证的无参考大型语言模型作为评判者的协议。值得注意的是，我们发现开源的Qwen3（至少8B参数）与专有评判者的协议超过80%，支持隐私保护评估。总体而言，我们的框架提供了一个可验证和可重复的第一阶段诊断，补充了专门构建的音频语料库。

View on arXiv Download PDF AI Translation

cs.CL / 56 / 2605.15156

MeMo: Memory as a Model

MeMo：作为模型的记忆

Quek, Ryan Wei Heng, Lee, Sanghyuk, Leong, Alfred Wei Lun, Verma, Arun, Prakash, Alok, Chen, Nancy F., Low, Bryan Kian Hsiang, Rus, Daniela, Solar-Lezama, Armando

Abstract

Large language models (LLMs) achieve strong performance across a wide range of tasks, but remain frozen after pretraining until subsequent updates. Many real-world applications require timely, domain-specific information, motivating the need for efficient mechanisms to incorporate new knowledge. In this paper, we introduce MeMo (Memory as a Model), a modular framework that encodes new knowledge into a dedicated memory model while keeping the LLM parameters unchanged. Compared to existing methods, MeMo offers several advantages: (a) it captures complex cross-document relationships, (b) it is robust to retrieval noise, (c) it avoids catastrophic forgetting in the LLM, (d) it does not require access to the LLM's weights or output logits, enabling plug-and-play integration with both open and proprietary closed-source LLMs, and (e) its retrieval cost is independent of corpus size at inference time. Our experimental results on three benchmarks, BrowseComp-Plus, NarrativeQA, and MuSiQue, show that MeMo achieves strong performance compared to existing methods across diverse settings.

Chinese Translation

大型语言模型（LLMs）在广泛任务中表现出色，但在预训练后保持不变，直到后续更新。许多现实世界的应用需要及时的、特定领域的信息，这促使我们需要高效的机制来融入新知识。在本文中，我们介绍了MeMo（Memory as a Model），一个模块化框架，它将新知识编码到专用的记忆模型中，同时保持LLM参数不变。与现有方法相比，MeMo提供了几个优势：（a）它捕捉复杂的跨文档关系，（b）它对检索噪声具有鲁棒性，（c）它避免了LLM中的灾难性遗忘，（d）它不需要访问LLM的权重或输出logits，从而实现与开放和专有闭源LLM的即插即用集成，以及（e）其检索成本在推理时与语料库大小无关。我们在三个基准测试（BrowseComp-Plus、NarrativeQA和MuSiQue）上的实验结果表明，MeMo在多种设置下相比现有方法表现出色。

View on arXiv Download PDF AI Translation

cs.CL / 57 / 2605.15168

Text Knows What, Tables Know When: Clinical Timeline Reconstruction via Retrieval-Augmented Multimodal Alignment

文本知道什么，表格知道何时：通过检索增强的多模态对齐进行临床时间线重建

Kumar, Sayantan, Noroozizadeh, Shahriar, Kim, Juyong, Weiss, Jeremy C.

Abstract

Reconstructing precise clinical timelines is essential for modeling patient trajectories and forecasting risk in complex, heterogeneous conditions like sepsis. While unstructured clinical narratives offer semantically rich and contextually complete descriptions of a patient's course, they often lack temporal precision and contain ambiguous event timing. Conversely, structured electronic health record (EHR) data provides precise temporal anchors but misses a substantial portion of clinically meaningful events. We introduce a retrieval-augmented multimodal alignment framework that bridges this gap to improve the temporal precision of absolute clinical timelines extracted from text. Our approach formulates timeline reconstruction as a graph-based multistep process: it first extracts central anchor events from narratives to build an initial temporal scaffold, places non-central events relative to this backbone, and then calibrates the timeline using retrieved structured EHR rows as external temporal evidence. Evaluated using instruction-tuned large language models on the i2m4 benchmark spanning MIMIC-III and MIMIC-IV, our multimodal pipeline consistently improves absolute timestamp accuracy (AULTC) and improves temporal concordance across nearly all evaluated models over unimodal text-only reconstruction, without compromising event match rates. Furthermore, our empirical gap analysis reveals that 34.8% of text-derived events are entirely absent from tabular records, demonstrating that aligning these modalities can produce a more temporally faithful and clinically informative reconstruction of patient trajectories than either source alone.

Chinese Translation

重建精确的临床时间线对于建模患者轨迹和预测复杂异质条件下的风险（如脓毒症）至关重要。尽管非结构化的临床叙述提供了患者病程的语义丰富和上下文完整的描述，但它们往往缺乏时间精确性，并且包含模糊的事件时序。相反，结构化的电子健康记录（EHR）数据提供了精确的时间锚点，但缺失了大量临床上有意义的事件。我们提出了一种检索增强的多模态对齐框架，以弥补这一差距，提高从文本中提取的绝对临床时间线的时间精确性。我们的方法将时间线重建公式化为一个基于图的多步骤过程：首先从叙述中提取中心锚事件以构建初始时间框架，然后相对于该框架放置非中心事件，最后使用检索到的结构化EHR行作为外部时间证据来校准时间线。在使用经过指令调优的大型语言模型对跨越MIMIC-III和MIMIC-IV的i2m4基准进行评估时，我们的多模态管道在绝对时间戳准确性（AULTC）上持续改善，并在几乎所有评估模型中提高了时间一致性，相较于单模态文本重建，且没有妨碍事件匹配率。此外，我们的实证差距分析显示，34.8%的文本衍生事件在表格记录中完全缺失，证明对齐这些模态可以比单独来源产生更具时间真实性和临床信息性的患者轨迹重建。

View on arXiv Download PDF AI Translation

cs.CL / 58 / 2605.15184

Is Grep All You Need? How Agent Harnesses Reshape Agentic Search

Grep 是你所需的一切吗？代理工具如何重塑代理搜索

Sen, Sahil, Kasturi, Akhil, Lumer, Elias, Gulati, Anmol, Subbiah, Vamse Kumar

Abstract

Recent advances in Large Language Model (LLM) agents have enabled complex agentic workflows where models autonomously retrieve information, call tools, and reason over large corpora to complete tasks on behalf of users. Despite the growing adoption of retrieval-augmented generation (RAG) in agentic search systems, existing literature lacks a systematic comparison of how retrieval strategy choice interacts with agent architecture and tool-calling paradigm. Important practical dimensions, including how tool outputs are presented to the model and how performance changes when searches must cope with more irrelevant surrounding text, remain under-explored in agent loops. This paper reports an empirical study organized into two experiments. Experiment 1 compares grep and vector retrieval on a 116-question sample from LongMemEval, using a custom agent harness (Chronos) and provider-native CLI harnesses (Claude Code, Codex, and Gemini CLI), for both inline tool results and file-based tool results that the model reads separately. Experiment 2 compares grep-only and vector-only retrieval while progressively mixing in additional unrelated conversation history, so that each query is embedded in more distracting material alongside the passages that matter. Across Chronos and the provider CLIs, grep generally yields higher accuracy than vector retrieval in our comparisons in experiment 1; at the same time, overall scores still depend strongly on which harness and tool-calling style is used, even when the underlying conversation data are the same.

Chinese Translation

近期大型语言模型（LLM）代理的进展使得复杂的代理工作流程成为可能，这些模型能够自主检索信息、调用工具，并在大量语料上进行推理，以代表用户完成任务。尽管在代理搜索系统中检索增强生成（RAG）的应用日益增加，但现有文献缺乏对检索策略选择如何与代理架构和工具调用范式相互作用的系统比较。重要的实际维度，包括工具输出如何呈现给模型，以及当搜索必须应对更多无关的周围文本时性能如何变化，在代理循环中仍然未得到充分探讨。本文报告了一项实证研究，分为两个实验。实验1比较了在LongMemEval的116个问题样本上使用自定义代理工具（Chronos）和提供商原生命令行工具（Claude Code、Codex和Gemini CLI）进行的grep和向量检索，分别针对内联工具结果和模型单独读取的基于文件的工具结果。实验2比较了仅使用grep和仅使用向量检索，同时逐步混入额外的无关对话历史，使得每个查询嵌入更多的干扰材料中，连同重要的段落。在Chronos和提供商CLI中，实验1的比较结果表明，grep通常比向量检索产生更高的准确性；同时，整体得分仍然强烈依赖于使用的工具和工具调用风格，即使基础对话数据相同。

View on arXiv Download PDF AI Translation

arXiv Papers

Towards Robotic Dexterous Hand Intelligence: A Survey

Ergodic Imitation for Adaptive Exploration around Demonstrations

Behavior Cloning for Active Perception with Low-Resolution Egocentric Vision

Safety-Constrained Reinforcement Learning with Post-Training Reachability Verification for Robot Navigation

Motion Planning for Autonomous Vehicles using Optimization over Graphs of Convex Sets

MAPLE: Latent Multi-Agent Play for End-to-End Autonomous Driving

Reactive Planning based Control for Mobile Robots in Obstacle-Cluttered Environments

Distill: Uncovering the True Intent behind Human-Robot Communication

Energy-Efficient Quadruped Locomotion with Compliant Feet

Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control

Let Robots Feel Your Touch: Visuo-Tactile Cortical Alignment for Embodied Mirror Resonance

DSSP: Diffusion State Space Policy with Full-History Encoding

SeaVis: Modeling and Control of a Remotely Operated Towed Vehicle for Seabed Visualization and Mapping

SR-Platform: An Agentic Pipeline for Natural Language-Driven Robot Simulation Environment Synthesis

IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation

Exploring Bottlenecks in VLM-LLM Navigation: How 3D Scene Understanding Capability Impacts Zero-Shot VLN

Learning Cross-Coupled and Regime Dependent Dynamics for Aerial Manipulation

CaMeRL: Collision-Aware and Memory-Enhanced Reinforcement Learning for UAV Navigation in Multi-Scale Obstacle Environments

Learning Direct Control Policies with Flow Matching for Autonomous Driving

Chrono-Gymnasium: An Open-Source, Gymnasium-Compatible Distributed Simulation Framework

FU-MPC: Frontier- and Uncertainty-Aware Model Predictive Control for Efficient and Accurate UAV Exploration with Motorized LiDAR

Behavioral Data-Driven Optimal Trajectory Generation for Rotary Cranes

A Prototyping Framework for Distributed Control of Multi-Robot Systems

SOCC-ICP: Semantics-Assisted Odometry based on Occupancy Grids and ICP

CLOVER: Closed-Loop Value Estimation \& Ranking for End-to-End Autonomous Driving Planning

CoCo-InEKF: State Estimation with Learned Contact Covariances in Dynamic, Contact-Rich Scenarios

Pelican-Unified 1.0: A Unified Embodied Intelligence Model for Understanding, Reasoning, Imagination and Action

Hand-in-the-Loop: Improving Dexterous VLA via Seamless Interventional Correction

Contrastive Multi-Modal Hypergraph Reasoning for 3D Crowd Mesh Recovery

Few Channels Draw The Whole Picture: Revealing Massive Activations in Diffusion Transformers

CineMesh4D: Personalized 4D Whole Heart Reconstruction from Sparse Cine MRI

Unified Pix Token And Word Token Generative Language Model

PVRF: All-in-one Adverse Weather Removal via Prior-modulated and Velocity-constrained Rectified Flow

Evolving Layer-Specific Scalar Functions for Hardware-Aware Transformer Adaptation

CurveBench: A Benchmark for Exact Topological Reasoning over Nested Jordan Curves

Venus-DeFakerOne: Unified Fake Image Detection & Localization

DUET: Dual-Paradigm Adaptive Expert Triage with Single-cell Inductive Prior for Spatial Transcriptomics Prediction

Bridging the Rural Healthcare Gap: A Cascaded Edge-Cloud Architecture for Automated Retinal Screening

SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection

ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows

PanoPlane: Plane-Aware Panoramic Completion for Sparse-View Indoor 3D Gaussian Splatting

TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion

Rethinking the Good Enough Embedding for Easy Few-Shot Learning

You Only Landmark Once: Lightweight U-Net Face Super Resolution with YOLO-World Landmark Heatmaps

CoReDiT: Spatial Coherence-Guided Token Pruning and Reconstruction for Efficient Diffusion Transformers

Automatic Landmark-Based Segmentation of Human Subcortical Structures in MRI

Implicit spatial-frequency fusion of hyperspectral and lidar data via kolmogorov-arnold networks

Generative Deep Learning for Computational Destaining and Restaining of Unregistered Digital Pathology Images

Towards Real-Time Autonomous Navigation: Transformer-Based Catheter Tip Tracking in Fluoroscopy

Image Restoration via Diffusion Models with Dynamic Resolution

PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation

Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers

CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL

KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration

ICED: Concept-level Machine Unlearning via Interpretable Concept Decomposition

CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding

TurboVGGT: Fast Visual Geometry Reconstruction with Adaptive Alternating Attention

D2-CDIG: Controlled Diffusion Remote Sensing Image Generation with Dual Priors of DEM and Cloud-Fog

InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation

IG-Diff: Complex Night Scene Restoration with Illumination-Guided Diffusion Model

AnyBand-Diff: A Unified Remote Sensing Image Generation and Band Repair Framework with Spectral Priors

Learning with Semantic Priors: Stabilizing Point-Supervised Infrared Small Target Detection via Hierarchical Knowledge Distillation

Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation

Dual-Latent Collaborative Decoding for Fidelity-Perception Balanced Image Compression

Analogical Trajectory Transfer

Systematic Discovery of Semantic Attacks in Online Map Construction through Conditional Diffusion

SceneForge: Structured World Supervision from 3D Interventions

DermAgent: A Self-Reflective Agentic System for Dermatological Image Analysis with Multi-Tool Reasoning and Traceable Decision-Making

Think When Needed: Adaptive Reasoning-Driven Multimodal Embeddings with a Dual-LoRA Architecture

ClickRemoval: An Interactive Open-Source Tool for Object Removal in Diffusion Models

Real2Sim in HOI: Toward Physically Plausible HOI Reconstruction from Monocular Videos

GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding

Reduce the Artifacts Bias for More Generalizable AI-Generated Image Detection

Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity

HASTE: Training-Free Video Diffusion Acceleration via Head-Wise Adaptive Sparse Attention

ArcGate: Adaptive Arctangent Gated Activation

From Sparse to Dense: Spatio-Temporal Fusion for Multi-View 3D Human Pose Estimation with DenseWarper

Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models

PROVE: A Perceptual RemOVal cohErence Benchmark for Visual Media