← Back to Index
Daily Research Digest

arXiv Papers

2026-04-13
234
Papers
4
Categories
234
Translated
收藏清单 0
机器人学 (Robotics)
24
cs.RO / 1 / 2604.08636

LEGO: Latent-space Exploration for Geometry-aware Optimization of Humanoid Kinematic Design

LEGO:面向几何感知的类人运动学设计的潜空间探索优化
Yoon, Jihwan, Jeong, Taemoon, Park, Jeongeun, Kim, Chanwoo, Kwon, Jaewoon, Lee, Yonghyeon, Lee, Kyungjae, Choi, Sungjoon
Abstract
Designing robot morphologies and kinematics has traditionally relied on human intuition, with little systematic foundation. Motion-design co-optimization offers a promising path toward automation, but two major challenges remain: (i) the vast, unstructured design space and (ii) the difficulty of constructing task-specific loss functions. We propose a new paradigm that minimizes human involvement by (i) learning the design search space from existing mechanical designs, rather than hand-crafting it, and (ii) defining the loss directly from human motion data via motion retargeting and Procrustes analysis. Using screw-theory-based joint axis representation and isometric manifold learning, we construct a compact, geometry-preserving latent space of humanoid upper body designs in which optimization is tractable. We then solve design optimization in this latent space using gradient-free optimization. Our approach establishes a principled framework for data-driven robot design and demonstrates that leveraging existing designs and human motion can effectively guide the automated discovery of novel robot design.
Chinese Translation
机器人形态和运动学设计传统上依赖于人类直觉,缺乏系统性的理论基础。运动设计的协同优化为自动化提供了有前景的路径,但仍面临两大挑战:(i)庞大且无结构的设计空间;(ii)构建特定任务损失函数的困难。我们提出了一种新范式,通过(i)从现有机械设计中学习设计搜索空间,而非手工构建;(ii)通过运动重定向和Procrustes分析,直接从人体运动数据定义损失,从而最大限度减少人为干预。利用基于螺旋理论的关节轴表示和等距流形学习,我们构建了一个紧凑且几何保持的类人上半身设计潜空间,使得优化变得可行。随后,我们在该潜空间中采用无梯度优化方法解决设计优化问题。该方法建立了一个基于数据驱动的机器人设计的原则性框架,并证明了利用现有设计和人体运动数据能够有效指导自动化发现新型机器人设计。
cs.RO / 2 / 2604.08664

Generative Simulation for Policy Learning in Physical Human-Robot Interaction

用于物理人机交互策略学习的生成式仿真
Wang, Junxiang, Xu, Xinwen, Wu, Tiancheng, Millan, Julian, Pechuk, Nir, Erickson, Zackory
Abstract
Developing autonomous physical human-robot interaction (pHRI) systems is limited by the scarcity of large-scale training data to learn robust robot behaviors for real-world applications. In this paper, we introduce a zero-shot "text2sim2real" generative simulation framework that automatically synthesizes diverse pHRI scenarios from high-level natural-language prompts. Leveraging Large Language Models (LLMs) and Vision-Language Models (VLMs), our pipeline procedurally generates soft-body human models, scene layouts, and robot motion trajectories for assistive tasks. We utilize this framework to autonomously collect large-scale synthetic demonstration datasets and then train vision-based imitation learning policies operating on segmented point clouds. We evaluate our approach through a user study on two physically assistive tasks: scratching and bathing. Our learned policies successfully achieve zero-shot sim-to-real transfer, attaining success rates exceeding 80% and demonstrating resilience to unscripted human motion. Overall, we introduce the first generative simulation pipeline for pHRI applications, automating simulation environment synthesis, data collection, and policy learning. Additional information may be found on our project website: https://rchi-lab.github.io/gen_phri/
Chinese Translation
自主物理人机交互(pHRI)系统的发展受到大规模训练数据匮乏的限制,难以学习适用于实际应用的鲁棒机器人行为。本文提出了一种零样本“text2sim2real”生成式仿真框架,能够从高级自然语言提示自动合成多样化的pHRI场景。该框架利用大型语言模型(LLMs)和视觉语言模型(VLMs),程序化生成软体人类模型、场景布局及辅助任务的机器人运动轨迹。我们利用该框架自主收集大规模合成示范数据集,并基于分割点云训练基于视觉的模仿学习策略。通过对两项物理辅助任务——抓痒和洗浴——的用户研究评估,所学策略成功实现了零样本的仿真到现实迁移,成功率超过80%,并表现出对非脚本化人类动作的鲁棒性。总体而言,我们首次引入了面向pHRI应用的生成式仿真流水线,实现了仿真环境合成、数据收集及策略学习的自动化。更多信息请访问项目网站:https://rchi-lab.github.io/gen_phri/
cs.RO / 3 / 2604.08726

Task-Aware Bimanual Affordance Prediction via VLM-Guided Semantic-Geometric Reasoning

基于VLM引导的语义-几何推理的任务感知双手可供性预测
Hahne, Fabian, Prasad, Vignesh, Chalvatzaki, Georgia, Peters, Jan, Kshirsagar, Alap
Abstract
Bimanual manipulation requires reasoning about where to interact with an object and which arm should perform each action, a joint affordance localization and arm allocation problem that geometry-only planners cannot resolve without semantic understanding of task intent. Existing approaches either treat affordance prediction as coarse part segmentation or rely on geometric heuristics for arm assignment, failing to jointly reason about task-relevant contact regions and arm allocation. We reframe bimanual manipulation as a joint affordance localization and arm allocation problem and propose a hierarchical framework for task-aware bimanual affordance prediction that leverages a Vision-Language Model (VLM) to generalize across object categories and task descriptions without requiring category-specific training. Our approach fuses multi-view RGB-D observations into a consistent 3D scene representation and generates global 6-DoF grasp candidates, which are then spatially and semantically filtered by querying the VLM for task-relevant affordance regions on each object, as well as for arm allocation to the individual objects, thereby ensuring geometric validity while respecting task semantics. We evaluate our method on a dual-arm platform across nine real-world manipulation tasks spanning four categories: parallel manipulation, coordinated stabilization, tool use, and human handover. Our approach achieves consistently higher task success rates than geometric and semantic baselines for task-oriented grasping, demonstrating that explicit semantic reasoning over affordances and arm allocation helps enable reliable bimanual manipulation in unstructured environments.
Chinese Translation
双手操作需要推理与物体交互的位置以及每只手臂应执行的动作,这是一个联合的可供性定位与手臂分配问题,单纯依赖几何规划器无法在缺乏任务意图语义理解的情况下解决。现有方法要么将可供性预测视为粗略的部件分割,要么依赖几何启发式进行手臂分配,未能联合推理任务相关的接触区域与手臂分配。我们将双手操作重新定义为联合的可供性定位与手臂分配问题,提出一个层次化框架用于任务感知的双手可供性预测,该框架利用视觉-语言模型(Vision-Language Model, VLM)实现跨物体类别和任务描述的泛化,无需类别特定的训练。我们的方法将多视角RGB-D观测融合为一致的三维场景表示,并生成全局6自由度抓取候选,然后通过查询VLM获取每个物体上与任务相关的可供性区域及手臂分配信息,进行空间和语义过滤,从而确保几何有效性并尊重任务语义。我们在双臂平台上针对涵盖四类任务(平行操作、协调稳定、工具使用和人手交接)的九个真实操作任务进行了评估。实验结果表明,我们的方法在任务导向抓取上持续优于几何和语义基线,证明了对可供性和手臂分配的显式语义推理有助于实现非结构化环境中的可靠双手操作。
cs.RO / 4 / 2604.08780

Toward Hardware-Agnostic Quadrupedal World Models via Morphology Conditioning

面向硬件无关的四足世界模型的形态条件化研究
Danesh, Mohamad H., Li, Chenhao, Abyaneh, Amin, Houssaini, Anas, Ellis, Kirsty, Berseth, Glen, Hutter, Marco, Lin, Hsiu-Chin
Abstract
World models promise a paradigm shift in robotics, where an agent learns the underlying physics of its environment once to enable efficient planning and behavior learning. However, current world models are often hardware-locked specialists: a model trained on a Boston Dynamics Spot robot fails catastrophically on a Unitree Go1 due to the mismatch in kinematic and dynamic properties, as the model overfits to specific embodiment constraints rather than capturing the universal locomotion dynamics. Consequently, a slight change in actuator dynamics or limb length necessitates training a new model from scratch. In this work, we take a step towards a framework for training a generalizable Quadrupedal World Model (QWM) that disentangles environmental dynamics from robot morphology. We address the limitations of implicit system identification, where treating static physical properties (like mass or limb length) as latent variables to be inferred from motion history creates an adaptation lag that can compromise zero-shot safety and efficiency. Instead, we explicitly condition the generative dynamics on the robot's engineering specifications. By integrating a physical morphology encoder and a reward normalizer, we enable the model to serve as a neural simulator capable of generalizing across morphologies. This capability unlocks zero-shot control across a range of embodiments. We introduce, for the first time, a world model that enables zero-shot generalization to new morphologies for locomotion. While we carefully study the limitations of our method, QWM operates as a distribution-bounded interpolator within the quadrupedal morphology family rather than a universal physics engine, this work represents a significant step toward morphology-conditioned world models for legged locomotion.
Chinese Translation
世界模型承诺在机器人技术中实现范式转变,使得代理能够一次性学习其环境的基本物理特性,从而实现高效的规划和行为学习。然而,当前的世界模型往往是硬件锁定的专用模型:在波士顿动力公司的Spot机器人上训练的模型在Unitree Go1上会出现灾难性的失败,因为其运动学和动力学特性不匹配,模型过度拟合于特定的体现约束,而未能捕捉到普遍的运动动态。因此,执行器动态或肢体长度的轻微变化需要从头开始训练一个新模型。在本研究中,我们朝着训练可泛化的四足世界模型(Quadrupedal World Model, QWM)框架迈出了一步,该框架将环境动态与机器人形态解耦。我们解决了隐式系统识别的局限性,其中将静态物理属性(如质量或肢体长度)视为从运动历史中推断的潜在变量会造成适应滞后,从而影响零-shot安全性和效率。相反,我们明确地将生成动态条件化于机器人的工程规格。通过整合物理形态编码器和奖励归一化器,我们使模型能够作为神经模拟器,能够跨形态进行泛化。这一能力解锁了在多种体现下的零-shot控制。我们首次引入了一种世界模型,能够实现对新形态的零-shot泛化以进行运动。虽然我们仔细研究了我们方法的局限性,QWM作为四足形态家族内的分布界限插值器,而非通用物理引擎,这项工作代表了朝着腿部运动的形态条件化世界模型的重要一步。
cs.RO / 5 / 2604.08787

One Interface, Many Robots: Unified Real-Time Low-Level Motion Planning for Collaborative Arms

一个接口,多台机器人:协作臂的统一实时低级运动规划
Feng, Yue, Huang, Weicheng, Chen, I-Ming
Abstract
This paper proposes a common interface for real-time low-level motion planning of collaborative robotic arms, aimed at enabling broader applicability and improved portability across heterogeneous hardware platforms. In previous work, we introduced WinGs Operating Studio (WOS), a middleware solution that abstracts diverse robotic components into uniform software resources and provides a broad suite of language-agnostic APIs. This paper specifically focuses on its minimal yet flexible interface for real-time end-effector trajectory control. By employing an n-degree polynomial interpolator in conjunction with a quadratic programming solver, the proposed method generates smooth, continuously differentiable trajectories with precise position, velocity, and acceleration profiles. We validate our approach in three distinct scenarios. First, in an offline demonstration, a collaborative arm accurately draws various geometric shapes on paper. Second, in an interruptible, low-frequency re-planning setting, a robotic manipulator grasps a dynamic object placed on a moving mobile robot. Finally, we conducted a teleoperation experiment in which one robotic arm controlled another to perform a series of dexterous manipulations, confirming the proposed method's reliability, versatility, and ease of use.
Chinese Translation
本文提出了一种用于协作机器人臂的实时低级运动规划的通用接口,旨在实现更广泛的适用性和在异构硬件平台上的改进可移植性。在之前的工作中,我们介绍了 WinGs Operating Studio (WOS),这是一种中间件解决方案,将多样的机器人组件抽象为统一的软件资源,并提供了一套广泛的语言无关的 API。本文特别关注其用于实时末端执行器轨迹控制的最小但灵活的接口。通过采用 n 次多项式插值器与二次规划求解器相结合,所提出的方法生成平滑、连续可微的轨迹,并具有精确的位置、速度和加速度特性。我们在三个不同的场景中验证了我们的方法。首先,在离线演示中,协作臂准确地在纸上绘制各种几何形状。其次,在可中断的低频重新规划环境中,机器人操纵器抓取放置在移动机器人上的动态物体。最后,我们进行了一个遥操作实验,其中一台机器人臂控制另一台机器人臂执行一系列灵巧操作,确认了所提方法的可靠性、通用性和易用性。
cs.RO / 6 / 2604.08882

Simulation of Adaptive Running with Flexible Sports Prosthesis using Reinforcement Learning of Hybrid-link System

基于混合连杆系统强化学习的柔性运动假肢自适应跑步仿真研究
Shimane, Yuta, Yamamoto, Ko
Abstract
This study proposes a reinforcement learning-based adaptive running motion simulation for a unilateral transtibial amputee with the flexibility of a leaf-spring-type sports prosthesis using hybrid-link system. The design and selection of sports prostheses often rely on trial and error. A comprehensive whole-body dynamics analysis that considers the interaction between human motion and prosthetic deformation could provide valuable insights for user-specific design and selection. The hybrid-link system facilitates whole-body dynamics analysis by incorporating the Piece-wise Constant Strain model to represent the flexible deformation of the prosthesis. Based on this system, the simulation methodology generates whole-body dynamic motions of a unilateral transtibial amputee through a reinforcement learning-based approach, which combines imitation learning from motion capture data with accurate prosthetic dynamics computation. We simulated running motions under different virtual prosthetic stiffness conditions and analyzed the metabolic cost of transport obtained from the simulations, suggesting that variations in stiffness influence running performance. Our findings demonstrate the potential of this approach for simulation and analysis under virtual conditions that differ from real conditions.
Chinese Translation
本研究提出了一种基于强化学习的自适应跑步运动仿真方法,针对单侧截小腿截肢者,结合叶片弹簧型运动假肢的柔性特性,采用混合连杆系统进行建模。运动假肢的设计与选择通常依赖于反复试验,考虑人体运动与假肢变形相互作用的全身动力学综合分析,有助于实现用户特定的设计与选择。混合连杆系统通过引入分段恒定应变(Piece-wise Constant Strain)模型来描述假肢的柔性变形,促进了全身动力学分析。基于该系统,仿真方法通过强化学习结合运动捕捉数据的模仿学习与精确的假肢动力学计算,生成单侧截小腿截肢者的全身动态运动。我们在不同虚拟假肢刚度条件下模拟了跑步运动,并分析了仿真获得的运输代谢成本,结果表明刚度变化会影响跑步性能。研究结果展示了该方法在不同于真实条件的虚拟环境下进行仿真与分析的潜力。
cs.RO / 7 / 2604.08883

HTNav: A Hybrid Navigation Framework with Tiered Structure for Urban Aerial Vision-and-Language Navigation

HTNav:一种具有分层结构的混合导航框架用于城市空中视觉-语言导航
Fan, Chengjie, Pan, Cong, Liu, Zijian, Liu, Ningzhong, Qin, Jie
Abstract
Inspired by the general Vision-and-Language Navigation (VLN) task, aerial VLN has attracted widespread attention, owing to its significant practical value in applications such as logistics delivery and urban inspection. However, existing methods face several challenges in complex urban environments, including insufficient generalization to unseen scenes, suboptimal performance in long-range path planning, and inadequate understanding of spatial continuity. To address these challenges, we propose HTNav, a new collaborative navigation framework that integrates Imitation Learning (IL) and Reinforcement Learning (RL) within a hybrid IL-RL framework. This framework adopts a staged training mechanism to ensure the stability of the basic navigation strategy while enhancing its environmental exploration capability. By integrating a tiered decision-making mechanism, it achieves collaborative interaction between macro-level path planning and fine-grained action control. Furthermore, a map representation learning module is introduced to deepen its understanding of spatial continuity in open domains. On the CityNav benchmark, our method achieves state-of-the-art performance across all scene levels and task difficulties. Experimental results demonstrate that this framework significantly improves navigation precision and robustness in complex urban environments.
Chinese Translation
受通用视觉-语言导航(Vision-and-Language Navigation, VLN)任务的启发,空中视觉-语言导航因其在物流配送和城市巡检等应用中的重要实际价值而受到广泛关注。然而,现有方法在复杂的城市环境中面临诸多挑战,包括对未见场景的泛化能力不足、长距离路径规划性能不佳以及空间连续性理解不足。为解决这些问题,我们提出了HTNav,一种将模仿学习(Imitation Learning, IL)与强化学习(Reinforcement Learning, RL)相结合的混合IL-RL导航框架。该框架采用分阶段训练机制,既保证了基础导航策略的稳定性,又增强了其环境探索能力。通过引入分层决策机制,实现了宏观路径规划与细粒度动作控制的协同交互。此外,框架中引入了地图表示学习模块,以加深对开放域空间连续性的理解。在CityNav基准测试中,我们的方法在所有场景级别和任务难度上均取得了最先进的性能。实验结果表明,该框架显著提升了复杂城市环境中的导航精度和鲁棒性。
cs.RO / 8 / 2604.08983

AssemLM: Spatial Reasoning Multimodal Large Language Models for Robotic Assembly

AssemLM:面向机器人装配的空间推理多模态大型语言模型
Jing, Zhi, Qiao, Jinbin, Lu, Ouyang, Ao, Jicong, Qiu, Shuang, Jiang, Yu-Gang, Bai, Chenjia
Abstract
Spatial reasoning is a fundamental capability for embodied intelligence, especially for fine-grained manipulation tasks such as robotic assembly. While recent vision-language models (VLMs) exhibit preliminary spatial awareness, they largely rely on coarse 2D perception and lack the ability to perform accurate reasoning over 3D geometry, which is crucial for precise assembly operations. To address this limitation, we propose AssemLM, a spatial multimodal large language model tailored for robotic assembly. AssemLM integrates assembly manuals, point clouds, and textual instructions to reason about and predict task-critical 6D assembly poses, enabling explicit geometric understanding throughout the assembly process. To effectively bridge raw 3D perception and high-level reasoning, we adopt a specialized point cloud encoder to capture fine-grained geometric and rotational features, which are then integrated into the multimodal language model to support accurate 3D spatial reasoning for assembly tasks. In addition, we construct AssemBench, a large-scale dataset and benchmark for assembly-oriented spatial reasoning, comprising over 900K multimodal samples with precise 6D pose annotations. AssemBench extends spatial reasoning evaluation beyond 2D and grounding tasks into full 3D geometric inference, filling a critical gap in existing embodied AI benchmarks. Extensive experiments demonstrate that AssemLM achieves state-of-the-art performance in 6D pose reasoning across diverse assembly scenarios. Furthermore, real-robot evaluations show that our model can support fine-grained and multi-step assembly execution in real-world settings, demonstrating its potential for robotic assembly applications.
Chinese Translation
空间推理是具身智能的基础能力,尤其对于机器人装配等细粒度操作任务尤为重要。尽管近期的视觉-语言模型(VLMs)展现了初步的空间感知能力,但它们主要依赖粗糙的二维感知,缺乏对三维几何结构进行精确推理的能力,而这对于精确的装配操作至关重要。为了解决这一局限性,我们提出了AssemLM,一种专为机器人装配设计的空间多模态大型语言模型。AssemLM融合了装配手册、点云数据和文本指令,以推理并预测任务关键的六自由度(6D)装配姿态,实现了装配过程中显式的几何理解。为了有效连接原始三维感知与高级推理,我们采用了专门的点云编码器来捕捉细粒度的几何和旋转特征,随后将其整合进多模态语言模型,支持装配任务中准确的三维空间推理。此外,我们构建了AssemBench,这是一个面向装配空间推理的大规模数据集和基准,包含超过90万条带有精确6D姿态标注的多模态样本。AssemBench将空间推理评估从二维和平面定位任务扩展到完整的三维几何推断,填补了现有具身人工智能基准中的关键空白。大量实验表明,AssemLM在多样化装配场景中的6D姿态推理任务上达到了最先进的性能。进一步的真实机器人评估显示,我们的模型能够支持细粒度、多步骤的装配执行,展现了其在机器人装配应用中的潜力。
cs.RO / 9 / 2604.09036

V-CAGE: Vision-Closed-Loop Agentic Generation Engine for Robotic Manipulation

V-CAGE:用于机器人操作的视觉闭环自主生成引擎
Liu, Yaru, Wang, Ao-bo, Ye, Nanyang
Abstract
Scaling Vision-Language-Action (VLA) models requires massive datasets that are both semantically coherent and physically feasible. However, existing scene generation methods often lack context-awareness, making it difficult to synthesize high-fidelity environments embedded with rich semantic information, frequently resulting in unreachable target positions that cause tasks to fail prematurely. We present V-CAGE (Vision-Closed-loop Agentic Generation Engine), an agentic framework for autonomous robotic data synthesis. Unlike traditional scripted pipelines, V-CAGE operates as an embodied agentic system, leveraging foundation models to bridge high-level semantic reasoning with low-level physical interaction. Specifically, we introduce Inpainting-Guided Scene Construction to systematically arrange context-aware layouts, ensuring that the generated scenes are both semantically structured and kinematically reachable. To ensure trajectory correctness, we integrate functional metadata with a Vision-Language Model based closed-loop verification mechanism, acting as a visual critic to rigorously filter out silent failures and sever the error propagation chain. Finally, to overcome the storage bottleneck of massive video datasets, we implement a perceptually-driven compression algorithm that achieves over 90\% filesize reduction without compromising downstream VLA training efficacy. By centralizing semantic layout planning and visual self-verification, V-CAGE automates the end-to-end pipeline, enabling the highly scalable synthesis of diverse, high-quality robotic manipulation datasets.
Chinese Translation
扩展视觉-语言-动作(VLA)模型需要大量在语义上连贯且在物理上可行的庞大数据集。然而,现有的场景生成方法往往缺乏上下文意识,使得合成嵌入丰富语义信息的高保真环境变得困难,常常导致目标位置无法到达,从而使任务提前失败。我们提出了V-CAGE(视觉闭环自主生成引擎),这是一个用于自主机器人数据合成的自主框架。与传统的脚本化流程不同,V-CAGE作为一个具身的自主系统运作,利用基础模型将高层次的语义推理与低层次的物理交互连接起来。具体而言,我们引入了基于修补的场景构建方法,以系统性地安排上下文感知的布局,确保生成的场景在语义上结构化且在运动学上可达。为了确保轨迹的正确性,我们将功能元数据与基于视觉-语言模型的闭环验证机制相结合,作为视觉评判者严格过滤出静默故障并切断错误传播链。最后,为了克服庞大视频数据集的存储瓶颈,我们实施了一种感知驱动的压缩算法,实现了超过90%的文件大小减少,而不影响下游VLA训练的有效性。通过集中语义布局规划和视觉自我验证,V-CAGE自动化了端到端流程,使得多样化、高质量的机器人操作数据集的高度可扩展合成成为可能。
cs.RO / 10 / 2604.09038

Towards Lifelong Aerial Autonomy: Geometric Memory Management for Continual Visual Place Recognition in Dynamic Environments

迈向终身空中自主:动态环境中持续视觉地点识别的几何记忆管理
Shao, Xingyu, Yan, Zhiqiang, Sun, Liangzheng, He, Mengfan, Chen, Chao, Zhang, Jinhui, Li, Chunyu, Meng, Ziyang
Abstract
Robust geo-localization in changing environmental conditions is critical for long-term aerial autonomy. While visual place recognition (VPR) models perform well when airborne views match the training domain, adapting them to shifting distributions during sequential missions triggers catastrophic forgetting. Existing continual learning (CL) methods often fail here because geographic features exhibit severe intra-class variations. In this work, we formulate aerial VPR as a mission-based domain-incremental learning (DIL) problem and propose a novel heterogeneous memory framework. To respect strict onboard storage constraints, our "Learn-and-Dispose" pipeline decouples geographic knowledge into static satellite anchors (preserving global geometric priors) and a dynamic experience replay buffer (retaining domain-specific features). We introduce a spatially-constrained allocation strategy that optimizes buffer selection based on sample difficulty or feature space diversity. To facilitate systematic assessment, we provide three evaluation criteria and a comprehensive benchmark derived from 21 diverse mission sequences. Extensive experiments demonstrate that our architecture significantly boosts spatial generalization; our diversity-driven buffer selection outperforms the random baseline by 7.8% in knowledge retention. Unlike class-mean preservation methods that fail in unstructured environments, maximizing structural diversity achieves a superior plasticity-stability balance and ensures order-agnostic robustness across randomized sequences. These results prove that maintaining structural feature coverage is more critical than sample difficulty for resolving catastrophic forgetting in lifelong aerial autonomy.
Chinese Translation
在变化的环境条件下实现鲁棒的地理定位对于长期空中自主至关重要。尽管视觉地点识别(Visual Place Recognition, VPR)模型在飞行视角与训练域匹配时表现良好,但在连续任务中适应分布变化会引发灾难性遗忘。现有的持续学习(Continual Learning, CL)方法常因地理特征存在严重的类内变异而难以奏效。本文将空中VPR问题形式化为基于任务的领域增量学习(Domain-Incremental Learning, DIL)问题,并提出了一种新颖的异构记忆框架。为满足严格的机载存储限制,我们设计了“学习与丢弃”(Learn-and-Dispose)流程,将地理知识解耦为静态卫星锚点(保留全局几何先验)和动态经验回放缓冲区(保留领域特定特征)。我们引入了一种空间约束的分配策略,基于样本难度或特征空间多样性优化缓冲区选择。为促进系统评估,本文提供了三项评价标准及涵盖21个多样任务序列的综合基准。大量实验表明,我们的架构显著提升了空间泛化能力;基于多样性的缓冲选择在知识保持上较随机基线提升了7.8%。不同于在非结构化环境中失效的类均值保持方法,最大化结构多样性实现了更优的可塑性与稳定性平衡,并确保了随机序列下的顺序无关鲁棒性。这些结果证明,维持结构特征覆盖比样本难度更为关键,有助于解决终身空中自主中的灾难性遗忘问题。
cs.RO / 11 / 2604.09049

{\sf TriDeliver}: Cooperative Air-Ground Instant Delivery with UAVs, Couriers, and Crowdsourced Ground Vehicles

{\sf TriDeliver}:基于无人机、快递员与众包地面车辆的协同空地即时配送系统
Gao, Junhui, Pan, Yan, Wang, Qianru, Hou, Wenzhe, Deng, Yiqin, Jiang, Liangliang, Fang, Yuguang
Abstract
Instant delivery, shipping items before critical deadlines, is essential in daily life. While multiple delivery agents, such as couriers, Unmanned Aerial Vehicles (UAVs), and crowdsourced agents, have been widely employed, each of them faces inherent limitations (e.g., low efficiency/labor shortages, flight control, and dynamic capabilities, respectively), preventing them from meeting the surging demands alone. This paper proposes {\sf TriDeliver}, the first hierarchical cooperative framework, integrating human couriers, UAVs, and crowdsourced ground vehicles (GVs) for efficient instant delivery. To obtain the initial scheduling knowledge for GVs and UAVs as well as improve the cooperative delivery performance, we design a Transfer Learning (TL)-based algorithm to extract delivery knowledge from couriers' behavioral history and transfer their knowledge to UAVs and GVs with fine-tunings, which is then used to dispatch parcels for efficient delivery. Evaluated on one-month real-world trajectory and delivery datasets, it has been demonstrated that 1) by integrating couriers, UAVs, and crowdsourced GVs, {\sf TriDeliver} reduces the delivery cost by $65.8\%$ versus state-of-the-art cooperative delivery by UAVs and couriers; 2) {\sf TriDeliver} achieves further improvements in terms of delivery time ($-17.7\%$), delivery cost ($-9.8\%$), and impacts on original tasks of crowdsourced GVs ($-43.6\%$), even with the representation of the transferred knowledge by simple neural networks, respectively.
Chinese Translation
即时配送,即在关键截止时间前运送物品,是日常生活中的重要需求。尽管快递员、无人机(UAVs)和众包代理等多种配送主体已被广泛应用,但它们各自存在固有的局限性(例如,效率低下/人力短缺、飞行控制难题以及动态能力限制),使其难以单独满足激增的配送需求。本文提出了{\sf TriDeliver},首个分层协同框架,整合了人工快递员、无人机和众包地面车辆(GVs),以实现高效的即时配送。为获取GVs和UAVs的初始调度知识并提升协同配送性能,我们设计了一种基于迁移学习(Transfer Learning, TL)的算法,从快递员的行为历史中提取配送知识,并通过微调将其迁移至无人机和地面车辆,进而用于高效的包裹派送调度。在基于一个月的真实轨迹及配送数据集上的评估表明:1)通过整合快递员、无人机与众包GVs,{\sf TriDeliver}相比现有的无人机与快递员协同配送方案,配送成本降低了65.8%;2)即使采用简单神经网络对迁移知识进行表示,{\sf TriDeliver}在配送时间(减少17.7%)、配送成本(减少9.8%)以及对众包GVs原有任务的影响(减少43.6%)方面均实现了进一步提升。
cs.RO / 12 / 2604.09156

On the Terminology and Geometric Aspects of Redundant Parallel Manipulators

冗余并联机械手的术语及几何特性研究
Mueller, Andreas
Abstract
Parallel kinematics machines (PKM) can exhibit kinematic as well as actuation redundancy. While the meaning of kinematic redundancy has been clarified already for serial manipulators, actuation redundancy, that is only possible in PKM, is differently classified in the literature. In this paper a consistent terminology for general redundant PKM is proposed. A kinematic model is introduced with the configuration space (c-space) as central part. The notion of kinematic redundancy is recalled for PKM. C-space, output, and input singularities are distinguished. The significance of the c-space geometry is emphasized, and it is pointed out geometrically that input singularities can be avoided by redundant actuation schemes. In order to distinguish different actuation schemes of PKM a non-linear control system is introduced whose dynamics evolves on the c-space. The degree of actuation (DOA) is introduced as the number of independent control vector fields, and PKM are classified as full-actuated and underactuated. Relating this DOA to the degree of freedom (DOF) allows to classify the actuation redundancy.
Chinese Translation
并联运动学机器(PKM)可以表现出运动学冗余以及驱动冗余。尽管运动学冗余的含义已在串联机械手中得到澄清,但驱动冗余——这仅在PKM中可能存在——在文献中存在不同的分类方法。本文提出了一种适用于一般冗余PKM的一致术语体系。引入了以配置空间(c-space)为核心的运动学模型,回顾了PKM的运动学冗余概念。区分了配置空间奇异点、输出奇异点和输入奇异点。强调了配置空间几何的重要性,并从几何角度指出通过冗余驱动方案可以避免输入奇异点。为了区分PKM的不同驱动方案,本文引入了一个非线性控制系统,其动力学在配置空间上演化。提出了驱动度(DOA)作为独立控制向量场的数量,并将PKM分类为全驱动和欠驱动。将驱动度与自由度(DOF)相关联,从而对驱动冗余进行分类。
cs.RO / 13 / 2604.09270

Soft Electroadhesive Feet for Micro Aerial Robots Perching on Smooth and Curved Surfaces

微型空中机器人在光滑和曲面上栖息的软电粘附脚
Liu, Chen, Feroz, Sonu, Zhang, Ketao
Abstract
Electroadhesion (EA) provides electrically switchable adhesion and is a promising mechanism for perching micro aerial robots on smooth surfaces. However, practical implementations of soft and stretchable EA pads for aerial perching remain limited. This work presents (i) an efficient workflow for fabricating soft, stretchable electroadhesive pads with sinusoidal wave and concentric-circle electrodes in multiple sizes, (ii) a controlled experimental comparison of normal and shear adhesion under inactive (0 kV) and active (4.8 kV) conditions using an Instron-based setup, and (iii) a perching demonstration using a Crazyflie quadrotor equipped with electroadhesive feet on flat and curved substrates. Experimental results show that shear adhesion dominates, reaching forces on the order of 3 N with partial pad contact, while normal adhesion is comparatively small and strongly dependent on substrate properties. The Crazyflie prototype demonstrates repeatable attachment on smooth plastic surfaces, including curved geometries, as well as rapid detachment when the voltage is removed. These results highlight the potential of soft electroadhesive feet for lightweight and reliable perching in micro aerial vehicles (MAVs).
Chinese Translation
电粘附(Electroadhesion, EA)提供了可电控的粘附力,是一种有前景的机制,可以使微型空中机器人在光滑表面上栖息。然而,针对空中栖息的软性和可拉伸电粘附垫的实际应用仍然有限。本研究提出了(i)一种高效的工作流程,用于制造具有正弦波和同心圆电极的软性、可拉伸电粘附垫,尺寸多样;(ii)使用基于Instron的实验装置对在非激活(0 kV)和激活(4.8 kV)条件下的法向和剪切粘附进行控制实验比较;(iii)使用配备电粘附脚的Crazyflie四旋翼在平面和曲面基材上进行栖息演示。实验结果表明,剪切粘附占主导地位,达到约3 N的力,且在部分垫接触的情况下表现明显,而法向粘附相对较小,并且强烈依赖于基材特性。Crazyflie原型展示了在光滑塑料表面(包括曲面几何形状)上的可重复附着,以及在去除电压时的快速脱离。这些结果突显了软电粘附脚在微型空中飞行器(MAVs)中实现轻量化和可靠栖息的潜力。
cs.RO / 14 / 2604.09282

Characterizing Lidar Range-Measurement Ambiguity due to Multiple Returns

多重回波引起的激光雷达测距模糊特性分析
Rife, Jason H., Li, Yifan
Abstract
Reliable position and attitude sensing is critical for highly automated vehicles that operate on conventional roadways. Lidar sensors are increasingly incorporated into pose-estimation systems. Despite its great utility, lidar is a complex sensor, and its performance in roadway environments is not yet well understood. For instance, it is often assumed in lidar-localization algorithms that a lidar will always identify a unique surface along a given raypath. However, this assumption is not always true, as ample prior evidence exists to suggest that lidar units may generate measurements probabilistically when more than one scattering surface appears within the lidar's conical beam. In this paper, we analyze lidar datasets to characterize cases with probabilistic returns along particular raypaths. Our contribution is to present representative cumulative distribution functions (CDFs) for raypaths observed by two different mechanically rotating lidar units with stationary bases. In subsequent discussion, we outline a qualitative methodology to assess the effect of probabilistic multi-return cases on lidar-based localization.
Chinese Translation
可靠的位置和姿态感知对于在传统道路上运行的高度自动化车辆至关重要。激光雷达(Lidar)传感器正日益被集成到位姿估计系统中。尽管激光雷达具有极大实用价值,但其作为一种复杂传感器,在道路环境中的性能尚未被充分理解。例如,激光雷达定位算法通常假设激光雷达沿特定射线路径总能识别唯一的反射表面。然而,该假设并非总是成立,已有大量先前研究表明,当激光雷达锥形光束内存在多个散射表面时,激光雷达可能以概率方式生成测量值。本文通过分析激光雷达数据集,刻画了特定射线路径上出现概率性回波的情况。我们的贡献在于展示了两种不同机械旋转激光雷达装置(基座固定)观测到的射线路径的代表性累积分布函数(CDF)。在后续讨论中,我们提出了一种定性方法,以评估概率性多重回波情况对基于激光雷达定位的影响。
cs.RO / 15 / 2604.09294

A Benchmark of Dexterity for Anthropomorphic Robotic Hands

类人机器人手的灵巧性基准测试
Liconti, Davide, Zhou, Yuning, Toshimitsu, Yasunori, Hinchet, Ronan, Katzschmann, Robert K.
Abstract
Dexterity is a central yet ambiguously defined concept in the design and evaluation of anthropomorphic robotic hands. In practice, the term is often used inconsistently, with different systems evaluated under disparate criteria, making meaningful comparisons across designs difficult. This highlights the need for a unified, performance-based definition of dexterity grounded in measurable outcomes rather than proxy metrics. In this work, we introduce POMDAR, a comprehensive dexterity benchmark that formalizes dexterity as task performance across a structured set of manipulation and grasping motions. The benchmark was systematically derived from established taxonomies in human motor control. It is implemented in both real-world and simulation and includes four manipulation configurations: vertical and horizontal configurations, continuous rotation, and pure grasping. The task designs contain mechanical scaffolding to constrain task motion, suppress compensatory strategies, and enable metrics to be measured unambiguously. We define a quantitative scoring metric combining task correctness and execution speed, effectively measuring dexterity as throughput. This enables objective, reproducible, and interpretable evaluation across different hand designs. POMDAR provides an open-source, standardized, and taxonomy-grounded benchmark for consistent comparison and evaluation of anthropomorphic robot hands to facilitate a systematic advancement of dexterous manipulation platforms. CAD, simulation files, and evaluation videos are publicly available at https://srl-ethz.github.io/POMDAR/.
Chinese Translation
灵巧性是类人机器人手设计与评估中的核心但定义模糊的概念。在实际应用中,该术语常被不一致地使用,不同系统在不同标准下进行评估,导致设计间的有效比较变得困难。这凸显了基于可测量结果而非代理指标的统一性能导向灵巧性定义的必要性。在本研究中,我们提出了POMDAR,一种全面的灵巧性基准测试,将灵巧性形式化为一组结构化操作和抓取动作中的任务表现。该基准测试系统地基于人类运动控制的既定分类法推导而来,涵盖真实环境和仿真环境,包含四种操作配置:垂直和水平配置、连续旋转及纯抓取。任务设计中包含机械支架以限制任务动作,抑制补偿策略,并实现指标的明确测量。我们定义了结合任务正确性与执行速度的量化评分指标,有效地将灵巧性度量为吞吐量,从而实现不同手型设计间的客观、可重复且可解释的评估。POMDAR提供了一个开源、标准化且基于分类法的基准测试平台,支持类人机器人手的一致比较与评估,促进灵巧操作平台的系统性进步。CAD文件、仿真文件及评估视频可在https://srl-ethz.github.io/POMDAR/公开获取。
cs.RO / 16 / 2604.09303

Online Intention Prediction via Control-Informed Learning

基于控制信息学习的在线意图预测
Zhou, Tianyu, Liang, Zihao, Lu, Zehui, Mou, Shaoshuai
Abstract
This paper presents an online intention prediction framework for estimating the goal state of autonomous systems in real time, even when intention is time-varying, and system dynamics or objectives include unknown parameters. The problem is formulated as an inverse optimal control / inverse reinforcement learning task, with the intention treated as a parameter in the objective. A shifting horizon strategy discounts outdated information, while online control-informed learning enables efficient gradient computation and online parameter updates. Simulations under varying noise levels and hardware experiments on a quadrotor drone demonstrate that the proposed approach achieves accurate, adaptive intention prediction in complex environments.
Chinese Translation
本文提出了一种在线意图预测框架,用于实时估计自主系统的目标状态,即使在意图随时间变化且系统动力学或目标包含未知参数的情况下。该问题被表述为逆最优控制/逆强化学习任务,将意图视为目标函数中的参数。采用滑动时域策略以削弱过时信息的影响,同时基于控制信息的在线学习实现了高效的梯度计算和参数在线更新。通过在不同噪声水平下的仿真以及四旋翼无人机的硬件实验,验证了所提方法在复杂环境中实现准确且自适应的意图预测的能力。
cs.RO / 17 / 2604.09323

Robust Adaptive Backstepping Impedance Control of Robots in Unknown Environments

未知环境下机器人鲁棒自适应反步法阻抗控制
Nazmara, Reza, Kshirsagar, Alap, Peters, Jan, Aguiar, A. Pedro
Abstract
This paper presents a Robust Adaptive Backstepping Impedance Control (RABIC) strategy for robots operating in contact-rich and uncertain environments. The proposed control strategy considers the complete coupled dynamics of the system and explicitly accounts for key sources of uncertainty, including external disturbances and unmodeled dynamics, while not requiring the robot's dynamic parameters in implementation. We propose a backstepping-based adaptive impedance control scheme for the inner loop to track the reference impedance model. To handle uncertainties, we employ a Taylor series-based estimator for system dynamics and an adaptive estimator for determining the upper bound of external forces. Stability analysis demonstrates the semi-global practical finite-time stability of the overall system. To demonstrate the effectiveness of the proposed method, a simulated mobile manipulator scenario and experimental evaluations on a real Franka Emika Panda robot were conducted. The proposed approach exhibits safer performance compared to PD control while ensuring trajectory tracking and force monitoring. Overall, the RABIC framework provides a solid basis for future research on adaptive and learning-based impedance control for coupled mobile and fixed serially linked manipulators.
Chinese Translation
本文提出了一种适用于接触丰富且存在不确定性的环境中机器人的鲁棒自适应反步法阻抗控制(Robust Adaptive Backstepping Impedance Control,RABIC)策略。该控制策略考虑了系统的完整耦合动力学,并明确处理了关键的不确定性来源,包括外部扰动和未建模动力学,同时在实现过程中不依赖机器人的动力学参数。我们设计了基于反步法的自适应阻抗控制方案作为内环,用以跟踪参考阻抗模型。为应对不确定性,采用了基于泰勒级数的系统动力学估计器和用于确定外力上界的自适应估计器。稳定性分析证明了整体系统的半全局实用有限时间稳定性。为验证所提方法的有效性,进行了移动操作臂仿真场景测试及基于真实Franka Emika Panda机器人的实验评估。结果表明,该方法相比传统PD控制表现出更安全的性能,同时保证了轨迹跟踪和力监测。总体而言,RABIC框架为未来耦合移动与固定串联机械臂的自适应及基于学习的阻抗控制研究提供了坚实基础。
cs.RO / 18 / 2604.09326

Multimodal Anomaly Detection for Human-Robot Interaction

人机交互的多模态异常检测
Ribeiro, Guilherme, Antypas, Iordanis, Bizzaro, Leonardo, Bimbo, João, Garcia, Nuno Cruz
Abstract
Ensuring safety and reliability in human-robot interaction (HRI) requires the timely detection of unexpected events that could lead to system failures or unsafe behaviours. Anomaly detection thus plays a critical role in enabling robots to recognize and respond to deviations from normal operation during collaborative tasks. While reconstruction models have been actively explored in HRI, approaches that operate directly on feature vectors remain largely unexplored. In this work, we propose MADRI, a framework that first transforms video streams into semantically meaningful feature vectors before performing reconstruction-based anomaly detection. Additionally, we augment these visual feature vectors with the robot's internal sensors' readings and a Scene Graph, enabling the model to capture both external anomalies in the visual environment and internal failures within the robot itself. To evaluate our approach, we collected a custom dataset consisting of a simple pick-and-place robotic task under normal and anomalous conditions. Experimental results demonstrate that reconstruction on vision-based feature vectors alone is effective for detecting anomalies, while incorporating other modalities further improves detection performance, highlighting the benefits of multimodal feature reconstruction for robust anomaly detection in human-robot collaboration.
Chinese Translation
确保人机交互(HRI)的安全性和可靠性需要及时检测可能导致系统故障或不安全行为的意外事件。因此,异常检测在使机器人能够识别和响应协作任务中正常操作偏差方面发挥着关键作用。尽管重建模型在HRI中得到了积极探索,但直接在特征向量上操作的方法仍然很少被研究。在本研究中,我们提出了MADRI框架,该框架首先将视频流转换为语义上有意义的特征向量,然后进行基于重建的异常检测。此外,我们还将这些视觉特征向量与机器人的内部传感器读数和场景图(Scene Graph)相结合,使模型能够捕捉视觉环境中的外部异常和机器人内部的故障。为了评估我们的方法,我们收集了一个自定义数据集,该数据集包含在正常和异常条件下进行的简单抓取和放置机器人任务。实验结果表明,仅基于视觉特征向量的重建在检测异常方面是有效的,而结合其他模态进一步提高了检测性能,突显了多模态特征重建在人机协作中实现稳健异常检测的优势。
cs.RO / 19 / 2604.09330

VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis

VAG:用于具身数据合成的双流视频动作生成框架
Lang, Xiaolei, Wang, Yang, Zhou, Yukun, Ni, Chaojun, Li, Kerui, Zhu, Jiagang, Liu, Tianze, Lv, Jiajun, Zuo, Xingxing, Ye, Yun, Huang, Guan, Wang, Xiaofeng, Zhu, Zheng
Abstract
Recent advances in robot foundation models trained on large-scale human teleoperation data have enabled robots to perform increasingly complex real-world tasks. However, scaling these systems remains difficult because collecting task-specific demonstrations is expensive and labor-intensive. Synthetic data, especially generated videos, offer a promising direction, but existing World Models (WMs) are not directly suitable for policy learning since they do not provide paired action trajectories. World-Action (WA) models partially address this by predicting actions with visual outputs, yet often lack strong video-action alignment, while two-stage pipelines that generate video first and then infer actions introduce inefficiency and error accumulation. To address these limitations, we propose VAG, a unified flow-matching-based dual-stream framework that jointly generates video and action under visual and language conditioning. By synchronizing denoising in both branches and using an adaptive 3D pooling mechanism to transfer compact global video context to the action branch, VAG improves cross-modal consistency during generation. Across both simulated and real-world settings, VAG produces aligned video-action pairs with competitive prediction quality, supports executable trajectory replay, and provides useful synthetic pretraining data that improves downstream policy generalization, indicating its potential as a practical world-action model for embodied data synthesis.
Chinese Translation
近年来,基于大规模人类远程操作数据训练的机器人基础模型使机器人能够执行日益复杂的现实任务。然而,系统的扩展仍面临挑战,因为收集特定任务的示范数据成本高且劳动强度大。合成数据,尤其是生成的视频,提供了一个有前景的方向,但现有的世界模型(World Models, WMs)不适合直接用于策略学习,因为它们不提供配对的动作轨迹。世界动作模型(World-Action, WA)通过预测带有视觉输出的动作部分解决了这一问题,但通常缺乏强有力的视频-动作对齐;而先生成视频再推断动作的两阶段流程则带来效率低下和误差积累的问题。为克服这些限制,我们提出了VAG,一种基于流匹配的统一双流框架,在视觉和语言条件下联合生成视频和动作。通过同步两条分支的去噪过程,并采用自适应3D池化机制将紧凑的全局视频上下文传递至动作分支,VAG提升了生成过程中的跨模态一致性。在模拟和现实环境中,VAG生成了对齐的视频-动作对,具有竞争力的预测质量,支持可执行轨迹重放,并提供了有助于下游策略泛化的合成预训练数据,显示出其作为具身数据合成实用世界动作模型的潜力。
cs.RO / 20 / 2604.09431

Musculoskeletal Motion Imitation for Learning Personalized Exoskeleton Control Policy in Impaired Gait

模仿肌肉骨骼运动以学习个性化外骨骼控制策略应对步态障碍
Choi, Itak, Park, Ilseung, Halilaj, Eni, Kang, Inseung
Abstract
Designing generalizable control policies for lower-limb exoskeletons remains fundamentally constrained by exhaustive data collection or iterative optimization procedures, which limit accessibility to clinical populations. To address this challenge, we introduce a device-agnostic framework that combines physiologically plausible musculoskeletal simulation with reinforcement learning to enable scalable personalized exoskeleton assistance for both able-bodied and clinical populations. Our control policies not only generate physiologically plausible locomotion dynamics but also capture clinically observed compensatory strategies under targeted muscular deficits, providing a unified computational model of both healthy and pathological gait. Without task-specific tuning, the resulting exoskeleton control policies produce assistive torque profiles at the hip and ankle that align with state-of-the-art profiles validated in human experiments, while consistently reducing metabolic cost across walking speeds. For simulated impaired-gait models, the learned control policies yield asymmetric, deficit-specific exoskeleton assistance that improves both energetic efficiency and bilateral kinematic symmetry without explicit prescription of the target gait pattern. These results demonstrate that physiologically plausible musculoskeletal simulation via reinforcement learning can serve as a scalable foundation for personalized exoskeleton control across both able-bodied and clinical populations, eliminating the need for extensive physical trials.
Chinese Translation
为下肢外骨骼设计可推广的控制策略仍然受到全面数据收集或迭代优化程序的根本限制,这限制了其在临床人群中的可及性。为了解决这一挑战,我们提出了一种设备无关的框架,结合生理上合理的肌肉骨骼仿真与强化学习,以实现对健康人群和临床人群的可扩展个性化外骨骼辅助。我们的控制策略不仅生成生理上合理的运动动力学,还捕捉到在特定肌肉缺陷下临床观察到的补偿策略,提供了健康与病理步态的统一计算模型。在没有特定任务调优的情况下,所得到的外骨骼控制策略在髋关节和踝关节产生的辅助扭矩特征与在人体实验中验证的最先进特征一致,同时在不同步态速度下持续降低代谢成本。对于模拟的步态障碍模型,学习到的控制策略提供了不对称的、特定缺陷的外骨骼辅助,改善了能量效率和双侧运动学对称性,而无需明确规定目标步态模式。这些结果表明,通过强化学习实现的生理上合理的肌肉骨骼仿真可以作为个性化外骨骼控制的可扩展基础,适用于健康人群和临床人群,从而消除对大量物理试验的需求。
cs.RO / 21 / 2604.09462

Adaptor: Advancing Assistive Teleoperation with Few-Shot Learning and Cross-Operator Generalization

Adaptor:基于少样本学习与跨操作者泛化的辅助远程操作进展
Liu, Yu, Yin, Yihang, Huang, Tianlv, Yan, Fei, Xu, Yuan, Hong, Weinan, Han, Wei, Cao, Yue, Chen, Xiangyu, Fan, Zipei, Song, Xuan
Abstract
Assistive teleoperation enhances efficiency via shared control, yet inter-operator variability, stemming from diverse habits and expertise, induces highly heterogeneous trajectory distributions that undermine intent recognition stability. We present Adaptor, a few-shot framework for robust cross-operator intent recognition. The Adaptor bridges the domain gap through two stages: (i) preprocessing, which models intent uncertainty by synthesizing trajectory perturbations via noise injection and performs geometry-aware keyframe extraction; and (ii) policy learning, which encodes the processed trajectories with an Intention Expert and fuses them with the pre-trained vision-language model context to condition an Action Expert for action generation. Experiments on real-world and simulated benchmarks demonstrate that Adaptor achieves state-of-the-art performance, improving success rates and efficiency over baselines. Moreover, the method exhibits low variance across operators with varying expertise, demonstrating robust cross-operator generalization.
Chinese Translation
辅助远程操作通过共享控制提升效率,但由于操作者习惯和专业水平的多样性,导致轨迹分布高度异质,进而削弱了意图识别的稳定性。本文提出了Adaptor,一种用于稳健跨操作者意图识别的少样本学习框架。Adaptor通过两个阶段弥合领域差距:(i)预处理阶段,通过噪声注入合成轨迹扰动以建模意图不确定性,并执行几何感知的关键帧提取;(ii)策略学习阶段,利用Intention Expert对处理后的轨迹进行编码,并将其与预训练视觉-语言模型的上下文融合,以条件化Action Expert进行动作生成。基于真实世界和仿真基准的实验表明,Adaptor实现了最先进的性能,较基线方法显著提升了成功率和效率。此外,该方法在不同专业水平操作者间表现出低方差,展现出强健的跨操作者泛化能力。
cs.RO / 22 / 2604.09474

SafeMind: A Risk-Aware Differentiable Control Framework for Adaptive and Safe Quadruped Locomotion

SafeMind:一种风险感知的可微分控制框架,用于自适应和安全的四足运动
Zhang, Zukun, Shu, Kai, Mo, Mingqiao
Abstract
Learning-based quadruped controllers achieve impressive agility but typically lack formal safety guarantees under model uncertainty, perception noise, and unstructured contact conditions. We introduce SafeMind, a differentiable stochastic safety-control framework that unifies probabilistic Control Barrier Functions with semantic context understanding and meta-adaptive risk calibration. SafeMind explicitly models epistemic and aleatoric uncertainty through a variance-aware barrier constraint embedded in a differentiable quadratic program, thereby preserving gradient flow for end-to-end training. A semantics-to-constraint encoder modulates safety margins using perceptual or language cues, while a meta-adaptive learner continuously adjusts risk sensitivity across environments. We provide theoretical conditions for probabilistic forward invariance, feasibility, and stability under stochastic dynamics. SafeMind is deployed on Unitree A1 and ANYmal C at 200~Hz and validated across 12 terrain types, dynamic obstacles, morphology perturbations, and semantically defined tasks. Experiments show that SafeMind reduces safety violations by 3--10x and energy consumption by 10--15% relative to state-of-the-art CBF, MPC, and hybrid RL baselines, while maintaining real-time control performance.
Chinese Translation
基于学习的四足控制器在灵活性方面表现出色,但通常在模型不确定性、感知噪声和非结构化接触条件下缺乏正式的安全保障。我们提出了SafeMind,这是一种可微分的随机安全控制框架,将概率控制屏障函数与语义上下文理解和元自适应风险校准统一起来。SafeMind通过嵌入在可微分二次规划中的方差感知屏障约束,显式建模认知不确定性和随机不确定性,从而保持端到端训练的梯度流动。语义到约束编码器利用感知或语言线索调节安全边际,而元自适应学习者则在不同环境中持续调整风险敏感性。我们提供了在随机动态下概率前向不变性、可行性和稳定性的理论条件。SafeMind在Unitree A1和ANYmal C上以200 Hz的频率部署,并在12种地形类型、动态障碍物、形态扰动和语义定义任务中进行了验证。实验表明,与最先进的控制屏障函数(CBF)、模型预测控制(MPC)和混合强化学习基线相比,SafeMind将安全违规减少了3到10倍,能耗降低了10到15%,同时保持实时控制性能。
cs.RO / 23 / 2604.09487

Sim-to-Real Transfer for Muscle-Actuated Robots via Generalized Actuator Networks

基于广义执行器网络的肌肉驱动机器人仿真到现实迁移
Schneider, Jan, Mahajan, Mridul, Chen, Le, Guist, Simon, Schölkopf, Bernhard, Posner, Ingmar, Büchler, Dieter
Abstract
Tendon drives paired with soft muscle actuation enable faster and safer robots while potentially accelerating skill acquisition. Still, these systems are rarely used in practice due to inherent nonlinearities, friction, and hysteresis, which complicate modeling and control. So far, these challenges have hindered policy transfer from simulation to real systems. To bridge this gap, we propose a sim-to-real pipeline that learns a neural network model of this complex actuation and leverages established rigid body simulation for the arm dynamics and interactions with the environment. Our method, called Generalized Actuator Network (GeAN), enables actuation model identification across a wide range of robots by learning directly from joint position trajectories rather than requiring torque sensors. Using GeAN on PAMY2, a tendon-driven robot powered by pneumatic artificial muscles, we successfully deploy precise goal-reaching and dynamic ball-in-a-cup policies trained entirely in simulation. To the best of our knowledge, this result constitutes the first successful sim-to-real transfer for a four-degrees-of-freedom muscle-actuated robot arm.
Chinese Translation
肌腱驱动结合软肌肉驱动技术能够实现更快速且更安全的机器人,同时有望加速技能的获取。然而,由于固有的非线性、摩擦和滞后效应,这些系统在建模和控制上存在较大难度,因此在实际应用中较少使用。迄今为止,这些挑战阻碍了策略从仿真到真实系统的迁移。为弥合这一差距,我们提出了一种仿真到现实的流程,该流程通过学习复杂驱动的神经网络模型,并利用成熟的刚体动力学仿真来模拟机械臂动力学及其与环境的交互。我们的方法称为广义执行器网络(Generalized Actuator Network,GeAN),通过直接从关节位置轨迹学习,而无需扭矩传感器,实现了对多种机器人驱动模型的识别。基于GeAN,我们在由气动人工肌肉驱动的肌腱驱动机器人PAMY2上,成功部署了完全在仿真中训练的精准目标到达和动态“杯中球”策略。据我们所知,这是首个针对四自由度肌肉驱动机械臂实现的成功仿真到现实迁移。
cs.RO / 24 / 2604.09499

Physics-Informed Reinforcement Learning of Spatial Density Velocity Potentials for Map-Free Racing

基于物理信息的空间密度速度势强化学习用于无地图赛车
Sivashangaran, Shathushan, Khairnar, Apoorva, Gohari, Sepideh, Dutta, Vihaan, Eskandarian, Azim
Abstract
Autonomous racing without prebuilt maps is a grand challenge for embedded robotics that requires kinodynamic planning from instantaneous sensor data at the acceleration and tire friction limits. Out-Of-Distribution (OOD) generalization to various racetrack configurations utilizes Machine Learning (ML) to encode the mathematical relation between sensor data and vehicle actuation for end-to-end control, with implicit localization. These comprise Behavioral Cloning (BC) that is capped to human reaction times and Deep Reinforcement Learning (DRL) which requires large-scale collisions for comprehensive training that can be infeasible without simulation but is arduous to transfer to reality, thus exhibiting greater performance than BC in simulation, but actuation instability on hardware. This paper presents a DRL method that parameterizes nonlinear vehicle dynamics from the spectral distribution of depth measurements with a non-geometric, physics-informed reward, to infer vehicle time-optimal and overtaking racing controls with an Artificial Neural Network (ANN) that utilizes less than 1% of the computation of BC and model-based DRL. Slaloming from simulation to reality transfer and variance-induced conservatism are eliminated with the combination of a physics engine exploit-aware reward and the replacement of an explicit collision penalty with an implicit truncation of the value horizon. The policy outperforms human demonstrations by 12% in OOD tracks on proportionally scaled hardware, by maximizing the friction circle with tire dynamics that resemble an empirical Pacejka tire model. System identification illuminates a functional bifurcation where the first layer compresses spatial observations to extract digitized track features with higher resolution in corner apexes, and the second encodes nonlinear dynamics.
Chinese Translation
无需预构建地图的自主赛车是嵌入式机器人领域的一大挑战,要求在加速度和轮胎摩擦极限下,从瞬时传感器数据进行运动动力学规划。针对各种赛道配置的分布外(OOD)泛化,利用机器学习(ML)编码传感器数据与车辆驱动之间的数学关系,实现端到端控制并隐式完成定位。其中包括受限于人类反应时间的行为克隆(Behavioral Cloning, BC)和需要大规模碰撞数据进行全面训练的深度强化学习(Deep Reinforcement Learning, DRL)。后者在无仿真环境下训练难以实现且难以迁移到现实硬件,表现为仿真中优于BC但硬件上驱动不稳定。本文提出一种DRL方法,通过深度测量的谱分布参数化非线性车辆动力学,结合非几何、基于物理信息的奖励函数,利用人工神经网络(Artificial Neural Network, ANN)推断车辆的时间最优及超车控制,计算量不到BC和基于模型的DRL的1%。通过结合物理引擎漏洞感知奖励和用隐式价值截断替代显式碰撞惩罚,消除了仿真到现实转移中的摆动及方差引起的保守性。该策略在按比例缩放的硬件上,在OOD赛道中比人类示范提升了12%,通过最大化摩擦圆并采用类似经验Pacejka轮胎模型的轮胎动力学实现。系统辨识揭示了功能分叉:第一层压缩空间观测以提取数字化赛道特征,在弯道顶点处分辨率更高;第二层编码非线性动力学。
计算机视觉 (Computer Vision)
119
cs.CV / 1 / 2604.08609

Detection of Hate and Threat in Digital Forensics: A Case-Driven Multimodal Approach

数字取证中的仇恨与威胁检测:基于案例驱动的多模态方法
Shill, Ponkoj Chandra
Abstract
Digital forensic investigations increasingly rely on heterogeneous evidence such as images, scanned documents, and contextual reports. These artifacts may contain explicit or implicit expressions of harm, hate, threat, violence, or intimidation, yet existing automated approaches often assume clean text input or apply vision models without forensic justification. This paper presents a case-driven multimodal approach for hate and threat detection in forensic analysis. The proposed framework explicitly determines the presence and source of textual evidence, distinguishing between embedded text, associated contextual text, and image-only evidence. Based on the identified evidence configuration, the framework selectively applies text analysis, multimodal fusion, or image-only semantic reasoning using vision language models with vision transformer backbones (ViT). By conditioning inference on evidence availability, the approach mirrors forensic decision-making, improves evidentiary traceability, and avoids unjustified modality assumptions. Experimental evaluation on forensic-style image evidence demonstrates consistent and interpretable behavior across heterogeneous evidence scenarios.
Chinese Translation
数字取证调查日益依赖于异构证据,如图像、扫描文档和上下文报告。这些证据可能包含明确或隐含的伤害、仇恨、威胁、暴力或恐吓的表达,然而现有的自动化方法往往假设输入为干净文本,或在没有取证依据的情况下应用视觉模型。本文提出了一种基于案例驱动的多模态方法,用于取证分析中的仇恨与威胁检测。所提出的框架明确确定文本证据的存在及其来源,区分嵌入文本、相关上下文文本和仅包含图像的证据。基于识别的证据配置,该框架选择性地应用文本分析、多模态融合或仅基于图像的语义推理,使用具有视觉变换器骨干网络(ViT)的视觉语言模型。通过根据证据的可用性进行推理,该方法反映了取证决策过程,提高了证据的可追溯性,并避免了不合理的模态假设。在取证风格的图像证据上的实验评估表明,在异构证据场景中表现出一致且可解释的行为。
cs.CV / 2 / 2604.08610

A Semi-Automated Framework for 3D Reconstruction of Medieval Manuscript Miniatures

中世纪手稿微型画三维重建的半自动化框架
Pallotto, Riccardo, Feliciati, Pierluigi, Uricchio, Tiberio
Abstract
This paper presents a semi-automated framework for transforming two-dimensional miniatures from medieval manuscripts into three-dimensional digital models suitable for extended reality (XR), tactile 3D~printing, and web-based visualization. We evaluate seven image-to-3D methods (TripoSR, SF3D, SPAR3D, TRELLIS, Wonder3D, SAM~3D, Hi3DGen) on 69~manuscript figures from two collections using rendering-based metrics (Silhouette IoU, LPIPS, CLIP~Score) and volumetric measures (Depth Range Ratio, watertight percentage), revealing a trade-off between volumetric expansion and geometric fidelity. Hi3DGen balances topological quality with rich surface detail through its normal bridging approach, making it a good starting point for expert refinement. Our pipeline combines SAM segmentation, Hi3DGen mesh generation, expert refinement in ZBrush, and AI-assisted texturing. Two case studies on Gothic illuminations from the Decretum Gratiani (Vatican Library) and Renaissance miniatures by Giulio Clovio demonstrate applicability across artistic traditions. The resulting models can support WebXR visualization, AR overlay on physical manuscripts, and tactile 3D~prints for visually impaired users.
Chinese Translation
本文提出了一种半自动化框架,将中世纪手稿中的二维微型画转换为适用于扩展现实(XR)、触觉3D打印及基于网络的可视化的三维数字模型。我们评估了七种图像到三维的方法(TripoSR、SF3D、SPAR3D、TRELLIS、Wonder3D、SAM 3D、Hi3DGen),针对来自两个收藏的69幅手稿人物,采用基于渲染的指标(轮廓交并比(Silhouette IoU)、LPIPS、CLIP评分)和体积测量(深度范围比、封闭率)进行评估,揭示了体积扩展与几何保真度之间的权衡。Hi3DGen通过其法线桥接方法在拓扑质量与丰富的表面细节之间取得平衡,成为专家精细调整的良好起点。我们的流程结合了SAM分割、Hi3DGen网格生成、ZBrush中的专家精修以及AI辅助纹理处理。两项案例研究分别针对梵蒂冈图书馆的《Decretum Gratiani》哥特式插图和Giulio Clovio的文艺复兴微型画,展示了该方法在不同艺术传统中的适用性。最终生成的模型支持WebXR可视化、物理手稿上的增强现实叠加以及为视障用户提供的触觉3D打印。
cs.CV / 3 / 2604.08613

ViSAGE @ NTIRE 2026 Challenge on Video Saliency Prediction

ViSAGE @ NTIRE 2026 视频显著性预测挑战赛
Wang, Kun, Hu, Yupeng, Li, Zhiran, Liu, Hao, Xiang, Qianlong, Nie, Liqiang
Abstract
In this report, we present our champion solution for the NTIRE 2026 Challenge on Video Saliency Prediction held in conjunction with CVPR 2026. To exploit complementary inductive biases for video saliency, we propose Video Saliency with Adaptive Gated Experts (ViSAGE), a multi-expert ensemble framework. Each specialized decoder performs adaptive gating and modulation to refine spatio-temporal features. The complementary predictions from different experts are then fused at inference. ViSAGE thereby aggregates diverse inductive biases to capture complex spatio-temporal saliency cues in videos. On the Private Test set, ViSAGE ranked first on two out of four evaluation metrics, and outperformed most competing solutions on the other two metrics, demonstrating its effectiveness and generalization ability. Our code has been released at https://github.com/iLearn-Lab/CVPRW26-ViSAGE.
Chinese Translation
在本报告中,我们展示了在与 CVPR 2026 会议同时举行的 NTIRE 2026 视频显著性预测挑战赛中获得冠军的解决方案。为了利用视频显著性的互补归纳偏差,我们提出了具有自适应门控专家的 视频显著性(ViSAGE)多专家集成框架。每个专门的解码器执行自适应门控和调制,以细化时空特征。不同专家的互补预测在推理时被融合。因此,ViSAGE 聚合了多样的归纳偏差,以捕捉视频中复杂的时空显著性线索。在私有测试集上,ViSAGE 在四个评估指标中的两个上排名第一,并在其他两个指标上超越了大多数竞争解决方案,证明了其有效性和泛化能力。我们的代码已发布在 https://github.com/iLearn-Lab/CVPRW26-ViSAGE。
cs.CV / 4 / 2604.08615

MARINER: A 3E-Driven Benchmark for Fine-Grained Perception and Complex Reasoning in Open-Water Environments

MARINER:一个基于3E驱动的开放水域环境中细粒度感知与复杂推理的基准
Liao, Xingming, Chen, Ning, Shu, Muying, Yin, Yunpeng, Zeng, Peijian, Wang, Zhuowei, Lin, Nankai, Cheng, Lianglun
Abstract
Fine-grained visual understanding and high-level reasoning in real-world open-water environments remain under-explored due to the lack of dedicated benchmarks. We introduce MARINER, a comprehensive benchmark built under the novel Entity-Environment-Event (3E) paradigm. MARINER contains 16,629 multi-source maritime images with 63 fine-grained vessel categories, diverse adverse environments, and 5 typical dynamic maritime incidents, covering fine-grained classification, object detection, and visual question answering tasks. We conduct extensive evaluations on mainstream Multimodal Large language models (MLLMs) and establish baselines, revealing that even advanced models struggle with fine-grained discrimination and causal reasoning in complex marine scenes. As a dedicated maritime benchmark, MARINER fills the gap of realistic and cognitive-level evaluation for maritime multimodal understanding, and promotes future research on robust vision-language models for open-water applications. Appendix and supplementary materials are available at https://lxixim.github.io/MARINER.
Chinese Translation
由于缺乏专门的基准,现实世界开放水域环境中的细粒度视觉理解和高层次推理仍然未得到充分探索。我们介绍了MARINER,这是一个基于新颖的实体-环境-事件(3E)范式构建的综合基准。MARINER包含16,629张多源海洋图像,涵盖63个细粒度船舶类别、多样的恶劣环境和5种典型动态海洋事件,涉及细粒度分类、目标检测和视觉问答任务。我们对主流的多模态大型语言模型(MLLMs)进行了广泛评估并建立了基准线,结果显示即使是先进模型在复杂海洋场景中的细粒度区分和因果推理方面也面临挑战。作为一个专门的海洋基准,MARINER填补了海洋多模态理解的现实和认知层面评估的空白,并推动了未来在开放水域应用中对稳健视觉-语言模型的研究。附录和补充材料可在 https://lxixim.github.io/MARINER 获取。
cs.CV / 5 / 2604.08626

WildDet3D: Scaling Promptable 3D Detection in the Wild

WildDet3D:面向野外环境的可提示三维检测的规模化
Huang, Weikai, Zhang, Jieyu, Li, Sijun, Jia, Taoyang, Duan, Jiafei, Cheng, Yunqian, Cho, Jaemin, Wallingford, Mattew, Soraki, Rustin, Kim, Chris Dongjoo, Clay, Donovan, Anderson, Taira, Han, Winson, Farhadi, Ali, Hariharan, Bharath, Ren, Zhongzheng, Krishna, Ranjay
Abstract
Understanding objects in 3D from a single image is a cornerstone of spatial intelligence. A key step toward this goal is monocular 3D object detection--recovering the extent, location, and orientation of objects from an input RGB image. To be practical in the open world, such a detector must generalize beyond closed-set categories, support diverse prompt modalities, and leverage geometric cues when available. Progress is hampered by two bottlenecks: existing methods are designed for a single prompt type and lack a mechanism to incorporate additional geometric cues, and current 3D datasets cover only narrow categories in controlled environments, limiting open-world transfer. In this work we address both gaps. First, we introduce WildDet3D, a unified geometry-aware architecture that natively accepts text, point, and box prompts and can incorporate auxiliary depth signals at inference time. Second, we present WildDet3D-Data, the largest open 3D detection dataset to date, constructed by generating candidate 3D boxes from existing 2D annotations and retaining only human-verified ones, yielding over 1M images across 13.5K categories in diverse real-world scenes. WildDet3D establishes a new state-of-the-art across multiple benchmarks and settings. In the open-world setting, it achieves 22.6/24.8 AP3D on our newly introduced WildDet3D-Bench with text and box prompts. On Omni3D, it reaches 34.2/36.4 AP3D with text and box prompts, respectively. In zero-shot evaluation, it achieves 40.3/48.9 ODS on Argoverse 2 and ScanNet. Notably, incorporating depth cues at inference time yields substantial additional gains (+20.7 AP on average across settings).
Chinese Translation
从单张图像中理解三维物体是空间智能的基石。实现这一目标的关键步骤是单目三维物体检测——从输入的RGB图像中恢复物体的尺寸、位置和朝向。为了在开放世界中具备实用性,该检测器必须能够超越封闭类别的限制,支持多样的提示模态,并在可用时利用几何线索。当前进展受制于两个瓶颈:现有方法通常仅针对单一提示类型设计,缺乏整合额外几何线索的机制;且现有三维数据集仅涵盖受控环境中的狭窄类别,限制了开放世界的迁移能力。本文针对这两方面进行了突破。首先,我们提出了WildDet3D,一种统一的几何感知架构,原生支持文本、点和框提示,并能在推理时融合辅助深度信号。其次,我们构建了WildDet3D-Data,这是迄今为止最大的开放三维检测数据集,通过从现有二维标注生成候选三维框并仅保留人工验证的样本,涵盖了超过100万张图像和1.35万个类别,场景多样且贴近真实世界。WildDet3D在多个基准和设置中均创下新纪录。在开放世界设置中,使用文本和框提示,在我们新引入的WildDet3D-Bench上分别达到22.6和24.8的AP3D。在Omni3D数据集上,文本和框提示分别实现34.2和36.4的AP3D。在零样本评估中,于Argoverse 2和ScanNet上分别取得40.3和48.9的ODS。值得注意的是,推理时融合深度线索带来了显著的性能提升,平均提升达+20.7 AP。
cs.CV / 6 / 2604.08641

On Semiotic-Grounded Interpretive Evaluation of Generative Art

基于符号学的生成艺术解释性评估
Jiang, Ruixiang, Chen, Changwen
Abstract
Interpretation is essential to deciphering the language of art: audiences communicate with artists by recovering meaning from visual artifacts. However, current Generative Art (GenArt) evaluators remain fixated on surface-level image quality or literal prompt adherence, failing to assess the deeper symbolic or abstract meaning intended by the creator. We address this gap by formalizing a Peircean computational semiotic theory that models Human-GenArt Interaction (HGI) as cascaded semiosis. This framework reveals that artistic meaning is conveyed through three modes - iconic, symbolic, and indexical - yet existing evaluators operate heavily within the iconic mode, remaining structurally blind to the latter two. To overcome this structural blindness, we propose SemJudge. This evaluator explicitly assesses symbolic and indexical meaning in HGI via a Hierarchical Semiosis Graph (HSG) that reconstructs the meaning-making process from prompt to generated artifact. Extensive quantitative experiments show that SemJudge aligns more closely with human judgments than prior evaluators on an interpretation-intensive fine-art benchmark. User studies further demonstrate that SemJudge produces deeper, more insightful artistic interpretations, thereby paving the way for GenArt to move beyond the generation of "pretty" images toward a medium capable of expressing complex human experience. Project page: https://github.com/songrise/SemJudge.
Chinese Translation
解释对于解读艺术语言至关重要:观众通过从视觉艺术品中恢复意义与艺术家进行交流。然而,目前的生成艺术(Generative Art, GenArt)评估者仍然专注于表面图像质量或字面提示的遵循,未能评估创作者所意图的更深层次的象征或抽象意义。我们通过形式化一个皮尔斯(Peirce)计算符号学理论来填补这一空白,该理论将人类与生成艺术的互动(Human-GenArt Interaction, HGI)建模为级联符号化。该框架揭示了艺术意义是通过三种模式传达的——图像模式(iconic)、象征模式(symbolic)和指示模式(indexical),而现有的评估者主要在图像模式内运作,结构性地忽视了后两者。为了克服这种结构性盲点,我们提出了SemJudge。该评估器通过一个分层符号化图(Hierarchical Semiosis Graph, HSG)明确评估HGI中的象征和指示意义,该图重建了从提示到生成艺术品的意义构建过程。大量定量实验表明,SemJudge在一个以解释为重点的美术基准上与人类判断的吻合度高于以往的评估者。用户研究进一步表明,SemJudge能够产生更深刻、更具洞察力的艺术解释,从而为生成艺术超越单纯生成“美丽”图像,迈向能够表达复杂人类体验的媒介铺平道路。项目页面:https://github.com/songrise/SemJudge。
cs.CV / 7 / 2604.08645

3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding

3D-VCD:通过视觉对比解码缓解3D-LLM具身智能体中的幻觉问题
Ogunleye, Makanjuola, Abdelrahman, Eman, Lourentzou, Ismini
Abstract
Large multimodal models are increasingly used as the reasoning core of embodied agents operating in 3D environments, yet they remain prone to hallucinations that can produce unsafe and ungrounded decisions. Existing inference-time hallucination mitigation methods largely target 2D vision-language settings and do not transfer to embodied 3D reasoning, where failures arise from object presence, spatial layout, and geometric grounding rather than pixel-level inconsistencies. We introduce 3D-VCD, the first inference-time visual contrastive decoding framework for hallucination mitigation in 3D embodied agents. 3D-VCD constructs a distorted 3D scene graph by applying semantic and geometric perturbations to object-centric representations, such as category substitutions and coordinate or extent corruption. By contrasting predictions under the original and distorted 3D contexts, our method suppresses tokens that are insensitive to grounded scene evidence and are therefore likely driven by language priors. We evaluate 3D-VCD on the 3D-POPE and HEAL benchmarks and show that it consistently improves grounded reasoning without any retraining, establishing inference-time contrastive decoding over structured 3D representations as an effective and practical route to more reliable embodied intelligence.
Chinese Translation
大型多模态模型正日益作为在3D环境中运行的具身智能体的推理核心,然而它们仍易产生幻觉,导致不安全且无依据的决策。现有的推理时幻觉缓解方法主要针对二维视觉-语言场景,难以迁移至具身3D推理,其中失败原因更多源于对象存在、空间布局和几何定位,而非像素级不一致。我们提出3D-VCD,这是首个用于3D具身智能体幻觉缓解的推理时视觉对比解码框架。3D-VCD通过对以对象为中心的表示施加语义和几何扰动(如类别替换、坐标或范围破坏),构建扭曲的3D场景图。通过对比原始与扭曲3D上下文下的预测结果,本方法抑制对有据场景证据不敏感、可能受语言先验驱动的词元。我们在3D-POPE和HEAL基准上评估3D-VCD,结果表明其在无需重新训练的情况下持续提升了有据推理能力,确立了基于结构化3D表示的推理时对比解码作为实现更可靠具身智能的有效且实用途径。
cs.CV / 8 / 2604.08646

InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation

InsEdit:基于指令的视觉编辑方法及其通过数据高效的视频扩散模型适配
Rao, Zhefan, Zou, Bin, Che, Haoxuan, He, Xuanhua, Choi, Chong Hou, Li, Yanheng, Liu, Rui, Chen, Qifeng
Abstract
Instruction-based video editing is a natural way to control video content with text, but adapting a video generation model into an editor usually appears data-hungry. At the same time, high-quality video editing data remains scarce. In this paper, we show that a video generation backbone can become a strong video editor without large scale video editing data. We present InsEdit, an instruction-based editing model built on HunyuanVideo-1.5. InsEdit combines a visual editing architecture with a video data pipeline based on Mutual Context Attention (MCA), which creates aligned video pairs where edits can begin in the middle of a clip rather than only from the first frame. With only O(100)K video editing data, InsEdit achieves state-of-the-art results among open-source methods on our video instruction editing benchmarks. In addition, because our training recipe also includes image editing data, the final model supports image editing without any modification.
Chinese Translation
基于指令的视频编辑是一种通过文本控制视频内容的自然方式,但将视频生成模型转变为编辑器通常需要大量数据支持。同时,高质量的视频编辑数据依然稀缺。本文展示了视频生成骨干网络无需大规模视频编辑数据即可成为强大的视频编辑器。我们提出了InsEdit,一种基于指令的编辑模型,构建于HunyuanVideo-1.5之上。InsEdit结合了视觉编辑架构与基于互上下文注意力(Mutual Context Attention, MCA)的视频数据管道,该管道生成对齐的视频对,使编辑可以从视频片段中间开始,而不仅限于第一帧。仅使用约10万条视频编辑数据,InsEdit在我们的视频指令编辑基准测试中达到了开源方法中的最先进水平。此外,由于训练方案中也包含图像编辑数据,最终模型无需任何修改即可支持图像编辑。
cs.CV / 9 / 2604.08694

EfficientSign: An Attention-Enhanced Lightweight Architecture for Indian Sign Language Recognition

EfficientSign:一种增强注意力的轻量级印度手语识别架构
Gupta, Rishabh, Nalla, Shravya R.
Abstract
How do you build a sign language recognizer that works on a phone? That question drove this work. We built EfficientSign, a lightweight model which takes EfficientNet-B0 and focuses on two attention modules (Squeeze-and-Excitation for channel focus, and a spatial attention layer that focuses on the hand gestures). We tested it against five other approaches on 12,637 images of Indian Sign Language alphabets, all 26 classes, using 5-fold cross-validation. EfficientSign achieves the accuracy of 99.94% (+/-0.05%), which matches the performance of ResNet18's 99.97% accuracy, but with 62% fewer parameters (4.2M vs 11.2M). We also experimented with feeding deep features (1,280-dimensional vectors pulled from EfficientNet-B0's pooling layer) into classical classifiers. SVM achieved the accuracy of 99.63%, Logistic Regression achieved the accuracy of 99.03% and KNN achieved accuracy of 96.33%. All of these blow past the 92% that SURF-based methods managed on a similar dataset back in 2015. Our results show that attention-enhanced learning model provides an efficient and deployable solution for ISL recognition without requiring a massive model or hand-tuned feature pipelines anymore.
Chinese Translation
如何构建一个可以在手机上运行的手语识别器?这个问题驱动了本研究的开展。我们构建了EfficientSign,这是一种轻量级模型,基于EfficientNet-B0,并专注于两个注意力模块(通道聚焦的Squeeze-and-Excitation和专注于手势的空间注意力层)。我们在12,637张印度手语字母的图像上对其进行了测试,涵盖所有26个类别,并使用5折交叉验证与其他五种方法进行了比较。EfficientSign达到了99.94%(+/-0.05%)的准确率,这与ResNet18的99.97%准确率相当,但参数量减少了62%(4.2M对比11.2M)。我们还尝试将深度特征(从EfficientNet-B0的池化层提取的1280维向量)输入到经典分类器中。支持向量机(SVM)达到了99.63%的准确率,逻辑回归(Logistic Regression)达到了99.03%的准确率,K近邻(KNN)达到了96.33%的准确率。所有这些结果均超过了2015年SURF基础方法在类似数据集上实现的92%。我们的结果表明,增强注意力的学习模型为印度手语(ISL)识别提供了一种高效且可部署的解决方案,无需再依赖庞大的模型或手动调优的特征管道。
cs.CV / 10 / 2604.08701

Unified Multimodal Uncertain Inference

统一的多模态不确定推理
Zhang, Dengjia, Martin, Alexander, Jurayj, William, Murray, Kenton, Van Durme, Benjamin, Kriz, Reno
Abstract
We introduce Unified Multimodal Uncertain Inference (UMUI), a multimodal inference task spanning text, audio, and video, where models must produce calibrated probability estimates of hypotheses conditioned on a premise in any modality or combination. While uncertain inference has been explored in text, extension to other modalities has been limited to single-modality binary entailment judgments, leaving no framework for fine-grained probabilistic reasoning in or across other modalities. To address this, we curate a human-annotated evaluation set with scalar probability judgments across audio, visual, and audiovisual settings, and additionally evaluate on existing text and audio benchmarks. We introduce CLUE (Calibrated Latent Uncertainty Estimation), which combines self-consistent teacher calibration and distribution-based confidence probing to produce calibrated predictions. We demonstrate that our 3B-parameter model achieves equivalent or stronger performance than baselines up to 32B parameters across all modalities.
Chinese Translation
我们介绍了统一的多模态不确定推理(Unified Multimodal Uncertain Inference, UMUI),这是一项涵盖文本、音频和视频的多模态推理任务,模型必须根据任何模态或组合中的前提生成经过校准的假设概率估计。尽管不确定推理在文本中已有研究,但扩展到其他模态的工作仅限于单模态的二元蕴含判断,缺乏在其他模态中或跨模态进行细粒度概率推理的框架。为了解决这个问题,我们整理了一个经过人工标注的评估集,涵盖音频、视觉和视听设置中的标量概率判断,并在现有的文本和音频基准上进行额外评估。我们引入了CLUE(Calibrated Latent Uncertainty Estimation),它结合了自一致教师校准和基于分布的置信度探测,以生成经过校准的预测。我们展示了我们的3B参数模型在所有模态上实现了与高达32B参数的基线相当或更强的性能。
cs.CV / 11 / 2604.08704

RS-OVC: Open-Vocabulary Counting for Remote-Sensing Data

RS-OVC:面向遥感数据的开放词汇计数方法
Shor, Tamir, Leifman, George, Beryozkin, Genady
Abstract
Object-Counting for remote-sensing (RS) imagery is attracting increasing research interest due to its crucial role in a wide and diverse set of applications. While several promising methods for RS object-counting have been proposed, existing methods focus on a closed, pre-defined set of object classes. This limitation necessitates costly re-annotation and model re-training to adapt current approaches for counting of novel objects that have not been seen during training, and severely inhibits their application in dynamic, real-world monitoring scenarios. To address this gap, in this work we propose RS-OVC - the first Open Vocabulary Counting (OVC) model for Remote-Sensing and aerial imagery. We show that our model is capable of accurate counting of novel object classes, that were unseen during training, based solely on textual and/or visual conditioning.
Chinese Translation
遥感(RS)影像中的目标计数因其在广泛且多样的应用中的关键作用而受到越来越多的研究关注。尽管已有若干针对遥感目标计数的有效方法被提出,但现有方法均聚焦于一个封闭的、预定义的目标类别集合。这一限制导致在计数训练中未见过的新颖目标时,必须进行昂贵的重新标注和模型再训练,严重制约了其在动态真实监测场景中的应用。为填补这一空白,本文提出了RS-OVC——首个面向遥感和航空影像的开放词汇计数(Open Vocabulary Counting, OVC)模型。我们展示了该模型能够基于文本和/或视觉条件,实现对训练期间未见过的新颖目标类别的准确计数。
cs.CV / 12 / 2604.08711

Deep Learning-Based Tracking and Lineage Reconstruction of Ligament Breakup

基于深度学习的韧带破裂跟踪与谱系重建
Ahire, Vrushank, Kurumanghat, Vivek, Ganaie, Mudasir, Kabiraj, Lipika
Abstract
The disintegration of liquid sheets into ligaments and droplets involves highly transient, multi-scale dynamics that are difficult to quantify from high-speed shadowgraphy images. Identifying droplets, ligaments, and blobs formed during breakup, along with tracking across frames, is essential for spray analysis. However, conventional multi-object tracking frameworks impose strict one-to-one temporal associations and cannot represent one-to-many fragmentation events. In this study, we present a two-stage deep learning framework for object detection and temporal relationship modeling across frames. The framework captures ligament deformation, fragmentation, and parent-child lineage during liquid sheet disintegration. In the first stage, a Faster R-CNN with a ResNet-50 backbone and Feature Pyramid Network detects and classifies ligaments and droplets in high-speed shadowgraphy recordings of an impinging Carbopol gel jet. A morphology-preserving synthetic data generation strategy augments the training set without introducing physically implausible configurations, achieving a held-out F1 score of up to 0.872 across fourteen original-to-synthetic configurations. In the second stage, a Transformer-augmented multilayer perceptron classifies inter-frame associations into continuation, fragmentation (one-to-many), and non-association using physics-informed geometric features. Despite severe class imbalance, the model achieves 86.1% accuracy, 93.2% precision, and perfect recall (1.00) for fragmentation events. Together, the framework enables automated reconstruction of fragmentation trees, preservation of parent-child lineage, and extraction of breakup statistics such as fragment multiplicity and droplet size distributions. By explicitly identifying children droplets formed from ligament fragmentation, the framework provides automated analysis of the primary atomization mode.
Chinese Translation
液体薄膜分解为韧带和液滴的过程涉及高度瞬态的多尺度动力学,这些动态难以通过高速影像技术进行量化。识别在破裂过程中形成的液滴、韧带和斑点,并在帧间进行跟踪,对于喷雾分析至关重要。然而,传统的多物体跟踪框架施加了严格的一对一时间关联,无法表示一对多的碎裂事件。在本研究中,我们提出了一种两阶段的深度学习框架,用于物体检测和跨帧时间关系建模。该框架捕捉液体薄膜分解过程中的韧带变形、碎裂及其亲子谱系。在第一阶段,采用具有ResNet-50主干和特征金字塔网络的Faster R-CNN检测和分类在冲击的Carbopol凝胶喷流的高速影像记录中的韧带和液滴。一种保持形态的合成数据生成策略在不引入物理上不合理的配置的情况下增强了训练集,实现了在十四个原始到合成配置中的持出F1分数高达0.872。在第二阶段,增强型变换器的多层感知器使用物理信息几何特征将帧间关联分类为延续、碎裂(一对多)和非关联。尽管类别严重失衡,该模型在碎裂事件上实现了86.1%的准确率、93.2%的精确率和完美的召回率(1.00)。综上所述,该框架实现了碎裂树的自动重建,保持亲子谱系,并提取碎裂统计数据,如碎片多重性和液滴尺寸分布。通过明确识别由韧带碎裂形成的子液滴,该框架提供了对主要雾化模式的自动分析。
cs.CV / 13 / 2604.08716

What Matters in Virtual Try-Off? Dual-UNet Diffusion Model For Garment Reconstruction

虚拟试穿中的关键因素:用于服装重建的双UNet扩散模型
Truong, Loc-Phat, Madadi, Meysam, Escalera, Sergio
Abstract
Virtual Try-On (VTON) has seen rapid advancements, providing a strong foundation for generative fashion tasks. However, the inverse problem, Virtual Try-Off (VTOFF)-aimed at reconstructing the canonical garment from a draped-on image-remains a less understood domain, distinct from the heavily researched field of VTON. In this work, we seek to establish a robust architectural foundation for VTOFF by studying and adapting various diffusion-based strategies from VTON and general Latent Diffusion Models (LDMs). We focus our investigation on the Dual-UNet Diffusion Model architecture and analyze three axes of design: (i) Generation Backbone: comparing Stable Diffusion variants; (ii) Conditioning: ablating different mask designs, masked/unmasked inputs for image conditioning, and the utility of high-level semantic features; and (iii) Losses and Training Strategies: evaluating the impact of the auxiliary attention-based loss, perceptual objectives and multi-stage curriculum schedules. Extensive experiments reveal trade-offs across various configuration options. Evaluated on VITON-HD and DressCode datasets, our framework achieves state-of-the-art performance with a drop of 9.5\% on the primary metric DISTS and competitive performance on LPIPS, FID, KID, and SSIM, providing both stronger baselines and insights to guide future Virtual Try-Off research.
Chinese Translation
虚拟试穿(VTON)技术迅速发展,为生成性时尚任务奠定了坚实基础。然而,逆问题——虚拟试穿后(VTOFF),旨在从穿着图像中重建标准服装——仍然是一个较少被理解的领域,与广泛研究的VTON领域截然不同。在本研究中,我们旨在通过研究和调整来自VTON及一般潜在扩散模型(LDMs)的各种基于扩散的策略,为VTOFF建立一个稳健的架构基础。我们将研究重点放在双UNet扩散模型架构上,并分析三个设计轴心:(i)生成主干:比较稳定扩散变体;(ii)条件设计:消融不同的掩码设计、图像条件下的有掩码/无掩码输入,以及高层语义特征的实用性;(iii)损失与训练策略:评估辅助基于注意力的损失、感知目标和多阶段课程安排的影响。大量实验揭示了各种配置选项之间的权衡。在VITON-HD和DressCode数据集上的评估表明,我们的框架在主要指标DISTS上实现了9.5%的性能提升,并在LPIPS、FID、KID和SSIM上表现出竞争力,提供了更强的基准和指导未来虚拟试穿研究的见解。
cs.CV / 14 / 2604.08718

Accelerating Transformer-Based Monocular SLAM via Geometric Utility Scoring

通过几何效用评分加速基于变换器的单目SLAM
Xiong, Xinmiao, Liu, Bangya, Wang, Hao, Li, Dayou, Chen, Nuo, Feng, Andrew, Ding, Mingyu, Banerjee, Suman, Zhou, Yang, Fan, Zhiwen
Abstract
Geometric Foundation Models (GFMs) have recently advanced monocular SLAM by providing robust, calibration-free 3D priors. However, deploying these models on dense video streams introduces significant computational redundancy. Current GFM-based SLAM systems typically rely on post hoc keyframe selection. Because of this, they must perform expensive dense geometric decoding simply to determine whether a frame contains novel geometry, resulting in late rejection and wasted computation. To mitigate this inefficiency, we propose LeanGate, a lightweight feed-forward frame-gating network. LeanGate predicts a geometric utility score to assess a frame's mapping value prior to the heavy GFM feature extraction and matching stages. As a predictive plug-and-play module, our approach bypasses over 90% of redundant frames. Evaluations on standard SLAM benchmarks demonstrate that LeanGate reduces tracking FLOPs by more than 85% and achieves a 5x end-to-end throughput speedup. Furthermore, it maintains the tracking and mapping accuracy of dense baselines.
Chinese Translation
几何基础模型(Geometric Foundation Models, GFMs)最近通过提供稳健的、无需校准的3D先验,推动了单目SLAM的发展。然而,在密集视频流上部署这些模型会引入显著的计算冗余。目前基于GFM的SLAM系统通常依赖于事后关键帧选择。因此,它们必须执行昂贵的密集几何解码,仅仅为了确定一个帧是否包含新颖的几何信息,这导致了延迟拒绝和计算浪费。为了解决这一低效问题,我们提出了LeanGate,一个轻量级的前馈帧门控网络。LeanGate预测几何效用评分,以评估帧的映射价值,从而在繁重的GFM特征提取和匹配阶段之前进行筛选。作为一个可预测的即插即用模块,我们的方法绕过了超过90%的冗余帧。在标准SLAM基准测试中的评估表明,LeanGate将跟踪的FLOPs减少了85%以上,并实现了5倍的端到端吞吐量加速。此外,它还保持了与密集基线相当的跟踪和映射精度。
cs.CV / 15 / 2604.08719

LMGenDrive: Bridging Multimodal Understanding and Generative World Modeling for End-to-End Driving

LMGenDrive:桥接多模态理解与生成世界建模以实现端到端驾驶
Shao, Hao, Wang, Letian, Zhou, Yang, Hu, Yuxuan, Zong, Zhuofan, Waslander, Steven L., Zhan, Wei, Li, Hongsheng
Abstract
Recent years have seen remarkable progress in autonomous driving, yet generalization to long-tail and open-world scenarios remains a major bottleneck for large-scale deployment. To address this challenge, some works use LLMs and VLMs for vision-language understanding and reasoning, enabling vehicles to interpret rare and safety-critical situations when generating actions. Others study generative world models to capture the spatio-temporal evolution of driving scenes, allowing agents to imagine possible futures before acting. Inspired by human intelligence, which unifies understanding and imagination, we explore a unified model for autonomous driving. We present LMGenDrive, the first framework that combines LLM-based multimodal understanding with generative world models for end-to-end closed-loop driving. Given multi-view camera inputs and natural-language instructions, LMGenDrive generates both future driving videos and control signals. This design provides complementary benefits: video prediction improves spatio-temporal scene modeling, while the LLM contributes strong semantic priors and instruction grounding from large-scale pretraining. We further propose a progressive three-stage training strategy, from vision pretraining to multi-step long-horizon driving, to improve stability and performance. LMGenDrive supports both low-latency online planning and autoregressive offline video generation. Experiments show that it significantly outperforms prior methods on challenging closed-loop benchmarks, with clear gains in instruction following, spatio-temporal understanding, and robustness to rare scenarios. These results suggest that unifying multimodal understanding and generation is a promising direction for more generalizable and robust embodied decision-making systems.
Chinese Translation
近年来,自动驾驶领域取得了显著进展,但在长尾和开放世界场景中的泛化能力仍然是大规模部署的主要瓶颈。为了解决这一挑战,一些研究利用大型语言模型(LLMs)和视觉语言模型(VLMs)进行视觉-语言理解和推理,使得车辆能够在生成动作时解释稀有和安全关键的情况。另一些研究则探讨生成世界模型,以捕捉驾驶场景的时空演变,使得智能体能够在行动之前想象可能的未来。受到人类智能的启发,人类智能将理解与想象统一起来,我们探索了一种用于自动驾驶的统一模型。我们提出了LMGenDrive,这是第一个将基于LLM的多模态理解与生成世界模型结合起来的端到端闭环驾驶框架。LMGenDrive在接收多视角摄像头输入和自然语言指令的情况下,能够生成未来的驾驶视频和控制信号。这一设计提供了互补的好处:视频预测改善了时空场景建模,而LLM则通过大规模预训练提供了强大的语义先验和指令基础。我们进一步提出了一种渐进的三阶段训练策略,从视觉预训练到多步长时间驾驶,以提高稳定性和性能。LMGenDrive支持低延迟在线规划和自回归离线视频生成。实验表明,它在具有挑战性的闭环基准测试中显著优于先前的方法,在指令遵循、时空理解和对稀有场景的鲁棒性方面都有明显提升。这些结果表明,将多模态理解与生成统一起来是实现更具泛化能力和鲁棒性的具身决策系统的一个有前途的方向。
cs.CV / 16 / 2604.08722

AI Driven Soccer Analysis Using Computer Vision

基于计算机视觉的人工智能驱动足球分析
Manchado, Adrian, Cellio, Tanner, Keane, Jonathan, Wang, Yiyang
Abstract
Sport analysis is crucial for team performance since it provides actionable data that can inform coaching decisions, improve player performance, and enhance team strategies. To analyze more complex features from game footage, a computer vision model can be used to identify and track key entities from the field. We propose the use of an object detection and tracking system to predict player positioning throughout the game. To translate this to positioning in relation to the field dimensions, we use a point prediction model to identify key points on the field and combine these with known field dimensions to extract actual distances. For the player-identification model, object detection models like YOLO and Faster R-CNN are evaluated on the accuracy of our custom video footage using multiple different evaluation metrics. The goal is to identify the best model for object identification to obtain the most accurate results when paired with SAM2 (Segment Anything Model 2) for segmentation and tracking. For the key point detection model, we use a CNN model to find consistent locations in the soccer field. Through homography, the positions of points and objects in the camera perspective will be transformed to a real-ground perspective. The segmented player masks from SAM2 are transformed from camera perspective to real-world field coordinates through homography, regardless of camera angle or movement. The transformed real-world coordinates can be used to calculate valuable tactical insights including player speed, distance covered, positioning heatmaps, and more complex team statistics, providing coaches and players with actionable performance data previously unavailable from standard video analysis.
Chinese Translation
运动分析对于提升团队表现至关重要,因为它提供了可操作的数据,能够指导教练决策、提升球员表现并优化团队战术。为了从比赛录像中分析更复杂的特征,可以采用计算机视觉模型来识别和跟踪场上的关键实体。我们提出使用目标检测与跟踪系统来预测比赛过程中球员的位置。为了将其转换为相对于场地尺寸的位置,我们使用关键点预测模型识别场地上的关键点,并结合已知的场地尺寸提取实际距离。对于球员识别模型,我们评估了YOLO和Faster R-CNN等目标检测模型在我们定制视频素材上的准确性,采用多种不同的评估指标。目标是确定最佳的目标识别模型,以便在与SAM2(Segment Anything Model 2)结合进行分割和跟踪时获得最准确的结果。对于关键点检测模型,我们采用卷积神经网络(CNN)模型来寻找足球场上的一致位置。通过单应性变换,将摄像机视角中的点和物体位置转换为真实地面视角。SAM2分割出的球员掩膜通过单应性变换从摄像机视角转换为真实世界的场地坐标,无论摄像机角度或运动如何变化。转换后的真实世界坐标可用于计算宝贵的战术洞察,包括球员速度、跑动距离、位置热力图及更复杂的团队统计数据,为教练和球员提供了标准视频分析中无法获得的可操作性能数据。
cs.CV / 17 / 2604.08741

LPLCv2: An Expanded Dataset for Fine-Grained License Plate Legibility Classification

LPLCv2:用于细粒度车牌可读性分类的扩展数据集
Wojcik, Lucas, Machoski, Eduardo A. F., Nascimento Jr., Eduil, Laroca, Rayson, Menotti, David
Abstract
Modern Automatic License Plate Recognition (ALPR) systems achieve outstanding performance in controlled, well-defined scenarios. However, large-scale real-world usage remains challenging due to low-quality imaging devices, compression artifacts, and suboptimal camera installation. Identifying illegible license plates (LPs) has recently become feasible through a dedicated benchmark; however, its impact has been limited by its small size and annotation errors. In this work, we expand the original benchmark to over three times the size with two extra capture days, revise its annotations and introduce novel labels. LP-level annotations include bounding boxes, text, and legibility level, while vehicle-level annotations comprise make, model, type, and color. Image-level annotations feature camera identity, capture conditions (e.g., rain and faulty cameras), acquisition time, and day ID. We present a novel training procedure featuring an Exponential Moving Average-based loss function and a refined learning rate scheduler, addressing common mistakes in testing. These improvements enable a baseline model to achieve an 89.5% F1-score on the test set, considerably surpassing the previous state of the art. We further introduce a novel protocol to explicitly addresses camera contamination between training and evaluation splits, where results show a small impact. Dataset and code are publicly available at https://github.com/lmlwojcik/LPLCv2-Dataset.
Chinese Translation
现代自动车牌识别(ALPR)系统在受控且定义明确的场景中表现出色。然而,由于低质量成像设备、压缩伪影以及摄像头安装不佳,大规模的实际应用仍面临挑战。近期通过专门的基准测试,实现了对不可读车牌(LP)的识别,但其影响受限于数据集规模较小及标注错误。在本工作中,我们将原始基准数据集扩展至三倍以上,增加了两天的采集数据,修订了标注并引入了新的标签。车牌级标注包括边界框、文本及可读性等级,车辆级标注涵盖品牌、型号、类型和颜色。图像级标注则包含摄像头身份、采集条件(如雨天及故障摄像头)、采集时间和日期ID。我们提出了一种基于指数移动平均(Exponential Moving Average)的损失函数及改进的学习率调度器的新训练方法,有效纠正了测试中的常见错误。该改进使基线模型在测试集上达到89.5%的F1分数,显著超越了先前的最先进水平。我们还引入了一种新协议,明确解决训练与评估划分中摄像头污染问题,结果表明其影响较小。数据集及代码已公开,地址为:https://github.com/lmlwojcik/LPLCv2-Dataset。
cs.CV / 18 / 2604.08760

SIC3D: Style Image Conditioned Text-to-3D Gaussian Splatting Generation

SIC3D:基于风格图像条件的文本到三维高斯点云生成
He, Ming, Chen, Zhixiang, Maddock, Steve
Abstract
Recent progress in text-to-3D object generation enables the synthesis of detailed geometry from text input by leveraging 2D diffusion models and differentiable 3D representations. However, the approaches often suffer from limited controllability and texture ambiguity due to the limitation of the text modality. To address this, we present SIC3D, a controllable image-conditioned text-to-3D generation pipeline with 3D Gaussian Splatting (3DGS). There are two stages in SIC3D. The first stage generates the 3D object content from text with a text-to-3DGS generation model. The second stage transfers style from a reference image to the 3DGS. Within this stylization stage, we introduce a novel Variational Stylized Score Distillation (VSSD) loss to effectively capture both global and local texture patterns while mitigating conflicts between geometry and appearance. A scaling regularization is further applied to prevent the emergence of artifacts and preserve the pattern from the style image. Extensive experiments demonstrate that SIC3D enhances geometric fidelity and style adherence, outperforming prior approaches in both qualitative and quantitative evaluations.
Chinese Translation
近年来,文本到三维物体生成的进展使得通过利用二维扩散模型和可微分三维表示,从文本输入合成细致几何成为可能。然而,由于文本模态的局限性,这些方法常常面临可控性不足和纹理模糊的问题。为此,我们提出了SIC3D,一种基于三维高斯点云(3D Gaussian Splatting,3DGS)的可控图像条件文本到三维生成流程。SIC3D包含两个阶段:第一阶段通过文本到3DGS生成模型从文本生成三维物体内容;第二阶段将参考图像的风格迁移到3DGS。在风格迁移阶段,我们引入了一种新颖的变分风格化评分蒸馏(Variational Stylized Score Distillation,VSSD)损失,有效捕捉全局与局部纹理模式,同时缓解几何与外观之间的冲突。此外,进一步应用缩放正则化以防止伪影产生并保持风格图像的纹理模式。大量实验表明,SIC3D在几何保真度和风格一致性方面均优于现有方法,且在定性和定量评估中表现出色。
cs.CV / 19 / 2604.08761

State Space Models are Effective Sign Language Learners: Exploiting Phonological Compositionality for Vocabulary-Scale Recognition

状态空间模型是有效的手语学习者:利用音韵组合性进行词汇规模识别
Cheng, Bryan, Jin, Austin, Zhang, Jasper
Abstract
Sign language recognition suffers from catastrophic scaling failure: models achieving high accuracy on small vocabularies collapse at realistic sizes. Existing architectures treat signs as atomic visual patterns, learning flat representations that cannot exploit the compositional structure of sign languages-systematically organized from discrete phonological parameters (handshape, location, movement, orientation) reused across the vocabulary. We introduce PHONSSM, enforcing phonological decomposition through anatomically-grounded graph attention, explicit factorization into orthogonal subspaces, and prototypical classification enabling few-shot transfer. Using skeleton data alone on the largest ASL dataset ever assembled (5,565 signs), PHONSSM achieves 72.1% on WLASL2000 (+18.4pp over skeleton SOTA), surpassing most RGB methods without video input. Gains are most dramatic in the few-shot regime (+225% relative), and the model transfers zero-shot to ASL Citizen, exceeding supervised RGB baselines. The vocabulary scaling bottleneck is fundamentally a representation learning problem, solvable through compositional inductive biases mirroring linguistic structure.
Chinese Translation
手语识别面临灾难性的规模失效:在小词汇上取得高准确率的模型在实际规模下崩溃。现有架构将手势视为原子视觉模式,学习的平面表示无法利用手语的组合结构——这些结构是由可在词汇中重复使用的离散音韵参数(手型、位置、运动、方向)系统性组织而成。我们提出了PHONSSM,通过解剖学基础的图注意力、显式分解为正交子空间以及原型分类来强制音韵分解,从而实现少量样本迁移。仅使用骨架数据,在有史以来最大的美国手语(ASL)数据集中(5,565个手势),PHONSSM在WLASL2000上达到了72.1%(比骨架的SOTA高出18.4个百分点),超越了大多数不使用视频输入的RGB方法。在少量样本的情况下,增益最为显著(相对提高225%),该模型在零样本情况下迁移至ASL Citizen,超越了监督RGB基线。词汇规模瓶颈本质上是一个表示学习问题,可以通过反映语言结构的组合归纳偏置来解决。
cs.CV / 20 / 2604.08762

InstrAct: Towards Action-Centric Understanding in Instructional Videos

InstrAct:面向教学视频的动作中心理解
Yang, Zhuoyi, Yu, Jiapeng, Tan, Reuben, Li, Boyang, Xu, Huijuan
Abstract
Understanding instructional videos requires recognizing fine-grained actions and modeling their temporal relations, which remains challenging for current Video Foundation Models (VFMs). This difficulty stems from noisy web supervision and a pervasive "static bias", where models rely on objects rather than motion cues. To address this, we propose InstrAction, a pretraining framework for instructional videos' action-centric representations. We first introduce a data-driven strategy, which filters noisy captions and generates action-centric hard negatives to disentangle actions from objects during contrastive learning. At the visual feature level, an Action Perceiver extracts motion-relevant tokens from redundant video encodings. Beyond contrastive learning, we introduce two auxiliary objectives: Dynamic Time Warping alignment (DTW-Align) for modeling sequential temporal structure, and Masked Action Modeling (MAM) for strengthening cross-modal grounding. Finally, we introduce the InstrAct Bench to evaluate action-centric understanding, where our method consistently outperforms state-of-the-art VFMs on semantic reasoning, procedural logic, and fine-grained retrieval tasks.
Chinese Translation
理解教学视频需要识别细粒度的动作并建模它们的时间关系,这对当前的视频基础模型(VFMs)仍然是一个挑战。这一困难源于嘈杂的网络监督和普遍存在的“静态偏差”,即模型依赖于物体而非运动线索。为了解决这个问题,我们提出了InstrAction,一个用于教学视频动作中心表示的预训练框架。我们首先介绍了一种数据驱动策略,该策略过滤嘈杂的字幕并生成动作中心的困难负样本,以便在对比学习中将动作与物体分离。在视觉特征层面,动作感知器(Action Perceiver)从冗余的视频编码中提取与运动相关的标记。除了对比学习,我们还引入了两个辅助目标:动态时间规整对齐(Dynamic Time Warping alignment,DTW-Align)用于建模序列时间结构,以及掩码动作建模(Masked Action Modeling,MAM)用于增强跨模态对齐。最后,我们引入了InstrAct基准测试,以评估动作中心理解,在语义推理、过程逻辑和细粒度检索任务上,我们的方法始终优于最先进的VFMs。
cs.CV / 21 / 2604.08810

R2G: A Multi-View Circuit Graph Benchmark Suite from RTL to GDSII

R2G:从 RTL 到 GDSII 的多视角电路图基准套件
Zhou, Zewei, Zou, Jiajun, Zhang, Jiajia, Yang, Ao, He, Ruichao, Zhou, Haozheng, Liu, Ao, Liu, Jiawei, Jin, Leilei, Shen, Shan, Sun, Daying
Abstract
Graph neural networks (GNNs) are increasingly applied to physical design tasks such as congestion prediction and wirelength estimation, yet progress is hindered by inconsistent circuit representations and the absence of controlled evaluation protocols. We present R2G (RTL-to-GDSII), a multi-view circuit-graph benchmark suite that standardizes five stage-aware views with information parity (every view encodes the same attribute set, differing only in where features attach) over 30 open-source IP cores (up to $10^6$ nodes/edges). R2G provides an end-to-end DEF-to-graph pipeline spanning synthesis, placement, and routing stages, together with loaders, unified splits, domain metrics, and reproducible baselines. By decoupling representation choice from model choice, R2G isolates a confound that prior EDA and graph-ML benchmarks leave uncontrolled. In systematic studies with GINE, GAT, and ResGatedGCN, we find: (i) view choice dominates model choice, with Test R$^2$ varying by more than 0.3 across representations for a fixed GNN; (ii) node-centric views generalize best across both placement and routing; and (iii) decoder-head depth (3--4 layers) is the primary accuracy driver, turning divergent training into near-perfect predictions (R$^2$$>$0.99). Code and datasets are available at https://github.com/ShenShan123/R2G.
Chinese Translation
图神经网络(GNNs)越来越多地应用于物理设计任务,如拥塞预测和布线长度估计,但由于电路表示不一致和缺乏受控评估协议,进展受到阻碍。我们提出了 R2G(RTL-to-GDSII),这是一个多视角电路图基准套件,标准化了五个阶段感知视图,确保信息一致性(每个视图编码相同的属性集,仅在特征附加的位置上有所不同),涵盖 30 个开源 IP 核(最多 $10^6$ 节点/边)。R2G 提供了一个端到端的 DEF 到图的管道,涵盖综合、放置和布线阶段,并配备加载器、统一分割、领域指标和可重复的基准线。通过将表示选择与模型选择解耦,R2G 隔离了先前 EDA 和图机器学习基准未受控的混淆因素。在与 GINE、GAT 和 ResGatedGCN 的系统研究中,我们发现:(i)视图选择主导模型选择,对于固定的 GNN,不同表示的测试 R$^2$ 变化超过 0.3;(ii)以节点为中心的视图在放置和布线中具有最佳的泛化能力;(iii)解码器头深度(3--4 层)是主要的准确性驱动因素,使得不同的训练转变为近乎完美的预测(R$^2$$>$0.99)。代码和数据集可在 https://github.com/ShenShan123/R2G 获取。
cs.CV / 22 / 2604.08815

Towards Responsible Multimodal Medical Reasoning via Context-Aligned Vision-Language Models

通过上下文对齐的视觉-语言模型实现负责任的多模态医学推理
Khan, Sumra, Chhabriya, Sagar, Zafar, Aizan, Arif, Sheeraz, Muneer, Amgad, Zafar, Anas, Raza, Shaina, Qureshi, Rizwan
Abstract
Medical vision-language models (VLMs) show strong performance on radiology tasks but often produce fluent yet weakly grounded conclusions due to over-reliance on a dominant modality. We introduce a context-aligned reasoning framework that enforces agreement across heterogeneous clinical evidence before generating diagnostic conclusions. The proposed approach augments a frozen VLM with structured contextual signals derived from radiomic statistics, explainability activations, and vocabulary-grounded semantic cues. Instead of producing free-form responses, the model generates structured outputs containing supporting evidence, uncertainty estimates, limitations, and safety notes. We observe that auxiliary signals alone provide limited benefit; performance gains emerge only when these signals are integrated through contextual verification. Experiments on chest X-ray datasets demonstrate that context alignment improves discriminative performance (AUC 0.918 to 0.925) while maintaining calibrated uncertainty. The framework also substantially reduces hallucinated keywords (1.14 to 0.25) and produces more concise reasoning explanations (19.4 to 15.3 words) without increasing model confidence (0.70 to 0.68). Cross-dataset evaluation on CheXpert further reveals that modality informativeness significantly influences reasoning behavior. These results suggest that enforcing multi-evidence agreement improves both reliability and trustworthiness in medical multimodal reasoning, while preserving the underlying model architecture.
Chinese Translation
医学视觉-语言模型(VLMs)在放射学任务中表现出强大的性能,但由于过度依赖主导模态,常常产生流畅但基础薄弱的结论。我们提出了一种上下文对齐的推理框架,在生成诊断结论之前,强制不同临床证据之间达成一致。该方法通过来自放射组学统计、可解释性激活和词汇基础语义线索的结构化上下文信号增强了一个冻结的VLM。模型生成的不是自由形式的响应,而是包含支持证据、不确定性估计、局限性和安全注意事项的结构化输出。我们观察到,仅依靠辅助信号的好处有限;只有在通过上下文验证整合这些信号时,性能提升才会显现。对胸部X光数据集的实验表明,上下文对齐提高了区分性能(AUC从0.918提升至0.925),同时保持了校准的不确定性。该框架还显著减少了幻觉关键词(从1.14降至0.25),并生成了更简洁的推理解释(从19.4词降至15.3词),而模型信心并未增加(从0.70降至0.68)。在CheXpert数据集上的跨数据集评估进一步揭示了模态信息量显著影响推理行为。这些结果表明,强制多证据一致性提高了医学多模态推理的可靠性和可信度,同时保留了基础模型架构。
cs.CV / 23 / 2604.08819

SenBen: Sensitive Scene Graphs for Explainable Content Moderation

SenBen:用于可解释内容审核的敏感场景图
Akyon, Fatih Cagatay, Temizel, Alptekin
Abstract
Content moderation systems classify images as safe or unsafe but lack spatial grounding and interpretability: they cannot explain what sensitive behavior was detected, who is involved, or where it occurs. We introduce the Sensitive Benchmark (SenBen), the first large-scale scene graph benchmark for sensitive content, comprising 13,999 frames from 157 movies annotated with Visual Genome-style scene graphs (25 object classes, 28 attributes including affective states such as pain, fear, aggression, and distress, 14 predicates) and 16 sensitivity tags across 5 categories. We distill a frontier VLM into a compact 241M student model using a multi-task recipe that addresses vocabulary imbalance in autoregressive scene graph generation through suffix-based object identity, Vocabulary-Aware Recall (VAR) Loss, and a decoupled Query2Label tag head with asymmetric loss, yielding a +6.4 percentage point improvement in SenBen Recall over standard cross-entropy training. On grounded scene graph metrics, our student model outperforms all evaluated VLMs except Gemini models and all commercial safety APIs, while achieving the highest object detection and captioning scores across all models, at $7.6\times$ faster inference and $16\times$ less GPU memory.
Chinese Translation
内容审核系统将图像分类为安全或不安全,但缺乏空间定位和可解释性:它们无法说明检测到的敏感行为是什么、涉及谁以及发生在何处。我们引入了Sensitive Benchmark(SenBen),这是首个针对敏感内容的大规模场景图基准,包含来自157部电影的13,999帧,采用Visual Genome风格的场景图注释(25个物体类别,28个属性包括疼痛、恐惧、攻击和痛苦等情感状态,14种谓词)以及涵盖5个类别的16个敏感标签。我们通过多任务训练方案,将前沿的视觉语言模型(VLM)蒸馏为一个紧凑的2.41亿参数学生模型,该方案通过基于后缀的物体身份、词汇感知召回(Vocabulary-Aware Recall, VAR)损失以及采用非对称损失的解耦Query2Label标签头,解决了自回归场景图生成中的词汇不平衡问题,使SenBen召回率较标准交叉熵训练提升了6.4个百分点。在有空间定位的场景图指标上,我们的学生模型优于除Gemini模型外的所有评估VLM和所有商业安全API,同时在所有模型中实现了最高的物体检测和图像描述得分,推理速度提升7.6倍,GPU内存消耗减少16倍。
cs.CV / 24 / 2604.08836

CatalogStitch: Dimension-Aware and Occlusion-Preserving Object Compositing for Catalog Image Generation

CatalogStitch:面向目录图像生成的尺寸感知与遮挡保留对象合成方法
Jain, Sanyam, Kandari, Pragya, Singhal, Manit, Zhang, He, Kim, Soo Ye
Abstract
Generative object compositing methods have shown remarkable ability to seamlessly insert objects into scenes. However, when applied to real-world catalog image generation, these methods require tedious manual intervention: users must carefully adjust masks when product dimensions differ, and painstakingly restore occluded elements post-generation. We present CatalogStitch, a set of model-agnostic techniques that automate these corrections, enabling user-friendly content creation. Our dimension-aware mask computation algorithm automatically adapts the target region to accommodate products with different dimensions; users simply provide a product image and background, without manual mask adjustments. Our occlusion-aware hybrid restoration method guarantees pixel-perfect preservation of occluding elements, eliminating post-editing workflows. We additionally introduce CatalogStitch-Eval, a 58-example benchmark covering aspect-ratio mismatch and occlusion-heavy catalog scenarios, together with supplementary PDF and HTML viewers. We evaluate our techniques with three state-of-the-art compositing models (ObjectStitch, OmniPaint, and InsertAnything), demonstrating consistent improvements across diverse catalog scenarios. By reducing manual intervention and automating tedious corrections, our approach transforms generative compositing into a practical, human-friendly tool for production catalog workflows.
Chinese Translation
生成式对象合成方法在无缝插入对象到场景中表现出显著能力。然而,当应用于真实世界的目录图像生成时,这些方法需要繁琐的人工干预:用户必须在产品尺寸不同时仔细调整遮罩,并在生成后费力地恢复被遮挡的元素。我们提出了CatalogStitch,一套与模型无关的技术,自动完成这些修正,实现用户友好的内容创作。我们的尺寸感知遮罩计算算法能够自动调整目标区域以适应不同尺寸的产品;用户只需提供产品图像和背景,无需手动调整遮罩。我们的遮挡感知混合恢复方法保证了遮挡元素的像素级完美保留,消除了后期编辑流程。我们还引入了CatalogStitch-Eval,一个包含58个示例的基准,涵盖了宽高比不匹配和遮挡严重的目录场景,并配备了附加的PDF和HTML查看器。我们在三种最先进的合成模型(ObjectStitch、OmniPaint和InsertAnything)上评估了我们的技术,展示了在多样目录场景中的持续改进。通过减少人工干预和自动化繁琐修正,我们的方法将生成式合成转变为适用于生产目录工作流的实用且人性化工具。
cs.CV / 25 / 2604.08847

DeFakeQ: Enabling Real-Time Deepfake Detection on Edge Devices via Adaptive Bidirectional Quantization

DeFakeQ:通过自适应双向量化实现边缘设备上的实时深度伪造检测
Li, Xiangyu, Sun, Yujing, Zheng, Yuhang, Ma, Yuexin, Lam, Kwok-Yan
Abstract
Deepfake detection has become a fundamental component of modern media forensics. Despite significant progress in detection accuracy, most existing methods remain computationally intensive and parameter-heavy, limiting their deployment on resource-constrained edge devices that require real-time, on-site inference. This limitation is particularly critical in an era where mobile devices are extensively used for media-centric applications, including online payments, virtual meetings, and social networking. Meanwhile, due to the unique requirement of capturing extremely subtle forgery artifacts for deepfake detection, state-of-the-art quantization techniques usually underperform for such a challenging task. These fine-grained cues are highly sensitive to model compression and can be easily degraded during quantization, leading to noticeable performance drops. This challenge highlights the need for quantization strategies specifically designed to preserve the discriminative features essential for reliable deepfake detection. To address this gap, we propose DefakeQ, the first quantization framework tailored for deepfake detectors, enabling real-time deployment on edge devices. Our approach introduces a novel adaptive bidirectional compression strategy that simultaneously leverages feature correlations and eliminates redundancy, achieving an effective balance between model compactness and detection performance. Extensive experiments across five benchmark datasets and eleven state-of-the-art backbone detectors demonstrate that DeFakeQ consistently surpasses existing quantization and model compression baselines. Furthermore, we deploy DefakeQ on mobile devices in real-world scenarios, demonstrating its capability for real-time deepfake detection and its practical applicability in edge environments.
Chinese Translation
深度伪造检测已成为现代媒体取证的基础组成部分。尽管检测准确率取得了显著进展,但大多数现有方法仍计算量大且参数繁多,限制了其在资源受限且需要实时现场推理的边缘设备上的部署。随着移动设备在在线支付、虚拟会议和社交网络等媒体中心应用中的广泛使用,这一限制尤为关键。与此同时,由于深度伪造检测对捕捉极其细微的伪造痕迹有独特需求,最先进的量化技术通常难以胜任这一挑战性任务。这些细粒度线索对模型压缩高度敏感,易在量化过程中被削弱,导致性能显著下降。该挑战凸显了专门设计以保留可靠深度伪造检测所必需判别特征的量化策略的必要性。为填补这一空白,我们提出了DeFakeQ,这是首个针对深度伪造检测器量身定制的量化框架,实现了在边缘设备上的实时部署。我们的方法引入了一种新颖的自适应双向压缩策略,既利用特征相关性又消除冗余,有效平衡了模型紧凑性与检测性能。在五个基准数据集和十一种最先进主干检测器上的大量实验表明,DeFakeQ持续超越现有量化和模型压缩基线。此外,我们在移动设备的真实场景中部署了DeFakeQ,展示了其实时深度伪造检测能力及在边缘环境中的实际应用价值。
cs.CV / 26 / 2604.08858

BIAS: A Biologically Inspired Algorithm for Video Saliency Detection

BIAS:一种生物启发的动态视频显著性检测算法
Zhang, Zhao-ji, Li, Ya-tang
Abstract
We present BIAS, a fast, biologically inspired model for dynamic visual saliency detection in continuous video streams. Building on the Itti--Koch framework, BIAS incorporates a retina-inspired motion detector to extract temporal features, enabling the generation of saliency maps that integrate both static and motion information. Foci of attention (FOAs) are identified using a greedy multi-Gaussian peak-fitting algorithm that balances winner-take-all competition with information maximization. BIAS detects salient regions with millisecond-scale latency and outperforms heuristic-based approaches and several deep-learning models on the DHF1K dataset, particularly in videos dominated by bottom-up attention. Applied to traffic accident analysis, BIAS demonstrates strong real-world utility, achieving state-of-the-art performance in cause-effect recognition and anticipating accidents up to 0.72 seconds before manual annotation with reliable accuracy. Overall, BIAS bridges biological plausibility and computational efficiency to achieve interpretable, high-speed dynamic saliency detection.
Chinese Translation
我们提出了BIAS,一种快速的生物启发模型,用于连续视频流中的动态视觉显著性检测。基于Itti-Koch框架,BIAS结合了受视网膜启发的运动检测器,以提取时间特征,从而生成整合静态和运动信息的显著性图。注意焦点(FOA)通过贪婪的多高斯峰拟合算法进行识别,该算法在赢家通吃的竞争与信息最大化之间取得平衡。BIAS以毫秒级延迟检测显著区域,并在DHF1K数据集上超越了基于启发式的方法和几种深度学习模型,特别是在以自下而上注意力为主导的视频中。应用于交通事故分析,BIAS展示了强大的现实世界实用性,在因果识别和在手动标注前最多提前0.72秒预测事故方面实现了最先进的性能,并具有可靠的准确性。总体而言,BIAS在生物合理性与计算效率之间架起了桥梁,实现了可解释的高速动态显著性检测。
cs.CV / 27 / 2604.08877

Harnessing Weak Pair Uncertainty for Text-based Person Search

利用弱配对不确定性进行基于文本的人物搜索
Sun, Jintao, Zheng, Zhedong, Ding, Gangyi
Abstract
In this paper, we study the text-based person search, which is to retrieve the person of interest via natural language description. Prevailing methods usually focus on the strict one-to-one correspondence pair matching between the visual and textual modality, such as contrastive learning. However, such a paradigm unintentionally disregards the weak positive image-text pairs, which are of the same person but the text descriptions are annotated from different views (cameras). To take full use of weak positives, we introduce an uncertainty-aware method to explicitly estimate image-text pair uncertainty, and incorporate the uncertainty into the optimization procedure in a smooth manner. Specifically, our method contains two modules: uncertainty estimation and uncertainty regularization. (1) Uncertainty estimation is to obtain the relative confidence on the given positive pairs; (2) Based on the predicted uncertainty, we propose the uncertainty regularization to adaptively adjust loss weight. Additionally, we introduce a group-wise image-text matching loss to further facilitate the representation space among the weak pairs. Compared with existing methods, the proposed method explicitly prevents the model from pushing away potentially weak positive candidates. Extensive experiments on three widely-used datasets, .e.g, CUHK-PEDES, RSTPReid and ICFG-PEDES, verify the mAP improvement of our method against existing competitive methods +3.06%, +3.55% and +6.94%, respectively.
Chinese Translation
在本文中,我们研究了基于文本的人物搜索,即通过自然语言描述检索感兴趣的人物。现有的方法通常侧重于视觉和文本模态之间严格的一对一对应配对匹配,例如对比学习。然而,这种范式无意中忽视了弱正样本图像-文本对,这些对属于同一个人,但文本描述是从不同视角(摄像头)进行标注的。为了充分利用弱正样本,我们引入了一种不确定性感知的方法,以显式估计图像-文本对的不确定性,并将不确定性平滑地纳入优化过程。具体而言,我们的方法包含两个模块:不确定性估计和不确定性正则化。(1) 不确定性估计用于获取给定正样本对的相对置信度;(2) 基于预测的不确定性,我们提出不确定性正则化以自适应调整损失权重。此外,我们引入了一种组内图像-文本匹配损失,以进一步促进弱样本之间的表征空间。与现有方法相比,所提出的方法显式防止模型将潜在的弱正候选样本推远。在三个广泛使用的数据集上进行的广泛实验,例如 CUHK-PEDES、RSTPReid 和 ICFG-PEDES,验证了我们的方法在 mAP 上相较于现有竞争方法的提升,分别为 +3.06%、+3.55% 和 +6.94%。
cs.CV / 28 / 2604.08881

Precise Shield: Explaining and Aligning VLLM Safety via Neuron-Level Guidance

Precise Shield:通过神经元级引导解释与校准视觉-语言大模型(VLLM)的安全性
Shi, Enyi, Shen, Fei, Miao, Shuyi, Zhu, Linxia, Shao, Pengyang, Tang, Jinhui, Chua, Tat-Seng
Abstract
In real-world deployments, Vision-Language Large Models (VLLMs) face critical challenges from multilingual and multimodal composite attacks: harmful images paired with low-resource language texts can easily bypass defenses designed for high-resource language scenarios, exposing structural blind spots in current cross-lingual and cross-modal safety methods. This raises a mechanistic question: where is safety capability instantiated within the model, and how is it distributed across languages and modalities? Prior studies on pure-text LLMs have identified cross-lingual shared safety neurons, suggesting that safety may be governed by a small subset of critical neurons. Leveraging this insight, we propose Precise Shield, a two-stage framework that first identifies safety neurons by contrasting activation patterns between harmful and benign inputs, and then constrains parameter updates strictly within this subspace via gradient masking with affecting fewer than 0.03% of parameters. This strategy substantially improves safety while preserving multilingual and multimodal generalization. Further analysis reveals a moderate overlap of safety neurons across languages and modalities, enabling zero-shot cross-lingual and cross-modal transfer of safety capabilities, and offering a new direction for neuron-level, transfer-based safety enhancement.
Chinese Translation
在实际应用中,视觉-语言大模型(VLLM)面临来自多语言和多模态复合攻击的严峻挑战:有害图像与低资源语言文本的组合能够轻易绕过为高资源语言场景设计的防御措施,暴露出现有跨语言和跨模态安全方法的结构性盲点。这引发了一个机制性问题:安全能力在模型中具体体现于何处?它如何分布于不同语言和模态之间?此前针对纯文本大型语言模型(LLM)的研究发现了跨语言共享的安全神经元,表明安全性可能由一小部分关键神经元控制。基于此洞见,我们提出了Precise Shield,一种两阶段框架:首先通过对比有害与无害输入的激活模式识别安全神经元;随后通过梯度掩蔽严格限制参数更新仅在该子空间内,影响参数比例低于0.03%。该策略在显著提升安全性的同时,保持了多语言和多模态的泛化能力。进一步分析表明,安全神经元在语言和模态间存在适度重叠,使得安全能力能够实现零样本跨语言和跨模态迁移,开辟了基于神经元级、迁移驱动的安全增强新方向。
cs.CV / 29 / 2604.08884

HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing

HM-Bench:用于高光谱遥感的多模态大语言模型综合基准
Zhang, Xinyu, Mai, Zurong, Li, Qingmei, Liao, Zjin, Wen, Yibin, Chen, Yuhang, Fan, Xiaoya, Ho, Chan Tsz, Tianyuan, Bi, Liang, Haoyuan, Su, Ruifeng, Qian, Zihao, Zheng, Juepeng, Huang, Jianxi, Lu, Yutong, Fu, Haohuan
Abstract
While multimodal large language models (MLLMs) have made significant strides in natural image understanding, their ability to perceive and reason over hyperspectral image (HSI) remains underexplored, which is a vital modality in remote sensing. The high dimensionality and intricate spectral-spatial properties of HSI pose unique challenges for models primarily trained on RGB data.To address this gap, we introduce Hyperspectral Multimodal Benchmark (HM-Bench), the first benchmark designed specifically to evaluate MLLMs in HSI understanding. We curate a large-scale dataset of 19,337 question-answer pairs across 13 task categories, ranging from basic perception to spectral reasoning. Given that existing MLLMs are not equipped to process raw hyperspectral cubes natively, we propose a dual-modality evaluation framework that transforms HSI data into two complementary representations: PCA-based composite images and structured textual reports. This approach facilitates a systematic comparison of different representation for model performance. Extensive evaluations on 18 representative MLLMs reveal significant difficulties in handling complex spatial-spectral reasoning tasks. Furthermore, our results demonstrate that visual inputs generally outperform textual inputs, highlighting the importance of grounding in spectral-spatial evidence for effective HSI understanding. Dataset and appendix can be accessed at https://github.com/HuoRiLi-Yu/HM-Bench.
Chinese Translation
尽管多模态大语言模型(MLLMs)在自然图像理解方面取得了显著进展,但它们在高光谱图像(HSI)上的感知和推理能力仍然未得到充分探索,而高光谱图像在遥感中是一个至关重要的模态。HSI的高维度和复杂的光谱-空间特性为主要在RGB数据上训练的模型带来了独特的挑战。为了解决这一问题,我们引入了高光谱多模态基准(HM-Bench),这是第一个专门设计用于评估MLLMs在HSI理解方面的基准。我们策划了一个包含19,337个问答对的大规模数据集,涵盖了从基本感知到光谱推理的13个任务类别。鉴于现有的MLLMs无法原生处理原始高光谱立方体,我们提出了一种双模态评估框架,将HSI数据转换为两种互补的表示:基于PCA的复合图像和结构化文本报告。这种方法便于对不同表示进行系统比较,以评估模型性能。在对18个代表性MLLMs的广泛评估中,发现它们在处理复杂的空间-光谱推理任务时存在显著困难。此外,我们的结果表明,视觉输入通常优于文本输入,突显了在有效的HSI理解中基于光谱-空间证据的基础的重要性。数据集和附录可在https://github.com/HuoRiLi-Yu/HM-Bench访问。
cs.CV / 30 / 2604.08893

Adaptive Dual Residual U-Net with Attention Gate and Multiscale Spatial Attention Mechanisms (ADRUwAMS)

具有注意力门和多尺度空间注意力机制的自适应双残差 U-Net (ADRUwAMS)
Suraki, Mohsen Yaghoubi
Abstract
Glioma is a harmful brain tumor that requires early detection to ensure better health results. Early detection of this tumor is key for effective treatment and requires an automated segmentation process. However, it is a challenging task to find tumors due to tumor characteristics like location and size. A reliable method to accurately separate tumor zones from healthy tissues is deep learning models, which have shown promising results over the last few years. In this research, an Adaptive Dual Residual U-Net with Attention Gate and Multiscale Spatial Attention Mechanisms (ADRUwAMS) is introduced. This model is an innovative combination of adaptive dual residual networks, attention mechanisms, and multiscale spatial attention. The dual adaptive residual network architecture captures high-level semantic and intricate low-level details from brain images, ensuring precise segmentation of different tumor parts, types, and hard regions. The attention gates use gating and input signals to compute attention coefficients for the input features, and multiscale spatial attention generates scaled attention maps and combines these features to hold the most significant information about the brain tumor. We trained the model for 200 epochs using the ReLU activation function on BraTS 2020 and BraTS 2019 datasets. These improvements resulted in high accuracy for tumor detection and segmentation on BraTS 2020, achieving dice scores of 0.9229 for the whole tumor, 0.8432 for the tumor core, and 0.8004 for the enhancing tumor.
Chinese Translation
胶质瘤是一种有害的脑肿瘤,需要早期检测以确保更好的健康结果。该肿瘤的早期检测是有效治疗的关键,且需要自动化的分割过程。然而,由于肿瘤的特征如位置和大小,发现肿瘤是一项具有挑战性的任务。深度学习模型被认为是一种可靠的方法,可以准确地将肿瘤区域与健康组织分离,近年来已显示出良好的效果。在本研究中,提出了一种具有注意力门和多尺度空间注意力机制的自适应双残差 U-Net (ADRUwAMS)。该模型是自适应双残差网络、注意力机制和多尺度空间注意力的创新结合。双自适应残差网络架构能够从脑部图像中捕捉高层语义和复杂的低层细节,确保对不同肿瘤部分、类型和困难区域的精确分割。注意力门利用门控和输入信号计算输入特征的注意力系数,而多尺度空间注意力生成缩放的注意力图,并结合这些特征以保留关于脑肿瘤的最重要信息。我们在 BraTS 2020 和 BraTS 2019 数据集上使用 ReLU 激活函数训练该模型 200 个周期。这些改进使得在 BraTS 2020 上的肿瘤检测和分割达到了高准确率,整体肿瘤的 Dice 分数为 0.9229,肿瘤核心为 0.8432,增强肿瘤为 0.8004。
cs.CV / 31 / 2604.08896

GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing

GeoMMBench 和 GeoMMAgent:迈向地球科学与遥感领域的专家级多模态智能
Xiao, Aoran, Cheng, Shihao, Xu, Yonghao, Ren, Yexian, Chen, Hongruixuan, Yokoya, Naoto
Abstract
Recent advances in multimodal large language models (MLLMs) have accelerated progress in domain-oriented AI, yet their development in geoscience and remote sensing (RS) remains constrained by distinctive challenges: wide-ranging disciplinary knowledge, heterogeneous sensor modalities, and a fragmented spectrum of tasks. To bridge these gaps, we introduce GeoMMBench, a comprehensive multimodal question-answering benchmark covering diverse RS disciplines, sensors, and tasks, enabling broader and more rigorous evaluation than prior benchmarks. Using GeoMMBench, we assess 36 open-source and proprietary large language models, uncovering systematic deficiencies in domain knowledge, perceptual grounding, and reasoning--capabilities essential for expert-level geospatial interpretation. Beyond evaluation, we propose GeoMMAgent, a multi-agent framework that strategically integrates retrieval, perception, and reasoning through domain-specific RS models and tools. Extensive experimental results demonstrate that GeoMMAgent significantly outperforms standalone LLMs, underscoring the importance of tool-augmented agents for dynamically tackling complex geoscience and RS challenges.
Chinese Translation
近期多模态大型语言模型(MLLMs)的进展加速了面向领域的人工智能发展,但在地球科学和遥感(RS)领域的应用仍受到独特挑战的限制:广泛的学科知识、异构传感器模态以及任务的碎片化。为了解决这些问题,我们推出了 GeoMMBench,这是一个涵盖多样化遥感学科、传感器和任务的综合多模态问答基准,能够比以往的基准提供更广泛和更严格的评估。通过使用 GeoMMBench,我们评估了 36 个开源和专有的大型语言模型,发现它们在领域知识、感知基础和推理能力等方面存在系统性缺陷,这些能力对于专家级地理空间解释至关重要。除了评估,我们还提出了 GeoMMAgent,一个多智能体框架,通过领域特定的遥感模型和工具战略性地整合检索、感知和推理。大量实验结果表明,GeoMMAgent 显著优于独立的 LLM,强调了工具增强型智能体在动态应对复杂地球科学和遥感挑战中的重要性。
cs.CV / 32 / 2604.08903

Fast Model-guided Instance-wise Adaptation Framework for Real-world Pansharpening with Fidelity Constraints

基于模型引导的快速实例级适应框架及其在真实场景融合约束下的全色锐化应用
Yang, Zhiqi, Xiao, Jin-Liang, Yin, Shan, Deng, Liang-Jian, Vivone, Gemine
Abstract
Pansharpening aims to generate high-resolution multispectral (HRMS) images by fusing low-resolution multispectral (LRMS) and high-resolution panchromatic (PAN) images while preserving both spectral and spatial information. Although deep learning (DL)-based pansharpening methods achieve impressive performance, they require high training cost and large datasets, and often degrade when the test distribution differs from training, limiting generalization. Recent zero-shot methods, trained on a single PAN/LRMS pair, offer strong generalization but suffer from limited fusion quality, high computational overhead, and slow convergence. To address these issues, we propose FMG-Pan, a fast and generalizable model-guided instance-wise adaptation framework for real-world pansharpening, achieving both cross-sensor generality and rapid training-inference. The framework leverages a pretrained model to guide a lightweight adaptive network through joint optimization with spectral and physical fidelity constraints. We further design a novel physical fidelity term to enhance spatial detail preservation. Extensive experiments on real-world datasets under both intra- and cross-sensor settings demonstrate state-of-the-art performance. On the WorldView-3 dataset, FMG-Pan completes training and inference for a 512x512x8 image within 3 seconds on an RTX 3090 GPU, significantly faster than existing zero-shot methods, making it suitable for practical deployment.
Chinese Translation
全色锐化旨在通过融合低分辨率多光谱(LRMS)图像和高分辨率全色(PAN)图像,生成高分辨率多光谱(HRMS)图像,同时保持光谱和空间信息。尽管基于深度学习(DL)的全色锐化方法取得了显著性能,但其训练成本高且需大量数据,且当测试分布与训练分布不同时性能往往下降,限制了泛化能力。近期的零样本方法仅基于单个PAN/LRMS对进行训练,具备较强的泛化能力,但融合质量有限、计算开销大且收敛速度慢。为解决上述问题,我们提出了FMG-Pan,一种快速且具备良好泛化能力的模型引导实例级适应框架,适用于真实场景全色锐化,实现跨传感器通用性与快速训练推理。该框架利用预训练模型指导轻量级自适应网络,通过光谱和物理保真约束的联合优化进行训练。我们进一步设计了新颖的物理保真项以增强空间细节保留。在真实数据集上的大量实验(包括传感器内和跨传感器设置)表明本方法达到了最先进的性能。在WorldView-3数据集上,FMG-Pan在RTX 3090 GPU上对512×512×8图像完成训练和推理仅需3秒,显著快于现有零样本方法,适合实际部署。
cs.CV / 33 / 2604.08915

Large-Scale Universal Defect Generation: Foundation Models and Datasets

大规模通用缺陷生成:基础模型与数据集
Fan, Yuanting, Liu, Jun, Gao, Bin-Bin, Chen, Xiaochen, Lin, Yuhuan, Dai, Zhewei, Zhan, Jiawei, Wang, Chengjie
Abstract
Existing defect/anomaly generation methods often rely on few-shot learning, which overfits to specific defect categories due to the lack of large-scale paired defect editing data. This issue is aggravated by substantial variations in defect scale and morphology, resulting in limited generalization, degraded realism, and category consistency. We address these challenges by introducing UDG, a large-scale dataset of 300K normal-abnormal-mask-caption quadruplets spanning diverse domains, and by presenting UniDG, a universal defect generation foundation model that supports both reference-based defect generation and text instruction-based defect editing without per-category fine-tuning. UniDG performs Defect-Context Editing via adaptive defect cropping and structured diptych input format, and fuses reference and target conditions through MM-DiT multimodal attention. A two-stage training strategy, Diversity-SFT followed by Consistency-RFT, further improves diversity while enhancing realism and reference consistency. Extensive experiments on MVTec-AD and VisA show that UniDG outperforms prior few-shot anomaly generation and image insertion/editing baselines in synthesis quality and downstream single- and multi-class anomaly detection/localization. Code will be available at https://github.com/RetoFan233/UniDG.
Chinese Translation
现有的缺陷/异常生成方法通常依赖于少量样本学习,由于缺乏大规模配对缺陷编辑数据,导致其过拟合于特定缺陷类别。缺陷规模和形态的显著变化加剧了这一问题,导致有限的泛化能力、降低的真实感和类别一致性。我们通过引入UDG,一个包含30万对正常-异常-掩码-标题四元组的大规模数据集,涵盖多种领域,并提出UniDG,一个通用缺陷生成基础模型,支持基于参考的缺陷生成和基于文本指令的缺陷编辑,而无需针对每个类别进行微调。UniDG通过自适应缺陷裁剪和结构化双联输入格式执行缺陷上下文编辑,并通过MM-DiT多模态注意力融合参考和目标条件。两阶段训练策略,首先是多样性微调(Diversity-SFT),然后是一致性微调(Consistency-RFT),进一步提高了多样性,同时增强了真实感和参考一致性。在MVTec-AD和VisA上的广泛实验表明,UniDG在合成质量以及下游单类和多类异常检测/定位方面优于先前的少样本异常生成和图像插入/编辑基线。代码将发布在 https://github.com/RetoFan233/UniDG。
cs.CV / 34 / 2604.08916

MV3DIS: Multi-View Mask Matching via 3D Guides for Zero-Shot 3D Instance Segmentation

MV3DIS:基于3D引导的多视角掩码匹配用于零样本3D实例分割
Zhao, Yibo, Zhang, Yigong, Xie, Jin
Abstract
Conventional 3D instance segmentation methods rely on labor-intensive 3D annotations for supervised training, which limits their scalability and generalization to novel objects. Recent approaches leverage multi-view 2D masks from the Segment Anything Model (SAM) to guide the merging of 3D geometric primitives, thereby enabling zero-shot 3D instance segmentation. However, these methods typically process each frame independently and rely solely on 2D metrics, such as SAM prediction scores, to produce segmentation maps. This design overlooks multi-view correlations and inherent 3D priors, leading to inconsistent 2D masks across views and ultimately fragmented 3D segmentation. In this paper, we propose MV3DIS, a coarse-to-fine framework for zero-shot 3D instance segmentation that explicitly incorporates 3D priors. Specifically, we introduce a 3D-guided mask matching strategy that uses coarse 3D segments as a common reference to match 2D masks across views and consolidates multi-view mask consistency via 3D coverage distributions. Guided by these view-consistent 2D masks, the coarse 3D segments are further refined into precise 3D instances. Additionally, we introduce a depth consistency weighting scheme that quantifies projection reliability to suppress ambiguities from inter-object occlusions, thereby improving the robustness of 3D-to-2D correspondence. Extensive experiments on the ScanNetV2, ScanNet200, ScanNet++, Replica, and Matterport3D datasets demonstrate the effectiveness of MV3DIS, which achieves superior performance over previous methods
Chinese Translation
传统的3D实例分割方法依赖于大量人工标注的3D数据进行监督训练,这限制了其对新颖物体的扩展性和泛化能力。近期方法利用Segment Anything Model(SAM)生成的多视角2D掩码来引导3D几何基元的合并,从而实现零样本3D实例分割。然而,这些方法通常独立处理每一帧,仅依赖2D指标(如SAM预测分数)生成分割图,忽视了多视角间的关联性和固有的3D先验,导致视角间2D掩码不一致,最终造成3D分割碎片化。本文提出MV3DIS,一种显式融合3D先验的粗到细零样本3D实例分割框架。具体而言,我们引入一种3D引导的掩码匹配策略,利用粗略的3D分割作为公共参考,匹配多视角的2D掩码,并通过3D覆盖分布整合多视角掩码一致性。在这些视图一致的2D掩码引导下,粗略的3D分割被进一步细化为精确的3D实例。此外,我们设计了深度一致性加权方案,用以量化投影的可靠性,抑制物体间遮挡带来的歧义,从而提升3D到2D对应的鲁棒性。在ScanNetV2、ScanNet200、ScanNet++、Replica和Matterport3D数据集上的大量实验验证了MV3DIS的有效性,其性能优于现有方法。
cs.CV / 35 / 2604.08921

TAIHRI: Task-Aware 3D Human Keypoints Localization for Close-Range Human-Robot Interaction

TAIHRI:面向近距离人机交互的任务感知三维人体关键点定位
Li, Ao, Ling, Yonggen, Lin, Yiyang, Wang, Yuji, Deng, Yong, Tang, Yansong
Abstract
Accurate 3D human keypoints localization is a critical technology enabling robots to achieve natural and safe physical interaction with users. Conventional 3D human keypoints estimation methods primarily focus on the whole-body reconstruction quality relative to the root joint. However, in practical human-robot interaction (HRI) scenarios, robots are more concerned with the precise metric-scale spatial localization of task-relevant body parts under the egocentric camera 3D coordinate. We propose TAIHRI, the first Vision-Language Model (VLM) tailored for close-range HRI perception, capable of understanding users' motion commands and directing the robot's attention to the most task-relevant keypoints. By quantizing 3D keypoints into a finite interaction space, TAIHRI precisely localize the 3D spatial coordinates of critical body parts by 2D keypoint reasoning via next token prediction, and seamlessly adapt to downstream tasks such as natural language control or global space human mesh recovery. Experiments on egocentric interaction benchmarks demonstrate that TAIHRI achieves superior estimation accuracy for task-critical body parts. We believe TAIHRI opens new research avenues in the field of embodied human-robot interaction. Code is available at: https://github.com/Tencent/TAIHRI.
Chinese Translation
准确的三维人体关键点定位是实现机器人与用户自然且安全的物理交互的关键技术。传统的三维人体关键点估计方法主要关注相对于根关节的全身重建质量。然而,在实际的人机交互(HRI)场景中,机器人更关注任务相关身体部位在自我视角相机三维坐标系下的精确度量级空间定位。我们提出了TAIHRI,这是首个专为近距离HRI感知设计的视觉-语言模型(Vision-Language Model,VLM),能够理解用户的动作指令并引导机器人关注最相关的任务关键点。通过将三维关键点量化到有限的交互空间,TAIHRI通过基于下一标记预测的二维关键点推理,精确定位关键身体部位的三维空间坐标,并能无缝适配自然语言控制或全局空间人体网格恢复等下游任务。在自我视角交互基准测试中,TAIHRI在任务关键身体部位的估计精度上表现出优越性。我们相信TAIHRI为具身人机交互领域开辟了新的研究方向。代码已开源,地址:https://github.com/Tencent/TAIHRI。
cs.CV / 36 / 2604.08922

Degradation-Robust Fusion: An Efficient Degradation-Aware Diffusion Framework for Multimodal Image Fusion in Arbitrary Degradation Scenarios

降质鲁棒融合:一种高效的降质感知扩散框架用于任意降质场景下的多模态图像融合
Shi, Yu, Liu, Yu, Wu, Zhong-Cheng, Cheng, Juan, Li, Huafeng, Chen, Xun
Abstract
Complex degradations like noise, blur, and low resolution are typical challenges in real world image fusion tasks, limiting the performance and practicality of existing methods. End to end neural network based approaches are generally simple to design and highly efficient in inference, but their black-box nature leads to limited interpretability. Diffusion based methods alleviate this to some extent by providing powerful generative priors and a more structured inference process. However, they are trained to learn a single domain target distribution, whereas fusion lacks natural fused data and relies on modeling complementary information from multiple sources, making diffusion hard to apply directly in practice. To address these challenges, this paper proposes an efficient degradation aware diffusion framework for image fusion under arbitrary degradation scenarios. Specifically, instead of explicitly predicting noise as in conventional diffusion models, our method performs implicit denoising by directly regressing the fused image, enabling flexible adaptation to diverse fusion tasks under complex degradations with limited steps. Moreover, we design a joint observation model correction mechanism that simultaneously imposes degradation and fusion constraints during sampling to ensure high reconstruction accuracy. Experiments on diverse fusion tasks and degradation configurations demonstrate the superiority of the proposed method under complex degradation scenarios.
Chinese Translation
复杂的降质现象如噪声、模糊和低分辨率是现实世界图像融合任务中的典型挑战,限制了现有方法的性能和实用性。基于端到端神经网络的方法通常设计简单且推理高效,但其黑箱特性导致解释性有限。扩散(Diffusion)方法通过提供强大的生成先验和更具结构化的推理过程,在一定程度上缓解了这一问题。然而,这些方法通常训练以学习单一域的目标分布,而融合任务缺乏自然的融合数据,依赖于对多源互补信息的建模,使得扩散方法难以直接应用。为应对上述挑战,本文提出了一种高效的降质感知扩散框架,用于任意降质场景下的图像融合。具体而言,本方法不同于传统扩散模型中显式预测噪声,而是通过直接回归融合图像实现隐式去噪,从而能够在有限步数内灵活适应复杂降质条件下的多样融合任务。此外,我们设计了联合观测模型校正机制,在采样过程中同时施加降质和融合约束,确保高重建精度。在多样的融合任务和降质配置上的实验结果表明,所提方法在复杂降质场景下表现出优越性。
cs.CV / 37 / 2604.08924

Customized Fusion: A Closed-Loop Dynamic Network for Adaptive Multi-Task-Aware Infrared-Visible Image Fusion

定制融合:一种用于自适应多任务感知红外-可见图像融合的闭环动态网络
Yang, Zengyi, Liu, Yu, Cheng, Juan, Zhu, Zhiqin, Zhang, Yafei, Li, Huafeng
Abstract
Infrared-visible image fusion aims to integrate complementary information for robust visual understanding, but existing fusion methods struggle with simultaneously adapting to multiple downstream tasks. To address this issue, we propose a Closed-Loop Dynamic Network (CLDyN) that can adaptively respond to the semantic requirements of diverse downstream tasks for task-customized image fusion. Specifically, CLDyN introduces a closed-loop optimization mechanism that establishes a semantic transmission chain to achieve explicit feedback from downstream tasks to the fusion network through a Requirement-driven Semantic Compensation (RSC) module. The RSC module leverages a Basis Vector Bank (BVB) and an Architecture-Adaptive Semantic Injection (A2SI) block to customize the network architecture according to task requirements, thereby enabling task-specific semantic compensation and allowing the fusion network to actively adapt to diverse tasks without retraining. To promote semantic compensation, a reward-penalty strategy is introduced to reward or penalize the RSC module based on task performance variations. Experiments on the M3FD, FMB, and VT5000 datasets demonstrate that CLDyN not only maintains high fusion quality but also exhibits strong multi-task adaptability. The code is available at https://github.com/YR0211/CLDyN.
Chinese Translation
红外-可见图像融合旨在整合互补信息以实现稳健的视觉理解,但现有的融合方法在同时适应多个下游任务方面存在困难。为了解决这一问题,我们提出了一种闭环动态网络(Closed-Loop Dynamic Network, CLDyN),能够自适应地响应多样化下游任务的语义需求,实现任务定制的图像融合。具体而言,CLDyN引入了一种闭环优化机制,建立了一个语义传输链,通过需求驱动的语义补偿(Requirement-driven Semantic Compensation, RSC)模块实现从下游任务到融合网络的显式反馈。RSC模块利用基础向量库(Basis Vector Bank, BVB)和架构自适应语义注入(Architecture-Adaptive Semantic Injection, A2SI)块,根据任务需求定制网络架构,从而实现任务特定的语义补偿,使融合网络能够在不重新训练的情况下主动适应多样化任务。为了促进语义补偿,提出了一种奖励-惩罚策略,根据任务性能变化对RSC模块进行奖励或惩罚。在M3FD、FMB和VT5000数据集上的实验表明,CLDyN不仅保持了高质量的融合效果,还表现出强大的多任务适应能力。代码可在 https://github.com/YR0211/CLDyN 获取。
cs.CV / 38 / 2604.08936

M-IDoL: Information Decomposition for Modality-Specific and Diverse Representation Learning in Medical Foundation Model

M-IDoL:用于医学基础模型中模态特异性与多样性表示学习的信息分解方法
Liu, Yihang, Wen, Ying, Yang, Jiaxiong, Yang, Longzhen, He, Lianghua, Shen, Heng Tao
Abstract
Medical foundation models (MFMs) aim to learn universal representations from multimodal medical images that can generalize effectively to diverse downstream clinical tasks. However, most existing MFMs suffer from information ambiguity that blend multimodal representations in a single embedding space, leading to the degradation of modality specificity and diversity. In this paper, we propose M-IDoL, a self-supervised \underline{\textit{M}}FM that introduces Information Decomposition for multimodal representation Learning via two objectives: i) maximize inter-modality entropy by dispersing multimodal representation into separable Mixture-of-Experts (MoE) subspaces to achieve representation specificity across modalities; and ii) minimize intra-modality uncertainty by performing fine-grained semantic discrimination within each MoE subspace to enrich representation diversity per modality. By pre-training on 1.15 million medical images, M-IDoL i) delivers superior generalization across 21 downstream clinical tasks, outperforming 20 foundation models on five imaging modalities (e.g., X-ray, fundus, OCT, dermoscopy and pathology), and ii) learns modality-specific and diverse representations, showing clearer separation of feature cluster across modalities and finer-grained feature discrimination within each modality.
Chinese Translation
医学基础模型(Medical Foundation Models, MFMs)旨在从多模态医学影像中学习通用表示,以有效泛化到多样化的下游临床任务。然而,现有大多数MFMs存在信息模糊问题,将多模态表示混合在单一嵌入空间中,导致模态特异性和多样性的下降。本文提出了M-IDoL,一种自监督的医学基础模型,采用信息分解(Information Decomposition)进行多模态表示学习,包含两个目标:i)通过将多模态表示分散到可分离的专家混合(Mixture-of-Experts, MoE)子空间中,最大化模态间熵以实现跨模态的表示特异性;ii)通过在每个MoE子空间内执行细粒度语义判别,最小化模态内不确定性以丰富每个模态的表示多样性。通过在115万张医学影像上进行预训练,M-IDoL表现出:i)在21个下游临床任务中实现卓越的泛化能力,优于20个基础模型,涵盖五种影像模态(如X射线、眼底、光学相干断层扫描(OCT)、皮肤镜和病理学);ii)学习到模态特异且多样的表示,展现出跨模态特征簇的更清晰分离及每个模态内更细粒度的特征判别能力。
cs.CV / 39 / 2604.08943

MASS: Mesh-inellipse Aligned Deformable Surfel Splatting for Hand Reconstruction and Rendering from Egocentric Monocular Video

MASS:基于椭圆内网格对齐的可变形Surfel点云渲染用于第一人称单目视频中的手部重建与渲染
Zhu, Haoyu, Zhang, Yi, Yao, Lei, Chau, Lap-pui, Wang, Yi
Abstract
Reconstructing high-fidelity 3D hands from egocentric monocular videos remains a challenge due to the limitations in capturing high-resolution geometry, hand-object interactions, and complex objects on hands. Additionally, existing methods often incur high computational costs, making them impractical for real-time applications. In this work, we propose Mesh-inellipse Aligned deformable Surfel Splatting (MASS) to address these challenges by leveraging a deformable 2D Gaussian Surfel representation. We introduce the mesh-aligned Steiner Inellipse and fractal densification for mesh-to-surfel conversion that initiates high-resolution 2D Gaussian surfels from coarse parametric hand meshes, providing surface representation with photorealistic rendering potential. Second, we propose Gaussian Surfel Deformation, which enables efficient modeling of hand deformations and personalized features by predicting residual updates to surfel attributes and introducing an opacity mask to refine geometry and texture without adaptive density control. In addition, we propose a two-stage training strategy and a novel binding loss to improve the optimization robustness and reconstruction quality. Extensive experiments on the ARCTIC dataset, the Hand Appearance dataset, and the Interhand2.6M dataset demonstrate that our model achieves superior reconstruction performance compared to state-of-the-art methods.
Chinese Translation
从第一人称单目视频中重建高保真三维手部模型仍然具有挑战性,主要由于高分辨率几何结构捕捉的限制、手物交互复杂性以及手部复杂物体的存在。此外,现有方法通常计算开销较大,难以满足实时应用需求。本文提出了一种基于可变形二维高斯Surfel表示的Mesh-inellipse Aligned可变形Surfel点云渲染(MASS)方法,以应对上述挑战。我们引入了网格对齐的Steiner椭圆内切线和分形加密技术,实现从粗糙参数化手部网格到高分辨率二维高斯Surfel的转换,提供具有光真实感渲染潜力的表面表示。其次,提出了高斯Surfel变形方法,通过预测Surfel属性的残差更新并引入不透明度掩码,有效建模手部变形和个性化特征,无需自适应密度控制即可细化几何和纹理。此外,设计了两阶段训练策略和新颖的绑定损失函数,以提升优化的鲁棒性和重建质量。在ARCTIC数据集、Hand Appearance数据集及Interhand2.6M数据集上的大量实验表明,所提模型在重建性能上优于现有最先进方法。
cs.CV / 40 / 2604.08945

TouchAnything: Diffusion-Guided 3D Reconstruction from Sparse Robot Touches

TouchAnything:基于扩散引导的稀疏机器人触觉三维重建
Gu, Langzhe, Huang, Hung-Jui, Qadri, Mohamad, Kaess, Michael, Yuan, Wenzhen
Abstract
Accurate object geometry estimation is essential for many downstream tasks, including robotic manipulation and physical interaction. Although vision is the dominant modality for shape perception, it becomes unreliable under occlusions or challenging lighting conditions. In such scenarios, tactile sensing provides direct geometric information through physical contact. However, reconstructing global 3D geometry from sparse local touches alone is fundamentally underconstrained. We present TouchAnything, a framework that leverages a pretrained large-scale 2D vision diffusion model as a semantic and geometric prior for 3D reconstruction from sparse tactile measurements. Unlike prior work that trains category-specific reconstruction networks or learns diffusion models directly from tactile data, we transfer the geometric knowledge encoded in pretrained visual diffusion models to the tactile domain. Given sparse contact constraints and a coarse class-level description of the object, we formulate reconstruction as an optimization problem that enforces tactile consistency while guiding solutions toward shapes consistent with the diffusion prior. Our method reconstructs accurate geometries from only a few touches, outperforms existing baselines, and enables open-world 3D reconstruction of previously unseen object instances. Our project page is https://grange007.github.io/touchanything .
Chinese Translation
准确的物体几何估计对于机器人操作和物理交互等多种下游任务至关重要。尽管视觉是形状感知的主要方式,但在遮挡或光照条件复杂的情况下,视觉信息变得不可靠。在此类场景中,触觉传感通过物理接触提供直接的几何信息。然而,仅凭稀疏的局部触觉数据进行全局三维几何重建在本质上是欠约束的。我们提出了TouchAnything框架,该框架利用预训练的大规模二维视觉扩散模型作为语义和几何先验,从稀疏触觉测量中实现三维重建。不同于以往训练特定类别重建网络或直接从触觉数据学习扩散模型的方法,我们将预训练视觉扩散模型中编码的几何知识迁移到触觉领域。给定稀疏的接触约束和物体的粗略类别描述,我们将重建问题形式化为一个优化问题,该问题在保证触觉一致性的同时,引导解向与扩散先验一致的形状收敛。我们的方法仅凭少量触觉点即可重建出准确的几何形状,优于现有基线方法,并支持对先前未见过的物体实例进行开放世界三维重建。项目主页:https://grange007.github.io/touchanything 。
cs.CV / 41 / 2604.08956

Low-Data Supervised Adaptation Outperforms Prompting for Cloud Segmentation Under Domain Shift

低数据监督适应在领域转移下优于提示法进行云分割
Kethavath, Harshith, Hu, Weiming
Abstract
Adapting vision-language models to remote sensing imagery presents a fundamental challenge: both the visual and linguistic distributions of satellite data lie far outside natural image pretraining corpora. Despite this, prompting remains the dominant deployment paradigm, driven by the assumption that domain-specific language can guide frozen model representations toward specialized tasks. We test this assumption directly on a domain where the mismatch is prominent: cloud segmentation for satellite imagery. Using CLIPSeg on the CloudSEN12+ cloud segmentation benchmark, we evaluate 60 prompt variants spanning simple labels, domain terminology, appearance descriptors, and contextual cues, finding that every variant underperforms the zero-shot baseline (0.255 mIoU), with engineered prompts scoring as low as 0.07 mIoU. No amount of linguistic refinement bridges the gap between CLIP's natural image representations and satellite spectral imagery. In contrast, supervised fine-tuning with just 0.1% labeled data (~8 images) surpasses zero-shot performance overall, and 5-10% data recovers ~85% of maximum achievable mIoU. Full fine-tuning consistently outperforms low-rank adaptation by 0.03-0.09 mIoU, with the largest gaps for spectrally ambiguous classes, and at 0.5 to 1% labeled data, fine-tuning temporarily degrades performance on these classes before recovering, a supervision dip that aggregate mIoU can mask. For practitioners adapting vision-language models to specialized imagery, our results deliver a clear message: labeled data is not the expensive alternative to prompting; it is the worthwhile path.
Chinese Translation
将视觉-语言模型适应于遥感图像面临着一个基本挑战:卫星数据的视觉和语言分布远远超出了自然图像预训练语料库。尽管如此,提示法仍然是主导的部署范式,基于这样一个假设:特定领域的语言可以引导冻结的模型表示朝向专业任务。我们在一个不匹配显著的领域直接测试这一假设:卫星图像的云分割。使用 CLIPSeg 在 CloudSEN12+ 云分割基准上,我们评估了 60 种提示变体,涵盖简单标签、领域术语、外观描述符和上下文线索,发现每种变体的表现均低于零-shot 基线(0.255 mIoU),而工程化提示的得分低至 0.07 mIoU。无论语言如何精细化,都无法弥补 CLIP 的自然图像表示与卫星光谱图像之间的差距。相比之下,仅使用 0.1% 的标记数据(约 8 张图像)进行的监督微调总体上超过了零-shot 性能,而 5-10% 的数据恢复了约 85% 的最大可达 mIoU。完全微调始终比低秩适应高出 0.03-0.09 mIoU,光谱模糊类别的差距最大,而在 0.5% 到 1% 的标记数据下,微调在这些类别上的性能暂时下降后恢复,这种监督下降可能会被聚合 mIoU 掩盖。对于将视觉-语言模型适应于专业图像的从业者,我们的结果传达了一个明确的信息:标记数据并不是提示法的昂贵替代品;它是值得追求的路径。
cs.CV / 42 / 2604.08965

Dynamic Class-Aware Active Learning for Unbiased Satellite Image Segmentation

面向无偏卫星影像分割的动态类别感知主动学习
Kumar, Gadi Hemanth, Nambiar, Athira, Bodani, Pankaj
Abstract
Semantic segmentation of satellite imagery plays a vital role in land cover mapping and environmental monitoring. However, annotating large-scale, high-resolution satellite datasets is costly and time consuming, especially when covering vast geographic regions. Instead of randomly labeling data or exhaustively annotating entire datasets, Active Learning (AL) offers an efficient alternative by intelligently selecting the most informative samples for annotation with the help of Human-in-the-loop (HITL), thereby reducing labeling costs while maintaining high model performance. AL is particularly beneficial for large-scale or resource-constrained satellite applications, as it enables high segmentation accuracy with significantly fewer labeled samples. Despite these advantages, standard AL strategies typically rely on global uncertainty or diversity measures and lack the adaptability to target underperforming or rare classes as training progresses, leading to bias in the system. To overcome these limitations, we propose a novel adaptive acquisition function, Dynamic Class-Aware Uncertainty based Active learning (DCAU-AL) that prioritizes sample selection based on real-time class-wise performance gaps, thereby overcoming class-imbalance issue. The proposed DCAU-AL mechanism continuously tracks the performance of the segmentation per class and dynamically adjusts the sampling weights to focus on poorly performing or underrepresented classes throughout the active learning process. Extensive experiments on the OpenEarth land cover dataset show that DCAU-AL significantly outperforms existing AL methods, especially under severe class imbalance, delivering superior per-class IoU and improved annotation efficiency.
Chinese Translation
卫星影像的语义分割在土地覆盖制图和环境监测中具有重要作用。然而,大规模高分辨率卫星数据集的标注成本高且耗时,尤其是在覆盖广阔地理区域时。相比随机标注数据或对整个数据集进行全面注释,主动学习(Active Learning, AL)通过人机交互(Human-in-the-loop, HITL)智能选择最具信息量的样本进行标注,提供了一种高效的替代方案,从而在保持模型高性能的同时降低标注成本。AL对于大规模或资源受限的卫星应用尤为有益,因为它能够以显著减少的标注样本实现高分割精度。尽管如此,现有标准AL策略通常依赖全局不确定性或多样性度量,缺乏针对训练过程中表现欠佳或罕见类别的适应性,导致系统偏差。为克服这些限制,我们提出了一种新颖的自适应采样函数——基于动态类别感知不确定性的主动学习(Dynamic Class-Aware Uncertainty based Active Learning, DCAU-AL),该方法基于实时的类别性能差距优先选择样本,从而解决类别不平衡问题。所提DCAU-AL机制持续跟踪各类别的分割性能,并动态调整采样权重,聚焦于表现较差或代表性不足的类别,贯穿整个主动学习过程。在OpenEarth土地覆盖数据集上的大量实验表明,DCAU-AL在严重类别不平衡情况下显著优于现有AL方法,实现了更优的类别间交并比(IoU)和提升的标注效率。
cs.CV / 43 / 2604.08966

How Should Video LLMs Output Time? An Analysis of Efficient Temporal Grounding Paradigms

视频大语言模型应如何输出时间?高效时间定位范式的分析
Jin, Shengji, Zou, Yuanhao, Zhu, Victor, Ji, Zhengping, Chen, Chen
Abstract
While Multimodal Large Language Models (MLLMs) have advanced Video Temporal Grounding (VTG), existing methods often couple output paradigms with different backbones, datasets, and training protocols. This makes it challenging to isolate the specific impact of the output design. Additionally, as VTG systems are increasingly considered for resource-constrained edge deployment, the trade-off between output formulation and system-level efficiency requires systematic investigation. In this paper, we present a controlled empirical study comparing three dominant VTG output paradigms: Text Numeral Generation, Temporal Token Generation, and Continuous Temporal Decoding. We evaluate these paradigms across identical compact VLMs (SmolVLM2, FastVLM, and Molmo2) using consistent datasets and LoRA fine-tuning protocols. Evaluations on Charades-STA, QVHighlights, and YouCook2 measure both localization accuracy and system efficiency, including inference latency, training throughput, and parameter overhead. Our results demonstrate that the choice of output formulation significantly affects both grounding accuracy and computational cost, independent of model scale. Specifically, the continuous distribution paradigm consistently achieves the most favorable efficiency-accuracy trade-off on the Pareto frontier, delivering robust localization with minimal latency overhead. These findings provide objective empirical guidelines for designing efficient, deployment-ready VTG systems.
Chinese Translation
尽管多模态大语言模型(MLLMs)在视频时间定位(VTG)方面取得了进展,但现有方法通常将输出范式与不同的骨干网络、数据集和训练协议结合在一起。这使得隔离输出设计的具体影响变得困难。此外,随着VTG系统越来越多地考虑在资源受限的边缘部署中使用,输出形式与系统级效率之间的权衡需要系统性的研究。在本文中,我们呈现了一项受控的实证研究,比较三种主流的VTG输出范式:文本数字生成、时间标记生成和连续时间解码。我们在相同的紧凑型视频语言模型(SmolVLM2、FastVLM和Molmo2)上使用一致的数据集和LoRA微调协议对这些范式进行评估。在Charades-STA、QVHighlights和YouCook2上的评估测量了定位准确性和系统效率,包括推理延迟、训练吞吐量和参数开销。我们的结果表明,输出形式的选择显著影响定位准确性和计算成本,与模型规模无关。具体而言,连续分布范式在帕累托前沿上始终实现了最有利的效率-准确性权衡,以最小的延迟开销提供稳健的定位。这些发现为设计高效、适合部署的VTG系统提供了客观的实证指导。
cs.CV / 44 / 2604.08990

ActFER: Agentic Facial Expression Recognition via Active Tool-Augmented Visual Reasoning

ActFER:通过主动工具增强视觉推理的能动面部表情识别
Liu, Shifeng, Zhang, Zhengye, Zhao, Sirui, Mao, Xinglong, Kan, Zhehan, Wei, Zhixiang, Wu, Shiwei, Fu, Chaoyou, Xu, Tong, Chen, Enhong
Abstract
Recent advances in Multimodal Large Language Models (MLLMs) have created new opportunities for facial expression recognition (FER), moving it beyond pure label prediction toward reasoning-based affect understanding. However, existing MLLM-based FER methods still follow a passive paradigm: they rely on externally prepared facial inputs and perform single-pass reasoning over fixed visual evidence, without the capability for active facial perception. To address this limitation, we propose ActFER, an agentic framework that reformulates FER as active visual evidence acquisition followed by multimodal reasoning. Specifically, ActFER dynamically invokes tools for face detection and alignment, selectively zooms into informative local regions, and reasons over facial Action Units (AUs) and emotions through a visual Chain-of-Thought. To realize such behavior, we further develop Utility-Calibrated GRPO (UC-GRPO), a reinforcement learning algorithm tailored to agentic FER. UC-GRPO uses AU-grounded multi-level verifiable rewards to densify supervision, query-conditional contrastive utility estimation to enable sample-aware dynamic credit assignment for local inspection, and emotion-aware EMA calibration to reduce noisy utility estimates while capturing emotion-wise inspection tendencies. This algorithm enables ActFER to learn both when local inspection is beneficial and how to reason over the acquired evidence. Comprehensive experiments show that ActFER trained with UC-GRPO consistently outperforms passive MLLM-based FER baselines and substantially improves AU prediction accuracy.
Chinese Translation
最近在多模态大语言模型(MLLMs)方面的进展为面部表情识别(FER)创造了新的机会,使其超越纯粹的标签预测,朝着基于推理的情感理解发展。然而,现有的基于MLLM的FER方法仍然遵循被动范式:它们依赖于外部准备的面部输入,并在固定的视觉证据上进行单次推理,缺乏主动面部感知的能力。为了解决这一局限性,我们提出了ActFER,一个将FER重新定义为主动视觉证据获取后进行多模态推理的能动框架。具体而言,ActFER动态调用工具进行面部检测和对齐,选择性地放大信息丰富的局部区域,并通过视觉思维链对面部动作单元(AUs)和情感进行推理。为了实现这种行为,我们进一步开发了效用校准的GRPO(UC-GRPO),这是一种针对能动FER的强化学习算法。UC-GRPO使用基于AU的多级可验证奖励来增强监督,基于查询的对比效用估计来实现样本感知的动态信用分配,以便进行局部检查,并通过情感感知的EMA校准来减少噪声效用估计,同时捕捉情感导向的检查倾向。该算法使ActFER能够学习何时局部检查是有益的,以及如何对获取的证据进行推理。全面的实验表明,使用UC-GRPO训练的ActFER在性能上始终优于被动的基于MLLM的FER基线,并显著提高了AUs预测的准确性。
cs.CV / 45 / 2604.08991

PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos

PinpointQA:面向室内视频中小物体空间理解的数据集与基准测试
Zhou, Zhiyu, Liu, Peilin, Zhang, Ruoxuan, Zhang, Luyang, Zhang, Cheng, Xie, Hongxia, Cheng, Wen-Huang
Abstract
Small object-centric spatial understanding in indoor videos remains a significant challenge for multimodal large language models (MLLMs), despite its practical value for object search and assistive applications. Although existing benchmarks have advanced video spatial intelligence, embodied reasoning, and diagnostic perception, no existing benchmark directly evaluates whether a model can localize a target object in video and express its position with sufficient precision for downstream use. In this work, we introduce PinpointQA, the first dataset and benchmark for small object-centric spatial understanding in indoor videos. Built from ScanNet++ and ScanNet200, PinpointQA comprises 1,024 scenes and 10,094 QA pairs organized into four progressively challenging tasks: Target Presence Verification (TPV), Nearest Reference Identification (NRI), Fine-Grained Spatial Description (FSD), and Structured Spatial Prediction (SSP). The dataset is built from intermediate spatial representations, with QA pairs generated automatically and further refined through quality control. Experiments on representative MLLMs reveal a consistent capability gap along the progressive chain, with SSP remaining particularly difficult. Supervised fine-tuning on PinpointQA yields substantial gains, especially on the harder tasks, demonstrating that PinpointQA serves as both a diagnostic benchmark and an effective training dataset. The dataset and project page are available at https://rainchowz.github.io/PinpointQA.
Chinese Translation
尽管小物体为中心的室内视频空间理解在物体搜索和辅助应用中具有实际价值,但对于多模态大型语言模型(MLLMs)而言仍是一项重大挑战。现有基准虽然推动了视频空间智能、具身推理和诊断感知的发展,但尚无基准能直接评估模型是否能够在视频中定位目标物体并以足够精确的方式表达其位置以供下游使用。本文提出了PinpointQA,这是首个针对室内视频中小物体空间理解的数据集与基准测试。PinpointQA基于ScanNet++和ScanNet200构建,包含1024个场景和10094个问答对,组织为四个逐步递进的任务:目标存在验证(Target Presence Verification, TPV)、最近参考物识别(Nearest Reference Identification, NRI)、细粒度空间描述(Fine-Grained Spatial Description, FSD)和结构化空间预测(Structured Spatial Prediction, SSP)。该数据集基于中间空间表示构建,问答对通过自动生成并经过质量控制进一步精炼。对代表性MLLM的实验表明,模型在任务链中表现出持续的能力差距,尤其是SSP任务依然极具挑战性。在PinpointQA上的监督微调带来了显著提升,尤其是在较难任务上,表明PinpointQA既是诊断基准也是有效的训练数据集。数据集及项目页面可访问:https://rainchowz.github.io/PinpointQA。
cs.CV / 46 / 2604.08995

Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

Matrix-Game 3.0:具有长时记忆的实时流式交互世界模型
Wang, Zile, Liu, Zexiang, Li, Jaixing, Huang, Kaichen, Xu, Baixin, Kang, Fei, An, Mengyin, Wang, Peiyu, Jiang, Biao, Wei, Yichen, Xietian, Yidan, Pei, Jiangbo, Hu, Liang, Jiang, Boyi, Xue, Hua, Wang, Zidong, Sun, Haofeng, Li, Wei, Ouyang, Wanli, He, Xianglong, Liu, Yang, Li, Yangguang, Zhou, Yahui
Abstract
With the advancement of interactive video generation, diffusion models have increasingly demonstrated their potential as world models. However, existing approaches still struggle to simultaneously achieve memory-enabled long-term temporal consistency and high-resolution real-time generation, limiting their applicability in real-world scenarios. To address this, we present Matrix-Game 3.0, a memory-augmented interactive world model designed for 720p real-time longform video generation. Building upon Matrix-Game 2.0, we introduce systematic improvements across data, model, and inference. First, we develop an upgraded industrial-scale infinite data engine that integrates Unreal Engine-based synthetic data, large-scale automated collection from AAA games, and real-world video augmentation to produce high-quality Video-Pose-Action-Prompt quadruplet data at scale. Second, we propose a training framework for long-horizon consistency: by modeling prediction residuals and re-injecting imperfect generated frames during training, the base model learns self-correction; meanwhile, camera-aware memory retrieval and injection enable the base model to achieve long horizon spatiotemporal consistency. Third, we design a multi-segment autoregressive distillation strategy based on Distribution Matching Distillation (DMD), combined with model quantization and VAE decoder pruning, to achieve efficient real-time inference. Experimental results show that Matrix-Game 3.0 achieves up to 40 FPS real-time generation at 720p resolution with a 5B model, while maintaining stable memory consistency over minute-long sequences. Scaling up to a 2x14B model further improves generation quality, dynamics, and generalization. Our approach provides a practical pathway toward industrial-scale deployable world models.
Chinese Translation
随着交互视频生成技术的进步,扩散模型越来越显示出其作为世界模型的潜力。然而,现有方法仍然难以同时实现具备记忆的长期时间一致性和高分辨率的实时生成,这限制了它们在现实场景中的应用。为了解决这一问题,我们提出了Matrix-Game 3.0,这是一种增强记忆的交互世界模型,旨在进行720p实时长格式视频生成。在Matrix-Game 2.0的基础上,我们在数据、模型和推理方面进行了系统性的改进。首先,我们开发了一个升级的工业级无限数据引擎,该引擎整合了基于虚幻引擎的合成数据、来自AAA游戏的大规模自动收集以及现实视频增强,以大规模生成高质量的视频-姿态-动作-提示四元组数据。其次,我们提出了一种用于长期一致性的训练框架:通过建模预测残差并在训练过程中重新注入不完美生成的帧,基础模型实现了自我纠正;同时,基于相机的记忆检索和注入使基础模型能够实现长期的时空一致性。第三,我们设计了一种基于分布匹配蒸馏(Distribution Matching Distillation, DMD)的多段自回归蒸馏策略,结合模型量化和变分自编码器(VAE)解码器剪枝,以实现高效的实时推理。实验结果表明,Matrix-Game 3.0在720p分辨率下以5B模型实现了高达40 FPS的实时生成,同时在分钟级序列中保持稳定的记忆一致性。将模型扩展到2x14B进一步提高了生成质量、动态性和泛化能力。我们的方法为工业规模可部署的世界模型提供了一条实用的路径。
cs.CV / 47 / 2604.09000

StreamMeCo: Long-Term Agent Memory Compression for Efficient Streaming Video Understanding

StreamMeCo:用于高效流媒体视频理解的长期智能体记忆压缩
Wang, Junxi, Sun, Te, Zhu, Jiayi, Li, Junxian, Xu, Haowen, Wen, Zichen, Hu, Xuming, Li, Zhiyu, Zhang, Linfeng
Abstract
Vision agent memory has shown remarkable effectiveness in streaming video understanding. However, storing such memory for videos incurs substantial memory overhead, leading to high costs in both storage and computation. To address this issue, we propose StreamMeCo, an efficient Stream Agent Memory Compression framework. Specifically, based on the connectivity of the memory graph, StreamMeCo introduces edge-free minmax sampling for the isolated nodes and an edge-aware weight pruning for connected nodes, evicting the redundant memory nodes while maintaining the accuracy. In addition, we introduce a time-decay memory retrieval mechanism to further eliminate the performance degradation caused by memory compression. Extensive experiments on three challenging benchmark datasets (M3-Bench-robot, M3-Bench-web and Video-MME-Long) demonstrate that under 70% memory graph compression, StreamMeCo achieves a 1.87* speedup in memory retrieval while delivering an average accuracy improvement of 1.0%. Our code is available at https://github.com/Celina-love-sweet/StreamMeCo.
Chinese Translation
视觉智能体记忆在流媒体视频理解中表现出了显著的有效性。然而,存储这些视频的记忆会带来巨大的内存开销,从而导致存储和计算成本的增加。为了解决这个问题,我们提出了StreamMeCo,一个高效的流智能体记忆压缩框架。具体而言,基于记忆图的连通性,StreamMeCo为孤立节点引入了无边最小最大采样,并为连接节点实施了边感知权重剪枝,从而在保持准确性的同时驱逐冗余的记忆节点。此外,我们引入了一种时间衰减记忆检索机制,以进一步消除记忆压缩带来的性能下降。在三个具有挑战性的基准数据集(M3-Bench-robot、M3-Bench-web和Video-MME-Long)上进行的广泛实验表明,在70%的记忆图压缩下,StreamMeCo在记忆检索中实现了1.87倍的加速,同时平均准确率提高了1.0%。我们的代码可在 https://github.com/Celina-love-sweet/StreamMeCo 获取。
cs.CV / 48 / 2604.09009

Robust by Design: A Continuous Monitoring and Data Integration Framework for Medical AI

设计即鲁棒:面向医疗人工智能的连续监测与数据整合框架
Daouk, Mohammad, Becker, Jan Ulrich, Kambham, Neeraja, Chang, Anthony, Mohan, Chandra, Van Nguyen, Hien
Abstract
Adaptive medical AI models often face performance drops in dynamic clinical environments due to data drift. We propose an autonomous continuous monitoring and data integration framework that maintains robust performance over time. Focusing on glomerular pathology image classification (proliferative vs. non-proliferative lupus nephritis), our three-stage method uses multi-metric feature analysis and Monte Carlo dropout-based uncertainty gating to decide when to retrain on new data. Only images statistically similar to the training distribution (via Euclidean, cosine, Mahalanobis metrics) and with low predictive entropy are integrated. The model is then incrementally retrained with these images under strict performance safeguards (no metric degradation >5%). In experiments with a ResNet18 ensemble on a multi-center dataset, the framework prevents performance degradation: new images were added without significant change in AUC (~0.92) or accuracy (~89%). This approach addresses data shift and avoids catastrophic forgetting, enabling sustained learning in medical imaging AI.
Chinese Translation
适应性医疗人工智能模型在动态临床环境中常因数据漂移导致性能下降。我们提出了一种自主的连续监测与数据整合框架,以维持模型的长期鲁棒性能。该方法聚焦于肾小球病理图像分类(增生型与非增生型狼疮性肾炎),采用三阶段流程,结合多指标特征分析和基于蒙特卡洛丢弃(Monte Carlo dropout)的不确定性门控,决定何时对新数据进行再训练。仅将与训练分布在欧氏距离、余弦相似度、马氏距离等统计指标上相似且预测熵较低的图像纳入整合。随后,在严格的性能保障(指标退化不超过5%)下,模型对这些图像进行增量再训练。在基于ResNet18集成模型的多中心数据集实验中,该框架有效防止了性能下降:新增图像后,AUC(约0.92)和准确率(约89%)无显著变化。该方法解决了数据漂移问题,避免了灾难性遗忘,实现了医疗影像人工智能的持续学习。
cs.CV / 49 / 2604.09018

Domain-generalizable Face Anti-Spoofing with Patch-based Multi-tasking and Artifact Pattern Conversion

基于补丁的多任务学习与伪造模式转换的领域通用人脸防伪
Jung, Seungjin, Jeong, Yonghyun, Kim, Minha, Min, Jimin, Yoo, Youngjoon, Choi, Jongwon
Abstract
Face Anti-Spoofing (FAS) algorithms, designed to secure face recognition systems against spoofing, struggle with limited dataset diversity, impairing their ability to handle unseen visual domains and spoofing methods. We introduce the Pattern Conversion Generative Adversarial Network (PCGAN) to enhance domain generalization in FAS. PCGAN effectively disentangles latent vectors for spoof artifacts and facial features, allowing to generate images with diverse artifacts. We further incorporate patch-based and multi-task learning to tackle partial attacks and overfitting issues to facial features. Our extensive experiments validate PCGAN's effectiveness in domain generalization and detecting partial attacks, giving a substantial improvement in facial recognition security.
Chinese Translation
人脸防伪(FAS)算法旨在保护人脸识别系统免受伪造攻击,但由于数据集多样性有限,导致其在处理未见视觉领域和伪造方法时能力受限。我们提出了模式转换生成对抗网络(PCGAN),以增强FAS中的领域泛化能力。PCGAN有效地解耦了伪造伪影和面部特征的潜在向量,从而能够生成具有多样化伪造特征的图像。我们进一步结合基于补丁的学习和多任务学习,以应对部分攻击和面部特征过拟合问题。我们的广泛实验验证了PCGAN在领域泛化和检测部分攻击方面的有效性,显著提升了人脸识别的安全性。
cs.CV / 50 / 2604.09022

BlendFusion -- Scalable Synthetic Data Generation for Diffusion Model Training

BlendFusion -- 可扩展的合成数据生成用于扩散模型训练
Venkatesh, Thejas, Velury, Suguna Varshini
Abstract
With the rapid adoption of diffusion models, synthetic data generation has emerged as a promising approach for addressing the growing demand for large-scale image datasets. However, images generated purely by diffusion models often exhibit visual inconsistencies, and training models on such data can create an autophagous feedback loop that leads to model collapse, commonly referred to as Model Autophagy Disorder (MAD). To address these challenges, we propose BlendFusion, a scalable framework for synthetic data generation from 3D scenes using path tracing. Our pipeline incorporates an object-centric camera placement strategy, robust filtering mechanisms, and automatic captioning to produce high-quality image-caption pairs. Using this pipeline, we curate FineBLEND, an image-caption dataset constructed from a diverse set of 3D scenes. We empirically analyze the quality of FineBLEND and compare it to several widely used image-caption datasets. We also demonstrate the effectiveness of our object-centric camera placement strategy relative to object-agnostic sampling approaches. Our open-source framework is designed for high configurability, enabling the community to create their own datasets from 3D scenes.
Chinese Translation
随着扩散模型的快速普及,合成数据生成已成为满足对大规模图像数据集日益增长需求的有前景的方法。然而,纯粹由扩散模型生成的图像往往表现出视觉不一致性,在此类数据上训练模型可能会形成一种自我吞噬的反馈循环,导致模型崩溃,这种现象通常被称为模型自噬障碍(Model Autophagy Disorder, MAD)。为了解决这些挑战,我们提出了BlendFusion,一个基于路径追踪的可扩展合成数据生成框架,旨在从3D场景中生成数据。我们的管道结合了以对象为中心的相机放置策略、强大的过滤机制和自动标注功能,以生成高质量的图像-标注对。通过该管道,我们策划了FineBLEND,一个由多样化3D场景构建的图像-标注数据集。我们对FineBLEND的质量进行了实证分析,并将其与多个广泛使用的图像-标注数据集进行了比较。我们还展示了我们的以对象为中心的相机放置策略相对于对象无关采样方法的有效性。我们的开源框架旨在具有高度可配置性,使社区能够从3D场景中创建自己的数据集。
cs.CV / 51 / 2604.09023

CAD 100K: A Comprehensive Multi-Task Dataset for Car Related Visual Anomaly Detection

CAD 100K:一个全面的多任务汽车相关视觉异常检测数据集
Pang, Jiahua, Li, Ying, Cao, Dongpu, Luo, Jingcai, Zheng, Yanuo, Yunfan, Bao, Lei, Yujie, Yuan, Rui, Tian, Yuxi, Yuan, Guojin, Chen, Hongchang, Zheng, Zhi, Liu, Yongchun
Abstract
Multi-task visual anomaly detection is critical for car-related manufacturing quality assessment. However, existing methods remain task-specific, hindered by the absence of a unified benchmark for multi-task evaluation. To fill in this gap, We present the CAD Dataset, a large-scale and comprehensive benchmark designed for car-related multi-task visual anomaly detection. The dataset contains over 100 images crossing 7 vehicle domains and 3 tasks, providing models a comprehensive view for car-related anomaly detection. It is the first car-related anomaly dataset specialized for multi-task learning(MTL), while combining synthesis data augmentation for few-shot anomaly images. We implement a multi-task baseline and conduct extensive empirical studies. Results show MTL promotes task interaction and knowledge transfer, while also exposing challenging conflicts between tasks. The CAD dataset serves as a standardized platform to drive future advances in car-related multi-task visual anomaly detection.
Chinese Translation
多任务视觉异常检测对于汽车相关的制造质量评估至关重要。然而,现有的方法仍然是任务特定的,受到缺乏统一的多任务评估基准的限制。为填补这一空白,我们提出了CAD数据集,这是一个大规模且全面的基准,旨在用于汽车相关的多任务视觉异常检测。该数据集包含超过100个图像,涵盖7个车辆领域和3个任务,为模型提供了汽车相关异常检测的全面视角。它是第一个专门针对多任务学习(MTL)的汽车相关异常数据集,同时结合了合成数据增强以应对少量异常图像。我们实现了一个多任务基线,并进行了广泛的实证研究。结果表明,MTL促进了任务之间的互动和知识转移,同时也暴露了任务之间的挑战性冲突。CAD数据集作为一个标准化平台,推动未来在汽车相关多任务视觉异常检测领域的进展。
cs.CV / 52 / 2604.09024

Leave My Images Alone: Preventing Multi-Modal Large Language Models from Analyzing Images via Visual Prompt Injection

别动我的图像:通过视觉提示注入防止多模态大语言模型分析图像
Shao, Zedian, Liu, Hongbin, Hu, Yuepeng, Gong, Neil Zhenqiang
Abstract
Multi-modal large language models (MLLMs) have emerged as powerful tools for analyzing Internet-scale image data, offering significant benefits but also raising critical safety and societal concerns. In particular, open-weight MLLMs may be misused to extract sensitive information from personal images at scale, such as identities, locations, or other private details. In this work, we propose ImageProtector, a user-side method that proactively protects images before sharing by embedding a carefully crafted, nearly imperceptible perturbation that acts as a visual prompt injection attack on MLLMs. As a result, when an adversary analyzes a protected image with an MLLM, the MLLM is consistently induced to generate a refusal response such as "I'm sorry, I can't help with that request." We empirically demonstrate the effectiveness of ImageProtector across six MLLMs and four datasets. Additionally, we evaluate three potential countermeasures, Gaussian noise, DiffPure, and adversarial training, and show that while they partially mitigate the impact of ImageProtector, they simultaneously degrade model accuracy and/or efficiency. Our study focuses on the practically important setting of open-weight MLLMs and large-scale automated image analysis, and highlights both the promise and the limitations of perturbation-based privacy protection.
Chinese Translation
多模态大语言模型(MLLMs)已成为分析互联网规模图像数据的强大工具,带来了显著的益处,同时也引发了关键的安全和社会问题。特别是,开源权重的MLLMs可能被滥用,以大规模提取个人图像中的敏感信息,如身份、位置或其他隐私细节。在本研究中,我们提出了ImageProtector,一种用户端方法,通过嵌入精心设计、几乎不可察觉的扰动,作为对MLLMs的视觉提示注入攻击,主动保护图像在分享前的安全。因此,当攻击者使用MLLM分析受保护图像时,MLLM会被持续诱导生成拒绝响应,例如“抱歉,我无法协助该请求”。我们在六个MLLMs和四个数据集上实证验证了ImageProtector的有效性。此外,我们评估了三种潜在的对抗措施:高斯噪声(Gaussian noise)、DiffPure和对抗训练,结果表明,虽然它们在一定程度上缓解了ImageProtector的影响,但同时也降低了模型的准确性和/或效率。我们的研究聚焦于开源权重MLLMs和大规模自动图像分析这一实际重要的应用场景,强调了基于扰动的隐私保护方法的潜力与局限性。
cs.CV / 53 / 2604.09025

Skill-Conditioned Visual Geolocation for Vision-Language

基于技能条件的视觉语言地理定位
Yang, Chenjie, Jiang, Yutian, Wu, Chenyu
Abstract
Vision-language models (VLMs) have shown a promising ability in image geolocation, but they still lack structured geographic reasoning and the capacity for autonomous self-evolution. Existing methods predominantly rely on implicit parametric memory, which often exploits outdated knowledge and generates hallucinated reasoning. Furthermore, current inference is a "one-off" process, lacking the feedback loops necessary for self-evolution based on reasoning outcomes. To address these issues, we propose GeoSkill, a training-free framework based on an evolving Skill-Graph. We first initialize the graph by refining human expert trajectories into atomic, natural-language skills. For execution, GeoSkill employs an inference model to perform direct reasoning guided by the current Skill-Graph. For continuous growth, an Autonomous Evolution mechanism leverages a larger model to conduct multiple reasoning rollouts on image-coordinate pairs sourced from web-scale data and verified real-world reasoning. By analyzing both successful and failed trajectories from these rollouts, the mechanism iteratively synthesizes and prunes skills, effectively expanding the Skill-Graph and correcting geographic biases without any parameter updates. Experiments demonstrate that GeoSkill achieves promising performance in both geolocation accuracy and reasoning faithfulness on GeoRC, while maintaining superior generalization across diverse external datasets. Furthermore, our autonomous evolution fosters the emergence of novel, verifiable skills, significantly enhancing the system's cognition of real-world geographic knowledge beyond isolated case studies.
Chinese Translation
视觉语言模型(VLMs)在图像地理定位方面展现出有希望的能力,但仍缺乏结构化的地理推理能力及自主自我进化的能力。现有方法主要依赖隐式参数记忆,往往利用过时知识并产生虚构的推理。此外,当前推理过程为“一次性”执行,缺乏基于推理结果进行自我进化所需的反馈循环。为解决这些问题,我们提出了GeoSkill,一种基于不断演化的技能图(Skill-Graph)的无训练框架。我们首先通过将人类专家轨迹细化为原子级的自然语言技能来初始化该图。在执行阶段,GeoSkill利用推理模型在当前技能图的指导下进行直接推理。为了实现持续增长,自主进化机制利用更大规模模型对来自网络规模数据的图像-坐标对进行多次推理演练,并结合真实世界的推理验证。通过分析这些演练中成功与失败的轨迹,该机制迭代地合成和修剪技能,有效扩展技能图并纠正地理偏差,且无需任何参数更新。实验表明,GeoSkill在GeoRC数据集上在地理定位准确性和推理可信度方面均取得了有竞争力的表现,同时在多样的外部数据集上保持了优异的泛化能力。此外,我们的自主进化促进了新颖且可验证技能的涌现,显著增强了系统对真实世界地理知识的认知,超越了孤立案例研究的局限。
cs.CV / 54 / 2604.09030

NTIRE 2026 The 3rd Restore Any Image Model (RAIM) Challenge: Multi-Exposure Image Fusion in Dynamic Scenes (Track 2)

NTIRE 2026 第三届恢复任意图像模型 (RAIM) 挑战:动态场景中的多曝光图像融合(第二赛道)
Qu, Lishen, Liu, Yao, Liang, Jie, Zeng, Hui, Dai, Wen, Qin, Guanyi, Guan, Ya-nan, Zhou, Shihao, Yang, Jufeng, Zhang, Lei, Timofte, Radu, Yuan, Xiyuan, Sun, Wanjie, Li, Shihang, Zhang, Bo, Chen, Bin, Lin, Jiannan, Chen, Yuxu, Gao, Qinquan, Tong, Tong, Gao, Song, Tang, Jiacong, Hu, Tao, Ma, Xiaowen, Yan, Qingsen, Xu, Sunhan, Wang, Juan, Sun, Xinyu, Qi, Lei, Xu, He, Tu, Jiachen, Xu, Guoyi, Jiang, Yaoxin, Liu, Jiajia, Shi, Yaokun
Abstract
This paper presents NTIRE 2026, the 3rd Restore Any Image Model (RAIM) challenge on multi-exposure image fusion in dynamic scenes. We introduce a benchmark that targets a practical yet difficult HDR imaging setting, where exposure bracketing must be fused under scene motion, illumination variation, and handheld camera jitter. The challenge data contains 100 training sequences with 7 exposure levels and 100 test sequences with 5 exposure levels, reflecting real-world scenarios that frequently cause misalignment and ghosting artefacts. We evaluate submissions with a leaderboard score derived from PSNR, SSIM, and LPIPS, while also considering perceptual quality, efficiency, and reproducibility during the final review. This track attracted 114 participating teams and received 987 submissions. The winning methods significantly improved the ability to remove artifacts from multi-exposure fusion and recover fine details. The dataset and the code of each team can be found at the repository: https://github.com/qulishen/RAIM-HDR.
Chinese Translation
本文介绍了 NTIRE 2026,第三届恢复任意图像模型 (RAIM) 挑战,主题为动态场景中的多曝光图像融合。我们提出了一个基准,旨在应对一个实际但困难的高动态范围 (HDR) 成像设置,在该设置中,曝光包围必须在场景运动、光照变化和手持相机抖动的情况下进行融合。挑战数据集包含 100 个训练序列,具有 7 个曝光级别,以及 100 个测试序列,具有 5 个曝光级别,反映了在现实世界场景中经常导致对齐错误和鬼影伪影的情况。我们通过基于峰值信噪比 (PSNR)、结构相似性指数 (SSIM) 和感知相似性指标 (LPIPS) 的排行榜得分来评估提交,同时在最终审查中考虑感知质量、效率和可重复性。本赛道吸引了 114 支参赛队伍,并收到了 987 份提交。获胜的方法显著提高了去除多曝光融合伪影和恢复细节的能力。每个团队的数据集和代码可以在以下仓库找到:https://github.com/qulishen/RAIM-HDR。
cs.CV / 55 / 2604.09037

SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos

SiMing-Bench:从临床技能视频中的连续交互评估程序正确性
Huang, Xiyang, Lin, Jiawei, Wu, Keying, Huang, Jiaxin, Yang, Kailai, Wei, Renxiong, zeng, Cheng, Xiang, Jiayi, Kuang, Ziyan, Peng, Min, Xie, Qianqian, Ananiadou, Sophia
Abstract
Current video benchmarks for multimodal large language models (MLLMs) focus on event recognition, temporal ordering, and long-context recall, but overlook a harder capability required for expert procedural judgment: tracking how ongoing interactions update the procedural state and thereby determine the correctness of later actions. We introduce SiMing-Bench, the first benchmark for evaluating this capability from full-length clinical skill videos. It targets rubric-grounded process-level judgment of whether interaction-driven state updates preserve procedural correctness across an entire workflow. SiMing-Bench is instantiated with SiMing-Score, a physician-annotated dataset of real clinical skill examination videos spanning cardiopulmonary resuscitation, automated external defibrillator operation, and bag-mask ventilation, each paired with a standardized step-wise rubric and dual-expert labels. Across diverse open- and closed-source MLLMs, we observe consistently weak agreement with physician judgments. Moreover, weak performance on rubric-defined intermediate steps persists even when overall procedure-level correlation appears acceptable, suggesting that coarse global assessment substantially overestimates current models' procedural judgment ability. Additional analyses with binary step judgment and step-aligned clips indicate that the bottleneck is not merely fine-grained scoring or temporal localization, but modeling how continuous interactions update procedural state over time.
Chinese Translation
当前针对多模态大型语言模型(MLLMs)的视频基准主要集中在事件识别、时间排序和长时记忆回忆上,但忽视了专家程序判断所需的更复杂能力:跟踪持续交互如何更新程序状态,从而决定后续动作的正确性。我们介绍了SiMing-Bench,这是第一个用于评估这一能力的基准,基于完整的临床技能视频。它旨在评估交互驱动的状态更新是否在整个工作流程中保持程序正确性的基于标准的过程级判断。SiMing-Bench通过SiMing-Score实现,这是一个由医生注释的真实临床技能考试视频数据集,涵盖心肺复苏、自动体外除颤器操作和手袋面罩通气,每个视频都配有标准化的逐步评分标准和双专家标签。在多种开放源代码和闭源的MLLMs中,我们观察到与医生判断的一致性普遍较弱。此外,即使在整体程序级相关性看似可接受的情况下,对基于标准的中间步骤的表现仍然较弱,这表明粗略的全局评估显著高估了当前模型的程序判断能力。通过二元步骤判断和步骤对齐剪辑的额外分析表明,瓶颈不仅仅是细粒度评分或时间定位,而是建模持续交互如何随时间更新程序状态。
cs.CV / 56 / 2604.09045

Scene-Agnostic Object-Centric Representation Learning for 3D Gaussian Splatting

面向3D高斯点渲染的场景无关对象中心表示学习
Hsu, Tsuheng, Liu, Guiyu, Kannala, Juho, Heikkilä, Janne
Abstract
Recent works on 3D scene understanding leverage 2D masks from visual foundation models (VFMs) to supervise radiance fields, enabling instance-level 3D segmentation. However, the supervision signals from foundation models are not fundamentally object-centric and often require additional mask pre/post-processing or specialized training and loss design to resolve mask identity conflicts across views. The learned identity of the 3D scene is scene-dependent, limiting generalizability across scenes. Therefore, we propose a dataset-level, object-centric supervision scheme to learn object representations in 3D Gaussian Splatting (3DGS). Building on a pre-trained slot attention-based Global Object Centric Learning (GOCL) module, we learn a scene-agnostic object codebook that provides consistent, identity-anchored representations across views and scenes. By coupling the codebook with the module's unsupervised object masks, we can directly supervise the identity features of 3D Gaussians without additional mask pre-/post-processing or explicit multi-view alignment. The learned scene-agnostic codebook enables object supervision and identification without per-scene fine-tuning or retraining. Our method thus introduces unsupervised object-centric learning (OCL) into 3DGS, yielding more structured representations and better generalization for downstream tasks such as robotic interaction, scene understanding, and cross-scene generalization.
Chinese Translation
近期关于3D场景理解的研究利用视觉基础模型(Visual Foundation Models, VFMs)生成的二维掩码对辐射场进行监督,实现了实例级的3D分割。然而,基础模型提供的监督信号并非从根本上以对象为中心,且通常需要额外的掩码预处理/后处理或专门的训练与损失设计,以解决跨视角的掩码身份冲突问题。所学习的3D场景身份依赖于具体场景,限制了跨场景的泛化能力。为此,我们提出了一种基于数据集级别的对象中心监督方案,用于在3D高斯点渲染(3D Gaussian Splatting, 3DGS)中学习对象表示。基于预训练的基于slot attention的全局对象中心学习模块(Global Object Centric Learning, GOCL),我们学习了一个场景无关的对象码本,该码本在不同视角和场景间提供一致且身份锚定的表示。通过将码本与模块的无监督对象掩码结合,我们能够直接监督3D高斯的身份特征,无需额外的掩码预处理/后处理或显式的多视角对齐。所学习的场景无关码本支持对象监督与识别,无需针对每个场景进行微调或重新训练。我们的方法将无监督对象中心学习(Object-Centric Learning, OCL)引入3DGS,生成更具结构性的表示,并提升了机器人交互、场景理解及跨场景泛化等下游任务的性能。
cs.CV / 57 / 2604.09047

Text-Conditioned Multi-Expert Regression Framework for Fully Automated Multi-Abutment Design

基于文本条件的多专家回归框架用于全自动多基台设计
Zheng, Mianjie, Yang, Xinquan, Liu, Xuefen, Li, Xuguang, Tang, Kun, Meng, He, Shen, Linlin
Abstract
Dental implant abutments serve as the geometric and biomechanical interface between the implant fixture and the prosthetic crown, yet their design relies heavily on manual effort and is time-consuming. Although deep neural networks have been proposed to assist dentists in designing abutments, most existing approaches remain largely manual or semi-automated, requiring substantial clinician intervention and lacking scalability in multi-abutment scenarios. To address these limitations, we propose TEMAD, a fully automated, text-conditioned multi-expert architecture for multi-abutment design. This framework integrates implant site localization and implant system, compatible abutment parameter regression into a unified pipeline. Specifically, we introduce an Implant Site Identification Network (ISIN) to automatically localize implant sites and provide this information to the subsequent multi-abutment regression network. We further design a Tooth-Conditioned Feature-wise Linear Modulation (TC-FiLM) module, which adaptively calibrates mesh representations using tooth embeddings to enable position-specific feature modulation. Additionally, a System-Prompted Mixture-of-Experts (SPMoE) mechanism leverages implant system prompts to guide expert selection, ensuring system-aware regression. Extensive experiments on a large-scale abutment design dataset show that TEMAD achieves state-of-the-art performance compared to existing methods, particularly in multi-abutment settings, validating its effectiveness for fully automated dental implant planning.
Chinese Translation
牙科种植基台作为种植体与修复冠之间的几何和生物力学接口,其设计高度依赖人工操作且耗时较长。尽管已有深度神经网络被提出以辅助牙医设计基台,但大多数现有方法仍主要是手动或半自动化,需大量临床医生干预,且在多基台场景下缺乏可扩展性。为解决这些限制,我们提出了TEMAD,一种全自动、基于文本条件的多专家架构用于多基台设计。该框架将种植部位定位与种植系统兼容基台参数回归整合于统一流程中。具体而言,我们引入了种植部位识别网络(ISIN)以自动定位种植部位,并将该信息传递给后续的多基台回归网络。我们进一步设计了牙齿条件特征线性调制模块(TC-FiLM),利用牙齿嵌入自适应校准网格表示,实现位置特异性的特征调制。此外,系统提示混合专家机制(SPMoE)利用种植系统提示引导专家选择,确保系统感知的回归。基于大规模基台设计数据集的广泛实验表明,TEMAD在多基台设置中相比现有方法取得了最先进的性能,验证了其在全自动牙科种植规划中的有效性。
cs.CV / 58 / 2604.09051

Fine-Grained Action Segmentation for Renorrhaphy in Robot-Assisted Partial Nephrectomy

机器人辅助部分肾切除术中肾缝合的细粒度动作分割
Dai, Jiaheng, Liu, Huanrong, Zhou, Tailai, Jia, Tongyu, Liu, Qin, Ban, Yutong, Li, Zeju, Gao, Yu, Ma, Xin, Li, Qingbiao
Abstract
Fine-grained action segmentation during renorrhaphy in robot-assisted partial nephrectomy requires frame-level recognition of visually similar suturing gestures with variable duration and substantial class imbalance. The SIA-RAPN benchmark defines this problem on 50 clinical videos acquired with the da Vinci Xi system and annotated with 12 frame-level labels. The benchmark compares four temporal models built on I3D features: MS-TCN++, AsFormer, TUT, and DiffAct. Evaluation uses balanced accuracy, edit score, segmental F1 at overlap thresholds of 10, 25, and 50, frame-wise accuracy, and frame-wise mean average precision. In addition to the primary evaluation across five released split configurations on SIA-RAPN, the benchmark reports cross-domain results on a separate single-port RAPN dataset. Across the strongest reported values over those five runs on the primary dataset, DiffAct achieves the highest F1, frame-wise accuracy, edit score, and frame mAP, while MS-TCN++ attains the highest balanced accuracy.
Chinese Translation
在机器人辅助部分肾切除术中的肾缝合过程中,细粒度动作分割需要对视觉上相似的缝合手势进行逐帧识别,这些手势具有可变的持续时间和显著的类别不平衡。SIA-RAPN基准定义了这一问题,基于50个使用达芬奇Xi系统获取的临床视频,并标注了12个逐帧标签。该基准比较了基于I3D特征构建的四种时间模型:MS-TCN++、AsFormer、TUT和DiffAct。评估指标包括平衡准确率、编辑分数、在重叠阈值为10、25和50时的分段F1、逐帧准确率和逐帧平均精度均值。此外,除了在SIA-RAPN上对五个发布的拆分配置进行的主要评估外,该基准还报告了在单端口RAPN数据集上的跨领域结果。在对主要数据集的五次运行中,DiffAct在F1、逐帧准确率、编辑分数和逐帧平均精度均值方面取得了最高的报告值,而MS-TCN++则获得了最高的平衡准确率。
cs.CV / 59 / 2604.09057

Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence

Tora3:基于轨迹引导的具有物理一致性的音视频生成
Liao, Junchao, Zhang, Zhenghao, Meng, Xiangyu, Li, Litao, Zhang, Ziying, Zhu, Siyu, Qin, Long, Wang, Weizhi
Abstract
Audio-video (AV) generation has recently made strong progress in perceptual quality and multimodal coherence, yet generating content with plausible motion-sound relations remains challenging. Existing methods often produce object motions that are visually unstable and sounds that are only loosely aligned with salient motion or contact events, largely because they lack an explicit motion-aware structure shared by video and audio generation. We present Tora3, a trajectory-guided AV generation framework that improves physical coherence by using object trajectories as a shared kinematic prior. Rather than treating trajectories as a video-only control signal, Tora3 uses them to jointly guide visual motion and acoustic events. Specifically, we design a trajectory-aligned motion representation for video, a kinematic-audio alignment module driven by trajectory-derived second-order kinematic states, and a hybrid flow matching scheme that preserves trajectory fidelity in trajectory-conditioned regions while maintaining local coherence elsewhere. We further curate PAV, a large-scale AV dataset emphasizing motion-relevant patterns with automatically extracted motion annotations. Extensive experiments show that Tora3 improves motion realism, motion-sound synchronization, and overall AV generation quality over strong open-source baselines.
Chinese Translation
音视频(AV)生成在感知质量和多模态一致性方面近年来取得了显著进展,但生成具有合理运动-声音关系的内容仍然具有挑战性。现有方法常常产生视觉上不稳定的物体运动以及仅与显著运动或接触事件松散对齐的声音,这在很大程度上是因为缺乏视频和音频生成共享的显式运动感知结构。我们提出了Tora3,一种基于轨迹引导的音视频生成框架,通过使用物体轨迹作为共享的运动学先验来提升物理一致性。Tora3不仅将轨迹视为仅用于视频的控制信号,而是利用轨迹共同引导视觉运动和声学事件。具体而言,我们设计了轨迹对齐的视频运动表示、由轨迹导出的二阶运动学状态驱动的运动学-音频对齐模块,以及一种混合流匹配方案,该方案在轨迹条件区域保持轨迹的准确性,同时在其他区域维持局部一致性。我们还整理了PAV,一个强调运动相关模式并带有自动提取运动标注的大规模音视频数据集。大量实验表明,Tora3在运动真实感、运动-声音同步以及整体音视频生成质量方面均优于强大的开源基线方法。
cs.CV / 60 / 2604.09059

Learning Vision-Language-Action World Models for Autonomous Driving

用于自主驾驶的视觉-语言-动作世界模型学习
Wang, Guoqing, Tang, Pin, Ren, Xiangxuan, Zhao, Guodongfang, Feng, Bailan, Ma, Chao
Abstract
Vision-Language-Action (VLA) models have recently achieved notable progress in end-to-end autonomous driving by integrating perception, reasoning, and control within a unified multimodal framework. However, they often lack explicit modeling of temporal dynamics and global world consistency, which limits their foresight and safety. In contrast, world models can simulate plausible future scenes but generally struggle to reason about or evaluate the imagined future they generate. In this work, we present VLA-World, a simple yet effective VLA world model that unifies predictive imagination with reflective reasoning to improve driving foresight. VLA-World first uses an action-derived feasible trajectory to guide the generation of the next-frame image, capturing rich spatial and temporal cues that describe how the surrounding environment evolves. The model then reasons over this self-generated future imagined frame to refine the predicted trajectory, achieving higher performance and better interpretability. To support this pipeline, we curate nuScenes-GR-20K, a generative reasoning dataset derived from nuScenes, and employ a three-stage training strategy that includes pretraining, supervised fine-tuning, and reinforcement learning. Extensive experiments demonstrate that VLA-World consistently surpasses state-of-the-art VLA and world-model baselines on both planning and future-generation benchmarks. Project page: https://vlaworld.github.io
Chinese Translation
视觉-语言-动作(VLA)模型最近在端到端自主驾驶中取得了显著进展,通过在统一的多模态框架中整合感知、推理和控制。然而,它们通常缺乏对时间动态和全球世界一致性的明确建模,这限制了它们的前瞻性和安全性。相比之下,世界模型可以模拟合理的未来场景,但通常难以推理或评估它们生成的想象未来。在本研究中,我们提出了VLA-World,这是一种简单而有效的VLA世界模型,它将预测性想象与反思性推理相结合,以提高驾驶的前瞻性。VLA-World首先使用基于动作的可行轨迹来引导下一帧图像的生成,捕捉描述周围环境如何演变的丰富空间和时间线索。然后,该模型对自生成的未来想象帧进行推理,以细化预测轨迹,从而实现更高的性能和更好的可解释性。为了支持这一流程,我们策划了nuScenes-GR-20K,这是一个源自nuScenes的生成推理数据集,并采用了包括预训练、监督微调和强化学习在内的三阶段训练策略。大量实验表明,VLA-World在规划和未来生成基准测试中始终超越了最先进的VLA和世界模型基线。项目页面:https://vlaworld.github.io
cs.CV / 61 / 2604.09062

Nested Radially Monotone Polar Occupancy Estimation: Clinically-Grounded Optic Disc and Cup Segmentation for Glaucoma Screening

嵌套径向单调极性占用估计:基于临床的视盘和视杯分割用于青光眼筛查
Goperma, Rimsa, Basnet, Rojan, Zhao, Liang
Abstract
Valid segmentation of the optic disc (OD) and optic cup (OC) from fundus photographs is essential for glaucoma screening. Unfortunately, existing deep learning methods do not guarantee clinical validness including star-convexity and nested structure of OD and OC, resulting corruption in diagnostic metric, especially under cross-dataset domain shift. To adress this issue, this paper proposed NPS-Net (Nested Polar Shape Network), the first framework that formulates the OD/OC segmentation as nested radially monotone polar occupancy estimation.This output representation can guarantee the aforementioned clinical validness and achieve high accuracy. Evaluated across seven public datasets, NPS-Net shows strong zero-shot generalization. On RIM-ONE, it maintains 100% anatomical validity and improves Cup Dice by 12.8% absolute over the best baseline, reducing vCDR MAE by over 56%. On PAPILA, it achieves Disc Dice of 0.9438 and Disc HD95 of 2.78 px, an 83% reduction over the best competing method.
Chinese Translation
从眼底照片中有效分割视盘(OD)和视杯(OC)对于青光眼筛查至关重要。不幸的是,现有的深度学习方法并未保证临床有效性,包括视盘和视杯的星形凸性及嵌套结构,导致诊断指标的损坏,尤其是在跨数据集领域转移的情况下。为了解决这一问题,本文提出了NPS-Net(嵌套极性形状网络),这是第一个将视盘/视杯分割公式化为嵌套径向单调极性占用估计的框架。这种输出表示能够保证上述临床有效性并实现高精度。在七个公共数据集上的评估中,NPS-Net展现出强大的零样本泛化能力。在RIM-ONE数据集上,它保持了100%的解剖有效性,并在最佳基线的基础上提高了视杯Dice系数12.8%的绝对值,将vCDR平均绝对误差降低了超过56%。在PAPILA数据集上,它实现了视盘Dice系数0.9438和视盘HD95为2.78像素,相较于最佳竞争方法减少了83%。
cs.CV / 62 / 2604.09063

Frequency-Enhanced Diffusion Models: Curriculum-Guided Semantic Alignment for Zero-Shot Skeleton Action Recognition

频率增强扩散模型:基于课程引导的语义对齐用于零样本骨架动作识别
Zhou, Yuxi, Zhang, Zhengbo, Pan, Jingyu, Lin, Zhiyu, Tu, Zhigang
Abstract
Human action recognition is pivotal in computer vision, with applications ranging from surveillance to human-robot interaction. Despite the effectiveness of supervised skeleton-based methods, their reliance on exhaustive annotation limits generalization to novel actions. Zero-Shot Skeleton Action Recognition (ZSAR) emerges as a promising paradigm, yet it faces challenges due to the spectral bias of diffusion models, which oversmooth high-frequency dynamics. Here, we propose Frequency-Aware Diffusion for Skeleton-Text Matching (FDSM), integrating a Semantic-Guided Spectral Residual Module, a Timestep-Adaptive Spectral Loss, and Curriculum-based Semantic Abstraction to address these challenges. Our approach effectively recovers fine-grained motion details, achieving state-of-the-art performance on NTU RGB+D, PKU-MMD, and Kinetics-skeleton datasets. Code has been made available at https://github.com/yuzhi535/FDSM. Project homepage: https://yuzhi535.github.io/FDSM.github.io/
Chinese Translation
人体动作识别在计算机视觉领域具有重要意义,应用涵盖监控到人机交互等多个方面。尽管监督式骨架动作识别方法效果显著,但其依赖于大量标注数据,限制了对新颖动作的泛化能力。零样本骨架动作识别(Zero-Shot Skeleton Action Recognition, ZSAR)作为一种有前景的范式,面临扩散模型的频谱偏置问题,即对高频动态的过度平滑。针对这些挑战,我们提出了频率感知扩散骨架-文本匹配方法(Frequency-Aware Diffusion for Skeleton-Text Matching, FDSM),融合了语义引导频谱残差模块(Semantic-Guided Spectral Residual Module)、时间步长自适应频谱损失(Timestep-Adaptive Spectral Loss)以及基于课程学习的语义抽象(Curriculum-based Semantic Abstraction)。该方法有效恢复了细粒度的运动细节,在NTU RGB+D、PKU-MMD及Kinetics-skeleton数据集上实现了最先进的性能。代码已开源,地址:https://github.com/yuzhi535/FDSM。项目主页:https://yuzhi535.github.io/FDSM.github.io/
cs.CV / 63 / 2604.09076

Cross-Modal Knowledge Distillation from Spatial Transcriptomics to Histology

跨模态知识蒸馏:从空间转录组学到组织学
Hizmi, Arbel, Bakulin, Artemii, Bagon, Shai, Yosef, Nir
Abstract
Spatial transcriptomics provides a molecularly rich description of tissue organization, enabling unsupervised discovery of tissue niches -- spatially coherent regions of distinct cell-type composition and function that are relevant to both biological research and clinical interpretation. However, spatial transcriptomics remains costly and scarce, while H&E histology is abundant but carries a less granular signal. We propose to leverage paired spatial transcriptomics and H&E data to transfer transcriptomics-derived niche structure to a histology-only model via cross-modal distillation. Across multiple tissue types and disease contexts, the distilled model achieves substantially higher agreement with transcriptomics-derived niche structure than unsupervised morphology-based baselines trained on identical image features, and recovers biologically meaningful neighborhood composition as confirmed by cell-type analysis. The resulting framework leverages paired spatial transcriptomic and H&E data during training, and can then be applied to held-out tissue regions using histology alone, without any transcriptomic input at inference time.
Chinese Translation
空间转录组学提供了对组织结构的分子丰富描述,使得无监督发现组织小环境成为可能——这些小环境是空间上连贯的、具有不同细胞类型组成和功能的区域,与生物研究和临床解读均相关。然而,空间转录组学仍然成本高昂且稀缺,而H&E组织学则丰富但信号较为粗糙。我们提出利用配对的空间转录组学和H&E数据,通过跨模态蒸馏将转录组学衍生的小环境结构转移到仅基于组织学的模型中。在多种组织类型和疾病背景下,蒸馏模型与转录组学衍生的小环境结构的符合度显著高于在相同图像特征上训练的无监督形态学基线,并且通过细胞类型分析确认了生物学上有意义的邻域组成。该框架在训练过程中利用配对的空间转录组学和H&E数据,然后可以仅使用组织学应用于保留的组织区域,在推理时无需任何转录组学输入。
cs.CV / 64 / 2604.09088

Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation

通过掩码双路径蒸馏实现衰减侧网络的内存高效迁移学习
Zhang, Yutong, Chen, Jiaxin, Chen, Honglin, Zheng, Kaiqi, Liao, Shengcai, Zhong, Hanwen, Li, Weixin, Wang, Yunhong
Abstract
Memory-efficient transfer learning (METL) approaches have recently achieved promising performance in adapting pre-trained models to downstream tasks. They avoid applying gradient backpropagation in large backbones, thus significantly reducing the number of trainable parameters and high memory consumption during fine-tuning. However, since they typically employ a lightweight and learnable side network, these methods inevitably introduce additional memory and time overhead during inference, which contradicts the ultimate goal of efficient transfer learning. To address the above issue, we propose a novel approach dubbed Masked Dual Path Distillation (MDPD) to accelerate inference while retaining parameter and memory efficiency in fine-tuning with fading side networks. Specifically, MDPD develops a framework that enhances the performance by mutually distilling the frozen backbones and learnable side networks in fine-tuning, and discard the side network during inference without sacrificing accuracy. Moreover, we design a novel feature-based knowledge distillation method for the encoder structure with multiple layers. Extensive experiments on distinct backbones across vision/language-only and vision-and-language tasks demonstrate that our method not only accelerates inference by at least 25.2\% while keeping parameter and memory consumption comparable, but also remarkably promotes the accuracy compared to SOTA approaches. The source code is available at https://github.com/Zhang-VKk/MDPD.
Chinese Translation
内存高效迁移学习(METL)方法近年来在将预训练模型适配到下游任务中取得了令人鼓舞的表现。它们避免在大型主干网络中应用梯度反向传播,从而显著减少了微调过程中的可训练参数数量和高内存消耗。然而,由于这类方法通常采用轻量且可学习的侧网络,推理阶段不可避免地引入了额外的内存和时间开销,这与高效迁移学习的最终目标相悖。为解决上述问题,我们提出了一种新颖的方法,称为掩码双路径蒸馏(Masked Dual Path Distillation,MDPD),旨在通过衰减侧网络实现推理加速,同时保持微调过程中的参数和内存效率。具体而言,MDPD构建了一个框架,通过在微调过程中相互蒸馏冻结的主干网络和可学习的侧网络来提升性能,并在推理时舍弃侧网络而不牺牲准确率。此外,我们设计了一种针对多层编码器结构的新型基于特征的知识蒸馏方法。在视觉、纯语言以及视觉与语言任务的不同主干网络上的大量实验表明,我们的方法不仅在保持参数和内存消耗相当的情况下实现了至少25.2%的推理加速,还相较于最先进方法显著提升了准确率。源码已公开,地址为:https://github.com/Zhang-VKk/MDPD。
cs.CV / 65 / 2604.09096

Off-the-shelf Vision Models Benefit Image Manipulation Localization

现成视觉模型有助于图像操控定位
Zhang, Zhengxuan, Song, Keji, Hu, Junmin, Luo, Ao, Li, Yuezun
Abstract
Image manipulation localization (IML) and general vision tasks are typically treated as two separate research directions due to the fundamental differences between manipulation-specific and semantic features. In this paper, however, we bridge this gap by introducing a fresh perspective: these two directions are intrinsically connected, and general semantic priors can benefit IML. Building on this insight, we propose a novel trainable adapter (named ReVi) that repurposes existing off-the-shelf general-purpose vision models (e.g., image generation and segmentation networks) for IML. Inspired by robust principal component analysis, the adapter disentangles semantic redundancy from manipulation-specific information embedded in these models and selectively enhances the latter. Unlike existing IML methods that require extensive model redesign and full retraining, our method relies on the off-the-shelf vision models with frozen parameters and only fine-tunes the proposed adapter. The experimental results demonstrate the superiority of our method, showing the potential for scalable IML frameworks.
Chinese Translation
图像操控定位(Image Manipulation Localization, IML)和一般视觉任务通常被视为两个独立的研究方向,原因在于操控特定特征与语义特征之间的根本差异。然而,在本文中,我们通过引入一个全新的视角来弥合这一差距:这两个方向本质上是相互关联的,一般语义先验可以促进IML的发展。基于这一洞察,我们提出了一种新颖的可训练适配器(命名为ReVi),该适配器将现有的现成通用视觉模型(例如,图像生成和分割网络)重新用于IML。受稳健主成分分析的启发,该适配器将这些模型中嵌入的语义冗余与操控特定信息进行解耦,并选择性地增强后者。与现有的IML方法需要广泛的模型重设计和完全重训练不同,我们的方法依赖于具有冻结参数的现成视觉模型,仅对所提出的适配器进行微调。实验结果证明了我们方法的优越性,展示了可扩展IML框架的潜力。
cs.CV / 66 / 2604.09100

Physically Grounded 3D Generative Reconstruction under Hand Occlusion using Proprioception and Multi-Contact Touch

基于物理的手部遮挡下的三维生成重建方法:利用本体感觉和多接触触觉
Caddeo, Gabriele Mario, Marra, Pasquale, Natale, Lorenzo
Abstract
We propose a multimodal, physically grounded approach for metric-scale amodal object reconstruction and pose estimation under severe hand occlusion. Unlike prior occlusion-aware 3D generation methods that rely only on vision, we leverage physical interaction signals: proprioception provides the posed hand geometry, and multi-contact touch constrains where the object surface must lie, reducing ambiguity in occluded regions. We represent object structure as a pose-aware, camera-aligned signed distance field (SDF) and learn a compact latent space with a Structure-VAE. In this latent space, we train a conditional flow-matching diffusion model, pretraining on vision-only images and finetuning on occluded manipulation scenes while conditioning on visible RGB evidence, occluder/visibility masks, the hand latent representation, and tactile information. Crucially, we incorporate physics-based objectives and differentiable decoder-guidance during finetuning and inference to reduce hand--object interpenetration and to align the reconstructed surface with contact observations. Because our method produces a metric, physically consistent structure estimate, it integrates naturally into existing two-stage reconstruction pipelines, where a downstream module refines geometry and predicts appearance. Experiments in simulation show that adding proprioception and touch substantially improves completion under occlusion and yields physically plausible reconstructions at correct real-world scale compared to vision-only baselines; we further validate transfer by deploying the model on a real humanoid robot with an end-effector different from those used during training.
Chinese Translation
我们提出了一种多模态、基于物理的度量尺度无模态物体重建和姿态估计方法,适用于严重的手部遮挡情况。与以往仅依赖视觉的遮挡感知三维生成方法不同,我们利用物理交互信号:本体感觉提供了手部姿态几何信息,而多接触触觉则约束了物体表面的位置,从而减少了遮挡区域的模糊性。我们将物体结构表示为一种姿态感知、相机对齐的有符号距离场(SDF),并通过结构变分自编码器(Structure-VAE)学习一个紧凑的潜在空间。在这个潜在空间中,我们训练了一个条件流匹配扩散模型,先在仅包含视觉图像上进行预训练,然后在遮挡的操作场景上进行微调,同时以可见的RGB证据、遮挡物/可见性掩码、手部潜在表示和触觉信息为条件。至关重要的是,我们在微调和推理过程中引入了基于物理的目标和可微分解码器引导,以减少手部与物体的相互穿透,并使重建表面与接触观测结果对齐。由于我们的方法生成了度量上物理一致的结构估计,因此可以自然地集成到现有的两阶段重建管道中,其中下游模块负责细化几何形状并预测外观。模拟实验表明,添加本体感觉和触觉显著改善了遮挡下的完成效果,并与仅依赖视觉的基线相比,产生了在真实世界尺度下物理上合理的重建;我们进一步通过将模型部署在与训练期间使用的末端执行器不同的真实类人机器人上验证了迁移能力。
cs.CV / 67 / 2604.09106

Detecting Diffusion-generated Images via Dynamic Assembly ForestsDetecting Diffusion-generated Images via Dynamic Assembly Forests

基于动态组装森林的扩散生成图像检测
Fu, Mengxin, Li, Yuezun
Abstract
Diffusion models are known for generating high-quality images, causing serious security concerns. To combat this, most efforts rely on deep neural networks (e.g., CNNs and Transformers), while largely overlooking the potential of traditional machine learning models. In this paper, we freshly investigate such alternatives and proposes a novel Dynamic Assembly Forest model (DAF) to detect diffusion-generated images. Built upon the deep forest paradigm, DAF addresses the inherent limitations in feature learning and scalable training, making it an effective diffusion-generated image detector. Compared to existing DNN-based methods, DAF has significantly fewer parameters, much lower computational cost, and can be deployed without GPUs, while achieving competitive performance under standard evaluation protocols. These results highlight the strong potential of the proposed method as a practical substitute for heavyweight DNN models in resource-constrained scenarios. Our code and models are available at https://github.com/OUC-VAS/DAF.
Chinese Translation
扩散模型以生成高质量图像著称,但也引发了严重的安全问题。为应对这一挑战,大多数方法依赖于深度神经网络(如卷积神经网络CNN和Transformer),而较少关注传统机器学习模型的潜力。本文重新探讨了这些替代方案,提出了一种新颖的动态组装森林(Dynamic Assembly Forest,DAF)模型用于检测扩散生成的图像。DAF基于深度森林范式,解决了特征学习和可扩展训练中的固有限制,成为一种高效的扩散生成图像检测器。与现有基于深度神经网络的方法相比,DAF参数显著更少,计算成本大幅降低,且无需GPU即可部署,同时在标准评测协议下表现出竞争力的性能。这些结果凸显了该方法作为资源受限场景中重量级深度神经网络模型的实用替代方案的强大潜力。我们的代码和模型可在https://github.com/OUC-VAS/DAF获取。
cs.CV / 68 / 2604.09114

FIRE-CIR: Fine-grained Reasoning for Composed Fashion Image Retrieval

FIRE-CIR:细粒度推理的组合时尚图像检索
Gardères, François, Gauthier, Camille-Sovanneary, Ponce, Jean, Chen, Shizhe
Abstract
Composed image retrieval (CIR) aims to retrieve a target image that depicts a reference image modified by a textual description. While recent vision-language models (VLMs) achieve promising CIR performance by embedding images and text into a shared space for retrieval, they often fail to reason about what to preserve and what to change. This limitation hinders interpretability and yields suboptimal results, particularly in fine-grained domains like fashion. In this paper, we introduce FIRE-CIR, a model that brings compositional reasoning and interpretability to fashion CIR. Instead of relying solely on embedding similarity, FIRE-CIR performs question-driven visual reasoning: it automatically generates attribute-focused visual questions derived from the modification text, and verifies the corresponding visual evidence in both reference and candidate images. To train such a reasoning system, we automatically construct a large-scale fashion-specific visual question answering dataset, containing questions requiring either single- or dual-image analysis. During retrieval, our model leverages this explicit reasoning to re-rank candidate results, filtering out images inconsistent with the intended modifications. Experimental results on the Fashion IQ benchmark show that FIRE-CIR outperforms state-of-the-art methods in retrieval accuracy. It also provides interpretable, attribute-level insights into retrieval decisions.
Chinese Translation
组合图像检索(CIR)旨在检索描绘由文本描述修改的参考图像的目标图像。尽管近期的视觉-语言模型(VLMs)通过将图像和文本嵌入共享空间以实现检索,取得了令人鼓舞的CIR性能,但它们往往无法推理出应保留什么和应改变什么。这一局限性妨碍了可解释性,并导致次优结果,尤其是在时尚等细粒度领域。本文介绍了FIRE-CIR,一个将组合推理和可解释性引入时尚CIR的模型。FIRE-CIR并不单纯依赖于嵌入相似性,而是执行以问题驱动的视觉推理:它自动生成基于修改文本的属性聚焦视觉问题,并验证参考图像和候选图像中的相应视觉证据。为了训练这样一个推理系统,我们自动构建了一个大规模的时尚特定视觉问答数据集,包含需要单图像或双图像分析的问题。在检索过程中,我们的模型利用这种显式推理对候选结果进行重新排序,过滤掉与预期修改不一致的图像。在Fashion IQ基准上的实验结果表明,FIRE-CIR在检索准确性上优于最先进的方法。它还提供了对检索决策的可解释的属性级见解。
cs.CV / 69 / 2604.09125

Few-Shot Personalized Age Estimation

少样本个性化年龄估计
Paplhám, Jakub, Franc, Vojtěch, Moroz, Artem
Abstract
Existing age estimation methods treat each face as an independent sample, learning a global mapping from appearance to age. This ignores a well-documented phenomenon: individuals age at different rates due to genetics, lifestyle, and health, making the mapping from face to age identity-dependent. When reference images of the same person with known ages are available, we can exploit this context to personalize the estimate. The only existing benchmark for this task (NIST FRVT) is closed-source and limited to a single reference image. In this work, we introduce OpenPAE, the first open benchmark for $N$-shot personalized age estimation with strict evaluation protocols. We establish a hierarchy of increasingly sophisticated baselines: from arithmetic offset, through closed-form Bayesian linear regression, to a conditional attentive neural process. Our experiments show that personalization consistently improves performance, that the gains are not merely domain adaptation, and that nonlinear methods significantly outperform simpler alternatives. We release all models, code, protocols, and evaluation splits.
Chinese Translation
现有的年龄估计方法将每张面孔视为独立样本,从外观到年龄学习全局映射。这忽视了一个众所周知的现象:个体由于遗传、生活方式和健康状况的不同而以不同的速度衰老,使得面孔到年龄的映射依赖于个体身份。当有已知年龄的同一人的参考图像可用时,我们可以利用这一背景来个性化估计。现有的唯一基准(NIST FRVT)是闭源的,并且仅限于单张参考图像。在本研究中,我们引入了OpenPAE,这是第一个针对$N$-shot个性化年龄估计的开放基准,具有严格的评估协议。我们建立了一系列日益复杂的基线:从算术偏移,到封闭形式的贝叶斯线性回归,再到条件注意神经过程。我们的实验表明,个性化始终能提高性能,这些提升不仅仅是领域适应,并且非线性方法显著优于更简单的替代方案。我们发布了所有模型、代码、协议和评估拆分。
cs.CV / 70 / 2604.09127

FaceLiVTv2: An Improved Hybrid Architecture for Efficient Mobile Face Recognition

FaceLiVTv2:一种用于高效移动端人脸识别的改进混合架构
Setyawan, Novendra, Sun, Chi-Chia, Hsu, Mao-Hsiu, Kuo, Wen-Kai, Hsieh, Jun-Wei
Abstract
Lightweight face recognition is increasingly important for deployment on edge and mobile devices, where strict constraints on latency, memory, and energy consumption must be met alongside reliable accuracy. Although recent hybrid CNN-Transformer architectures have advanced global context modeling, striking an effective balance between recognition performance and computational efficiency remains an open challenge. In this work, we present FaceLiVTv2, an improved version of our FaceLiVT hybrid architecture designed for efficient global--local feature interaction in mobile face recognition. At its core is Lite MHLA, a lightweight global token interaction module that replaces the original multi-layer attention design with multi-head linear token projections and affine rescale transformations, reducing redundancy while preserving representational diversity across heads. We further integrate Lite MHLA into a unified RepMix block that coordinates local and global feature interactions and adopts global depthwise convolution for adaptive spatial aggregation in the embedding stage. Under our experimental setup, results on LFW, CA-LFW, CP-LFW, CFP-FP, AgeDB-30, and IJB show that FaceLiVTv2 consistently improves the accuracy-efficiency trade-off over existing lightweight methods. Notably, FaceLiVTv2 reduces mobile inference latency by 22% relative to FaceLiVTv1, achieves speedups of up to 30.8% over GhostFaceNets on mobile devices, and delivers 20-41% latency improvements over EdgeFace and KANFace across platforms while maintaining higher recognition accuracy. These results demonstrate that FaceLiVTv2 offers a practical and deployable solution for real-time face recognition. Code is available at https://github.com/novendrastywn/FaceLiVT.
Chinese Translation
轻量级人脸识别在边缘和移动设备上的部署日益重要,这些设备对延迟、内存和能耗有严格限制,同时还需保证可靠的识别准确率。尽管近期的混合CNN-Transformer架构在全局上下文建模方面取得了进展,但在识别性能与计算效率之间实现有效平衡仍是一个开放的挑战。本文提出了FaceLiVTv2,这是我们FaceLiVT混合架构的改进版本,旨在实现移动端人脸识别中高效的全局-局部特征交互。其核心是Lite MHLA,一种轻量级的全局token交互模块,用多头线性token投影和仿射重缩放变换替代了原有的多层注意力设计,减少冗余的同时保持了各头之间的表示多样性。我们进一步将Lite MHLA集成到统一的RepMix模块中,该模块协调局部与全局特征交互,并在嵌入阶段采用全局深度卷积实现自适应空间聚合。在我们的实验设置下,FaceLiVTv2在LFW、CA-LFW、CP-LFW、CFP-FP、AgeDB-30和IJB数据集上均持续提升了准确率与效率的权衡表现。值得注意的是,FaceLiVTv2相较于FaceLiVTv1将移动端推理延迟降低了22%,在移动设备上相比GhostFaceNets实现了最高30.8%的加速,并在多平台上相较EdgeFace和KANFace实现了20%至41%的延迟提升,同时保持更高的识别准确率。这些结果表明,FaceLiVTv2为实时人脸识别提供了一个实用且可部署的解决方案。代码可在https://github.com/novendrastywn/FaceLiVT获取。
cs.CV / 71 / 2604.09132

Strips as Tokens: Artist Mesh Generation with Native UV Segmentation

以条带作为Token:基于原生UV分割的艺术家网格生成
Xu, Rui, Qin, Dafei, Qiao, Kaichun, Dong, Qiujie, Pi, Huaijin, Zhang, Qixuan, Zhang, Longwen, Xu, Lan, Yu, Jingyi, Wang, Wenping, Komura, Taku
Abstract
Recent advancements in autoregressive transformers have demonstrated remarkable potential for generating artist-quality meshes. However, the token ordering strategies employed by existing methods typically fail to meet professional artist standards, where coordinate-based sorting yields inefficiently long sequences, and patch-based heuristics disrupt the continuous edge flow and structural regularity essential for high-quality modeling. To address these limitations, we propose Strips as Tokens (SATO), a novel framework with a token ordering strategy inspired by triangle strips. By constructing the sequence as a connected chain of faces that explicitly encodes UV boundaries, our method naturally preserves the organized edge flow and semantic layout characteristic of artist-created meshes. A key advantage of this formulation is its unified representation, enabling the same token sequence to be decoded into either a triangle or quadrilateral mesh. This flexibility facilitates joint training on both data types: large-scale triangle data provides fundamental structural priors, while high-quality quad data enhances the geometric regularity of the outputs. Extensive experiments demonstrate that SATO consistently outperforms prior methods in terms of geometric quality, structural coherence, and UV segmentation.
Chinese Translation
近年来,自回归Transformer在生成艺术家级别网格方面展现出显著潜力。然而,现有方法采用的Token排序策略通常难以满足专业艺术家的标准,其中基于坐标的排序导致序列过长且效率低下,而基于补丁的启发式方法则破坏了连续边缘流和结构规律性,而这些是高质量建模的关键。为了解决这些局限性,我们提出了Strips as Tokens(SATO),这是一种受三角形条带启发的新型Token排序框架。通过将序列构建为显式编码UV边界的面连接链,我们的方法自然地保留了艺术家网格所特有的有序边缘流和语义布局。这种表述的一个关键优势在于其统一表示,使得相同的Token序列既可解码为三角形网格,也可解码为四边形网格。这种灵活性促进了两种数据类型的联合训练:大规模三角形数据提供了基础结构先验,而高质量四边形数据则提升了输出的几何规律性。大量实验表明,SATO在几何质量、结构连贯性及UV分割方面均持续优于先前方法。
cs.CV / 72 / 2604.09142

Geometry Reinforced Efficient Attention Tuning Equipped with Normals for Robust Stereo Matching

结合法线的几何增强高效注意力调优用于鲁棒立体匹配
Li, Jiahao, Chen, Xinhong, Jiang, Zhengmin, Huang, Cheng, Li, Yung-Hui, Wang, Jianping
Abstract
Despite remarkable advances in image-driven stereo matching over the past decade, Synthetic-to-Realistic Zero-Shot (Syn-to-Real) generalization remains an open challenge. This suboptimal generalization performance mainly stems from cross-domain shifts and ill-posed ambiguities inherent in image textures, particularly in occluded, textureless, repetitive, and non-Lambertian (specular/transparent) regions. To improve Syn-to-Real generalization, we propose GREATEN, a framework that incorporates surface normals as domain-invariant, object-intrinsic, and discriminative geometric cues to compensate for the limitations of image textures. The proposed framework consists of three key components. First, a Gated Contextual-Geometric Fusion (GCGF) module adaptively suppresses unreliable contextual cues in image features and fuses the filtered image features with normal-driven geometric features to construct domain-invariant and discriminative contextual-geometric representations. Second, a Specular-Transparent Augmentation (STA) strategy improves the robustness of GCGF against misleading visual cues in non-Lambertian regions. Third, sparse attention designs preserve the fine-grained global feature extraction capability of GREAT-Stereo for handling occlusion and texture-related ambiguities while substantially reducing computational overhead, including Sparse Spatial (SSA), Sparse Dual-Matching (SDMA), and Simple Volume (SVA) attentions. Trained exclusively on synthetic data such as SceneFlow, GREATEN-IGEV achieves outstanding Syn-to-Real performance. Specifically, it reduces errors by 30% on ETH3D, 8.5% on the non-Lambertian Booster, and 14.1% on KITTI-2015, compared to FoundationStereo, Monster-Stereo, and DEFOM-Stereo, respectively. In addition, GREATEN-IGEV runs 19.2% faster than GREAT-IGEV and supports high-resolution (3K) inference on Middlebury with disparity ranges up to 768.
Chinese Translation
尽管过去十年图像驱动的立体匹配取得了显著进展,合成到真实的零样本(Syn-to-Real)泛化仍然是一个未解决的挑战。这种次优的泛化性能主要源于跨域偏移以及图像纹理中固有的病态歧义,尤其是在遮挡、无纹理、重复纹理和非朗伯(镜面/透明)区域。为提升Syn-to-Real泛化能力,我们提出了GREATEN框架,该框架引入表面法线作为域不变、物体内在且具有判别性的几何线索,以弥补图像纹理的局限性。所提框架包含三个关键组成部分。首先,门控上下文-几何融合(Gated Contextual-Geometric Fusion,GCGF)模块自适应抑制图像特征中不可靠的上下文线索,并将滤波后的图像特征与法线驱动的几何特征融合,构建域不变且具判别力的上下文-几何表示。其次,镜面-透明增强(Specular-Transparent Augmentation,STA)策略提升GCGF在非朗伯区域对误导视觉线索的鲁棒性。第三,稀疏注意力设计保留了GREAT-Stereo在处理遮挡和纹理相关歧义时的细粒度全局特征提取能力,同时大幅降低计算开销,包括稀疏空间注意力(Sparse Spatial Attention,SSA)、稀疏双匹配注意力(Sparse Dual-Matching Attention,SDMA)和简单体积注意力(Simple Volume Attention,SVA)。GREATEN-IGEV仅在合成数据如SceneFlow上训练,即实现了卓越的Syn-to-Real性能。具体而言,与FoundationStereo、Monster-Stereo和DEFOM-Stereo相比,分别在ETH3D、非朗伯Boosters和KITTI-2015数据集上误差降低了30%、8.5%和14.1%。此外,GREATEN-IGEV的运行速度比GREAT-IGEV快19.2%,并支持在Middlebury上进行高分辨率(3K)推理,视差范围可达768。
cs.CV / 73 / 2604.09145

Deep Light Pollution Removal in Night Cityscape Photographs

夜间城市景观照片中的深度光污染去除
Wang, Hao, Wu, Xiaolin, Zhang, Xi, Sun, Baoqing
Abstract
Nighttime photography is severely degraded by light pollution induced by pervasive artificial lighting in urban environments. After long-range scattering and spatial diffusion, unwanted artificial light overwhelms natural night luminance, generates skyglow that washes out the view of stars and celestial objects and produces halos and glow artifacts around light sources. Unlike nighttime dehazing, which aims to improve detail legibility through thick air, the objective of light pollution removal is to restore the pristine night appearance by neutralizing the radiative footprint of ground lighting. In this paper we introduce a physically-based degradation model that adds to the previous ones for nighttime dehazing two critical aspects; (i) anisotropic spread of directional light sources, and (ii) skyglow caused by invisible surface lights behind skylines. In addition, we construct a training strategy that leverages large generative model and synthetic-real coupling to compensate for the scarcity of paired real data and enhance generalization. Extensive experiments demonstrate that the proposed formulation and learning framework substantially reduce light pollution artifacts and better recover authentic night imagery than prior nighttime restoration methods.
Chinese Translation
夜间摄影受到城市环境中普遍人工照明引起的光污染的严重影响。在长距离散射和空间扩散后,令人厌烦的人工光源压倒了自然夜间亮度,产生了洗掉星星和天体视野的天光,并在光源周围产生光晕和光晕伪影。与夜间去雾旨在通过厚重的空气改善细节可读性不同,光污染去除的目标是通过中和地面照明的辐射足迹来恢复原始的夜间外观。在本文中,我们引入了一种基于物理的退化模型,该模型在之前的夜间去雾模型基础上增加了两个关键方面;(i) 定向光源的各向异性扩散,以及 (ii) 由天际线后面不可见的表面光源引起的天光。此外,我们构建了一种训练策略,利用大型生成模型和合成-真实耦合来弥补配对真实数据的稀缺性并增强泛化能力。大量实验表明,所提出的公式和学习框架显著减少了光污染伪影,并比以往的夜间恢复方法更好地恢复了真实的夜间图像。
cs.CV / 74 / 2604.09151

Benchmarking CNN- and Transformer-Based Models for Surgical Instrument Segmentation in Robotic-Assisted Surgery

基于CNN和Transformer模型的机器人辅助手术中外科器械分割的基准测试
Ameli, Sara
Abstract
Accurate segmentation of surgical instruments in robotic-assisted surgery is critical for enabling context-aware computer-assisted interventions, such as tool tracking, workflow analysis, and autonomous decision-making. In this study, we benchmark five deep learning architectures-UNet, UNet, DeepLabV3, Attention UNet, and SegFormer on the SAR-RARP50 dataset for multi-class semantic segmentation of surgical instruments in real-world radical prostatectomy videos. The models are trained with a compound loss function combining Cross Entropy and Dice loss to address class imbalance and capture fine object boundaries. Our experiments reveal that while convolutional models such as UNet and Attention UNet provide strong baseline performance, DeepLabV3 achieves results comparable to SegFormer, demonstrating the effectiveness of atrous convolution and multi-scale context aggregation in capturing complex surgical scenes. Transformer-based architectures like SegFormer further enhance global contextual understanding, leading to improved generalization across varying instrument appearances and surgical conditions. This work provides a comprehensive comparison and practical insights for selecting segmentation models in surgical AI applications, highlighting the trade-offs between convolutional and transformer-based approaches.
Chinese Translation
在机器人辅助手术中,准确分割外科器械对于实现上下文感知的计算机辅助干预至关重要,例如工具跟踪、工作流程分析和自主决策。本研究对五种深度学习架构——UNet、DeepLabV3、Attention UNet和SegFormer在SAR-RARP50数据集上进行了基准测试,以实现实际根治性前列腺切除术视频中外科器械的多类语义分割。模型使用复合损失函数进行训练,该函数结合了交叉熵损失和Dice损失,以解决类别不平衡问题并捕捉细微的物体边界。我们的实验表明,尽管卷积模型如UNet和Attention UNet提供了强有力的基线性能,但DeepLabV3的结果与SegFormer相当,展示了空洞卷积和多尺度上下文聚合在捕捉复杂外科场景中的有效性。基于Transformer的架构如SegFormer进一步增强了全局上下文理解,从而提高了在不同器械外观和手术条件下的泛化能力。本研究提供了对外科AI应用中分割模型选择的全面比较和实用见解,突出了卷积方法与基于Transformer的方法之间的权衡。
cs.CV / 75 / 2604.09164

Efficient Spatial-Temporal Focal Adapter with SSM for Temporal Action Detection

基于SSM的高效时空焦点适配器用于时序动作检测
Qiu, Yicheng, Yanai, Keiji
Abstract
Temporal human action detection aims to identify and localize action segments within untrimmed videos, serving as a pivotal task in video understanding. Despite the progress achieved by prior architectures like CNN and Transformer models, these continue to struggle with feature redundancy and degraded global dependency modeling capabilities when applied to long video sequences. These limitations severely constrain their scalability in real-world video analysis. State Space Models (SSMs) offer a promising alternative with linear long-term modeling and robust global temporal reasoning capabilities. Rethinking the application of SSMs in temporal modeling, this research constructs a novel framework for video human action detection. Specifically, we introduce the Efficient Spatial-Temporal Focal (ESTF) Adapter into the pre-trained layers. This module integrates the advantages of our proposed Temporal Boundary-aware SSM(TB-SSM) for temporal feature modeling with efficient processing of spatial features. We perform comprehensive and quantitative analyses across multiple benchmarks, comparing our proposed method against previous SSM-based and other structural methods. Extensive experiments demonstrate that our improved strategy significantly enhances both localization performance and robustness, validating the effectiveness of our proposed method.
Chinese Translation
时序人体动作检测旨在识别并定位未剪辑视频中的动作片段,是视频理解中的关键任务。尽管先前的架构如卷积神经网络(CNN)和Transformer模型取得了一定进展,但在处理长视频序列时,仍面临特征冗余和全局依赖建模能力下降的问题。这些限制严重制约了其在实际视频分析中的可扩展性。状态空间模型(SSM)作为一种具有线性长时建模和强大全局时序推理能力的有前景的替代方案,重新思考了SSM在时序建模中的应用,本文构建了一个用于视频人体动作检测的新型框架。具体而言,我们在预训练层中引入了高效时空焦点(ESTF)适配器。该模块融合了我们提出的时序边界感知SSM(Temporal Boundary-aware SSM,TB-SSM)在时序特征建模上的优势,同时高效处理空间特征。我们在多个基准数据集上进行了全面且定量的分析,将所提方法与先前基于SSM及其他结构的方法进行了比较。大量实验表明,我们改进的策略显著提升了定位性能和鲁棒性,验证了所提方法的有效性。
cs.CV / 76 / 2604.09167

MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding

MAG-3D:用于三维理解的多智能体基础推理
Zheng, Henry, Fang, Chenyue, Huang, Rui, Wei, Siyuan, Liu, Xiao, Huang, Gao
Abstract
Vision-language models (VLMs) have achieved strong performance in multimodal understanding and reasoning, yet grounded reasoning in 3D scenes remains underexplored. Effective 3D reasoning hinges on accurate grounding: to answer open-ended queries, a model must first identify query-relevant objects and regions in a complex scene, and then reason about their spatial and geometric relationships. Recent approaches have demonstrated strong potential for grounded 3D reasoning. However, they often rely on in-domain tuning or hand-crafted reasoning pipelines, which limit their flexibility and zero-shot generalization to novel environments. In this work, we present MAG-3D, a training-free multi-agent framework for grounded 3D reasoning with off-the-shelf VLMs. Instead of relying on task-specific training or fixed reasoning procedures, MAG-3D dynamically coordinates expert agents to address the key challenges of 3D reasoning. Specifically, we propose a planning agent that decomposes the task and orchestrates the overall reasoning process, a grounding agent that performs free-form 3D grounding and relevant frame retrieval from extensive 3D scene observations, and a coding agent that conducts flexible geometric reasoning and explicit verification through executable programs. This multi-agent collaborative design enables flexible training-free 3D grounded reasoning across diverse scenes and achieves state-of-the-art performance on challenging benchmarks.
Chinese Translation
视觉-语言模型(VLMs)在多模态理解与推理方面取得了优异表现,然而三维场景中的基础推理仍未得到充分探索。有效的三维推理依赖于准确的基础定位:为回答开放式查询,模型必须首先识别复杂场景中与查询相关的对象和区域,随后推理它们的空间和几何关系。近期方法展示了基础三维推理的强大潜力,但通常依赖于领域内微调或手工设计的推理流程,限制了其灵活性及对新环境的零样本泛化能力。在本工作中,我们提出了MAG-3D,一种基于现成视觉-语言模型的无训练多智能体框架,用于基础三维推理。MAG-3D不依赖于特定任务训练或固定推理程序,而是动态协调专家智能体以解决三维推理的关键挑战。具体而言,我们设计了规划智能体用于任务分解与整体推理流程的协调,基础智能体负责自由形式的三维基础定位及从大量三维场景观测中检索相关帧,编码智能体通过可执行程序进行灵活的几何推理与显式验证。该多智能体协作设计实现了跨多样场景的灵活无训练三维基础推理,并在挑战性基准测试中达成了最先进的性能。
cs.CV / 77 / 2604.09168

ELT: Elastic Looped Transformers for Visual Generation

ELT:用于视觉生成的弹性循环变换器
Goyal, Sahil, Agrawal, Swayam, Anil, Gautham Govind, Jain, Prateek, Paul, Sujoy, Kusupati, Aditya
Abstract
We introduce Elastic Looped Transformers (ELT), a highly parameter-efficient class of visual generative models based on a recurrent transformer architecture. While conventional generative models rely on deep stacks of unique transformer layers, our approach employs iterative, weight-shared transformer blocks to drastically reduce parameter counts while maintaining high synthesis quality. To effectively train these models for image and video generation, we propose the idea of Intra-Loop Self Distillation (ILSD), where student configurations (intermediate loops) are distilled from the teacher configuration (maximum training loops) to ensure consistency across the model's depth in a single training step. Our framework yields a family of elastic models from a single training run, enabling Any-Time inference capability with dynamic trade-offs between computational cost and generation quality, with the same parameter count. ELT significantly shifts the efficiency frontier for visual synthesis. With $4\times$ reduction in parameter count under iso-inference-compute settings, ELT achieves a competitive FID of $2.0$ on class-conditional ImageNet $256 \times 256$ and FVD of $72.8$ on class-conditional UCF-101.
Chinese Translation
我们提出了弹性循环变换器(Elastic Looped Transformers, ELT),这是一类基于递归变换器架构的高参数效率视觉生成模型。传统生成模型依赖于深层堆叠的独特变换器层,而我们的方法采用迭代的、权重共享的变换器模块,以显著减少参数数量,同时保持高合成质量。为了有效训练这些用于图像和视频生成的模型,我们提出了内部循环自蒸馏(Intra-Loop Self Distillation, ILSD)的概念,其中学生配置(中间循环)从教师配置(最大训练循环)中蒸馏,以确保模型深度的一致性,且仅需一步训练。我们的框架从单次训练中生成一系列弹性模型,实现了随时推理能力,并在计算成本与生成质量之间进行动态权衡,且参数数量保持不变。在等推理计算设置下,ELT在参数数量上减少了$4 imes$,在类条件ImageNet $256 imes 256$上达到了竞争性的FID为$2.0$,在类条件UCF-101上达到了FVD为$72.8$,显著推动了视觉合成的效率前沿。
cs.CV / 78 / 2604.09169

UniSemAlign: Text-Prototype Alignment with a Foundation Encoder for Semi-Supervised Histopathology Segmentation

UniSemAlign:基于基础编码器的文本-原型对齐用于半监督组织病理学分割
Thai, Le-Van, Nguyen, Tien Dat, Pham, Hoai Nhan, Thi, Lan Anh Dinh, Nguyen, Duy-Dong, Bui, Ngoc Lam Quang
Abstract
Semi-supervised semantic segmentation in computational pathology remains challenging due to scarce pixel-level annotations and unreliable pseudo-label supervision. We propose UniSemAlign, a dual-modal semantic alignment framework that enhances visual segmentation by injecting explicit class-level structure into pixel-wise learning. Built upon a pathology-pretrained Transformer encoder, UniSemAlign introduces complementary prototype-level and text-level alignment branches in a shared embedding space, providing structured guidance that reduces class ambiguity and stabilizes pseudo-label refinement. The aligned representations are fused with visual predictions to generate more reliable supervision for unlabeled histopathology images. The framework is trained end-to-end with supervised segmentation, cross-view consistency, and cross-modal alignment objectives. Extensive experiments on the GlaS and CRAG datasets demonstrate that UniSemAlign substantially outperforms recent semi-supervised baselines under limited supervision, achieving Dice improvements of up to 2.6% on GlaS and 8.6% on CRAG with only 10% labeled data, and strong improvements at 20% supervision. Code is available at: https://github.com/thailevann/UniSemAlign
Chinese Translation
由于像素级标注稀缺和伪标签监督不可靠,计算病理学中的半监督语义分割仍然面临挑战。我们提出了UniSemAlign,这是一种双模态语义对齐框架,通过将显式的类别级结构注入像素级学习来增强视觉分割。UniSemAlign基于病理预训练的Transformer编码器,引入了在共享嵌入空间中的互补原型级和文本级对齐分支,提供结构化指导,从而减少类别歧义并稳定伪标签的精炼。对齐的表示与视觉预测融合,以生成对未标记组织病理图像更可靠的监督。该框架通过监督分割、跨视图一致性和跨模态对齐目标进行端到端训练。在GlaS和CRAG数据集上的广泛实验表明,UniSemAlign在有限监督下显著优于最近的半监督基线,在仅使用10%标记数据的情况下,GlaS上实现了高达2.6%的Dice提升,CRAG上实现了8.6%的提升,并在20%监督下取得了强劲的改进。代码可在以下网址获取:https://github.com/thailevann/UniSemAlign
cs.CV / 79 / 2604.09181

MixFlow: Mixed Source Distributions Improve Rectified Flows

MixFlow:混合源分布改善整流流
Nayal, Nazir, Wewer, Christopher, Lenssen, Jan Eric
Abstract
Diffusion models and their variations, such as rectified flows, generate diverse and high-quality images, but they are still hindered by slow iterative sampling caused by the highly curved generative paths they learn. An important cause of high curvature, as shown by previous work, is independence between the source distribution (standard Gaussian) and the data distribution. In this work, we tackle this limitation by two complementary contributions. First, we attempt to break away from the standard Gaussian assumption by introducing $\kappa\texttt{-FC}$, a general formulation that conditions the source distribution on an arbitrary signal $\kappa$ that aligns it better with the data distribution. Then, we present MixFlow, a simple but effective training strategy that reduces the generative path curvatures and considerably improves sampling efficiency. MixFlow trains a flow model on linear mixtures of a fixed unconditional distribution and a $\kappa\texttt{-FC}$-based distribution. This simple mixture improves the alignment between the source and data, provides better generation quality with less required sampling steps, and accelerates the training convergence considerably. On average, our training procedure improves the generation quality by 12\% in FID compared to standard rectified flow and 7\% compared to previous baselines under a fixed sampling budget. Code available at: $\href{https://github.com/NazirNayal8/MixFlow}{https://github.com/NazirNayal8/MixFlow}$
Chinese Translation
扩散模型及其变体,如整流流,能够生成多样且高质量的图像,但仍受到其学习到的高度弯曲生成路径导致的缓慢迭代采样的限制。先前的研究表明,高曲率的一个重要原因是源分布(标准高斯分布)与数据分布之间的独立性。在本研究中,我们通过两个互补的贡献来解决这一限制。首先,我们尝试通过引入 $ ext{kappa} exttt{-FC}$,一种将源分布条件化于任意信号 $ ext{kappa}$ 的通用公式,打破标准高斯假设,从而使其与数据分布更好地对齐。然后,我们提出了 MixFlow,一种简单但有效的训练策略,减少了生成路径的曲率,并显著提高了采样效率。MixFlow 在固定无条件分布和基于 $ ext{kappa} exttt{-FC}$ 的分布的线性混合上训练流模型。这种简单的混合改善了源分布与数据分布之间的对齐,提供了更好的生成质量,所需的采样步骤更少,并显著加速了训练收敛。平均而言,我们的训练过程在固定采样预算下,相较于标准整流流,生成质量提高了 12 ext{%},相较于先前基准提高了 7 ext{%}。代码可在:$ ext{https://github.com/NazirNayal8/MixFlow}$
cs.CV / 80 / 2604.09197

Vision Transformers for Preoperative CT-Based Prediction of Histopathologic Chemotherapy Response Score in High-Grade Serous Ovarian Carcinoma

基于术前CT的高分化浆液性卵巢癌组织病理化学治疗反应评分预测的视觉Transformer方法
Fati, Francesca, Coutinho, Felipe, Reinius, Marika, Rosanu, Marina, Funingana, Gabriel, De Vitis, Luigi, Schivardi, Gabriella, Clayton, Hannah, Traversa, Alice, Gao, Zeyu, Penteado, Guilherme, Gao, Shangqi, Pastori, Francesco, Woitek, Ramona, Ghioni, Maria Cristina, Aletti, Giovanni Damiano, Jimenez-Linan, Mercedes, Burge, Sarah, Colombo, Nicoletta, Sala, Evis, Spadea, Maria Francesca, Kline, Timothy L., Brenton, James D., Cardoso, Jaime, Multinu, Francesco, De Momi, Elena, Crispin-Ortuzar, Mireia, Machado, Ines P.
Abstract
Purpose. High-grade serous ovarian carcinoma (HGSOC) is characterized by pronounced biological and spatial heterogeneity and is frequently diagnosed at an advanced stage. Neoadjuvant chemotherapy (NACT) followed by delayed primary surgery is commonly employed in patients unsuitable for primary cytoreduction. The Chemotherapy Response Score (CRS) is a validated histopathological biomarker of response to NACT, but it is only available postoperatively. In this study, we investigate whether pre-treatment computed tomography (CT) imaging and clinical data can be used to predict CRS as an investigational decision-support adjunct to inform multidisciplinary team (MDT) discussions regarding expected treatment response. Methods. We proposed a 2.5D multimodal deep learning framework that processes lesion-dense omental slices using a pre-trained Vision Transformer encoder and integrates the resulting visual representations with clinical variables through an intermediate fusion module to predict CRS. Results. Our multimodal model, integrating imaging and clinical data, achieved a ROC-AUC of 0.95 alongside 95% accuracy and 80% precision on the internal test cohort (IEO, n=41 patients). On the external test set (OV04, n=70 patients), it achieved a ROC-AUC of 0.68, alongside 67% accuracy and 75% precision. Conclusion. These preliminary results demonstrate the feasibility of transformer-based deep learning for preoperative prediction of CRS in HGSOC using routine clinical data and CT imaging. As an investigational, pre-treatment decision-support tool, this approach may assist MDT discussions by providing early, non-invasive estimates of treatment response.
Chinese Translation
目的:高分化浆液性卵巢癌(HGSOC)表现出显著的生物学和空间异质性,且常在晚期被诊断。对于不适合初次细胞减灭术的患者,常采用新辅助化疗(NACT)后延迟进行初次手术。化疗反应评分(CRS)是一种经过验证的NACT反应组织病理学生物标志物,但仅在术后可获得。本研究旨在探讨是否可利用术前计算机断层扫描(CT)影像及临床数据预测CRS,作为辅助多学科团队(MDT)讨论预期治疗反应的决策支持工具。方法:我们提出了一种2.5D多模态深度学习框架,利用预训练的视觉Transformer编码器处理病灶密集的网膜切片,并通过中间融合模块将视觉表征与临床变量整合,以预测CRS。结果:我们的多模态模型融合影像与临床数据,在内部测试队列(IEO,n=41例患者)中实现了ROC-AUC 0.95,准确率95%,精确率80%;在外部测试集(OV04,n=70例患者)中实现ROC-AUC 0.68,准确率67%,精确率75%。结论:这些初步结果表明,基于Transformer的深度学习方法结合常规临床数据和CT影像,能够实现HGSOC术前CRS预测的可行性。作为一种术前的探索性决策支持工具,该方法可通过提供早期、非侵入性的治疗反应估计,辅助MDT讨论。
cs.CV / 81 / 2604.09199

Globally Optimal Pose from Orthographic Silhouettes

从正投影轮廓中获取全局最优姿态
Sengupta, Agniva, Kuş, Dilara, Li, Jianning, Zachow, Stefan
Abstract
We solve the problem of determining the pose of known shapes in $\mathbb{R}^3$ from their unoccluded silhouettes. The pose is determined up to global optimality using a simple yet under-explored property of the area-of-silhouette: its continuity w.r.t trajectories in the rotation space. The proposed method utilises pre-computed silhouette-signatures, modelled as a response surface of the area-of-silhouettes. Querying this silhouette-signature response surface for pose estimation leads to a strong branching of the rotation search space, making resolution-guided candidate search feasible. Additionally, we utilise the aspect ratio of 2D ellipses fitted to projected silhouettes as an auxiliary global shape signature to accelerate the pose search. This combined strategy forms the first method to efficiently estimate globally optimal pose from just the silhouettes, without being guided by correspondences, for any shape, irrespective of its convexity and genus. We validate our method on synthetic and real examples, demonstrating significantly improved accuracy against comparable approaches. Code, data, and supplementary in: https://agnivsen.github.io/pose-from-silhouette/
Chinese Translation
我们解决了从已知形状在 $ ext{R}^3$ 中的未遮挡轮廓确定姿态的问题。通过利用轮廓面积的一个简单但尚未深入探讨的特性——其在旋转空间轨迹上的连续性,我们可以在全局最优性上确定姿态。所提出的方法利用预计算的轮廓特征,建模为轮廓面积的响应面。查询该轮廓特征响应面进行姿态估计,导致旋转搜索空间的强分支,使得基于分辨率的候选搜索成为可能。此外,我们利用拟合到投影轮廓的二维椭圆的纵横比作为辅助全局形状特征,以加速姿态搜索。这个组合策略形成了第一种方法,可以仅通过轮廓高效估计全局最优姿态,而不依赖于对应关系,适用于任何形状,无论其凸性和属数如何。我们在合成和真实示例上验证了我们的方法,显示出与可比方法相比显著提高的准确性。代码、数据和补充材料见:https://agnivsen.github.io/pose-from-silhouette/
cs.CV / 82 / 2604.09201

CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation

CT-1:视觉-语言-相机模型将空间推理知识转移至相机可控视频生成
Zhao, Haoyu, Zhang, Zihao, Gu, Jiaxi, Chen, Haoran, Zheng, Qingping, Tang, Pin, Jin, Yeyin, Zhang, Yuang, Cheng, Junqi, Lu, Zenghui, Shu, Peng, Wu, Zuxuan, Jiang, Yu-Gang
Abstract
Camera-controllable video generation aims to synthesize videos with flexible and physically plausible camera movements. However, existing methods either provide imprecise camera control from text prompts or rely on labor-intensive manual camera trajectory parameters, limiting their use in automated scenarios. To address these issues, we propose a novel Vision-Language-Camera model, termed CT-1 (Camera Transformer 1), a specialized model designed to transfer spatial reasoning knowledge to video generation by accurately estimating camera trajectories. Built upon vision-language modules and a Diffusion Transformer model, CT-1 employs a Wavelet-based Regularization Loss in the frequency domain to effectively learn complex camera trajectory distributions. These trajectories are integrated into a video diffusion model to enable spatially aware camera control that aligns with user intentions. To facilitate the training of CT-1, we design a dedicated data curation pipeline and construct CT-200K, a large-scale dataset containing over 47M frames. Experimental results demonstrate that our framework successfully bridges the gap between spatial reasoning and video synthesis, yielding faithful and high-quality camera-controllable videos and improving camera control accuracy by 25.7% over prior methods.
Chinese Translation
相机可控视频生成旨在合成具有灵活且物理上合理的相机运动的视频。然而,现有方法要么从文本提示中提供不精确的相机控制,要么依赖于劳动密集型的手动相机轨迹参数,这限制了它们在自动化场景中的应用。为了解决这些问题,我们提出了一种新颖的视觉-语言-相机模型,称为CT-1(Camera Transformer 1),这是一个专门设计的模型,旨在通过准确估计相机轨迹将空间推理知识转移到视频生成中。CT-1建立在视觉-语言模块和扩散变换器模型的基础上,采用基于小波的正则化损失,在频域中有效学习复杂的相机轨迹分布。这些轨迹被整合到视频扩散模型中,以实现与用户意图一致的空间感知相机控制。为了促进CT-1的训练,我们设计了一个专门的数据整理管道,并构建了CT-200K,这是一个包含超过4700万帧的大规模数据集。实验结果表明,我们的框架成功弥合了空间推理与视频合成之间的差距,生成了真实且高质量的相机可控视频,并在相机控制精度上比之前的方法提高了25.7%。
cs.CV / 83 / 2604.09206

Long-SCOPE: Fully Sparse Long-Range Cooperative 3D Perception

Long-SCOPE:完全稀疏的长距离协作3D感知
Wang, Jiahao, Xu, Zikun, Zhang, Yuner, Jiang, Zhongwei, Lu, Chenyang, Yang, Shuocheng, Wang, Yuxuan, Zhong, Jiaru, Zhang, Chuang, Xu, Shaobing, Wang, Jianqiang
Abstract
Cooperative 3D perception via Vehicle-to-Everything communication is a promising paradigm for enhancing autonomous driving, offering extended sensing horizons and occlusion resolution. However, the practical deployment of existing methods is hindered at long distances by two critical bottlenecks: the quadratic computational scaling of dense BEV representations and the fragility of feature association mechanisms under significant observation and alignment errors. To overcome these limitations, we introduce Long-SCOPE, a fully sparse framework designed for robust long-distance cooperative 3D perception. Our method features two novel components: a Geometry-guided Query Generation module to accurately detect small, distant objects, and a learnable Context-Aware Association module that robustly matches cooperative queries despite severe positional noise. Experiments on the V2X-Seq and Griffin datasets validate that Long-SCOPE achieves state-of-the-art performance, particularly in challenging 100-150 m long-range settings, while maintaining highly competitive computation and communication costs.
Chinese Translation
通过车对万物(Vehicle-to-Everything)通信实现的协作3D感知是一种有前景的范式,能够增强自动驾驶,提供更广阔的感知视野和遮挡解决方案。然而,现有方法在长距离上的实际部署受到两个关键瓶颈的制约:密集鸟瞰图(BEV)表示的二次计算扩展性以及在显著观察和对齐误差下特征关联机制的脆弱性。为了解决这些限制,我们提出了Long-SCOPE,一个专为稳健的长距离协作3D感知设计的完全稀疏框架。我们的方法具有两个新颖的组件:一个几何引导查询生成模块,用于准确检测小型远距离物体,以及一个可学习的上下文感知关联模块,能够在严重位置噪声的情况下稳健地匹配协作查询。在V2X-Seq和Griffin数据集上的实验验证了Long-SCOPE实现了最先进的性能,特别是在具有挑战性的100-150米长距离设置中,同时保持了高度竞争的计算和通信成本。
cs.CV / 84 / 2604.09210

Adding Another Dimension to Image-based Animal Detection

为基于图像的动物检测增加另一个维度
Shukla, Vandita, Remondino, Fabio, Risse, Benjamin
Abstract
Monocular imaging of animals inherently reduces 3D structures to 2D projections. Detection algorithms lead to 2D bounding boxes that lack information about animal's orientation relative to the camera. To build 3D detection methods for RGB animal images, there is a lack of labeled datasets; such labeling processes require 3D input streams along with RGB data. We present a pipeline that utilises Skinned Multi Animal Linear models to estimate 3D bounding boxes and to project them as robust labels into 2D image space using a dedicated camera pose refinement algorithm. To assess which sides of the animal are captured, cuboid face visibility metrics are computed. These 3D bounding boxes and metrics form a crucial step toward developing and benchmarking future monocular 3D animal detection algorithms. We evaluate our method on the Animal3D dataset, demonstrating accurate performance across species and settings.
Chinese Translation
单目成像动物本质上将三维结构简化为二维投影。检测算法生成的二维边界框缺乏动物相对于相机的朝向信息。为了构建RGB动物图像的三维检测方法,缺乏标注数据集;这样的标注过程需要三维输入流和RGB数据。我们提出了一种利用Skinned Multi Animal Linear模型来估计三维边界框的流程,并通过专门的相机姿态优化算法将其作为稳健的标签投影到二维图像空间。为了评估捕获到的动物的哪些侧面,计算了立方体面可见性指标。这些三维边界框和指标是开发和基准测试未来单目三维动物检测算法的重要一步。我们在Animal3D数据集上评估了我们的方法,展示了在不同物种和设置下的准确性能。
cs.CV / 85 / 2604.09213

SHIFT: Steering Hidden Intermediates in Flow Transformers

SHIFT:引导流式Transformer中的隐藏中间态
Konovalova, Nina, Kuznetsov, Andrey, Alanov, Aibek
Abstract
Diffusion models have become leading approaches for high-fidelity image generation. Recent DiT-based diffusion models, in particular, achieve strong prompt adherence while producing high-quality samples. We propose SHIFT, a simple but effective and lightweight framework for concept removal in DiT diffusion models via targeted manipulation of intermediate activations at inference time, inspired by activation steering in large language models. SHIFT learns steering vectors that are dynamically applied to selected layers and timesteps to suppress unwanted visual concepts while preserving the prompt's remaining content and overall image quality. Beyond suppression, the same mechanism can shift generations into a desired \emph{style domain} or bias samples toward adding or changing target objects. We demonstrate that SHIFT provides effective and flexible control over DiT generation across diverse prompts and targets without time-consuming retraining.
Chinese Translation
扩散模型已成为高保真图像生成的主流方法。尤其是基于DiT的扩散模型,在生成高质量样本的同时,实现了对提示词的强一致性。我们提出了SHIFT,一种简单但高效且轻量的框架,通过在推理阶段针对性地操控中间激活,实现DiT扩散模型中的概念移除,灵感来源于大型语言模型中的激活引导。SHIFT学习引导向量,动态应用于选定的层和时间步,以抑制不需要的视觉概念,同时保留提示词的其余内容及整体图像质量。除了抑制功能,该机制还可将生成结果引导至期望的风格域,或使样本偏向添加或改变目标对象。我们展示了SHIFT在无需耗时重训练的情况下,能够针对多样化提示词和目标,提供对DiT生成过程的有效且灵活的控制。
cs.CV / 86 / 2604.09220

TinyNeRV: Compact Neural Video Representations via Capacity Scaling, Distillation, and Low-Precision Inference

TinyNeRV:通过容量缩放、蒸馏和低精度推理实现紧凑的神经视频表示
Akhtar, Muhammad Hannan, Amer, Ihab, Shanableh, Tamer
Abstract
Implicit neural video representations encode entire video sequences within the parameters of a neural network and enable constant time frame reconstruction. Recent work on Neural Representations for Videos (NeRV) has demonstrated competitive reconstruction performance while avoiding the sequential decoding process of conventional video codecs. However, most existing studies focus on moderate or high capacity models, leaving the behavior of extremely compact configurations required for constrained environments insufficiently explored. This paper presents a systematic study of tiny NeRV architectures designed for efficient deployment. Two lightweight configurations, NeRV-T and NeRV-T+, are introduced and evaluated across multiple video datasets in order to analyze how aggressive capacity reduction affects reconstruction quality, computational complexity, and decoding throughput. Beyond architectural scaling, the work investigates strategies for improving the performance of compact models without increasing inference cost. Knowledge distillation with frequency-aware focal supervision is explored to enhance reconstruction fidelity in low-capacity networks. In addition, the impact of lowprecision inference is examined through both post training quantization and quantization aware training to study the robustness of tiny models under reduced numerical precision. Experimental results demonstrate that carefully designed tiny NeRV variants can achieve favorable quality efficiency trade offs while substantially reducing parameter count, computational cost, and memory requirements. These findings provide insight into the practical limits of compact neural video representations and offer guidance for deploying NeRV style models in resource constrained and real-time environments. The official implementation is available at https: //github.com/HannanAkhtar/TinyNeRV-Implementation.
Chinese Translation
隐式神经视频表示将整个视频序列编码在神经网络的参数中,并实现恒定时间的帧重建。近期关于视频的神经表示(NeRV)的研究展示了竞争力的重建性能,同时避免了传统视频编解码器的顺序解码过程。然而,大多数现有研究集中在中等或高容量模型上,对于在受限环境中所需的极其紧凑配置的行为探讨不足。本文系统性地研究了为高效部署设计的微型 NeRV 架构。介绍并评估了两种轻量级配置,NeRV-T 和 NeRV-T+,在多个视频数据集上分析激进的容量缩减如何影响重建质量、计算复杂性和解码吞吐量。除了架构缩放外,本文还探讨了在不增加推理成本的情况下提高紧凑模型性能的策略。探索了使用频率感知焦点监督的知识蒸馏,以增强低容量网络的重建保真度。此外,通过后训练量化和量化感知训练研究低精度推理的影响,以考察微型模型在降低数值精度下的鲁棒性。实验结果表明,精心设计的微型 NeRV 变体可以在显著减少参数数量、计算成本和内存需求的同时,实现良好的质量效率权衡。这些发现为紧凑神经视频表示的实际限制提供了见解,并为在资源受限和实时环境中部署 NeRV 风格模型提供了指导。官方实现可在 https://github.com/HannanAkhtar/TinyNeRV-Implementation 获取。
cs.CV / 87 / 2604.09231

Hitem3D 2.0: Multi-View Guided Native 3D Texture Generation

Hitem3D 2.0:多视角引导的原生三维纹理生成
He, Huiang, Zhao, Shengchu, Huang, Jianwen, Li, Jie, Wu, Jiaqi, Zhang, Hu, Tang, Pei, Zheng, Heliang, Li, Yukun, Jia, Rongfei
Abstract
Although recent advances have improved the quality of 3D texture generation, existing methods still struggle with incomplete texture coverage, cross-view inconsistency, and misalignment between geometry and texture. To address these limitations, we propose Hitem3D 2.0, a multi-view guided native 3D texture generation framework that enhances texture quality through the integration of 2D multi-view generation priors and native 3D texture representations. Hitem3D 2.0 comprises two key components: a multi-view synthesis framework and a native 3D texture generation model. The multi-view generation is built upon a pre-trained image editing backbone and incorporates plug-and-play modules that explicitly promote geometric alignment, cross-view consistency, and illumination uniformity, thereby enabling the synthesis of high-fidelity multi-view images. Conditioned on the generated views and 3D geometry, the native 3D texture generation model projects multi-view textures onto 3D surfaces while plausibly completing textures in unseen regions. Through the integration of multi-view consistency constraints with native 3D texture modeling, Hitem3D 2.0 significantly improves texture completeness, cross-view coherence, and geometric alignment. Experimental results demonstrate that Hitem3D 2.0 outperforms existing methods in terms of texture detail, fidelity, consistency, coherence, and alignment.
Chinese Translation
尽管近期进展提升了三维纹理生成的质量,现有方法仍面临纹理覆盖不完整、跨视角不一致以及几何与纹理错位等问题。为解决这些限制,我们提出了Hitem3D 2.0,一种多视角引导的原生三维纹理生成框架,通过融合二维多视角生成先验与原生三维纹理表示来增强纹理质量。Hitem3D 2.0包含两个关键组成部分:多视角合成框架和原生三维纹理生成模型。多视角生成基于预训练的图像编辑骨干网络,结合即插即用模块,显式促进几何对齐、跨视角一致性及光照均匀性,从而实现高保真多视角图像的合成。在生成的视图和三维几何条件下,原生三维纹理生成模型将多视角纹理投射到三维表面,同时合理补全不可见区域的纹理。通过多视角一致性约束与原生三维纹理建模的融合,Hitem3D 2.0显著提升了纹理的完整性、跨视角连贯性及几何对齐。实验结果表明,Hitem3D 2.0在纹理细节、保真度、一致性、连贯性和对齐度方面均优于现有方法。
cs.CV / 88 / 2604.09232

Neural Distribution Prior for LiDAR Out-of-Distribution Detection

用于LiDAR异常检测的神经分布先验
Li, Zizhao, Xiang, Zhengkang, Ao, Jiayang, Liu, Feng, West, Joseph, Khoshelham, Kourosh
Abstract
LiDAR-based perception is critical for autonomous driving due to its robustness to poor lighting and visibility conditions. Yet, current models operate under the closed-set assumption and often fail to recognize unexpected out-of-distribution (OOD) objects in the open world. Existing OOD scoring functions exhibit limited performance because they ignore the pronounced class imbalance inherent in LiDAR OOD detection and assume a uniform class distribution. To address this limitation, we propose the Neural Distribution Prior (NDP), a framework that models the distributional structure of network predictions and adaptively reweights OOD scores based on alignment with a learned distribution prior. NDP dynamically captures the logit distribution patterns of training data and corrects class-dependent confidence bias through an attention-based module. We further introduce a Perlin noise-based OOD synthesis strategy that generates diverse auxiliary OOD samples from input scans, enabling robust OOD training without external datasets. Extensive experiments on the SemanticKITTI and STU benchmarks demonstrate that NDP substantially improves OOD detection performance, achieving a point-level AP of 61.31\% on the STU test set, which is more than 10$\times$ higher than the previous best result. Our framework is compatible with various existing OOD scoring formulations, providing an effective solution for open-world LiDAR perception.
Chinese Translation
基于LiDAR的感知对于自动驾驶至关重要,因为它在光照和能见度差的条件下表现出色。然而,当前模型在封闭集假设下运行,往往无法识别开放世界中意外的异常(OOD)对象。现有的OOD评分函数表现有限,因为它们忽略了LiDAR OOD检测中固有的显著类别不平衡,并假设类别分布均匀。为了解决这一局限性,我们提出了神经分布先验(Neural Distribution Prior, NDP),这是一个建模网络预测分布结构的框架,并根据与学习到的分布先验的对齐情况自适应地重新加权OOD评分。NDP动态捕捉训练数据的logit分布模式,并通过基于注意力的模块纠正类别依赖的置信度偏差。我们进一步引入了一种基于Perlin噪声的OOD合成策略,从输入扫描中生成多样的辅助OOD样本,使得在没有外部数据集的情况下能够进行稳健的OOD训练。在SemanticKITTI和STU基准上的广泛实验表明,NDP显著提高了OOD检测性能,在STU测试集上达到了61.31%的点级AP,超过了之前最佳结果的10倍以上。我们的框架与多种现有的OOD评分公式兼容,为开放世界的LiDAR感知提供了一种有效的解决方案。
cs.CV / 89 / 2604.09249

FashionStylist: An Expert Knowledge-enhanced Multimodal Dataset for Fashion Understanding

FashionStylist:一个融合专家知识的多模态时尚理解数据集
Feng, Kaidong, Huang, Zhuoxuan, Guo, Huizhong, Jin, Yuting, Chen, Xinyu, Liang, Yue, Gai, Yifei, Zhou, Li, Ma, Yunshan, Sun, Zhu
Abstract
Fashion understanding requires both visual perception and expert-level reasoning about style, occasion, compatibility, and outfit rationale. However, existing fashion datasets remain fragmented and task-specific, often focusing on item attributes, outfit co-occurrence, or weak textual supervision, and thus provide limited support for holistic outfit understanding. In this paper, we introduce FashionStylist, an expert-annotated benchmark for holistic and expert-level fashion understanding. Constructed through a dedicated fashion-expert annotation pipeline, FashionStylist provides professionally grounded annotations at both the item and outfit levels. It supports three representative tasks: outfit-to-item grounding, outfit completion, and outfit evaluation. These tasks cover realistic item recovery from complex outfits with layering and accessories, compatibility-aware composition beyond co-occurrence matching, and expert-level assessment of style, season, occasion, and overall coherence. Experimental results show that FashionStylist serves not only as a unified benchmark for multiple fashion tasks, but also as an effective training resource for improving grounding, completion, and outfit-level semantic evaluation in MLLM-based fashion systems.
Chinese Translation
时尚理解不仅需要视觉感知,还需具备关于风格、场合、搭配及穿搭逻辑的专家级推理能力。然而,现有的时尚数据集多为碎片化且针对特定任务,通常侧重于单品属性、穿搭共现或弱文本监督,因此难以支持整体穿搭的全面理解。本文提出了FashionStylist,一个由专家注释的整体且具备专家级水平的时尚理解基准数据集。FashionStylist通过专门的时尚专家注释流程构建,提供了基于专业知识的单品及穿搭层面的标注。该数据集支持三项代表性任务:穿搭到单品的定位(outfit-to-item grounding)、穿搭补全(outfit completion)以及穿搭评估(outfit evaluation)。这些任务涵盖了从复杂层叠及配饰丰富的穿搭中恢复真实单品、超越共现匹配的兼容性感知组合,以及对风格、季节、场合和整体协调性的专家级评估。实验结果表明,FashionStylist不仅作为多项时尚任务的统一基准,同时也是提升基于多模态大语言模型(MLLM)时尚系统中定位、补全及穿搭层面语义评估能力的有效训练资源。
cs.CV / 90 / 2604.09253

Mosaic: Multimodal Jailbreak against Closed-Source VLMs via Multi-View Ensemble Optimization

马赛克:通过多视角集成优化对闭源视觉语言模型的多模态越狱攻击
Lan, Yuqin, Li, Gen, Hu, Yuanze, Shen, Weihao, Fan, Zhaoxin, Wu, Faguo, Zhang, Xiao, Yang, Laurence T., Zheng, Zhiming
Abstract
Vision-Language Models (VLMs) are powerful but remain vulnerable to multimodal jailbreak attacks. Existing attacks mainly rely on either explicit visual prompt attacks or gradient-based adversarial optimization. While the former is easier to detect, the latter produces subtle perturbations that are less perceptible, but is usually optimized and evaluated under homogeneous open-source surrogate-target settings, leaving its effectiveness on commercial closed-source VLMs under heterogeneous settings unclear. To examine this issue, we study different surrogate-target settings and observe a consistent gap between homogeneous and heterogeneous settings, a phenomenon we term surrogate dependency. Motivated by this finding, we propose Mosaic, a Multi-view ensemble optimization framework for multimodal jailbreak against closed-source VLMs, which alleviates surrogate dependency under heterogeneous surrogate-target settings by reducing over-reliance on any single surrogate model and visual view. Specifically, Mosaic incorporates three core components: a Text-Side Transformation module, which perturbs refusal-sensitive lexical patterns; a Multi-View Image Optimization module, which updates perturbations under diverse cropped views to avoid overfitting to a single visual view; and a Surrogate Ensemble Guidance module, which aggregates optimization signals from multiple surrogate VLMs to reduce surrogate-specific bias. Extensive experiments on safety benchmarks demonstrate that Mosaic achieves state-of-the-art Attack Success Rate and Average Toxicity against commercial closed-source VLMs.
Chinese Translation
视觉语言模型(VLMs)功能强大,但仍然容易受到多模态越狱攻击。现有攻击主要依赖于显式视觉提示攻击或基于梯度的对抗优化。前者更易于检测,而后者产生的微小扰动不易察觉,但通常是在同质的开源替代目标设置下进行优化和评估,因此在异质设置下对商业闭源VLM的有效性尚不清楚。为了解决这一问题,我们研究了不同的替代目标设置,并观察到同质和异质设置之间存在一致的差距,这一现象我们称之为替代依赖性。基于这一发现,我们提出了马赛克(Mosaic),一个针对闭源VLM的多模态越狱的多视角集成优化框架,它通过减少对任何单一替代模型和视觉视图的过度依赖,缓解了异质替代目标设置下的替代依赖性。具体而言,马赛克包含三个核心组件:一个文本侧变换模块,用于扰动拒绝敏感的词汇模式;一个多视角图像优化模块,在多样化的裁剪视图下更新扰动,以避免对单一视觉视图的过拟合;以及一个替代集成引导模块,从多个替代VLM中聚合优化信号,以减少替代特定的偏差。在安全基准上的广泛实验表明,马赛克在商业闭源VLM上实现了最先进的攻击成功率和平均毒性。
cs.CV / 91 / 2604.09260

Beyond Segmentation: Structurally Informed Facade Parsing from Imperfect Images

超越分割:基于结构信息的建筑立面解析方法处理不完美图像
Janicki, Maciej, Plocharski, Aleksander, Musialski, Przemyslaw
Abstract
Standard object detectors typically treat architectural elements independently, often resulting in facade parsings that lack the structural coherence required for downstream procedural reconstruction. We address this limitation by augmenting the YOLOv8 training objective with a custom lightweight alignment loss. This regularization encourages grid-consistent arrangements of bounding boxes during training, effectively injecting geometric priors without altering the standard inference pipeline. Experiments on the CMP dataset demonstrate that our method successfully improves structural regularity, correcting alignment errors caused by perspective and occlusion while maintaining a controllable trade-off with standard detection accuracy.
Chinese Translation
标准的目标检测器通常将建筑元素独立处理,导致的立面解析结果缺乏下游程序化重建所需的结构一致性。针对这一局限,我们通过在YOLOv8训练目标中引入定制的轻量级对齐损失进行增强。该正则化项在训练过程中促进边界框的网格一致排列,有效注入几何先验信息,同时不改变标准推理流程。在CMP数据集上的实验表明,我们的方法成功提升了结构规则性,纠正了由透视和遮挡引起的对齐误差,并在保持标准检测准确率的同时实现了可控的权衡。
cs.CV / 92 / 2604.09304

GeRM: A Generative Rendering Model From Physically Realistic to Photorealistic

GeRM:一种从物理真实到照片真实的生成渲染模型
Lu, Jiayuan, Xie, Rengan, Jin, Xuancheng, Wu, Zhizhen, Ye, Qi, Xie, Tian, Bao, Hujun, Huo, Rui Wang. Yuchi
Abstract
For decades, Physically-Based Rendering (PBR) is the fundation of synthesizing photorealisitic images, and therefore sometimes roughly referred as Photorealistic Rendering (PRR). While PBR is indeed a mathematical simulation of light transport that guarantees physical reality, photorealism has additional reliance on the realistic digital model of geometry and appearance of the real world, leaving a barely explored gap from PBR to PRR (P2P). Consequently, the path toward photorealism faces a critical dilemma: the explicit simulation of PRR encumbered by unreachable realistic digital models for real-world existence, while implicit generation models sacrifice controllability and geometric consistency. Based on this insight, this paper presents the problem, data, and approach of mitigating P2P gap, followed by the first multi-modal generative rendering model, dubbed GeRM, to unify PBR and PRR. GeRM integrates physical attributes like G-buffers with text prompts, and progressive incremental injection to generate controllable photorealistic images, allowing users to fluidly navigate the continuum between strict physical fidelity and perceptual photorealism. Technically, we model the transition between PBR and PRR images as a distribution transfer and aim to learn a distribution transfer vector field (DTV Field) to guide this process. To define the learning objective, we first leverage a multi-agent VLM framework to construct an expert-guided pairwise P2P transfer dataset, named P2P-50K, where each paired sample in the dataset corresponds to a transfer vector in the DTV Field. Subsequently, we propose a multi-condition ControlNet to learn the DTV Field, which synthesizes PBR images and progressively transitions them into PRR images, guided by G-buffers, text prompts, and cues for enhanced regions.
Chinese Translation
数十年来,基于物理的渲染(Physically-Based Rendering,PBR)一直是合成照片真实图像的基础,因此有时被粗略地称为照片真实渲染(Photorealistic Rendering,PRR)。虽然PBR确实是保证物理真实性的光传输数学模拟,但照片真实还额外依赖于对现实世界几何和外观的真实数字模型,这导致从PBR到PRR(P2P)之间存在一个鲜有探索的鸿沟。因此,通向照片真实的路径面临关键困境:PRR的显式模拟受限于难以获得的真实数字模型,而隐式生成模型则牺牲了可控性和几何一致性。基于这一洞察,本文阐述了缓解P2P鸿沟的问题、数据及方法,随后提出了首个多模态生成渲染模型GeRM,以统一PBR与PRR。GeRM融合了物理属性(如G缓冲区)与文本提示,并采用渐进式增量注入生成可控的照片真实图像,使用户能够流畅地在严格的物理逼真与感知的照片真实之间导航。从技术上讲,我们将PBR与PRR图像之间的转换建模为分布转移,旨在学习一个分布转移向量场(Distribution Transfer Vector Field,DTV Field)以指导该过程。为定义学习目标,我们首先利用多智能体视觉语言模型框架构建了一个专家指导的成对P2P转移数据集P2P-50K,其中数据集中每对样本对应DTV Field中的一个转移向量。随后,我们提出了一种多条件ControlNet来学习DTV Field,该网络以G缓冲区、文本提示及增强区域线索为引导,合成PBR图像并逐步将其转变为PRR图像。
cs.CV / 93 / 2604.09305

VAGNet: Vision-based accident anticipation with global features

VAGNet:基于视觉的全局特征交通事故预测
Vipulananthan, Vipooshan, Chitraranjan, Charith D.
Abstract
Traffic accidents are a leading cause of fatalities and injuries across the globe. Therefore, the ability to anticipate hazardous situations in advance is essential. Automated accident anticipation enables timely intervention through driver alerts and collision avoidance maneuvers, forming a key component of advanced driver assistance systems. In autonomous driving, such predictive capabilities support proactive safety behaviors, such as initiating defensive driving and human takeover when required. Using dashcam video as input offers a cost-effective solution, but it is challenging due to the complexity of real-world driving scenes. Accident anticipation systems need to operate in real-time. However, current methods involve extracting features from each detected object, which is computationally intensive. We propose VAGNet, a deep neural network that learns to predict accidents from dash-cam video using global features of traffic scenes without requiring explicit object-level features. The network consists of transformer and graph modules, and we use the vision foundation model VideoMAE-V2 for global feature extraction. Experiments on four benchmark datasets (DAD, DoTA, DADA, and Nexar) show that our method anticipates accidents with higher average precision and mean time-to-accident while being computationally more efficient compared to existing methods.
Chinese Translation
交通事故是全球范围内导致死亡和伤害的主要原因之一,因此提前预测危险情况的能力至关重要。自动化事故预测通过驾驶员警报和碰撞规避动作实现及时干预,是高级驾驶辅助系统的重要组成部分。在自动驾驶中,这种预测能力支持主动安全行为,如启动防御性驾驶和必要时的人类接管。以行车记录仪视频作为输入提供了一种成本效益高的解决方案,但由于现实驾驶场景的复杂性,面临诸多挑战。事故预测系统需实时运行,而现有方法通常通过提取每个检测到的对象特征,计算开销较大。我们提出了VAGNet,一种深度神经网络,利用交通场景的全局特征从行车记录仪视频中学习预测事故,无需显式的对象级特征。该网络由Transformer和图模块组成,采用视觉基础模型VideoMAE-V2进行全局特征提取。在四个基准数据集(DAD、DoTA、DADA和Nexar)上的实验表明,我们的方法在平均精度和平均事故提前时间方面均优于现有方法,同时计算效率更高。
cs.CV / 94 / 2604.09324

Structure-Aware Fine-Grained Gaussian Splatting for Expressive Avatar Reconstruction

面向结构感知的细粒度高斯散点法用于表现力丰富的虚拟人重建
Su, Yuze, Wang, Hongsong, Gui, Jie, Wang, Liang
Abstract
Reconstructing photorealistic and topology-aware human avatars from monocular videos remains a significant challenge in the fields of computer vision and graphics. While existing 3D human avatar modeling approaches can effectively capture body motion, they often fail to accurately model fine details such as hand movements and facial expressions. To address this, we propose Structure-aware Fine-grained Gaussian Splatting (SFGS), a novel method for reconstructing expressive and coherent full-body 3D human avatars from a monocular video sequence. The SFGS use both spatial-only triplane and time-aware hexplane to capture dynamic features across consecutive frames. A structure-aware gaussian module is designed to capture pose-dependent details in a spatially coherent manner and improve pose and texture expression. To better model hand deformations, we also propose a residual refinement module based on fine-grained hand reconstruction. Our method requires only a single-stage training and outperforms state-of-the-art baselines in both quantitative and qualitative evaluations, generating high-fidelity avatars with natural motion and fine details. The code is on Github: https://github.com/Su245811YZ/SFGS
Chinese Translation
从单目视频中重建具有真实感且拓扑结构感知的人体虚拟人仍是计算机视觉与图形学领域的一大挑战。现有的三维人体虚拟人建模方法虽然能够有效捕捉身体动作,但往往难以准确建模手部动作和面部表情等细节。为此,我们提出了结构感知的细粒度高斯散点法(Structure-aware Fine-grained Gaussian Splatting,SFGS),一种从单目视频序列重建表现力丰富且连贯的全身三维人体虚拟人的新方法。SFGS结合了仅空间的三平面(triplane)和时间感知的六平面(hexplane)以捕捉连续帧间的动态特征。设计了结构感知高斯模块,以空间连贯的方式捕捉依赖姿态的细节,提升姿态与纹理的表达能力。为更好地建模手部变形,我们还提出了基于细粒度手部重建的残差细化模块。该方法仅需单阶段训练,在定量和定性评估中均优于现有最先进方法,生成具有自然动作和细节的高保真虚拟人。代码已开源于Github:https://github.com/Su245811YZ/SFGS
cs.CV / 95 / 2604.09327

From Frames to Events: Rethinking Evaluation in Human-Centric Video Anomaly Detection

从帧到事件:重新思考以人为本的视频异常检测评估
Rashvand, Narges, Yao, Shanle, Pazho, Armin Danesh, Ardabili, Babak Rahimi, Tabkhi, Hamed
Abstract
Pose-based Video Anomaly Detection (VAD) has gained significant attention for its privacy-preserving nature and robustness to environmental variations. However, traditional frame-level evaluations treat video as a collection of isolated frames, fundamentally misaligned with how anomalies manifest and are acted upon in the real world. In operational surveillance systems, what matters is not the flagging of individual frames, but the reliable detection, localization, and reporting of a coherent anomalous event, a contiguous temporal episode with an identifiable onset and duration. Frame-level metrics are blind to this distinction, and as a result, they systematically overestimate model performance for any deployment that requires actionable, event-level alerts. In this work, we propose a shift toward an event-centric perspective in VAD. We first audit widely used VAD benchmarks, including SHT[19], CHAD[6], NWPUC[4], and HuVAD[25], to characterize their event structure. We then introduce two strategies for temporal event localization: a score-refinement pipeline with hierarchical Gaussian smoothing and adaptive binarization, and an end-to-end Dual-Branch Model that directly generates event-level detections. Finally, we establish the first event-based evaluation standard for VAD by adapting Temporal Action Localization metrics, including tIoU-based event matching and multi-threshold F1 evaluation. Our results quantify a substantial performance gap: while all SoTA models achieve frame-level AUC-ROC exceeding 52% on the NWPUC[4], their event-level localization precision falls below 10% even at a minimal tIoU=0.2, with an average event-level F1 of only 0.11 across all thresholds. The code base for this work is available at https://github.com/TeCSAR-UNCC/EventCentric-VAD.
Chinese Translation
基于姿态的视频异常检测(VAD)因其保护隐私的特性和对环境变化的鲁棒性而受到广泛关注。然而,传统的帧级评估将视频视为一系列孤立的帧,这与异常在现实世界中的表现和处理方式根本不符。在实际的监控系统中,重要的不是单独帧的标记,而是对连贯异常事件的可靠检测、定位和报告,这是一段具有可识别起始和持续时间的连续时间片段。帧级指标对此区分视而不见,因此,它们系统性地高估了任何需要可操作的事件级警报的部署中的模型性能。在本研究中,我们提议在VAD中转向以事件为中心的视角。我们首先审查了广泛使用的VAD基准,包括SHT[19]、CHAD[6]、NWPUC[4]和HuVAD[25],以表征它们的事件结构。然后,我们介绍了两种时间事件定位策略:一种是具有层次高斯平滑和自适应二值化的得分精炼管道,另一种是直接生成事件级检测的端到端双分支模型。最后,我们通过调整时间动作定位指标,建立了VAD的首个基于事件的评估标准,包括基于tIoU的事件匹配和多阈值F1评估。我们的结果量化了显著的性能差距:尽管所有最先进的模型在NWPUC[4]上实现了超过52%的帧级AUC-ROC,但它们的事件级定位精度在最小tIoU=0.2时仍低于10%,在所有阈值下的平均事件级F1仅为0.11。该研究的代码库可在https://github.com/TeCSAR-UNCC/EventCentric-VAD获取。
cs.CV / 96 / 2604.09349

Visually-Guided Policy Optimization for Multimodal Reasoning

基于视觉引导的多模态推理策略优化
Wang, Zengbin, Xiong, Feng, Lin, Liang, Hu, Xuecai, Wang, Yong, Wang, Yanlin, Zhang, Man, Chu, Xiangxiang
Abstract
Reinforcement learning with verifiable rewards (RLVR) has significantly advanced the reasoning ability of vision-language models (VLMs). However, the inherent text-dominated nature of VLMs often leads to insufficient visual faithfulness, characterized by sparse attention activation to visual tokens. More importantly, our empirical analysis reveals that temporal visual forgetting along reasoning steps exacerbates this deficiency. To bridge this gap, we propose Visually-Guided Policy Optimization (VGPO), a novel framework to reinforce visual focus during policy optimization. Specifically, VGPO initially introduces a Visual Attention Compensation mechanism that leverages visual similarity to localize and amplify visual cues, while progressively elevating visual expectations in later steps to counteract visual forgetting. Building on this mechanism, we implement a dual-grained advantage re-weighting strategy: the intra-trajectory level highlights tokens exhibiting relatively high visual activation, while the inter-trajectory level prioritizes trajectories demonstrating superior visual accumulation. Extensive experiments demonstrate that VGPO achieves better visual activation and superior performance in mathematical multimodal reasoning and visual-dependent tasks.
Chinese Translation
具有可验证奖励的强化学习(RLVR)显著提升了视觉-语言模型(VLMs)的推理能力。然而,VLMs固有的文本主导特性常导致视觉忠实度不足,表现为对视觉标记的注意力激活稀疏。更重要的是,我们的实证分析表明,推理步骤中的视觉信息遗忘加剧了这一缺陷。为弥补该不足,我们提出了视觉引导的策略优化(Visually-Guided Policy Optimization,VGPO)框架,以在策略优化过程中强化视觉关注。具体而言,VGPO首先引入视觉注意力补偿机制,通过利用视觉相似性定位并放大视觉线索,同时在后续步骤逐步提升视觉期望值,以抵消视觉遗忘。在此机制基础上,我们实现了双粒度优势重加权策略:轨迹内层面强调视觉激活较高的标记,轨迹间层面优先考虑视觉积累表现更优的轨迹。大量实验表明,VGPO在提升视觉激活度的同时,在数学多模态推理及视觉依赖任务中表现出更优性能。
cs.CV / 97 / 2604.09352

LuMon: A Comprehensive Benchmark and Development Suite with Novel Datasets for Lunar Monocular Depth Estimation

LuMon:面向月球单目深度估计的新型数据集及综合基准测试与开发套件
Sekmen, Aytaç, Gunes, Fatih Emre, Horoz, Furkan, Işık, Hüseyin Umut, Ozaydin, Mehmet Alp, Topaloglu, Onur Altay, Üstündaş, Şahin Umutcan, Yeni, Yurdasen Alp, Soken, Halil Ersin, Sahin, Erol, Cinbis, Ramazan Gokberk, Kalkan, Sinan
Abstract
Monocular Depth Estimation (MDE) is crucial for autonomous lunar rover navigation using electro-optical cameras. However, deploying terrestrial MDE networks to the Moon brings a severe domain gap due to harsh shadows, textureless regolith, and zero atmospheric scattering. Existing evaluations rely on analogs that fail to replicate these conditions and lack actual metric ground truth. To address this, we present LuMon, a comprehensive benchmarking framework to evaluate MDE methods for lunar exploration. We introduce novel datasets featuring high-quality stereo ground truth depth from the real Chang'e-3 mission and the CHERI dark analog dataset. Utilizing this framework, we conduct a systematic zero-shot evaluation of state-of-the-art architectures across synthetic, analog, and real datasets. We rigorously assess performance against mission critical challenges like craters, rocks, extreme shading, and varying depth ranges. Furthermore, we establish a sim-to-real domain adaptation baseline by fine tuning a foundation model on synthetic data. While this adaptation yields drastic in-domain performance gains, it exhibits minimal generalization to authentic lunar imagery, highlighting a persistent cross-domain transfer gap. Our extensive analysis reveals the inherent limitations of current networks and sets a standard foundation to guide future advancements in extraterrestrial perception and domain adaptation.
Chinese Translation
单目深度估计(Monocular Depth Estimation, MDE)对于利用电光摄像机实现月球自主漫游车导航至关重要。然而,将地面MDE网络部署到月球面临严峻的领域差异,原因在于月球环境中存在强烈阴影、无纹理的月壤以及零大气散射。现有评估方法依赖于无法真实复现这些条件的类比数据,且缺乏实际的度量真值。为此,我们提出了LuMon,一个用于评估月球探测MDE方法的综合基准框架。我们引入了包含真实嫦娥三号任务高质量立体真值深度数据及CHERI暗环境类比数据集的新型数据集。基于该框架,我们对最先进的架构在合成、类比及真实数据集上进行了系统的零样本评估,严格考察了其在陨石坑、岩石、极端阴影及不同深度范围等任务关键挑战下的表现。此外,我们通过在合成数据上微调基础模型,建立了一个模拟到真实(sim-to-real)的领域自适应基线。尽管该自适应在域内性能上带来了显著提升,但其对真实月球影像的泛化能力有限,凸显了跨域迁移的持续挑战。我们的深入分析揭示了当前网络的内在局限性,并为未来外星感知与领域自适应的研究提供了标准基准和指导。
cs.CV / 98 / 2604.09364

Arbitration Failure, Not Perceptual Blindness: How Vision-Language Models Resolve Visual-Linguistic Conflicts

仲裁失败,而非感知盲点:视觉-语言模型如何解决视觉-语言冲突
Nooralahzadeh, Farhad, Rohanian, Omid, Zhang, Yi, Fürst, Jonathan, Stockinger, Kurt
Abstract
When a Vision-Language Model (VLM) sees a blue banana and answers "yellow", is the problem of perception or arbitration? We explore the question in ten VLMs with various sizes and reveal an Encoding--Grounding Dissociation: models that fail to report what they see (and thus provide a wrong answer) still encode the visual evidence as strongly as models that provide the correct answer. Using Multimodal Arbitration Crossover (MAC) analysis with layer-by-layer Logit Lens probing, we track the competition between visual and prior signals across every layer of each model. We show that visual attributes can be linearly decodable from early layers (AUC > 0.86). The accuracy remains nearly identical for both successful and failed samples. However, the gap in the final-layer logit -- not the strength of encoding -- better predicts grounding outcomes with a correlation of . After having studied when VLMs base their answers on image clues rather than prior knowledge, we want to understand the causal relationships. We establish causality through full-sequence activation patching. The standard last-token interventions in LLM interpretability do not affect VLMs. In contrast, replacing the full token sequence at layers identified by MAC alters 60 to 84% of outputs. Partial-token decomposition shows that image tokens carry almost all of the causal impact, while text tokens have none. Scaling addresses the remaining architectural differences to achieve perfect retention. Moving from diagnosis to intervention, we show that training-free activation steering -- both linear and sparse autoencoder-guided -- in early layers can improve visual grounding by up to +3.8% with degrading performance in some setups. Overall, these findings lead to a clear conclusion: VLMs already see well, but the challenge is acting on what they see. Targeted interventions can help to bridge this gap.
Chinese Translation
当视觉-语言模型(Vision-Language Model, VLM)看到一根蓝色香蕉却回答“黄色”时,问题出在感知还是仲裁?我们在十个不同规模的VLM中探讨了这一问题,揭示了一种编码-定位解离现象:那些未能准确报告所见(因此给出错误答案)的模型,仍然像给出正确答案的模型一样强烈地编码了视觉证据。通过多模态仲裁交叉分析(Multimodal Arbitration Crossover, MAC)结合逐层Logit Lens探测,我们追踪了每个模型各层中视觉信号与先验信号的竞争。结果显示,视觉属性可以从早期层线性解码(AUC > 0.86),成功与失败样本的准确率几乎相同。然而,最终层的logit差距——而非编码强度——更能预测定位结果,相关性达到 。在研究了VLM何时基于图像线索而非先验知识作答后,我们进一步探究因果关系。通过全序列激活修补,我们确立了因果性。与大语言模型(LLM)解释性中标准的最后一个token干预无效不同,MAC识别的层中替换完整token序列可改变60%至84%的输出。部分token分解显示,图像token几乎承担了全部因果影响,而文本token无此作用。通过扩展模型规模,解决了剩余的架构差异,实现了完美保留。从诊断转向干预,我们展示了无需训练的激活引导——包括线性和稀疏自编码器引导——在早期层可提升视觉定位性能最高达+3.8%,但在某些设置下性能有所下降。总体而言,这些发现得出明确结论:VLM已经具备良好的视觉感知能力,挑战在于如何有效利用所见信息。针对性的干预可助力弥合这一差距。
cs.CV / 99 / 2604.09366

Robust 4D Visual Geometry Transformer with Uncertainty-Aware Priors

具有不确定性感知先验的鲁棒4D视觉几何变换器
Zang, Ying, Han, Yidong, Ding, Chaotao, Hu, Yuanqi, Ji, Deyi, Zhu, Qi, Li, Xuanfu, Ma, Jin, Sun, Lingyun, Chen, Tianrun, Zhu, Lanyun
Abstract
Reconstructing dynamic 4D scenes is an important yet challenging task. While 3D foundation models like VGGT excel in static settings, they often struggle with dynamic sequences where motion causes significant geometric ambiguity. To address this, we present a framework designed to disentangle dynamic and static components by modeling uncertainty across different stages of the reconstruction process. Our approach introduces three synergistic mechanisms: (1) Entropy-Guided Subspace Projection, which leverages information-theoretic weighting to adaptively aggregate multi-head attention distributions, effectively isolating dynamic motion cues from semantic noise; (2) Local-Consistency Driven Geometry Purification, which enforces spatial continuity via radius-based neighborhood constraints to eliminate structural outliers; and (3) Uncertainty-Aware Cross-View Consistency, which formulates multi-view projection refinement as a heteroscedastic maximum likelihood estimation problem, utilizing depth confidence as a probabilistic weight. Experiments on dynamic benchmarks show that our approach outperforms current state-of-the-art methods, reducing Mean Accuracy error by 13.43\% and improving segmentation F-measure by 10.49\%. Our framework maintains the efficiency of feed-forward inference and requires no task-specific fine-tuning or per-scene optimization.
Chinese Translation
重建动态4D场景是一项重要但具有挑战性的任务。虽然像VGGT这样的3D基础模型在静态环境中表现出色,但在动态序列中,由于运动导致显著的几何模糊,它们往往难以应对。为了解决这个问题,我们提出了一个框架,通过在重建过程的不同阶段建模不确定性,旨在将动态和静态组件分离。我们的方法引入了三种协同机制:(1) 熵引导子空间投影,利用信息论加权自适应聚合多头注意力分布,有效地将动态运动线索与语义噪声隔离;(2) 局部一致性驱动几何净化,通过基于半径的邻域约束强制空间连续性,以消除结构异常值;(3) 不确定性感知交叉视图一致性,将多视图投影优化公式化为异方差最大似然估计问题,利用深度置信度作为概率权重。在动态基准测试中的实验表明,我们的方法优于当前的最先进技术,将平均准确率误差降低了13.43\%,并将分割F-measure提高了10.49\%。我们的框架保持了前馈推理的高效性,并且不需要特定任务的微调或逐场景的优化。
cs.CV / 100 / 2604.09367

EpiAgent: An Agent-Centric System for Ancient Inscription Restoration

EpiAgent:一种以智能体为中心的古代铭文修复系统
Zhu, Shipeng, Chen, Ang, Nie, Na, Fang, Pengfei, Zhang, Min-Ling, Xue, Hui
Abstract
Ancient inscriptions, as repositories of cultural memory, have suffered from centuries of environmental and human-induced degradation. Restoring their intertwined visual and textual integrity poses one of the most demanding challenges in digital heritage preservation. However, existing AI-based approaches often rely on rigid pipelines, struggling to generalize across such complex and heterogeneous real-world degradations. Inspired by the skill-coordinated workflow of human epigraphers, we propose EpiAgent, an agent-centric system that formulates inscription restoration as a hierarchical planning problem. Following an Observe-Conceive-Execute-Reevaluate paradigm, an LLM-based central planner orchestrates collaboration among multimodal analysis, historical experience, specialized restoration tools, and iterative self-refinement. This agent-centric coordination enables a flexible and adaptive restoration process beyond conventional single-pass methods. Across real-world degraded inscriptions, EpiAgent achieves superior restoration quality and stronger generalization compared to existing methods. Our work marks an important step toward expert-level agent-driven restoration of cultural heritage. The code is available at https://github.com/blackprotoss/EpiAgent.
Chinese Translation
古代铭文作为文化记忆的载体,经历了数百年的环境和人为破坏,其视觉与文本的交织完整性修复是一项数字遗产保护中极具挑战性的任务。然而,现有基于人工智能的方法通常依赖于僵化的流程,难以应对复杂且异质的真实世界退化情况。受人类铭文学者协同技能工作流程的启发,我们提出了EpiAgent,一种以智能体为中心的系统,将铭文修复问题建模为分层规划问题。遵循观察-构思-执行-再评估(Observe-Conceive-Execute-Reevaluate)范式,基于大型语言模型(LLM)的中央规划者协调多模态分析、历史经验、专业修复工具及迭代自我优化的协作。该智能体中心的协调机制实现了超越传统单次处理方法的灵活且自适应的修复流程。在真实退化铭文的修复任务中,EpiAgent在修复质量和泛化能力上均优于现有方法。我们的工作标志着迈向专家级智能体驱动文化遗产修复的重要一步。代码已发布于https://github.com/blackprotoss/EpiAgent。
cs.CV / 101 / 2604.09386

Region-Constrained Group Relative Policy Optimization for Flow-Based Image Editing

基于区域约束的群体相对策略优化用于基于流的图像编辑
Ouyang, Zhuohan, Qian, Zhe, Cui, Wenhuo, Wang, Chaoqun
Abstract
Instruction-guided image editing requires balancing target modification with non-target preservation. Recently, flow-based models have emerged as a strong and increasingly adopted backbone for instruction-guided image editing, thanks to their high fidelity and efficient deterministic ODE sampling. Building on this foundation, GRPO-based reward-driven post-training has been explored to directly optimize editing-specific rewards, improving instruction following and editing consistency. However, existing methods often suffer from noisy credit assignment: global exploration also perturbs non-target regions, inflating within-group reward variance and yielding noisy GRPO advantages. To address this, we propose RC-GRPO-Editing, a region-constrained GRPO post-training framework for flow-based image editing under deterministic ODE sampling. It suppresses background-induced nuisance variance to enable cleaner localized credit assignment, improving editing region instruction adherence while preserving non-target content. Concretely, we localize exploration via region-decoupled initial noise perturbations to reduce background-induced reward variance and stabilize GRPO advantages, and introduce an attention concentration reward that aligns cross-attention with the intended editing region throughout the rollout, reducing unintended changes in non-target regions. Experiments on CompBench show consistent improvements in editing region instruction adherence and non-target preservation.
Chinese Translation
指令引导的图像编辑需要在目标修改与非目标区域保护之间取得平衡。近年来,基于流(flow-based)的模型因其高保真度和高效的确定性ODE采样,成为指令引导图像编辑中强大且日益被采用的骨干架构。在此基础上,基于GRPO(Group Relative Policy Optimization)的奖励驱动后训练方法被提出,以直接优化编辑特定的奖励,从而提升指令遵循性和编辑一致性。然而,现有方法常受噪声信用分配问题困扰:全局探索同时扰动非目标区域,导致组内奖励方差膨胀,产生噪声较大的GRPO优势。为此,我们提出了RC-GRPO-Editing,一种针对基于流的图像编辑在确定性ODE采样下的区域约束GRPO后训练框架。该方法抑制背景引起的干扰方差,实现更清晰的局部信用分配,提升编辑区域对指令的遵循性,同时保护非目标内容。具体而言,我们通过区域解耦的初始噪声扰动实现局部探索,减少背景引起的奖励方差并稳定GRPO优势;同时引入注意力集中奖励,使跨注意力在整个回滚过程中与预期编辑区域保持一致,减少非目标区域的非预期变化。在CompBench数据集上的实验表明,该方法在编辑区域指令遵循性和非目标保护方面均有持续改进。
cs.CV / 102 / 2604.09405

EGLOCE: Training-Free Energy-Guided Latent Optimization for Concept Erasure

EGLOCE:无训练的能量引导潜变量优化用于概念消除
Ahn, Junyeong, Yoon, Seojin, Baik, Sungyong
Abstract
As text-to-image diffusion models grow increasingly prevalent, the ability to remove specific concepts-mostly explicit content and many copyrighted characters or styles-has become essential for safety and compliance. Existing unlearning approaches often require costly re-training, modify parameters at the cost of degradation of unrelated concept fidelity, or depend on indirect inference-time adjustment that compromise the effectiveness of concept erasure. Inspired by the success of energy-guided sampling for preservation of the condition of diffusion models, we introduce Energy-Guided Latent Optimization for Concept Erasure (EGLOCE), a training-free approach that removes unwanted concepts by re-directing noisy latent during inference. Our method employs a dual-objective framework: a repulsion energy that steers generation away from target concepts via gradient descent in latent space, and a retention energy that preserves semantic alignment to the original prompt. Combined with previous approaches that either require erroneous modified model weights or provide weak inference-time guidance, EGLOCE operates entirely at inference and enhances erasure performance, enabling plug-and-play integration. Extensive experiments demonstrate that EGLOCE improves concept removal while maintaining image quality and prompt alignment across baselines, even with adversarial attacks. To the best of our knowledge, our work is the first to establish a new paradigm for safe and controllable image generation through dual energy-based guidance during sampling.
Chinese Translation
随着文本到图像扩散模型的日益普及,移除特定概念——主要是显性内容以及许多受版权保护的角色或风格——的能力已成为安全性和合规性的关键。现有的遗忘方法通常需要高成本的重新训练,或以牺牲无关概念的保真度为代价修改参数,亦或依赖于在推理时进行间接调整,从而影响概念消除的效果。受能量引导采样在保持扩散模型条件方面成功的启发,我们提出了能量引导潜变量优化用于概念消除(EGLOCE),这是一种无训练方法,通过在推理阶段重新引导噪声潜变量来移除不需要的概念。我们的方法采用双目标框架:通过潜变量空间中的梯度下降,利用排斥能量使生成远离目标概念;同时利用保留能量保持与原始提示的语义一致性。结合此前需要错误修改模型权重或仅提供弱推理时引导的方法,EGLOCE完全在推理阶段运行,提升了消除性能,实现即插即用的集成。大量实验表明,EGLOCE在保持图像质量和提示对齐的同时,优于基线方法的概念移除效果,即使在对抗攻击下亦表现稳健。据我们所知,本工作首次通过采样过程中的双能量引导建立了安全且可控的图像生成新范式。
cs.CV / 103 / 2604.09411

SynFlow: Scaling Up LiDAR Scene Flow Estimation with Synthetic Data

SynFlow:利用合成数据扩展LiDAR场景流估计规模
Zhang, Qingwen, Zhu, Xiaomeng, Jiang, Chenhan, Jensfelt, Patric
Abstract
Reliable 3D dynamic perception requires models that can anticipate motion beyond predefined categories, yet progress is hindered by the scarcity of dense, high-quality motion annotations. While self-supervision on unlabeled real data offers a path forward, empirical evidence suggests that scaling unlabeled data fails to close the performance gap due to noisy proxy signals. In this paper, we propose a shift in paradigm: learning robust real-world motion priors entirely from scalable simulation. We introduce SynFlow, a data generation pipeline that generates large-scale synthetic dataset specifically designed for LiDAR scene flow. Unlike prior works that prioritize sensor-specific realism, SynFlow employs a motion-oriented strategy to synthesize diverse kinematic patterns across 4,000 sequences ($\sim$940k frames), termed SynFlow-4k. This represents a 34x scale-up in annotated volume over existing real-world benchmarks. Our experiments demonstrate that SynFlow-4k provides a highly domain-invariant motion prior. In a zero-shot regime, models trained exclusively on our synthetic data generalize across multiple real-world benchmarks, rivaling in-domain supervised baselines on nuScenes and outperforming state-of-the-art methods on TruckScenes by 31.8%. Furthermore, SynFlow-4k serves as a label-efficient foundation: fine-tuning with only 5% of real-world labels surpasses models trained from scratch on the full available budget. We open-source the pipeline and dataset to facilitate research in generalizable 3D motion estimation. More detail can be found at https://kin-zhang.github.io/SynFlow.
Chinese Translation
可靠的三维动态感知需要模型能够预测超出预定义类别的运动,但高质量密集运动标注的稀缺阻碍了进展。虽然在无标签真实数据上进行自监督学习提供了一条前进的道路,但实证表明,扩大无标签数据规模由于代理信号噪声问题未能缩小性能差距。本文提出了一种范式转变:完全通过可扩展的仿真学习稳健的真实世界运动先验。我们引入了SynFlow,一种专门为LiDAR场景流生成大规模合成数据集的数据生成管线。不同于以传感器特定真实性为优先的先前工作,SynFlow采用运动导向策略,合成了跨越4000个序列(约94万帧)的多样运动模式,称为SynFlow-4k。这在标注体量上较现有真实世界基准提升了34倍。实验表明,SynFlow-4k提供了高度领域不变的运动先验。在零样本条件下,单独在合成数据上训练的模型能够泛化至多个真实世界基准,在nuScenes上与领域内监督基线相当,并在TruckScenes上超越最先进方法31.8%。此外,SynFlow-4k作为标签高效的基础:仅用5%的真实标签微调即可超越从零开始训练的全量标签模型。我们开源了该管线和数据集,以促进通用三维运动估计的研究。更多详情请见https://kin-zhang.github.io/SynFlow。
cs.CV / 104 / 2604.09415

PhysInOne: Visual Physics Learning and Reasoning in One Suite

PhysInOne:一体化视觉物理学习与推理
Zhou, Siyuan, Wang, Hejun, Cheng, Hu, Li, Jinxi, Wang, Dongsheng, Jiang, Junwei, Jin, Yixiao, Huang, Jiayue, Mao, Shiwei, Liu, Shangjia, Yang, Yafei, Song, Hongkang, Wei, Shenxing, Zhang, Zihui, Huang, Peng, Liu, Shijie, Hao, Zhengli, Li, Hao, Li, Yitian, Zhou, Wenqi, Zhao, Zhihan, He, Zongqi, Wen, Hongtao, Huang, Shouwang, Yun, Peng, Cheng, Bowen, Fu, Pok Kazaf, Lai, Wai Kit, Chen, Jiahao, Wang, Kaiyuan, Sun, Zhixuan, Li, Ziqi, Hu, Haochen, Zhang, Di, Yuen, Chun Ho, Wang, Bing, Wang, Zhihua, Zou, Chuhang, Yang, Bo
Abstract
We present PhysInOne, a large-scale synthetic dataset addressing the critical scarcity of physically-grounded training data for AI systems. Unlike existing datasets limited to merely hundreds or thousands of examples, PhysInOne provides 2 million videos across 153,810 dynamic 3D scenes, covering 71 basic physical phenomena in mechanics, optics, fluid dynamics, and magnetism. Distinct from previous works, our scenes feature multiobject interactions against complex backgrounds, with comprehensive ground-truth annotations including 3D geometry, semantics, dynamic motion, physical properties, and text descriptions. We demonstrate PhysInOne's efficacy across four emerging applications: physics-aware video generation, long-/short-term future frame prediction, physical property estimation, and motion transfer. Experiments show that fine-tuning foundation models on PhysInOne significantly enhances physical plausibility, while also exposing critical gaps in modeling complex physical dynamics and estimating intrinsic properties. As the largest dataset of its kind, orders of magnitude beyond prior works, PhysInOne establishes a new benchmark for advancing physics-grounded world models in generation, simulation, and embodied AI.
Chinese Translation
我们提出了PhysInOne,这是一个大规模合成数据集,旨在解决AI系统在物理基础训练数据方面的严重短缺。与现有数据集仅限于数百或数千个示例不同,PhysInOne提供了200万个视频,涵盖153,810个动态3D场景,涉及力学、光学、流体动力学和磁学中的71种基本物理现象。与以往的研究不同,我们的场景展示了复杂背景下的多物体交互,并提供了全面的真实标注,包括3D几何、语义、动态运动、物理属性和文本描述。我们展示了PhysInOne在四个新兴应用中的有效性:物理感知视频生成、长短期未来帧预测、物理属性估计和运动转移。实验表明,在PhysInOne上微调基础模型显著增强了物理合理性,同时也暴露了在建模复杂物理动态和估计内在属性方面的关键缺口。作为同类数据集中规模最大的,远超以往研究,PhysInOne为推进基于物理的世界模型在生成、仿真和具身AI中的应用建立了新的基准。
cs.CV / 105 / 2604.09425

Do Vision Language Models Need to Process Image Tokens?

视觉语言模型是否需要处理图像标记?
Ghosh, Sambit, Babu, R. Venkatesh, Agarwal, Chirag
Abstract
Vision Language Models (VLMs) have achieved remarkable success by integrating visual encoders with large language models (LLMs). While VLMs process dense image tokens across deep transformer stacks (incurring substantial computational overhead), it remains fundamentally unclear whether sustained image-token processing is necessary for their performance or visual representations meaningfully evolve from early to later layers. In this work, we systematically investigate the functional role of image tokens in VLMs and show that visual representations rapidly converge to a bounded-complexity regime, \ie their entropy stabilizes, intrinsic dimensionality compresses, and trajectory curvature approaches a near-constant profile. In contrast, textual representations continue to undergo substantial restructuring across depth. Once stabilized, visual representations become largely interchangeable between layers, indicating limited additional transformation in deeper stages. Further, depth-wise visual truncation reveals that the necessity of visual processing is task-dependent, where single-token predictions remain comparatively robust to truncated visual depth, but multi-token generation require sustained access to visual representations. Under deterministic decoding, reducing visual depth perturbs intermediate reasoning trajectories more strongly than final outputs, suggesting that image tokens influence the structure of reasoning more than the ultimate conclusions. Collectively, these findings \textbf{question the assumption} that deeper visual processing is uniformly essential in VLMs, challenging the current paradigm of multimodal LLM architectures.
Chinese Translation
视觉语言模型(VLMs)通过将视觉编码器与大型语言模型(LLMs)结合,取得了显著成功。尽管VLMs在深层变换器堆栈中处理密集的图像标记(这会产生相当大的计算开销),但目前仍不清楚持续的图像标记处理是否对其性能是必要的,或者视觉表征是否在早期层到后期层之间有意义地演变。在本研究中,我们系统地探讨了图像标记在VLMs中的功能角色,并表明视觉表征迅速收敛到一个有限复杂度的状态,即它们的熵稳定、内在维度压缩,并且轨迹曲率接近于近似常数的特征。相比之下,文本表征在深度上仍然经历显著的重构。一旦稳定,视觉表征在层之间变得大体上可互换,表明在更深的阶段中额外的变换有限。此外,深度视觉截断揭示了视觉处理的必要性是任务依赖的,其中单标记预测对截断的视觉深度保持相对稳健,但多标记生成则需要持续访问视觉表征。在确定性解码下,减少视觉深度对中间推理轨迹的干扰比最终输出更强,表明图像标记对推理结构的影响大于最终结论。总体而言,这些发现 extbf{质疑了}在VLMs中更深的视觉处理是普遍必要的假设,挑战了当前多模态LLM架构的范式。
cs.CV / 106 / 2604.09429

Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories

光线作为像素:学习视频与相机轨迹的联合分布
Jang, Wonbong, Liu, Shikun, Sanyal, Soubhik, Perez, Juan Camilo, Ng, Kam Woh, Agrawal, Sanskar, Perez-Rua, Juan-Manuel, Douratsos, Yiannis, Xiang, Tao
Abstract
Recovering camera parameters from images and rendering scenes from novel viewpoints have long been treated as separate tasks in computer vision and graphics. This separation breaks down when image coverage is sparse or poses are ambiguous, since each task needs what the other produces. We propose Rays as Pixels, a Video Diffusion Model (VDM) that learns a joint distribution over videos and camera trajectories. We represent each camera as dense ray pixels (raxels) and denoise them jointly with video frames through Decoupled Self-Cross Attention mechanism. A single trained model handles three tasks: predicting camera trajectories from video, jointly generating video and camera trajectory from input images, and generating video from input images along a target camera trajectory. Because the model can both predict trajectories from a video and generate views conditioned on its own predictions, we evaluate it through a closed-loop self-consistency test, demonstrating that its forward and inverse predictions agree. Notably, trajectory prediction requires far fewer denoising steps than video generation, even a few denoising steps suffice for self-consistency. We report results on pose estimation and camera-controlled video generation.
Chinese Translation
从图像中恢复相机参数以及从新视角渲染场景在计算机视觉和图形学中长期以来被视为两个独立的任务。当图像覆盖稀疏或姿态模糊时,这种分离就会失效,因为每个任务都需要另一个任务所产生的内容。我们提出了光线作为像素(Rays as Pixels),一种视频扩散模型(Video Diffusion Model, VDM),它学习视频与相机轨迹的联合分布。我们将每个相机表示为密集的光线像素(raxels),并通过解耦自交叉注意机制(Decoupled Self-Cross Attention)与视频帧共同去噪。一个训练好的模型可以处理三个任务:从视频中预测相机轨迹、从输入图像共同生成视频和相机轨迹,以及沿目标相机轨迹从输入图像生成视频。由于该模型能够从视频中预测轨迹并根据自身的预测生成视图,我们通过闭环自一致性测试对其进行评估,证明其前向和反向预测是一致的。值得注意的是,轨迹预测所需的去噪步骤远少于视频生成,甚至少量的去噪步骤就足以实现自一致性。我们报告了姿态估计和相机控制视频生成的结果。
cs.CV / 107 / 2604.09436

SCoRe: Clean Image Generation from Diffusion Models Trained on Noisy Images

SCoRe:从训练于噪声图像的扩散模型生成清晰图像
Matsuzaki, Yuta, Uchida, Seiichi, Takezaki, Shumpei
Abstract
Diffusion models trained on noisy datasets often reproduce high-frequency training artifacts, significantly degrading generation quality. To address this, we propose SCoRe (Spectral Cutoff Regeneration), a training-free, generation-time spectral regeneration method for clean image generation from diffusion models trained on noisy images. Leveraging the spectral bias of diffusion models, which infer high-frequency details from low-frequency cues, SCoRe suppresses corrupted high-frequency components of a generated image via a frequency cutoff and regenerates them via SDEdit. Crucially, we derive a theoretical mapping between the cutoff frequency and the SDEdit initialization timestep based on Radially Averaged Power Spectral Density (RAPSD), which prevents excessive noise injection during regeneration. Experiments on synthetic (CIFAR-10) and real-world (SIDD) noisy datasets demonstrate that SCoRe substantially outperforms post-processing and noise-robust baselines, restoring samples closer to clean image distributions without any retraining or fine-tuning.
Chinese Translation
训练于噪声数据集的扩散模型往往会再现高频训练伪影,显著降低生成质量。为了解决这个问题,我们提出了SCoRe(谱截止再生),这是一种无训练、生成时的谱再生方法,用于从训练于噪声图像的扩散模型生成清晰图像。SCoRe利用扩散模型的谱偏差,即从低频线索推断高频细节,通过频率截止抑制生成图像的损坏高频成分,并通过SDEdit再生这些成分。关键是,我们基于径向平均功率谱密度(Radially Averaged Power Spectral Density,RAPSD)推导出截止频率与SDEdit初始化时间步之间的理论映射,从而防止再生过程中过量噪声的注入。在合成(CIFAR-10)和真实世界(SIDD)噪声数据集上的实验表明,SCoRe显著优于后处理和噪声鲁棒基线,能够在不进行任何重训练或微调的情况下,将样本恢复得更接近清晰图像分布。
cs.CV / 108 / 2604.09445

AsymLoc: Towards Asymmetric Feature Matching for Efficient Visual Localization

AsymLoc:朝着高效视觉定位的非对称特征匹配
Omama, Mohammad, Berton, Gabriele, Foxlin, Eric, Kim, Yelin
Abstract
Precise and real-time visual localization is critical for applications like AR/VR and robotics, especially on resource-constrained edge devices such as smart glasses, where battery life and heat dissipation can be a primary concerns. While many efficient models exist, further reducing compute without sacrificing accuracy is essential for practical deployment. To address this, we propose asymmetric visual localization: a large Teacher model processes pre-mapped database images offline, while a lightweight Student model processes the query image online. This creates a challenge in matching features from two different models without resorting to heavy, learned matchers. We introduce AsymLoc, a novel distillation framework that aligns a Student to its Teacher through a combination of a geometry-driven matching objective and a joint detector-descriptor distillation objective, enabling fast, parameter-less nearest-neighbor matching. Extensive experiments on HPatches, ScanNet, IMC2022, and Aachen show that AsymLoc achieves up to 95% of the teacher's localization accuracy using an order of magnitude smaller models, significantly outperforming existing baselines and establishing a new state-of-the-art efficiency-accuracy trade-off.
Chinese Translation
精确且实时的视觉定位对于增强现实/虚拟现实(AR/VR)和机器人等应用至关重要,尤其是在智能眼镜等资源受限的边缘设备上,电池寿命和散热问题可能是主要关注点。尽管存在许多高效模型,但在不牺牲准确性的情况下进一步降低计算需求对于实际部署至关重要。为此,我们提出了非对称视觉定位:一个大型教师模型离线处理预映射的数据库图像,而一个轻量级学生模型在线处理查询图像。这在于如何匹配来自两个不同模型的特征,而不依赖于重型学习匹配器。我们引入了AsymLoc,一个新颖的蒸馏框架,通过几何驱动的匹配目标和联合检测-描述子蒸馏目标的组合,将学生模型与教师模型对齐,从而实现快速、无参数的最近邻匹配。在HPatches、ScanNet、IMC2022和Aachen上的大量实验表明,AsymLoc在使用数量级更小的模型时,能够达到教师模型定位准确度的95%,显著超越现有基准,并建立了新的效率-准确性权衡的最优状态。
cs.CV / 109 / 2604.09473

Realizing Immersive Volumetric Video: A Multimodal Framework for 6-DoF VR Engagement

实现沉浸式体积视频:一个用于6自由度虚拟现实参与的多模态框架
Yang, Zhengxian, Wang, Shengqi, Pan, Shi, Li, Hongshuai, Wang, Haoxiang, Li, Lin, Li, Guanjun, Wen, Zhengqi, Lin, Borong, Tao, Jianhua, Yu, Tao
Abstract
Fully immersive experiences that tightly integrate 6-DoF visual and auditory interaction are essential for virtual and augmented reality. While such experiences can be achieved through computer-generated content, constructing them directly from real-world captured videos remains largely unexplored. We introduce Immersive Volumetric Videos, a new volumetric media format designed to provide large 6-DoF interaction spaces, audiovisual feedback, and high-resolution, high-frame-rate dynamic content. To support IVV construction, we present ImViD, a multi-view, multi-modal dataset built upon a space-oriented capture philosophy. Our custom capture rig enables synchronized multi-view video-audio acquisition during motion, facilitating efficient capture of complex indoor and outdoor scenes with rich foreground--background interactions and challenging dynamics. The dataset provides 5K-resolution videos at 60 FPS with durations of 1-5 minutes, offering richer spatial, temporal, and multimodal coverage than existing benchmarks. Leveraging this dataset, we develop a dynamic light field reconstruction framework built upon a Gaussian-based spatio-temporal representation, incorporating flow-guided sparse initialization, joint camera temporal calibration, and multi-term spatio-temporal supervision for robust and accurate modeling of complex motion. We further propose, to our knowledge, the first method for sound field reconstruction from such multi-view audiovisual data. Together, these components form a unified pipeline for immersive volumetric video production. Extensive benchmarks and immersive VR experiments demonstrate that our pipeline generates high-quality, temporally stable audiovisual volumetric content with large 6-DoF interaction spaces. This work provides both a foundational definition and a practical construction methodology for immersive volumetric videos.
Chinese Translation
完全沉浸的体验,紧密集成6自由度的视觉和听觉交互,对于虚拟现实和增强现实至关重要。虽然这样的体验可以通过计算机生成的内容实现,但直接从现实世界捕获的视频构建这些体验仍然在很大程度上未被探索。我们介绍了沉浸式体积视频(Immersive Volumetric Videos,IVV),这是一种新的体积媒体格式,旨在提供广阔的6自由度交互空间、视听反馈以及高分辨率、高帧率的动态内容。为了支持IVV的构建,我们提出了ImViD,这是一个基于空间导向捕获理念构建的多视角、多模态数据集。我们的定制捕获设备能够在运动过程中同步进行多视角视频-音频采集,从而高效捕获复杂的室内和室外场景,具有丰富的前景-背景交互和具有挑战性的动态。该数据集提供了以60帧每秒(FPS)播放的5K分辨率视频,时长为1-5分钟,提供比现有基准更丰富的空间、时间和多模态覆盖。利用该数据集,我们开发了一个基于高斯时空表示的动态光场重建框架,结合流引导的稀疏初始化、联合相机时间校准和多项时空监督,以实现复杂运动的稳健和准确建模。我们进一步提出了迄今为止首个从此类多视角视听数据中重建声场的方法。所有这些组件共同形成了一个统一的沉浸式体积视频制作管道。广泛的基准测试和沉浸式虚拟现实实验表明,我们的管道生成高质量、时间稳定的视听体积内容,具有广阔的6自由度交互空间。这项工作为沉浸式体积视频提供了基础定义和实用构建方法。
cs.CV / 110 / 2604.09478

Incremental Semantics-Aided Meshing from LiDAR-Inertial Odometry and RGB Direct Label Transfer

基于LiDAR-惯性里程计与RGB直接标签传递的增量语义辅助网格重建
Affan, Muhammad, Lehtola, Ville, Vosselman, George
Abstract
Geometric high-fidelity mesh reconstruction from LiDAR-inertial scans remains challenging in large, complex indoor environments -- such as cultural buildings -- where point cloud sparsity, geometric drift, and fixed fusion parameters produce holes, over-smoothing, and spurious surfaces at structural boundaries. We propose a modular, incremental RGB+LiDAR pipeline that generates incremental semantics-aided high-quality meshes from indoor scans through scan frame-based direct label transfer. A vision foundation model labels each incoming RGB frame; labels are incrementally projected and fused onto a LiDAR-inertial odometry map; and an incremental semantics-aware Truncated Signed Distance Function (TSDF) fusion step produces the final mesh via marching cubes. This frame-level fusion strategy preserves the geometric fidelity of LiDAR while leveraging rich visual semantics to resolve geometric ambiguities at reconstruction boundaries caused by LiDAR point-cloud sparsity and geometric drift. We demonstrate that semantic guidance improves geometric reconstruction quality; quantitative evaluation is therefore performed using geometric metrics on the Oxford Spires dataset, while results from the NTU VIRAL dataset are analyzed qualitatively. The proposed method outperforms state-of-the-art geometric baselines ImMesh and Voxblox, demonstrating the benefit of semantics-aided fusion for geometric mesh quality. The resulting semantically labelled meshes are of value when reconstructing Universal Scene Description (USD) assets, offering a path from indoor LiDAR scanning to XR and digital modeling.
Chinese Translation
基于LiDAR-惯性扫描的几何高保真网格重建在大型复杂室内环境(如文化建筑)中仍然具有挑战性,原因在于点云稀疏、几何漂移及固定融合参数导致结构边界处出现孔洞、过度平滑和伪造表面。本文提出了一种模块化的增量RGB+LiDAR管线,通过基于扫描帧的直接标签传递,从室内扫描中生成增量语义辅助的高质量网格。视觉基础模型对每个输入的RGB帧进行标注;标签被增量投影并融合到LiDAR-惯性里程计地图上;随后,增量语义感知的截断有符号距离函数(TSDF)融合步骤通过行进立方体算法生成最终网格。该基于帧的融合策略在保留LiDAR几何精度的同时,利用丰富的视觉语义信息解决了由LiDAR点云稀疏和几何漂移引起的重建边界几何模糊问题。我们展示了语义引导对几何重建质量的提升;因此在Oxford Spires数据集上通过几何指标进行了定量评估,并对NTU VIRAL数据集的结果进行了定性分析。所提方法优于当前最先进的几何基线方法ImMesh和Voxblox,证明了语义辅助融合对几何网格质量的提升作用。生成的语义标注网格在重建通用场景描述(USD)资产时具有重要价值,为室内LiDAR扫描到XR和数字建模提供了路径。
cs.CV / 111 / 2604.09480

Online3R: Online Learning for Consistent Sequential Reconstruction Based on Geometry Foundation Model

Online3R:基于几何基础模型的在线学习一致性序列重建方法
Zhou, Shunkai, Yan, Zike, Xue, Fei, Wu, Dong, Deng, Yuchen, Zha, Hongbin
Abstract
We present Online3R, a new sequential reconstruction framework that is capable of adapting to new scenes through online learning, effectively resolving inconsistency issues. Specifically, we introduce a set of learnable lightweight visual prompts into a pretrained, frozen geometry foundation model to capture the knowledge of new environments while preserving the fundamental capability of the foundation model for geometry prediction. To solve the problems of missing groundtruth and the requirement of high efficiency when updating these visual prompts at test time, we introduce a local-global self-supervised learning strategy by enforcing the local and global consistency constraints on predictions. The local consistency constraints are conducted on intermediate and previously local fused results, enabling the model to be trained with high-quality pseudo groundtruth signals; the global consistency constraints are operated on sparse keyframes spanning long distances rather than per frame, allowing the model to learn from a consistent prediction over a long trajectory in an efficient way. Our experiments demonstrate that Online3R outperforms previous state-of-the-art methods on various benchmarks. Project page: https://shunkaizhou.github.io/online3r-1.0/
Chinese Translation
我们提出了Online3R,一种新颖的序列重建框架,能够通过在线学习适应新场景,有效解决不一致性问题。具体而言,我们在预训练且冻结的几何基础模型中引入了一组可学习的轻量级视觉提示,以捕捉新环境的知识,同时保持基础模型在几何预测方面的核心能力。为了解决测试时更新这些视觉提示时缺乏真实标注和高效性需求的问题,我们提出了一种局部-全局自监督学习策略,通过对预测结果施加局部和全局一致性约束来实现。局部一致性约束作用于中间及先前的局部融合结果,使模型能够利用高质量的伪真实信号进行训练;全局一致性约束则作用于跨越长距离的稀疏关键帧,而非逐帧操作,从而使模型能够以高效的方式从长轨迹上的一致预测中学习。我们的实验表明,Online3R在多个基准测试中优于现有的最先进方法。项目主页:https://shunkaizhou.github.io/online3r-1.0/
cs.CV / 112 / 2604.09508

VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning

VISOR:通过迭代搜索与超远程推理实现的主动视觉检索增强生成
Shen, Yucheng, Wu, Jiulong, Huang, Jizhou, Yin, Dawei, Yan, Lingyong, Cao, Min
Abstract
Visual Retrieval-Augmented Generation (VRAG) empowers Vision-Language Models to retrieve and reason over visually rich documents. To tackle complex queries requiring multi-step reasoning, agentic VRAG systems interleave reasoning with iterative retrieval.. However, existing agentic VRAG faces two critical bottlenecks. (1) Visual Evidence Sparsity: key evidence is scattered across pages yet processed in isolation, hindering cross-page reasoning; moreover, fine-grained intra-image evidence often requires precise visual actions, whose misuse degrades retrieval quality; (2) Search Drift in Long Horizons: the accumulation of visual tokens across retrieved pages dilutes context and causes cognitive overload, leading agents to deviate from their search objective. To address these challenges, we propose VISOR (Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning), a unified single-agent framework. VISOR features a structured Evidence Space for progressive cross-page reasoning, coupled with a Visual Action Evaluation and Correction mechanism to manage visual actions. Additionally, we introduce a Dynamic Trajectory with Sliding Window and Intent Injection to mitigate search drift. They anchor the evidence space while discarding earlier raw interactions, preventing context from being overwhelmed by visual tokens. We train VISOR using a Group Relative Policy Optimization-based Reinforcement Learning (GRPO-based RL) pipeline with state masking and credit assignment tailored for dynamic context reconstruction. Extensive experiments on ViDoSeek, SlideVQA, and MMLongBench demonstrate that VISOR achieves state-of-the-art performance with superior efficiency for long-horizon visual reasoning tasks.
Chinese Translation
视觉检索增强生成(Visual Retrieval-Augmented Generation,VRAG)使视觉语言模型能够检索并推理视觉丰富的文档。为应对需要多步推理的复杂查询,主动式VRAG系统将推理与迭代检索交织进行。然而,现有的主动式VRAG面临两个关键瓶颈:(1)视觉证据稀疏性:关键证据分散于多页文档中却被孤立处理,阻碍跨页推理;此外,细粒度的图像内证据通常需要精准的视觉操作,操作不当会降低检索质量;(2)长远程搜索漂移:跨页累积的视觉标记稀释了上下文且导致认知过载,使得代理偏离搜索目标。为解决这些挑战,我们提出了VISOR(Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning),一个统一的单代理框架。VISOR具备结构化的证据空间以支持渐进式跨页推理,并配备视觉操作评估与纠正机制以管理视觉操作。此外,我们引入了带滑动窗口和意图注入的动态轨迹机制以缓解搜索漂移,该机制锚定证据空间,同时丢弃早期的原始交互,防止视觉标记淹没上下文。我们采用基于群体相对策略优化(Group Relative Policy Optimization,GRPO)的强化学习训练流程,结合状态屏蔽和针对动态上下文重构的信用分配。大量实验结果表明,VISOR在ViDoSeek、SlideVQA和MMLongBench数据集上实现了长远程视觉推理任务的最先进性能与卓越效率。
cs.CV / 113 / 2604.09511

RIRF: Reasoning Image Restoration Framework

RIRF:推理图像恢复框架
Yan, Wending, Zhang, Rongkai, Tang, Kaihua, Cheng, Yu, Liu, Qiankun
Abstract
Universal image restoration (UIR) aims to recover clean images from diverse and unknown degradations using a unified model. Existing UIR methods primarily focus on pixel reconstruction and often lack explicit diagnostic reasoning over degradation composition, severity, and scene semantics prior to restoration. We propose Reason and Restore (R\&R), a novel framework that integrates structured Chain-of-Thought (CoT) reasoning into the image restoration pipeline. R\&R introduces an explicit reasoner, implemented by fine-tuning Qwen3-VL, to diagnose degradation types, quantify degradation severity, infer key degradation-related factors, and describe relevant scene and object semantics. The resulting structured reasoning provides interpretable and fine-grained diagnostic priors for the restorer. To further improve restoration quality, the quantified degradation severity produced by the reasoner is leveraged as reinforcement learning (RL) signals to guide and strengthen the restorer. Unlike existing multimodal LLM-based agentic systems that decouple reasoning from low-level vision tasks, R\&R tightly couples semantic diagnostic reasoning with pixel-level restoration in a unified framework. Extensive experiments across diverse UIR benchmarks demonstrate that R\&R achieves state-of-the-art performance while offering unique interpretability into the restoration process.
Chinese Translation
通用图像恢复(UIR)旨在使用统一模型从多样化和未知的退化中恢复清晰图像。现有的UIR方法主要集中在像素重建上,往往缺乏在恢复之前对退化组成、严重性和场景语义的明确诊断推理。我们提出了推理与恢复(R&R),这是一个新颖的框架,将结构化的思维链(Chain-of-Thought, CoT)推理集成到图像恢复流程中。R&R引入了一个明确的推理器,通过微调Qwen3-VL来实现,旨在诊断退化类型、量化退化严重性、推断关键的退化相关因素,并描述相关的场景和对象语义。所产生的结构化推理为恢复器提供了可解释的、细粒度的诊断先验。为了进一步提高恢复质量,推理器产生的量化退化严重性被用作强化学习(Reinforcement Learning, RL)信号,以指导和增强恢复器。与现有的基于多模态大语言模型(LLM)的代理系统将推理与低级视觉任务解耦的做法不同,R&R在一个统一框架中紧密结合了语义诊断推理与像素级恢复。广泛的实验结果表明,R&R在多样化的UIR基准测试中实现了最先进的性能,同时为恢复过程提供了独特的可解释性。
cs.CV / 114 / 2604.09527

Envisioning the Future, One Step at a Time

一步步展望未来
Baumann, Stefan Andreas, Wiese, Jannik, Martorella, Tommaso, Kalayeh, Mahdi M., Ommer, Björn
Abstract
Accurately anticipating how complex, diverse scenes will evolve requires models that represent uncertainty, simulate along extended interaction chains, and efficiently explore many plausible futures. Yet most existing approaches rely on dense video or latent-space prediction, expending substantial capacity on dense appearance rather than on the underlying sparse trajectories of points in the scene. This makes large-scale exploration of future hypotheses costly and limits performance when long-horizon, multi-modal motion is essential. We address this by formulating the prediction of open-set future scene dynamics as step-wise inference over sparse point trajectories. Our autoregressive diffusion model advances these trajectories through short, locally predictable transitions, explicitly modeling the growth of uncertainty over time. This dynamics-centric representation enables fast rollout of thousands of diverse futures from a single image, optionally guided by initial constraints on motion, while maintaining physical plausibility and long-range coherence. We further introduce OWM, a benchmark for open-set motion prediction based on diverse in-the-wild videos, to evaluate accuracy and variability of predicted trajectory distributions under real-world uncertainty. Our method matches or surpasses dense simulators in predictive accuracy while achieving orders-of-magnitude higher sampling speed, making open-set future prediction both scalable and practical. Project page: http://compvis.github.io/myriad.
Chinese Translation
准确预测复杂多样场景的演变需要能够表示不确定性、沿着长时间交互链进行模拟并高效探索多种可能未来的模型。然而,大多数现有方法依赖于密集视频或潜在空间预测,耗费大量资源在密集的外观信息上,而非场景中稀疏点的轨迹。这导致大规模未来假设的探索成本高昂,并在需要长时间、多模态运动预测时性能受限。我们通过将开放集未来场景动力学的预测形式化为对稀疏点轨迹的逐步推断来解决这一问题。我们的自回归扩散模型通过短期、局部可预测的转变推进这些轨迹,明确建模了不确定性的时间增长。这种以动力学为中心的表示使得从单张图像快速展开成千上万种多样未来成为可能,并可选择性地通过初始运动约束进行引导,同时保持物理合理性和长距离一致性。我们进一步引入了OWM,这是一个基于多样化野外视频的开放集运动预测基准,用于评估在现实世界不确定性下预测轨迹分布的准确性和多样性。我们的方法在预测准确性上匹配或超越了密集模拟器,同时采样速度提升了数量级,使得开放集未来预测既具备可扩展性又实用。项目主页:http://compvis.github.io/myriad。
cs.CV / 115 / 2604.09529

VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning

VL-Calibration:面向大型视觉语言模型推理的解耦置信度校准
Xiao, Wenyi, Xu, Xinchi, Gan, Leilei
Abstract
Large Vision Language Models (LVLMs) achieve strong multimodal reasoning but frequently exhibit hallucinations and incorrect responses with high certainty, which hinders their usage in high-stakes domains. Existing verbalized confidence calibration methods, largely developed for text-only LLMs, typically optimize a single holistic confidence score using binary answer-level correctness. This design is mismatched to LVLMs: an incorrect prediction may arise from perceptual failures or from reasoning errors given correct perception, and a single confidence conflates these sources while visual uncertainty is often dominated by language priors. To address these issues, we propose VL-Calibration, a reinforcement learning framework that explicitly decouples confidence into visual and reasoning confidence. To supervise visual confidence without ground-truth perception labels, we introduce an intrinsic visual certainty estimation that combines (i) visual grounding measured by KL-divergence under image perturbations and (ii) internal certainty measured by token entropy. We further propose token-level advantage reweighting to focus optimization on tokens based on visual certainty, suppressing ungrounded hallucinations while preserving valid perception. Experiments on thirteen benchmarks show that VL-Calibration effectively improves calibration while boosting visual reasoning accuracy, and it generalizes to out-of-distribution benchmarks across model scales and architectures.
Chinese Translation
大型视觉语言模型(LVLMs)在多模态推理方面表现出强大能力,但常常以高度确定性产生幻觉和错误回答,这限制了其在高风险领域的应用。现有的口头置信度校准方法主要针对纯文本的大型语言模型(LLMs)开发,通常通过二元答案级正确性来优化单一的整体置信度分数。这种设计与LVLMs不匹配:错误预测可能源于感知失败,也可能是在感知正确的前提下推理错误,而单一置信度将这些来源混为一谈,且视觉不确定性往往被语言先验所主导。为解决这些问题,我们提出了VL-Calibration,一种通过强化学习框架显式将置信度解耦为视觉置信度和推理置信度的方法。为了在无真实感知标签的情况下监督视觉置信度,我们引入了一种内在视觉确定性估计,结合了(i)通过图像扰动下的KL散度衡量的视觉定位和(ii)通过token熵衡量的内部确定性。我们进一步提出了基于视觉确定性的token级优势重加权,聚焦于基于视觉确定性的token优化,抑制无根幻觉同时保持有效感知。十三个基准实验表明,VL-Calibration有效提升了校准效果并增强了视觉推理准确性,且在不同模型规模和架构的分布外基准上具有良好的泛化能力。
cs.CV / 116 / 2604.09531

VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images

VisionFoundry:利用合成图像教授视觉语言模型(VLMs)视觉感知能力
Zhou, Guanyu, Yin, Yida, Chai, Wenhao, Tong, Shengbang, Fu, Xingyu, Liu, Zhuang
Abstract
Vision-language models (VLMs) still struggle with visual perception tasks such as spatial understanding and viewpoint recognition. One plausible contributing factor is that natural image datasets provide limited supervision for low-level visual skills. This motivates a practical question: can targeted synthetic supervision, generated from only a task keyword such as Depth Order, address these weaknesses? To investigate this question, we introduce VisionFoundry, a task-aware synthetic data generation pipeline that takes only the task name as input and uses large language models (LLMs) to generate questions, answers, and text-to-image (T2I) prompts, then synthesizes images with T2I models and verifies consistency with a proprietary VLM, requiring no reference images or human annotation. Using VisionFoundry, we construct VisionFoundry-10K, a synthetic visual question answering (VQA) dataset containing 10k image-question-answer triples spanning 10 tasks. Models trained on VisionFoundry-10K achieve substantial improvements on visual perception benchmarks: +7% on MMVP and +10% on CV-Bench-3D, while preserving broader capabilities and showing favorable scaling behavior as data size increases. Our results suggest that limited task-targeted supervision is an important contributor to this bottleneck and that synthetic supervision is a promising path toward more systematic training for VLMs.
Chinese Translation
视觉语言模型(VLMs)在空间理解和视角识别等视觉感知任务上仍存在困难。一个可能的原因是自然图像数据集对低级视觉技能的监督有限。这引发了一个实际问题:是否可以通过仅基于任务关键词(如深度顺序Depth Order)生成的针对性合成监督来解决这些弱点?为探究该问题,我们提出了VisionFoundry,一种任务感知的合成数据生成流程,该流程仅以任务名称为输入,利用大型语言模型(LLMs)生成问题、答案及文本到图像(T2I)提示,然后使用T2I模型合成图像,并通过专有VLM验证一致性,无需参考图像或人工标注。基于VisionFoundry,我们构建了VisionFoundry-10K,一个包含1万条图像-问题-答案三元组、涵盖10个任务的合成视觉问答(VQA)数据集。在VisionFoundry-10K上训练的模型在视觉感知基准测试中取得显著提升:MMVP提升7%,CV-Bench-3D提升10%,同时保持了更广泛的能力,并随着数据规模增加表现出良好的扩展性。我们的结果表明,有限的任务针对性监督是该瓶颈的重要因素,合成监督为VLMs的更系统训练提供了有前景的路径。
cs.CV / 117 / 2604.09532

Seeing is Believing: Robust Vision-Guided Cross-Modal Prompt Learning under Label Noise

眼见为实:标签噪声下鲁棒的视觉引导跨模态提示学习
Geng, Zibin, Jiang, Xuefeng, Li, Jia, Li, Zheng, Wen, Tian, Wu, Lvhua, Sun, Sheng, Wang, Yuwei, Liu, Min
Abstract
Prompt learning is a parameter-efficient approach for vision-language models, yet its robustness under label noise is less investigated. Visual content contains richer and more reliable semantic information, which remains more robust under label noise. However, the prompt itself is highly susceptible to label noise. Motivated by this intuition, we propose VisPrompt, a lightweight and robust vision-guided prompt learning framework for noisy-label settings. Specifically, we exploit a cross-modal attention mechanism to reversely inject visual semantics into prompt representations. This enables the prompt tokens to selectively aggregate visual information relevant to the current sample, thereby improving robustness by anchoring prompt learning to stable instance-level visual evidence and reducing the influence of noisy supervision. To address the instability caused by using the same way of injecting visual information for all samples, despite differences in the quality of their visual cues, we further introduce a lightweight conditional modulation mechanism to adaptively control the strength of visual information injection, which strikes a more robust balance between text-side semantic priors and image-side instance evidence. The proposed framework effectively suppresses the noise-induced disturbances, reduce instability in prompt updates, and alleviate memorization of mislabeled samples. VisPrompt significantly improves robustness while keeping the pretrained VLM backbone frozen and introducing only a small amount of additional trainable parameters. Extensive experiments under synthetic and real-world label noise demonstrate that VisPrompt generally outperforms existing baselines on seven benchmark datasets and achieves stronger robustness. Our code is publicly available at https://github.com/gezbww/Vis_Prompt.
Chinese Translation
提示学习是一种针对视觉-语言模型的参数高效方法,但其在标签噪声环境下的鲁棒性研究较少。视觉内容包含更丰富且更可靠的语义信息,在标签噪声下表现出更强的鲁棒性。然而,提示本身对标签噪声高度敏感。基于此直觉,我们提出了VisPrompt,一种轻量且鲁棒的视觉引导提示学习框架,专为噪声标签设置设计。具体而言,我们利用跨模态注意力机制将视觉语义反向注入提示表示,使提示令牌能够选择性地聚合与当前样本相关的视觉信息,从而通过将提示学习锚定于稳定的实例级视觉证据,提高鲁棒性并减少噪声监督的影响。为解决对所有样本采用相同视觉信息注入方式所导致的不稳定性(尽管其视觉线索质量存在差异),我们进一步引入轻量的条件调制机制,自适应控制视觉信息注入强度,在文本侧语义先验与图像侧实例证据之间实现更鲁棒的平衡。所提框架有效抑制噪声引起的干扰,减少提示更新的不稳定性,并缓解错误标签样本的记忆效应。VisPrompt在保持预训练视觉-语言模型(VLM)骨干网络冻结的同时,仅引入少量可训练参数,显著提升了鲁棒性。在合成及真实标签噪声环境下的大量实验表明,VisPrompt在七个基准数据集上普遍优于现有基线方法,展现出更强的鲁棒性。我们的代码已公开,地址为https://github.com/gezbww/Vis_Prompt。
cs.CV / 118 / 2604.09535

EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks

EgoTL:用于长时间任务的自我中心思维链
Liu, Lulin, Li, Dayou, Liang, Yiqing, Jiang, Sicong, Vijay, Hitesh, Hu, Hezhen, Xu, Xuhai, Liu, Zirui, Shakkottai, Srinivas, Li, Manling, Fan, Zhiwen
Abstract
Large foundation models have made significant advances in embodied intelligence, enabling synthesis and reasoning over egocentric input for household tasks. However, VLM-based auto-labeling is often noisy because the primary data sources lack accurate human action labels, chain-of-thought (CoT), and spatial annotations; these errors are amplified during long-horizon spatial instruction following. These issues stem from insufficient coverage of minute-long, daily household planning tasks and from inaccurate spatial grounding. As a result, VLM reasoning chains and world-model synthesis can hallucinate objects, skip steps, or fail to respect real-world physical attributes. To address these gaps, we introduce EgoTL. EgoTL builds a think-aloud capture pipeline for egocentric data. It uses a say-before-act protocol to record step-by-step goals and spoken reasoning with word-level timestamps, then calibrates physical properties with metric-scale spatial estimators, a memory-bank walkthrough for scene context, and clip-level tags for navigation instructions and detailed manipulation actions. With EgoTL, we are able to benchmark VLMs and World Models on six task dimensions from three layers and long-horizon generation over minute-long sequences across over 100 daily household tasks. We find that foundation models still fall short as egocentric assistants or open-world simulators. Finally, we finetune foundation models with human CoT aligned with metric labels on the training split of EgoTL, which improves long-horizon planning and reasoning, step-wise reasoning, instruction following, and spatial grounding.
Chinese Translation
大型基础模型在具身智能方面取得了显著进展,使得对家庭任务的自我中心输入进行综合和推理成为可能。然而,基于视觉语言模型(VLM)的自动标注往往存在噪声,因为主要数据源缺乏准确的人类动作标签、思维链(CoT)和空间注释;这些错误在长时间的空间指令跟随过程中被放大。这些问题源于对持续一分钟的日常家庭规划任务覆盖不足以及空间定位不准确。因此,VLM推理链和世界模型合成可能会幻觉出物体、跳过步骤或未能尊重现实世界的物理属性。为了解决这些问题,我们提出了EgoTL。EgoTL建立了一个自我中心数据的思维捕捉管道。它使用“先说后做”(say-before-act)协议记录逐步目标和带有词级时间戳的口头推理,然后通过度量尺度的空间估计器、场景上下文的记忆库走查以及导航指令和详细操作动作的剪辑级标签来校准物理属性。通过EgoTL,我们能够在来自三个层次的六个任务维度上对VLM和世界模型进行基准测试,并在超过100个日常家庭任务中进行持续一分钟的长时间生成。我们发现基础模型在作为自我中心助手或开放世界模拟器方面仍然存在不足。最后,我们使用与EgoTL训练集上的度量标签对齐的人类思维链(CoT)对基础模型进行微调,从而改善了长时间规划和推理、逐步推理、指令跟随和空间定位。
cs.CV / 119 / 2604.09547

Tango: Taming Visual Signals for Efficient Video Large Language Models

Tango:驯化视觉信号以实现高效的视频大型语言模型
Yin, Shukang, Zhao, Sirui, Wang, Hanchao, Jia, Baozhi, Wang, Xianquan, Fu, Chaoyou, Chen, Enhong
Abstract
Token pruning has emerged as a mainstream approach for developing efficient Video Large Language Models (Video LLMs). This work revisits and advances the two predominant token-pruning paradigms: attention-based selection and similarity-based clustering. Our study reveals two critical limitations in existing methods: (1) conventional top-k selection strategies fail to fully account for the attention distribution, which is often spatially multi-modal and long-tailed in magnitude; and (2) direct similarity-based clustering frequently generates fragmented clusters, resulting in distorted representations after pooling. To address these bottlenecks, we propose Tango, a novel framework designed to optimize the utilization of visual signals. Tango integrates a diversity-driven strategy to enhance attention-based token selection, and introduces Spatio-temporal Rotary Position Embedding (ST-RoPE) to preserve geometric structure via locality priors. Comprehensive experiments across various Video LLMs and video understanding benchmarks demonstrate the effectiveness and generalizability of our approach. Notably, when retaining only 10% of the video tokens, Tango preserves 98.9% of the original performance on LLaVA-OV while delivering a 1.88x inference speedup.
Chinese Translation
令牌修剪已成为开发高效视频大型语言模型(Video LLMs)的主流方法。本研究重新审视并推进了两种主要的令牌修剪范式:基于注意力的选择和基于相似性的聚类。我们的研究揭示了现有方法的两个关键局限性:(1)传统的 top-k 选择策略未能充分考虑注意力分布,而注意力分布通常在空间上是多模态的,并且在幅度上呈长尾分布;(2)直接的基于相似性的聚类常常生成碎片化的聚类,导致池化后表现失真。为了解决这些瓶颈,我们提出了 Tango,一个旨在优化视觉信号利用的新框架。Tango 集成了一种以多样性驱动的策略,以增强基于注意力的令牌选择,并引入了时空旋转位置嵌入(Spatio-temporal Rotary Position Embedding, ST-RoPE)以通过局部先验保持几何结构。在各种视频 LLM 和视频理解基准上的全面实验表明了我们方法的有效性和普适性。值得注意的是,当仅保留 10% 的视频令牌时,Tango 在 LLaVA-OV 上保留了 98.9% 的原始性能,同时实现了 1.88 倍的推理加速。
人工智能 (Artificial Intelligence)
26
cs.AI / 1 / 2604.08601

OpenKedge: Governing Agentic Mutation with Execution-Bound Safety and Evidence Chains

OpenKedge:以执行约束安全性和证据链治理自主变异
He, Jun, Yu, Deying
Abstract
The rise of autonomous AI agents exposes a fundamental flaw in API-centric architectures: probabilistic systems directly execute state mutations without sufficient context, coordination, or safety guarantees. We introduce OpenKedge, a protocol that redefines mutation as a governed process rather than an immediate consequence of API invocation. OpenKedge requires actors to submit declarative intent proposals, which are evaluated against deterministically derived system state, temporal signals, and policy constraints prior to execution. Approved intents are compiled into execution contracts that strictly bound permitted actions, resource scope, and time, and are enforced via ephemeral, task-oriented identities. This shifts safety from reactive filtering to preventative, execution-bound enforcement. Crucially, OpenKedge introduces an Intent-to-Execution Evidence Chain (IEEC), which cryptographically links intent, context, policy decisions, execution bounds, and outcomes into a unified lineage. This transforms mutation into a verifiable and reconstructable process, enabling deterministic auditability and reasoning about system behavior. We evaluate OpenKedge across multi-agent conflict scenarios and cloud infrastructure mutations. Results show that the protocol deterministically arbitrates competing intents and cages unsafe execution while maintaining high throughput, establishing a principled foundation for safely operating agentic systems at scale.
Chinese Translation
自主人工智能代理的兴起暴露了以API为中心的架构中的一个根本缺陷:概率系统直接执行状态变更,而没有足够的上下文、协调或安全保障。我们提出了OpenKedge,一种将变异重新定义为受控过程而非API调用的直接结果的协议。OpenKedge要求参与者提交声明性意图提案,这些提案在执行之前会根据确定性推导的系统状态、时间信号和政策约束进行评估。获得批准的意图被编译成执行合同,这些合同严格限制了允许的操作、资源范围和时间,并通过短暂的、任务导向的身份进行强制执行。这将安全性从反应式过滤转变为预防性、执行约束的强制。至关重要的是,OpenKedge引入了一种意图到执行的证据链(Intent-to-Execution Evidence Chain, IEEC),它通过加密方式将意图、上下文、政策决策、执行界限和结果链接成一个统一的谱系。这将变异转变为一个可验证和可重构的过程,使得系统行为的确定性审计和推理成为可能。我们在多代理冲突场景和云基础设施变更中评估了OpenKedge。结果表明,该协议以确定性的方式仲裁竞争意图,并限制不安全的执行,同时保持高吞吐量,为安全地大规模操作自主系统奠定了原则基础。
cs.AI / 2 / 2604.08603

From Business Events to Auditable Decisions: Ontology-Governed Graph Simulation for Enterprise AI

从商业事件到可审计决策:面向企业人工智能的本体驱动图模拟
Zhu, Hongyin, Liang, Jinming, Hou, Mengjun, Tang, Ruifan, Zhu, Xianbin, Yang, Jingyuan, Mao, Yuanman, Wu, Feng
Abstract
Existing LLM-based agent systems share a common architectural failure: they answer from the unrestricted knowledge space without first simulating how active business scenarios reshape that space for the event at hand -- producing decisions that are fluent but ungrounded and carrying no audit trail. We present LOM-action, which equips enterprise AI with \emph{event-driven ontology simulation}: business events trigger scenario conditions encoded in the enterprise ontology~(EO), which drive deterministic graph mutations in an isolated sandbox, evolving a working copy of the subgraph into the scenario-valid simulation graph $G_{\text{sim}}$; all decisions are derived exclusively from this evolved graph. The core pipeline is \emph{event $\to$ simulation $\to$ decision}, realized through a dual-mode architecture -- \emph{skill mode} and \emph{reasoning mode}. Every decision produces a fully traceable audit log. LOM-action achieves 93.82% accuracy and 98.74% tool-chain F1 against frontier baselines Doubao-1.8 and DeepSeek-V3.2, which reach only 24--36% F1 despite 80% accuracy -- exposing the \emph{illusive accuracy} phenomenon. The four-fold F1 advantage confirms that ontology-governed, event-driven simulation, not model scale, is the architectural prerequisite for trustworthy enterprise decision intelligence.
Chinese Translation
现有的基于大型语言模型(LLM)的代理系统存在一个共同的架构缺陷:它们在没有首先模拟活跃商业场景如何重塑知识空间的情况下,从不受限制的知识空间中回答问题——产生的决策流畅但缺乏基础,并且没有审计轨迹。我们提出了LOM-action,它为企业人工智能提供了 extit{事件驱动的本体模拟}:商业事件触发编码在企业本体(Enterprise Ontology, EO)中的场景条件,这些条件驱动在隔离沙箱中的确定性图变异,将子图的工作副本演变为场景有效的模拟图$G_{ ext{sim}}$;所有决策均仅源自这一演变后的图。核心流程是 extit{事件 $ o$ 模拟 $ o$ 决策},通过双模式架构实现—— extit{技能模式}和 extit{推理模式}。每个决策都会生成一个完全可追溯的审计日志。LOM-action在前沿基准Doubao-1.8和DeepSeek-V3.2上实现了93.82%的准确率和98.74%的工具链F1,而后者尽管准确率达到80%,但F1仅为24%至36%——揭示了 extit{虚幻准确性}现象。四倍的F1优势确认了本体驱动的事件驱动模拟,而非模型规模,是可信企业决策智能的架构前提。
cs.AI / 3 / 2604.08621

Sustained Impact of Agentic Personalisation in Marketing: A Longitudinal Case Study

营销中代理性个性化的持续影响:一项纵向案例研究
Jeunen, Olivier, Hanna, Eleanor, Wheeler, Schaun
Abstract
In consumer applications, Customer Relationship Management (CRM) has traditionally relied on the manual optimisation of static, rule-based messaging strategies. While adaptive and autonomous learning systems offer the promise of scalable personalisation, it remains unclear to what extent ``human-in-the-loop'' oversight is required to sustain performance uplift over time. This paper presents a longitudinal case study analysing a real-world consumer application that leverages agentic infrastructure to personalise marketing messaging for a large-scale user base over an 11-month period. We compare two distinct periods: an active phase where marketers directly curated content, audiences, and strategies -- followed immediately by a passive phase where agents operated autonomously from a fixed library of components. Our results demonstrate that whilst active human management generates the highest relative lift in engagement metrics, the autonomous agents successfully sustained a positive lift during the passive period. These findings suggest a symbiotic model where human intervention drives strategic initialisation and discovery, yet autonomous agents can ensure the scalable retention and preservation of performance gains.
Chinese Translation
在消费者应用中,客户关系管理(CRM)传统上依赖于静态规则基础消息策略的手动优化。尽管自适应和自主学习系统提供了可扩展个性化的前景,但仍不清楚在多大程度上需要“人机协作”的监督以维持长期的性能提升。本文呈现了一项纵向案例研究,分析了一个真实的消费者应用,该应用利用代理基础设施在11个月的时间内为大规模用户群体个性化营销信息。我们比较了两个不同的阶段:一个是积极阶段,营销人员直接策划内容、受众和策略;紧接着是一个被动阶段,代理从固定组件库中自主操作。我们的结果表明,尽管积极的人为管理在参与度指标上产生了最高的相对提升,但自主代理在被动阶段成功维持了正向提升。这些发现表明了一种共生模型,其中人类干预推动战略初始化和发现,而自主代理则可以确保性能提升的可扩展保留和维护。
cs.AI / 4 / 2604.08685

RAMP: Hybrid DRL for Online Learning of Numeric Action Models

RAMP:用于在线学习数值动作模型的混合深度强化学习
Benyamin, Yarin, Mordoch, Argaman, Shperberg, Shahaf S., Stern, Roni
Abstract
Automated planning algorithms require an action model specifying the preconditions and effects of each action, but obtaining such a model is often hard. Learning action models from observations is feasible, but existing algorithms for numeric domains are offline, requiring expert traces as input. We propose the Reinforcement learning, Action Model learning, and Planning (RAMP) strategy for learning numeric planning action models online via interactions with the environment. RAMP simultaneously trains a Deep Reinforcement Learning (DRL) policy, learns a numeric action model from past interactions, and uses that model to plan future actions when possible. These components form a positive feedback loop: the RL policy gathers data to refine the action model, while the planner generates plans to continue training the RL policy. To facilitate this integration of RL and numeric planning, we developed Numeric PDDLGym, an automated framework for converting numeric planning problems to Gym environments. Experimental results on standard IPC numeric domains show that RAMP significantly outperforms PPO, a well-known DRL algorithm, in terms of solvability and plan quality.
Chinese Translation
自动规划算法需要一个动作模型来指定每个动作的前置条件和效果,但获取这样的模型往往很困难。从观察中学习动作模型是可行的,但现有的针对数值领域的算法是离线的,需要专家轨迹作为输入。我们提出了强化学习、动作模型学习和规划(RAMP)策略,通过与环境的交互在线学习数值规划动作模型。RAMP同时训练一个深度强化学习(DRL)策略,从过去的交互中学习数值动作模型,并在可能的情况下使用该模型规划未来的动作。这些组件形成了一个正反馈循环:强化学习策略收集数据以完善动作模型,而规划器生成计划以继续训练强化学习策略。为了促进强化学习与数值规划的整合,我们开发了数值PDDLGym,这是一个将数值规划问题转换为Gym环境的自动化框架。在标准IPC数值领域的实验结果表明,RAMP在可解性和计划质量方面显著优于著名的DRL算法PPO。
cs.AI / 5 / 2604.08707

Parameterized Complexity Of Representing Models Of MSO Formulas

MSO公式模型表示的参数化复杂性
Kučera, Petr, Martinek, Petr
Abstract
Monadic second order logic (MSO2) plays an important role in parameterized complexity due to the Courcelle's theorem. This theorem states that the problem of checking if a given graph has a property specified by a given MSO2 formula can be solved by a parameterized linear time algorithm with respect to the treewidth of the graph and the size of the formula. We extend this result by showing that models of MSO2 formula with free variables can be represented with a decision diagram whose size is parameterized linear in the above mentioned parameter. In particular, we show a parameterized linear upper bound on the size of a sentential decision diagram (SDD) when treewidth is considered and a parameterized linear upper bound on the size of an ordered binary decision diagram (OBDD) when considering the pathwidth in the parameter. In addition, building on a lower bound on the size of OBDD by Razgon (2014), we show that there is an MSO2 formula and a class of graphs with bounded treewidth which do not admit an OBDD with the size parameterized by the treewidth. Our result offers a new perspective on the Courcelle's theorem and connects it to the area of knowledge representation.
Chinese Translation
一阶单元二阶逻辑(MSO2)因库尔塞尔定理(Courcelle's theorem)在参数化复杂性中具有重要作用。该定理指出,判断给定图是否具有由给定MSO2公式指定的性质的问题,可以通过一个关于图的树宽(treewidth)和公式大小的参数化线性时间算法解决。我们在此基础上扩展了该结果,证明带有自由变量的MSO2公式的模型可以用一种决策图表示,该决策图的大小在上述参数下是参数化线性的。具体而言,我们证明了当参数为树宽时,句子决策图(sentential decision diagram,SDD)的大小存在参数化线性上界;当参数为路径宽(pathwidth)时,有序二叉决策图(ordered binary decision diagram,OBDD)的大小也存在参数化线性上界。此外,基于Razgon(2014)关于OBDD大小的下界,我们展示存在一个MSO2公式及一类具有有界树宽的图,这类图不允许存在以树宽为参数的OBDD大小的参数化线性表示。我们的结果为库尔塞尔定理提供了新的视角,并将其与知识表示领域建立了联系。
cs.AI / 6 / 2604.08712

Model Space Reasoning as Search in Feedback Space for Planning Domain Generation

将模型空间推理视为规划领域生成中的反馈空间搜索
Oswald, James, Oblinsky, Daniel, Varha, Volodymyr, Dragovic, Vasilije, Kokel, Harsha, Srinivas, Kavitha, Katz, Michael, Sohrabi, Shirin
Abstract
The generation of planning domains from natural language descriptions remains an open problem even with the advent of large language models and reasoning models. Recent work suggests that while LLMs have the ability to assist with domain generation, they are still far from producing high quality domains that can be deployed in practice. To this end, we investigate the ability of an agentic language model feedback framework to generate planning domains from natural language descriptions that have been augmented with a minimal amount of symbolic information. In particular, we evaluate the quality of the generated domains under various forms of symbolic feedback, including landmarks, and output from the VAL plan validator. Using these feedback mechanisms, we experiment using heuristic search over model space to optimize domain quality.
Chinese Translation
从自然语言描述生成规划领域仍然是一个未解决的问题,即使在大型语言模型和推理模型出现之后。最近的研究表明,尽管大型语言模型(LLMs)能够协助领域生成,但它们仍远未能生成可以在实践中部署的高质量领域。为此,我们研究了一种代理语言模型反馈框架的能力,该框架能够从增强了最少量符号信息的自然语言描述中生成规划领域。特别地,我们评估了在各种符号反馈形式下生成领域的质量,包括地标(landmarks)和来自VAL计划验证器的输出。利用这些反馈机制,我们通过在模型空间上进行启发式搜索来优化领域质量。
cs.AI / 7 / 2604.08756

Artifacts as Memory Beyond the Agent Boundary

超越主体边界的记忆工件
Martin, John D., Mince, Fraser, Saleh, Esra'a, Pajak, Amy
Abstract
The situated view of cognition holds that intelligent behavior depends not only on internal memory, but on an agent's active use of environmental resources. Here, we begin formalizing this intuition within Reinforcement Learning (RL). We introduce a mathematical framing for how the environment can functionally serve as an agent's memory, and prove that certain observations, which we call artifacts, can reduce the information needed to represent history. We corroborate our theory with experiments showing that when agents observe spatial paths, the amount of memory required to learn a performant policy is reduced. Interestingly, this effect arises unintentionally, and implicitly through the agent's sensory stream. We discuss the implications of our findings, and show they satisfy qualitative properties previously used to ground accounts of external memory. Moving forward, we anticipate further work on this subject could reveal principled ways to exploit the environment as a substitute for explicit internal memory.
Chinese Translation
情境认知观认为,智能行为不仅依赖于内部记忆,还依赖于主体对环境资源的主动利用。在此,我们开始在强化学习(Reinforcement Learning, RL)中形式化这一直觉。我们引入了一种数学框架,说明环境如何在功能上充当主体的记忆,并证明某些观察结果(我们称之为工件)可以减少表示历史所需的信息量。我们通过实验验证了我们的理论,结果表明,当主体观察空间路径时,学习高效策略所需的记忆量减少。有趣的是,这种效果是无意中产生的,并且是通过主体的感知流隐含实现的。我们讨论了研究结果的意义,并展示它们满足之前用于支持外部记忆的定性特性。展望未来,我们预计在这一主题上的进一步研究可能揭示出利用环境作为显式内部记忆替代品的原则性方法。
cs.AI / 8 / 2604.08863

Hidden in Plain Sight: Visual-to-Symbolic Analytical Solution Inference from Field Visualizations

隐藏在显而易见之中:从场可视化中推断视觉到符号的解析解
Li, Pengze, Zhang, Jiaquan, Long, Yunbo, Liu, Xinping, wenjie, Zhou, Su, Encheng, Zeng, Zihang, Liu, Jiaqi, Liu, Jiyao, Yu, Junchi, Liu, Lihao, Torr, Philip, Tang, Shixiang, Wang, Aoran, Chen, Xi
Abstract
Recovering analytical solutions of physical fields from visual observations is a fundamental yet underexplored capability for AI-assisted scientific reasoning. We study visual-to-symbolic analytical solution inference (ViSA) for two-dimensional linear steady-state fields: given field visualizations (and first-order derivatives) plus minimal auxiliary metadata, the model must output a single executable SymPy expression with fully instantiated numeric constants. We introduce ViSA-R2 and align it with a self-verifying, solution-centric chain-of-thought pipeline that follows a physicist-like pathway: structural pattern recognition solution-family (ansatz) hypothesis parameter derivation consistency verification. We also release ViSA-Bench, a VLM-ready synthetic benchmark covering 30 linear steady-state scenarios with verifiable analytical/symbolic annotations, and evaluate predictions by numerical accuracy, expression-structure similarity, and character-level accuracy. Using an 8B open-weight Qwen3-VL backbone, ViSA-R2 outperforms strong open-source baselines and the evaluated closed-source frontier VLMs under a standardized protocol.
Chinese Translation
从视觉观察中恢复物理场的解析解是AI辅助科学推理的一项基本但尚未充分探索的能力。我们研究了二维线性稳态场的视觉到符号解析解推断(ViSA):给定场可视化(及一阶导数)和最小的辅助元数据,模型必须输出一个包含完全实例化数值常数的可执行SymPy表达式。我们引入了ViSA-R2,并将其与一个自验证的、以解为中心的思维链流程对齐,该流程遵循物理学家式的路径:结构模式识别解集(ansatz)假设参数推导一致性验证。我们还发布了ViSA-Bench,这是一个适用于视觉语言模型(VLM)的合成基准,涵盖30个具有可验证解析/符号注释的线性稳态场景,并通过数值准确性、表达结构相似性和字符级准确性评估预测。使用一个8B开放权重的Qwen3-VL骨干网络,ViSA-R2在标准化协议下超越了强大的开源基线和评估的闭源前沿VLM。
cs.AI / 9 / 2604.08865

SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks

SPPO:面向长时序推理任务的序列级PPO算法
Wang, Tianyi, Li, Yixia, Li, Long, Chen, Yibiao, Huang, Shaohan, Chen, Yun, Li, Peng, Liu, Yang, Chen, Guanhua
Abstract
Proximal Policy Optimization (PPO) is central to aligning Large Language Models (LLMs) in reasoning tasks with verifiable rewards. However, standard token-level PPO struggles in this setting due to the instability of temporal credit assignment over long Chain-of-Thought (CoT) horizons and the prohibitive memory cost of the value model. While critic-free alternatives like GRPO mitigate these issues, they incur significant computational overhead by requiring multiple samples for baseline estimation, severely limiting training throughput. In this paper, we introduce Sequence-Level PPO (SPPO), a scalable algorithm that harmonizes the sample efficiency of PPO with the stability of outcome-based updates. SPPO reformulates the reasoning process as a Sequence-Level Contextual Bandit problem, employing a decoupled scalar value function to derive low-variance advantage signals without multi-sampling. Extensive experiments on mathematical benchmarks demonstrate that SPPO significantly surpasses standard PPO and matches the performance of computation-heavy group-based methods, offering a resource-efficient framework for aligning reasoning LLMs.
Chinese Translation
近端策略优化(Proximal Policy Optimization,PPO)在利用可验证奖励对大型语言模型(LLMs)进行推理任务对齐中起着核心作用。然而,标准的基于token级别的PPO在该场景下表现不佳,原因在于长链式思维(Chain-of-Thought,CoT)跨度中时间信用分配的不稳定性以及价值模型的高昂内存开销。尽管无评论家(critic-free)替代方法如GRPO缓解了这些问题,但它们通过多样本基线估计带来了显著的计算负担,严重限制了训练吞吐量。本文提出了序列级PPO(Sequence-Level PPO,SPPO),这是一种可扩展算法,融合了PPO的样本效率与基于结果更新的稳定性。SPPO将推理过程重新表述为序列级上下文赌博机(Sequence-Level Contextual Bandit)问题,采用解耦的标量价值函数以无多样本采样的方式导出低方差优势信号。在数学基准测试中的大量实验表明,SPPO显著优于标准PPO,并匹配了计算密集型基于组的方法的性能,提供了一种资源高效的推理LLMs对齐框架。
cs.AI / 10 / 2604.08905

StaRPO: Stability-Augmented Reinforcement Policy Optimization

StaRPO:稳定性增强的强化策略优化
Zhang, Jinghan, Mo, Fengran, Weerasooriya, Tharindu Cyril, Dai, Ruimin, Han, Xiaoyan, Fu, Yanjie, Wang, Dakuo, Liu, Kunpeng
Abstract
Reinforcement learning (RL) is effective in enhancing the accuracy of large language models in complex reasoning tasks. Existing RL policy optimization frameworks rely on final-answer correctness as feedback signals and rarely capture the internal logical structure of the reasoning process. Consequently, the models would generate fluent and semantically relevant responses but logically inconsistent, structurally erratic, or redundant. To this end, we propose StaRPO, a stability-augmented reinforcement learning framework that explicitly incorporates reasoning stability into the optimization objective. Our StaRPO decomposes stability into two computable lightweight metrics: the Autocorrelation Function (ACF) to evaluate local step-to-step coherence, and Path Efficiency (PE) to evaluate global goal-directedness of the reasoning trajectory. These stability rewards are combined with task rewards to provide complementary and process-aware feedback. We validate the effectiveness of using ACF and PE rewards by showing their correlation with logic errors on two backbone models. Experiments on four reasoning benchmarks show that StaRPO consistently outperforms compared baselines and can enhance both final-answer accuracy and logical stability.
Chinese Translation
强化学习(Reinforcement Learning, RL)在提升大型语言模型处理复杂推理任务的准确性方面表现出色。现有的RL策略优化框架主要依赖最终答案的正确性作为反馈信号,较少捕捉推理过程中的内部逻辑结构。因此,模型生成的回答虽然流畅且语义相关,但在逻辑上可能不一致、结构上混乱或存在冗余。为此,我们提出了StaRPO,一种稳定性增强的强化学习框架,显式地将推理稳定性纳入优化目标。StaRPO将稳定性分解为两个可计算的轻量级指标:自相关函数(Autocorrelation Function, ACF)用于评估局部步骤间的一致性,路径效率(Path Efficiency, PE)用于评估推理轨迹的全局目标导向性。这些稳定性奖励与任务奖励相结合,提供互补且过程感知的反馈。我们通过展示ACF和PE奖励与两个基础模型逻辑错误的相关性,验证了其有效性。在四个推理基准测试中的实验结果表明,StaRPO持续优于对比基线,能够同时提升最终答案的准确性和逻辑稳定性。
cs.AI / 11 / 2604.08931

Enhancing LLM Problem Solving via Tutor-Student Multi-Agent Interaction

通过导师-学生多智能体交互提升大型语言模型(LLM)的问题解决能力
Özdemir, Nurullah Eymen, Oztop, Erhan
Abstract
Human cognitive development is shaped not only by individual effort but by structured social interaction, where role-based exchanges such as those between a tutor and a learner, enable solutions that neither could achieve alone. Inspired by these developmental principles, we ask the question whether a tutor-student multi-agent system can create a synergistic effect by pushing Large Language Model (LLM) beyond what it can do within existing frameworks. To test the idea, we adopt autonomous coding problem domain where two agents instantiated from the same LLM assigned asymmetric roles: a student agent generates and iteratively refines solutions, while a tutor agent provides structured evaluative feedback without access to ground-truth answers. In our proposed framework (PETITE), we aim to extract better problem-solving performance from one model by structuring its interaction through complementary roles, rather than relying on stronger supervisory models or heterogeneous ensembles. Our model is evaluated on the APPS coding benchmark against state-of-the-art approaches of Self-Consistency, Self-Refine, Multi-Agent Debate, and Multi-Agent Review. The results show that our model achieves similar or higher accuracy while consuming significantly fewer tokens. These results suggest that developmentally grounded role-differentiated interaction structures provide a principled and resource-efficient paradigm for enhancing LLM problem-solving through structured peer-like interactions. Index Terms- Peer Tutoring, Scaffolding, Large Language Models, Multi-Agent Systems, Code Generation
Chinese Translation
人类认知发展不仅受个体努力影响,更受结构化社会互动的塑造,其中基于角色的交流(如导师与学习者之间的互动)能够实现单独个体无法达到的解决方案。受这些发展原则启发,我们提出问题:导师-学生多智能体系统是否能够通过推动大型语言模型(LLM)超越现有框架的能力,产生协同效应。为验证这一想法,我们采用自主编码问题领域,两个由同一LLM实例化的智能体被赋予非对称角色:学生智能体负责生成并迭代优化解决方案,导师智能体则在无访问真实答案的情况下提供结构化的评估反馈。在我们提出的框架PETITE中,我们旨在通过互补角色结构化交互,从单一模型中提取更优的问题解决性能,而非依赖更强的监督模型或异构集成。我们的模型在APPS编码基准上与Self-Consistency、自我优化(Self-Refine)、多智能体辩论(Multi-Agent Debate)及多智能体评审(Multi-Agent Review)等最先进方法进行了对比评估。结果表明,我们的模型在准确率上达到相似或更高水平,同时显著减少了令牌消耗。这些结果表明,基于发展心理学的角色差异化交互结构为通过结构化的同伴式互动提升LLM问题解决能力提供了一种有原则且资源高效的范式。关键词:同伴辅导,支架式教学,大型语言模型,多智能体系统,代码生成
cs.AI / 12 / 2604.08987

PilotBench: A Benchmark for General Aviation Agents with Safety Constraints

PilotBench:具有安全约束的一般航空代理基准测试
Wu, Yalun, Liu, Haotian, Li, Zhoujun, Wang, Boyang
Abstract
As Large Language Models (LLMs) advance toward embodied AI agents operating in physical environments, a fundamental question emerges: can models trained on text corpora reliably reason about complex physics while adhering to safety constraints? We address this through PilotBench, a benchmark evaluating LLMs on safety-critical flight trajectory and attitude prediction. Built from 708 real-world general aviation trajectories spanning nine operationally distinct flight phases with synchronized 34-channel telemetry, PilotBench systematically probes the intersection of semantic understanding and physics-governed prediction through comparative analysis of LLMs and traditional forecasters. We introduce Pilot-Score, a composite metric balancing 60% regression accuracy with 40% instruction adherence and safety compliance. Comparative evaluation across 41 models uncovers a Precision-Controllability Dichotomy: traditional forecasters achieve superior MAE of 7.01 but lack semantic reasoning capabilities, while LLMs gain controllability with 86--89% instruction-following at the cost of 11--14 MAE precision. Phase-stratified analysis further exposes a Dynamic Complexity Gap-LLM performance degrades sharply in high-workload phases such as Climb and Approach, suggesting brittle implicit physics models. These empirical discoveries motivate hybrid architectures combining LLMs' symbolic reasoning with specialized forecasters' numerical precision. PilotBench provides a rigorous foundation for advancing embodied AI in safety-constrained domains.
Chinese Translation
随着大型语言模型(LLMs)向在物理环境中操作的具身人工智能代理的发展,一个基本问题浮现:在遵循安全约束的同时,基于文本语料库训练的模型能否可靠地推理复杂的物理现象?我们通过PilotBench来解决这一问题,该基准测试评估LLMs在安全关键的飞行轨迹和姿态预测方面的表现。PilotBench基于708条真实世界的一般航空轨迹构建,涵盖九个操作上不同的飞行阶段,并同步记录34通道的遥测数据。PilotBench系统性地探讨了语义理解与物理驱动预测之间的交集,通过对LLMs与传统预测器的比较分析。我们引入了Pilot-Score,这是一种复合指标,平衡了60%的回归准确性与40%的指令遵循和安全合规性。在41个模型的比较评估中,发现了精度-可控性二分法:传统预测器的平均绝对误差(MAE)为7.01,表现优越,但缺乏语义推理能力,而LLMs在指令遵循率达到86%至89%的同时,MAE精度却下降至11至14。分阶段分析进一步揭示了动态复杂性差距——LLM在高工作负载阶段(如爬升和进近)中的表现急剧下降,暗示其隐式物理模型的脆弱性。这些实证发现激励我们探索将LLMs的符号推理与专业预测器的数值精度相结合的混合架构。PilotBench为在安全约束领域推进具身人工智能提供了严格的基础。
cs.AI / 13 / 2604.08988

SEA-Eval: A Benchmark for Evaluating Self-Evolving Agents Beyond Episodic Assessment

SEA-Eval:超越情节评估的自我进化智能体评测基准
Jiang, Sihang, Ma, Lipeng, Hong, Zhonghua, Wang, Keyi, Lu, Zhiyu, Chen, Shisong, Zhang, Jinghao, Pan, Tianjun, Zhou, Weijia, Liang, Jiaqing, Xiao, Yanghua
Abstract
Current LLM-based agents demonstrate strong performance in episodic task execution but remain constrained by static toolsets and episodic amnesia, failing to accumulate experience or optimize strategies across task boundaries. While the Self-Evolving Agent (SEA) paradigm has been previously proposed, this paper contributes a new formal definition of SEA grounded in digital embodiment and continuous cross-task evolution, and introduces SEA-Eval, the first benchmark designed to evaluate SEA characteristics across two dimensions, intra-task execution reliability and long-term evolutionary performance. By organizing tasks into sequential streams and analyzing Success Rate and Token Consumption over time, SEA-Eval quantifies evolutionary gain and structural stability in ways that existing episodic benchmarks cannot. Empirical evaluations reveal a significant evolutionary bottleneck in current state-of-the-art frameworks, where identical success rates mask up to 31.2 times differences in token consumption and divergent evolutionary trajectories under sequential analysis. SEA-Eval provides a rigorous scientific foundation for advancing agents from mere task executors toward genuinely self-evolving digital entities.
Chinese Translation
当前基于大型语言模型(LLM)的智能体在单次任务执行中表现出色,但仍受限于静态工具集和情节性遗忘,无法在任务边界间积累经验或优化策略。尽管此前已有自我进化智能体(Self-Evolving Agent,SEA)范式的提出,本文基于数字化具身(digital embodiment)和跨任务持续进化,提出了SEA的新形式化定义,并引入SEA-Eval,这是首个旨在从任务内执行可靠性和长期进化性能两个维度评估SEA特性的基准。通过将任务组织为连续流并分析成功率与令牌消耗随时间的变化,SEA-Eval以现有情节基准无法实现的方式量化进化增益与结构稳定性。实证评估揭示了当前最先进框架中的显著进化瓶颈,即在相同成功率下令牌消耗存在高达31.2倍的差异,且在序列分析中呈现出不同的进化轨迹。SEA-Eval为推动智能体从单纯任务执行者向真正自我进化的数字实体转变提供了严谨的科学基础。
cs.AI / 14 / 2604.09001

Hypergraph Neural Networks Accelerate MUS Enumeration

超图神经网络加速最小不可满足子集枚举
Ijima, Hiroya, Yawata, Koichiro
Abstract
Enumerating Minimal Unsatisfiable Subsets (MUSes) is a fundamental task in constraint satisfaction problems (CSPs). Its major challenge is the exponential growth of the search space, which becomes particularly severe when satisfiability checks are expensive. Recent machine learning approaches reduce this cost for Boolean satisfiability problems but rely on explicit variable-constraint relationships, limiting their application domains. This paper proposes a domain-agnostic method to accelerate MUS enumeration using Hypergraph Neural Networks (HGNNs). The proposed method incrementally builds a hypergraph with constraints as vertices and MUSes enumerated until the current step as hyperedges, and employs an HGNN-based agent trained via reinforcement learning to minimize the number of satisfiability checks required to obtain an MUS. Experimental results demonstrate the effectiveness of our approach in accelerating MUS enumeration, showing that our method can enumerate more MUSes within the same satisfiability check budget compared to conventional methods.
Chinese Translation
枚举最小不可满足子集(MUSes)是约束满足问题(CSPs)中的一项基础任务。其主要挑战在于搜索空间的指数级增长,尤其当可满足性检查代价较高时问题更加严重。近期的机器学习方法通过减少布尔可满足性问题中的检查成本取得了一定进展,但这些方法依赖于显式的变量-约束关系,限制了其应用领域。本文提出了一种基于超图神经网络(Hypergraph Neural Networks, HGNNs)的领域无关方法,以加速MUS枚举。该方法通过将约束作为顶点、当前步骤已枚举的MUS作为超边,增量构建超图,并利用基于HGNN的智能体通过强化学习进行训练,旨在最小化获得MUS所需的可满足性检查次数。实验结果表明,本方法在加速MUS枚举方面具有显著效果,在相同的可满足性检查预算下,能够枚举出更多的MUS,优于传统方法。
cs.AI / 15 / 2604.09035

Advantage-Guided Diffusion for Model-Based Reinforcement Learning

基于优势引导的扩散模型在基于模型的强化学习中的应用
Foffano, Daniele, Eriksson, Arvid, Broman, David, Johansson, Karl H., Proutiere, Alexandre
Abstract
Model-based reinforcement learning (MBRL) with autoregressive world models suffers from compounding errors, whereas diffusion world models mitigate this by generating trajectory segments jointly. However, existing diffusion guides are either policy-only, discarding value information, or reward-based, which becomes myopic when the diffusion horizon is short. We introduce Advantage-Guided Diffusion for MBRL (AGD-MBRL), which steers the reverse diffusion process using the agent's advantage estimates so that sampling concentrates on trajectories expected to yield higher long-term return beyond the generated window. We develop two guides: (i) Sigmoid Advantage Guidance (SAG) and (ii) Exponential Advantage Guidance (EAG). We prove that a diffusion model guided through SAG or EAG allows us to perform reweighted sampling of trajectories with weights increasing in state-action advantage-implying policy improvement under standard assumptions. Additionally, we show that the trajectories generated from AGD-MBRL follow an improved policy (that is, with higher value) compared to an unguided diffusion model. AGD integrates seamlessly with PolyGRAD-style architectures by guiding the state components while leaving action generation policy-conditioned, and requires no change to the diffusion training objective. On MuJoCo control tasks (HalfCheetah, Hopper, Walker2D and Reacher), AGD-MBRL improves sample efficiency and final return over PolyGRAD, an online Diffuser-style reward guide, and model-free baselines (PPO/TRPO), in some cases by a margin of 2x. These results show that advantage-aware guidance is a simple, effective remedy for short-horizon myopia in diffusion-model MBRL.
Chinese Translation
基于模型的强化学习(MBRL)中,采用自回归世界模型存在误差累积问题,而扩散世界模型通过联合生成轨迹片段缓解了该问题。然而,现有的扩散引导方法要么仅基于策略,忽略了价值信息,要么基于奖励,在扩散时域较短时表现出短视性。我们提出了用于MBRL的优势引导扩散(Advantage-Guided Diffusion,AGD-MBRL),该方法利用智能体的优势估计引导反向扩散过程,使采样集中于预期在生成窗口之外能带来更高长期回报的轨迹。我们设计了两种引导策略:(i)Sigmoid优势引导(SAG)和(ii)指数优势引导(EAG)。我们证明,采用SAG或EAG引导的扩散模型允许对轨迹进行重加权采样,权重随状态-动作优势增加,从而在标准假设下实现策略改进。此外,我们展示了AGD-MBRL生成的轨迹相较于无引导扩散模型遵循改进的策略(即具有更高价值)。AGD能够无缝集成到PolyGRAD风格的架构中,通过引导状态部分而保持动作生成的策略条件性,且无需修改扩散训练目标。在MuJoCo控制任务(HalfCheetah、Hopper、Walker2D和Reacher)中,AGD-MBRL在样本效率和最终回报上均优于PolyGRAD、在线Diffuser风格的奖励引导以及无模型基线(PPO/TRPO),部分任务提升幅度达到2倍。结果表明,基于优势的引导是解决扩散模型MBRL中短视性问题的简便且有效的方法。
cs.AI / 16 / 2604.09072

Overhang Tower: Resource-Rational Adaptation in Sequential Physical Planning

悬臂塔:在资源约束下的顺序物理规划中的资源理性适应
Shen, Ruihong, Li, Shiqian, Zhu, Yixin
Abstract
Humans effortlessly navigate the physical world by predicting how objects behave under gravity and contact forces, yet how such judgments support sequential physical planning under resource constraints remains poorly understood. Research on intuitive physics debates whether prediction relies on the Intuitive Physics Engine (IPE) or fast, cue-based heuristics; separately, decision-making research debates deliberative lookahead versus myopic strategies. These debates have proceeded in isolation, leaving the cognitive architecture of sequential physical planning underspecified. How physical prediction mechanisms and planning strategies jointly adapt under limited cognitive resources remains an open question. Here we show that humans exhibit a dual transition under resource pressure, simultaneously shifting both physical prediction mechanism and planning strategy to match cognitive budget. Using Overhang Tower, a construction task requiring participants to maximize horizontal overhang while maintaining stability, we find that IPE-based simulation dominates early stages while CNN-based visual heuristics prevail as complexity grows; concurrently, time pressure truncates deliberative lookahead, shifting planning toward shallower horizons: a dual transition unpredicted by prior single-mechanism accounts. These findings reveal a hierarchical, resource-rational architecture that flexibly trades computational cost against predictive fidelity. Our results unify two long-standing debates (simulation vs. heuristics and myopic vs. deliberative planning) as a dynamic repertoire reconfigured by cognitive budget.
Chinese Translation
人类通过预测物体在重力和接触力作用下的行为,轻松地在物理世界中导航,但在资源约束下,这种判断如何支持顺序物理规划仍然不够清楚。关于直观物理的研究辩论预测是否依赖于直观物理引擎(Intuitive Physics Engine, IPE)或快速的基于线索的启发式方法;同时,决策研究则辩论深思熟虑的前瞻性策略与短视策略。这些辩论各自独立进行,导致顺序物理规划的认知架构未被明确。物理预测机制和规划策略在有限认知资源下如何共同适应仍然是一个未解的问题。在这里,我们展示了人类在资源压力下表现出双重转变,同时调整物理预测机制和规划策略以匹配认知预算。通过使用悬臂塔这一构建任务,要求参与者在保持稳定的同时最大化水平悬臂,我们发现,在早期阶段,基于IPE的模拟占主导地位,而随着复杂性的增加,基于卷积神经网络(CNN)的视觉启发式方法占优;与此同时,时间压力缩短了深思熟虑的前瞻性,导致规划向更浅的视野转变:这一双重转变是先前单一机制解释所未曾预测的。这些发现揭示了一种层次化的、资源理性的架构,灵活地在计算成本与预测准确性之间进行权衡。我们的结果统一了两个长期存在的辩论(模拟与启发式、短视与深思熟虑规划),作为一种由认知预算重新配置的动态能力。
cs.AI / 17 / 2604.09195

Camera Artist: A Multi-Agent Framework for Cinematic Language Storytelling Video Generation

Camera Artist:一种用于电影语言叙事视频生成的多智能体框架
Hu, Haobo, Mao, Qi, Li, Yuanhang, Jin, Libiao
Abstract
We propose Camera Artist, a multi-agent framework that models a real-world filmmaking workflow to generate narrative videos with explicit cinematic language. While recent multi-agent systems have made substantial progress in automating filmmaking workflows from scripts to videos, they often lack explicit mechanisms to structure narrative progression across adjacent shots and deliberate use of cinematic language, resulting in fragmented storytelling and limited filmic quality. To address this, Camera Artist builds upon established agentic pipelines and introduces a dedicated Cinematography Shot Agent, which integrates recursive storyboard generation to strengthen shot-to-shot narrative continuity and cinematic language injection to produce more expressive, film-oriented shot designs. Extensive quantitative and qualitative results demonstrate that our approach consistently outperforms existing baselines in narrative consistency, dynamic expressiveness, and perceived film quality.
Chinese Translation
我们提出了Camera Artist,一种多智能体框架,模拟真实世界的电影制作流程,以生成具有明确电影语言的叙事视频。尽管近期多智能体系统在从剧本到视频的自动化电影制作流程方面取得了显著进展,但它们通常缺乏明确的机制来构建相邻镜头间的叙事推进以及对电影语言的刻意运用,导致叙事碎片化且电影质感有限。为了解决这一问题,Camera Artist基于已有的智能体流水线,新增了专门的摄影镜头智能体(Cinematography Shot Agent),该智能体整合了递归式故事板生成以强化镜头间的叙事连续性,并注入电影语言以产生更具表现力和电影导向的镜头设计。大量定量和定性结果表明,我们的方法在叙事一致性、动态表现力及感知电影质量方面持续优于现有基线方法。
cs.AI / 18 / 2604.09251

DRBENCHER: Can Your Agent Identify the Entity, Retrieve Its Properties and Do the Math?

DRBENCHER:你的智能体能识别实体、检索其属性并进行计算吗?
Lee, Young-Suk, Astudillo, Ramon Fernandez, Florian, Radu
Abstract
Deep research agents increasingly interleave web browsing with multi-step computation, yet existing benchmarks evaluate these capabilities in isolation, creating a blind spot in assessing real-world performance. We introduce DRBENCHER, a synthetic benchmark generator for questions that require both browsing and computation. It enforces four criteria: verifiability (gold answers are computed by executing parameterized code over knowledge-graph values), complexity (multi-hop entity identification, property retrieval, and domain-specific computation), difficulty (a two-stage verification cascade filters out questions solvable by the generating model), and diversity (a greedy max-min embedding filter maximizes coverage). These criteria are realized via a unified answer-first pipeline spanning five domains: biochemistry, financial, geophysical, security, and history. Human evaluation shows 76% validity (84% excluding stale data), with 35% of errors due to outdated knowledge-graph entries, highlighting an inherent limitation of systems that reason over evolving data. Automatic evaluation shows that the strongest frontier model achieves only 20% answer accuracy. Compared to manually constructed benchmarks (BrowseComp+, MATH-500, GPQA), DRBENCHER achieves the highest semantic diversity.
Chinese Translation
深度研究智能体日益将网页浏览与多步计算交织进行,然而现有基准测试通常单独评估这些能力,导致难以全面衡量其现实表现。我们提出了DRBENCHER,一种针对需要同时进行浏览与计算的问题的合成基准生成器。该基准遵循四项标准:可验证性(通过执行参数化代码在知识图谱数值上计算黄金答案)、复杂性(多跳实体识别、属性检索及领域特定计算)、难度(两阶段验证级联过滤生成模型可解问题)、多样性(贪婪最大最小嵌入过滤以最大化覆盖率)。这些标准通过涵盖生物化学、金融、地球物理、安全和历史五个领域的统一先答后查流程实现。人工评估显示有效率为76%(剔除过时数据后为84%),其中35%的错误源于知识图谱条目过时,凸显了基于动态数据推理系统的固有限制。自动评估表明,最强前沿模型的答案准确率仅为20%。相比手工构建的基准(BrowseComp+、MATH-500、GPQA),DRBENCHER在语义多样性方面表现最佳。
cs.AI / 19 / 2604.09285

SAGE: A Service Agent Graph-guided Evaluation Benchmark

SAGE:基于服务代理图引导的评估基准
Shi, Ling, Dai, Yuqin, Wang, Ziyin, Gao, Ning, Zhang, Wei, Wang, Chaozheng, Wang, Yujie, He, Wei, Wang, Jinpeng, Xiong, Deiyi
Abstract
The development of Large Language Models (LLMs) has catalyzed automation in customer service, yet benchmarking their performance remains challenging. Existing benchmarks predominantly rely on static paradigms and single-dimensional metrics, failing to account for diverse user behaviors or the strict adherence to structured Standard Operating Procedures (SOPs) required in real-world deployments. To bridge this gap, we propose SAGE (Service Agent Graph-guided Evaluation), a universal multi-agent benchmark for automated, dual-axis assessment. SAGE formalizes unstructured SOPs into Dynamic Dialogue Graphs, enabling precise verification of logical compliance and comprehensive path coverage. We introduce an Adversarial Intent Taxonomy and a modular Extension Mechanism, enabling low-cost deployment across domains and facilitating automated dialogue data synthesis. Evaluation is conducted via a framework where Judge Agents and a Rule Engine analyze interactions between User and Service Agents to generate deterministic ground truth. Extensive experiments on 27 LLMs across 6 industrial scenarios reveal a significant ``Execution Gap'' where models accurately classify intents but fail to derive correct subsequent actions. We also observe ``Empathy Resilience'', a phenomenon where models maintain polite conversational facades despite underlying logical failures under high adversarial intensity. Code and resources are available at https://anonymous.4open.science/r/SAGE-Bench-4CD3/.
Chinese Translation
大型语言模型(LLMs)的发展推动了客户服务的自动化,但其性能的基准测试仍然具有挑战性。现有的基准测试主要依赖静态范式和单一维度的指标,未能考虑多样化的用户行为或实际部署中对结构化标准操作流程(SOPs)严格遵守的需求。为弥补这一不足,我们提出了SAGE(Service Agent Graph-guided Evaluation),一个通用的多代理自动化双轴评估基准。SAGE将非结构化的SOPs形式化为动态对话图(Dynamic Dialogue Graphs),实现对逻辑合规性的精确验证和路径覆盖的全面评估。我们引入了对抗性意图分类法(Adversarial Intent Taxonomy)和模块化扩展机制(Extension Mechanism),支持低成本的跨领域部署及自动化对话数据合成。评估通过一个框架进行,评判代理(Judge Agents)和规则引擎(Rule Engine)分析用户代理与服务代理之间的交互,生成确定性的真实标签。针对27个大型语言模型在6个工业场景中的广泛实验揭示了显著的“执行差距”(Execution Gap),即模型虽然能准确分类意图,却未能推导出正确的后续动作。我们还观察到“同理心韧性”(Empathy Resilience)现象,即模型在高对抗强度下尽管存在逻辑失败,仍能保持礼貌的对话表象。代码和资源可在https://anonymous.4open.science/r/SAGE-Bench-4CD3/获取。
cs.AI / 20 / 2604.09308

Constraint-Aware Corrective Memory for Language-Based Drug Discovery Agents

面向约束的纠正记忆在基于语言的药物发现代理中的应用
Sun, Maochen, Zhang, Youzhi, Meng, Gaofeng
Abstract
Large language models are making autonomous drug discovery agents increasingly feasible, but reliable success in this setting is not determined by any single action or molecule. It is determined by whether the final returned set jointly satisfies protocol-level requirements such as set size, diversity, binding quality, and developability. This creates a fundamental control problem: the agent plans step by step, while task validity is decided at the level of the whole candidate set. Existing language-based drug discovery systems therefore tend to rely on long raw history and under-specified self-reflection, making failure localization imprecise and planner-facing agent states increasingly noisy. We present CACM (Constraint-Aware Corrective Memory), a language-based drug discovery framework built around precise set-level diagnosis and a concise memory write-back mechanism. CACM introduces protocol auditing and a grounded diagnostician, which jointly analyze multimodal evidence spanning task requirements, pocket context, and candidate-set evidence to localize protocol violations, generate actionable remediation hints, and bias the next action toward the most relevant correction. To keep planning context compact, CACM organizes memory into static, dynamic, and corrective channels and compresses them before write-back, thereby preserving persistent task information while exposing only the most decision-relevant failures. Our experimental results show that CACM improves the target-level success rate by 36.4% over the state-of-the-art baseline. The results show that reliable language-based drug discovery benefits not only from more powerful molecular tools, but also from more precise diagnosis and more economical agent states.
Chinese Translation
大型语言模型使得自主药物发现代理的实现变得越来越可行,但在这种情况下的可靠成功并不是由单一的行动或分子决定的。成功的关键在于最终返回的集合是否共同满足协议级别的要求,如集合大小、多样性、结合质量和可开发性。这就产生了一个基本的控制问题:代理逐步规划,而任务的有效性是在整个候选集合的层面上决定的。因此,现有的基于语言的药物发现系统往往依赖于长时间的原始历史和不够明确的自我反思,使得失败定位不够精确,规划者面临的代理状态变得越来越嘈杂。我们提出了CACM(面向约束的纠正记忆),这是一个围绕精确集合级诊断和简洁记忆回写机制构建的基于语言的药物发现框架。CACM引入了协议审计和一个有根据的诊断工具,联合分析跨越任务要求、口袋上下文和候选集合证据的多模态证据,以定位协议违规,生成可操作的补救提示,并将下一步行动偏向于最相关的纠正措施。为了保持规划上下文的紧凑性,CACM将记忆组织为静态、动态和纠正通道,并在回写之前对其进行压缩,从而保留持久的任务信息,同时仅暴露出最相关的决策失败。我们的实验结果表明,CACM在目标级成功率上比最先进的基线提高了36.4%。结果表明,可靠的基于语言的药物发现不仅得益于更强大的分子工具,还得益于更精确的诊断和更经济的代理状态。
cs.AI / 21 / 2604.09338

Mind the Gap Between Spatial Reasoning and Acting! Step-by-Step Evaluation of Agents With Spatial-Gym

关注空间推理与行动之间的差距!基于Spatial-Gym的逐步评估代理
Kaesberg, Lars Benedikt, Yang, Tianyu, Bauer, Niklas, Ruas, Terry, Wahle, Jan Philip, Gipp, Bela
Abstract
Spatial reasoning is central to navigation and robotics, yet measuring model capabilities on these tasks remains difficult. Existing benchmarks evaluate models in a one-shot setting, requiring full solution generation in a single response, unlike humans, who work in interactive environments step-by-step. We introduce Spatial-Gym, a Gymnasium environment that isolates spatial constraint reasoning by testing pathfinding in 2D-grid puzzles as a sequential decision task with optional backtracking. We evaluate eight models in three settings (one-shot, step-by-step, step-by-step with backtracking) against human, random, and A* baselines on 500 episodes. The best model, GPT-OSS 120B, achieves a solve rate of 16.0%, 82 points below the human baseline (98.0%). Step-by-step format helps weaker models (up to +5.4%) by removing formatting errors, but hurts stronger models (up to 5.6%) by constraining global planning. Backtracking improves episode completion, but increases solve rate only for weaker models; stronger models rarely backtrack and do not benefit from it. Our experiments have three key findings: (1) models fail to scale reasoning effort with difficulty, (2) vision models receiving images of the spatial environment reduce solve rate by 73%, and (3) extended chain-of-thought reasoning retains a 3-5x accuracy advantage over standard inference even in the step-by-step setting. Spatial-Gym enables diagnosis of model limitations and provides a framework for improving spatial reasoning through reinforcement learning.
Chinese Translation
空间推理是导航和机器人技术的核心,但在这些任务上衡量模型能力仍然困难。现有基准测试在一次性设置中评估模型,要求在单次响应中生成完整解决方案,而人类则在交互环境中逐步工作。我们引入了Spatial-Gym,这是一个Gymnasium环境,通过在2D网格难题中测试路径寻找,将空间约束推理隔离为一个具有可选回溯的顺序决策任务。我们在三种设置(一次性、逐步、逐步带回溯)下评估了八个模型,并与人类、随机和A*基准在500个回合中进行比较。最佳模型GPT-OSS 120B的解决率为16.0%,比人类基准(98.0%)低82个百分点。逐步格式通过消除格式错误帮助了较弱的模型(提高最多5.4%),但对较强的模型造成了伤害(降低最多5.6%),因为它限制了全局规划。回溯提高了回合完成率,但仅对较弱模型提高了解决率;较强模型很少回溯,且没有从中受益。我们的实验有三个关键发现:(1)模型未能根据难度调整推理努力,(2)接收空间环境图像的视觉模型解决率降低了73%,以及(3)扩展的思维链推理在逐步设置中仍保持3-5倍的准确性优势,优于标准推理。Spatial-Gym使得模型局限性的诊断成为可能,并提供了通过强化学习改善空间推理的框架。
cs.AI / 22 / 2604.09408

HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

HiL-Bench(人机交互基准):智能体是否知道何时寻求帮助?
Elfeki, Mohamed, Trinh, Tu, Luu, Kelvin, Luo, Guangze, Hunt, Nathan, Montoya, Ernesto, Marwaha, Nandan, He, Yannis, Wang, Charles, Crabedo, Fernando, Castilo, Alessa, Liu, Bing
Abstract
Frontier coding agents solve complex tasks when given complete context but collapse when specifications are incomplete or ambiguous. The bottleneck is not raw capability, but judgment: knowing when to act autonomously and when to ask for help. Current benchmarks are blind to this failure mode. They supply unambiguous detailed instructions and solely reward execution correctness, so an agent that makes a lucky guess for a missing requirement will score identically to one that would have asked to be certain. We present HiL-Bench (Human-in-the-Loop Benchmark) to measure this selective escalation skill. Each task contains human-validated blockers (missing information, ambiguous requests, contradictory information) that surface only through progressive exploration, not upfront inspection. Our core metric, Ask-F1, the harmonic mean of question precision and blocker recall, captures the tension between over-asking and silent guessing; its structure architecturally prevents gaming through question spam. Evaluation across SWE and text-to-SQL domains reveals a large universal judgment gap: no frontier model recovers more than a fraction of its full-information performance when deciding whether to ask. Failure analysis identifies three key help-seeking patterns: overconfident wrong beliefs with no gap detection; high uncertainty detection yet persistent errors; broad, imprecise escalation without self-correction. These consistent patterns confirm poor help-seeking is a model-level flaw, not task-specific. RL training on shaped Ask-F1 reward shows judgment is trainable: a 32B model improves both help-seeking quality and task pass rate, with gains that transfer across domains. The model does not learn domain-specific heuristics for when to ask; it learns to detect unresolvable uncertainty and act on it.
Chinese Translation
前沿编码智能体在获得完整上下文时能够解决复杂任务,但在规格不完整或含糊时则会失败。瓶颈不在于原始能力,而在于判断力:知道何时自主行动,何时寻求帮助。现有基准测试忽视了这一失败模式。它们提供明确详细的指令,仅奖励执行正确性,因此对于缺失需求做出幸运猜测的智能体,与会主动询问以确保正确的智能体得分相同。我们提出HiL-Bench(人机交互基准)以衡量这种选择性升级技能。每个任务包含经过人工验证的阻碍因素(缺失信息、模糊请求、矛盾信息),这些阻碍仅通过逐步探索显现,而非事先检查。我们的核心指标Ask-F1是问题精确率与阻碍召回率的调和平均数,捕捉了过度提问与沉默猜测之间的矛盾;其结构从架构上防止通过大量提问刷分。在软件工程(SWE)和文本到SQL(text-to-SQL)领域的评估揭示了普遍存在的巨大判断差距:没有任何前沿模型在决定是否提问时能够恢复其完全信息条件下性能的全部。失败分析识别出三种关键的寻求帮助模式:过度自信的错误信念且未检测到差距;高不确定性检测但持续出错;广泛且不精确的升级且无自我纠正。这些一致的模式确认了寻求帮助不佳是模型层面的缺陷,而非任务特定。基于形状化Ask-F1奖励的强化学习训练表明判断力是可训练的:一个32B模型在提升寻求帮助质量和任务通过率方面均有显著进步,且收益可跨领域迁移。该模型未学习领域特定的提问启发式,而是学会了检测无法解决的不确定性并据此采取行动。
cs.AI / 23 / 2604.09417

Do We Really Need to Approach the Entire Pareto Front in Many-Objective Bayesian Optimisation?

在多目标贝叶斯优化中,我们真的需要逼近整个帕累托前沿吗?
Jiang, Chao, Huang, Jingyu, Li, Miqing
Abstract
Many-objective optimisation, a subset of multi-objective optimisation, involves optimisation problems with more than three objectives. As the number of objectives increases, the number of solutions needed to adequately represent the entire Pareto front typically grows substantially. This makes it challenging, if not infeasible, to design a search algorithm capable of effectively exploring the entire Pareto front. This difficulty is particularly acute in the Bayesian optimisation paradigm, where sample efficiency is critical and only a limited number of solutions (often a few hundred) are evaluated. Moreover, after the optimisation process, the decision-maker eventually selects just one solution for deployment, regardless of how many high-quality, diverse solutions are available. In light of this, we argue an idea that under a very limited evaluation budget, it may be more useful to focus on finding a single solution of the highest possible quality for the decision-maker, rather than aiming to approximate the entire Pareto front as existing many-/multi-objective Bayesian optimisation methods typically do. Bearing this idea in mind, this paper proposes a \underline{s}ingle \underline{p}oint-based \underline{m}ulti-\underline{o}bjective search framework (SPMO) that aims to improve the quality of solutions along a direction that leads to a good tradeoff between objectives. Within SPMO, we present a simple acquisition function, called expected single-point improvement (ESPI), working under both noiseless and noisy scenarios. We show that ESPI can be optimised effectively with gradient-based methods via the sample average approximation (SAA) approach and theoretically prove its convergence guarantees under the SAA. We also empirically demonstrate that the proposed SPMO is computationally tractable and outperforms state-of-the-arts on a wide range of benchmark and real-world problems.
Chinese Translation
多目标优化是多目标优化的一个子集,涉及目标数量超过三个的优化问题。随着目标数量的增加,为了充分表示整个帕累托前沿,所需的解的数量通常会大幅增长。这使得设计能够有效探索整个帕累托前沿的搜索算法变得具有挑战性,甚至不可行。这一难题在贝叶斯优化范式中尤为突出,因为样本效率至关重要,且通常仅评估有限数量的解(通常为几百个)。此外,在优化过程结束后,决策者最终只会选择一个解进行部署,无论存在多少高质量且多样化的解。鉴于此,我们提出一个观点:在极其有限的评估预算下,关注为决策者找到单个质量最高的解,可能比传统多目标/多目标贝叶斯优化方法通常追求逼近整个帕累托前沿更为有用。基于此思想,本文提出了一种基于单点的多目标搜索框架(SPMO),旨在沿着实现目标间良好权衡的方向提升解的质量。在SPMO框架内,我们提出了一种简单的采集函数,称为期望单点改进(ESPI),适用于无噪声和有噪声场景。我们展示了ESPI可以通过样本平均近似(SAA)方法利用基于梯度的优化手段有效优化,并理论证明了其在SAA下的收敛性保证。我们还通过实证验证了所提SPMO在计算上是可行的,并在广泛的基准测试和实际问题中优于现有最先进方法。
cs.AI / 24 / 2604.09455

E3-TIR: Enhanced Experience Exploitation for Tool-Integrated Reasoning

E3-TIR:面向工具集成推理的增强经验利用方法
Guo, Weiyang, Shi, Zesheng, Zhao, Liye, Ma, Jiayuan, Zhu, Zeen, He, Junxian, Zhang, Min, Li, Jing
Abstract
While Large Language Models (LLMs) have demonstrated significant potential in Tool-Integrated Reasoning (TIR), existing training paradigms face significant limitations: Zero-RL suffers from inefficient exploration and mode degradation due to a lack of prior guidance, while SFT-then-RL is limited by high data costs and capability plateaus caused by low-entropy collapse. To address these challenges, we propose E3-TIR (Enhanced Experience Exploitation), a warm-up paradigm for the early stages of agent training. Specifically, we formulate training as the dynamic integration of three experience types: Expert Prefixes, Expert Guided, and Self-Exploration. By executing diverse branching exploration around expert "anchors" and employing a mix policy optimization mechanism, we effectively mitigate distribution shifts and resolve optimization conflicts arising from shared prefixes. Our method dynamically adapts the model's knowledge boundaries, effectively balancing exploration diversity with training efficiency.Experimental results demonstrate that E3-TIR achieves a 6 performance improvement over traditional paradigms on tool-use tasks, while requiring less than 10 of the synthetic data. Furthermore, in terms of ROI, a comprehensive metric integrating performance, data cost, and training efficiency we achieve a 1.46x gain compared to baselines. Code is available at https://github.com/yuki-younai/E3-TIR.
Chinese Translation
尽管大型语言模型(LLMs)在工具集成推理(Tool-Integrated Reasoning, TIR)中展现出显著潜力,现有的训练范式仍存在诸多限制:Zero-RL由于缺乏先验指导,导致探索效率低下和模式退化;而SFT-then-RL则受限于高昂的数据成本及由低熵崩溃引起的能力瓶颈。为应对这些挑战,我们提出了E3-TIR(Enhanced Experience Exploitation),一种用于代理训练早期阶段的预热范式。具体而言,我们将训练过程表述为三种经验类型的动态整合:专家前缀(Expert Prefixes)、专家引导(Expert Guided)和自我探索(Self-Exploration)。通过围绕专家“锚点”执行多样化的分支探索,并采用混合策略优化机制,我们有效缓解了分布偏移问题,解决了共享前缀带来的优化冲突。该方法动态调整模型的知识边界,有效平衡了探索多样性与训练效率。实验结果表明,E3-TIR在工具使用任务上相比传统范式提升了6个百分点的性能,同时合成数据使用量不足10%。此外,在综合性能、数据成本及训练效率的ROI指标上,我们相较基线实现了1.46倍的提升。代码已开源,地址:https://github.com/yuki-younai/E3-TIR。
cs.AI / 25 / 2604.09482

Process Reward Agents for Steering Knowledge-Intensive Reasoning

用于引导知识密集型推理的过程奖励代理
Sohn, Jiwoong, Sternal, Tomasz, Styppa, Kenneth, Hoefler, Torsten, Moor, Michael
Abstract
Reasoning in knowledge-intensive domains remains challenging as intermediate steps are often not locally verifiable: unlike math or code, evaluating step correctness may require synthesizing clues across large external knowledge sources. As a result, subtle errors can propagate through reasoning traces, potentially never to be detected. Prior work has proposed process reward models (PRMs), including retrieval-augmented variants, but these methods operate post hoc, scoring completed trajectories, which prevents their integration into dynamic inference procedures. Here, we introduce Process Reward Agents (PRA), a test-time method for providing domain-grounded, online, step-wise rewards to a frozen policy. In contrast to prior retrieval-augmented PRMs, PRA enables search-based decoding to rank and prune candidate trajectories at every generation step. Experiments on multiple medical reasoning benchmarks demonstrate that PRA consistently outperforms strong baselines, achieving 80.8% accuracy on MedQA with Qwen3-4B, a new state of the art at the 4B scale. Importantly, PRA generalizes to unseen frozen policy models ranging from 0.5B to 8B parameters, improving their accuracy by up to 25.7% without any policy model updates. More broadly, PRA suggests a paradigm in which frozen reasoners are decoupled from domain-specific reward modules, allowing the deployment of new backbones in complex domains without retraining.
Chinese Translation
在知识密集型领域中的推理仍然具有挑战性,因为中间步骤往往无法局部验证:与数学或代码不同,评估步骤的正确性可能需要综合大量外部知识源中的线索。因此,细微的错误可能会在推理轨迹中传播,且可能永远无法被发现。先前的工作提出了过程奖励模型(Process Reward Models,PRMs),包括检索增强变体,但这些方法是事后操作,对已完成的轨迹进行评分,无法将其整合到动态推理过程中。本文提出了过程奖励代理(Process Reward Agents,PRA),这是一种在测试阶段为冻结策略提供基于领域的在线逐步奖励的方法。与先前的检索增强PRMs不同,PRA使基于搜索的解码能够在每个生成步骤对候选轨迹进行排序和剪枝。多项医学推理基准实验表明,PRA持续优于强基线,在MedQA数据集上与Qwen3-4B模型结合时达到80.8%的准确率,创下4B参数规模的新纪录。更重要的是,PRA能够推广到参数规模从0.5B到8B不等的未见冻结策略模型,在不更新策略模型的情况下提升其准确率最高达25.7%。更广泛地,PRA提出了一种范式,即将冻结的推理器与特定领域的奖励模块解耦,允许在复杂领域中部署新的基础模型而无需重新训练。
cs.AI / 26 / 2604.09502

Strategic Algorithmic Monoculture:Experimental Evidence from Coordination Games

战略算法单一文化:来自协调博弈的实验证据
Ballestero, Gonzalo, Hosseini, Hadi, Khanna, Samarth, Shorrer, Ran I.
Abstract
AI agents increasingly operate in multi-agent environments where outcomes depend on coordination. We distinguish primary algorithmic monoculture -- baseline action similarity -- from strategic algorithmic monoculture, whereby agents adjust similarity in response to incentives. We implement a simple experimental design that cleanly separates these forces, and deploy it on human and large language model (LLM) subjects. LLMs exhibit high levels of baseline similarity (primary monoculture) and, like humans, they regulate it in response to coordination incentives (strategic monoculture). While LLMs coordinate extremely well on similar actions, they lag behind humans in sustaining heterogeneity when divergence is rewarded.
Chinese Translation
人工智能代理越来越多地在多智能体环境中运作,其结果依赖于协调。我们将主要算法单一文化(基线行动相似性)与战略算法单一文化区分开来,后者是指代理根据激励调整相似性。我们实施了一种简单的实验设计,清晰地分离这些力量,并在人工和大型语言模型(LLM)受试者上进行部署。LLM表现出高水平的基线相似性(主要单一文化),并且与人类一样,它们会根据协调激励来调节这一相似性(战略单一文化)。虽然LLM在相似行动上的协调能力极强,但在奖励分歧时,它们在维持异质性方面落后于人类。
计算语言学 (Computation and Language)
65
cs.CL / 1 / 2604.08554

Drift and selection in LLM text ecosystems

大型语言模型文本生态系统中的漂移与选择
Riis, Søren
Abstract
The public text record -- the material from which both people and AI systems now learn -- is increasingly shaped by its own outputs. Generated text enters the public record, later agents learn from it, and the cycle repeats. Here we develop an exactly solvable mathematical framework for this recursive process, based on variable-order $n$-gram agents, and separate two forces acting on the public corpus. The first is drift: unfiltered reuse progressively removes rare forms, and in the infinite-corpus limit we characterise the stable distributions exactly. The second is selection: publication, ranking and verification filter what enters the record, and the outcome depends on what is selected. When publication merely reflects the statistical status quo, the corpus converges to a shallow state in which further lookahead brings no benefit. When publication is normative -- rewarding quality, correctness or novelty -- deeper structure persists, and we establish an optimal upper bound on the resulting divergence from shallow equilibria. The framework therefore identifies when recursive publication compresses public text and when selective filtering sustains richer structure, with implications for the design of AI training corpora.
Chinese Translation
公共文本记录——人类和人工智能系统学习的材料——越来越受到其自身输出的影响。生成的文本进入公共记录,随后代理从中学习,这一循环不断重复。在此,我们开发了一个可精确求解的数学框架,用于描述这一递归过程,基于可变阶数的 $n$-gram 代理,并区分作用于公共语料库的两种力量。第一种是漂移:未经过滤的重用逐渐去除稀有形式,在无限语料库的极限中,我们精确地表征了稳定分布。第二种是选择:出版、排名和验证过滤了进入记录的内容,结果依赖于所选择的内容。当出版仅反映统计现状时,语料库收敛到一个浅层状态,在该状态下,进一步的前瞻性没有益处。当出版具有规范性——奖励质量、正确性或新颖性时,较深的结构得以持续,我们确立了由此产生的与浅层平衡的偏差的最佳上限。因此,该框架识别出何时递归出版压缩公共文本,以及何时选择性过滤维持更丰富的结构,这对人工智能训练语料库的设计具有重要意义。
cs.CL / 2 / 2604.08555

SynDocDis: A Metadata-Driven Framework for Generating Synthetic Physician Discussions Using Large Language Models

SynDocDis:一种基于元数据驱动的框架,用于利用大型语言模型生成合成医生讨论
Rubinstein, Beny, Matos, Sergio
Abstract
Physician-physician discussions of patient cases represent a rich source of clinical knowledge and reasoning that could feed AI agents to enrich and even participate in subsequent interactions. However, privacy regulations and ethical considerations severely restrict access to such data. While synthetic data generation using Large Language Models offers a promising alternative, existing approaches primarily focus on patient-physician interactions or structured medical records, leaving a significant gap in physician-to-physician communication synthesis. We present SynDocDis, a novel framework that combines structured prompting techniques with privacy-preserving de-identified case metadata to generate clinically accurate physician-to-physician dialogues. Evaluation by five practicing physicians in nine oncology and hepatology scenarios demonstrated exceptional communication effectiveness (mean 4.4/5) and strong medical content quality (mean 4.1/5), with substantial interrater reliability (kappa = 0.70, 95% CI: 0.67-0.73). The framework achieved 91% clinical relevance ratings while maintaining doctors' and patients' privacy. These results place SynDocDis as a promising framework for advancing medical AI research ethically and responsibly through privacy-compliant synthetic physician dialogue generation with direct applications in medical education and clinical decision support.
Chinese Translation
医生之间关于患者病例的讨论代表了丰富的临床知识和推理来源,这些知识可以为人工智能代理提供支持,丰富甚至参与后续的互动。然而,隐私法规和伦理考虑严重限制了对这些数据的访问。虽然使用大型语言模型生成合成数据提供了一个有前景的替代方案,但现有方法主要集中于患者与医生之间的互动或结构化医疗记录,导致医生之间的沟通合成存在显著空白。我们提出了SynDocDis,这是一种新颖的框架,结合了结构化提示技术和隐私保护的去标识化病例元数据,以生成临床上准确的医生间对话。通过九个肿瘤学和肝病学场景中的五位执业医生的评估,显示出卓越的沟通效果(平均4.4/5)和强大的医学内容质量(平均4.1/5),并具有显著的评分者间一致性(kappa = 0.70,95% CI:0.67-0.73)。该框架在保持医生和患者隐私的同时,实现了91%的临床相关性评分。这些结果使SynDocDis成为一个有前景的框架,通过符合隐私要求的合成医生对话生成,伦理和负责任地推动医学人工智能研究,直接应用于医学教育和临床决策支持。
cs.CL / 3 / 2604.08556

EMA Is Not All You Need: Mapping the Boundary Between Structure and Content in Recurrent Context

EMA 并非万能:映射循环上下文中结构与内容的边界
Singh, Arth
Abstract
What exactly do efficient sequence models gain over simple temporal averaging? We use exponential moving average (EMA) traces, the simplest recurrent context (no gating, no content-based retrieval), as a controlled probe to map the boundary between what fixed-coefficient accumulation can and cannot represent. EMA traces encode temporal structure: a Hebbian architecture with multi-timescale traces achieves 96% of a supervised BiGRU on grammatical role assignment with zero labels, surpassing the supervised model on structure-dependent roles. EMA traces destroy token identity: a 130M-parameter language model using only EMA context reaches C4 perplexity 260 (8x GPT-2), and a predictor ablation (replacing the linear predictor with full softmax attention) yields identical loss, localizing the entire gap to the traces. The traces apply lossy, data-independent compression; by the data processing inequality, no downstream predictor can recover the discarded information. Fixed-coefficient accumulation, whether across time or depth, suffers irreversible information dilution that only learned, input-dependent selection can resolve.
Chinese Translation
高效序列模型相较于简单的时间平均究竟获得了什么?我们使用指数移动平均(EMA)轨迹——最简单的循环上下文(无门控,无基于内容的检索)——作为受控探针,来映射固定系数累积能够表示与不能表示的边界。EMA轨迹编码时间结构:一种具有多时间尺度轨迹的Hebbian架构在无标签的语法角色分配任务上达到监督式BiGRU的96%,并且在结构依赖角色上超越了该监督模型。EMA轨迹破坏了标记身份:一个仅使用EMA上下文的1.3亿参数语言模型在C4数据集上的困惑度达到260(是GPT-2的8倍),而一个预测器消融实验(用全softmax注意力替代线性预测器)产生了相同的损失,将全部差距定位于轨迹本身。轨迹执行的是有损、数据无关的压缩;根据数据处理不等式,任何下游预测器都无法恢复被丢弃的信息。无论是在时间维度还是深度维度上的固定系数累积,都存在不可逆的信息稀释,只有通过学习的、依赖输入的选择机制才能解决这一问题。
cs.CL / 4 / 2604.08557

Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models

重新掩蔽与重定向:利用扩散语言模型中的去噪不可逆性
Singh, Arth
Abstract
Diffusion-based language models (dLLMs) generate text by iteratively denoising masked token sequences. We show that their safety alignment rests on a single fragile assumption: that the denoising schedule is monotonic and committed tokens are never re-evaluated. Safety-aligned dLLMs commit refusal tokens within the first 8-16 of 64 denoising steps, and the schedule treats these commitments as permanent. A trivial two-step intervention - re-masking these tokens and injecting a 12-token affirmative prefix - achieves 76.1% ASR on HarmBench (n=159, Lg=128) against LLaDA-8B-Instruct and 81.8% ASR (n=159) against Dream-7B-Instruct, without any gradient computation or adversarial search. The simplicity of this exploit is itself the central finding: augmenting with gradient-optimized perturbation via a differentiable Gumbel-softmax chain consistently degrades ASR (e.g., 41.5% vs. 76.1% at Lg=128), confirming that the vulnerability is structural rather than requiring sophisticated exploitation. These findings reveal that dLLM safety is not adversarially robust but architecturally shallow - it holds only because the denoising schedule is never violated. We discuss defenses including safety-aware unmasking schedules, step-conditional prefix detection, and post-commitment re-verification.
Chinese Translation
基于扩散的语言模型(dLLMs)通过迭代去噪掩蔽的标记序列生成文本。我们表明,它们的安全对齐依赖于一个脆弱的假设:去噪调度是单调的,已承诺的标记不会被重新评估。安全对齐的dLLMs在64个去噪步骤的前8-16步内承诺拒绝标记,并且该调度将这些承诺视为永久性的。一个简单的两步干预——重新掩蔽这些标记并注入一个12标记的肯定前缀——在对抗LLaDA-8B-Instruct的HarmBench上实现了76.1%的ASR(n=159, Lg=128),在对抗Dream-7B-Instruct时实现了81.8%的ASR(n=159),且无需任何梯度计算或对抗搜索。这一漏洞的简单性本身就是核心发现:通过可微分的Gumbel-softmax链增强的梯度优化扰动会持续降低ASR(例如,在Lg=128时为41.5%对比76.1%),确认了这一脆弱性是结构性的,而不是需要复杂的利用。这些发现揭示了dLLM的安全性并非对抗性强健,而是架构上浅薄——它仅仅因为去噪调度从未被违反而存在。我们讨论了包括安全感知的去掩蔽调度、步骤条件前缀检测和后承诺重新验证等防御措施。
cs.CL / 5 / 2604.08558

WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models

WAND:用于高效自回归文本到语音模型的窗口注意力与知识蒸馏
Lee, Hanna, Nguyen, Tan Dat, Kang, Jaehoon, Shim, Kyuhong
Abstract
Recent decoder-only autoregressive text-to-speech (AR-TTS) models produce high-fidelity speech, but their memory and compute costs scale quadratically with sequence length due to full self-attention. In this paper, we propose WAND, Windowed Attention and Knowledge Distillation, a framework that adapts pretrained AR-TTS models to operate with constant computational and memory complexity. WAND separates the attention mechanism into two: persistent global attention over conditioning tokens and local sliding-window attention over generated tokens. To stabilize fine-tuning, we employ a curriculum learning strategy that progressively tightens the attention window. We further utilize knowledge distillation from a full-attention teacher to recover high-fidelity synthesis quality with high data efficiency. Evaluated on three modern AR-TTS models, WAND preserves the original quality while achieving up to 66.2% KV cache memory reduction and length-invariant, near-constant per-step latency.
Chinese Translation
最近的仅解码器自回归文本到语音(AR-TTS)模型能够生成高保真语音,但由于全自注意力机制,其内存和计算成本随着序列长度呈二次增长。在本文中,我们提出了WAND,即窗口注意力与知识蒸馏,这一框架旨在使预训练的AR-TTS模型在计算和内存复杂度上保持恒定。WAND将注意力机制分为两部分:对条件标记的持久全局注意力和对生成标记的局部滑动窗口注意力。为了稳定微调过程,我们采用了一种课程学习策略,逐步收紧注意力窗口。此外,我们还利用来自全注意力教师模型的知识蒸馏,以高数据效率恢复高保真的合成质量。在对三种现代AR-TTS模型进行评估时,WAND在保持原始质量的同时,实现了高达66.2%的KV缓存内存减少,并且每步延迟几乎保持恒定。
cs.CL / 6 / 2604.08559

Medical Reasoning with Large Language Models: A Survey and MR-Bench

大型语言模型的医学推理:综述与MR-Bench
Ren, Xiaohan, Fan, Chenxiao, Ma, Wenyin, He, Hongliang, Gao, Chongming, Zhao, Xiaoyan, Feng, Fuli
Abstract
Large language models (LLMs) have achieved strong performance on medical exam-style tasks, motivating growing interest in their deployment in real-world clinical settings. However, clinical decision-making is inherently safety-critical, context-dependent, and conducted under evolving evidence. In such situations, reliable LLM performance depends not on factual recall alone, but on robust medical reasoning. In this work, we present a comprehensive review of medical reasoning with LLMs. Grounded in cognitive theories of clinical reasoning, we conceptualize medical reasoning as an iterative process of abduction, deduction, and induction, and organize existing methods into seven major technical routes spanning training-based and training-free approaches. We further conduct a unified cross-benchmark evaluation of representative medical reasoning models under a consistent experimental setting, enabling a more systematic and comparable assessment of the empirical impact of existing methods. To better assess clinically grounded reasoning, we introduce MR-Bench, a benchmark derived from real-world hospital data. Evaluations on MR-Bench expose a pronounced gap between exam-level performance and accuracy on authentic clinical decision tasks. Overall, this survey provides a unified view of existing medical reasoning methods, benchmarks, and evaluation practices, and highlights key gaps between current model performance and the requirements of real-world clinical reasoning.
Chinese Translation
大型语言模型(LLMs)在医学考试风格任务中表现出色,激发了对其在真实临床环境中应用的日益关注。然而,临床决策本质上是安全关键的、依赖于上下文的,并在不断变化的证据下进行。在这种情况下,可靠的LLM性能不仅依赖于事实回忆,还依赖于稳健的医学推理。在本研究中,我们对LLMs的医学推理进行了全面回顾。基于临床推理的认知理论,我们将医学推理概念化为一个包含归纳、演绎和溯因的迭代过程,并将现有方法组织为七个主要技术路线,涵盖基于训练和无训练的方法。我们进一步在一致的实验设置下,对代表性的医学推理模型进行了统一的跨基准评估,从而实现对现有方法的经验影响进行更系统和可比较的评估。为了更好地评估临床基础推理,我们引入了MR-Bench,这是一个基于真实医院数据的基准。对MR-Bench的评估揭示了考试级别性能与真实临床决策任务准确性之间显著的差距。总体而言,本综述提供了对现有医学推理方法、基准和评估实践的统一视角,并突出了当前模型性能与真实世界临床推理要求之间的关键差距。
cs.CL / 7 / 2604.08560

Uncertainty Estimation for the Open-Set Text Classification systems

开放集文本分类系统的不确定性估计
Erlygin, Leonid, Zaytsev, Alexey
Abstract
Accurate uncertainty estimation is essential for building robust and trustworthy recognition systems. In this paper, we consider the open-set text classification (OSTC) task - and uncertainty estimation for it. For OSTC a text sample should be classified as one of the existing classes or rejected as unknown. To account for the different uncertainty types encountered in OSTC, we adapt the Holistic Uncertainty Estimation (HolUE) method for the text domain. Our approach addresses two major causes of prediction errors in text recognition systems: text uncertainty that stems from ill formulated queries and gallery uncertainty that is related the ambiguity of data distribution. By capturing these sources, it becomes possible to predict when the system will make a recognition error. We propose a new OSTC benchmark and conduct extensive experiments on a wide range of data, utilizing the authorship attribution, intent and topic classification datasets. HolUE achieves 40-365% improvement in Prediction Rejection Ratio (PRR) over the quality-based SCF baseline across datasets: 365% on Yahoo Answers (0.79 vs 0.17 at FPIR 0.1), 347% on DBPedia (0.85 vs 0.19), 240% on PAN authorship attribution (0.51 vs 0.15 at FPIR 0.5), and 40% on CLINC150 intent classification (0.73 vs~0.52). We make public our code and protocols https://github.com/Leonid-Erlygin/text_uncertainty.git
Chinese Translation
准确的不确定性估计对于构建稳健且可信赖的识别系统至关重要。本文考虑了开放集文本分类(Open-Set Text Classification, OSTC)任务及其不确定性估计。对于OSTC,文本样本应被分类为现有类别之一或被拒绝为未知。为了考虑在OSTC中遇到的不同不确定性类型,我们将整体不确定性估计(Holistic Uncertainty Estimation, HolUE)方法适配到文本领域。我们的方法解决了文本识别系统中预测错误的两个主要原因:源于不良表述查询的文本不确定性和与数据分布模糊性相关的图库不确定性。通过捕捉这些来源,我们能够预测系统何时会发生识别错误。我们提出了一个新的OSTC基准,并在广泛的数据集上进行了大量实验,利用了作者归属、意图和主题分类的数据集。HolUE在各数据集上相较于基于质量的SCF基线在预测拒绝率(Prediction Rejection Ratio, PRR)上实现了40%-365%的提升:在Yahoo Answers上提升365%(0.79 vs 0.17,假阳性率为0.1),在DBPedia上提升347%(0.85 vs 0.19),在PAN作者归属上提升240%(0.51 vs 0.15,假阳性率为0.5),在CLINC150意图分类上提升40%(0.73 vs 0.52)。我们公开了我们的代码和协议,链接为:https://github.com/Leonid-Erlygin/text_uncertainty.git
cs.CL / 8 / 2604.08561

A Representation-Level Assessment of Bias Mitigation in Foundation Models

基础模型中偏见缓解的表征级评估
Nizhnichenkov, Svetoslav, Nair, Rahul, Daly, Elizabeth, Mac Namee, Brian
Abstract
We investigate how successful bias mitigation reshapes the embedding space of encoder-only and decoder-only foundation models, offering an internal audit of model behaviour through representational analysis. Using BERT and Llama2 as representative architectures, we assess the shifts in associations between gender and occupation terms by comparing baseline and bias-mitigated variants of the models. Our findings show that bias mitigation reduces gender-occupation disparities in the embedding space, leading to more neutral and balanced internal representations. These representational shifts are consistent across both model types, suggesting that fairness improvements can manifest as interpretable and geometric transformations. These results position embedding analysis as a valuable tool for understanding and validating the effectiveness of debiasing methods in foundation models. To further promote the assessment of decoder-only models, we introduce WinoDec, a dataset consisting of 4,000 sequences with gender and occupation terms, and release it to the general public. (https://github.com/winodec/wino-dec)
Chinese Translation
我们研究了成功的偏见缓解如何重塑编码器仅和解码器仅基础模型的嵌入空间,通过表征分析提供模型行为的内部审计。以 BERT 和 Llama2 作为代表性架构,我们通过比较模型的基线和偏见缓解变体,评估性别与职业术语之间关联的变化。我们的研究结果表明,偏见缓解减少了嵌入空间中的性别-职业差异,导致更中立和平衡的内部表征。这些表征变化在两种模型类型中是一致的,表明公平性改善可以表现为可解释的几何变换。这些结果将嵌入分析定位为理解和验证基础模型去偏见方法有效性的有价值工具。为了进一步促进对解码器仅模型的评估,我们引入了 WinoDec,一个包含 4,000 个性别和职业术语序列的数据集,并向公众发布。 (https://github.com/winodec/wino-dec)
cs.CL / 9 / 2604.08562

Neural networks for Text-to-Speech evaluation

用于文本转语音评估的神经网络
Trofimenko, Ilya, Kocharyan, David, Zaitsev, Aleksandr, Repnikov, Pavel, Levin, Mark, Shevtsov, Nikita
Abstract
Ensuring that Text-to-Speech (TTS) systems deliver human-perceived quality at scale is a central challenge for modern speech technologies. Human subjective evaluation protocols such as Mean Opinion Score (MOS) and Side-by-Side (SBS) comparisons remain the de facto gold standards, yet they are expensive, slow, and sensitive to pervasive assessor biases. This study addresses these barriers by formulating, and implementing a suite of novel neural models designed to approximate expert judgments in both relative (SBS) and absolute (MOS) settings. For relative assessment, we propose NeuralSBS, a HuBERT-backed model achieving 73.7% accuracy (on SOMOS dataset). For absolute assessment, we introduce enhancements to MOSNet using custom sequence-length batching, as well as WhisperBert, a multimodal stacking ensemble that combines Whisper audio features and BERT textual embeddings via weak learners. Our best MOS models achieve a Root Mean Square Error (RMSE) of ~0.40, significantly outperforming the human inter-rater RMSE baseline of 0.62. Furthermore, our ablation studies reveal that naively fusing text via cross-attention can degrade performance, highlighting the effectiveness of ensemble-based stacking over direct latent fusion. We additionally report negative results with SpeechLM-based architectures and zero-shot LLM evaluators (Qwen2-Audio, Gemini 2.5 flash preview), reinforcing the necessity of dedicated metric learning frameworks.
Chinese Translation
确保文本转语音(Text-to-Speech, TTS)系统在大规模应用中实现人类感知质量是一项现代语音技术的核心挑战。人工主观评价协议如平均意见分(Mean Opinion Score, MOS)和并列比较(Side-by-Side, SBS)仍然是事实上的金标准,但这些方法成本高昂、速度缓慢且易受评估者偏见的影响。本研究通过构建并实现一套新颖的神经网络模型,旨在模拟专家在相对(SBS)和绝对(MOS)评估中的判断,以解决上述障碍。针对相对评估,我们提出了基于HuBERT的NeuralSBS模型,在SOMOS数据集上实现了73.7%的准确率。针对绝对评估,我们对MOSNet进行了自定义序列长度批处理的改进,并引入了WhisperBert,这是一种多模态堆叠集成模型,通过弱学习器结合了Whisper音频特征和BERT文本嵌入。我们表现最佳的MOS模型实现了约0.40的均方根误差(Root Mean Square Error, RMSE),显著优于人类评审者间的RMSE基线0.62。此外,消融实验表明,简单通过交叉注意力融合文本信息可能会降低性能,凸显了基于集成堆叠方法优于直接潜在融合的效果。我们还报告了基于SpeechLM架构和零样本大型语言模型(Large Language Model, LLM)评估器(如Qwen2-Audio、Gemini 2.5闪测版)的负面结果,进一步强调了专用度量学习框架的必要性。
cs.CL / 10 / 2604.08563

Temperature-Dependent Performance of Prompting Strategies in Extended Reasoning Large Language Models

温度依赖性对扩展推理大型语言模型提示策略性能的影响
Salah, Mousa, Muneer, Amgad
Abstract
Extended reasoning models represent a transformative shift in Large Language Model (LLM) capabilities by enabling explicit test-time computation for complex problem solving. However, the optimal configuration of sampling temperature and prompting strategy for these systems remains largely underexplored. We systematically evaluate chain-of-thought and zero-shot prompting across four temperature settings (0.0, 0.4, 0.7, and 1.0) using Grok-4.1 with extended reasoning on 39 mathematical problems from AMO-Bench, a challenging International Mathematical Olympiad-level benchmark. We find that zero-shot prompting achieves peak performance at moderate temperatures, reaching 59% accuracy at T=0.4 and T=0.7, while chain-of-thought prompting performs best at the temperature extremes. Most notably, the benefit of extended reasoning increases from 6x at T=0.0 to 14.3x at T=1.0. These results suggest that temperature should be optimized jointly with prompting strategy, challenging the common practice of using T=0 for reasoning tasks.
Chinese Translation
扩展推理模型通过在测试时显式计算复杂问题解决过程,代表了大型语言模型(LLM)能力的变革性提升。然而,这类系统中采样温度与提示策略的最优配置尚未得到充分探索。本文系统评估了链式思维(chain-of-thought)提示和零样本(zero-shot)提示在四个温度设置(0.0、0.4、0.7 和 1.0)下,基于 Grok-4.1 扩展推理模型对 AMO-Bench 中39个国际数学奥林匹克级别的数学问题的表现。研究发现,零样本提示在中等温度下表现最佳,在 T=0.4 和 T=0.7 时准确率达到59%;而链式思维提示则在温度极值时表现最优。最显著的是,扩展推理的性能提升从 T=0.0 时的6倍增加到 T=1.0 时的14.3倍。结果表明,温度应与提示策略联合优化,这一发现挑战了在推理任务中普遍采用 T=0 的常规做法。
cs.CL / 11 / 2604.08564

Attention-Based Sampler for Diffusion Language Models

基于注意力的采样器用于扩散语言模型
Zhou, Yuyan, Hou, Kai Syun, Chen, Weiyu, Kwok, James
Abstract
Auto-regressive models (ARMs) have established a dominant paradigm in language modeling. However, their strictly sequential decoding paradigm imposes fundamental constraints on both inference efficiency and modeling flexibility. To address these limitations, diffusion-based large language models (dLLMs) have been proposed, offering the potential for parallel decoding and flexible language modeling. Despite these advantages, current dLLMs decoding strategies rely primarily on token level information, which fails to account for global sequence structure and often yields suboptimal results. In this paper, we study the decoding order selection problem from the perspective of log-likelihood maximization. We theoretically demonstrate that optimal sequence likelihood can be approximately achieved by decoding tokens in descending order of their attention matrix column sums. This finding provides a principled justification for attention-guided decoding and offers a theoretically grounded alternative to greedy search. We instantiate this theoretical insight in a new training-free decoding algorithm, termed Attn-Sampler, and further propose a block attention approximation and dynamic attention thresholding for practical acceleration. Extensive experiments across multiple benchmarks validate the effectiveness of our proposed method, demonstrating that it achieves superior generation quality while enhancing the decoding parallelism.
Chinese Translation
自回归模型(Auto-regressive models,ARMs)已成为语言建模中的主导范式。然而,其严格的顺序解码方式在推理效率和建模灵活性方面存在根本性限制。为了解决这些问题,提出了基于扩散的大型语言模型(diffusion-based large language models,dLLMs),其具备并行解码和灵活语言建模的潜力。尽管如此,当前dLLMs的解码策略主要依赖于词元级别的信息,未能充分考虑全局序列结构,且常常导致次优结果。本文从最大化对数似然的角度研究了解码顺序选择问题。我们理论证明,通过按照注意力矩阵列和的降序解码词元,可以近似实现序列的最优似然。这一发现为基于注意力的引导解码提供了理论依据,并为贪婪搜索提供了理论支持的替代方案。基于该理论洞见,我们提出了一种无需训练的新型解码算法,称为Attn-Sampler,并进一步提出了块状注意力近似和动态注意力阈值策略以实现实际加速。大量基准实验验证了所提方法的有效性,表明其在提升生成质量的同时增强了解码的并行性。
cs.CL / 12 / 2604.08565

Dynamic sparsity in tree-structured feed-forward layers at scale

大规模树结构前馈层中的动态稀疏性
Sedghi, Reza, Schiewer, Robin, Subramoney, Anand, Kappel, David
Abstract
At typical context lengths, the feed-forward MLP block accounts for a large share of a transformer's compute budget, motivating sparse alternatives to dense MLP blocks. We study sparse, tree-structured feed-forward layers as drop-in replacements for MLP blocks in deep transformer architectures, enabling conditional computation via hard hierarchical routing without a separate router network. We demonstrate for the first time that this form of tree-structured conditional sparsity can be applied for autoregressive language modeling and downstream question answering, including zero- and few-shot settings, and its scalability beyond 1B parameters. Despite activating fewer than 5% of the feed-forward block's units per token, our models match dense baselines under controlled training and fine-tuning protocols. We further analyze training dynamics and identify an emergent auto-pruning effect: the interaction of hard routing with asymmetric nonlinearities progressively deactivates unused paths, yielding partial conversion of dynamic routing into static structural sparsity. We show that simple architectural choices can modulate this behavior and recover balanced trees without auxiliary losses. Overall, our work demonstrates that tree-structured feed-forward layers provide a scalable and controllable mechanism for sparsifying large transformer models.
Chinese Translation
在典型的上下文长度下,前馈多层感知机(MLP)模块占据了变换器计算预算的很大一部分,这促使人们寻求稀疏替代方案来取代密集的MLP模块。我们研究了稀疏的树结构前馈层,作为深度变换器架构中MLP模块的直接替代方案,通过硬层级路由实现条件计算,无需单独的路由网络。我们首次展示了这种树结构条件稀疏性可以应用于自回归语言建模及下游问答任务,包括零样本和少样本设置,并且其可扩展性超过10亿参数规模。尽管每个标记激活的前馈模块单元不到5%,我们的模型在受控训练和微调协议下仍能匹配密集基线性能。我们进一步分析了训练动态,发现了一种新兴的自动剪枝效应:硬路由与非对称非线性函数的相互作用逐步停用未使用的路径,实现了动态路由向静态结构稀疏性的部分转化。我们展示了简单的架构选择可以调节该行为,并在无辅助损失的情况下恢复平衡树结构。总体而言,我们的工作表明,树结构前馈层为大规模变换器模型的稀疏化提供了一种可扩展且可控的机制。
cs.CL / 13 / 2604.08566

Sentiment Classification of Gaza War Headlines: A Comparative Analysis of Large Language Models and Arabic Fine-Tuned BERT Models

加沙战争头条的情感分类:大型语言模型与阿拉伯语微调BERT模型的比较分析
Eleraqi, Amr, Mustafa, Hager H., Ahmed, Abdul Hadi N.
Abstract
This study examines how different artificial intelligence architectures interpret sentiment in conflict-related media discourse, using the 2023 Gaza War as a case study. Drawing on a corpus of 10,990 Arabic news headlines (Eleraqi 2026), the research conducts a comparative analysis between three large language models and six fine-tuned Arabic BERT models. Rather than evaluating accuracy against a single human-annotated gold standard, the study adopts an epistemological approach that treats sentiment classification as an interpretive act produced by model architectures. To quantify systematic differences across models, the analysis employs information-theoretic and distributional metrics, including Shannon Entropy, Jensen-Shannon Distance, and a Variance Score measuring deviation from aggregate model behavior. The results reveal pronounced and non-random divergence in sentiment distributions. Fine-tuned BERT models, particularly MARBERT, exhibit a strong bias toward neutral classifications, while LLMs consistently amplify negative sentiment, with LLaMA-3.1-8B showing near-total collapse into negativity. Frame-conditioned analysis further demonstrates that GPT-4.1 adjusts sentiment judgments in line with narrative frames (e.g., humanitarian, legal, security), whereas other LLMs display limited contextual modulation. These findings suggest that the choice of model constitutes a choice of interpretive lens, shaping how conflict narratives are algorithmically framed and emotionally evaluated. The study contributes to media studies and computational social science by foregrounding algorithmic discrepancy as an object of analysis and by highlighting the risks of treating automated sentiment outputs as neutral or interchangeable measures of media tone in contexts of war and crisis.
Chinese Translation
本研究考察了不同人工智能架构如何解读与冲突相关的媒体话语中的情感,以2023年加沙战争为案例。研究基于10,990条阿拉伯语新闻头条(Eleraqi 2026),对三种大型语言模型与六种微调的阿拉伯语BERT模型进行了比较分析。研究并非通过单一人工标注的金标准评估准确性,而是采用了一种认识论的方法,将情感分类视为模型架构产生的解释性行为。为了量化模型之间的系统性差异,分析采用了信息理论和分布度量,包括香农熵(Shannon Entropy)、詹森-香农距离(Jensen-Shannon Distance)以及测量偏离聚合模型行为的方差得分(Variance Score)。结果显示情感分布存在明显且非随机的差异。微调的BERT模型,特别是MARBERT,表现出强烈的中立分类偏向,而大型语言模型(LLMs)则持续放大负面情感,其中LLaMA-3.1-8B几乎完全崩溃为负面情感。框架条件分析进一步表明,GPT-4.1根据叙事框架(例如人道主义、法律、安全)调整情感判断,而其他LLMs则表现出有限的上下文调节能力。这些发现表明,模型的选择构成了一种解释视角的选择,塑造了冲突叙事在算法上如何被框定和情感上被评估。本研究通过将算法差异作为分析对象,并强调在战争和危机背景下将自动化情感输出视为中立或可互换的媒体语调衡量的风险,为媒体研究和计算社会科学做出了贡献。
cs.CL / 14 / 2604.08567

Multi-User Large Language Model Agents

多用户大型语言模型代理
Yang, Shu, Zhu, Shenzhe, Zhu, Hao, Enríquez, José Ramón, Wang, Di, Pentland, Alex, Bakker, Michiel A., Pei, Jiaxin
Abstract
Large language models (LLMs) and LLM-based agents are increasingly deployed as assistants in planning and decision making, yet most existing systems are implicitly optimized for a single-principal interaction paradigm, in which the model is designed to satisfy the objectives of one dominant user whose instructions are treated as the sole source of authority and utility. However, as they are integrated into team workflows and organizational tools, they are increasingly required to serve multiple users simultaneously, each with distinct roles, preferences, and authority levels, leading to multi-user, multi-principal settings with unavoidable conflicts, information asymmetry, and privacy constraints. In this work, we present the first systematic study of multi-user LLM agents. We begin by formalizing multi-user interaction with LLM agents as a multi-principal decision problem, where a single agent must account for multiple users with potentially conflicting interests and associated challenges. We then introduce a unified multi-user interaction protocol and design three targeted stress-testing scenarios to evaluate current LLMs' capabilities in instruction following, privacy preservation, and coordination. Our results reveal systematic gaps: frontier LLMs frequently fail to maintain stable prioritization under conflicting user objectives, exhibit increasing privacy violations over multi-turn interactions, and suffer from efficiency bottlenecks when coordination requires iterative information gathering.
Chinese Translation
大型语言模型(LLMs)及基于LLM的代理正日益被部署为规划和决策辅助工具,然而现有大多数系统隐含地优化于单一主体交互范式,即模型旨在满足一位主导用户的目标,该用户的指令被视为唯一的权威和效用来源。然而,随着这些模型被集成进团队工作流程和组织工具,它们越来越需要同时服务于多个用户,每个用户具有不同的角色、偏好和权限等级,导致多用户、多主体的情境中不可避免地出现冲突、信息不对称和隐私限制。在本研究中,我们首次系统性地研究了多用户LLM代理。我们首先将多用户与LLM代理的交互形式化为多主体决策问题,其中单一代理必须考虑多个用户可能存在冲突的利益及相关挑战。随后,我们提出了统一的多用户交互协议,并设计了三个针对性的压力测试场景,以评估当前LLM在遵循指令、隐私保护和协调能力方面的表现。结果显示存在系统性差距:前沿LLM在面对用户目标冲突时常常无法维持稳定的优先级排序,多轮交互过程中隐私泄露问题逐渐加剧,且在协调需要迭代信息收集时效率瓶颈明显。
cs.CL / 15 / 2604.08568

Can We Still Hear the Accent? Investigating the Resilience of Native Language Signals in the LLM Era

我们还能听出口音吗?在大语言模型(LLM)时代探究母语信号的韧性
Utami, Nabelanita, Ryohei, Sasano
Abstract
The evolution of writing assistance tools from machine translation to large language models (LLMs) has changed how researchers write. This study investigates whether this shift is homogenizing research papers by analyzing native language identification (NLI) trends in ACL Anthology papers across three eras: pre-neural network (NN), pre-LLM, and post-LLM. We construct a labeled dataset using a semi-automated framework and fine-tune a classifier to detect linguistic fingerprints of author backgrounds. Our analysis shows a consistent decline in NLI performance over time. Interestingly, the post-LLM era reveals anomalies: while Chinese and French show unexpected resistance or divergent trends, Japanese and Korean exhibit sharper-than-expected declines.
Chinese Translation
从机器翻译到大语言模型(LLMs),写作辅助工具的发展改变了研究人员的写作方式。本研究通过分析ACL Anthology论文中母语识别(NLI)的趋势,考察这一转变是否正在使研究论文趋于同质化,涵盖了神经网络(NN)出现前、LLM出现前及LLM出现后三个时期。我们采用半自动化框架构建了带标签的数据集,并微调分类器以检测作者背景的语言指纹。分析结果显示,NLI性能随时间持续下降。有趣的是,LLM时代后期出现了异常现象:中国和法国的母语信号表现出意外的抵抗力或不同趋势,而日语和韩语则呈现出比预期更为显著的下降。
cs.CL / 16 / 2604.08595

Adaptive Rigor in AI System Evaluation using Temperature-Controlled Verdict Aggregation via Generalized Power Mean

基于温控判决聚合的人工智能系统评估中的自适应严格性
Meshkov, Aleksandr
Abstract
Existing evaluation methods for LLM-based AI systems, such as LLM-as-a-Judge, verdict systems, and NLI, do not always align well with human assessment because they cannot adapt their strictness to the application domain. This paper presents Temperature-Controlled Verdict Aggregation (TCVA), a method that combines a five-level verdict-scoring system with generalized power-mean aggregation and an intuitive temperature parameter T [0.1, 1.0] to control evaluation rigor. Low temperatures yield pessimistic scores suited for safety-critical domains; high temperatures produce lenient scores appropriate for conversational AI. Experimental evaluation on three benchmark datasets with human Likert-scale annotations (SummEval and USR) shows that TCVA achieves correlation with human judgments comparable to RAGAS on faithfulness (Spearman = 0.667 vs. 0.676) while consistently outperforming DeepEval. The method requires no additional LLM calls when adjusting the temperature parameter.
Chinese Translation
现有的基于大型语言模型(LLM)的人工智能系统评估方法,如LLM-as-a-Judge、判决系统和自然语言推理(NLI),并不总是与人类评估高度一致,因为它们无法根据应用领域调整其严格性。本文提出了一种温控判决聚合(Temperature-Controlled Verdict Aggregation, TCVA)方法,该方法结合了五级判决评分系统、广义幂均值聚合和直观的温度参数T [0.1, 1.0],以控制评估的严格性。低温度产生适合安全关键领域的悲观评分;高温度则产生适合对话人工智能的宽松评分。在三个具有人工Likert量表注释的基准数据集(SummEval和USR)上的实验评估表明,TCVA在与人类判断的一致性上达到了与RAGAS相当的水平(忠实度Spearman = 0.667 vs. 0.676),同时始终优于DeepEval。该方法在调整温度参数时不需要额外的LLM调用。
cs.CL / 17 / 2604.08644

EXAONE 4.5 Technical Report

EXAONE 4.5 技术报告
Choi, Eunbi, Choi, Kibong, Chun, Sehyun, Hong, Seokhee, Hwang, Junwon, Jeon, Hyojin, Jo, Ahra, Jo, Hyunjik, Jo, Yeonsik, Kim, Joonkee, Kim, Seonghwan, Kim, Soyeon, Kim, Sunkyoung, Kim, Yireun, Kim, Yongil, Lee, Changhun, Lee, Haeju, Lee, Jinsik, Lee, Kyungmin, Park, Sangha, Ryoo, Kwangrok, Seo, Minju, Yang, Sejong, Yeen, Heuiyeen, Chang, Hwan, Choi, Stanley Jungkyu, Choi, Yejin, Han, Kyubeen, Jang, Joonwon, Jeon, Kijeong, Jeong, Geunyeong, Jo, Gerrard Jeongwon, Jung, Jiyeon, Kim, Daeseong, Kim, Dohoon, Kim, Dohyun, Kim, Hyunseo, Kim, Minu, Kim, Myoungshin, Kim, Youchul, Ko, Byungoh, Lee, Christopher, Lee, Edward Hwayoung, Lee, Honglak, Lee, Jiyoung, Lee, Sangeun, Lim, Seungwon, Lim, Woohyung, Mun, Jueun, Park, Jaewoo, Park, Jimin, Park, Jinho, Park, Yongmin, Seo, Wooseok, Song, Yongwoo, Yi, Sihyuk, Yoo, Kyungjae, Yoon, Sangyeon
Abstract
This technical report introduces EXAONE 4.5, the first open-weight vision language model released by LG AI Research. EXAONE 4.5 is architected by integrating a dedicated visual encoder into the existing EXAONE 4.0 framework, enabling native multimodal pretraining over both visual and textual modalities. The model is trained on large-scale data with careful curation, particularly emphasizing document-centric corpora that align with LG's strategic application domains. This targeted data design enables substantial performance gains in document understanding and related tasks, while also delivering broad improvements across general language capabilities. EXAONE 4.5 extends context length up to 256K tokens, facilitating long-context reasoning and enterprise-scale use cases. Comparative evaluations demonstrate that EXAONE 4.5 achieves competitive performance in general benchmarks while outperforming state-of-the-art models of similar scale in document understanding and Korean contextual reasoning. As part of LG's ongoing effort toward practical industrial deployment, EXAONE 4.5 is designed to be continuously extended with additional domains and application scenarios to advance AI for a better life.
Chinese Translation
本技术报告介绍了 EXAONE 4.5,这是由 LG AI Research 发布的首个开源权重视觉语言模型。EXAONE 4.5 通过在现有 EXAONE 4.0 框架中集成专用视觉编码器构建,实现了对视觉和文本模态的原生多模态预训练。该模型在经过精心筛选的大规模数据上训练,特别强调与 LG 战略应用领域相契合的以文档为中心的语料库。此针对性的数据设计显著提升了文档理解及相关任务的性能,同时在通用语言能力方面也带来了广泛的改进。EXAONE 4.5 将上下文长度扩展至 256K 令牌,支持长上下文推理和企业级应用场景。对比评估表明,EXAONE 4.5 在通用基准测试中表现具有竞争力,并在文档理解及韩语上下文推理方面超越了同规模的最先进模型。作为 LG 致力于实用工业部署的持续努力的一部分,EXAONE 4.5 设计为可持续扩展,涵盖更多领域和应用场景,以推动更美好生活的人工智能发展。
cs.CL / 18 / 2604.08723

Decomposing the Delta: What Do Models Actually Learn from Preference Pairs?

分解增量:模型实际上从偏好对中学到了什么?
Lee, Chia-Hsuan, Zhou, Mingyang, Ni, Renkun, Cheng, Zelei, Dai, Sihui, Chakraborty, Supriyo, Zhang, Shixiong, Sahu, Sambit, Campbell, William
Abstract
Preference optimization methods such as DPO and KTO are widely used for aligning language models, yet little is understood about what properties of preference data drive downstream reasoning gains. We ask: what aspects of a preference pair improve a reasoning model's performance on general reasoning tasks? We investigate two distinct notions of quality delta in preference data: generator-level delta, arising from the differences in capability between models that generate chosen and rejected reasoning traces, and sample-level delta, arising from differences in judged quality differences within an individual preference pair. To study generator-level delta, we vary the generator's scale and model family, and to study sample-level delta, we employ an LLM-as-a-judge to rate the quality of generated traces along multiple reasoning-quality dimensions. We find that increasing generator-level delta steadily improves performance on out-of-domain reasoning tasks and filtering data by sample-level delta can enable more data-efficient training. Our results suggest a twofold recipe for improving reasoning performance through preference optimization: maximize generator-level delta when constructing preference pairs and exploit sample-level delta to select the most informative training examples.
Chinese Translation
偏好优化方法如 DPO 和 KTO 被广泛用于对齐语言模型,但关于偏好数据的哪些特性驱动下游推理提升的理解仍然有限。我们提出问题:偏好对的哪些方面改善了推理模型在一般推理任务上的表现?我们研究了偏好数据中两种不同的质量增量概念:生成器级增量,它源于生成被选择和被拒绝推理轨迹的模型之间能力的差异;样本级增量,它源于个别偏好对内判断质量差异的不同。为了研究生成器级增量,我们改变生成器的规模和模型家族;而为了研究样本级增量,我们使用 LLM-as-a-judge 来评估生成轨迹在多个推理质量维度上的质量。我们发现,增加生成器级增量可以稳步提高在域外推理任务上的表现,而通过样本级增量过滤数据可以实现更高效的数据训练。我们的结果建议了一种通过偏好优化提高推理表现的双重策略:在构建偏好对时最大化生成器级增量,并利用样本级增量选择最具信息量的训练示例。
cs.CL / 19 / 2604.08752

LLMs Underperform Graph-Based Parsers on Supervised Relation Extraction for Complex Graphs

大型语言模型在复杂图监督关系抽取任务中表现不及基于图的解析器
Gajo, Paolo, Rosati, Domenic, Sajjad, Hassan, Barrón-Cedeño, Alberto
Abstract
Relation extraction represents a fundamental component in the process of creating knowledge graphs, among other applications. Large language models (LLMs) have been adopted as a promising tool for relation extraction, both in supervised and in-context learning settings. However, in this work we show that their performance still lags behind much smaller architectures when the linguistic graph underlying a text has great complexity. To demonstrate this, we evaluate four LLMs against a graph-based parser on six relation extraction datasets with sentence graphs of varying sizes and complexities. Our results show that the graph-based parser increasingly outperforms the LLMs, as the number of relations in the input documents increases. This makes the much lighter graph-based parser a superior choice in the presence of complex linguistic graphs.
Chinese Translation
关系抽取是构建知识图谱等应用中的基础组成部分。大型语言模型(LLMs)已被作为一种有前景的工具应用于监督学习和上下文学习环境下的关系抽取。然而,本研究表明,当文本所依赖的语言图结构复杂度较高时,LLMs的性能仍落后于规模更小的架构。为验证这一点,我们在六个关系抽取数据集上,针对不同规模和复杂度的句子图,评估了四种LLMs与一种基于图的解析器的表现。结果显示,随着输入文档中关系数量的增加,基于图的解析器的性能优势愈发明显。这使得在处理复杂语言图时,体积更小的基于图的解析器成为更优的选择。
cs.CL / 20 / 2604.08757

Cards Against LLMs: Benchmarking Humor Alignment in Large Language Models

对抗大型语言模型的幽默:大型语言模型幽默对齐的基准测试
Fettach, Yousra, Bied, Guillaume, Toivonen, Hannu, De Bie, Tijl
Abstract
Humor is one of the most culturally embedded and socially significant dimensions of human communication, yet it remains largely unexplored as a dimension of Large Language Model (LLM) alignment. In this study, five frontier language models play the same Cards Against Humanity games (CAH) as human players. The models select the funniest response from a slate of ten candidate cards across 9,894 rounds. While all models exceed the random baseline, alignment with human preference remains modest. More striking is that models agree with each other substantially more often than they agree with humans. We show that this preference is partly explained by systematic position biases and content preferences, raising the question whether LLM humor judgment reflects genuine preference or structural artifacts of inference and alignment.
Chinese Translation
幽默是人类交流中最具文化嵌入性和社会重要性的维度之一,但作为大型语言模型(LLM)对齐的一个维度,它仍然在很大程度上未被探索。在本研究中,五个前沿语言模型与人类玩家一起进行同样的《人类的对抗游戏》(Cards Against Humanity,CAH)。这些模型在9,894轮游戏中从十张候选卡片中选择最搞笑的回应。尽管所有模型的表现均超过随机基线,但与人类偏好的对齐仍然相对有限。更引人注目的是,这些模型之间的意见一致性远高于它们与人类之间的意见一致性。我们表明,这种偏好部分可以通过系统的位置偏见和内容偏好来解释,这引发了一个问题:LLM的幽默判断是否反映了真实的偏好,还是推理和对齐的结构性伪影。
cs.CL / 21 / 2604.08764

Revisiting Anisotropy in Language Transformers: The Geometry of Learning Dynamics

重新审视语言变换器中的各向异性:学习动态的几何学
Bernas, Raphael, Jourdan, Fanny, Poché, Antonin, Hudelot, Céline
Abstract
Since their introduction, Transformer architectures have dominated Natural Language Processing (NLP). However, recent research has highlighted an inherent anisotropy phenomenon in these models, presenting a significant challenge to their geometric interpretation. Previous theoretical studies on this phenomenon are rarely grounded in the underlying representation geometry. In this paper, we extend them by deriving geometric arguments for how frequency-biased sampling attenuates curvature visibility and why training preferentially amplify tangent directions. Empirically, we then use concept-based mechanistic interpretability during training, rather than only post hoc, to fit activation-derived low-rank tangent proxies and test them against ordinary backpropagated true gradients. Across encoder-style and decoder-style language models, we find that these activation-derived directions capture both unusually large gradient energy and a substantially larger share of gradient anisotropy than matched-rank normal controls, providing strong empirical support for a tangent-aligned account of anisotropy.
Chinese Translation
自引入以来,变换器架构在自然语言处理(NLP)领域中占据主导地位。然而,最近的研究揭示了这些模型中固有的各向异性现象,这对其几何解释提出了重大挑战。之前关于这一现象的理论研究很少基于基础的表示几何。在本文中,我们通过推导几何论证来扩展这些研究,探讨频率偏向采样如何削弱曲率可见性,以及为何训练优先放大切向方向。然后,我们在训练过程中使用基于概念的机械可解释性,而不仅仅是事后分析,以拟合由激活导出的低秩切向代理,并将其与普通反向传播的真实梯度进行测试。在编码器风格和解码器风格的语言模型中,我们发现这些由激活导出的方向捕捉到了异常大的梯度能量,并且相较于匹配秩的正常对照组,它们占据了显著更大的梯度各向异性份额,为切向对齐的各向异性解释提供了强有力的实证支持。
cs.CL / 22 / 2604.08782

MT-OSC: Path for LLMs that Get Lost in Multi-Turn Conversation

MT-OSC:解决大型语言模型在多轮对话中迷失的路径
Singh, Jyotika, Tu, Fang, Ballesteros, Miguel, Sun, Weiyi, Ghoshal, Sandip, Yuan, Michelle, Benajiba, Yassine, Ravi, Sujith, Roth, Dan
Abstract
Large language models (LLMs) suffer significant performance degradation when user instructions and context are distributed over multiple conversational turns, yet multi-turn (MT) interactions dominate chat interfaces. The routine approach of appending full chat history to prompts rapidly exhausts context windows, leading to increased latency, higher computational costs, and diminishing returns as conversations extend. We introduce MT-OSC, a One-off Sequential Condensation framework that efficiently and automatically condenses chat history in the background without disrupting the user experience. MT-OSC employs a Condenser Agent that uses a few-shot inference-based Condenser and a lightweight Decider to selectively retain essential information, reducing token counts by up to 72% in 10-turn dialogues. Evaluated across 13 state-of-the-art LLMs and diverse multi-turn benchmarks, MT-OSC consistently narrows the multi-turn performance gap - yielding improved or preserved accuracy across datasets while remaining robust to distractors and irrelevant turns. Our results establish MT-OSC as a scalable solution for multi-turn chats, enabling richer context within constrained input spaces, reducing latency and operational cost, while balancing performance.
Chinese Translation
大型语言模型(LLMs)在用户指令和上下文分布于多个对话轮次时,表现出显著的性能下降,然而多轮(MT)交互在聊天界面中占主导地位。常规方法是将完整的聊天历史附加到提示中,这迅速耗尽了上下文窗口,导致延迟增加、计算成本上升,并且随着对话的延续,收益递减。我们提出了MT-OSC,一种一次性顺序压缩框架,能够高效且自动地在后台压缩聊天历史,而不干扰用户体验。MT-OSC采用了一个压缩代理(Condenser Agent),利用少量示例推理的压缩器(Condenser)和轻量级决策器(Decider)选择性地保留关键信息,在10轮对话中将令牌数量减少了多达72%。在13个最先进的LLMs和多种多轮基准测试中评估,MT-OSC始终缩小了多轮性能差距——在各数据集上实现了准确性提升或保持,同时对干扰项和无关轮次保持鲁棒性。我们的结果确立了MT-OSC作为多轮聊天的可扩展解决方案,使得在受限输入空间内实现更丰富的上下文,降低延迟和运营成本,同时平衡性能。
cs.CL / 23 / 2604.08788

MedConceal: A Benchmark for Clinical Hidden-Concern Reasoning Under Partial Observability

MedConceal:临床隐性关切推理的部分可观察性基准
Han, Yikun, Chan, Joey, Chen, Jingyuan, Ai, Mengting, Du, Simo, Guo, Yue
Abstract
Patient-clinician communication is an asymmetric-information problem: patients often do not disclose fears, misconceptions, or practical barriers unless clinicians elicit them skillfully. Effective medical dialogue therefore requires reasoning under partial observability: clinicians must elicit latent concerns, confirm them through interaction, and respond in ways that guide patients toward appropriate care. However, existing medical dialogue benchmarks largely sidestep this challenge by exposing hidden patient state, collapsing elicitation into extraction, or evaluating responses without modeling what remains hidden. We present MedConceal, a benchmark with an interactive patient simulator for evaluating hidden-concern reasoning in medical dialogue, comprising 300 curated cases and 600 clinician-LLM interactions. Built from clinician-answered online health discussions, each case pairing clinician-visible context with simulator-internal hidden concerns derived from prior literature and structured using an expert-developed taxonomy. The simulator withholds these concerns from the dialogue agent, tracks whether they have been revealed and addressed via theory-grounded turn-level communication signals, and is clinician-reviewed for clinical plausibility. This enables process-aware evaluation of both task success and the interaction process that leads to it. We study two abilities: confirmation, surfacing hidden concerns through multi-turn dialogue, and intervention, addressing the primary concern and guiding the patient toward a target plan. Results show that no single system dominates: frontier models lead on different confirmation metrics, while human clinicians (N=159) remain strongest on intervention success. Together, these results identify hidden-concern reasoning under partial observability as a key unresolved challenge for medical dialogue systems.
Chinese Translation
患者与临床医生之间的沟通是一个不对称信息问题:患者通常不会披露恐惧、误解或实际障碍,除非临床医生巧妙地引导他们。因此,有效的医疗对话需要在部分可观察性下进行推理:临床医生必须引出潜在关切,通过互动确认这些关切,并以指导患者获得适当护理的方式作出回应。然而,现有的医疗对话基准在很大程度上回避了这一挑战,因为它们暴露了隐藏的患者状态,将引导过程简化为提取,或在没有建模隐藏内容的情况下评估响应。我们提出了MedConceal,这是一个具有互动患者模拟器的基准,用于评估医疗对话中的隐性关切推理,包含300个精心策划的案例和600个临床医生与大语言模型(LLM)的互动。每个案例将临床医生可见的背景与来自先前文献的、使用专家开发的分类法结构化的模拟器内部隐藏关切配对。模拟器在对话代理中隐瞒这些关切,跟踪它们是否通过基于理论的回合级交流信号被揭示和处理,并经过临床医生审核以确保临床合理性。这使得对任务成功和导致成功的互动过程的过程感知评估成为可能。我们研究了两种能力:确认,通过多轮对话揭示隐藏关切,以及干预,解决主要关切并引导患者朝向目标计划。结果表明,没有单一系统占据主导地位:前沿模型在不同的确认指标上表现突出,而人类临床医生(N=159)在干预成功率上仍然最强。这些结果共同表明,在部分可观察性下进行隐性关切推理是医疗对话系统面临的一个关键未解决挑战。
cs.CL / 24 / 2604.08797

Lessons Without Borders? Evaluating Cultural Alignment of LLMs Using Multilingual Story Moral Generation

没有边界的教训?使用多语言故事道德生成评估大型语言模型的文化一致性
Wu, Sophie, Piper, Andrew
Abstract
Stories are key to transmitting values across cultures, but their interpretation varies across linguistic and cultural contexts. Thus, we introduce multilingual story moral generation as a novel culturally grounded evaluation task. Using a new dataset of human-written story morals collected across 14 language-culture pairs, we compare model outputs with human interpretations via semantic similarity, a human preference survey, and value categorization. We show that frontier models such as GPT-4o and Gemini generate story morals that are semantically similar to human responses and preferred by human evaluators. However, their outputs exhibit markedly less cross-linguistic variation and concentrate on a narrower set of widely shared values. These findings suggest that while contemporary models can approximate central tendencies of human moral interpretation, they struggle to reproduce the diversity that characterizes human narrative understanding. By framing narrative interpretation as an evaluative task, this work introduces a new approach to studying cultural alignment in language models beyond static benchmarks or knowledge-based tests.
Chinese Translation
故事是跨文化传递价值观的关键,但其解读在语言和文化背景中有所不同。因此,我们引入多语言故事道德生成作为一种新颖的文化基础评估任务。通过收集跨14种语言-文化对的人类撰写的故事道德的新数据集,我们通过语义相似性、人类偏好调查和价值分类比较模型输出与人类解读。我们展示了前沿模型如GPT-4o和Gemini生成的故事道德在语义上与人类回应相似,并受到人类评估者的偏好。然而,它们的输出表现出明显较少的跨语言变异,并集中于一组更狭窄的广泛共享价值观。这些发现表明,尽管当代模型能够近似人类道德解读的中心倾向,但它们在再现人类叙事理解的多样性方面存在困难。通过将叙事解读框架化为一种评估任务,本研究引入了一种新的方法来研究语言模型中的文化一致性,超越静态基准或基于知识的测试。
cs.CL / 25 / 2604.08849

Scalable High-Recall Constraint-Satisfaction-Based Information Retrieval for Clinical Trials Matching

基于约束满足的可扩展高召回率临床试验匹配信息检索方法
Zhou, Cyrus, Jin, Yufei, Xu, Yilin, Wang, Yu-Chiang, Chao, Chieh-Ju, Lam, Monica S.
Abstract
Clinical trials are central to evidence-based medicine, yet many struggle to meet enrollment targets, despite the availability of over half a million trials listed on ClinicalTrials.gov, which attracts approximately two million users monthly. Existing retrieval techniques, largely based on keyword and embedding-similarity matching between patient profiles and eligibility criteria, often struggle with low recall, low precision, and limited interpretability due to complex constraints. We propose SatIR, a scalable clinical trial retrieval method based on constraint satisfaction, enabling high-precision and interpretable matching of patients to relevant trials. Our approach uses formal methods -- Satisfiability Modulo Theories (SMT) and relational algebra -- to efficiently represent and match key constraints from clinical trials and patient records. Beyond leveraging established medical ontologies and conceptual models, we use Large Language Models (LLMs) to convert informal reasoning regarding ambiguity, implicit clinical assumptions, and incomplete patient records into explicit, precise, controllable, and interpretable formal constraints. Evaluated on 59 patients and 3,621 trials, SatIR outperforms TrialGPT on all three evaluated retrieval objectives. It retrieves 32%-72% more relevant-and-eligible trials per patient, improves recall over the union of useful trials by 22-38 points, and serves more patients with at least one useful trial. Retrieval is fast, requiring 2.95 seconds per patient over 3,621 trials. These results show that SatIR is scalable, effective, and interpretable.
Chinese Translation
临床试验是循证医学的核心,然而尽管ClinicalTrials.gov上列出了超过五十万项试验,且每月吸引约两百万用户,许多试验仍难以达到招募目标。现有的检索技术主要基于患者档案与资格标准之间的关键词和嵌入相似度匹配,常因复杂约束导致召回率低、精确度不足且解释性有限。我们提出了SatIR,一种基于约束满足的可扩展临床试验检索方法,实现了患者与相关试验的高精度且可解释匹配。该方法利用形式化方法——可满足性模理论(Satisfiability Modulo Theories, SMT)和关系代数——高效表示并匹配临床试验及患者记录中的关键约束。除了利用既有医学本体和概念模型外,我们还采用大型语言模型(Large Language Models, LLMs)将关于歧义、隐含临床假设及不完整患者记录的非正式推理转化为明确、精确、可控且可解释的形式约束。在59名患者和3,621项试验上的评估结果显示,SatIR在所有三个检索目标上均优于TrialGPT。它每位患者检索到的相关且符合资格的试验数量提升32%-72%,对有用试验集合的召回率提高22-38个百分点,并为更多患者提供至少一项有用试验。检索速度快速,平均每位患者在3,621项试验中耗时2.95秒。结果表明,SatIR具备良好的可扩展性、高效性和解释性。
cs.CL / 26 / 2604.08851

Cross-Lingual Attention Distillation with Personality-Informed Generative Augmentation for Multilingual Personality Recognition

基于个性信息生成增强的跨语言注意力蒸馏用于多语言个性识别
Tan, Jing Jie, Kwan, Ban-Hoe, Ng, Danny Wee-Kiat, Hum, Yan-Chai, Kawarazaki, Noriyuki, Takano, Kosuke
Abstract
While significant work has been done on personality recognition, the lack of multilingual datasets remains an unresolved challenge. To address this, we propose ADAM (Cross-Lingual (A)ttention (D)istillation with Personality-Guided Generative (A)ugmentation for (M)ultilingual Personality Recognition), a state-of-the-art approach designed to advance multilingual personality recognition. Our approach leverages an existing English-language personality dataset as the primary source and employs a large language model (LLM) for translationbased augmentation, enhanced by Personality-Informed Generative Augmentation (PIGA), to generate high-quality training data in multiple languages, including Japanese, Chinese, Malay, and French. We provide a thorough analysis to justify the effectiveness of these augmentation techniques. Building on these advancements, ADAM integrates Cross-Lingual Attention Distillation (CLAD) to train a model capable of understanding and recognizing personality traits across languages, bridging linguistic and cultural gaps in personality analysis. This research presents a thorough evaluation of the proposed augmentation method, incorporating an ablation study on recognition performance to ensure fair comparisons and robust validation. Overall, with PIGA augmentation, the findings demonstrate that CLAD significantly outperforms the standard BCE across all languages and personality traits, achieving notable improvements in average BA scores - 0.6332 (+0.0573) on the Essays dataset and 0.7448 (+0.0968) on the Kaggle dataset. The CLAD-trained model also demonstrated strong generalizability and achieved benchmark performance comparable to current leading encoder models. The model weight, dataset, and algorithm repository are available at https://research.jingjietan.com/?q=ADAM.
Chinese Translation
尽管在个性识别方面已经取得了显著进展,但缺乏多语言数据集仍然是一个未解决的挑战。为了解决这一问题,我们提出了ADAM(基于个性引导生成增强的跨语言注意力蒸馏用于多语言个性识别),这是一种旨在推动多语言个性识别的最先进方法。我们的方法利用现有的英语个性数据集作为主要来源,并采用大型语言模型(LLM)进行基于翻译的增强,结合个性信息生成增强(PIGA),以生成高质量的多语言训练数据,包括日语、中文、马来语和法语。我们提供了全面的分析,以证明这些增强技术的有效性。在这些进展的基础上,ADAM集成了跨语言注意力蒸馏(CLAD),以训练一个能够理解和识别跨语言个性特征的模型,弥合个性分析中的语言和文化差距。本研究对所提出的增强方法进行了全面评估,包括对识别性能的消融研究,以确保公平比较和稳健验证。总体而言,借助PIGA增强,研究结果表明CLAD在所有语言和个性特征上显著优于标准的BCE,在Essays数据集上平均BA分数达到0.6332(+0.0573),在Kaggle数据集上达到0.7448(+0.0968)。CLAD训练的模型还表现出强大的泛化能力,并达到了与当前领先的编码器模型相当的基准性能。模型权重、数据集和算法库可在https://research.jingjietan.com/?q=ADAM获取。
cs.CL / 27 / 2604.08879

GRASP: Grounded CoT Reasoning with Dual-Stage Optimization for Multimodal Sarcasm Target Identification

GRASP:基于双阶段优化的多模态讽刺目标识别的有根链式推理方法
Wan, Faxian, Yang, Xiaocui, Cao, Yifan, Feng, Shi, Wang, Daling, Zhang, Yifei
Abstract
Moving beyond the traditional binary classification paradigm of Multimodal Sarcasm Detection, Multimodal Sarcasm Target Identification (MSTI) presents a more formidable challenge, requiring precise localization of fine-grained targets such as textual phrases and visual regions. Existing approaches predominantly rely on implicit cross-modal alignment, offering limited interpretability and suboptimal fine-grained localization. To address these limitations, we propose GRASP, Grounded Chain-of-Thought ReAsoning with Dual-Stage Optimization for Multimodal Sarcasm Prediction and Target Identification, a framework that integrates visual grounding with explicit Chain-of-Thought (CoT) reasoning to move beyond black-box MSTI. Specifically, we curate MSTI-MAX, a refined dataset that mitigates class imbalance and enriches multimodal sarcasm cues. We introduce Grounded CoT reasoning, which explicitly anchors sarcasm-related visual regions within the reasoning trajectory and prompts the model to articulate rationales before predicting the final classification labels and sarcasm targets. Furthermore, we employ a dual-stage outcome-supervised joint optimization strategy: Supervised Fine-Tuning with a coordinate-aware weighted loss, followed by Fine-Grained Target Policy Optimization. Extensive experiments demonstrate that GRASP outperforms existing baselines in fine-grained sarcasm target identification across modalities, and an LLM-as-a-Judge evaluation quantitatively measures the quality of internal reasoning chains. Our dataset and source code will be released on GitHub.
Chinese Translation
超越传统的多模态讽刺检测二分类范式,多模态讽刺目标识别(MSTI)提出了更具挑战性的任务,需精确定位细粒度目标,如文本短语和视觉区域。现有方法主要依赖隐式的跨模态对齐,解释性有限且细粒度定位效果不佳。为解决这些问题,我们提出了GRASP(Grounded Chain-of-Thought ReAsoning with Dual-Stage Optimization),一个将视觉定位与显式链式推理(Chain-of-Thought, CoT)相结合的框架,旨在突破黑盒式的MSTI。具体而言,我们构建了MSTI-MAX,一个缓解类别不平衡并丰富多模态讽刺线索的精细数据集。我们引入了有根CoT推理,明确将讽刺相关的视觉区域锚定在推理路径中,并促使模型在预测最终分类标签和讽刺目标前阐述推理依据。此外,我们采用双阶段结果监督的联合优化策略:先进行带坐标感知加权损失的监督微调,随后进行细粒度目标策略优化。大量实验表明,GRASP在跨模态细粒度讽刺目标识别上优于现有基线,且通过大型语言模型(LLM)作为评判者的评估定量衡量了内部推理链的质量。我们的数据集和源码将于GitHub公开发布。
cs.CL / 28 / 2604.08923

NCL-BU at SemEval-2026 Task 3: Fine-tuning XLM-RoBERTa for Multilingual Dimensional Sentiment Regression

NCL-BU在SemEval-2026任务3中的方法:基于XLM-RoBERTa的多语言维度情感回归微调
Wu, Tong, Rusnachenko, Nicolay, Liang, Huizhi
Abstract
Dimensional Aspect-Based Sentiment Analysis (DimABSA) extends traditional ABSA from categorical polarity labels to continuous valence-arousal (VA) regression. This paper describes a system developed for Track A - Subtask 1 (Dimensional Aspect Sentiment Regression), aiming to predict real-valued VA scores in the [1, 9] range for each given aspect in a text. A fine-tuning approach based on XLM-RoBERTa-base is adopted, constructing the input as [CLS] T [SEP] a_i [SEP] and training dual regression heads with sigmoid-scaled outputs for valence and arousal prediction. Separate models are trained for each language-domain combination (English and Chinese across restaurant, laptop, and finance domains), and training and development sets are merged for final test predictions. In development experiments, the fine-tuning approach is compared against several large language models including GPT-5.2, LLaMA-3-70B, LLaMA-3.3-70B, and LLaMA-4-Maverick under a few-shot prompting setting, demonstrating that task-specific fine-tuning substantially and consistently outperforms these LLM-based methods across all evaluation datasets. The code is publicly available at https://github.com/tongwu17/SemEval-2026-Task3-Track-A.
Chinese Translation
维度化基于方面的情感分析(Dimensional Aspect-Based Sentiment Analysis,DimABSA)将传统的基于类别极性标签的ABSA扩展为连续的效价-唤醒度(Valence-Arousal,VA)回归。本文介绍了为Track A - 子任务1(维度化方面情感回归)开发的系统,旨在预测文本中每个给定方面的实值VA分数,范围为[1, 9]。采用基于XLM-RoBERTa-base的微调方法,将输入构造为[CLS] T [SEP] a_i [SEP],并训练具有Sigmoid缩放输出的双回归头,分别用于效价和唤醒度的预测。针对每种语言-领域组合(英语和中文,涵盖餐饮、笔记本电脑和金融领域)训练独立模型,训练集和开发集合并用于最终测试预测。在开发实验中,将微调方法与包括GPT-5.2、LLaMA-3-70B、LLaMA-3.3-70B和LLaMA-4-Maverick在内的多种大型语言模型在少样本提示设置下进行了比较,结果表明,任务特定的微调方法在所有评估数据集上均显著且持续地优于这些基于大型语言模型的方法。代码已公开,地址为https://github.com/tongwu17/SemEval-2026-Task3-Track-A。
cs.CL / 29 / 2604.08947

MuTSE: A Human-in-the-Loop Multi-use Text Simplification Evaluator

MuTSE:一种人机协同的多用途文本简化评估器
Roscan, Rares-Alexandru, Petre1, Gabriel, Dumitran, Adrian-Marius, Dumitran, Angela-Liliana
Abstract
As Large Language Models (LLMs) become increasingly prevalent in text simplification, systematically evaluating their outputs across diverse prompting strategies and architectures remains a critical methodological challenge in both NLP research and Intelligent Tutoring Systems (ITS). Developing robust prompts is often hindered by the absence of structured, visual frameworks for comparative text analysis. While researchers typically rely on static computational scripts, educators are constrained to standard conversational interfaces -- neither paradigm supports systematic multi-dimensional evaluation of prompt-model permutations. To address these limitations, we introduce \textbf{MuTSE}\footnote{The project code and the demo have been made available for peer review at the following anonymized URL. https://osf.io/njs43/overview?view_only=4b4655789f484110a942ebb7788cdf2a, an interactive human-in-the-loop web application designed to streamline the evaluation of LLM-generated text simplifications across arbitrary CEFR proficiency targets. The system supports concurrent execution of $P \times M$ prompt-model permutations, generating a comprehensive comparison matrix in real-time. By integrating a novel tiered semantic alignment engine augmented with a linearity bias heuristic ($\lambda$), MuTSE visually maps source sentences to their simplified counterparts, reducing the cognitive load associated with qualitative analysis and enabling reproducible, structured annotation for downstream NLP dataset construction.
Chinese Translation
随着大型语言模型(Large Language Models, LLMs)在文本简化领域的日益普及,系统性地评估其在多样化提示策略和架构下的输出,仍然是自然语言处理(NLP)研究和智能辅导系统(Intelligent Tutoring Systems, ITS)中的一项关键方法学挑战。开发稳健的提示常因缺乏用于比较文本分析的结构化、可视化框架而受阻。尽管研究人员通常依赖静态的计算脚本,教育工作者则被限制在标准的对话界面中——这两种范式均不支持对提示-模型组合进行系统的多维度评估。为解决这些局限,我们提出了MuTSE(Multi-use Text Simplification Evaluator)——一个人机协同的交互式网页应用,旨在简化针对任意CEFR(欧洲语言共同参考框架)熟练度目标的LLM生成文本简化的评估流程。该系统支持同时执行P×M个提示-模型组合,实时生成全面的比较矩阵。通过集成一种新颖的分层语义对齐引擎,并辅以线性偏置启发式(λ),MuTSE能够可视化地映射源句子与其简化版本,降低定性分析的认知负担,并实现可复现的结构化标注,促进下游NLP数据集的构建。
cs.CL / 30 / 2604.08948

TaxPraBen: A Scalable Benchmark for Structured Evaluation of LLMs in Chinese Real-World Tax Practice

TaxPraBen:一个可扩展的基准,用于在中国实际税务实践中对大型语言模型进行结构化评估
Hu, Gang, Chen, Yating, Ding, Haiyan, Gao, Wang, Huang, Jiajia, Peng, Min, Xie, Qianqian, Yu, Kun
Abstract
While Large Language Models (LLMs) excel in various general domains, they exhibit notable gaps in the highly specialized, knowledge-intensive, and legally regulated Chinese tax domain. Consequently, while tax-related benchmarks are gaining attention, many focus on isolated NLP tasks, neglecting real-world practical capabilities. To address this issue, we introduce TaxPraBen, the first dedicated benchmark for Chinese taxation practice. It combines 10 traditional application tasks, along with 3 pioneering real-world scenarios: tax risk prevention, tax inspection analysis, and tax strategy planning, sourced from 14 datasets totaling 7.3K instances. TaxPraBen features a scalable structured evaluation paradigm designed through process of "structured parsing-field alignment extraction-numerical and textual matching", enabling end-to-end tax practice assessment while being extensible to other domains. We evaluate 19 LLMs based on Bloom's taxonomy. The results indicate significant performance disparities: all closed-source large-parameter LLMs excel, and Chinese LLMs like Qwen2.5 generally exceed multilingual LLMs, while the YaYi2 LLM, fine-tuned with some tax data, shows only limited improvement. TaxPraBen serves as a vital resource for advancing evaluations of LLMs in practical applications.
Chinese Translation
尽管大型语言模型(LLMs)在多个通用领域表现出色,但在高度专业化、知识密集且受法律监管的中国税务领域,它们却存在显著的差距。因此,尽管与税务相关的基准测试越来越受到关注,但许多测试集中于孤立的自然语言处理任务,忽视了实际应用能力。为了解决这一问题,我们推出了TaxPraBen,这是第一个专门针对中国税务实践的基准。它结合了10个传统应用任务,以及3个开创性的真实场景:税务风险防范、税务检查分析和税务战略规划,数据来源于14个数据集,总计7.3K个实例。TaxPraBen具有可扩展的结构化评估范式,采用“结构化解析-领域对齐提取-数值与文本匹配”的过程设计,使得端到端的税务实践评估成为可能,同时也可扩展至其他领域。我们基于布鲁姆分类法(Bloom's taxonomy)对19个LLMs进行了评估。结果显示出显著的性能差异:所有闭源的大参数LLMs表现优异,而像Qwen2.5这样的中文LLMs通常超越多语言LLMs,而经过一些税务数据微调的YaYi2 LLM仅显示出有限的改进。TaxPraBen为推动LLMs在实际应用中的评估提供了重要资源。
cs.CL / 31 / 2604.08952

MAB-DQA: Addressing Query Aspect Importance in Document Question Answering with Multi-Armed Bandits

MAB-DQA:利用多臂赌博机解决文档问答中的查询方面重要性
Xiang, Yixin, Ma, Yunshan, Du, Xiaoyu, Chen, Yibing, Zhang, Yanxin, Tang, Jinhui
Abstract
Document Question Answering (DQA) involves generating answers from a document based on a user's query, representing a key task in document understanding. This task requires interpreting visual layouts, which has prompted recent studies to adopt multimodal Retrieval-Augmented Generation (RAG) that processes page images for answer generation. However, in multimodal RAG, visual DQA struggles to utilize a large number of images effectively, as the retrieval stage often retains only a few candidate pages (e.g., Top-4), causing informative but less visually salient content to be overlooked in favor of common yet low-information pages. To address this issue, we propose a Multi-Armed Bandit-based DQA framework (MAB-DQA) to explicitly model the varying importance of multiple implicit aspects in a query. Specifically, MAB-DQA decomposes a query into aspect-aware subqueries and retrieves an aspect-specific candidate set for each. It treats each subquery as an arm and uses preliminary reasoning results from a small number of representative pages as reward signals to estimate aspect utility. Guided by an exploration-exploitation policy, MAB-DQA dynamically reallocates retrieval budgets toward high-value aspects. With the most informative pages and their correlations, MAB-DQA generates the expected results. On four benchmarks, MAB-DQA shows an average improvement of 5%-18% over the state-of-the-art method, consistently enhancing document understanding. Code at https://github.com/ElephantOH/MAB-DQA.
Chinese Translation
文档问答(DQA)涉及根据用户查询从文档中生成答案,是文档理解中的一项关键任务。该任务需要对视觉布局进行解读,这促使近期研究采用多模态检索增强生成(RAG),处理页面图像以生成答案。然而,在多模态RAG中,视觉DQA在有效利用大量图像方面面临挑战,因为检索阶段通常仅保留少数候选页面(例如,前4页),导致信息丰富但视觉上不显著的内容被忽视,而常见但信息量低的页面则被优先考虑。为了解决这一问题,我们提出了一种基于多臂赌博机的DQA框架(MAB-DQA),以明确建模查询中多个隐含方面的重要性。具体而言,MAB-DQA将查询分解为关注方面的子查询,并为每个子查询检索特定方面的候选集。它将每个子查询视为一个“臂”,并利用少量代表性页面的初步推理结果作为奖励信号来估计方面效用。在探索-利用策略的指导下,MAB-DQA动态重新分配检索预算,以关注高价值方面。通过最具信息量的页面及其相关性,MAB-DQA生成预期结果。在四个基准测试中,MAB-DQA相较于最先进的方法平均提升了5%-18%,持续增强文档理解。代码可在 https://github.com/ElephantOH/MAB-DQA 获取。
cs.CL / 32 / 2604.08964

Breaking Block Boundaries: Anchor-based History-stable Decoding for Diffusion Large Language Models

突破块边界:基于锚点的历史稳定解码方法用于扩散大型语言模型
Zou, Shun, Wang, Yong, Chen, Zehui, Chen, Lin, Tao, Chongyang, Zhao, Feng, Chu, Xiangxiang
Abstract
Diffusion Large Language Models (dLLMs) have recently become a promising alternative to autoregressive large language models (ARMs). Semi-autoregressive (Semi-AR) decoding is widely employed in base dLLMs and advanced decoding strategies due to its superior performance. However, our observations reveal that Semi-AR decoding suffers from inherent block constraints, which cause the decoding of many cross-block stable tokens to be unnecessarily delayed. To address this challenge, we systematically investigate the identification of stable tokens and present three key findings: (1) naive lookahead decoding is unreliable, (2) token stability closely correlates with convergence trend, and (3) historical information is isolated. Building on these insights, we propose Anchor-based History-stable Decoding (AHD), a training-free, plug-and-play dynamic decoding strategy. Specifically, AHD monitors the stability trend of tokens in real time through dynamic anchors. Once a token reaches stability, it initiates early cross-block decoding to enhance efficiency and performance. Extensive experiments across language, vision-language, and audio-language domains demonstrate that AHD simultaneously improves both performance and inference efficiency. Notably, AHD effectively reverses the performance degradation typically observed in existing advanced decoding acceleration strategies. For instance, on the BBH benchmark, our approach reduces decoding steps by 80% while improving performance by 3.67%.
Chinese Translation
扩散大型语言模型(Diffusion Large Language Models,dLLMs)近年来成为自回归大型语言模型(Autoregressive Large Language Models,ARMs)的有力替代方案。半自回归(Semi-autoregressive,Semi-AR)解码因其优越的性能被广泛应用于基础dLLMs及其先进解码策略中。然而,我们的观察表明,Semi-AR解码存在固有的块约束,导致许多跨块稳定令牌的解码被不必要地延迟。为解决这一挑战,我们系统地研究了稳定令牌的识别,并提出三项关键发现:(1)简单的前瞻解码不可靠,(2)令牌稳定性与收敛趋势密切相关,(3)历史信息存在隔离。基于这些洞见,我们提出了基于锚点的历史稳定解码(Anchor-based History-stable Decoding,AHD),这是一种无需训练、即插即用的动态解码策略。具体而言,AHD通过动态锚点实时监控令牌的稳定性趋势。一旦令牌达到稳定状态,即启动跨块提前解码,以提升效率和性能。在语言、视觉-语言及音频-语言等多个领域的大量实验表明,AHD能够同时提升性能和推理效率。值得注意的是,AHD有效逆转了现有先进解码加速策略中通常出现的性能下降。例如,在BBH基准测试中,我们的方法减少了80%的解码步骤,同时性能提升了3.67%。
cs.CL / 33 / 2604.08970

Litmus (Re)Agent: A Benchmark and Agentic System for Predictive Evaluation of Multilingual Models

Litmus (Re)Agent:用于多语言模型预测评估的基准与智能代理系统
Mittal, Avni, Kumar, Shanu, Dandapat, Sandipan, Choudhury, Monojit
Abstract
We study predictive multilingual evaluation: estimating how well a model will perform on a task in a target language when direct benchmark results are missing. This problem is common in multilingual deployment, where evaluation coverage is sparse and published evidence is uneven across languages, tasks, and model families. We introduce a controlled benchmark of 1,500 questions spanning six tasks and five evidence scenarios. The benchmark separates accessible evidence from ground truth, enabling evaluation of systems that must infer missing results from incomplete literature evidence. We also present Litmus (Re)Agent, a DAG-orchestrated agentic system that decomposes queries into hypotheses, retrieves evidence, and synthesises predictions through feature-aware aggregation. Across six systems, Litmus (Re)Agent achieves the best overall performance, with the largest gains in transfer-heavy scenarios where direct evidence is weak or absent. These results show that structured agentic reasoning is a promising approach to multilingual performance estimation under incomplete evidence.
Chinese Translation
我们研究了预测性多语言评估:在缺乏直接基准结果的情况下,估计模型在目标语言任务上的表现。该问题在多语言部署中普遍存在,评估覆盖面稀疏,且不同语言、任务和模型家族间的公开证据不均衡。我们引入了一个包含1500个问题、涵盖六个任务和五种证据场景的受控基准。该基准将可获取的证据与真实结果分离,使得系统能够在不完整的文献证据中推断缺失结果。我们还提出了Litmus (Re)Agent,一种基于有向无环图(DAG)协调的智能代理系统,该系统将查询分解为假设,检索证据,并通过特征感知的聚合合成预测。在六个系统中,Litmus (Re)Agent实现了最佳整体性能,尤其在直接证据薄弱或缺失的迁移密集场景中表现出最大提升。结果表明,结构化的智能代理推理是基于不完整证据进行多语言性能估计的有前景方法。
cs.CL / 34 / 2604.08974

Confident in a Confidence Score: Investigating the Sensitivity of Confidence Scores to Supervised Fine-Tuning

对置信分数的信心:研究置信分数对监督微调的敏感性
Flores, Lorenzo Jaime Yu, di-Piano, Cesare Spinoso, Cheung, Jackie Chi Kit
Abstract
Uncertainty quantification is a set of techniques that measure confidence in language models. They can be used, for example, to detect hallucinations or alert users to review uncertain predictions. To be useful, these confidence scores must be correlated with the quality of the output. However, recent work found that fine-tuning can affect the correlation between confidence scores and quality. Hence, we investigate the underlying behavior of confidence scores to understand its sensitivity to supervised fine-tuning (SFT). We find that post-SFT, the correlation of various confidence scores degrades, which can stem from changes in confidence scores due to factors other than the output quality, such as the output's similarity to the training distribution. We demonstrate via a case study how failing to address this miscorrelation reduces the usefulness of the confidence scores on a downstream task. Our findings show how confidence metrics cannot be used off-the-shelf without testing, and motivate the need for developing metrics which are more robust to fine-tuning.
Chinese Translation
不确定性量化是一组测量语言模型置信度的技术。例如,它们可以用于检测幻觉或提醒用户审查不确定的预测。为了具有实用性,这些置信分数必须与输出质量相关。然而,最近的研究发现,微调可能会影响置信分数与质量之间的相关性。因此,我们研究置信分数的基本行为,以理解其对监督微调(SFT)的敏感性。我们发现,在SFT之后,各种置信分数的相关性下降,这可能源于由于其他因素(如输出与训练分布的相似性)而导致的置信分数变化,而非输出质量。通过案例研究,我们展示了未能解决这种误相关如何降低置信分数在下游任务中的实用性。我们的发现表明,置信度指标不能直接使用而不进行测试,并促使开发对微调更具鲁棒性的指标的必要性。
cs.CL / 35 / 2604.08976

Quantisation Reshapes the Metacognitive Geometry of Language Models

量化重塑语言模型的元认知几何结构
Cacioli, Jon-Paul
Abstract
We report that model quantisation restructures domain-level metacognitive efficiency in LLMs rather than degrading it uniformly. Evaluating Llama-3-8B-Instruct on the same 3,000 questions at Q5_K_M and f16 precision, we find that M-ratio profiles across four knowledge domains are uncorrelated between formats (Spearman rho = 0.00). Arts & Literature moves from worst-monitored (M-ratio = 0.606 at Q5_K_M) to best-monitored (1.542 at f16). Geography moves from well-monitored (1.210) to under-monitored (0.798). However, Type-2 AUROC profiles are perfectly stable across formats (rho = 1.00), localising the restructuring to the M-ratio normalisation rather than the underlying discrimination signal. This finding emerged from a pre-registered attempt to improve metacognition through domain-conditional training. We prescribed confidence-amplification SFT for the diagnosed weak domain, with matched-budget agnostic and wrong-prescription controls. All four confirmatory hypotheses were null (10,000 bootstrap resamples, seed = 42). The training successfully reshaped confidence distributions, doubling the NLP gap in Science from 0.076 to 0.152, but did not improve meta-d' because the diagnostic profile did not transfer across formats. Any system relying on domain-level M-ratio profiles has an unexamined dependency on inference format. Systems using AUROC_2 are safer. We release all code, pre-registrations, and trial-level data.
Chinese Translation
我们报告称,模型量化重构了大型语言模型(LLMs)在领域层面的元认知效率,而不是均匀地降低它。在Q5_K_M和f16精度下对Llama-3-8B-Instruct在同一3,000个问题上的评估显示,四个知识领域的M比率在不同格式之间不相关(Spearman rho = 0.00)。艺术与文学领域的监控情况从最差(M比率 = 0.606在Q5_K_M)转变为最好(1.542在f16)。地理领域则从良好监控(1.210)转变为监控不足(0.798)。然而,Type-2 AUROC曲线下的稳定性在不同格式间完全一致(rho = 1.00),将重构局限于M比率的归一化,而非基础的辨别信号。该发现源于一次预注册的尝试,旨在通过领域条件训练来改善元认知。我们为诊断出的弱领域规定了信心增强的监督微调(SFT),并设置了匹配预算的无偏和错误处方对照组。所有四个确认假设均为零假设(10,000次自助重抽样,种子 = 42)。训练成功重塑了信心分布,将科学领域的NLP差距从0.076翻倍至0.152,但未能改善meta-d',因为诊断特征未能在不同格式间转移。任何依赖于领域层面M比率的系统都存在未被检验的推理格式依赖性。使用AUROC_2的系统更为安全。我们发布了所有代码、预注册信息和试验级数据。
cs.CL / 36 / 2604.08977

Testing the Assumptions of Active Learning for Translation Tasks with Few Samples

在少样本翻译任务中检验主动学习假设
Flores, Lorenzo Jaime Yu, di-Piano, Cesare Spinoso, Ernst, Ori, Adelani, David Ifeoluwa, Cheung, Jackie Chi Kit
Abstract
Active learning (AL) is a training paradigm for selecting unlabeled samples for annotation to improve model performance on a test set, which is useful when only a limited number of samples can be annotated. These algorithms often work by optimizing for the informativeness and diversity of the training data to be annotated. Recent work found that AL strategies fail to outperform random sampling on various language generation tasks when using 100-500 samples. To understand AL's poor performance when only using few samples, we investigate whether the core assumptions underlying AL strategies hold. We find that neither the informativeness nor diversity of the training data, which AL strategies optimize for, are correlated with test set performance. Instead, factors like the ordering of the training samples and interactions with pre-training data have a larger impact on performance. This suggests that future AL methods must take these factors into account in order to work with very few samples.
Chinese Translation
主动学习(Active Learning,AL)是一种训练范式,通过选择未标注样本进行注释,以提升模型在测试集上的表现,适用于只能注释有限样本的情形。这些算法通常通过优化待注释训练数据的信息量和多样性来实现。近期研究发现,在使用100至500个样本时,AL策略在多种语言生成任务中未能优于随机采样。为理解AL在少样本条件下表现不佳的原因,我们探讨了AL策略所依赖的核心假设是否成立。结果表明,AL策略所优化的信息量和多样性与测试集性能并无相关性。相反,训练样本的排序及与预训练数据的交互对性能影响更大。这表明未来的AL方法必须考虑这些因素,才能在极少样本情况下有效工作。
cs.CL / 37 / 2604.08986

PerMix-RLVR: Preserving Persona Expressivity under Verifiable-Reward Alignment

PerMix-RLVR:在可验证奖励对齐下保持角色表达力
Oh, Jihwan, Oh, Soowon, Aghazada, Murad, Jeong, Minchan, Kim, Sungnyun, Yun, Se-Young
Abstract
Persona prompting has been widely adopted to steer large language models (LLMs) behavior and improve their instruction performance by assigning specific characters. However, identifying an optimal persona is time-consuming, and its impact on output quality remains poorly understood. Prior work has mainly addressed this issue at the prompt level via inference-time strategies, incurring additional computation. In this work, we avoid inference-time prompt search by tackling persona sensitivity during training, aiming to train models that adapt their behavior to diverse personas while preserving task performance. In particular, we find that reinforcement learning with verifiable rewards (RLVR) systematically reduces sensitivity to persona prompts, but also reveals an inherent trade-off of outcome-based optimization: while RLVR improves robustness on tasks with verifiable goals, it can also degrade persona expressivity when needed, e.g., in-character role-playing. To address this limitation, we propose PerMix-RLVR, a persona-mixed RLVR strategy that mitigates the persona robustness-fidelity trade-off, preserving strong robustness to harmful persona variation while enabling faithful persona adoption when required. Concretely, PerMix-RLVR improves persona stability score (PSS) over RLVR by +21.2% on MATH500, while also enhancing persona fidelity by +11.4% on PersonaGym.
Chinese Translation
角色提示(Persona prompting)已被广泛应用于引导大型语言模型(LLMs)的行为并提升其指令执行性能,通过赋予特定角色身份。然而,确定最优角色身份耗时较长,其对输出质量的影响尚未被充分理解。以往研究主要通过推理时策略在提示层面解决该问题,导致额外计算开销。本研究通过在训练阶段处理角色敏感性,避免了推理时的提示搜索,旨在训练能够适应多样角色身份且保持任务性能的模型。具体而言,我们发现基于可验证奖励的强化学习(RLVR)系统性地降低了对角色提示的敏感性,但也揭示了基于结果优化的内在权衡:虽然RLVR提升了在具有可验证目标任务上的鲁棒性,但在需要表现角色特征的场景(如角色扮演)中可能削弱角色表达力。为解决该限制,我们提出了PerMix-RLVR,一种角色混合的RLVR策略,缓解了角色鲁棒性与忠实度之间的权衡,在保持对有害角色变化的强鲁棒性的同时,实现了在必要时对角色的忠实采纳。具体来说,PerMix-RLVR在MATH500数据集上将角色稳定性评分(PSS)较RLVR提升了21.2%,同时在PersonaGym上提升了11.4%的角色忠实度。
cs.CL / 38 / 2604.08999

ASTRA: Adaptive Semantic Tree Reasoning Architecture for Complex Table Question Answering

ASTRA:用于复杂表格问答的自适应语义树推理架构
Guo, Xiaoke, Li, Songze, Liu, Zhiqiang, Gong, Zhaoyan, Liu, Yuanxiang, Chen, Huajun, Zhang, Wen
Abstract
Table serialization remains a critical bottleneck for Large Language Models (LLMs) in complex table question answering, hindered by challenges such as structural neglect, representation gaps, and reasoning opacity. Existing serialization methods fail to capture explicit hierarchies and lack schema flexibility, while current tree-based approaches suffer from limited semantic adaptability. To address these limitations, we propose ASTRA (Adaptive Semantic Tree Reasoning Architecture) including two main modules, AdaSTR and DuTR. First, we introduce AdaSTR, which leverages the global semantic awareness of LLMs to reconstruct tables into Logical Semantic Trees. This serialization explicitly models hierarchical dependencies and employs an adaptive mechanism to optimize construction strategies based on table scale. Second, building on this structure, we present DuTR, a dual-mode reasoning framework that integrates tree-search-based textual navigation for linguistic alignment and symbolic code execution for precise verification. Experiments on complex table benchmarks demonstrate that our method achieves state-of-the-art (SOTA) performance.
Chinese Translation
表格序列化仍然是大型语言模型(LLMs)在复杂表格问答中的一个关键瓶颈,受到结构忽视、表示差距和推理不透明等挑战的阻碍。现有的序列化方法未能捕捉明确的层次结构,并且缺乏模式灵活性,而当前基于树的方法在语义适应性方面也存在局限性。为了解决这些局限性,我们提出了ASTRA(自适应语义树推理架构),包括两个主要模块:AdaSTR和DuTR。首先,我们介绍了AdaSTR,它利用LLMs的全局语义意识将表格重构为逻辑语义树。这种序列化明确建模了层次依赖关系,并采用自适应机制根据表格规模优化构建策略。其次,在此结构基础上,我们提出了DuTR,一个双模式推理框架,集成了基于树搜索的文本导航以实现语言对齐,以及符号代码执行以进行精确验证。在复杂表格基准上的实验表明,我们的方法实现了最先进的(SOTA)性能。
cs.CL / 39 / 2604.09008

Towards Linguistically-informed Representations for English as a Second or Foreign Language: Review, Construction and Application

面向语言学信息的英语作为第二语言或外语的表征:回顾、构建与应用
Li, Wenxi, Wang, Xihao, Sun, Weiwei
Abstract
The widespread use of English as a Second or Foreign Language (ESFL) has sparked a paradigm shift: ESFL is not seen merely as a deviation from standard English but as a distinct linguistic system in its own right. This shift highlights the need for dedicated, knowledge-intensive representations of ESFL. In response, this paper surveys existing ESFL resources, identifies their limitations, and proposes a novel solution. Grounded in constructivist theories, the paper treats constructions as the fundamental units of analysis, allowing it to model the syntax--semantics interface of both ESFL and standard English. This design captures a wide range of ESFL phenomena by referring to syntactico-semantic mappings of English while preserving ESFL's unique characteristics, resulting a gold-standard syntactico-semantic resource comprising 1643 annotated ESFL sentences. To demonstrate the sembank's practical utility, we conduct a pilot study testing the Linguistic Niche Hypothesis, highlighting its potential as a valuable tool in Second Language Acquisition research.
Chinese Translation
英语作为第二语言或外语(ESFL)的广泛使用引发了范式转变:ESFL不再仅仅被视为标准英语的偏差,而是作为一种独立的语言系统。这一转变突显了对专门、知识密集型ESFL表征的需求。为此,本文对现有的ESFL资源进行了调查,识别了其局限性,并提出了一种新颖的解决方案。基于建构主义理论,本文将构式视为分析的基本单元,从而能够建模ESFL与标准英语的句法-语义接口。该设计通过参考英语的句法-语义映射,捕捉了广泛的ESFL现象,同时保留了ESFL的独特特征,最终形成了一个包含1643个注释ESFL句子的黄金标准句法-语义资源。为了展示该资源的实际效用,我们进行了一个试点研究,测试语言生态位假设,突显其作为第二语言习得研究中有价值工具的潜力。
cs.CL / 40 / 2604.09029

CONDESION-BENCH: Conditional Decision-Making of Large Language Models in Compositional Action Space

CONDESION-BENCH:大规模语言模型在组合动作空间中的条件决策
Hwang, Yeonjun, Park, Sungyong, Kim, Minju, Lee, Dongha, Yeo, Jinyoung
Abstract
Large language models have been widely explored as decision-support tools in high-stakes domains due to their contextual understanding and reasoning capabilities. However, existing decision-making benchmarks rely on two simplifying assumptions: actions are selected from a finite set of pre-defined candidates, and explicit conditions restricting action feasibility are not incorporated into the decision-making process. These assumptions fail to capture the compositional structure of real-world actions and the explicit conditions that constrain their validity. To address these limitations, we introduce CONDESION-BENCH, a benchmark designed to evaluate conditional decision-making in compositional action space. In CONDESION-BENCH, actions are defined as allocations to decision variables and are restricted by explicit conditions at the variable, contextual, and allocation levels. By employing oracle-based evaluation of both decision quality and condition adherence, we provide a more rigorous assessment of LLMs as decision-support tools.
Chinese Translation
由于具备上下文理解和推理能力,大规模语言模型已被广泛探索作为高风险领域的决策支持工具。然而,现有的决策基准测试依赖于两个简化假设:动作从有限的预定义候选集中选择,且决策过程中未纳入限制动作可行性的显式条件。这些假设未能反映现实世界动作的组合结构及其有效性约束的显式条件。为解决这些局限性,我们提出了CONDESION-BENCH,一个旨在评估组合动作空间中条件决策的基准测试。在CONDESION-BENCH中,动作被定义为对决策变量的分配,并受到变量级、上下文级和分配级显式条件的限制。通过采用基于oracle的决策质量与条件遵守度评估方法,我们为大规模语言模型作为决策支持工具提供了更为严谨的评估。
cs.CL / 41 / 2604.09066

Anchored Sliding Window: Toward Robust and Imperceptible Linguistic Steganography

锚定滑动窗口:迈向鲁棒且难以察觉的语言隐写术
Yan, Ruiyi, Meng, Shiao, Murawaki, Yugo
Abstract
Linguistic steganography based on language models typically assumes that steganographic texts are transmitted without alteration, making them fragile to even minor modifications. While previous work mitigates this fragility by limiting the context window, it significantly compromises text quality. In this paper, we propose the anchored sliding window (ASW) framework to improve imperceptibility and robustness. In addition to the latest tokens, the prompt and a bridge context are anchored within the context window, encouraging the model to compensate for the excluded tokens. We formulate the optimization of the bridge context as a variant of prompt distillation, which we further extend using self-distillation strategies. Experiments show that our ASW significantly and consistently outperforms the baseline method in text quality, imperceptibility, and robustness across diverse settings. The code is available at github.com/ryehr/ASW_steganography.
Chinese Translation
基于语言模型的语言隐写术通常假设隐写文本在传输过程中不被修改,因此即使是轻微的改动也会导致文本脆弱。尽管以往工作通过限制上下文窗口来缓解这种脆弱性,但这显著降低了文本质量。本文提出了锚定滑动窗口(Anchored Sliding Window,ASW)框架,以提升隐写文本的难以察觉性和鲁棒性。除了最新生成的词元外,提示词(prompt)和桥接上下文(bridge context)也被锚定在上下文窗口内,促使模型对被排除的词元进行补偿。我们将桥接上下文的优化形式化为提示蒸馏(prompt distillation)的一种变体,并进一步采用自蒸馏(self-distillation)策略进行扩展。实验结果表明,ASW在文本质量、难以察觉性和鲁棒性方面,在多种设置下均显著且持续优于基线方法。代码已开源,地址为 github.com/ryehr/ASW_steganography。
cs.CL / 42 / 2604.09069

NyayaMind- A Framework for Transparent Legal Reasoning and Judgment Prediction in the Indian Legal System

NyayaMind - 印度法律系统中透明法律推理与判决预测的框架
Shukla, Parjanya Aditya, Nigam, Shubham Kumar, Datta, Debtanu, Patnaik, Balaramamahanthi Deepak, Shallum, Noel, Vanga, Pradeep Reddy, Ghosh, Saptarshi, Bhattacharya, Arnab
Abstract
Court Judgment Prediction and Explanation (CJPE) aims to predict a judicial decision and provide a legally grounded explanation for a given case based on the facts, legal issues, arguments, cited statutes, and relevant precedents. For such systems to be practically useful in judicial or legal research settings, they must not only achieve high predictive performance but also generate transparent and structured legal reasoning that aligns with established judicial practices. In this work, we present NyayaMind, an open-source framework designed to enable transparent and scalable legal reasoning for the Indian judiciary. The proposed framework integrates retrieval, reasoning, and verification mechanisms to emulate the structured decision-making process typically followed in courts. Specifically, NyayaMind consists of two main components: a Retrieval Module and a Prediction Module. The Retrieval Module employs a RAG pipeline to identify legally relevant statutes and precedent cases from large-scale legal corpora, while the Prediction Module utilizes reasoning-oriented LLMs fine-tuned for the Indian legal domain to generate structured outputs including issues, arguments, rationale, and the final decision. Our extensive results and expert evaluation demonstrate that NyayaMind significantly improves the quality of explanation and evidence alignment compared to existing CJPE approaches, providing a promising step toward trustworthy AI-assisted legal decision support systems.
Chinese Translation
法院判决预测与解释(CJPE)旨在根据案件的事实、法律问题、论点、引用的法规和相关判例,预测司法决定并提供法律依据的解释。为了使此类系统在司法或法律研究环境中具有实用性,它们不仅需要实现高预测性能,还必须生成与既定司法实践相一致的透明和结构化的法律推理。在本研究中,我们提出了NyayaMind,这是一个旨在为印度司法系统提供透明和可扩展法律推理的开源框架。该框架集成了检索、推理和验证机制,以模拟法院通常遵循的结构化决策过程。具体而言,NyayaMind由两个主要组件组成:检索模块和预测模块。检索模块采用RAG(检索增强生成)管道,从大规模法律语料库中识别法律相关的法规和判例,而预测模块则利用针对印度法律领域进行微调的面向推理的大型语言模型(LLMs),生成包括问题、论点、理由和最终决定在内的结构化输出。我们的广泛结果和专家评估表明,NyayaMind在解释质量和证据一致性方面显著优于现有的CJPE方法,为可信赖的AI辅助法律决策支持系统迈出了有希望的一步。
cs.CL / 43 / 2604.09075

Hierarchical Alignment: Enforcing Hierarchical Instruction-Following in LLMs through Logical Consistency

层级对齐:通过逻辑一致性强化大语言模型的层级指令遵循
Yang, Shu, Zhou, Zihao, Wang, Di, Li, Wenda
Abstract
Large language models increasingly operate under multiple instructions from heterogeneous sources with different authority levels, including system policies, user requests, tool outputs, and retrieved context. While prior work on instruction hierarchy highlights the importance of respecting instruction priorities, it mainly focuses on adversarial attacks and overlooks the benign but common instruction conflicts that arise in real-world applications. In such settings, models must not only avoid security violations but also preserve task utility and behavioral consistency when instructions partially or implicitly conflict. We propose Neuro-Symbolic Hierarchical Alignment (NSHA) for hierarchical instruction-following by explicitly modeling and enforcing instruction priorities. At inference time, we introduce solver-guided reasoning that formulates instruction resolution as a constraint satisfaction problem, enabling the model to derive a maximally consistent set of applicable instructions under hierarchical constraints. At training time, NSHA distills solver-based decisions into model parameters using automatically constructed supervision. We evaluate our approach on rule following, task execution, tool use, and safety, covering both single-turn and multi-turn interactions, and show that NSHA significantly improves performance under such conflicts while maintaining competitive utility in reference settings.
Chinese Translation
大型语言模型越来越多地在来自不同权威级别的异构来源下执行多条指令,包括系统策略、用户请求、工具输出和检索上下文。尽管先前关于指令层级的研究强调了尊重指令优先级的重要性,但主要关注对抗性攻击,忽视了现实应用中常见的良性指令冲突。在此类场景中,模型不仅需要避免安全违规,还必须在指令部分或隐含冲突时保持任务效用和行为一致性。我们提出了神经符号层级对齐(Neuro-Symbolic Hierarchical Alignment,NSHA)方法,通过显式建模和执行指令优先级,实现层级指令遵循。在推理阶段,我们引入了求解器引导推理,将指令解析问题形式化为约束满足问题,使模型能够在层级约束下推导出最大一致性的适用指令集合。在训练阶段,NSHA利用自动构建的监督信号,将基于求解器的决策蒸馏到模型参数中。我们在规则遵循、任务执行、工具使用和安全性等方面进行了评估,涵盖单轮和多轮交互,结果表明NSHA在指令冲突情况下显著提升了性能,同时在参考设置中保持了竞争性的效用表现。
cs.CL / 44 / 2604.09121

Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition

交互式自动语音识别(ASR):迈向类人交互与语义一致性评估的智能代理式语音识别
Wang, Peng, Zhu, Yanqiao, Jiang, Zixuan, Chen, Qinyuan, Zhao, Xingjian, Qiu, Xipeng, Wang, Wupeng, Gao, Zhifu, Li, Xiangang, Yu, Kai, Chen, Xie
Abstract
Recent years have witnessed remarkable progress in automatic speech recognition (ASR), driven by advances in model architectures and large-scale training data. However, two important aspects remain underexplored. First, Word Error Rate (WER), the dominant evaluation metric for decades, treats all words equally and often fails to reflect the semantic correctness of an utterance at the sentence level. Second, interactive correction-an essential component of human communication-has rarely been systematically studied in ASR research. In this paper, we integrate these two perspectives under an agentic framework for interactive ASR. We propose leveraging LLM-as-a-Judge as a semantic-aware evaluation metric to assess recognition quality beyond token-level accuracy. Furthermore, we design an LLM-driven agent framework to simulate human-like multi-turn interaction, enabling iterative refinement of recognition outputs through semantic feedback. Extensive experiments are conducted on standard benchmarks, including GigaSpeech (English), WenetSpeech (Chinese), the ASRU 2019 code-switching test set. Both objective and subjective evaluations demonstrate the effectiveness of the proposed framework in improving semantic fidelity and interactive correction capability. We will release the code to facilitate future research in interactive and agentic ASR.
Chinese Translation
近年来,得益于模型架构的进步和大规模训练数据的支持,自动语音识别(ASR)取得了显著进展。然而,有两个重要方面尚未得到充分探索。首先,作为几十年来的主导评估指标,词错误率(WER)对所有词汇一视同仁,常常无法反映句子层面的语义正确性。其次,交互式纠正作为人类交流的核心组成部分,在ASR研究中鲜有系统性探讨。本文将这两个视角整合于一个智能代理框架下的交互式ASR中。我们提出利用大语言模型(LLM)作为评判者(LLM-as-a-Judge),作为一种具备语义感知的评估指标,以超越单纯的词元级准确率评估识别质量。此外,我们设计了一个由LLM驱动的智能代理框架,模拟类人多轮交互,通过语义反馈实现识别结果的迭代优化。在包括GigaSpeech(英语)、WenetSpeech(中文)及ASRU 2019代码切换测试集等标准基准上进行了大量实验。客观和主观评估均表明所提框架在提升语义忠实度和交互纠正能力方面的有效性。我们将开源代码以促进未来交互式及智能代理式ASR的研究。
cs.CL / 45 / 2604.09123

Prototype-Regularized Federated Learning for Cross-Domain Aspect Sentiment Triplet Extraction

原型正则化的联邦学习用于跨域方面情感三元组提取
Cai, Zongming, Tang, Jianhang, Zhang, Zhenyong, Qin, Jinghui, Jin, Kebing, Zhuo, Hankz Hankui
Abstract
Aspect Sentiment Triplet Extraction (ASTE) aims to extract all sentiment triplets of aspect terms, opinion terms, and sentiment polarities from a sentence. Existing methods are typically trained on individual datasets in isolation, failing to jointly capture the common feature representations shared across domains. Moreover, data privacy constraints prevent centralized data aggregation. To address these challenges, we propose Prototype-based Cross-Domain Span Prototype extraction (PCD-SpanProto), a prototype-regularized federated learning framework to enable distributed clients to exchange class-level prototypes instead of full model parameters. Specifically, we design a weighted performance-aware aggregation strategy and a contrastive regularization module to improve the global prototype under domain heterogeneity and the promotion between intra-class compactness and inter-class separability across clients. Extensive experiments on four ASTE datasets demonstrate that our method outperforms baselines and reduces communication costs, validating the effectiveness of prototype-based cross-domain knowledge transfer.
Chinese Translation
方面情感三元组提取(ASTE)旨在从句子中提取所有方面术语、意见术语和情感极性的情感三元组。现有方法通常在各自的数据集上独立训练,未能共同捕捉跨域共享的特征表示。此外,数据隐私限制阻止了集中数据的聚合。为了解决这些挑战,我们提出了基于原型的跨域跨度原型提取(PCD-SpanProto),这是一种原型正则化的联邦学习框架,旨在使分布式客户端能够交换类级原型,而不是完整的模型参数。具体而言,我们设计了一种加权性能感知聚合策略和一个对比正则化模块,以提高在领域异质性下的全局原型,并促进客户端之间类内紧凑性和类间可分离性的平衡。在四个ASTE数据集上的大量实验表明,我们的方法优于基线,并降低了通信成本,验证了基于原型的跨域知识转移的有效性。
cs.CL / 46 / 2604.09150

Think Less, Know More: State-Aware Reasoning Compression with Knowledge Guidance for Efficient Reasoning

少思考,多了解:基于状态感知的知识引导推理压缩以实现高效推理
Sui, Yi, Li, Chaozhuo, Song, Dawei
Abstract
Large Reasoning Models (LRMs) achieve strong performance on complex tasks by leveraging long Chain-of-Thought (CoT), but often suffer from overthinking, leading to excessive reasoning steps and high inference latency. Existing CoT compression methods struggle to balance accuracy and efficiency, and lack fine-grained, step-level adaptation to redundancy and reasoning bias. Therefore, we propose State-Aware Reasoning Compression with Knowledge Guidance (STACK), a framework that performs step-wise CoT compression by explicitly modeling stage-specific redundancy sources and integrating with a retrieval-augmented guidance. STACK constructs online long-short contrastive samples and dynamically switches between knowledge-guided compression for uncertain or biased reasoning state and self-prompted compression for overly long but confident state, complemented by an answer-convergence-based early stopping mechanism to suppress redundant verification. We further propose a reward-difference-driven training strategy by combining Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO), enabling models to learn state-conditioned compression strategies. Experiments on three mathematical reasoning benchmarks show that STACK achieves a superior accuracy-efficiency balance, reducing average response length by 59.9% while improving accuracy by 4.8 points over existing methods.
Chinese Translation
大型推理模型(LRMs)通过利用长链式思维(CoT)在复杂任务上取得了强大的性能,但往往会出现过度思考的问题,导致推理步骤过多和推理延迟过高。现有的链式思维压缩方法难以在准确性和效率之间取得平衡,且缺乏对冗余和推理偏差的细粒度、逐步适应。因此,我们提出了一种基于状态感知的知识引导推理压缩框架(STACK),该框架通过明确建模阶段特定的冗余源并与检索增强引导相结合,执行逐步的链式思维压缩。STACK构建在线的长短对比样本,并在不确定或偏见的推理状态下动态切换到知识引导压缩,在过长但自信的状态下切换到自我提示压缩,并辅以基于答案收敛的提前停止机制以抑制冗余验证。我们进一步提出了一种基于奖励差异的训练策略,通过结合近端策略优化(PPO)和直接偏好优化(DPO),使模型能够学习状态条件下的压缩策略。在三个数学推理基准上的实验表明,STACK在准确性和效率之间达到了优越的平衡,平均响应长度减少了59.9%,同时准确性提高了4.8个百分点,相较于现有方法表现更佳。
cs.CL / 47 / 2604.09162

Persona-E$^2$: A Human-Grounded Dataset for Personality-Shaped Emotional Responses to Textual Events

Persona-E$^2$: 基于人类的个性化情感反应数据集
Yang, Yuqin, Zhou, Haowu, Tu, Haoran, Hui, Zhiwen, Yan, Shiqi, Li, HaoYang, She, Dong, Yao, Xianrong, Gao, Yang, Jin, Zhanpeng
Abstract
Most affective computing research treats emotion as a static property of text, focusing on the writer's sentiment while overlooking the reader's perspective. This approach ignores how individual personalities lead to diverse emotional appraisals of the same event. Although role-playing Large Language Models (LLMs) attempt to simulate such nuanced reactions, they often suffer from "personality illusion'' -- relying on surface-level stereotypes rather than authentic cognitive logic. A critical bottleneck is the absence of ground-truth human data to link personality traits to emotional shifts. To bridge the gap, we introduce Persona-E$^2$ (Persona-Event2Emotion), a large-scale dataset grounded in annotated MBTI and Big Five traits to capture reader-based emotional variations across news, social media, and life narratives. Extensive experiments reveal that state-of-the-art LLMs struggle to capture precise appraisal shifts, particularly in social media domains. Crucially, we find that personality information significantly improves comprehension, with the Big Five traits alleviating "personality illusion.'
Chinese Translation
大多数情感计算研究将情感视为文本的静态属性,关注作者的情感而忽视读者的视角。这种方法忽略了个体个性如何导致对同一事件的多样化情感评估。尽管角色扮演的大型语言模型(LLMs)试图模拟这种细微的反应,但它们常常遭遇“个性幻觉”(personality illusion)——依赖于表面刻板印象而非真实的认知逻辑。一个关键瓶颈是缺乏真实的人类数据来将个性特征与情感变化联系起来。为了解决这一问题,我们引入了Persona-E$^2$(Persona-Event2Emotion),这是一个大规模的数据集,基于标注的MBTI和五大性格特征,旨在捕捉读者在新闻、社交媒体和生活叙事中的情感变化。广泛的实验表明,最先进的LLMs在捕捉精确的评估变化方面存在困难,尤其是在社交媒体领域。重要的是,我们发现个性信息显著提高了理解能力,而五大性格特征则减轻了“个性幻觉”。
cs.CL / 48 / 2604.09174

Facet-Level Tracing of Evidence Uncertainty and Hallucination in RAG

RAG中证据不确定性和幻觉的层面级追踪
Elchafei, Passant, Swain, Monorama, Masoudian, Shahed, Schedl, Markus
Abstract
Retrieval-Augmented Generation (RAG) aims to reduce hallucination by grounding answers in retrieved evidence, yet hallucinated answers remain common even when relevant documents are available. Existing evaluations focus on answer-level or passage-level accuracy, offering limited insight into how evidence is used during generation. In this work, we introduce a facet-level diagnostics framework for QA that decomposes each input question into atomic reasoning facets. For each facet, we assess evidence sufficiency and grounding using a structured Facet x Chunk matrix that combines retrieval relevance with natural language inference-based faithfulness scores. To diagnose evidence usage, we analyze three controlled inference modes: Strict RAG, which enforces exclusive reliance on retrieved evidence; Soft RAG, which allows integration of retrieved evidence and parametric knowledge; and LLM-only generation without retrieval. Comparing these modes enables thorough analysis of retrieval-generation misalignment, defined as cases where relevant evidence is retrieved but not correctly integrated during generation. Across medical QA and HotpotQA, we evaluate three open-source and closed-source LLMs (GPT, Gemini, and LLaMA), providing interpretable diagnostics that reveal recurring facet-level failure modes, including evidence absence, evidence misalignment, and prior-driven overrides. Our results demonstrate that hallucinations in RAG systems are driven less by retrieval accuracy and more by how retrieved evidence is integrated during generation, with facet-level analysis exposing systematic evidence override and misalignment patterns that remain hidden under answer-level evaluation.
Chinese Translation
检索增强生成(RAG)旨在通过将答案基于检索到的证据进行扎根,从而减少幻觉现象,然而即使在相关文档可用的情况下,幻觉答案仍然很常见。现有评估主要集中在答案级或段落级的准确性上,提供了有限的洞察力来了解在生成过程中证据的使用情况。在本研究中,我们引入了一种针对问答(QA)的层面级诊断框架,将每个输入问题分解为原子推理层面。对于每个层面,我们使用结构化的层面 x 块矩阵评估证据的充分性和扎根性,该矩阵结合了检索相关性与基于自然语言推理的可信度评分。为了诊断证据的使用情况,我们分析了三种受控推理模式:严格RAG(Strict RAG),该模式强制完全依赖于检索到的证据;软RAG(Soft RAG),该模式允许整合检索到的证据和参数知识;以及仅使用大型语言模型(LLM)生成而不进行检索。比较这些模式使我们能够全面分析检索与生成之间的不一致性,定义为在生成过程中检索到相关证据但未能正确整合的情况。在医学问答和HotpotQA数据集上,我们评估了三种开源和闭源的LLM(GPT、Gemini和LLaMA),提供了可解释的诊断,揭示了反复出现的层面级失败模式,包括证据缺失、证据不一致和先前驱动的覆盖。我们的结果表明,RAG系统中的幻觉现象与检索准确性关系不大,而更多地取决于在生成过程中如何整合检索到的证据,层面级分析揭示了系统性的证据覆盖和不一致模式,这些模式在答案级评估中是隐藏的。
cs.CL / 49 / 2604.09189

Do LLMs Follow Their Own Rules? A Reflexive Audit of Self-Stated Safety Policies

大型语言模型是否遵循自身规则?对自述安全政策的反思性审计
Mittal, Avni
Abstract
LLMs internalize safety policies through RLHF, yet these policies are never formally specified and remain difficult to inspect. Existing benchmarks evaluate models against external standards but do not measure whether models understand and enforce their own stated boundaries. We introduce the Symbolic-Neural Consistency Audit (SNCA), a framework that (1) extracts a model's self-stated safety rules via structured prompts, (2) formalizes them as typed predicates (Absolute, Conditional, Adaptive), and (3) measures behavioral compliance via deterministic comparison against harm benchmarks. Evaluating four frontier models across 45 harm categories and 47,496 observations reveals systematic gaps between stated policy and observed behavior: models claiming absolute refusal frequently comply with harmful prompts, reasoning models achieve the highest self-consistency but fail to articulate policies for 29% of categories, and cross-model agreement on rule types is remarkably low (11%). These results demonstrate that the gap between what LLMs say and what they do is measurable and architecture-dependent, motivating reflexive consistency audits as a complement to behavioral benchmarks.
Chinese Translation
大型语言模型(LLMs)通过强化学习与人类反馈(RLHF)内化安全政策,然而这些政策从未被正式规范,且难以检视。现有基准测试评估模型是否符合外部标准,但未衡量模型是否理解并执行其自身声明的边界。我们提出符号-神经一致性审计(Symbolic-Neural Consistency Audit,SNCA)框架,该框架(1)通过结构化提示提取模型自述的安全规则,(2)将其形式化为类型谓词(绝对型、条件型、自适应型),并(3)通过与危害基准的确定性比较衡量行为合规性。对四个前沿模型在45个危害类别和47,496条观察数据上的评估揭示了声明政策与实际行为之间的系统性差距:声称绝对拒绝的模型频繁响应有害提示,推理模型虽实现最高的自洽性,但29%的类别未能明确政策,且跨模型对规则类型的共识极低(仅11%)。这些结果表明,LLMs言行之间的差距是可测量且依赖于模型架构的,促使反思性一致性审计成为行为基准测试的重要补充。
cs.CL / 50 / 2604.09212

SPASM: Stable Persona-driven Agent Simulation for Multi-turn Dialogue Generation

SPASM:基于稳定人格驱动的多轮对话生成代理模拟
Luo, Han, Laban, Guy
Abstract
Large language models are increasingly deployed in multi-turn settings such as tutoring, support, and counseling, where reliability depends on preserving consistent roles, personas, and goals across long horizons. This requirement becomes critical when LLMs are used to generate synthetic dialogues for training and evaluation, since LLM--LLM conversations can accumulate identity-related failures such as persona drift, role confusion, and "echoing", where one agent gradually mirrors its partner. We introduce SPASM (Stable Persona-driven Agent Simulation for Multi-turn dialogue generation), a modular, stability-first framework that decomposes simulation into (i) persona creation via schema sampling, plausibility validation, and natural-language persona crafting, (ii) Client--Responder dialogue generation, and (iii) termination detection for coherent stopping. To improve long-horizon stability without changing model weights, we propose Egocentric Context Projection (ECP): dialogue history is stored in a perspective-agnostic representation and deterministically projected into each agent's egocentric view before generation. Across three LLM backbones (GPT-4o-mini, DeepSeek-V3.2, Qwen-Plus) and nine Client--Responder pairings, we construct a dataset of 4,500 personas and 45,000 conversations (500 personas X 10 conversations per pairing). Ablations show ECP substantially reduces persona drift and, under human validation, eliminates echoing; embedding analyses recover persona structure and reveal strong responder-driven interaction geometry. Our code is available at https://github.com/lhannnn/SPASM.
Chinese Translation
大型语言模型在辅导、支持和咨询等多轮对话场景中的应用日益增多,其可靠性依赖于在长时间跨度内保持一致的角色、人格和目标。当大型语言模型用于生成合成对话以进行训练和评估时,这一要求变得尤为重要,因为大型语言模型之间的对话可能会积累与身份相关的失败,例如人格漂移、角色混淆和“回声”现象,其中一个代理逐渐模仿其伙伴的行为。我们提出了SPASM(基于稳定人格驱动的多轮对话生成代理模拟),这是一个模块化的、以稳定性为优先的框架,将模拟分解为(i)通过模式采样、合理性验证和自然语言人格构建进行的人格创建,(ii)客户-响应者对话生成,以及(iii)用于一致停止的终止检测。为了在不改变模型权重的情况下提高长时间跨度的稳定性,我们提出了自我中心上下文投影(Egocentric Context Projection,ECP):对话历史以视角无关的表示形式存储,并在生成之前确定性地投影到每个代理的自我中心视图中。在三个大型语言模型基础(GPT-4o-mini、DeepSeek-V3.2、Qwen-Plus)和九对客户-响应者配对中,我们构建了一个包含4500个人格和45000个对话的数据集(500个人格 × 每对配对10个对话)。消融实验表明,ECP显著减少了人格漂移,并在人工验证下消除了回声现象;嵌入分析恢复了人格结构,并揭示了强烈的响应者驱动的交互几何。我们的代码可在 https://github.com/lhannnn/SPASM 获取。
cs.CL / 51 / 2604.09237

ScheMatiQ: From Research Question to Structured Data through Interactive Schema Discovery

ScheMatiQ:通过交互式模式发现从研究问题到结构化数据
Levy, Shahar, Habba, Eliya, Mintz, Reshef, Raveh, Barak, Keydar, Renana, Stanovsky, Gabriel
Abstract
Many disciplines pose natural-language research questions over large document collections whose answers typically require structured evidence, traditionally obtained by manually designing an annotation schema and exhaustively labeling the corpus, a slow and error-prone process. We introduce ScheMatiQ, which leverages calls to a backbone LLM to take a question and a corpus to produce a schema and a grounded database, with a web interface that lets steer and revise the extraction. In collaboration with domain experts, we show that ScheMatiQ yields outputs that support real-world analysis in law and computational biology. We release ScheMatiQ as open source with a public web interface, and invite experts across disciplines to use it with their own data. All resources, including the website, source code, and demonstration video, are available at: www.ScheMatiQ-ai.com
Chinese Translation
许多学科对大型文档集合提出自然语言研究问题,其答案通常需要结构化证据,这通常通过手动设计注释模式并对语料库进行全面标注来获得,这一过程既缓慢又容易出错。我们介绍了ScheMatiQ,它利用对主干大型语言模型(LLM)的调用,将问题和语料库转化为模式和基于证据的数据库,并提供一个网络界面,允许用户引导和修订提取过程。通过与领域专家的合作,我们展示了ScheMatiQ产生的输出支持法律和计算生物学中的实际分析。我们将ScheMatiQ作为开源项目发布,并提供公共网络界面,邀请各学科的专家使用自己的数据。所有资源,包括网站、源代码和演示视频,均可在:www.ScheMatiQ-ai.com 获取。
cs.CL / 52 / 2604.09265

EthicMind: A Risk-Aware Framework for Ethical-Emotional Alignment in Multi-Turn Dialogue

EthicMind:一种风险感知框架,用于多轮对话中的伦理-情感对齐
Deng, Jiawen, Li, Wei, Zhang, Wentao, Jiao, Ziyun, Ren, Fuji
Abstract
Intelligent dialogue systems are increasingly deployed in emotionally and ethically sensitive settings, where failures in either emotional attunement or ethical judgment can cause significant harm. Existing dialogue models typically address empathy and ethical safety in isolation, and often fail to adapt their behavior as ethical risk and user emotion evolve across multi-turn interactions. We formulate ethical-emotional alignment in dialogue as an explicit turn-level decision problem, and propose \textsc{EthicMind}, a risk-aware framework that implements this formulation in multi-turn dialogue at inference time. At each turn, \textsc{EthicMind} jointly analyzes ethical risk signals and user emotion, plans a high-level response strategy, and generates context-sensitive replies that balance ethical guidance with emotional engagement, without requiring additional model training. To evaluate alignment behavior under ethically complex interactions, we introduce a risk-stratified, multi-turn evaluation protocol with a context-aware user simulation procedure. Experimental results show that \textsc{EthicMind} achieves more consistent ethical guidance and emotional engagement than competitive baselines, particularly in high-risk and morally ambiguous scenarios.
Chinese Translation
智能对话系统越来越多地应用于情感和伦理敏感的环境中,在这些环境中,无论是情感共鸣还是伦理判断的失败都可能造成重大伤害。现有的对话模型通常将同理心和伦理安全孤立地处理,往往无法在多轮交互中随着伦理风险和用户情感的变化而调整其行为。我们将对话中的伦理-情感对齐形式化为一个明确的轮次级决策问题,并提出了 extsc{EthicMind},这是一个在推理时实现该形式化的风险感知框架。在每一轮中, extsc{EthicMind} 共同分析伦理风险信号和用户情感,规划高层次的响应策略,并生成平衡伦理指导与情感参与的上下文敏感回复,而无需额外的模型训练。为了评估在伦理复杂交互下的对齐行为,我们引入了一种风险分层的多轮评估协议,并采用了上下文感知的用户模拟程序。实验结果表明, extsc{EthicMind} 在高风险和道德模糊场景中,提供了比竞争基线更一致的伦理指导和情感参与。
cs.CL / 53 / 2604.09377

Task-Aware LLM Routing with Multi-Level Task-Profile-Guided Data Synthesis for Cold-Start Scenarios

面向冷启动场景的多层任务画像引导数据合成的任务感知大型语言模型路由方法
Liu, Hui, Zou, Bin, Chen, Kecheng, Liu, Jie, Wang, Wenya, Li, Haoliang
Abstract
Large language models (LLMs) exhibit substantial variability in performance and computational cost across tasks and queries, motivating routing systems that select models to meet user-specific cost-performance trade-offs. However, existing routers generalize poorly in cold-start scenarios where in-domain training data is unavailable. We address this limitation with a multi-level task-profile-guided data synthesis framework that constructs a hierarchical task taxonomy and produces diverse question-answer pairs to approximate the test-time query distribution. Building on this, we introduce TRouter, a task-type-aware router approach that models query-conditioned cost and performance via latent task-type variables, with prior regularization derived from the synthesized task taxonomy. This design enhances TRouter's routing utility under both cold-start and in-domain settings. Across multiple benchmarks, we show that our synthesis framework alleviates cold-start issues and that TRouter delivers effective LLM routing.
Chinese Translation
大型语言模型(LLMs)在不同任务和查询上的性能及计算成本存在显著差异,这促使路由系统根据用户特定的成本-性能权衡选择合适的模型。然而,现有的路由器在缺乏领域内训练数据的冷启动场景下泛化能力较差。针对这一限制,我们提出了一种多层任务画像引导的数据合成框架,该框架构建了分层任务分类体系,并生成多样化的问答对以近似测试时的查询分布。在此基础上,我们引入了TRouter,一种任务类型感知的路由方法,通过潜在任务类型变量对查询条件下的成本和性能进行建模,并利用合成任务分类体系导出的先验正则化。该设计提升了TRouter在冷启动和领域内场景下的路由效用。在多个基准测试中,我们展示了该数据合成框架有效缓解了冷启动问题,且TRouter实现了高效的LLM路由。
cs.CL / 54 / 2604.09418

Automated Instruction Revision (AIR): A Structured Comparison of Task Adaptation Strategies for LLM

自动化指令修订(AIR):大型语言模型任务适应策略的结构化比较
Bilyk, Solomiia, Getmanskyi, Volodymyr, Firman, Taras
Abstract
This paper studies Automated Instruction Revision (AIR), a rule-induction-based method for adapting large language models (LLMs) to downstream tasks using limited task-specific examples. We position AIR within the broader landscape of adaptation strategies, including prompt optimization, retrieval-based methods, and fine-tuning. We then compare these approaches across a diverse benchmark suite designed to stress different task requirements, such as knowledge injection, structured extraction, label remapping, and logical reasoning. The paper argues that adaptation performance is strongly task-dependent: no single method dominates across all settings. Across five benchmarks, AIR was strongest or near-best on label-remapping classification, while KNN retrieval performed best on closed-book QA, and fine-tuning dominated structured extraction and event-order reasoning. AIR is most promising when task behavior can be captured by compact, interpretable instruction rules, while retrieval and fine-tuning remain stronger in tasks dominated by source-specific knowledge or dataset-specific annotation regularities.
Chinese Translation
本文研究了自动化指令修订(Automated Instruction Revision,AIR),这是一种基于规则归纳的方法,旨在利用有限的任务特定示例对大型语言模型(LLMs)进行下游任务适应。我们将AIR置于更广泛的适应策略框架中进行定位,包括提示优化、基于检索的方法以及微调。随后,我们在一个多样化的基准套件上比较了这些方法,该套件设计用以考察不同任务需求,如知识注入、结构化抽取、标签重映射和逻辑推理。本文指出,适应性能高度依赖于具体任务:没有单一方法能在所有场景中占据主导地位。在五个基准测试中,AIR在标签重映射分类任务上表现最优或接近最优,而KNN检索在闭卷问答(closed-book QA)中表现最佳,微调则在结构化抽取和事件顺序推理中占据主导。AIR在任务行为能够通过简洁且可解释的指令规则捕捉时最具潜力,而在以源特定知识或数据集特定注释规律为主导的任务中,检索和微调方法依然表现更强。
cs.CL / 55 / 2604.09442

UIPress: Bringing Optical Token Compression to UI-to-Code Generation

UIPress:将光学令牌压缩引入UI到代码生成
Dai, Dasen, Li, Shuoqi, Chen, Ronghao, Wang, Huacan, Wu, Biao, Lan, Qizhen
Abstract
UI-to-Code generation requires vision-language models (VLMs) to produce thousands of tokens of structured HTML/CSS from a single screenshot, making visual token efficiency critical. Existing compression methods either select tokens at inference time using task-agnostic heuristics, or zero out low-attention features without actually shortening the sequence -- neither truly reduces prefill latency or adapts to the non-uniform information density of UI screenshots. Meanwhile, optical (encoder-side learned) compression has shown strong results for document OCR, yet no prior work has adapted this paradigm to UI-to-Code generation. We propose UIPress, a lightweight learned compression module inserted between the frozen ViT encoder and the LLM decoder of Qwen3-VL-8B. UIPress combines depthwise-separable convolutions, element-guided spatial reweighting, and Transformer refinement to compress ${\sim}$6{,}700 visual tokens to a fixed budget of 256. Together with Low-Rank Adaptation (LoRA) on the decoder to bridge the representation gap, the entire system adds only ${\sim}$21.7M trainable parameters (0.26\% of the 8B base model). Under a fair comparison on the same base model against four baselines on Design2Code, UIPress at 256 tokens achieves a CLIP score of 0.8127, outperforming the uncompressed baseline by +7.5\% and the strongest inference-time method by +4.6\%, while delivering 9.1$\times$ time-to-first-token speedup. To the best of our knowledge, UIPress is the first encoder-side learned compression method for the UI-to-Code task.
Chinese Translation
UI到代码生成需要视觉-语言模型(VLMs)从单个截图生成数千个结构化的HTML/CSS令牌,这使得视觉令牌效率至关重要。现有的压缩方法要么在推理时使用与任务无关的启发式方法选择令牌,要么将低关注特征置零,但并未真正缩短序列——这两者都无法真正减少预填充延迟或适应UI截图的非均匀信息密度。同时,光学(编码器端学习的)压缩在文档OCR中显示出强大的效果,但之前没有工作将这一范式应用于UI到代码生成。我们提出了UIPress,一个轻量级的学习压缩模块,插入在冻结的ViT编码器和Qwen3-VL-8B的LLM解码器之间。UIPress结合了深度可分离卷积、元素引导的空间重加权和Transformer精炼,将约6,700个视觉令牌压缩到固定的256个预算。结合在解码器上的低秩适应(LoRA)以弥补表示差距,整个系统仅增加约21.7M可训练参数(占8B基础模型的0.26%)。在对同一基础模型进行公平比较时,与Design2Code上的四个基线相比,UIPress在256个令牌下达到了0.8127的CLIP分数,超越了未压缩基线7.5%和最强推理时方法4.6%,同时实现了9.1倍的首次令牌速度提升。据我们所知,UIPress是UI到代码任务中首个编码器端学习的压缩方法。
cs.CL / 56 / 2604.09443

Many-Tier Instruction Hierarchy in LLM Agents

大型语言模型代理中的多层指令层次结构
Zhang, Jingyu, Li, Tianjian, Jurayj, William, Zhan, Hongyuan, Van Durme, Benjamin, Khashabi, Daniel
Abstract
Large language model agents receive instructions from many sources-system messages, user prompts, tool outputs, and more-each carrying different levels of trust and authority. When these instructions conflict, models must reliably follow the highest-privilege instruction to remain safe and effective. The dominant paradigm, instruction hierarchy (IH), assumes a fixed, small set of privilege levels (typically fewer than five) defined by rigid role labels (e.g., system > user). This is inadequate for real-world agentic settings, where conflicts can arise across far more sources and contexts. In this work, we propose Many-Tier Instruction Hierarchy (ManyIH), a paradigm for resolving instruction conflicts among instructions with arbitrarily many privilege levels. We introduce ManyIH-Bench, the first benchmark for ManyIH. ManyIH-Bench requires models to navigate up to 12 levels of conflicting instructions with varying privileges, comprising 853 agentic tasks (427 coding and 426 instruction-following). ManyIH-Bench composes constraints developed by LLMs and verified by humans to create realistic and difficult test cases spanning 46 real-world agents. Our experiments show that even the current frontier models perform poorly (~40% accuracy) when instruction conflict scales. This work underscores the urgent need for methods that explicitly target fine-grained, scalable instruction conflict resolution in agentic settings.
Chinese Translation
大型语言模型代理接收来自多个来源的指令——系统消息、用户提示、工具输出等——每个来源都具有不同的信任和权威级别。当这些指令发生冲突时,模型必须可靠地遵循最高权限的指令,以保持安全和有效。当前主流的指令层次结构(Instruction Hierarchy, IH)假设存在一个固定的小型权限级别集合(通常少于五个),由严格的角色标签定义(例如,系统 > 用户)。然而,这在现实世界的代理环境中是不够的,因为冲突可能来自更多的来源和上下文。在本研究中,我们提出了多层指令层次结构(Many-Tier Instruction Hierarchy, ManyIH),这是一个解决具有任意多个权限级别的指令冲突的范式。我们引入了ManyIH-Bench,这是第一个针对ManyIH的基准测试。ManyIH-Bench要求模型在多达12个权限级别的冲突指令中进行导航,涵盖853个代理任务(427个编码任务和426个遵循指令的任务)。ManyIH-Bench结合了由大型语言模型(LLMs)开发并由人类验证的约束,创建了涵盖46个现实世界代理的真实且困难的测试案例。我们的实验表明,即使是当前最前沿的模型在指令冲突加剧时表现不佳(准确率约为40%)。这项工作强调了在代理环境中明确针对细粒度、可扩展的指令冲突解决方法的迫切需求。
cs.CL / 57 / 2604.09459

From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models

从推理到自主:大语言模型强化学习中的信用分配
Zhang, Chenchen
Abstract
Reinforcement learning (RL) for large language models (LLMs) increasingly relies on sparse, outcome-level rewards -- yet determining which actions within a long trajectory caused the outcome remains difficult. This credit assignment (CA) problem manifests in two regimes: reasoning RL, where credit must be distributed across tokens and steps within a single chain-of-thought generation (500--30K+ tokens); and agentic RL, where multi-turn environment interaction introduces stochastic transitions, partial observability, and horizons of 100+ turns (100K--1M tokens), making episode-level credit increasingly uninformative. We survey 47 CA methods (41 core, 6 adjacent enablers) published between 2024 and early 2026, organizing them in a two-dimensional taxonomy by assignment granularity (token, segment, step, turn, multi-agent) and methodology (Monte Carlo, temporal difference, model-based, game-theoretic, information-theoretic). Beyond the survey itself, we contribute three reusable resources: (1) a structured, machine-readable paper inventory with taxonomy labels, baseline families, and evidence levels; (2) a reporting checklist for future CA papers, validated against the reviewed literature to identify systematic methodological gaps; and (3) a benchmark protocol specification with task families, metadata requirements, and controlled bifurcation tasks, accompanied by a method selection decision tree. Our synthesis suggests that the shift from reasoning to agentic RL complicates and reshapes the credit assignment landscape: reasoning CA is maturing around process reward models and critic-free group comparison, while agentic CA is driving genuinely new approaches -- hindsight counterfactual analysis, privileged asymmetric critics, and turn-level MDP reformulations -- that have no direct precedent in reasoning RL.
Chinese Translation
大语言模型(LLMs)的强化学习(RL)越来越依赖于稀疏的结果级奖励——然而,确定长轨迹中的哪些动作导致了结果仍然困难。这个信用分配(CA)问题在两个领域中表现出来:推理强化学习,其中信用必须在单个思维链生成(500--30K+个标记)中的标记和步骤之间分配;自主强化学习,其中多轮环境交互引入了随机转移、部分可观察性和超过100轮(100K--1M个标记)的时间范围,使得情节级信用变得越来越不具信息性。我们调查了2024年至2026年初发布的47种CA方法(41种核心方法,6种邻近促进方法),并根据分配粒度(标记、段、步骤、轮次、多智能体)和方法论(蒙特卡洛、时间差分、基于模型、博弈论、信息论)将其组织成二维分类法。除了调查本身,我们还贡献了三个可重用资源:(1)一个结构化的、机器可读的论文清单,带有分类标签、基线家族和证据水平;(2)一个针对未来CA论文的报告检查表,经过审查文献验证,以识别系统性方法论缺口;(3)一个基准协议规范,包含任务家族、元数据要求和受控分叉任务,并附有方法选择决策树。我们的综合分析表明,从推理到自主RL的转变使信用分配的格局变得复杂和重塑:推理CA围绕过程奖励模型和无评论者的组比较逐渐成熟,而自主CA则推动真正新颖的方法——事后反事实分析、特权不对称评论者和轮次级MDP重构——这些在推理RL中没有直接的先例。
cs.CL / 58 / 2604.09466

Across the Levels of Analysis: Explaining Predictive Processing in Humans Requires More Than Machine-Estimated Probabilities

跨越分析层次:解释人类预测处理需要超越机器估计的概率
Nair, Sathvik, Phillips, Colin
Abstract
Under the lens of Marr's levels of analysis, we critique and extend two claims about language models (LMs) and language processing: first, that predicting upcoming linguistic information based on context is central to language processing, and second, that many advances in psycholinguistics would be impossible without large language models (LLMs). We further outline future directions that combine the strengths of LLMs with psycholinguistic models.
Chinese Translation
在Marr的分析层次视角下,我们批判并扩展了关于语言模型(LMs)与语言处理的两项主张:首先,基于上下文预测即将出现的语言信息是语言处理的核心;其次,许多心理语言学的进展若无大型语言模型(LLMs)将难以实现。我们进一步提出了结合LLMs优势与心理语言学模型的未来研究方向。
cs.CL / 59 / 2604.09470

Agentic Jackal: Live Execution and Semantic Value Grounding for Text-to-JQL

Agentic Jackal:面向文本到JQL的实时执行与语义值映射
Murali, Vishnu, Gulati, Anmol, Lumer, Elias, Frank, Kevin, Campagna, Sindy, Subbiah, Vamse Kumar
Abstract
Translating natural language into Jira Query Language (JQL) requires resolving ambiguous field references, instance-specific categorical values, and complex Boolean predicates. Single-pass LLMs cannot discover which categorical values (e.g., component names or fix versions) actually exist in a given Jira instance, nor can they verify generated queries against a live data source, limiting accuracy on paraphrased or ambiguous requests. No open, execution-based benchmark exists for mapping natural language to JQL. We introduce Jackal, the first large-scale, execution-based text-to-JQL benchmark comprising 100,000 validated NL-JQL pairs on a live Jira instance with over 200,000 issues. To establish baselines on Jackal, we propose Agentic Jackal, a tool-augmented agent that equips LLMs with live query execution via the Jira MCP server and JiraAnchor, a semantic retrieval tool that resolves natural-language mentions of categorical values through embedding-based similarity search. Among 9 frontier LLMs evaluated, single-pass models average only 43.4% execution accuracy on short natural-language queries, highlighting that text-to-JQL remains an open challenge. The agentic approach improves 7 of 9 models, with a 9.0% relative gain on the most linguistically challenging variant; in a controlled ablation isolating JiraAnchor, categorical-value accuracy rises from 48.7% to 71.7%, with component-field accuracy jumping from 16.9% to 66.2%. Our analysis identifies inherent semantic ambiguities, such as issue-type disambiguation and text-field selection, as the dominant failure modes rather than value-resolution errors, pointing to concrete directions for future work. We publicly release the benchmark, all agent transcripts, and evaluation code to support reproducibility.
Chinese Translation
将自然语言翻译为Jira查询语言(JQL)需要解决字段引用的歧义、实例特定的类别值以及复杂的布尔谓词问题。单次推理的大型语言模型(LLMs)无法识别给定Jira实例中实际存在的类别值(例如组件名称或修复版本),也无法针对实时数据源验证生成的查询,从而限制了对意译或歧义请求的准确性。目前尚无基于执行的开放基准用于自然语言到JQL的映射。我们提出了Jackal,这是首个大规模、基于执行的文本到JQL基准,包含10万个经过验证的自然语言-JQL对,基于一个拥有超过20万问题的实时Jira实例。为在Jackal上建立基线,我们设计了Agentic Jackal,一种工具增强型代理,赋能LLMs通过Jira MCP服务器进行实时查询执行,并结合JiraAnchor——一种通过基于嵌入的相似度搜索解决自然语言中类别值提及的语义检索工具。在评估的9个前沿LLMs中,单次推理模型在简短自然语言查询上的平均执行准确率仅为43.4%,凸显文本到JQL仍是一个开放挑战。该代理方法提升了9个模型中的7个,在语言最具挑战性的变体上实现了9.0%的相对提升;在隔离JiraAnchor的对照消融实验中,类别值准确率从48.7%提升至71.7%,组件字段准确率从16.9%跃升至66.2%。我们的分析指出,固有的语义歧义(如问题类型消歧和文本字段选择)是主要失败模式,而非值解析错误,这为未来工作指明了具体方向。我们公开发布了该基准、所有代理对话记录及评估代码,以支持结果的可复现性。
cs.CL / 60 / 2604.09494

RecaLLM: Addressing the Lost-in-Thought Phenomenon with Explicit In-Context Retrieval

RecaLLM:通过显式上下文检索解决思维迷失现象
Whitecross, Kyle, Rahimi, Negin
Abstract
We propose RecaLLM, a set of reasoning language models post-trained to make effective use of long-context information. In-context retrieval, which identifies relevant evidence from context, and reasoning are deeply intertwined: retrieval supports reasoning, while reasoning often determines what must be retrieved. However, their interaction remains largely underexplored. In preliminary experiments on several open-source LLMs, we observe that in-context retrieval performance substantially degrades even after a short reasoning span, revealing a key bottleneck for test-time scaling that we refer to as lost-in-thought: reasoning steps that improve performance also make subsequent in-context retrieval more challenging. To address this limitation, RecaLLM interleaves reasoning with explicit in-context retrieval, alternating between reasoning and retrieving context information needed to solve intermediate subproblems. We introduce a negligible-overhead constrained decoding mechanism that enables verbatim copying of evidence spans, improving the grounding of subsequent generation. Trained on diverse lexical and semantic retrieval tasks, RecaLLM achieves strong performance on two long-context benchmarks, RULER and HELMET, significantly outperforming baselines. Notably, we observe consistent gains at context windows of up to 128K tokens using training samples of at most 10K tokens, far shorter than those used by existing long-context approaches, highlighting a promising path toward improving long-context performance without expensive long-context training data.
Chinese Translation
我们提出了RecaLLM,一组经过后训练的推理语言模型,旨在有效利用长上下文信息。上下文检索(in-context retrieval)用于从上下文中识别相关证据,与推理紧密交织:检索支持推理,而推理往往决定必须检索的内容。然而,它们之间的交互仍然很少被探索。在对若干开源大语言模型(LLMs)的初步实验中,我们观察到即使在较短的推理跨度后,上下文检索性能也会显著下降,揭示了测试时扩展的关键瓶颈,我们称之为“思维迷失”(lost-in-thought):推理步骤虽然提升了性能,却使后续的上下文检索变得更加困难。为了解决这一限制,RecaLLM将推理与显式上下文检索交替进行,在推理与检索解决中间子问题所需的上下文信息之间切换。我们引入了一种开销极小的受限解码机制,支持对证据片段的逐字复制,提升了后续生成的基础性。RecaLLM在多样的词汇和语义检索任务上训练,在两个长上下文基准测试RULER和HELMET上表现优异,显著超越基线方法。值得注意的是,我们在最多128K标记的上下文窗口中观察到持续的性能提升,而训练样本长度最多仅为10K标记,远短于现有长上下文方法所用的训练长度,展现了在无需昂贵长上下文训练数据的情况下提升长上下文性能的有希望路径。
cs.CL / 61 / 2604.09497

BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation

BERT作为评判者:高效基于参考的语言模型评估的稳健替代方案
Gisserot-Boukhlef, Hippolyte, Boizard, Nicolas, Malherbe, Emmanuel, Hudelot, Céline, Colombo, Pierre
Abstract
Accurate evaluation is central to the large language model (LLM) ecosystem, guiding model selection and downstream adoption across diverse use cases. In practice, however, evaluating generative outputs typically relies on rigid lexical methods to extract and assess answers, which can conflate a model's true problem-solving ability with its compliance with predefined formatting guidelines. While recent LLM-as-a-Judge approaches mitigate this issue by assessing semantic correctness rather than strict structural conformity, they also introduce substantial computational overhead, making evaluation costly. In this work, we first systematically investigate the limitations of lexical evaluation through a large-scale empirical study spanning 36 models and 15 downstream tasks, demonstrating that such methods correlate poorly with human judgments. To address this limitation, we introduce BERT-as-a-Judge, an encoder-driven approach for assessing answer correctness in reference-based generative settings, robust to variations in output phrasing, and requiring only lightweight training on synthetically annotated question-candidate-reference triplets. We show that it consistently outperforms the lexical baseline while matching the performance of much larger LLM judges, providing a compelling tradeoff between the two and enabling reliable, scalable evaluation. Finally, through extensive experimentation, we provide detailed insights into BERT-as-a-Judge's performance to offer practical guidance for practitioners, and release all project artifacts to foster downstream adoption.
Chinese Translation
准确的评估对于大型语言模型(LLM)生态系统至关重要,它指导模型选择和在多种应用场景中的下游采用。然而,在实际操作中,评估生成输出通常依赖于严格的词汇方法来提取和评估答案,这可能将模型的真实问题解决能力与其对预定义格式指南的遵循混为一谈。尽管近期的LLM作为评判者的方法通过评估语义正确性而非严格的结构一致性来缓解这一问题,但它们也引入了显著的计算开销,使得评估成本高昂。在本研究中,我们首先通过一项涵盖36个模型和15个下游任务的大规模实证研究系统地调查了词汇评估的局限性,证明此类方法与人类判断的相关性较差。为了解决这一局限性,我们提出了BERT作为评判者,这是一种基于编码器的方法,用于在基于参考的生成设置中评估答案的正确性,能够适应输出措辞的变化,并且只需在合成标注的问题-候选-参考三元组上进行轻量级训练。我们展示了它在性能上始终优于词汇基线,同时与更大规模的LLM评判者的表现相匹配,提供了两者之间的有力权衡,并实现了可靠、可扩展的评估。最后,通过广泛的实验,我们提供了关于BERT作为评判者性能的详细见解,以为从业者提供实用指导,并发布所有项目文档以促进下游采用。
cs.CL / 62 / 2604.09501

You Can't Fight in Here! This is BBS!

这里不能打架!这里是BBS!
Futrell, Richard, Mahowald, Kyle
Abstract
Norm, the formal theoretical linguist, and Claudette, the computational language scientist, have a lovely time discussing whether modern language models can inform important questions in the language sciences. Just as they are about to part ways until they meet again, 25 of their closest friends show up -- from linguistics, neuroscience, cognitive science, psychology, philosophy, and computer science. We use this discussion to highlight what we see as some common underlying issues: the String Statistics Strawman (the mistaken idea that LMs can't be linguistically competent or interesting because they, like their Markov model predecessors, are statistical models that learn from strings) and the As Good As it Gets Assumption (the idea that LM research as it stands in 2026 is the limit of what it can tell us about linguistics). We clarify the role of LM-based work for scientific insights into human language and advocate for a more expansive research program for the language sciences in the AI age, one that takes on the commentators' concerns in order to produce a better and more robust science of both human language and of LMs.
Chinese Translation
形式理论语言学家Norm和计算语言科学家Claudette愉快地讨论了现代语言模型是否能够为语言科学中的重要问题提供信息。正当他们准备分别,期待再次相聚时,来自语言学、神经科学、认知科学、心理学、哲学和计算机科学的25位亲密朋友纷纷到场。我们借此讨论强调一些常见的潜在问题:字符串统计稻草人论(即错误地认为语言模型无法具备语言能力或趣味性,因为它们像其马尔可夫模型前辈一样,是从字符串中学习的统计模型)以及“尽善尽美”假设(即认为截至2026年的语言模型研究已达到其对语言学所能提供见解的极限)。我们阐明了基于语言模型的工作在揭示人类语言科学见解中的作用,并倡导在人工智能时代语言科学开展更为广泛的研究计划,积极回应评论者的关切,以推动对人类语言及语言模型的更好、更稳健的科学研究。
cs.CL / 63 / 2604.09514

Many Ways to Be Fake: Benchmarking Fake News Detection Under Strategy-Driven AI Generation

多种伪造方式:基于策略驱动AI生成的假新闻检测基准测试
Wang, Xinyu, Koneru, Sai, Zhang, Wenbo, Zheng, Wenliang, Ranjan, Saksham, Rajtmajer, Sarah
Abstract
Recent advances in large language models (LLMs) have enabled the large-scale generation of highly fluent and deceptive news-like content. While prior work has often treated fake news detection as a binary classification problem, modern fake news increasingly arises through human-AI collaboration, where strategic inaccuracies are embedded within otherwise accurate and credible narratives. These mixed-truth cases represent a realistic and consequential threat, yet they remain underrepresented in existing benchmarks. To address this gap, we introduce MANYFAKE, a synthetic benchmark containing 6,798 fake news articles generated through multiple strategy-driven prompting pipelines that capture many ways fake news can be constructed and refined. Using this benchmark, we evaluate a range of state-of-the-art fake news detectors. Our results show that even advanced reasoning-enabled models approach saturation on fully fabricated stories, but remain brittle when falsehoods are subtle, optimized, and interwoven with accurate information.
Chinese Translation
近年来,大型语言模型(LLMs)的进步使得大规模生成高度流畅且具有欺骗性的新闻类内容成为可能。尽管以往研究通常将假新闻检测视为二分类问题,但现代假新闻越来越多地通过人机协作产生,其中策略性的不准确信息被嵌入到其他准确且可信的叙述中。这类混合真实性的案例代表了现实且重要的威胁,然而在现有基准测试中却鲜有体现。为填补这一空白,我们提出了MANYFAKE,一个合成基准数据集,包含6798篇通过多种策略驱动的提示生成管道制作的假新闻文章,涵盖了假新闻构建和优化的多种方式。基于该基准,我们评估了一系列最先进的假新闻检测器。结果表明,即使是具备高级推理能力的模型在完全虚构的故事上表现趋于饱和,但在面对细微、经过优化且与准确信息交织的虚假内容时仍显脆弱。
cs.CL / 64 / 2604.09537

Case-Grounded Evidence Verification: A Framework for Constructing Evidence-Sensitive Supervision

基于案例的证据验证:构建证据敏感监督的框架
Arasteh, Soroosh Tayebi, Joodaki, Mehdi, Lotfinia, Mahshad, Nebelung, Sven, Truhn, Daniel
Abstract
Evidence-grounded reasoning requires more than attaching retrieved text to a prediction: a model should make decisions that depend on whether the provided evidence supports the target claim. In practice, this often fails because supervision is weak, evidence is only loosely tied to the claim, and evaluation does not test evidence dependence directly. We introduce case-grounded evidence verification, a general framework in which a model receives a local case context, external evidence, and a structured claim, and must decide whether the evidence supports the claim for that case. Our key contribution is a supervision construction procedure that generates explicit support examples together with semantically controlled non-support examples, including counterfactual wrong-state and topic-related negatives, without manual evidence annotation. We instantiate the framework in radiology and train a standard verifier on the resulting support task. The learned verifier substantially outperforms both case-only and evidence-only baselines, remains strong under correct evidence, and collapses when evidence is removed or swapped, indicating genuine evidence dependence. This behavior transfers across unseen evidence articles and an external case distribution, though performance degrades under evidence-source shift and remains sensitive to backbone choice. Overall, the results suggest that a major bottleneck in evidence grounding is not only model capacity, but the lack of supervision that encodes the causal role of evidence.
Chinese Translation
基于证据的推理不仅仅是将检索到的文本附加到预测上:模型应根据提供的证据是否支持目标主张来做出决策。在实践中,这往往失败,因为监督较弱,证据与主张的联系较松散,评估也未能直接测试证据依赖性。我们提出了基于案例的证据验证,这是一种通用框架,其中模型接收局部案例上下文、外部证据和结构化主张,并必须决定该证据是否支持该案例的主张。我们的关键贡献是一个监督构建程序,它生成明确的支持示例以及语义控制的非支持示例,包括反事实错误状态和主题相关的负例,而无需手动证据标注。我们在放射学中实例化该框架,并在生成的支持任务上训练一个标准验证器。学习到的验证器在性能上显著优于仅基于案例和仅基于证据的基线,在正确证据下保持强劲,而在移除或替换证据时则崩溃,表明真正的证据依赖性。这种行为在未见证据文章和外部案例分布中得以转移,尽管在证据来源变化下性能有所下降,并且对主干选择仍然敏感。总体而言,结果表明,证据基础的一个主要瓶颈不仅在于模型能力,还在于缺乏编码证据因果作用的监督。
cs.CL / 65 / 2604.09544

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

大型语言模型通过一种独特的统一机制生成有害内容
Orgad, Hadas, Wei, Boyi, Zheng, Kaden, Wattenberg, Martin, Henderson, Peter, Goldfarb-Tarrant, Seraphina, Belinkov, Yonatan
Abstract
Large language models (LLMs) undergo alignment training to avoid harmful behaviors, yet the resulting safeguards remain brittle: jailbreaks routinely bypass them, and fine-tuning on narrow domains can induce ``emergent misalignment'' that generalizes broadly. Whether this brittleness reflects a fundamental lack of coherent internal organization for harmfulness remains unclear. Here we use targeted weight pruning as a causal intervention to probe the internal organization of harmfulness in LLMs. We find that harmful content generation depends on a compact set of weights that are general across harm types and distinct from benign capabilities. Aligned models exhibit a greater compression of harm generation weights than unaligned counterparts, indicating that alignment reshapes harmful representations internally--despite the brittleness of safety guardrails at the surface level. This compression explains emergent misalignment: if weights of harmful capabilities are compressed, fine-tuning that engages these weights in one domain can trigger broad misalignment. Consistent with this, pruning harm generation weights in a narrow domain substantially reduces emergent misalignment. Notably, LLMs harmful generation capability is dissociated from how they recognize and explain such content. Together, these results reveal a coherent internal structure for harmfulness in LLMs that may serve as a foundation for more principled approaches to safety.
Chinese Translation
大型语言模型(LLMs)经过对齐训练以避免有害行为,但所产生的保护措施仍然脆弱:越狱攻击常常绕过这些保护措施,而在狭窄领域的微调可能会导致广泛的“新出现的不对齐”。这种脆弱性是否反映了在有害性方面缺乏一致的内部组织仍不清楚。在这里,我们使用有针对性的权重剪枝作为因果干预,探讨LLMs中有害性的内部组织。我们发现,有害内容的生成依赖于一组紧凑的权重,这些权重在不同的有害类型中是通用的,并且与良性能力是不同的。对齐模型在有害生成权重的压缩程度上超过了未对齐的模型,这表明对齐在内部重塑了有害表征——尽管在表面层面安全保护措施仍然脆弱。这种压缩解释了新出现的不对齐:如果有害能力的权重被压缩,在一个领域进行的微调可能会引发广泛的不对齐。与此一致的是,在狭窄领域剪枝有害生成权重显著减少了新出现的不对齐。值得注意的是,LLMs的有害生成能力与它们识别和解释此类内容的方式是分离的。综合来看,这些结果揭示了LLMs中有害性的一个一致的内部结构,这可能为更有原则的安全方法奠定基础。