← Back to Index
Daily Research Digest

arXiv Papers

2026-03-12
237
Papers
4
Categories
237
Translated
收藏清单 0
机器人学 (Robotics)
54
cs.RO / 1 / 2603.10052

OmniGuide: Universal Guidance Fields for Enhancing Generalist Robot Policies

OmniGuide:增强通用机器人策略的通用引导场
Song, Yunzhou, Le, Long, Park, Yong-Hyun, Wang, Jie, Shi, Junyao, Liu, Lingjie, Gu, Jiatao, Eaton, Eric, Jayaraman, Dinesh, Daniilidis, Kostas
Abstract
Vision-language-action(VLA) models have shown great promise as generalist policies for a large range of relatively simple tasks. However, they demonstrate limited performance on more complex tasks, such as those requiring complex spatial or semantic understanding, manipulation in clutter, or precise manipulation. We propose OMNIGUIDE, a flexible framework that improves VLA performance on such tasks by leveraging arbitrary sources of guidance, such as 3D foundation models, semantic-reasoning VLMs, and human pose models. We show how many kinds of guidance can be naturally expressed as differentiable energy functions with task-specific attractors and repellers located in 3D space, that influence the sampling of VLA actions. In this way, OMNIGUIDE enables guidance sources with complementary task-relevant strengths to improve a VLA model's performance on challenging tasks. Extensive experiments in both simulation and real-world environments, across diverse sources of guidance, demonstrate that OMNIGUIDE enhances the performance of state-of-the-art generalist policies (e.g., $\pi_{0.5}$, GR00T N1.6) significantly across success and safety rates. Critically, our unified framework matches or surpasses the performance of prior methods designed to incorporate specific sources of guidance into VLA policies. Project Page: $\href{https://omniguide.github.io/}{this \; url}$
Chinese Translation
视觉-语言-动作(VLA)模型在处理一系列相对简单的任务时展现出了巨大的潜力,作为通用策略。然而,在处理更复杂的任务时,它们的表现有限,例如需要复杂空间或语义理解、在杂乱环境中进行操作或精确操作的任务。我们提出了OMNIGUIDE,这是一个灵活的框架,通过利用任意来源的引导(如3D基础模型、语义推理的VLM和人体姿态模型)来提高VLA在这些任务上的表现。我们展示了多种引导如何自然地表达为可微分的能量函数,这些函数在3D空间中具有特定任务的吸引子和排斥子,影响VLA动作的采样。通过这种方式,OMNIGUIDE使得具有互补任务相关优势的引导源能够提升VLA模型在挑战性任务上的表现。在模拟和现实环境中进行的大量实验,涵盖多种引导来源,证明了OMNIGUIDE显著提升了最先进的通用策略(例如,$ ext{π}_{0.5}$,GR00T N1.6)的成功率和安全率。重要的是,我们的统一框架在性能上与先前旨在将特定引导源纳入VLA策略的方法相匹配或超越。项目页面:$ ext{href}{https://omniguide.github.io/}{this ext{ url}}$
cs.RO / 2 / 2603.10059

Model-Free Co-Optimization of Manufacturable Sensor Layouts and Deformation Proprioception

无模型的可制造传感器布局与形变本体感知的协同优化
Tian, Yingjun, Fang, Guoxin, Lyu, Aoran, Wang, Xilong, Shi, Zikang, Guo, Yuhu, Wang, Weiming, Wang, Charlie C. L.
Abstract
Flexible sensors are increasingly employed in soft robotics and wearable devices to provide proprioception of freeform deformations.Although supervised learning can train shape predictors from sensor signals, prediction accuracy strongly depends on sensor layout, which is typically determined heuristically or through trial-and-error. This work introduces a model-free, data-driven computational pipeline that jointly optimizes the number, length, and placement of flexible length-measurement sensors together with the parameters of a shape prediction network for large freeform deformations. Unlike model-based approaches, the proposed method relies solely on datasets of deformed shapes, without requiring physical simulation models, and is therefore broadly applicable to diverse robotic sensing tasks. The pipeline incorporates differentiable loss functions that account for both prediction accuracy and manufacturability constraints. By co-optimizing sensor layouts and network parameters, the method significantly improves deformation prediction accuracy over unoptimized layouts while ensuring practical feasibility. The effectiveness and generality of the approach are validated through numerical and physical experiments on multiple soft robotic and wearable systems.
Chinese Translation
柔性传感器在软机器人和可穿戴设备中越来越多地被应用,以提供自由形变的本体感知。尽管监督学习可以从传感器信号中训练形状预测器,但预测准确性在很大程度上依赖于传感器布局,而传感器布局通常是通过启发式方法或试错法确定的。本研究提出了一种无模型、数据驱动的计算流程,该流程共同优化了柔性长度测量传感器的数量、长度和位置,以及用于大规模自由形变的形状预测网络的参数。与基于模型的方法不同,所提出的方法仅依赖于变形形状的数据集,而不需要物理仿真模型,因此广泛适用于各种机器人传感任务。该流程结合了可微分的损失函数,考虑了预测准确性和可制造性约束。通过协同优化传感器布局和网络参数,该方法显著提高了形变预测的准确性,同时确保了实际可行性。通过对多个软机器人和可穿戴系统的数值和物理实验验证了该方法的有效性和普适性。
cs.RO / 3 / 2603.10061

Decision-Aware Uncertainty Evaluation of Vision-Language Model-Based Early Action Anticipation for Human-Robot Interaction

基于视觉-语言模型的人机交互早期行动预测的决策感知不确定性评估
Du, Zhaoda, Bowman, Michael, Zheng, Qiaojie, Zhang, Xiaoli
Abstract
Robots in shared workspaces must interpret human actions from partial, ambiguous observations, where overconfident early predictions can lead to unsafe or disruptive interaction. This challenge is amplified in egocentric views, where viewpoint changes and occlusions increase perceptual noise and ambiguity. As a result, downstream human-robot interaction modules require not only an action hypothesis but also a trustworthy estimate of confidence under partial observation. Recent vision-language model-based approaches have been proposed for short-term action recognition due to their open-vocabulary and context-aware reasoning, but their uncertainty reliability in the temporal-prefix regime is largely uncharacterized. We present the first systematic evaluation of uncertainty in vision-language model-based short-term action recognition for human-robot interaction. We introduce a temporal-prefix evaluation protocol and metrics for calibration and selective prediction. We also characterize miscalibration patterns and failure modes under partial observations. Our study provides the missing reliability evidence needed to use vision-language model predictions in confidence-gated human-robot interaction modules.
Chinese Translation
在共享工作空间中,机器人必须从部分模糊的观察中解读人类行为,而过于自信的早期预测可能导致不安全或干扰性的互动。这一挑战在自我中心视角中更加明显,因为视角变化和遮挡增加了感知噪声和模糊性。因此,下游的人机交互模块不仅需要行动假设,还需要在部分观察下对置信度的可靠估计。由于其开放词汇和上下文感知推理,最近提出了基于视觉-语言模型的短期行动识别方法,但在时间前缀范围内的不确定性可靠性尚未得到充分表征。我们首次系统性地评估了基于视觉-语言模型的短期行动识别在人机交互中的不确定性。我们引入了一种时间前缀评估协议和用于校准及选择性预测的指标。我们还表征了在部分观察下的错误校准模式和失败模式。我们的研究提供了使用视觉-语言模型预测在置信度门控人机交互模块中所需的可靠性证据。
cs.RO / 4 / 2603.10126

AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models

AR-VLA:面向视觉-语言-动作模型的真实自回归动作专家
Hu, Yutong, Zaech, Jan-Nico, Nikolov, Nikolay, Yao, Yuanqi, Dey, Sombit, Albanese, Giuliano, Detry, Renaud, Van Gool, Luc, Paudel, Danda
Abstract
We propose a standalone autoregressive (AR) Action Expert that generates actions as a continuous causal sequence while conditioning on refreshable vision-language prefixes. In contrast to existing Vision-Language-Action (VLA) models and diffusion policies that reset temporal context with each new observation and predict actions reactively, our Action Expert maintains its own history through a long-lived memory and is inherently context-aware. This structure addresses the frequency mismatch between fast control and slow reasoning, enabling efficient independent pretraining of kinematic syntax and modular integration with heavy perception backbones, naturally ensuring spatio-temporally consistent action generation across frames. To synchronize these asynchronous hybrid V-L-A modalities, we utilize a re-anchoring mechanism that mathematically accounts for perception staleness during both training and inference. Experiments on simulated and real-robot manipulation tasks demonstrate that the proposed method can effectively replace traditional chunk-based action heads for both specialist and generalist policies. AR-VLA exhibits superior history awareness and substantially smoother action trajectories while maintaining or exceeding the task success rates of state-of-the-art reactive VLAs. Overall, our work introduces a scalable, context-aware action generation schema that provides a robust structural foundation for training effective robotic policies.
Chinese Translation
我们提出了一种独立的自回归(AR)动作专家,该专家在可刷新视觉-语言前缀的条件下生成连续因果序列的动作。与现有的视觉-语言-动作(VLA)模型和每次新观察时重置时间上下文并反应性预测动作的扩散策略不同,我们的动作专家通过长久的记忆保持自身历史,并且本质上具备上下文感知能力。这一结构解决了快速控制与缓慢推理之间的频率不匹配问题,使得运动语法的独立预训练和与重型感知骨干的模块化集成变得高效,确保了跨帧时空一致的动作生成。为了同步这些异步的混合视觉-语言-动作(V-L-A)模式,我们利用了一种重新锚定机制,该机制在训练和推理过程中数学上考虑了感知的滞后性。在模拟和真实机器人操控任务上的实验表明,所提出的方法可以有效替代传统的基于块的动作头,适用于专业和通用策略。AR-VLA展现出卓越的历史意识和显著平滑的动作轨迹,同时保持或超过了最先进的反应性VLA的任务成功率。总体而言,我们的工作引入了一种可扩展的、上下文感知的动作生成方案,为训练有效的机器人策略提供了坚实的结构基础。
cs.RO / 5 / 2603.10158

Cross-Hand Latent Representation for Vision-Language-Action Models

跨手潜在表示用于视觉-语言-动作模型
Jiang, Guangqi, Liang, Yutong, Ye, Jianglong, Huang, Jia-Yang, Jing, Changwei, Duan, Rocky, Abbeel, Pieter, Wang, Xiaolong, Zou, Xueyan
Abstract
Dexterous manipulation is essential for real-world robot autonomy, mirroring the central role of human hand coordination in daily activity. Humans rely on rich multimodal perception--vision, sound, and language-guided intent--to perform dexterous actions, motivating vision-based, language-conditioned manipulation systems for robots. However, training reliable vision-language-action (VLA) models for dexterous manipulation requires large-scale demonstrations across many robotic hands. In addition, as new dexterous embodiments appear rapidly, collecting data for each becomes costly and impractical, creating a need for scalable cross-embodiment learning. We introduce XL-VLA, a vision-language-action framework integrated with a unified latent action space shared across diverse dexterous hands. This embodiment-invariant latent space is directly pluggable into standard VLA architectures, enabling seamless cross-embodiment training and efficient reuse of both existing and newly collected data. Experimental results demonstrate that XL-VLA consistently outperforms baseline VLA models operating in raw joint spaces, establishing it as an effective solution for scalable cross-embodiment dexterous manipulation.
Chinese Translation
灵巧操作对于现实世界中的机器人自主性至关重要,这与人类手部协调在日常活动中的核心作用相呼应。人类依赖丰富的多模态感知——视觉、声音和语言引导的意图——来执行灵巧动作,这激励了基于视觉和语言条件的机器人操作系统的研究。然而,为灵巧操作训练可靠的视觉-语言-动作(VLA)模型需要在多种机器人手上进行大规模演示。此外,随着新的灵巧形态迅速出现,为每种形态收集数据变得成本高昂且不切实际,因此需要可扩展的跨形态学习。我们提出了XL-VLA,一个视觉-语言-动作框架,集成了一个在多样化灵巧手之间共享的统一潜在动作空间。这个与形态无关的潜在空间可以直接插入标准VLA架构中,实现无缝的跨形态训练,并高效重用现有和新收集的数据。实验结果表明,XL-VLA在原始关节空间中始终优于基线VLA模型,确立了其作为可扩展跨形态灵巧操作的有效解决方案。
cs.RO / 6 / 2603.10166

Dance2Hesitate: A Multi-Modal Dataset of Dancer-Taught Hesitancy for Understandable Robot Motion

Dance2Hesitate:一个关于舞者教授犹豫的多模态数据集,以便于理解的机器人运动
Raghu, Srikrishna Bangalore, Soukhovei, Anna, Vankineni, Divya Sai Sindhuja, Bacula, Alexandra, Roncone, Alessandro
Abstract
In human-robot collaboration, a robot's expression of hesitancy is a critical factor that shapes human coordination strategies, attention allocation, and safety-related judgments. However, designing hesitant robot motion that generalizes is challenging because the observer's inference is highly dependent on embodiment and context. To address these challenges, we introduce and open-source a multi-modal, dancer-generated dataset of hesitant motion where we focus on specific context-embodiment pairs (i.e., manipulator/human upper-limb approaching a Jenga Tower, and anthropomorphic whole body motion in free space). The dataset includes (i) kinesthetic teaching demonstrations on a Franka Emika Panda reaching from a fixed start configuration to a fixed target (a Jenga tower) with three graded hesitancy levels (slight, significant, extreme) and (ii) synchronized RGB-D motion capture of dancers performing the same reaching behavior using their upper limb across three hesitancy levels, plus full human body sequences for extreme hesitancy. We further provide documentation to enable reproducible benchmarking across robot and human modalities. Across all dancers, we obtained 70 unique whole-body trajectories, 84 upper limb trajectories spanning over the three hesitancy levels, and 66 kinesthetic teaching trajectories spanning over the three hesitancy levels. The dataset can be accessed here: https://brsrikrishna.github.io/Dance2Hesitate/.
Chinese Translation
在人机协作中,机器人表达犹豫的能力是影响人类协调策略、注意力分配和安全相关判断的关键因素。然而,设计能够广泛适用的犹豫机器人运动具有挑战性,因为观察者的推断高度依赖于具身性和上下文。为了解决这些挑战,我们引入并开源了一个由舞者生成的多模态犹豫运动数据集,重点关注特定的上下文-具身性配对(即,操控器/人类上肢接近积木塔,以及类人全身运动在自由空间中的表现)。该数据集包括(i)在Franka Emika Panda上进行的动觉教学演示,从固定起始配置到固定目标(积木塔)达到三个不同的犹豫水平(轻微、显著、极端),以及(ii)舞者在三个犹豫水平下使用上肢执行相同的到达行为的同步RGB-D运动捕捉数据,此外还包括极端犹豫的完整人类身体序列。我们进一步提供文档,以便于在机器人和人类模态之间进行可重复的基准测试。在所有舞者中,我们获得了70个独特的全身轨迹,84个跨越三个犹豫水平的上肢轨迹,以及66个跨越三个犹豫水平的动觉教学轨迹。该数据集可在此访问:https://brsrikrishna.github.io/Dance2Hesitate/。
cs.RO / 7 / 2603.10173

Characterizing Healthy & Post-Stroke Neuromotor Behavior During 6D Upper-Limb Isometric Gaming: Implications for Design of End-Effector Rehabilitation Robot Interfaces

在六维上肢等长游戏中表征健康与中风后神经运动行为:对末端效应器康复机器人接口设计的启示
Anand, Ajay, Parra, Gabriel, Berghoff, Chad A., Hallock, Laura A.
Abstract
Successful robot-mediated rehabilitation requires designing games and robot interventions that promote healthy motor practice. However, the interplay between a given user's neuromotor behavior, the gaming interface, and the physical robot makes designing system elements -- and even characterizing what behaviors are "healthy" or pathological -- challenging. We leverage our OpenRobotRehab 1.0 open access data set to assess the characteristics of 13 healthy and 2 post-stroke users' force output, muscle activations, and game performance while executing isometric trajectory tracking tasks using an end-effector rehabilitation robot. We present an assessment of how subtle aspects of interface design impact user behavior; an analysis of how pathological neuromotor behaviors are reflected in end-effector force dynamics; and a novel hidden Markov model (HMM)-based neuromotor behavior classification method based on surface electromyography (sEMG) signals during cyclic motions. We demonstrate that task specification (including which axes are constrained and how users interpret tracking instructions) shapes user behavior; that pathology-related features are detectable in 6D end-effector force data during isometric task execution (with significant differences between healthy and post-stroke profiles in force error and average force production at $p=0.05$); and that healthy neuromotor strategies are heterogeneous and inherently difficult to characterize. We also show that our HMM-based models discriminate healthy and post-stroke neuromotor dynamics where synergy-based decompositions reflect no such differentiation. Lastly, we discuss these results' implications for the design of adaptive end-effector rehabilitation robots capable of promoting healthier movement strategies across diverse user populations.
Chinese Translation
成功的机器人介导康复需要设计能够促进健康运动练习的游戏和机器人干预。然而,特定用户的神经运动行为、游戏接口和物理机器人之间的相互作用使得系统元素的设计——甚至是表征什么行为是“健康”或病理的——变得具有挑战性。我们利用我们的OpenRobotRehab 1.0开放获取数据集,评估13名健康用户和2名中风后用户在使用末端效应器康复机器人执行等长轨迹跟踪任务时的力量输出、肌肉激活和游戏表现的特征。我们展示了接口设计的微妙方面如何影响用户行为;分析了病理性神经运动行为如何反映在末端效应器的力量动态中;以及基于表面肌电图(sEMG)信号的循环运动中一种新颖的隐马尔可夫模型(HMM)基础的神经运动行为分类方法。我们证明了任务规范(包括哪些轴被限制以及用户如何解读跟踪指令)塑造了用户行为;病理相关特征在等长任务执行期间的六维末端效应器力量数据中是可检测的(健康与中风后用户在力量误差和平均力量产生上存在显著差异,$p=0.05$);健康的神经运动策略是异质的,且本质上难以表征。我们还展示了我们的HMM基础模型能够区分健康与中风后神经运动动态,而基于协同作用的分解未能反映这种差异。最后,我们讨论了这些结果对设计能够促进更健康运动策略的自适应末端效应器康复机器人的启示,这些机器人能够适应不同用户群体。
cs.RO / 8 / 2603.10174

Autonomous Search for Sparsely Distributed Visual Phenomena through Environmental Context Modeling

通过环境上下文建模实现稀疏分布视觉现象的自主搜索
Chen, Eric, Manderson, Travis, Karapetyan, Nare, Edmunds, Peter, Roy, Nicholas, Girdhar, Yogesh
Abstract
Autonomous underwater vehicles (AUVs) are increasingly used to survey coral reefs, yet efficiently locating specific coral species of interest remains difficult: target species are often sparsely distributed across the reef, and an AUV with limited battery life cannot afford to search everywhere. When detections of the target itself are too sparse to provide directional guidance, the robot benefits from an additional signal to decide where to look next. We propose using the visual environmental context -- the habitat features that tend to co-occur with a target species -- as that signal. Because context features are spatially denser and often vary more smoothly than target detections, we hypothesize that a reward function targeted at broader environmental context will enable adaptive planners to make better decisions on where to go next, even in regions where no target has yet been observed. Starting from a single labeled image, our method uses patch-level DINOv2 embeddings to perform one-shot detections of both the target species and its surrounding context online. We validate our approach using real imagery collected by an AUV at two reef sites in St. John, U.S. Virgin Islands, simulating the robot's motion offline. Our results demonstrate that one-shot detection combined with adaptive context modeling enables efficient autonomous surveying, sampling up to 75$\%$ of the target in roughly half the time required by exhaustive coverage when the target is sparsely distributed, and outperforming search strategies that only use target detections.
Chinese Translation
自主水下航行器(AUV)在调查珊瑚礁方面的应用日益增多,但有效定位特定的珊瑚物种仍然困难:目标物种通常在珊瑚礁中稀疏分布,而电池寿命有限的AUV无法到处搜索。当目标的检测过于稀疏以至于无法提供方向指导时,机器人需要额外的信号来决定下一步的搜索方向。我们提出使用视觉环境上下文——与目标物种共生的栖息地特征——作为这一信号。由于上下文特征在空间上更为密集,并且通常变化更平滑,我们假设针对更广泛环境上下文的奖励函数将使自适应规划者能够在尚未观察到目标的区域做出更好的决策。我们的算法从单个标记图像出发,利用补丁级DINOv2嵌入在线执行目标物种及其周围环境的单次检测。我们使用在美国维尔京群岛圣约翰的两个珊瑚礁地点收集的真实图像验证了我们的方法,并模拟了机器人的离线运动。我们的结果表明,结合单次检测与自适应上下文建模能够实现高效的自主调查,在目标稀疏分布时,约用一半的时间采样到75%的目标,且优于仅使用目标检测的搜索策略。
cs.RO / 9 / 2603.10198

Octopus-inspired Distributed Control for Soft Robotic Arms: A Graph Neural Network-Based Attention Policy with Environmental Interaction

基于章鱼启发的软体机器人臂分布式控制:一种基于图神经网络的注意力策略与环境交互
Hou, Linxin, Wu, Qirui, Qin, Zhihang, Guo, Yongxin, Laschi, Cecilia
Abstract
This paper proposes SoftGM, an octopus-inspired distributed control architecture for segmented soft robotic arms that learn to reach targets in contact-rich environments using online obstacle discovery without relying on global obstacle geometry. SoftGM formulates each arm section as a cooperative agent and represents the arm-environment interaction as a graph. SoftGM uses a two-stage graph attention message passing scheme following a Centralised Training Decentralised Execution (CTDE) paradigm with a centralised critic and decentralised actor. We evaluate SoftGM in a Cosserat-rod simulator (PyElastica) across three tasks that increase the complexity of the environment: obstacle-free, structured obstacles, and a wall-with-hole scenario. Compared with six widely used MARL baselines (IDDPG, IPPO, ISAC, MADDPG, MAPPO, MASAC) under identical information content and training conditions, SoftGM matches strong CTDE methods in simpler settings and achieves the best performance in the wall-with-hole task. Robustness tests with observation noise, single-section actuation failure, and transient disturbances show that SoftGM preserves success while keeping control effort bounded, indicating resilient coordination driven by selective contact-relevant information routing.
Chinese Translation
本文提出了一种名为SoftGM的分布式控制架构,该架构受到章鱼的启发,旨在使分段软体机器人臂在接触丰富的环境中学习到达目标,且无需依赖全局障碍几何信息。SoftGM将每个臂段视为一个协作代理,并将臂与环境的交互表示为图。SoftGM采用了基于中心化训练、去中心化执行(CTDE)范式的两阶段图注意力消息传递机制,具有中心化的评论者和去中心化的执行者。我们在Cosserat杆模拟器(PyElastica)中评估SoftGM,针对三种逐步增加环境复杂度的任务:无障碍、结构化障碍和带孔墙场景。与六种广泛使用的多智能体强化学习基线(IDDPG、IPPO、ISAC、MADDPG、MAPPO、MASAC)在相同信息内容和训练条件下进行比较,SoftGM在较简单的环境中与强大的CTDE方法相当,并在带孔墙任务中实现最佳性能。通过观察噪声、单段驱动故障和瞬态干扰的鲁棒性测试表明,SoftGM在保持控制努力有限的同时仍能保持成功,表明其通过选择性接触相关信息路由驱动的协调能力具有韧性。
cs.RO / 10 / 2603.10227

Perceptive Hierarchical-Task MPC for Sequential Mobile Manipulation in Unstructured Semi-Static Environments

用于非结构化半静态环境中顺序移动操作的感知层次任务模型预测控制
Du, Xintong, Qian, Jingxing, Zhou, Siqi, Schoellig, Angela P.
Abstract
As compared to typical mobile manipulation tasks, sequential mobile manipulation poses a unique challenge -- as the robot operates over extended periods, successful task completion is not solely dependent on consistent motion generation but also on the robot's awareness and adaptivity to changes in the operating environment. While existing motion planners can generate whole-body trajectories to complete sequential tasks, they typically assume that the environment remains static and rely on precomputed maps. This assumption often breaks down during long-term operations, where semi-static changes such as object removal, introduction, or shifts are common. In this work, we propose a novel perceptive hierarchical-task model predictive control (HTMPC) framework for efficient sequential mobile manipulation in unstructured, changing environments. To tackle the challenge, we leverage a Bayesian inference framework to explicitly model object-level changes and thereby maintain a temporally accurate representation of the 3D environment; this up-to-date representation is embedded in a lexicographic optimization framework to enable efficient execution of sequential tasks. We validate our perceptive HTMPC approach through both simulated and real-robot experiments. In contrast to baseline methods, our approach systematically accounts for moved and phantom obstacles, successfully completing sequential tasks with higher efficiency and reactivity, without relying on prior maps or external infrastructure.
Chinese Translation
与典型的移动操作任务相比,顺序移动操作面临独特的挑战——由于机器人在较长时间内操作,成功完成任务不仅依赖于一致的运动生成,还依赖于机器人对操作环境变化的感知和适应性。尽管现有的运动规划器可以生成全身轨迹以完成顺序任务,但它们通常假设环境保持静态,并依赖于预先计算的地图。这一假设在长期操作中往往会失效,因为物体的移除、引入或位移等半静态变化是常见的。在本研究中,我们提出了一种新颖的感知层次任务模型预测控制(HTMPC)框架,以实现非结构化、变化环境中的高效顺序移动操作。为了解决这一挑战,我们利用贝叶斯推断框架显式建模物体级别的变化,从而保持对三维环境的时间准确表示;这一最新表示被嵌入到词典优化框架中,以实现顺序任务的高效执行。我们通过模拟和真实机器人实验验证了我们的感知HTMPC方法。与基线方法相比,我们的方法系统性地考虑了移动和虚拟障碍物,以更高的效率和反应性成功完成顺序任务,而无需依赖于先前的地图或外部基础设施。
cs.RO / 11 / 2603.10232

Hierarchical Task Model Predictive Control for Sequential Mobile Manipulation Tasks

用于顺序移动操作任务的层次任务模型预测控制
Du, Xintong, Zhou, Siqi, Schoellig, Angela P.
Abstract
Mobile manipulators are envisioned to serve more complex roles in people's everyday lives. With recent breakthroughs in large language models, task planners have become better at translating human verbal instructions into a sequence of tasks. However, there is still a need for a decision-making algorithm that can seamlessly interface with the high-level task planner to carry out the sequence of tasks efficiently. In this work, building on the idea of nonlinear lexicographic optimization, we propose a novel Hierarchical-Task Model Predictive Control framework that is able to complete sequential tasks with improved performance and reactivity by effectively leveraging the robot's redundancy. Compared to the state-of-the-art task-prioritized inverse kinematic control method, our approach has improved hierarchical trajectory tracking performance by 42% on average when facing task changes, robot singularity and reference variations. Compared to a typical single-task architecture, our proposed hierarchical task control architecture enables the robot to traverse a shorter path in task space and achieves an execution time 2.3 times faster when executing a sequence of delivery tasks. We demonstrated the results with real-world experiments on a 9 degrees of freedom mobile manipulator.
Chinese Translation
移动操作器被设想在日常生活中承担更复杂的角色。随着大型语言模型的最新突破,任务规划器在将人类口头指令转化为任务序列方面变得更加高效。然而,仍然需要一种决策算法,能够与高层任务规划器无缝对接,以高效地执行任务序列。在本研究中,我们基于非线性字典优化的思想,提出了一种新颖的层次任务模型预测控制框架,该框架能够通过有效利用机器人的冗余性,提升顺序任务的完成性能和反应能力。与最先进的任务优先逆运动学控制方法相比,我们的方法在面对任务变化、机器人奇异性和参考变化时,层次轨迹跟踪性能平均提高了42%。与典型的单任务架构相比,我们提出的层次任务控制架构使机器人在任务空间中能够沿更短的路径移动,并在执行一系列交付任务时实现了2.3倍的执行时间加速。我们通过在一个具有9个自由度的移动操作器上进行的真实世界实验展示了这些结果。
cs.RO / 12 / 2603.10248

Degeneracy-Resilient Teach and Repeat for Geometrically Challenging Environments Using FMCW Lidar

基于频率调制连续波激光雷达的抗退化教与重现方法在几何挑战环境中的应用
Papais, Katya M., Zhao, Wenda, Barfoot, Timothy D.
Abstract
Teach and Repeat (T&R) topometric navigation enables robots to autonomously repeat previously traversed paths without relying on GPS, making it well suited for operations in GPS-denied environments such as underground mines and lunar navigation. State-of-the-art T&R systems typically rely on iterative closest point (ICP)-based estimation; however, in geometrically degenerate environments with sparsely structured terrain, ICP often becomes ill-conditioned, resulting in degraded localization and unreliable navigation performance. To address this challenge, we present a degeneracy-resilient Frequency-Modulated Continuous-Wave (FMCW) lidar T&R navigation system consisting of Doppler velocity-based odometry and degeneracy-aware scan-to-map localization. Leveraging FMCW lidar, which provides per-point radial velocity measurements via the Doppler effect, we extend a geometry-independent, correspondence-free motion estimation to include principled pose uncertainty estimation that remains stable in degenerate environments. We further propose a degeneracy-aware localization method that incorporates per-point curvature for improved data association, and unifies translational and rotational scales to enable consistent degeneracy detection. Closed-loop field experiments across three environments with varying structural richness demonstrate that the proposed system reliably completes autonomous navigation, including in a challenging flat airport test field where a conventional ICP-based system fails.
Chinese Translation
教与重现(T&R)拓扑导航使机器人能够在不依赖GPS的情况下自主重复先前经过的路径,这使其非常适合在GPS受限环境中进行操作,如地下矿井和月球导航。先进的T&R系统通常依赖于基于迭代最近点(ICP)的估计;然而,在几何退化的环境中,地形稀疏结构使得ICP常常变得不稳定,导致定位性能下降和导航可靠性降低。为了解决这一挑战,我们提出了一种抗退化的频率调制连续波(FMCW)激光雷达T&R导航系统,该系统由基于多普勒速度的里程计和考虑退化的扫描到地图定位组成。利用FMCW激光雷达通过多普勒效应提供每个点的径向速度测量,我们扩展了一种几何无关的、无对应关系的运动估计方法,以包括在退化环境中保持稳定的原则性姿态不确定性估计。我们进一步提出了一种考虑退化的定位方法,该方法结合了每个点的曲率以改善数据关联,并统一了平移和旋转尺度以实现一致的退化检测。在三个不同结构丰富度的环境中进行的闭环实地实验表明,所提出的系统可靠地完成了自主导航,包括在一个传统ICP系统失效的具有挑战性的平坦机场测试场中。
cs.RO / 13 / 2603.10263

From Prior to Pro: Efficient Skill Mastery via Distribution Contractive RL Finetuning

从先验到专业:通过分布收缩强化学习微调实现高效技能掌握
Sun, Zhanyi, Song, Shuran
Abstract
We introduce Distribution Contractive Reinforcement Learning (DICE-RL), a framework that uses reinforcement learning (RL) as a "distribution contraction" operator to refine pretrained generative robot policies. DICE-RL turns a pretrained behavior prior into a high-performing "pro" policy by amplifying high-success behaviors from online feedback. We pretrain a diffusion- or flow-based policy for broad behavioral coverage, then finetune it with a stable, sample-efficient residual off-policy RL framework that combines selective behavior regularization with value-guided action selection. Extensive experiments and analyses show that DICE-RL reliably improves performance with strong stability and sample efficiency. It enables mastery of complex long-horizon manipulation skills directly from high-dimensional pixel inputs, both in simulation and on a real robot. Project website: https://zhanyisun.github.io/dice.rl.2026/.
Chinese Translation
我们提出了分布收缩强化学习(DICE-RL)框架,该框架利用强化学习(RL)作为“分布收缩”操作符来优化预训练的生成机器人策略。DICE-RL通过放大来自在线反馈的高成功率行为,将预训练的行为先验转化为高性能的“专业”策略。我们预训练了一种基于扩散或流的策略,以实现广泛的行为覆盖,然后使用一种稳定、样本高效的残差离线策略强化学习框架进行微调,该框架结合了选择性行为正则化和价值引导的动作选择。大量实验和分析表明,DICE-RL在强稳定性和样本效率方面可靠地提升了性能。它使得能够直接从高维像素输入中掌握复杂的长时间操作技能,既适用于仿真环境,也适用于真实机器人。项目网站:https://zhanyisun.github.io/dice.rl.2026/
cs.RO / 14 / 2603.10264

Design of a Robot-Assisted Chemical Dialysis System

机器人辅助化学透析系统的设计
Jung, Diane, Escobedo, Caleb, Liska, Noah, Gramopadhye, Maitrey, Szafir, Daniel, Roncone, Alessandro, Bruns, Carson
Abstract
Scientists perform diverse manual procedures that are tedious and laborious. Such procedures are considered a bottleneck for modern experimental science, as they consume time and increase burdens in fields including material science and medicine. We employ a user-centered approach to designing a robot-assisted system for dialysis, a common multi-day purification method used in polymer and protein synthesis. Through two usability studies, we obtain participant feedback and revise design requirements to develop the final system that satisfies scientists' needs and has the potential for applications in other experimental workflows. We anticipate that integration of this system into real synthesis procedures in a chemical wet lab will decrease workload on scientists during long experimental procedures and provide an effective approach to designing more systems that have the potential to accelerate scientific discovery and liberate scientists from tedious labor.
Chinese Translation
科学家们进行各种繁琐且耗时的手动操作。这些程序被视为现代实验科学的瓶颈,因为它们消耗时间并增加了材料科学和医学等领域的负担。我们采用以用户为中心的方法设计一个机器人辅助的透析系统,这是一种在聚合物和蛋白质合成中常用的多日纯化方法。通过两次可用性研究,我们收集参与者反馈并修订设计要求,以开发出满足科学家需求的最终系统,并具有在其他实验工作流程中应用的潜力。我们预计将该系统整合到化学湿实验室的实际合成程序中,将减少科学家在长时间实验过程中的工作负担,并为设计更多有潜力加速科学发现并解放科学家免于繁琐劳动的系统提供有效的方法。
cs.RO / 15 / 2603.10282

Update-Free On-Policy Steering via Verifiers

无更新的在线策略引导方法通过验证器
Attarian, Maria, Vyse, Ian, Voelcker, Claas, Gerigk, Jasper, Opryshko, Evgenii, Almasri, Anas, Singh, Sumeet, Du, Yilun, Gilitschenski, Igor
Abstract
In recent years, Behavior Cloning (BC) has become one of the most prevalent methods for enabling robots to mimic human demonstrations. However, despite their successes, BC policies are often brittle and struggle with precise manipulation. To overcome these issues, we propose UF-OPS, an Update-Free On-Policy Steering method that enables the robot to predict the success likelihood of its actions and adapt its strategy at execution time. We accomplish this by training verifier functions using policy rollout data obtained during an initial evaluation of the policy. These verifiers are subsequently used to steer the base policy toward actions with a higher likelihood of success. Our method improves the performance of black-box diffusion policy, without changing the base parameters, making it light-weight and flexible. We present results from both simulation and real-world data and achieve an average 49% improvement in success rate over the base policy across 5 real tasks.
Chinese Translation
近年来,行为克隆(Behavior Cloning, BC)已成为使机器人模仿人类示范的最普遍方法之一。然而,尽管取得了一定成功,BC 策略通常较为脆弱,且在精确操作方面表现不佳。为了解决这些问题,我们提出了 UF-OPS,一种无更新的在线策略引导方法,能够使机器人预测其动作的成功概率,并在执行时调整其策略。我们通过使用在初始评估政策期间获得的策略展开数据来训练验证器函数。这些验证器随后用于引导基础策略朝向成功概率更高的动作。我们的方法在不改变基础参数的情况下,提高了黑箱扩散策略的性能,使其轻量且灵活。我们展示了来自模拟和真实数据的结果,在 5 个真实任务中,成功率平均提高了 49%。
cs.RO / 16 / 2603.10306

SteadyTray: Learning Object Balancing Tasks in Humanoid Tray Transport via Residual Reinforcement Learning

SteadyTray:通过残差强化学习学习类人托盘运输中的物体平衡任务
Huang, Anlun, Wu, Zhenyu, Atar, Soofiyan, Zhi, Yuheng, Yip, Michael
Abstract
Stabilizing unsecured payloads against the inherent oscillations of dynamic bipedal locomotion remains a critical engineering bottleneck for humanoids in unstructured environments. To solve this, we introduce ReST-RL, a hierarchical reinforcement learning architecture that explicitly decouples locomotion from payload stabilization, evaluated via the SteadyTray benchmark. Rather than relying on monolithic end-to-end learning, our framework integrates a robust base locomotion policy with a dynamic residual module engineered to actively cancel gait-induced perturbations at the end-effector. This architectural separation ensures steady tray transport without degrading the underlying bipedal stability. In simulation, the residual design significantly outperforms end-to-end baselines in gait smoothness and orientation accuracy, achieving a 96.9% success rate in variable velocity tracking and 74.5% robustness against external force disturbances. Successfully deployed on the Unitree G1 humanoid hardware, this modular approach demonstrates highly reliable zero-shot sim-to-real generalization across various objects and external force disturbances.
Chinese Translation
在动态双足运动固有的振荡下,稳定不安全的负载仍然是类人在非结构化环境中的一个关键工程瓶颈。为了解决这一问题,我们提出了ReST-RL,一种层次化的强化学习架构,明确将运动与负载稳定化解耦,并通过SteadyTray基准进行评估。我们的框架并不依赖于单一的端到端学习,而是将一个稳健的基础运动策略与一个动态残差模块相结合,该模块旨在主动抵消末端效应器的步态引起的扰动。这种架构分离确保了托盘运输的稳定性,而不会降低基础双足稳定性。在仿真中,残差设计在步态平滑性和方向准确性方面显著优于端到端基线,成功率达到96.9%(在可变速度跟踪中)和74.5%(在外部力扰动下的鲁棒性)。该模块化方法成功部署在Unitree G1类人硬件上,展示了在各种物体和外部力扰动下高度可靠的零-shot模拟到现实的泛化能力。
cs.RO / 17 / 2603.10330

PC-Diffuser: Path-Consistent Capsule CBF Safety Filtering for Diffusion-Based Trajectory Planner

PC-Diffuser:基于路径一致性的胶囊 CBF 安全过滤器用于扩散式轨迹规划
Ku, Eugene, Lyu, Yiwei
Abstract
Autonomous driving in complex traffic requires planners that generalize beyond hand-crafted rules, motivating data-driven approaches that learn behavior from expert demonstrations. Diffusion-based trajectory planners have recently shown strong closed-loop performance by iteratively denoising a full-horizon plan, but they remain difficult to certify and can fail catastrophically in rare or out-of-distribution scenarios. To address this challenge, we present PC-Diffuser, a safety augmentation framework that embeds a certifiable, path-consistent barrier-function structure directly into the denoising loop of diffusion planning. The key idea is to make safety an intrinsic part of trajectory generation rather than a post-hoc fix: we enforce forward invariance along the rollout while preserving the diffusion model's intended path geometry. Specifically, PC-Diffuser (i) evaluates collision risk using a capsule-distance barrier function that better reflects vehicle geometry and reduces unnecessary conservativeness, (ii) converts denoised waypoints into dynamically feasible motion under a kinematic bicycle model, and (iii) applies a path-consistent safety filter that eliminates residual constraint violations without geometric distortion, so the corrected plan remains close to the learned distribution. By injecting these safety-consistent corrections at every denoising step and feeding the refined trajectory back into the diffusion process, PC-Diffuser enables iterative, context-aware safeguarding instead of post-hoc repair...
Chinese Translation
在复杂交通环境中,自动驾驶需要超越手工规则的规划器,这促使了从专家演示中学习行为的数据驱动方法的出现。基于扩散的轨迹规划器最近通过迭代去噪全视野计划显示出强大的闭环性能,但在稀有或分布外场景中,它们仍然难以认证,并可能发生灾难性失败。为了解决这一挑战,我们提出了 PC-Diffuser,这是一个安全增强框架,将可认证的路径一致性障碍函数结构直接嵌入到扩散规划的去噪循环中。关键思想是将安全性作为轨迹生成的内在部分,而不是事后修复:我们在保持扩散模型预期路径几何形状的同时,强制执行沿着展开的前向不变性。具体而言,PC-Diffuser (i) 使用胶囊距离障碍函数评估碰撞风险,该函数更好地反映车辆几何形状并减少不必要的保守性;(ii) 将去噪后的路径点转换为基于运动学自行车模型的动态可行运动;(iii) 应用路径一致的安全过滤器,消除残余约束违反而不产生几何扭曲,从而使修正后的计划保持接近学习到的分布。通过在每个去噪步骤中注入这些安全一致的修正,并将精炼的轨迹反馈到扩散过程中,PC-Diffuser 实现了迭代的、上下文感知的安全保障,而不是事后修复。
cs.RO / 18 / 2603.10352

Adaptive Manipulation Potential and Haptic Estimation for Tool-Mediated Interaction

工具介导交互的自适应操控潜力与触觉估计
Yang, Lin, Dutta, Anirvan, Ji, Yuan, Zhou, Yanxin, Shan, Shilin, Chen, Lv, Burdet, Etienne, Campolo, Domenico
Abstract
Achieving human-level dexterity in contact-rich, tool-mediated manipulation remains a significant challenge due to visual occlusion and the underdetermined nature of haptic sensing. This paper introduces a parameterized Equilibrium Manifold (EM) as a unified representation for tool-mediated interaction, and develops a closed-loop framework that integrates haptic estimation, online planning, and adaptive stiffness control. We establish a physical-geometric duality using an adaptive manipulation potential incorporating a differentiable contact model, which induces the manifold's geometric structure and ensures that complex physical interactions are encapsulated as continuous operations on the EM. Within this framework, we reformulate haptic estimation as a manifold parameter estimation problem. Specifically, a hybrid inference strategy (haptic SLAM) is employed in which discrete object shapes are classified via particle filtering, while the continuous object pose is estimated using analytical gradients for efficient optimization. By continuously updating the parameters of the manipulation potential, the framework dynamically reshapes the induced EM to guide online trajectory replanning and implement uncertainty-aware impedance control, thereby closing the perception-action loop. The system is validated through simulation and over 260 real-world screw-loosening trials. Experimental results demonstrate robust identification and manipulation success in standard scenarios while maintaining accurate tracking. Furthermore, ablation studies confirm that haptic SLAM and uncertainty-aware stiffness modulation outperform fixed impedance baselines, effectively preventing jamming during tight tolerance interactions.
Chinese Translation
在接触丰富的工具介导操控中实现人类水平的灵巧性仍然是一个重大挑战,这主要由于视觉遮挡和触觉感知的欠定性。本文引入了一种参数化的平衡流形(Equilibrium Manifold, EM),作为工具介导交互的统一表示,并开发了一个闭环框架,该框架整合了触觉估计、在线规划和自适应刚度控制。我们利用包含可微接触模型的自适应操控潜力建立了物理几何对偶性,这种潜力诱导了流形的几何结构,并确保复杂的物理交互被封装为对EM的连续操作。在该框架内,我们将触觉估计重新表述为流形参数估计问题。具体而言,采用了一种混合推理策略(触觉SLAM),通过粒子滤波对离散物体形状进行分类,同时使用解析梯度来高效优化连续物体姿态的估计。通过不断更新操控潜力的参数,该框架动态重塑诱导的EM,以引导在线轨迹重新规划并实施基于不确定性的阻抗控制,从而闭合感知-行动循环。该系统通过仿真和超过260次真实世界的松螺丝试验进行了验证。实验结果表明,在标准场景中,系统能够实现稳健的识别和操控成功,同时保持准确的跟踪。此外,消融研究证实,触觉SLAM和基于不确定性的刚度调节优于固定阻抗基线,有效防止了在紧公差交互中的卡滞现象。
cs.RO / 19 / 2603.10373

Few-Shot Adaptation to Non-Stationary Environments via Latent Trend Embedding for Robotics

通过潜在趋势嵌入实现对非平稳环境的少样本适应性——以机器人技术为例
Fujii, Yasuyuki, Kameda, Emika, Fukada, Hiroki, Mori, Yoshiki, Matsuo, Tadashi, Shimada, Nobutaka
Abstract
Robotic systems operating in real-world environments often suffer from concept shift, where the input-output relationship changes due to latent environmental factors that are not directly observable. Conventional adaptation methods update model parameters, which may cause catastrophic forgetting and incur high computational cost. This paper proposes a latent Trend ID-based framework for few-shot adaptation in non-stationary environments. Instead of modifying model weights, a low-dimensional environmental state, referred to as the Trend ID, is estimated via backpropagation while the model parameters remain fixed. To prevent overfitting caused by per-sample latent variables, we introduce temporal regularization and a state transition model that enforces smooth evolution of the latent space. Experiments on a quantitative food grasping task demonstrate that the learned Trend IDs are distributed across distinct regions of the latent space with temporally consistent trajectories, and that few-shot adaptation to unseen environments is achieved without modifying model parameters. The proposed framework provides a scalable and interpretable solution for robotics applications operating across diverse and evolving environments.
Chinese Translation
在现实环境中运行的机器人系统常常面临概念漂移的问题,即由于潜在的环境因素(这些因素并不可直接观察),输入与输出之间的关系发生变化。传统的适应方法通常会更新模型参数,这可能导致灾难性遗忘并产生高昂的计算成本。本文提出了一种基于潜在趋势ID的框架,用于在非平稳环境中进行少样本适应。该方法不修改模型权重,而是通过反向传播估计一个低维的环境状态,称为趋势ID,同时保持模型参数不变。为了防止因每个样本的潜在变量导致的过拟合,我们引入了时间正则化和状态转移模型,以确保潜在空间的平滑演变。在定量食品抓取任务上的实验表明,学习到的趋势ID在潜在空间的不同区域分布,并具有时间一致的轨迹,且在不修改模型参数的情况下实现了对未见环境的少样本适应。所提出的框架为在多样化和不断演变的环境中运行的机器人应用提供了一种可扩展且可解释的解决方案。
cs.RO / 20 / 2603.10390

ScanDP: Generalizable 3D Scanning with Diffusion Policy

ScanDP:基于扩散策略的通用化三维扫描
Hirako, Itsuki, Hakoda, Ryo, Liu, Yubin, Hwang, Matthew, Sato, Yoshihiro, Oishi, Takeshi
Abstract
Learning-based 3D Scanning plays a crucial role in enabling efficient and accurate scanning of target objects. However, recent reinforcement learning-based methods often require large-scale training data and still struggle to generalize to unseen object categories.In this work, we propose a data-efficient 3D scanning framework that uses Diffusion Policy to imitate human-like scanning strategies. To enhance robustness and generalization, we adopt the Occupancy Grid Mapping instead of direct point cloud processing, offering improved noise resilience and handling of diverse object geometries. We also introduce a hybrid approach combining a sphere-based space representation with a path optimization procedure that ensures path safety and scanning efficiency. This approach addresses limitations in conventional imitation learning, such as redundant or unpredictable behavior. We evaluate our method on diverse unseen objects in both shape and scale. Ours achieves higher coverage and shorter paths than baselines, while remaining robust to sensor noise. We further confirm practical feasibility and stable operation in real-world execution.
Chinese Translation
基于学习的三维扫描在实现目标物体的高效和准确扫描中发挥着至关重要的作用。然而,近期基于强化学习的方法通常需要大规模的训练数据,并且在对未见物体类别的泛化能力上仍然存在困难。在本研究中,我们提出了一种数据高效的三维扫描框架,利用扩散策略模仿类人扫描策略。为了增强鲁棒性和泛化能力,我们采用占据网格映射(Occupancy Grid Mapping),而不是直接处理点云,从而提高了对噪声的抗性和对多样物体几何形状的处理能力。我们还引入了一种混合方法,将基于球体的空间表示与路径优化程序相结合,以确保路径安全性和扫描效率。这种方法解决了传统模仿学习中的一些局限性,例如冗余或不可预测的行为。我们在形状和规模各异的未见物体上评估了我们的方法。我们的方案在覆盖率和路径长度上均优于基线,同时对传感器噪声保持鲁棒性。我们进一步确认了在实际执行中的可行性和稳定性。
cs.RO / 21 / 2603.10392

Safe Probabilistic Planning for Human-Robot Interaction using Conformal Risk Control

基于保形风险控制的人机交互安全概率规划
Gonzales, Jake, Mizuta, Kazuki, Leung, Karen, Ratliff, Lillian J.
Abstract
In this paper, we present a novel probabilistic safe control framework for human-robot interaction that combines control barrier functions (CBFs) with conformal risk control to provide formal safety guarantees while considering complex human behavior. The approach uses conformal risk control to quantify and control the prediction errors in CBF safety values and establishes formal guarantees on the probability of constraint satisfaction during interaction. We introduce an algorithm that dynamically adjusts the safety margins produced by conformal risk control based on the current interaction context. Through experiments on human-robot navigation scenarios, we demonstrate that our approach significantly reduces collision rates and safety violations as compared to baseline methods while maintaining high success rates in goal-reaching tasks and efficient control. The code, simulations, and other supplementary material can be found on the project website: https://jakeagonzales.github.io/crc-cbf-website/.
Chinese Translation
本文提出了一种新颖的人机交互安全概率控制框架,该框架结合了控制屏障函数(Control Barrier Functions, CBFs)与保形风险控制(Conformal Risk Control),在考虑复杂人类行为的同时提供正式的安全保障。该方法利用保形风险控制量化和控制CBF安全值中的预测误差,并在交互过程中建立约束满足概率的正式保障。我们引入了一种算法,该算法根据当前交互上下文动态调整保形风险控制产生的安全边际。通过在人机导航场景中的实验,我们证明了与基线方法相比,我们的方法显著降低了碰撞率和安全违规率,同时在目标达成任务中保持了高成功率和高效控制。代码、仿真和其他补充材料可以在项目网站上找到:https://jakeagonzales.github.io/crc-cbf-website/
cs.RO / 22 / 2603.10402

Shape Control of a Planar Hyper-Redundant Robot via Hybrid Kinematics-Informed and Learning-based Approach

通过混合运动学信息和基于学习的方法控制平面超冗余机器人的形状
Song, Yuli, Li, Wenbo, Xin, Wenci, Tang, Zhiqiang, Rus, Daniela, Laschi, Cecilia
Abstract
Hyper-redundant robots offer high dexterity, making them good at operating in confined and unstructured environments. To extend the reachable workspace, we built a multi-segment flexible rack actuated planar robot. However, the compliance of the flexible mechanism introduces instability, rendering it sensitive to external and internal uncertainties. To address these limitations, we propose a hybrid kinematics-informed and learning-based shape control method, named SpatioCoupledNet. The neural network adopts a hierarchical design that explicitly captures bidirectional spatial coupling between segments while modeling local disturbance along the robot body. A confidence-gating mechanism integrates prior kinematic knowledge, allowing the controller to adaptively balance model-based and learned components for improved convergence and fidelity. The framework is validated on a five-segment planar hyper-redundant robot under three representative shape configurations. Experimental results demonstrate that the proposed method consistently outperforms both analytical and purely neural controllers. In complex scenarios, it reduces steady-state error by up to 75.5% against the analytical model, and accelerates convergence by up to 20.5% compared to the data-driven baseline. Furthermore, gating analysis reveals a state-dependent authority fusion, shifting toward data-driven predictions in unstable states, while relying on physical priors in the remaining cases. Finally, we demonstrate robust performance in a dynamic task where the robot maintains a fixed end-effector position while avoiding moving obstacles, achieving a precise tip-positioning accuracy with a mean error of 10.47 mm.
Chinese Translation
超冗余机器人具有高灵活性,适合在狭小和非结构化环境中操作。为了扩展可达工作空间,我们构建了一种多段柔性机架驱动的平面机器人。然而,柔性机制的顺应性引入了不稳定性,使其对外部和内部不确定性敏感。为了解决这些限制,我们提出了一种混合运动学信息和基于学习的形状控制方法,命名为SpatioCoupledNet。该神经网络采用分层设计,明确捕捉段与段之间的双向空间耦合,同时对机器人身体沿线的局部干扰进行建模。一个置信门控机制整合了先前的运动学知识,使控制器能够自适应地平衡基于模型和学习的组件,从而提高收敛性和保真度。该框架在一个五段平面超冗余机器人上进行了验证,测试了三种代表性的形状配置。实验结果表明,所提出的方法在性能上始终优于分析控制器和纯神经控制器。在复杂场景中,相较于分析模型,它将稳态误差降低了最多75.5%,并且与数据驱动基线相比,加快了收敛速度,最多提高了20.5%。此外,门控分析揭示了一种状态依赖的权威融合,在不稳定状态下向数据驱动预测倾斜,而在其他情况下则依赖于物理先验。最后,我们展示了在动态任务中的稳健性能,机器人在避免移动障碍物的同时保持固定的末端执行器位置,实现了平均误差为10.47毫米的精确尖端定位。
cs.RO / 23 / 2603.10407

Rethinking Gaussian Trajectory Predictors: Calibrated Uncertainty for Safe Planning

重新思考高斯轨迹预测器:安全规划的校准不确定性
Pouria, Fatemeh Cheraghi, Golchoubian, Mahsa, Driggs-Campbell, Katherine
Abstract
Accurate trajectory prediction is critical for safe autonomous navigation in crowded environments. While many trajectory predictors output Gaussian distributions to represent the multi-modal distribution over future pedestrian positions, the reliability of their confidence levels often remains unaddressed. This limitation can lead to unsafe or overly conservative motion planning when the predictor is integrated with an uncertainty-aware planner. Existing Gaussian trajectory predictors primarily rely on the Negative Log-Likelihood loss, which is prone to predict over- or under-confident distributions, and may compromise downstream planner safety. This paper introduces a novel loss function for calibrating prediction uncertainty which leverages Kernel Density Estimation to estimate the empirical distribution of confidence levels. The proposed formulation enforces consistency with the properties of a Gaussian assumption by explicitly matching the estimated empirical distribution to the Chi-squared distribution. To ensure accurate mean prediction, a Mean Squared Error term is also incorporated in the final loss formulation. Experimental results on real-world trajectory datasets show that our method significantly improves the reliability of confidence levels predicted by different State-Of-The-Art Gaussian trajectory predictors. We also demonstrate the importance of providing planners with reliable probabilistic insights (i.e. calibrated confidence levels) for collision-free navigation in complex scenarios. For this purpose, we integrate Gaussian trajectory predictors trained with our loss function with an uncertainty-aware Model Predictive Control on scenarios extracted from real-world datasets, achieving improved planning performance through calibrated confidence levels.
Chinese Translation
准确的轨迹预测对于在拥挤环境中安全的自主导航至关重要。虽然许多轨迹预测器输出高斯分布以表示未来行人位置的多模态分布,但它们的置信水平的可靠性往往未得到解决。这一局限性可能导致在与不确定性感知规划器集成时,产生不安全或过于保守的运动规划。现有的高斯轨迹预测器主要依赖负对数似然损失(Negative Log-Likelihood loss),这容易预测出过于自信或不足自信的分布,从而可能危及下游规划器的安全。本文提出了一种新颖的损失函数,用于校准预测不确定性,该函数利用核密度估计(Kernel Density Estimation)来估计置信水平的经验分布。所提出的公式通过显式匹配估计的经验分布与卡方分布(Chi-squared distribution),强制与高斯假设的性质保持一致。为了确保准确的均值预测,最终的损失公式中还加入了均方误差(Mean Squared Error)项。在真实世界轨迹数据集上的实验结果表明,我们的方法显著提高了不同最先进高斯轨迹预测器预测的置信水平的可靠性。我们还展示了为规划器提供可靠的概率洞察(即校准的置信水平)在复杂场景中实现无碰撞导航的重要性。为此,我们将使用我们损失函数训练的高斯轨迹预测器与不确定性感知模型预测控制(Model Predictive Control)集成,在从真实世界数据集中提取的场景中,通过校准的置信水平实现了改进的规划性能。
cs.RO / 24 / 2603.10436

COHORT: Hybrid RL for Collaborative Large DNN Inference on Multi-Robot Systems Under Real-Time Constraints

COHORT:在实时约束下多机器人系统的协作大规模深度神经网络推理的混合强化学习
Anwar, Mohammad Saeid, Ravi, Anuradha, Ghosh, Indrajeet, Shinde, Gaurav, Busart, Carl, Roy, Nirmalya
Abstract
Large deep neural networks (DNNs), especially transformer-based and multimodal architectures, are computationally demanding and challenging to deploy on resource-constrained edge platforms like field robots. These challenges intensify in mission-critical scenarios (e.g., disaster response), where robots must collaborate under tight constraints on bandwidth, latency, and battery life, often without infrastructure or server support. To address these limitations, we present COHORT, a collaborative DNN inference and task-execution framework for multi-robot systems built on the Robotic Operating System (ROS). COHORT employs a hybrid offline-online reinforcement learning (RL) strategy to dynamically schedule and distribute DNN module execution across robots. Our key contributions are threefold: (a) Offline RL policy learning combined with Advantage-Weighted Regression (AWR), trained on auction-based task allocation data from heterogeneous DNN workloads across distributed robots, (b) Online policy adaptation via Multi-Agent PPO (MAPPO), initialized from the offline policy and fine-tuned in real time, and (c) comprehensive evaluation of COHORT on vision-language model (VLM) inference tasks such as CLIP and SAM, analyzing scalability with increasing robot/workload and robustness under . We benchmark COHORT against genetic algorithms and multiple RL baselines. Experimental results demonstrate that COHORT reduces battery consumption by 15.4% and increases GPU utilization by 51.67%, while satisfying frame-rate and deadline constraints 2.55 times of the time.
Chinese Translation
大型深度神经网络(DNN),尤其是基于变换器和多模态架构,计算需求高,且在资源受限的边缘平台(如现场机器人)上部署具有挑战性。这些挑战在关键任务场景(例如灾难响应)中愈加严重,机器人必须在带宽、延迟和电池寿命等紧迫约束下进行协作,通常没有基础设施或服务器支持。为了解决这些限制,我们提出了COHORT,一个基于机器人操作系统(ROS)的多机器人系统协作DNN推理和任务执行框架。COHORT采用混合离线-在线强化学习(RL)策略,动态调度和分配DNN模块在机器人之间的执行。我们的主要贡献有三方面:(a)结合优势加权回归(Advantage-Weighted Regression,AWR)的离线RL策略学习,基于分布式机器人中异构DNN工作负载的拍卖式任务分配数据进行训练;(b)通过多智能体PPO(Multi-Agent PPO,MAPPO)进行在线策略适应,从离线策略初始化并实时微调;(c)对COHORT在视觉-语言模型(Vision-Language Model,VLM)推理任务(如CLIP和SAM)上的全面评估,分析在机器人/工作负载增加时的可扩展性和鲁棒性。我们将COHORT与遗传算法和多个RL基线进行基准测试。实验结果表明,COHORT将电池消耗降低了15.4%,GPU利用率提高了51.67%,同时满足帧率和截止时间约束的能力提高了2.55倍。
cs.RO / 25 / 2603.10438

AsyncMDE: Real-Time Monocular Depth Estimation via Asynchronous Spatial Memory

AsyncMDE:通过异步空间记忆实现实时单目深度估计
Ma, Lianjie, Li, Yuquan, Jiang, Bingzheng, Zhong, Ziming, Ding, Han, Zhu, Lijun
Abstract
Foundation-model-based monocular depth estimation offers a viable alternative to active sensors for robot perception, yet its computational cost often prohibits deployment on edge platforms. Existing methods perform independent per-frame inference, wasting the substantial computational redundancy between adjacent viewpoints in continuous robot operation. This paper presents AsyncMDE, an asynchronous depth perception system consisting of a foundation model and a lightweight model that amortizes the foundation model's computational cost over time. The foundation model produces high-quality spatial features in the background, while the lightweight model runs asynchronously in the foreground, fusing cached memory with current observations through complementary fusion, outputting depth estimates, and autoregressively updating the memory. This enables cross-frame feature reuse with bounded accuracy degradation. At a mere 3.83M parameters, it operates at 237 FPS on an RTX 4090, recovering 77% of the accuracy gap to the foundation model while achieving a 25X parameter reduction. Validated across indoor static, dynamic, and synthetic extreme-motion benchmarks, AsyncMDE degrades gracefully between refreshes and achieves 161FPS on a Jetson AGX Orin with TensorRT, clearly demonstrating its feasibility for real-time edge deployment.
Chinese Translation
基于基础模型的单目深度估计为机器人感知提供了一种可行的替代方案,然而其计算成本往往限制了在边缘平台上的部署。现有方法在每帧上进行独立推理,浪费了连续机器人操作中相邻视点之间的显著计算冗余。本文提出了AsyncMDE,一个异步深度感知系统,由一个基础模型和一个轻量级模型组成,后者在时间上分摊了基础模型的计算成本。基础模型在后台生成高质量的空间特征,而轻量级模型在前台异步运行,通过互补融合将缓存记忆与当前观测进行融合,输出深度估计,并自回归地更新记忆。这使得跨帧特征重用成为可能,同时保持了有限的精度下降。该系统仅有3.83M参数,在RTX 4090上以237 FPS运行,恢复了77%的基础模型的精度差距,同时实现了25倍的参数减少。在室内静态、动态以及合成极端运动基准测试中进行验证,AsyncMDE在刷新之间表现出良好的降级,并在搭载TensorRT的Jetson AGX Orin上实现了161 FPS,清晰地展示了其在实时边缘部署中的可行性。
cs.RO / 26 / 2603.10441

KnowDiffuser: A Knowledge-Guided Diffusion Planner with LM Reasoning and Prior-Informed Trajectory Initialization

KnowDiffuser:一种基于知识引导的扩散规划器,结合语言模型推理和先验信息的轨迹初始化
Ding, Fan, Luo, Xuewen, Yang, Fengze, Yu, Bo, Tew, HwaHui, Krishnasamy, Ganesh, Loo, Junn Yong
Abstract
Recent advancements in Language Models (LMs) have demonstrated strong semantic reasoning capabilities, enabling their application in high-level decision-making for autonomous driving (AD). However, LMs operate over discrete token spaces and lack the ability to generate continuous, physically feasible trajectories required for motion planning. Meanwhile, diffusion models have proven effective at generating reliable and dynamically consistent trajectories, but often lack semantic interpretability and alignment with scene-level understanding. To address these limitations, we propose \textbf{KnowDiffuser}, a knowledge-guided motion planning framework that tightly integrates the semantic understanding of language models with the generative power of diffusion models. The framework employs a language model to infer context-aware meta-actions from structured scene representations, which are then mapped to prior trajectories that anchor the subsequent denoising process. A two-stage truncated denoising mechanism refines these trajectories efficiently, preserving both semantic alignment and physical feasibility. Experiments on the nuPlan benchmark demonstrate that KnowDiffuser significantly outperforms existing planners in both open-loop and closed-loop evaluations, establishing a robust and interpretable framework that effectively bridges the semantic-to-physical gap in AD systems.
Chinese Translation
近期语言模型(LM)的进展展示了其强大的语义推理能力,使其能够应用于自主驾驶(AD)的高层决策。然而,语言模型在离散的标记空间中操作,缺乏生成连续且物理可行的轨迹所需的能力,这对于运动规划至关重要。同时,扩散模型在生成可靠且动态一致的轨迹方面已被证明有效,但往往缺乏语义可解释性和与场景理解的一致性。为了解决这些局限性,我们提出了 extbf{KnowDiffuser},一种知识引导的运动规划框架,紧密结合了语言模型的语义理解与扩散模型的生成能力。该框架利用语言模型从结构化场景表示中推断上下文感知的元动作,然后将这些元动作映射到先验轨迹,以锚定后续的去噪过程。一个两阶段的截断去噪机制高效地精炼这些轨迹,既保持语义一致性,又确保物理可行性。在nuPlan基准上的实验表明,KnowDiffuser在开放循环和闭合循环评估中显著优于现有规划器,建立了一个强大且可解释的框架,有效弥合了自主驾驶系统中语义与物理之间的鸿沟。
cs.RO / 27 / 2603.10448

DiT4DiT: Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control

DiT4DiT:联合建模视频动态与动作以实现可推广的机器人控制
Ma, Teli, Zheng, Jia, Wang, Zifan, Jiang, Chuili, Cui, Andy, Liang, Junwei, Yang, Shuo
Abstract
Vision-Language-Action (VLA) models have emerged as a promising paradigm for robot learning, but their representations are still largely inherited from static image-text pretraining, leaving physical dynamics to be learned from comparatively limited action data. Generative video models, by contrast, encode rich spatiotemporal structure and implicit physics, making them a compelling foundation for robotic manipulation. But their potentials are not fully explored in the literature. To bridge the gap, we introduce DiT4DiT, an end-to-end Video-Action Model that couples a video Diffusion Transformer with an action Diffusion Transformer in a unified cascaded framework. Instead of relying on reconstructed future frames, DiT4DiT extracts intermediate denoising features from the video generation process and uses them as temporally grounded conditions for action prediction. We further propose a dual flow-matching objective with decoupled timesteps and noise scales for video prediction, hidden-state extraction, and action inference, enabling coherent joint training of both modules. Across simulation and real-world benchmarks, DiT4DiT achieves state-of-the-art results, reaching average success rates of 98.6% on LIBERO and 50.8% on RoboCasa GR1 while using substantially less training data. On the Unitree G1 robot, it also delivers superior real-world performance and strong zero-shot generalization. Importantly, DiT4DiT improves sample efficiency by over 10x and speeds up convergence by up to 7x, demonstrating that video generation can serve as an effective scaling proxy for robot policy learning. We release code and models at https://dit4dit.github.io/.
Chinese Translation
视觉-语言-动作(VLA)模型作为一种有前景的机器人学习范式逐渐兴起,但其表示仍主要来源于静态图像-文本的预训练,导致物理动态的学习受到相对有限的动作数据的制约。相比之下,生成视频模型编码了丰富的时空结构和隐含的物理特性,使其成为机器人操作的有力基础。然而,其潜力在文献中尚未得到充分探索。为此,我们提出了DiT4DiT,一个端到端的视频-动作模型,它在统一的级联框架中将视频扩散变换器与动作扩散变换器相结合。DiT4DiT并不依赖于重建的未来帧,而是从视频生成过程中提取中间去噪特征,并将其作为时间上有根的条件用于动作预测。我们进一步提出了一种双流匹配目标,具有解耦的时间步和噪声尺度,用于视频预测、隐藏状态提取和动作推断,从而实现两个模块的连贯联合训练。在模拟和现实世界基准测试中,DiT4DiT达到了最先进的结果,在LIBERO上平均成功率达到98.6%,在RoboCasa GR1上达到50.8%,同时使用的训练数据显著减少。在Unitree G1机器人上,它还展现了优越的现实世界性能和强大的零样本泛化能力。重要的是,DiT4DiT将样本效率提高了超过10倍,并将收敛速度提升了最多7倍,证明视频生成可以作为机器人策略学习的有效扩展代理。我们在 https://dit4dit.github.io/ 发布了代码和模型。
cs.RO / 28 / 2603.10451

FAR-Dex: Few-shot Data Augmentation and Adaptive Residual Policy Refinement for Dexterous Manipulation

FAR-Dex:用于灵巧操作的少量数据增强与自适应残差策略优化
Bai, Yushan, Chen, Fulin, Sun, Hongzheng, Tong, Yuchuang, Li, En, Zhang, Zhengtao
Abstract
Achieving human-like dexterous manipulation through the collaboration of multi-fingered hands with robotic arms remains a longstanding challenge in robotics, primarily due to the scarcity of high-quality demonstrations and the complexity of high-dimensional action spaces. To address these challenges, we propose FAR-Dex, a hierarchical framework that integrates few-shot data augmentation with adaptive residual refinement to enable robust and precise arm-hand coordination in dexterous tasks. First, FAR-DexGen leverages the IsaacLab simulator to generate diverse and physically constrained trajectories from a few demonstrations, providing a data foundation for policy training. Second, FAR-DexRes introduces an adaptive residual module that refines policies by combining multi-step trajectory segments with observation features, thereby enhancing accuracy and robustness in manipulation scenarios. Experiments in both simulation and real-world demonstrate that FAR-Dex improves data quality by 13.4% and task success rates by 7% over state-of-the-art methods. It further achieves over 80% success in real-world tasks, enabling fine-grained dexterous manipulation with strong positional generalization.
Chinese Translation
通过多指手与机器人臂的协作实现类人灵巧操作仍然是机器人领域的一项长期挑战,主要由于高质量演示的稀缺性和高维动作空间的复杂性。为了解决这些挑战,我们提出了FAR-Dex,一个将少量数据增强与自适应残差优化相结合的分层框架,以实现灵巧任务中稳健且精确的臂手协调。首先,FAR-DexGen利用IsaacLab模拟器从少量演示中生成多样化且物理约束的轨迹,为策略训练提供数据基础。其次,FAR-DexRes引入了一个自适应残差模块,通过将多步轨迹片段与观察特征相结合来优化策略,从而提高操作场景中的准确性和稳健性。在模拟和真实世界的实验中,FAR-Dex在数据质量上提高了13.4%,在任务成功率上提高了7%,相较于最先进的方法。此外,它在真实任务中实现了超过80%的成功率,使得灵巧操作能够实现细粒度的控制并具备强大的位置泛化能力。
cs.RO / 29 / 2603.10459

SUBTA: A Framework for Supported User-Guided Bimanual Teleoperation in Structured Assembly

SUBTA:一种支持用户引导的双手远程操作框架用于结构化装配
Liu, Xiao, Baskaran, Prakash, Li, Songpo, Manschitz, Simon, Ma, Wei, Ruiken, Dirk, Iba, Soshi
Abstract
In human-robot collaboration, shared autonomy enhances human performance through precise, intuitive support. Effective robotic assistance requires accurately inferring human intentions and understanding task structures to determine optimal support timing and methods. In this paper, we present SUBTA, a supported teleoperation system for bimanual assembly that couples learned intention estimation, scene-graph task planning, and context-dependent motion assists. We validate our approach through a user study (N=12) comparing standard teleoperation, motion-support only, and SUBTA. Linear mixed-effects analysis revealed that SUBTA significantly outperformed standard teleoperation in position accuracy (p<0.001, d=1.18) and orientation accuracy (p<0.001, d=1.75), while reducing mental demand (p=0.002, d=1.34). Post-experiment ratings indicate clearer, more trustworthy visual feedback and predictable interventions in SUBTA. The results demonstrate that SUBTA greatly improves both effectiveness and user experience in teleoperation.
Chinese Translation
在人机协作中,共享自主性通过精确、直观的支持提升人类表现。有效的机器人辅助需要准确推断人类意图并理解任务结构,以确定最佳的支持时机和方法。本文提出了SUBTA,一种用于双手装配的支持远程操作系统,结合了学习的意图估计、场景图任务规划和上下文相关的运动辅助。我们通过一项用户研究(N=12)验证了我们的方法,比较了标准远程操作、仅运动支持和SUBTA。线性混合效应分析显示,SUBTA在位置精度(p<0.001,d=1.18)和方向精度(p<0.001,d=1.75)方面显著优于标准远程操作,同时降低了心理负担(p=0.002,d=1.34)。实验后评分表明,SUBTA提供了更清晰、更可信的视觉反馈和可预测的干预。结果表明,SUBTA在远程操作中显著提高了有效性和用户体验。
cs.RO / 30 / 2603.10469

DepthCache: Depth-Guided Training-Free Visual Token Merging for Vision-Language-Action Model Inference

DepthCache:基于深度引导的无训练视觉标记合并用于视觉-语言-动作模型推理
Li, Yuquan, Ma, Lianjie, Ding, Han, Zhu, Lijun
Abstract
Vision-Language-Action (VLA) models enable generalist robotic manipulation but suffer from high inference latency. This bottleneck stems from the massive number of visual tokens processed by large language backbones. Existing methods either prune or merge tokens uniformly, degrading the spatial reasoning essential for robotic control. We present DepthCache, a training-free framework that leverages depth as a structural prior for visual token compression. It partitions observations into depth-based regions and applies spatially differentiated merge ratios, preserving the near-field workspace while compressing the distant background. To exploit temporal redundancy, DepthCache distributes the merging process across consecutive frames, ensuring consistent representations while reducing per-step computation. A motion-adaptive pipeline further optimizes auxiliary view compression based on end-effector dynamics. The framework requires no model modification, generalizing across diverse VLA architectures. On the LIBERO benchmark, DepthCache achieves up to 1.28x inference speedup with less than 1% average success rate degradation across three VLA models (pi_0.5, OpenVLA, GR00T), whereas pruning and merging baselines incur 4--24% degradation at comparable compression. Real-world experiments on a physical manipulator demonstrate that DepthCache enables faster task throughput and more responsive closed-loop control in latency-sensitive scenarios.
Chinese Translation
视觉-语言-动作(VLA)模型使通用机器人操作成为可能,但面临高推理延迟的问题。这一瓶颈源于大型语言骨干网络处理的海量视觉标记。现有方法要么均匀地修剪或合并标记,导致空间推理能力下降,这对机器人控制至关重要。我们提出了DepthCache,这是一种无训练框架,利用深度作为视觉标记压缩的结构先验。它将观察数据划分为基于深度的区域,并应用空间差异化的合并比例,保留近场工作空间的同时压缩远处背景。为了利用时间冗余,DepthCache将合并过程分布在连续帧之间,确保一致的表示,同时减少每步计算。一个运动自适应管道进一步优化基于末端执行器动态的辅助视图压缩。该框架无需模型修改,能够在多种VLA架构中进行泛化。在LIBERO基准测试中,DepthCache在三个VLA模型(pi_0.5、OpenVLA、GR00T)上实现了最高1.28倍的推理加速,且平均成功率下降不到1%,而修剪和合并基线在可比压缩下则导致4%至24%的性能下降。在物理操控器上的实际实验表明,DepthCache能够在对延迟敏感的场景中实现更快的任务吞吐量和更灵敏的闭环控制。
cs.RO / 31 / 2603.10529

BinWalker: Development and Field Evaluation of a Quadruped Manipulator Platform for Sustainable Litter Collection

BinWalker:可持续垃圾收集的四足机械手平台的开发与现场评估
Turrisi, Giulio, Bratta, Angelo, Minelli, Giovanni, Abati, Gabriel Fischer, Rad, Amir H., Soares, João Carlos Virgolino, Semini, Claudio
Abstract
Litter pollution represents a growing environmental problem affecting natural and urban ecosystems worldwide. Waste discarded in public spaces often accumulates in areas that are difficult to access, such as uneven terrains, coastal environments, parks, and roadside vegetation. Over time, these materials degrade and release harmful substances, including toxic chemicals and microplastics, which can contaminate soil and water and pose serious threats to wildlife and human health. Despite increasing awareness of the problem, litter collection is still largely performed manually by human operators, making large-scale cleanup operations labor-intensive, time-consuming, and costly. Robotic solutions have the potential to support and partially automate environmental cleanup tasks. In this work, we present a quadruped robotic system designed for autonomous litter collection in challenging outdoor scenarios. The robot combines the mobility advantages of legged locomotion with a manipulation system consisting of a robotic arm and an onboard litter container. This configuration enables the robot to detect, grasp, and store litter items while navigating through uneven terrains. The proposed system aims to demonstrate the feasibility of integrating perception, locomotion, and manipulation on a legged robotic platform for environmental cleanup tasks. Experimental evaluations conducted in outdoor scenarios highlight the effectiveness of the approach and its potential for assisting large-scale litter removal operations in environments that are difficult to reach with traditional robotic platforms. The code associated with this work can be found at: https://github.com/iit-DLSLab/trash-collection-isaaclab.
Chinese Translation
垃圾污染是一个日益严重的环境问题,影响着全球的自然和城市生态系统。公共场所丢弃的废物常常积聚在难以到达的区域,如不平坦的地形、沿海环境、公园和路边植被。随着时间的推移,这些材料会降解并释放有害物质,包括有毒化学物质和微塑料,这些物质可能污染土壤和水源,并对野生动物和人类健康构成严重威胁。尽管人们对这一问题的认识不断提高,垃圾收集仍主要由人工操作进行,这使得大规模清理工作劳动密集、耗时且成本高昂。机器人解决方案有潜力支持并部分自动化环境清理任务。在本研究中,我们提出了一种四足机器人系统,旨在应对具有挑战性的户外场景中的自主垃圾收集。该机器人结合了腿部运动的机动性优势和由机器人手臂及车载垃圾容器组成的操作系统。这种配置使机器人能够在不平坦的地形中探测、抓取和存储垃圾物品。所提出的系统旨在展示在四足机器人平台上集成感知、运动和操作以进行环境清理任务的可行性。在户外场景中进行的实验评估突显了该方法的有效性及其在传统机器人平台难以到达的环境中辅助大规模垃圾清除操作的潜力。与本研究相关的代码可以在以下链接找到:https://github.com/iit-DLSLab/trash-collection-isaaclab。
cs.RO / 32 / 2603.10565

TacLoc: Global Tactile Localization on Objects from a Registration Perspective

TacLoc:从配准角度进行物体的全球触觉定位
Zhang, Zirui, Zhang, Boyang, Zhang, Fumin, Yin, Huan
Abstract
Pose estimation is essential for robotic manipulation, particularly when visual perception is occluded during gripper-object interactions. Existing tactile-based methods generally rely on tactile simulation or pre-trained models, which limits their generalizability and efficiency. In this study, we propose TacLoc, a novel tactile localization framework that formulates the problem as a one-shot point cloud registration task. TacLoc introduces a graph-theoretic partial-to-full registration method, leveraging dense point clouds and surface normals from tactile sensing for efficient and accurate pose estimation. Without requiring rendered data or pre-trained models, TacLoc achieves improved performance through normal-guided graph pruning and a hypothesis-and-verification pipeline. TacLoc is evaluated extensively on the YCB dataset. We further demonstrate TacLoc on real-world objects across two different visual-tactile sensors.
Chinese Translation
姿态估计对于机器人操作至关重要,尤其是在夹具与物体交互时视觉感知被遮挡的情况下。现有的基于触觉的方法通常依赖于触觉模拟或预训练模型,这限制了它们的泛化能力和效率。在本研究中,我们提出了TacLoc,一种新颖的触觉定位框架,将问题表述为一次性点云配准任务。TacLoc引入了一种图论的部分到完整配准方法,利用来自触觉传感的密集点云和表面法线,实现高效且准确的姿态估计。TacLoc不需要渲染数据或预训练模型,通过法线引导的图修剪和假设-验证管道实现了性能的提升。TacLoc在YCB数据集上进行了广泛评估。我们进一步在两个不同的视觉-触觉传感器上展示了TacLoc在真实物体上的应用。
cs.RO / 33 / 2603.10572

Safety-critical Control Under Partial Observability: Reach-Avoid POMDP meets Belief Space Control

部分可观测下的安全关键控制:到达-避免 POMDP 与信念空间控制相结合
Vahs, Matti, Verhagen, Joris, Tumova, Jana
Abstract
Partially Observable Markov Decision Processes (POMDPs) provide a principled framework for robot decision-making under uncertainty. Solving reach-avoid POMDPs, however, requires coordinating three distinct behaviors: goal reaching, safety, and active information gathering to reduce uncertainty. Existing online POMDP solvers attempt to address all three within a single belief tree search, but this unified approach struggles with the conflicting time scales inherent to these objectives. We propose a layered, certificate-based control architecture that operates directly in belief space, decoupling goal reaching, information gathering, and safety into modular components. We introduce Belief Control Lyapunov Functions (BCLFs) that formalize information gathering as a Lyapunov convergence problem in belief space, and show how they can be learned via reinforcement learning. For safety, we develop Belief Control Barrier Functions (BCBFs) that leverage conformal prediction to provide probabilistic safety guarantees over finite horizons. The resulting control synthesis reduces to lightweight quadratic programs solvable in real time, even for non-Gaussian belief representations with dimension $>10^4$. Experiments in simulation and on a space-robotics platform demonstrate real-time performance and improved safety and task success compared to state-of-the-art constrained POMDP solvers.
Chinese Translation
部分可观测马尔可夫决策过程(POMDP)为不确定性下的机器人决策提供了一个原则性框架。然而,解决到达-避免 POMDP 需要协调三种不同的行为:目标到达、安全性和主动信息收集以减少不确定性。现有的在线 POMDP 求解器试图在单一的信念树搜索中解决这三者,但这种统一的方法在这些目标固有的冲突时间尺度上面临挑战。我们提出了一种分层的基于证书的控制架构,直接在信念空间中操作,将目标到达、信息收集和安全性解耦为模块化组件。我们引入了信念控制李雅普诺夫函数(BCLFs),将信息收集形式化为信念空间中的李雅普诺夫收敛问题,并展示了如何通过强化学习进行学习。为了确保安全性,我们开发了信念控制障碍函数(BCBFs),利用保形预测提供有限时间范围内的概率安全保证。最终的控制合成简化为轻量级的二次规划,可以实时求解,即使对于维度大于 $10^4$ 的非高斯信念表示。仿真和空间机器人平台上的实验表明,与最先进的约束 POMDP 求解器相比,所提出的方法在实时性能、安全性和任务成功率上都有所提升。
cs.RO / 34 / 2603.10597

Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction

恢复以预测:用于可变长度轨迹预测的渐进式回顾学习
Zhou, Hao, Qi, Lu, Li, Jason, Zhang, Jie, Liu, Yi, Yang, Xu, Fan, Mingyu, Luo, Fei
Abstract
Trajectory prediction is critical for autonomous driving, enabling safe and efficient planning in dense, dynamic traffic. Most existing methods optimize prediction accuracy under fixed-length observations. However, real-world driving often yields variable-length, incomplete observations, posing a challenge to these methods. A common strategy is to directly map features from incomplete observations to those from complete ones. This one-shot mapping, however, struggles to learn accurate representations for short trajectories due to significant information gaps. To address this issue, we propose a Progressive Retrospective Framework (PRF), which gradually aligns features from incomplete observations with those from complete ones via a cascade of retrospective units. Each unit consists of a Retrospective Distillation Module (RDM) and a Retrospective Prediction Module (RPM), where RDM distills features and RPM recovers previous timesteps using the distilled features. Moreover, we propose a Rolling-Start Training Strategy (RSTS) that enhances data efficiency during PRF training. PRF is plug-and-play with existing methods. Extensive experiments on datasets Argoverse 2 and Argoverse 1 demonstrate the effectiveness of PRF. Code is available at https://github.com/zhouhao94/PRF.
Chinese Translation
轨迹预测对于自动驾驶至关重要,它能够在密集动态交通中实现安全高效的规划。现有大多数方法在固定长度观测下优化预测准确性。然而,现实世界驾驶常常产生可变长度的不完整观测,这对这些方法构成了挑战。一种常见的策略是直接将不完整观测的特征映射到完整观测的特征上。然而,这种一次性映射在短轨迹上难以学习准确的表示,因为存在显著的信息缺口。为了解决这个问题,我们提出了渐进式回顾框架(Progressive Retrospective Framework, PRF),该框架通过一系列回顾单元逐步对齐不完整观测的特征与完整观测的特征。每个单元由回顾蒸馏模块(Retrospective Distillation Module, RDM)和回顾预测模块(Retrospective Prediction Module, RPM)组成,其中RDM负责蒸馏特征,RPM则利用蒸馏后的特征恢复先前的时间步。此外,我们提出了一种滚动启动训练策略(Rolling-Start Training Strategy, RSTS),以提高PRF训练过程中的数据效率。PRF可以与现有方法无缝结合。在Argoverse 2和Argoverse 1数据集上的大量实验表明了PRF的有效性。代码可在 https://github.com/zhouhao94/PRF 获取。
cs.RO / 35 / 2603.10609

Learning Bimanual Cloth Manipulation with Vision-based Tactile Sensing via Single Robotic Arm

基于视觉触觉感知的单臂双手布料操作学习
Lee, Dongmyoung, Chen, Wei, Chen, Xiaoshuai, Zong, Rui, Kormushev, Petar
Abstract
Robotic cloth manipulation remains challenging due to the high-dimensional state space of fabrics, their deformable nature, and frequent occlusions that limit vision-based sensing. Although dual-arm systems can mitigate some of these issues, they increase hardware and control complexity. This paper presents Touch G.O.G., a compact vision-based tactile gripper and perception/control framework for single-arm bimanual cloth manipulation. The proposed framework combines three key components: (1) a novel gripper design and control strategy for in-gripper cloth sliding with a single robot arm, (2) a Vision Foundation Model-backboned Vision Transformer pipeline for cloth part classification (PC-Net) and edge pose estimation (PE-Net) using real and synthetic tactile images, and (3) an encoder-decoder synthetic data generator (SD-Net) that reduces manual annotation by producing high-fidelity tactile images. Experiments show 96% accuracy in distinguishing edges, corners, interior regions, and grasp failures, together with sub-millimeter edge localization and 4.5{\deg} orientation error. Real-world results demonstrate reliable cloth unfolding, even for crumpled fabrics, using only a single robotic arm. These results highlight Touch G.O.G. as a compact and cost-effective solution for deformable object manipulation.
Chinese Translation
机器人布料操作仍然面临挑战,这主要由于布料的高维状态空间、可变形特性以及频繁的遮挡现象限制了基于视觉的感知。尽管双臂系统可以缓解一些这些问题,但它们增加了硬件和控制的复杂性。本文提出了Touch G.O.G.,一种紧凑的基于视觉的触觉夹持器及其感知/控制框架,旨在实现单臂双手布料操作。所提出的框架结合了三个关键组件:(1)一种新颖的夹持器设计和控制策略,用于单个机器人臂上的夹持器内布料滑动;(2)基于视觉基础模型的视觉变换器管道,用于布料部分分类(PC-Net)和边缘姿态估计(PE-Net),利用真实和合成的触觉图像;(3)一种编码器-解码器合成数据生成器(SD-Net),通过生成高保真触觉图像来减少人工标注。实验表明,在区分边缘、角落、内部区域和抓取失败方面的准确率达到96%,边缘定位精度在亚毫米级,方向误差为4.5°。实际结果展示了即使对于皱褶布料,使用单个机器人臂也能可靠地展开布料。这些结果突显了Touch G.O.G.作为一种紧凑且具有成本效益的可变形物体操作解决方案的潜力。
cs.RO / 36 / 2603.10616

AdaClearGrasp: Learning Adaptive Clearing for Zero-Shot Robust Dexterous Grasping in Densely Cluttered Environments

AdaClearGrasp:在密集杂乱环境中学习自适应清理以实现零-shot稳健灵巧抓取
Chen, Zixuan, Zhang, Wenquan, Fang, Jing, Zeng, Ruiming, Xu, Zhixuan, Hou, Yiwen, Wang, Xinke, Shi, Jieqi, Huo, Jing, Gao, Yang
Abstract
In densely cluttered environments, physical interference, visual occlusions, and unstable contacts often cause direct dexterous grasping to fail, while aggressive singulation strategies may compromise safety. Enabling robots to adaptively decide whether to clear surrounding objects or directly grasp the target is therefore crucial for robust manipulation. We propose AdaClearGrasp, a closed-loop decision-execution framework for adaptive clearing and zero-shot dexterous grasping in densely cluttered environments. The framework formulates manipulation as a controllable high-level decision process that determines whether to directly grasp the target or first clear surrounding objects. A pretrained vision-language model (VLM) interprets visual observations and language task descriptions to reason about grasp interference and generate a high-level planning skeleton, which invokes structured atomic skills through a unified action interface. For dexterous grasping, we train a reinforcement learning policy with a relative hand-object distance representation, enabling zero-shot generalization across diverse object geometries and physical properties. During execution, visual feedback monitors outcomes and triggers replanning upon failures, forming a closed-loop correction mechanism. To evaluate language-conditioned dexterous grasping in clutter, we introduce Clutter-Bench, the first simulation benchmark with graded clutter complexity. It includes seven target objects across three clutter levels, yielding 210 task scenarios. We further perform sim-to-real experiments on three objects under three clutter levels (18 scenarios). Results demonstrate that AdaClearGrasp significantly improves grasp success rates in densely cluttered environments. For more videos and code, please visit our project website: https://chenzixuan99.github.io/adaclear-grasp.github.io/.
Chinese Translation
在密集杂乱的环境中,物理干扰、视觉遮挡和不稳定的接触常常导致直接的灵巧抓取失败,而激进的单独处理策略可能会影响安全性。因此,使机器人能够自适应地决定是清理周围物体还是直接抓取目标,对于稳健的操作至关重要。我们提出了AdaClearGrasp,一个用于自适应清理和零-shot灵巧抓取的闭环决策执行框架,适用于密集杂乱的环境。该框架将操作过程形式化为一个可控的高层决策过程,以决定是直接抓取目标还是首先清理周围物体。一个预训练的视觉-语言模型(VLM)解释视觉观察和语言任务描述,以推理抓取干扰并生成高层规划骨架,通过统一的动作接口调用结构化原子技能。对于灵巧抓取,我们训练了一个强化学习策略,利用相对手-物体距离表示,从而实现对多样物体几何形状和物理属性的零-shot泛化。在执行过程中,视觉反馈监控结果并在失败时触发重新规划,形成一个闭环修正机制。为了评估在杂乱环境中基于语言的灵巧抓取,我们引入了Clutter-Bench,这是第一个具有分级杂乱复杂度的仿真基准。它包括七个目标物体,分为三个杂乱等级,共产生210个任务场景。我们还在三个物体的三个杂乱等级下(18个场景)进行了仿真到现实的实验。结果表明,AdaClearGrasp显著提高了在密集杂乱环境中的抓取成功率。有关更多视频和代码,请访问我们的项目网站:https://chenzixuan99.github.io/adaclear-grasp.github.io/
cs.RO / 37 / 2603.10651

Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions

交错调度与运动规划结合符号时空运动抽象的增量学习
Tosello, Elisa, Bit-Monnot, Arthur, Lusuardi, Davide, Valentini, Alessandro, Micheli, Andrea
Abstract
Task and Motion Planning combines high-level task sequencing (what to do) with low-level motion planning (how to do it) to generate feasible, collision-free execution plans. However, in many real-world domains, such as automated warehouses, tasks are predefined, shifting the challenge to if, when, and how to execute them safely and efficiently under resource, time and motion constraints. In this paper, we formalize this as the Scheduling and Motion Planning problem for multi-object navigation in shared workspaces. We propose a novel solution framework that interleaves off-the-shelf schedulers and motion planners in an incremental learning loop. The scheduler generates candidate plans, while the motion planner checks feasibility and returns symbolic feedback, i.e., spatial conflicts and timing adjustments, to guide the scheduler towards motion-feasible solutions. We validate our proposal on logistics and job-shop scheduling benchmarks augmented with motion tasks, using state-of-the-art schedulers and sampling-based motion planners. Our results show the effectiveness of our framework in generating valid plans under complex temporal and spatial constraints, where synchronized motion is critical.
Chinese Translation
任务与运动规划结合了高层次的任务序列(做什么)与低层次的运动规划(如何做),以生成可行的、无碰撞的执行计划。然而,在许多现实世界的领域中,例如自动化仓库,任务是预定义的,这将挑战转移到如何、何时以及在资源、时间和运动约束下安全高效地执行这些任务。在本文中,我们将其形式化为共享工作空间中多目标导航的调度与运动规划问题。我们提出了一种新颖的解决框架,该框架在增量学习循环中交错使用现成的调度器和运动规划器。调度器生成候选计划,而运动规划器检查可行性并返回符号反馈,即空间冲突和时间调整,以指导调度器朝向运动可行的解决方案。我们在增强了运动任务的物流和作业车间调度基准上验证了我们的提案,使用了最先进的调度器和基于采样的运动规划器。我们的结果表明,在复杂的时间和空间约束下,我们的框架在生成有效计划方面的有效性,其中同步运动至关重要。
cs.RO / 38 / 2603.10660

STM32-Based Smart Waste Bin for Hygienic Disposal Using Embedded Sensing and Automated Control

基于STM32的智能垃圾箱:使用嵌入式传感和自动控制实现卫生处置
Bhuiyan, Mohammed Aman, Saswato, Aritra Islam, Khan, Md. Misbah, Paul, Anish, Dhrubo, Ahmed Faizul Haque, Qayum, Mohammad Abdul
Abstract
The increasing demand for hygienic and contactless solutions in public and private environments has encouraged the development of automated systems for everyday applications. This paper presents the design and implementation of a motion- sensing automatic waste bin using an STM32 microcontroller, ultrasonic sensors, and a servo motor. The system detects user presence through ultrasonic sensing and automatically opens the bin lid using a servo motor controlled by the microcontroller. An additional ultrasonic sensor is used to monitor the internal waste level of the bin, while an OLED display provides real- time feedback regarding system status. The proposed system offers a low-cost, reliable, and easily deployable solution for touch-free waste disposal. Experimental evaluation demonstrates fast response time, stable sensing performance, and smooth mechanical operation. The system can be effectively deployed in homes, educational institutions, hospitals, and public facilities to improve hygiene and user convenience.
Chinese Translation
对公共和私人环境中卫生和无接触解决方案的日益需求促进了日常应用自动化系统的发展。本文介绍了一种基于STM32微控制器、超声波传感器和伺服电机的运动感应自动垃圾箱的设计与实现。该系统通过超声波传感检测用户的存在,并利用微控制器控制的伺服电机自动打开垃圾箱盖。另一个超声波传感器用于监测垃圾箱内部的垃圾水平,同时OLED显示屏提供关于系统状态的实时反馈。所提系统提供了一种低成本、可靠且易于部署的无接触垃圾处置解决方案。实验评估表明,该系统具有快速响应时间、稳定的传感性能和顺畅的机械操作。该系统可以有效部署在家庭、教育机构、医院和公共设施中,以提高卫生水平和用户便利性。
cs.RO / 39 / 2603.10670

Dynamic Modeling and Attitude Control of a Reaction-Wheel-Based Low-Gravity Bipedal Hopper

基于反应轮的低重力双足跳跃机器人的动态建模与姿态控制
Hari, Shriram, Nikhil, M Venkata Sai, Kumar, R Prasanth
Abstract
Planetary bodies characterized by low gravitational acceleration, such as the Moon and near-Earth asteroids, impose unique locomotion constraints due to diminished contact forces and extended airborne intervals. Among traversal strategies, hopping locomotion offers high energy efficiency but is prone to mid-flight attitude instability caused by asymmetric thrust generation and uneven terrain interactions. This paper presents an underactuated bipedal hopping robot that employs an internal reaction wheel to regulate body posture during the ballistic flight phase. The system is modeled as a gyrostat, enabling analysis of the dynamic coupling between torso rotation and reaction wheel momentum. The locomotion cycle comprises three phases: a leg-driven propulsive jump, mid-air attitude stabilization via an active momentum exchange controller, and a shock-absorbing landing. A reduced-order model is developed to capture the critical coupling between torso rotation and reaction wheel dynamics. The proposed framework is evaluated in MuJoCo-based simulations under lunar gravity conditions (g = 1.625 m/s^2). Results demonstrate that activation of the reaction wheel controller reduces peak mid-air angular deviation by more than 65% and constrains landing attitude error to within 3.5 degrees at touchdown. Additionally, actuator saturation per hop cycle is reduced, ensuring sufficient control authority. Overall, the approach significantly mitigates in-flight attitude excursions and enables consistent upright landings, providing a practical and control-efficient solution for locomotion on irregular extraterrestrial terrains.
Chinese Translation
低重力加速度的行星体,如月球和近地小行星,由于接触力减小和空中停留时间延长,施加了独特的运动约束。在各种移动策略中,跳跃运动提供了高能量效率,但由于不对称的推力产生和不平坦的地形交互,容易导致飞行中的姿态不稳定。本文提出了一种欠驱动的双足跳跃机器人,该机器人利用内部反应轮在弹道飞行阶段调节身体姿态。该系统被建模为陀螺体,从而能够分析躯干旋转与反应轮动量之间的动态耦合。运动周期包括三个阶段:腿驱动的推进跳跃、通过主动动量交换控制器进行的空中姿态稳定,以及减震着陆。我们开发了一个降阶模型,以捕捉躯干旋转与反应轮动力学之间的关键耦合。所提出的框架在基于MuJoCo的模拟中评估,模拟条件为月球重力(g = 1.625 m/s^2)。结果表明,激活反应轮控制器可以将空中峰值角度偏差减少超过65%,并将着陆姿态误差限制在触地时的3.5度以内。此外,每次跳跃周期的执行器饱和度降低,确保了足够的控制权。总体而言,该方法显著减轻了飞行中的姿态偏移,并实现了一致的直立着陆,为在不规则的外星地形上提供了一种实用且控制高效的运动解决方案。
cs.RO / 40 / 2603.10675

Cybo-Waiter: A Physical Agentic Framework for Humanoid Whole-Body Locomotion-Manipulation

Cybo-Waiter:一种用于类人全身运动-操作的物理代理框架
Ren, Peng, Ge, Haoyang, Qi, Chuan, Huang, Cong, Li, Hong, Zhao, Jiang, Chi, Pei, Chen, Kai
Abstract
Robots are increasingly expected to execute open ended natural language requests in human environments, which demands reliable long horizon execution under partial observability. This is especially challenging for humanoids because locomotion and manipulation are tightly coupled through stance, reachability, and balance. We present a humanoid agent framework that turns VLM plans into verifiable task programs and closes the loop with multi object 3D geometric supervision. A VLM planner compiles each instruction into a typed JSON sequence of subtasks with explicit predicate based preconditions and success conditions. Using SAM3 and RGB-D, we ground all task relevant entities in 3D, estimate object centroids and extents, and evaluate predicates over stable frames to obtain condition level diagnostics. The supervisor uses these diagnostics to verify subtask completion and to provide condition-level feedback for progression and replanning. We execute each subtask by coordinating humanoid locomotion and whole-body manipulation, selecting feasible motion primitives under reachability and balance constraints. Experiments on tabletop manipulation and long horizon humanoid loco manipulation tasks show improved robustness from multi object grounding, temporal stability, and recovery driven replanning.
Chinese Translation
机器人越来越被期望在人的环境中执行开放式自然语言请求,这要求在部分可观察性下进行可靠的长时间执行。这对类人机器人尤其具有挑战性,因为运动和操作通过站姿、可达性和平衡紧密耦合。我们提出了一种类人代理框架,将视觉语言模型(VLM)计划转化为可验证的任务程序,并通过多物体3D几何监督闭合反馈回路。VLM规划器将每个指令编译为带有明确谓词基础前提和成功条件的类型化JSON子任务序列。使用SAM3和RGB-D,我们在3D中定位所有与任务相关的实体,估计物体的质心和范围,并在稳定帧上评估谓词以获得条件级诊断。监督者利用这些诊断来验证子任务的完成,并提供条件级反馈以促进进展和重新规划。我们通过协调类人运动和全身操作来执行每个子任务,在可达性和平衡约束下选择可行的运动原语。在桌面操作和长时间类人运动操作任务上的实验表明,通过多物体定位、时间稳定性和恢复驱动的重新规划,显著提高了鲁棒性。
cs.RO / 41 / 2603.10682

OnFly: Onboard Zero-Shot Aerial Vision-Language Navigation toward Safety and Efficiency

OnFly:面向安全与效率的机载零-shot 空中视觉-语言导航
Zheng, Guiyong, Ban, Yueting, Zhang, Mingjie, Zheng, Juepeng, Zhou, Boyu
Abstract
Aerial vision-language navigation (AVLN) enables UAVs to follow natural-language instructions in complex 3D environments. However, existing zero-shot AVLN methods often suffer from unstable single-stream Vision-Language Model decision-making, unreliable long-horizon progress monitoring, and a trade-off between safety and efficiency. We propose OnFly, a fully onboard, real-time framework for zero-shot AVLN. OnFly adopts a shared-perception dual-agent architecture that decouples high-frequency target generation from low-frequency progress monitoring, thereby stabilizing decision-making. It further employs a hybrid keyframe-recent-frame memory to preserve global trajectory context while maintaining KV-cache prefix stability, enabling reliable long-horizon monitoring with termination and recovery signals. In addition, a semantic-geometric verifier refines VLM-predicted targets for instruction consistency and geometric safety using VLM features and depth cues, while a receding-horizon planner generates optimized collision-free trajectories under geometric safety constraints, improving both safety and efficiency. In simulation, OnFly improves task success from 26.4% to 67.8%, compared with the strongest state-of-the-art baseline, while fully onboard real-world flights validate its feasibility for real-time deployment. The code will be released at https://github.com/Robotics-STAR-Lab/OnFly
Chinese Translation
空中视觉-语言导航(AVLN)使无人机能够在复杂的三维环境中遵循自然语言指令。然而,现有的零-shot AVLN 方法往往面临不稳定的单流视觉-语言模型决策、不可依赖的长时间进度监测,以及安全与效率之间的权衡。我们提出了 OnFly,一个完全机载的实时零-shot AVLN 框架。OnFly 采用共享感知的双代理架构,将高频目标生成与低频进度监测解耦,从而稳定决策过程。它进一步采用混合关键帧-最近帧内存,以保持全局轨迹上下文,同时维持 KV-cache 前缀的稳定性,从而实现可靠的长时间监测,并提供终止和恢复信号。此外,语义-几何验证器利用 VLM 特征和深度线索,精炼 VLM 预测的目标,以确保指令一致性和几何安全,而递归地平面规划器在几何安全约束下生成优化的无碰撞轨迹,从而提高安全性和效率。在仿真中,OnFly 将任务成功率从 26.4% 提高到 67.8%,相较于最强的现有基线,同时完全机载的真实飞行验证了其实时部署的可行性。代码将发布在 https://github.com/Robotics-STAR-Lab/OnFly
cs.RO / 42 / 2603.10688

MapGCLR: Geospatial Contrastive Learning of Representations for Online Vectorized HD Map Construction

MapGCLR:在线矢量化高清地图构建的地理空间对比学习表示
Merkert, Jonas, Blumberg, Alexander, Pauls, Jan-Hendrik, Stiller, Christoph
Abstract
Autonomous vehicles rely on map information to understand the world around them. However, the creation and maintenance of offline high-definition (HD) maps remains costly. A more scalable alternative lies in online HD map construction, which only requires map annotations at training time. To further reduce the need for annotating vast training labels, self-supervised training provides an alternative. This work focuses on improving the latent birds-eye-view (BEV) feature grid representation within a vectorized online HD map construction model by enforcing geospatial consistency between overlapping BEV feature grids as part of a contrastive loss function. To ensure geospatial overlap for contrastive pairs, we introduce an approach to analyze the overlap between traversals within a given dataset and generate subsidiary dataset splits following adjustable multi-traversal requirements. We train the same model supervised using a reduced set of single-traversal labeled data and self-supervised on a broader unlabeled set of data following our multi-traversal requirements, effectively implementing a semi-supervised approach. Our approach outperforms the supervised baseline across the board, both quantitatively in terms of the downstream tasks vectorized map perception performance and qualitatively in terms of segmentation in the principal component analysis (PCA) visualization of the BEV feature space.
Chinese Translation
自主驾驶车辆依赖地图信息来理解周围的世界。然而,离线高清(HD)地图的创建和维护仍然成本高昂。一个更具可扩展性的替代方案是在线高清地图构建,该方法仅在训练时需要地图注释。为了进一步减少对大量训练标签的注释需求,自监督训练提供了一种替代方案。本研究的重点是通过在对比损失函数中强制重叠的鸟瞰视图(BEV)特征网格之间的地理空间一致性,来改善矢量化在线高清地图构建模型中的潜在BEV特征网格表示。为了确保对比对的地理空间重叠,我们提出了一种分析给定数据集中遍历之间重叠的方法,并根据可调的多遍历要求生成附属数据集划分。我们使用减少的单遍历标记数据集对相同模型进行监督训练,并在更广泛的未标记数据集上进行自监督训练,遵循我们的多遍历要求,从而有效地实现了一种半监督方法。我们的方案在各方面均优于监督基线,无论是在下游任务的矢量化地图感知性能的定量指标上,还是在主成分分析(PCA)可视化的BEV特征空间中的分割质量上。
cs.RO / 43 / 2603.10711

Parallel-in-Time Nonlinear Optimal Control via GPU-native Sequential Convex Programming

基于GPU的时间并行非线性最优控制通过顺序凸编程
Zou, Yilin, Zhang, Zhong, Jiang, Fanghua
Abstract
Real-time trajectory optimization for nonlinear constrained autonomous systems is critical and typically performed by CPU-based sequential solvers. Specifically, reliance on global sparse linear algebra or the serial nature of dynamic programming algorithms restricts the utilization of massively parallel computing architectures like GPUs. To bridge this gap, we introduce a fully GPU-native trajectory optimization framework that combines sequential convex programming with a consensus-based alternating direction method of multipliers. By applying a temporal splitting strategy, our algorithm decouples the optimization horizon into independent, per-node subproblems that execute massively in parallel. The entire process runs fully on the GPU, eliminating costly memory transfers and large-scale sparse factorizations. This architecture naturally scales to multi-trajectory optimization. We validate the solver on a quadrotor agile flight task and a Mars powered descent problem using an on-board edge computing platform. Benchmarks reveal a sustained 4x throughput speedup and a 51% reduction in energy consumption over a heavily optimized 12-core CPU baseline. Crucially, the framework saturates the hardware, maintaining over 96% active GPU utilization to achieve planning rates exceeding 100 Hz. Furthermore, we demonstrate the solver's extensibility to robust Model Predictive Control by jointly optimizing dynamically coupled scenarios under stochastic disturbances, enabling scalable and safe autonomy.
Chinese Translation
非线性约束自主系统的实时轨迹优化至关重要,通常由基于CPU的顺序求解器执行。具体而言,依赖于全局稀疏线性代数或动态规划算法的串行特性限制了像GPU这样的海量并行计算架构的利用。为了解决这一问题,我们提出了一种完全基于GPU的轨迹优化框架,该框架将顺序凸编程与基于共识的交替方向乘子法相结合。通过应用时间分裂策略,我们的算法将优化时间范围解耦为独立的每节点子问题,这些子问题可以大规模并行执行。整个过程完全在GPU上运行,消除了昂贵的内存传输和大规模稀疏分解。这种架构自然扩展到多轨迹优化。我们在四旋翼灵活飞行任务和火星动力下降问题上验证了该求解器,使用了一个机载边缘计算平台。基准测试显示,与经过高度优化的12核CPU基线相比,持续实现了4倍的吞吐量加速和51%的能耗降低。关键是,该框架充分利用了硬件,保持超过96%的GPU活跃利用率,实现超过100 Hz的规划速率。此外,我们展示了求解器在鲁棒模型预测控制中的可扩展性,通过在随机干扰下共同优化动态耦合场景,实现可扩展和安全的自主性。
cs.RO / 44 / 2603.10712

FutureVLA: Joint Visuomotor Prediction for Vision-Language-Action Model

FutureVLA:视觉-语言-动作模型的联合视觉运动预测
Xu, Xiaoxu, Li, Hao, Ye, Jinhui, Chen, Yilun, Zeng, Jia, Chen, Xinyi, Xu, Linning, Lin, Dahua, Li, Weixin, Pang, Jiangmiao
Abstract
Predictive foresight is important to intelligent embodied agents. Since the motor execution of a robot is intrinsically constrained by its visual perception of environmental geometry, effectively anticipating the future requires capturing this tightly coupled visuomotor interplay. While recent vision-language-action models attempt to incorporate future guidance, they struggle with this joint modeling. Existing explicit methods divert capacity to task-irrelevant visual details, whereas implicit methods relying on sparse frame pairs disrupt temporal continuity. By heavily relying on visual reconstruction, these methods become visually dominated, entangling static scene context with dynamic action intent. We argue that effective joint visuomotor predictive modeling requires both temporal continuity and visually-conditioned supervision decoupling. To this end, we propose FutureVLA, featuring a novel Joint Visuomotor Predictive Architecture. FutureVLA is designed to extract joint visuomotor embeddings by first decoupling visual and motor information, and then jointly encoding generalized physical priors. Specifically, in the pretraining stage, we leverage heterogeneous manipulation datasets and introduce a Joint Visuomotor Gating mechanism to structurally separate visual state preservation from temporal action modeling. It allows the motor stream to focus on continuous physical dynamics while explicitly querying visual tokens for environmental constraints, yielding highly generalizable joint visuomotor embeddings. Subsequently, in the post-training stage, we employ a latent embeddings alignment strategy, enabling diverse downstream VLA models to internalize these temporal priors without modifying their inference architectures. Extensive experiments demonstrate that FutureVLA consistently improves VLA frameworks.
Chinese Translation
预测性前瞻性对于智能具身代理至关重要。由于机器人的运动执行本质上受到其对环境几何形状的视觉感知的限制,有效地预测未来需要捕捉这种紧密耦合的视觉运动相互作用。尽管最近的视觉-语言-动作模型试图融入未来指导,但在这种联合建模上仍然面临挑战。现有的显式方法将能力转移到与任务无关的视觉细节上,而依赖稀疏帧对的隐式方法则破坏了时间连续性。由于过度依赖视觉重建,这些方法变得以视觉为主导,将静态场景上下文与动态动作意图纠缠在一起。我们认为,有效的联合视觉运动预测建模需要同时具备时间连续性和视觉条件监督的解耦。为此,我们提出了FutureVLA,具有新颖的联合视觉运动预测架构。FutureVLA旨在通过首先解耦视觉和运动信息,然后共同编码广义物理先验,提取联合视觉运动嵌入。具体而言,在预训练阶段,我们利用异质操控数据集,并引入联合视觉运动门控机制,以结构性地将视觉状态保持与时间动作建模分离。这使得运动流能够专注于连续的物理动态,同时显式查询视觉标记以获取环境约束,从而产生高度可泛化的联合视觉运动嵌入。随后,在后训练阶段,我们采用潜在嵌入对齐策略,使多样化的下游VLA模型能够在不修改其推理架构的情况下内化这些时间先验。大量实验证明,FutureVLA始终提升VLA框架的性能。
cs.RO / 45 / 2603.10714

MAVEN: A Meta-Reinforcement Learning Framework for Varying-Dynamics Expertise in Agile Quadrotor Maneuvers

MAVEN:一种用于变动态专业知识的敏捷四旋翼机操控的元强化学习框架
Zhou, Jin, Cao, Dongcheng, Wang, Xian, Li, Shuo
Abstract
Reinforcement learning (RL) has emerged as a powerful paradigm for achieving online agile navigation with quadrotors. Despite this success, policies trained via standard RL typically fail to generalize across significant dynamic variations, exhibiting a critical lack of adaptability. This work introduces MAVEN, a meta-RL framework that enables a single policy to achieve robust end-to-end navigation across a wide range of quadrotor dynamics. Our approach features a novel predictive context encoder, which learns to infer a latent representation of the system dynamics from interaction history. We demonstrate our method in agile waypoint traversal tasks under two challenging scenarios: large variations in quadrotor mass and severe single-rotor thrust loss. We leverage a GPU-vectorized simulator to distribute tasks across thousands of parallel environments, overcoming the long training times of meta-RL to converge in less than an hour. Through extensive experiments in both simulation and the real world, we validate that MAVEN achieves superior adaptation and agility. The policy successfully executes zero-shot sim-to-real transfer, demonstrating robust online adaptation by performing high-speed maneuvers despite mass variations of up to 66.7% and single-rotor thrust losses as severe as 70%.
Chinese Translation
强化学习(RL)已成为实现四旋翼机在线敏捷导航的强大范式。尽管取得了成功,通过标准RL训练的策略通常无法在显著的动态变化中进行泛化,表现出适应性严重不足。本研究提出了MAVEN,一种元强化学习框架,使单一策略能够在广泛的四旋翼机动态中实现稳健的端到端导航。我们的方法具有一种新颖的预测上下文编码器,能够从交互历史中学习推断系统动态的潜在表示。我们在两种具有挑战性的场景下展示了我们的方法:四旋翼机质量的大幅变化和严重的单旋翼推力损失。我们利用GPU向量化模拟器将任务分配到数千个并行环境中,克服了元强化学习的长训练时间,使其在不到一小时内收敛。通过在模拟和现实世界中的广泛实验,我们验证了MAVEN在适应性和敏捷性方面的优越性。该策略成功执行了零-shot模拟到现实的转移,通过在质量变化高达66.7%和单旋翼推力损失高达70%的情况下执行高速操控,展示了稳健的在线适应能力。
cs.RO / 46 / 2603.10715

ASTER: Attitude-aware Suspended-payload Quadrotor Traversal via Efficient Reinforcement Learning

ASTER:基于高效强化学习的姿态感知悬挂载荷四旋翼飞行
Cao, Dongcheng, Zhou, Jin, Li, Shuo
Abstract
Agile maneuvering of the quadrotor cable-suspended system is significantly hindered by its non-smooth hybrid dynamics. While model-free Reinforcement Learning (RL) circumvents explicit differentiation of complex models, achieving attitude-constrained or inverted flight remains an open challenge due to the extreme reward sparsity under strict orientation requirements. This paper presents ASTER, a robust RL framework that achieves, to our knowledge, the first successful autonomous inverted flight for the cable-suspended system. We propose hybrid-dynamics-informed state seeding (HDSS), an initialization strategy that back-propagates target configurations through physics-consistent kinematic inversions across both taut and slack cable phases. HDSS enables the policy to discover aggressive maneuvers that are unreachable via standard exploration. Extensive simulations and real-world experiments demonstrate remarkable agility, precise attitude alignment, and robust zero-shot sim-to-real transfer across complex trajectories.
Chinese Translation
四旋翼缆索悬挂系统的灵活机动受到其非平滑混合动力学的显著制约。虽然无模型强化学习(Reinforcement Learning, RL)避免了对复杂模型的显式微分,但由于在严格的姿态要求下奖励稀疏性极高,实现姿态受限或倒飞仍然是一个未解决的挑战。本文提出了ASTER,一个强健的RL框架,据我们所知,它首次成功实现了缆索悬挂系统的自主倒飞。我们提出了一种混合动力学信息驱动的状态初始化策略(Hybrid-Dynamics-Informed State Seeding, HDSS),该策略通过在紧绷和松弛缆索阶段进行物理一致的运动学反演,反向传播目标配置。HDSS使得策略能够发现通过标准探索无法达到的激进机动。大量的仿真和现实世界实验展示了显著的灵活性、精确的姿态对齐以及在复杂轨迹上的强健零-shot仿真到现实转移。
cs.RO / 47 / 2603.10847

Semantic Landmark Particle Filter for Robot Localisation in Vineyards

用于葡萄园机器人定位的语义地标粒子滤波器
de Silva, Rajitha, Cox, Jonathan, Heselden, James R., Popović, Marija, Cadena, Cesar, Polvara, Riccardo
Abstract
Reliable localisation in vineyards is hindered by row-level perceptual aliasing: parallel crop rows produce nearly identical LiDAR observations, causing geometry-only and vision-based SLAM systems to converge towards incorrect corridors, particularly during headland transitions. We present a Semantic Landmark Particle Filter (SLPF) that integrates trunk and pole landmark detections with 2D LiDAR within a probabilistic localisation framework. Detected trunks are converted into semantic walls, forming structural row boundaries embedded in the measurement model to improve discrimination between adjacent rows. GNSS is incorporated as a lightweight prior that stabilises localisation when semantic observations are sparse. Field experiments in a 10-row vineyard demonstrate consistent improvements over geometry-only (AMCL), vision-based (RTAB-Map), and GNSS baselines. Compared to AMCL, SLPF reduces Absolute Pose Error by 22% and 65% across two traversal directions; relative to a NoisyGNSS baseline, APE decreases by 65% and 61%. Row correctness improves from 0.67 to 0.73, while mean cross-track error decreases from 1.40 m to 1.26 m. These results show that embedding row-level structural semantics within the measurement model enables robust localisation in highly repetitive outdoor agricultural environments.
Chinese Translation
在葡萄园中,可靠的定位受到行级感知混淆的阻碍:平行的作物行产生几乎相同的激光雷达(LiDAR)观测,导致仅基于几何和视觉的同步定位与地图构建(SLAM)系统收敛到错误的通道,特别是在头部过渡期间。我们提出了一种语义地标粒子滤波器(SLPF),该滤波器将树干和杆地标检测与二维激光雷达(LiDAR)集成在一个概率定位框架中。检测到的树干被转换为语义墙,形成嵌入测量模型中的结构行边界,以改善相邻行之间的区分。全球导航卫星系统(GNSS)作为一种轻量级先验被纳入,以在语义观测稀疏时稳定定位。在一个包含10行作物的葡萄园中进行的实地实验表明,相较于仅基于几何的(AMCL)、基于视觉的(RTAB-Map)和GNSS基线,SLPF在定位精度上有一致的改善。与AMCL相比,SLPF在两个行驶方向上减少了22%和65%的绝对位姿误差;相较于噪声GNSS基线,绝对位姿误差(APE)分别减少了65%和61%。行正确性从0.67提高到0.73,而平均横向误差从1.40米降低到1.26米。这些结果表明,在测量模型中嵌入行级结构语义能够在高度重复的户外农业环境中实现稳健的定位。
cs.RO / 48 / 2603.10858

GRACE: A Unified 2D Multi-Robot Path Planning Simulator & Benchmark for Grid, Roadmap, And Continuous Environments

GRACE:一个统一的二维多机器人路径规划模拟器与基准测试,适用于网格、路线图和连续环境
Zang, Chuanlong, Mannucci, Anna, Barz, Isabelle, Schillinger, Philipp, Lier, Florian, Hönig, Wolfgang
Abstract
Advancing Multi-Agent Pathfinding (MAPF) and Multi-Robot Motion Planning (MRMP) requires platforms that enable transparent, reproducible comparisons across modeling choices. Existing tools either scale under simplifying assumptions (grids, homogeneous agents) or offer higher fidelity with less comparable instrumentation. We present GRACE, a unified 2D simulator+benchmark that instantiates the same task at multiple abstraction levels (grid, roadmap, continuous) via explicit, reproducible operators and a common evaluation protocol. Our empirical results on public maps and representative planners enable commensurate comparisons on a shared instance set. Furthermore, we quantify the expected representation-fidelity trade-offs (MRMP solves instances at higher fidelity but lower speed, while grid/roadmap planners scale farther). By consolidating representation, execution, and evaluation, GRACE thereby aims to make cross-representation studies more comparable and provides a means to advance multi-robot planning research and its translation to practice.
Chinese Translation
推进多智能体路径寻找(MAPF)和多机器人运动规划(MRMP)需要能够在建模选择上进行透明、可重复比较的平台。现有工具要么在简化假设(网格、同质代理)下进行扩展,要么提供更高的保真度但缺乏可比较的仪器。我们提出了GRACE,一个统一的二维模拟器+基准测试,通过明确、可重复的操作符和共同的评估协议,在多个抽象层次(网格、路线图、连续)上实现相同的任务。我们在公共地图和代表性规划者上的实证结果使得在共享实例集上进行相应的比较成为可能。此外,我们量化了预期的表示-保真度权衡(MRMP在更高保真度但较低速度下解决实例,而网格/路线图规划者则能够扩展得更远)。通过整合表示、执行和评估,GRACE旨在使跨表示研究更具可比性,并提供推动多机器人规划研究及其实践转化的手段。
cs.RO / 49 / 2603.10871

FG-CLTP: Fine-Grained Contrastive Language Tactile Pretraining for Robotic Manipulation

FG-CLTP:用于机器人操作的细粒度对比语言触觉预训练
Ma, Wenxuan, Zhang, Chaofan, Cai, Yinghao, Yao, Guocai, Cui, Shaowei, Wang, Shuo
Abstract
Recent advancements in integrating tactile sensing into vision-language-action (VLA) models have demonstrated transformative potential for robotic perception. However, existing tactile representations predominantly rely on qualitative descriptors (e.g., texture), neglecting quantitative contact states such as force magnitude, contact geometry, and principal axis orientation, which are indispensable for fine-grained manipulation. To bridge this gap, we propose FG-CLTP, a fine-grained contrastive language tactile pretraining framework. We first introduce a novel dataset comprising over 100k tactile 3D point cloud-language pairs that explicitly capture multidimensional contact states from the sensor's perspective. We then implement a discretized numerical tokenization mechanism to achieve quantitative-semantic alignment, effectively injecting explicit physical metrics into the multimodal feature space. The proposed FG-CLTP model yields a 95.9% classification accuracy and reduces the regression error (MAE) by 52.6% compared to state-of-the-art methods. Furthermore, the integration of 3D point cloud representations establishes a sensor-agnostic foundation with a minimal sim-to-real gap of 3.5%. Building upon this fine-grained representation, we develop a 3D tactile-language-action (3D-TLA) architecture driven by a flow matching policy to enable multimodal reasoning and control. Extensive experiments demonstrate that our framework significantly outperforms strong baselines in contact-rich manipulation tasks, providing a robust and generalizable foundation for tactile-language-action models.
Chinese Translation
最近在将触觉感知整合到视觉-语言-动作(VLA)模型中的进展展示了对机器人感知的变革潜力。然而,现有的触觉表征主要依赖于定性描述符(例如,纹理),忽视了诸如力的大小、接触几何形状和主轴方向等定量接触状态,而这些对于细粒度操作是不可或缺的。为了解决这一问题,我们提出了FG-CLTP,一个细粒度对比语言触觉预训练框架。我们首先引入一个新颖的数据集,该数据集包含超过10万对触觉3D点云-语言配对,明确捕捉传感器视角下的多维接触状态。然后,我们实现了一种离散化的数值标记机制,以实现定量-语义对齐,有效地将明确的物理度量注入多模态特征空间。所提出的FG-CLTP模型在分类准确率上达到了95.9%,并且与最先进的方法相比,回归误差(MAE)降低了52.6%。此外,3D点云表征的整合建立了一个传感器无关的基础,最小的仿真到现实的差距为3.5%。在此细粒度表征的基础上,我们开发了一个由流匹配策略驱动的3D触觉-语言-动作(3D-TLA)架构,以实现多模态推理和控制。大量实验表明,我们的框架在接触丰富的操作任务中显著优于强基线,为触觉-语言-动作模型提供了一个稳健且可推广的基础。
cs.RO / 50 / 2603.10878

RL-Augmented MPC for Non-Gaited Legged and Hybrid Locomotion

增强型强化学习模型预测控制在非步态腿部和混合运动中的应用
Patrizi, Andrea, Rizzardo, Carlo, Laurenzi, Arturo, Ruscelli, Francesco, Rossini, Luca, Tsagarakis, Nikos G.
Abstract
We propose a contact-explicit hierarchical architecture coupling Reinforcement Learning (RL) and Model Predictive Control (MPC), where a high-level RL agent provides gait and navigation commands to a low-level locomotion MPC. This offloads the combinatorial burden of contact timing from the MPC by learning acyclic gaits through trial and error in simulation. We show that only a minimal set of rewards and limited tuning are required to obtain effective policies. We validate the architecture in simulation across robotic platforms spanning 50 kg to 120 kg and different MPC implementations, observing the emergence of acyclic gaits and timing adaptations in flat-terrain legged and hybrid locomotion, and further demonstrating extensibility to non-flat terrains. Across all platforms, we achieve zero-shot sim-to-sim transfer without domain randomization, and we further demonstrate zero-shot sim-to-real transfer without domain randomization on Centauro, our 120 kg wheeled-legged humanoid robot. We make our software framework and evaluation results publicly available at https://github.com/AndrePatri/AugMPC.
Chinese Translation
我们提出了一种明确接触的分层架构,将强化学习(Reinforcement Learning, RL)与模型预测控制(Model Predictive Control, MPC)相结合,其中高层RL代理向低层运动MPC提供步态和导航指令。这种方法通过在仿真中通过试错学习无环步态,减轻了MPC在接触时机上的组合负担。我们展示了仅需最小的奖励集和有限的调优即可获得有效策略。我们在多个机器人平台(重量范围从50公斤到120公斤)和不同的MPC实现中验证了该架构,观察到在平坦地形的腿部和混合运动中出现了无环步态和时机适应,并进一步证明其在非平坦地形上的可扩展性。在所有平台上,我们实现了零-shot的仿真到仿真转移,无需领域随机化,并在我们的120公斤轮腿人形机器人Centauro上进一步展示了零-shot的仿真到现实转移,无需领域随机化。我们将我们的软件框架和评估结果公开发布在https://github.com/AndrePatri/AugMPC。
cs.RO / 51 / 2603.10890

A gripper for flap separation and opening of sealed bags

用于分离和打开密封袋的夹持器
Foix, Sergi, Oriol, Jaume, Torras, Carme, Borràs, Júlia
Abstract
Separating thin, flexible layers that must be individually grasped is a common but challenging manipulation primitive for most off-the-shelf grippers. A prominent example arises in clinical settings: the opening of sterile flat pouches for the preparation of the operating room, where the first step is to separate and grasp the flaps. We present a novel gripper design and opening strategy that enables reliable flap separation and robust seal opening. This capability addresses a high-volume repetitive hospital procedure in which nurses manually open up to 240 bags per shift, a physically demanding task linked to musculoskeletal injuries. Our design combines an active dented-roller fingertip with compliant fingers that exploit environmental constraints to robustly grasp thin flexible flaps. Experiments demonstrate that the proposed gripper reliably grasps and separates sealed bag flaps and other thin-layered materials from the hospital, the most sensitive variable affecting performance being the normal force applied. When two copies of the gripper grasp both flaps, the system withstands the forces needed to open the seals robustly. To our knowledge, this is one of the first demonstrations of robotic assistance to automate this repetitive, low-value, but critical hospital task.
Chinese Translation
分离需要单独抓取的薄而柔韧的层是大多数现成夹持器面临的常见但具有挑战性的操作原语。在临床环境中,这一问题尤为突出:打开用于手术室准备的无菌平袋,第一步是分离并抓取袋子的翻盖。我们提出了一种新颖的夹持器设计和开袋策略,能够实现可靠的翻盖分离和稳健的密封打开。这一能力解决了医院中一种高频次的重复性程序,护士每班需手动打开多达240个袋子,这是一项体力要求较高的任务,且与肌肉骨骼损伤相关。我们的设计结合了一个主动的凹形滚轮指尖和顺应性手指,利用环境约束稳健地抓取薄柔性翻盖。实验表明,所提出的夹持器能够可靠地抓取和分离密封袋的翻盖及其他薄层材料,影响性能的最敏感变量是施加的法向力。当两个夹持器同时抓取两个翻盖时,系统能够承受打开密封所需的力量。根据我们的了解,这是首次展示机器人辅助技术用于自动化这一重复性、低价值但至关重要的医院任务。
cs.RO / 52 / 2603.10971

Contact Coverage-Guided Exploration for General-Purpose Dexterous Manipulation

基于接触覆盖的通用灵巧操作探索
Liu, Zixuan, Qiao, Ruoyi, Tie, Chenrui, Liu, Xuanwei, Lou, Yunfan, Gao, Chongkai, Xu, Zhixuan, Shao, Lin
Abstract
Deep Reinforcement learning (DRL) has achieved remarkable success in domains with well-defined reward structures, such as Atari games and locomotion. In contrast, dexterous manipulation lacks general-purpose reward formulations and typically depends on task-specific, handcrafted priors to guide hand-object interactions. We propose Contact Coverage-Guided Exploration (CCGE), a general exploration method designed for general-purpose dexterous manipulation tasks. CCGE represents contact state as the intersection between object surface points and predefined hand keypoints, encouraging dexterous hands to discover diverse and novel contact patterns, namely which fingers contact which object regions. It maintains a contact counter conditioned on discretized object states obtained via learned hash codes, capturing how frequently each finger interacts with different object regions. This counter is leveraged in two complementary ways: (1) to assign a count-based contact coverage reward that promotes exploration of novel contact patterns, and (2) an energy-based reaching reward that guides the agent toward under-explored contact regions. We evaluate CCGE on a diverse set of dexterous manipulation tasks, including cluttered object singulation, constrained object retrieval, in-hand reorientation, and bimanual manipulation. Experimental results show that CCGE substantially improves training efficiency and success rates over existing exploration methods, and that the contact patterns learned with CCGE transfer robustly to real-world robotic systems. Project page is https://contact-coverage-guided-exploration.github.io.
Chinese Translation
深度强化学习(Deep Reinforcement Learning, DRL)在具有明确奖励结构的领域(如Atari游戏和运动)中取得了显著成功。相比之下,灵巧操作缺乏通用的奖励公式,通常依赖于特定任务的手工设计先验来指导手与物体的交互。我们提出了一种接触覆盖引导的探索方法(Contact Coverage-Guided Exploration, CCGE),旨在为通用灵巧操作任务提供一种通用的探索方法。CCGE将接触状态表示为物体表面点与预定义手部关键点的交集,鼓励灵巧手发现多样化和新颖的接触模式,即哪些手指接触哪些物体区域。它维护一个基于离散化物体状态的接触计数器,该状态通过学习的哈希码获得,捕捉每个手指与不同物体区域交互的频率。该计数器以两种互补的方式被利用:(1)分配基于计数的接触覆盖奖励,促进对新接触模式的探索;(2)基于能量的到达奖励,引导智能体朝向未充分探索的接触区域。我们在一系列多样的灵巧操作任务上评估了CCGE,包括杂乱物体的单独处理、受限物体的检索、手内重新定向和双手操作。实验结果表明,CCGE显著提高了训练效率和成功率,相较于现有的探索方法,且使用CCGE学习的接触模式能够稳健地迁移到现实世界的机器人系统中。项目页面为 https://contact-coverage-guided-exploration.github.io。
cs.RO / 53 / 2603.10979

Learning Adaptive Force Control for Contact-Rich Sample Scraping with Heterogeneous Materials

针对异质材料接触丰富样品刮削的自适应力控制学习
Cetin, Cenk, Pouli, Shreyas, Pizzuto, Gabriella
Abstract
The increasing demand for accelerated scientific discovery, driven by global challenges, highlights the need for advanced AI-driven robotics. Deploying robotic chemists in human-centric labs is key for the next horizon of autonomous discovery, as complex tasks still demand the dexterity of human scientists. Robotic manipulation in this context is uniquely challenged by handling diverse chemicals (granular, powdery, or viscous liquids), under varying lab conditions. For example, humans use spatulas for scraping materials from vial walls. Automating this process is challenging because it goes beyond simple robotic insertion tasks and traditional lab automation, requiring the execution of fine-granular movements within a constrained environment (the sample vial). Our work proposes an adaptive control framework to address this, relying on a low-level Cartesian impedance controller for stable and compliant physical interaction and a high-level reinforcement learning agent that learns to dynamically adjust interaction forces at the end-effector. The agent is guided by perception feedback, which provides the material's location. We first created a task-representative simulation environment with a Franka Research 3 robot, a scraping tool, and a sample vial containing heterogeneous materials. To facilitate the learning of an adaptive policy and model diverse characteristics, the sample is modelled as a collection of spheres, where each sphere is assigned a unique dislodgement force threshold, which is procedurally generated using Perlin noise. We train an agent to autonomously learn and adapt the optimal contact wrench for a sample scraping task in simulation and then successfully transfer this policy to a real robotic setup. Our method was evaluated across five different material setups, outperforming a fixed-wrench baseline by an average of 10.9%.
Chinese Translation
全球挑战推动的科学发现加速需求凸显了对先进人工智能驱动机器人技术的需求。在以人为中心的实验室中部署机器人化学家是实现自主发现的下一个重要阶段的关键,因为复杂任务仍然需要人类科学家的灵活性。在这种情况下,机器人操作面临着处理多样化化学品(颗粒状、粉末状或粘稠液体)以及不同实验室条件的独特挑战。例如,人类使用刮刀从试管壁上刮取材料。自动化这一过程具有挑战性,因为它超出了简单的机器人插入任务和传统实验室自动化的范围,要求在受限环境(样品试管)内执行精细的运动。我们的工作提出了一种自适应控制框架来解决这一问题,依赖于低级的笛卡尔阻抗控制器以实现稳定和顺应的物理交互,以及一个高级的强化学习代理,该代理学习动态调整末端执行器的交互力。代理通过感知反馈进行指导,感知反馈提供材料的位置。我们首先创建了一个具有代表性的任务仿真环境,使用Franka Research 3机器人、刮削工具和包含异质材料的样品试管。为了促进自适应策略的学习并建模多样化特性,样品被建模为一组球体,每个球体被分配一个独特的脱落力阈值,该阈值使用Perlin噪声程序生成。我们训练一个代理在仿真中自主学习和适应样品刮削任务的最佳接触扭矩,然后成功将该策略转移到真实的机器人设置中。我们的方法在五种不同的材料设置中进行了评估,平均超越了固定扭矩基线10.9%。
cs.RO / 54 / 2603.10980

PPGuide: Steering Diffusion Policies with Performance Predictive Guidance

PPGuide:通过性能预测指导引导扩散策略
Wang, Zixing, Jha, Devesh K., Qureshi, Ahmed H., Romeres, Diego
Abstract
Diffusion policies have shown to be very efficient at learning complex, multi-modal behaviors for robotic manipulation. However, errors in generated action sequences can compound over time which can potentially lead to failure. Some approaches mitigate this by augmenting datasets with expert demonstrations or learning predictive world models which might be computationally expensive. We introduce Performance Predictive Guidance (PPGuide), a lightweight, classifier-based framework that steers a pre-trained diffusion policy away from failure modes at inference time. PPGuide makes use of a novel self-supervised process: it uses attention-based multiple instance learning to automatically estimate which observation-action chunks from the policy's rollouts are relevant to success or failure. We then train a performance predictor on this self-labeled data. During inference, this predictor provides a real-time gradient to guide the policy toward more robust actions. We validated our proposed PPGuide across a diverse set of tasks from the Robomimic and MimicGen benchmarks, demonstrating consistent improvements in performance.
Chinese Translation
扩散策略在学习复杂的多模态行为方面表现出很高的效率,尤其是在机器人操作中。然而,生成的动作序列中的错误可能会随着时间的推移而累积,从而可能导致失败。一些方法通过用专家示范增强数据集或学习预测世界模型来缓解这一问题,但这可能会消耗大量计算资源。我们提出了性能预测指导(PPGuide),这是一种轻量级的基于分类器的框架,在推理时引导预训练的扩散策略远离失败模式。PPGuide利用了一种新颖的自监督过程:它使用基于注意力的多实例学习自动估计策略的回放中哪些观察-动作片段与成功或失败相关。然后,我们在这些自标记的数据上训练一个性能预测器。在推理过程中,这个预测器提供实时梯度,以指导策略朝向更稳健的动作。我们在Robomimic和MimicGen基准测试中的一系列多样化任务上验证了我们提出的PPGuide,展示了性能的一致性提升。
计算机视觉 (Computer Vision)
92
cs.CV / 1 / 2603.10125

4DEquine: Disentangling Motion and Appearance for 4D Equine Reconstruction from Monocular Video

4DEquine:从单目视频中解耦运动与外观以实现4D马匹重建
Lyu, Jin, An, Liang, Cheng, Pujin, Liu, Yebin, Tang, Xiaoying
Abstract
4D reconstruction of equine family (e.g. horses) from monocular video is important for animal welfare. Previous mainstream 4D animal reconstruction methods require joint optimization of motion and appearance over a whole video, which is time-consuming and sensitive to incomplete observation. In this work, we propose a novel framework called 4DEquine by disentangling the 4D reconstruction problem into two sub-problems: dynamic motion reconstruction and static appearance reconstruction. For motion, we introduce a simple yet effective spatio-temporal transformer with a post-optimization stage to regress smooth and pixel-aligned pose and shape sequences from video. For appearance, we design a novel feed-forward network that reconstructs a high-fidelity, animatable 3D Gaussian avatar from as few as a single image. To assist training, we create a large-scale synthetic motion dataset, VarenPoser, which features high-quality surface motions and diverse camera trajectories, as well as a synthetic appearance dataset, VarenTex, comprising realistic multi-view images generated through multi-view diffusion. While training only on synthetic datasets, 4DEquine achieves state-of-the-art performance on real-world APT36K and AiM datasets, demonstrating the superiority of 4DEquine and our new datasets for both geometry and appearance reconstruction. Comprehensive ablation studies validate the effectiveness of both the motion and appearance reconstruction network. Project page: https://luoxue-star.github.io/4DEquine_Project_Page/.
Chinese Translation
从单目视频中进行马科动物(如马)的4D重建对动物福利至关重要。以往主流的4D动物重建方法需要对整个视频中的运动和外观进行联合优化,这既耗时又对不完整观察敏感。在本研究中,我们提出了一种新颖的框架,称为4DEquine,通过将4D重建问题解耦为两个子问题:动态运动重建和静态外观重建。对于运动,我们引入了一种简单而有效的时空变换器,并结合后优化阶段,从视频中回归平滑且像素对齐的姿态和形状序列。对于外观,我们设计了一种新颖的前馈网络,可以从少至一张图像中重建出高保真、可动画的3D高斯头像。为了辅助训练,我们创建了一个大规模的合成运动数据集VarenPoser,包含高质量的表面运动和多样的相机轨迹,以及一个合成外观数据集VarenTex,包含通过多视角扩散生成的真实多视角图像。在仅使用合成数据集进行训练的情况下,4DEquine在真实世界的APT36K和AiM数据集上实现了最先进的性能,展示了4DEquine在几何和外观重建方面的优越性。全面的消融研究验证了运动和外观重建网络的有效性。项目页面:https://luoxue-star.github.io/4DEquine_Project_Page/.
cs.CV / 2 / 2603.10128

HG-Lane: High-Fidelity Generation of Lane Scenes under Adverse Weather and Lighting Conditions without Re-annotation

HG-Lane:在恶劣天气和光照条件下高保真生成车道场景,无需重新标注
Zhao, Daichao, Chen, Qiupu, He, Feng, Ning, Xin, Li, Qiankun
Abstract
Lane detection is a crucial task in autonomous driving, as it helps ensure the safe operation of vehicles. However, existing datasets such as CULane and TuSimple contain relatively limited data under extreme weather conditions, including rain, snow, and fog. As a result, detection models trained on these datasets often become unreliable in such environments, which may lead to serious safety-critical failures on the road. To address this issue, we propose HG-Lane, a High-fidelity Generation framework for Lane Scenes under adverse weather and lighting conditions without requiring re-annotation. Based on this framework, we further construct a benchmark that includes adverse weather and lighting scenarios, containing 30,000 images. Experimental results demonstrate that our method consistently and significantly improves the performance of existing lane detection networks. For example, using the state-of-the-art CLRNet, the overall mF1 score on our benchmark increases by 20.87 percent. The F1@50 score for the overall, normal, snow, rain, fog, night, and dusk categories increases by 19.75 percent, 8.63 percent, 38.8 percent, 14.96 percent, 26.84 percent, 21.5 percent, and 12.04 percent, respectively. The code and dataset are available at: https://github.com/zdc233/HG-Lane.
Chinese Translation
车道检测是自动驾驶中的一项关键任务,因为它有助于确保车辆的安全运行。然而,现有的数据集如CULane和TuSimple在极端天气条件下(包括雨、雪和雾)包含的数据相对有限。因此,在这些数据集上训练的检测模型在此类环境中往往变得不可靠,这可能导致道路上的严重安全隐患。为了解决这个问题,我们提出了HG-Lane,一个在恶劣天气和光照条件下高保真生成车道场景的框架,无需重新标注。在此框架的基础上,我们进一步构建了一个包含恶劣天气和光照场景的基准数据集,共包含30,000张图像。实验结果表明,我们的方法持续且显著地提高了现有车道检测网络的性能。例如,使用最先进的CLRNet,我们的基准数据集上的整体mF1得分提高了20.87%。整体、正常、雪、雨、雾、夜间和黄昏类别的F1@50得分分别提高了19.75%、8.63%、38.8%、14.96%、26.84%、21.5%和12.04%。代码和数据集可在以下链接获取:https://github.com/zdc233/HG-Lane。
cs.CV / 3 / 2603.10132

Unbalanced Optimal Transport Dictionary Learning for Unsupervised Hyperspectral Image Clustering

用于无监督高光谱图像聚类的非平衡最优传输字典学习
Lentz, Joshua, Karris, Nicholas, Cloninger, Alex, Murphy, James M.
Abstract
Hyperspectral images capture vast amounts of high-dimensional spectral information about a scene, making labeling an intensive task that is resistant to out-of-the-box statistical methods. Unsupervised learning of clusters allows for automated segmentation of the scene, enabling a more rapid understanding of the image. Partitioning the spectral information contained within the data via dictionary learning in Wasserstein space has proven an effective method for unsupervised clustering. However, this approach requires balancing the spectral profiles of the data, blurring the classes, and sacrificing robustness to outliers and noise. In this paper, we suggest improving this approach by utilizing unbalanced Wasserstein barycenters to learn a lower-dimensional representation of the underlying data. The deployment of spectral clustering on the learned representation results in an effective approach for the unsupervised learning of labels.
Chinese Translation
高光谱图像捕捉了关于场景的大量高维光谱信息,使得标记成为一项耗时的任务,且对现成的统计方法具有抵抗力。无监督聚类学习允许对场景进行自动分割,从而更快速地理解图像。通过在Wasserstein空间中进行字典学习来划分数据中包含的光谱信息已被证明是一种有效的无监督聚类方法。然而,这种方法需要平衡数据的光谱特征,导致类别模糊,并牺牲对异常值和噪声的鲁棒性。在本文中,我们建议通过利用非平衡Wasserstein重心来改进这一方法,以学习潜在数据的低维表示。在学习到的表示上部署光谱聚类,结果显示出一种有效的无监督标签学习方法。
cs.CV / 4 / 2603.10178

Video-Based Reward Modeling for Computer-Use Agents

基于视频的计算机使用代理奖励建模
Song, Linxin, Zhang, Jieyu, Sheng, Huanxin, Shi, Taiwei, Rahul, Gupta, Liu, Yang, Krishna, Ranjay, Kang, Jian, Zhao, Jieyu
Abstract
Computer-using agents (CUAs) are becoming increasingly capable; however, it remains difficult to scale evaluation of whether a trajectory truly fulfills a user instruction. In this work, we study reward modeling from execution video: a sequence of keyframes from an agent trajectory that is independent of the agent's internal reasoning or actions. Although video-execution modeling is method-agnostic, it presents key challenges, including highly redundant layouts and subtle, localized cues that determine success. We introduce Execution Video Reward 53k (ExeVR-53k), a dataset of 53k high-quality video--task--reward triplets. We further propose adversarial instruction translation to synthesize negative samples with step-level annotations. To enable learning from long, high-resolution execution videos, we design spatiotemporal token pruning, which removes homogeneous regions and persistent tokens while preserving decisive UI changes. Building on these components, we fine-tune an Execution Video Reward Model (ExeVRM) that takes only a user instruction and a video-execution sequence to predict task success. Our ExeVRM 8B achieves 84.7% accuracy and 87.7% recall on video-execution assessment, outperforming strong proprietary models such as GPT-5.2 and Gemini-3 Pro across Ubuntu, macOS, Windows, and Android, while providing more precise temporal attribution. These results show that video-execution reward modeling can serve as a scalable, model-agnostic evaluator for CUAs.
Chinese Translation
计算机使用代理(CUAs)正变得越来越强大;然而,评估一个轨迹是否真正满足用户指令仍然困难重重。在本研究中,我们研究了基于执行视频的奖励建模:这是一个独立于代理内部推理或行动的代理轨迹的关键帧序列。尽管视频执行建模与方法无关,但它面临着一些关键挑战,包括高度冗余的布局和决定成功的微妙、局部线索。我们引入了执行视频奖励53k(ExeVR-53k),这是一个包含53,000个高质量视频-任务-奖励三元组的数据集。我们进一步提出了对抗性指令翻译,以合成带有步骤级注释的负样本。为了能够从长时间、高分辨率的执行视频中学习,我们设计了时空标记修剪,该方法在保留决定性用户界面变化的同时,去除同质区域和持久标记。在这些组件的基础上,我们微调了执行视频奖励模型(ExeVRM),该模型仅需用户指令和视频执行序列即可预测任务成功。我们的ExeVRM 8B在视频执行评估中达到了84.7%的准确率和87.7%的召回率,超越了如GPT-5.2和Gemini-3 Pro等强大的专有模型,适用于Ubuntu、macOS、Windows和Android,同时提供了更精确的时间归因。这些结果表明,视频执行奖励建模可以作为CUAs的可扩展、与模型无关的评估工具。
cs.CV / 5 / 2603.10210

Delta-K: Boosting Multi-Instance Generation via Cross-Attention Augmentation

Delta-K:通过交叉注意力增强提升多实例生成
Wang, Zitong, Shen, Zijun, Xu, Haohao, Luo, Zhengjie, Wu, Weibin
Abstract
While Diffusion Models excel in text-to-image synthesis, they often suffer from concept omission when synthesizing complex multi-instance scenes. Existing training-free methods attempt to resolve this by rescaling attention maps, which merely exacerbates unstructured noise without establishing coherent semantic representations. To address this, we propose Delta-K, a backbone-agnostic and plug-and-play inference framework that tackles omission by operating directly in the shared cross-attention Key space. Specifically, with Vision-language model, we extract a differential key $\Delta K$ that encodes the semantic signature of missing concepts. This signal is then injected during the early semantic planning stage of the diffusion process. Governed by a dynamically optimized scheduling mechanism, Delta-K grounds diffuse noise into stable structural anchors while preserving existing concepts. Extensive experiments demonstrate the generality of our approach: Delta-K consistently improves compositional alignment across both modern DiT models and classical U-Net architectures, without requiring spatial masks, additional training, or architectural modifications.
Chinese Translation
尽管扩散模型在文本到图像合成方面表现出色,但在合成复杂的多实例场景时,常常会出现概念遗漏的问题。现有的无训练方法试图通过重新缩放注意力图来解决这一问题,但这仅仅加剧了无结构噪声,而未能建立连贯的语义表示。为了解决这一问题,我们提出了Delta-K,这是一种与骨干网络无关且即插即用的推理框架,通过直接在共享的交叉注意力键空间中操作来应对遗漏问题。具体而言,我们利用视觉-语言模型提取一个差异键$ ext{Δ K}$,该键编码了缺失概念的语义特征。然后,在扩散过程的早期语义规划阶段注入该信号。Delta-K通过动态优化的调度机制,将扩散噪声固定在稳定的结构锚点上,同时保留现有概念。大量实验表明我们方法的普遍性:Delta-K在现代DiT模型和经典U-Net架构中始终改善了组合对齐,而无需空间掩码、额外训练或架构修改。
cs.CV / 6 / 2603.10212

FusionNet: a frame interpolation network for 4D heart models

FusionNet:一种用于四维心脏模型的帧插值网络
Chang, Chujie, Miyauchi, Shoko, Morooka, Ken'ichi, Kurazume, Ryo, Mozos, Oscar Martinez
Abstract
Cardiac magnetic resonance (CMR) imaging is widely used to visualise cardiac motion and diagnose heart disease. However, standard CMR imaging requires patients to lie still in a confined space inside a loud machine for 40-60 min, which increases patient discomfort. In addition, shorter scan times decrease either or both the temporal and spatial resolutions of cardiac motion, and thus, the diagnostic accuracy of the procedure. Of these, we focus on reduced temporal resolution and propose a neural network called FusionNet to obtain four-dimensional (4D) cardiac motion with high temporal resolution from CMR images captured in a short period of time. The model estimates intermediate 3D heart shapes based on adjacent shapes. The results of an experimental evaluation of the proposed FusionNet model showed that it achieved a performance of over 0.897 in terms of the Dice coefficient, confirming that it can recover shapes more precisely than existing methods. This code is available at: https://github.com/smiyauchi199/FusionNet.git
Chinese Translation
心脏磁共振成像(CMR)广泛用于可视化心脏运动和诊断心脏疾病。然而,标准的CMR成像要求患者在一个嘈杂的机器内静止躺40-60分钟,这增加了患者的不适感。此外,较短的扫描时间会降低心脏运动的时间或空间分辨率,从而影响诊断的准确性。在这些因素中,我们关注于降低的时间分辨率,并提出了一种名为FusionNet的神经网络,以从短时间内捕获的CMR图像中获取高时间分辨率的四维(4D)心脏运动。该模型基于相邻形状估计中间的三维心脏形状。对所提出的FusionNet模型的实验评估结果表明,其在Dice系数方面的性能超过0.897,确认其能够比现有方法更精确地恢复形状。该代码可在以下链接获取:https://github.com/smiyauchi199/FusionNet.git
cs.CV / 7 / 2603.10216

An Automated Radiomics Framework for Postoperative Survival Prediction in Colorectal Liver Metastases using Preoperative MRI

基于自动化放射组学框架的结直肠肝转移术后生存预测:使用术前MRI
Alberb, Muhammad, Chen, Jianan, El-rewaidy, Hossam, Karanicolas, Paul, Seth, Arun, Amemiya, Yutaka, Martel, Anne, Cheung, Helen
Abstract
While colorectal liver metastasis (CRLM) is potentially curable via hepatectomy, patient outcomes remain highly heterogeneous. Postoperative survival prediction is necessary to avoid non-beneficial surgeries and guide personalized therapy. In this study, we present an automated AI-based framework for postoperative CRLM survival prediction using pre- and post-contrast MRI. We performed a retrospective study of 227 CRLM patients who had gadoxetate-enhanced MRI prior to curative-intent hepatectomy between 2013 and 2020. We developed a survival prediction framework comprising an anatomy-aware segmentation pipeline followed by a radiomics pipeline. The segmentation pipeline learns liver, CRLMs, and spleen segmentation from partially-annotated data, leveraging promptable foundation models to generate pseudo-labels. To support this pipeline, we propose SAMONAI, a prompt propagation algorithm that extends Segment Anything Model to 3D point-based segmentation. Predicted pre- and post-contrast segmentations are then fed into our radiomics pipeline, which extracts per-tumor features and predicts survival using SurvAMINN, an autoencoder-based multiple instance neural network for time-to-event survival prediction. SurvAMINN jointly learns dimensionality reduction and survival prediction from right-censored data, emphasizing high-risk metastases. We compared our framework against established methods and biomarkers using univariate and multivariate Cox regression. Our segmentation pipeline achieves median Dice scores of 0.96 (liver) and 0.93 (spleen), driving a CRLM segmentation Dice score of 0.78 and a detection F1-score of 0.79. Accurate segmentation enables our radiomics pipeline to achieve a survival prediction C-index of 0.69. Our results show the potential of integrating segmentation algorithms with radiomics-based survival analysis to deliver accurate and automated CRLM outcome prediction.
Chinese Translation
尽管结直肠肝转移(CRLM)通过肝切除术有潜在治愈的可能,但患者的预后仍然高度异质。术后生存预测对于避免无益的手术和指导个性化治疗是必要的。在本研究中,我们提出了一种基于自动化人工智能的框架,用于利用术前和术后对比MRI进行CRLM的术后生存预测。我们对2013年至2020年间接受过加达克赛特增强MRI的227名CRLM患者进行了回顾性研究,这些患者接受了治愈意图的肝切除术。我们开发了一个生存预测框架,包括一个解剖学感知的分割管道,随后是一个放射组学管道。分割管道从部分标注数据中学习肝脏、CRLM和脾脏的分割,利用可提示的基础模型生成伪标签。为了支持该管道,我们提出了SAMONAI,一种提示传播算法,将Segment Anything Model扩展到3D点基分割。预测的术前和术后对比分割结果随后输入到我们的放射组学管道中,该管道提取每个肿瘤特征,并使用SurvAMINN(一种基于自编码器的多实例神经网络)进行生存预测。SurvAMINN从右删失数据中共同学习降维和生存预测,强调高风险转移。我们使用单变量和多变量Cox回归将我们的框架与已建立的方法和生物标志物进行了比较。我们的分割管道在肝脏和脾脏的中位Dice得分分别为0.96和0.93,推动CRLM分割的Dice得分为0.78,检测的F1得分为0.79。准确的分割使我们的放射组学管道实现了生存预测C-index为0.69。我们的结果显示了将分割算法与基于放射组学的生存分析相结合的潜力,以提供准确和自动化的CRLM预后预测。
cs.CV / 8 / 2603.10220

Robotic Ultrasound Makes CBCT Alive

机器人超声使锥形束计算机断层扫描(CBCT)变得生动
Li, Feng, Li, Ziyuan, Jiang, Zhongliang, Navab, Nassir, Bi, Yuan
Abstract
Intraoperative Cone Beam Computed Tomography (CBCT) provides a reliable 3D anatomical context essential for interventional planning. However, its static nature fails to provide continuous monitoring of soft-tissue deformations induced by respiration, probe pressure, and surgical manipulation, leading to navigation discrepancies. We propose a deformation-aware CBCT updating framework that leverages robotic ultrasound as a dynamic proxy to infer tissue motion and update static CBCT slices in real time. Starting from calibration-initialized alignment with linear correlation of linear combination (LC2)-based rigid refinement, our method establishes accurate multimodal correspondence. To capture intraoperative dynamics, we introduce the ultrasound correlation UNet (USCorUNet), a lightweight network trained with optical flow-guided supervision to learn deformation-aware correlation representations, enabling accurate, real-time dense deformation field estimation from ultrasound streams. The inferred deformation is spatially regularized and transferred to the CBCT reference to produce deformation-consistent visualizations without repeated radiation exposure. We validate the proposed approach through deformation estimation and ultrasound-guided CBCT updating experiments. Results demonstrate real-time end-to-end CBCT slice updating and physically plausible deformation estimation, enabling dynamic refinement of static CBCT guidance during robotic ultrasound-assisted interventions. The source code is publicly available at https://github.com/anonymous-codebase/us-cbct-demo.
Chinese Translation
术中锥形束计算机断层扫描(CBCT)提供了可靠的三维解剖背景,对于介入规划至关重要。然而,其静态特性无法持续监测由呼吸、探头压力和外科操作引起的软组织变形,导致导航差异。我们提出了一种变形感知的CBCT更新框架,该框架利用机器人超声作为动态代理,以推断组织运动并实时更新静态CBCT切片。我们的算法从经过标定的线性组合(LC2)基础的刚性精细化对齐开始,建立了准确的多模态对应关系。为了捕捉术中动态,我们引入了超声相关UNet(USCorUNet),这是一个轻量级网络,通过光流引导的监督进行训练,以学习变形感知的相关表示,从超声流中实现准确的实时密集变形场估计。推断出的变形经过空间正则化,并转移到CBCT参考上,以生成变形一致的可视化,而无需重复辐射暴露。我们通过变形估计和超声引导的CBCT更新实验验证了所提出的方法。结果表明,能够实现实时端到端的CBCT切片更新和物理上合理的变形估计,从而在机器人超声辅助介入过程中实现静态CBCT指导的动态精细化。源代码已公开,网址为 https://github.com/anonymous-codebase/us-cbct-demo。
cs.CV / 9 / 2603.10231

OilSAM2: Memory-Augmented SAM2 for Scalable SAR Oil Spill Detection

OilSAM2:用于可扩展SAR油污检测的记忆增强型SAM2
Chen, Shuaiyu, Yin, Ming, Ren, Peng, Luo, Chunbo, Fu, Zeyu
Abstract
Segmenting oil spills from Synthetic Aperture Radar (SAR) imagery remains challenging due to severe appearance variability, scale heterogeneity, and the absence of temporal continuity in real world monitoring scenarios. While foundation models such as Segment Anything (SAM) enable prompt driven segmentation, existing SAM based approaches operate on single images and cannot effectively reuse information across scenes. Memory augmented variants (e.g., SAM2) further assume temporal coherence, making them prone to semantic drift when applied to unordered SAR image collections. We propose OilSAM2, a memory augmented segmentation framework tailored for unordered SAR oil spill monitoring. OilSAM2 introduces a hierarchical feature aware multi scale memory bank that explicitly models texture, structure, and semantic level representations, enabling robust cross image information reuse. To mitigate memory drift, we further propose a structure semantic consistent memory update strategy that selectively refreshes memory based on semantic discrepancy and structural variation.Experiments on two public SAR oil spill datasets demonstrate that OilSAM2 achieves state of the art segmentation performance, delivering stable and accurate results under noisy SAR monitoring scenarios. The source code is available at https://github.com/Chenshuaiyu1120/OILSAM2.
Chinese Translation
从合成孔径雷达(SAR)图像中分割油污仍然具有挑战性,原因在于外观变化剧烈、尺度异质性以及在实际监测场景中缺乏时间连续性。尽管基础模型如Segment Anything(SAM)能够实现快速驱动的分割,但现有的基于SAM的方法仅在单幅图像上操作,无法有效地在场景之间重用信息。记忆增强变体(例如,SAM2)进一步假设时间一致性,这使得它们在应用于无序的SAR图像集合时容易出现语义漂移。我们提出了OilSAM2,一种针对无序SAR油污监测的记忆增强分割框架。OilSAM2引入了一个层次特征感知的多尺度记忆库,明确建模纹理、结构和语义层次的表示,从而实现跨图像信息的稳健重用。为了减轻记忆漂移,我们进一步提出了一种结构语义一致的记忆更新策略,根据语义差异和结构变化选择性地刷新记忆。在两个公共SAR油污数据集上的实验表明,OilSAM2实现了最先进的分割性能,在噪声较大的SAR监测场景中提供了稳定和准确的结果。源代码可在https://github.com/Chenshuaiyu1120/OILSAM2获取。
cs.CV / 10 / 2603.10234

Why Does It Look There? Structured Explanations for Image Classification

为什么它看起来在那里?图像分类的结构化解释
Li, Jiarui, Yin, Zixiang, Landry, Samuel J, Ding, Zhengming, Mettu, Ramgopal R.
Abstract
Deep learning models achieve remarkable predictive performance, yet their black-box nature limits transparency and trustworthiness. Although numerous explainable artificial intelligence (XAI) methods have been proposed, they primarily provide saliency maps or concepts (i.e., unstructured interpretability). Existing approaches often rely on auxiliary models (\eg, GPT, CLIP) to describe model behavior, thereby compromising faithfulness to the original models. We propose Interpretability to Explainability (I2X), a framework that builds structured explanations directly from unstructured interpretability by quantifying progress at selected checkpoints during training using prototypes extracted from post-hoc XAI methods (e.g., GradCAM). I2X answers the question of "why does it look there" by providing a structured view of both intra- and inter-class decision making during training. Experiments on MNIST and CIFAR10 demonstrate effectiveness of I2X to reveal prototype-based inference process of various image classification models. Moreover, we demonstrate that I2X can be used to improve predictions across different model architectures and datasets: we can identify uncertain prototypes recognized by I2X and then use targeted perturbation of samples that allows fine-tuning to ultimately improve accuracy. Thus, I2X not only faithfully explains model behavior but also provides a practical approach to guide optimization toward desired targets.
Chinese Translation
深度学习模型在预测性能上取得了显著的成果,但其黑箱特性限制了透明性和可信度。尽管已经提出了众多可解释人工智能(XAI)方法,但它们主要提供显著性图或概念(即非结构化可解释性)。现有的方法通常依赖于辅助模型(例如,GPT,CLIP)来描述模型行为,从而妨碍了对原始模型的忠实性。我们提出了可解释性到可解释性(I2X)框架,该框架通过在训练过程中使用从后验XAI方法(例如,GradCAM)提取的原型量化选定检查点的进展,直接从非结构化可解释性构建结构化解释。I2X通过提供训练过程中类内和类间决策的结构化视图,回答了“为什么它看起来在那里”的问题。在MNIST和CIFAR10上的实验表明,I2X能够有效揭示各种图像分类模型的基于原型的推理过程。此外,我们还展示了I2X可以用于改善不同模型架构和数据集的预测:我们可以识别出I2X所识别的不确定原型,然后通过有针对性的样本扰动进行微调,从而最终提高准确性。因此,I2X不仅忠实地解释了模型行为,还提供了一种实用的方法来指导优化朝向期望目标。
cs.CV / 11 / 2603.10237

One Adapter for All: Towards Unified Representation in Step-Imbalanced Class-Incremental Learning

一款适配器应对所有:迈向步长不平衡的类增量学习统一表示
Zhang, Xiaoyan, He, Jiangpeng
Abstract
Class-incremental learning (CIL) aims to acquire new classes over time while retaining prior knowledge, yet most setups and methods assume balanced task streams. In practice, the number of classes per task often varies significantly. We refer to this as step imbalance, where large tasks that contain more classes dominate learning and small tasks inject unstable updates. Existing CIL methods assume balanced tasks and therefore treat all tasks uniformly, producing imbalanced updates that degrade overall learning performance. To address this challenge, we propose One-A, a unified and imbalance-aware framework that incrementally merges task updates into a single adapter, maintaining constant inference cost. One-A performs asymmetric subspace alignment to preserve dominant subspaces learned from large tasks while constraining low-information updates within them. An information-adaptive weighting balances the contribution between base and new adapters, and a directional gating mechanism selectively fuses updates along each singular direction, maintaining stability in head directions and plasticity in tail ones. Across multiple benchmarks and step-imbalanced streams, One-A achieves competitive accuracy with significantly low inference overhead, showing that a single, asymmetrically fused adapter can remain both adaptive to dynamic task sizes and efficient at deployment.
Chinese Translation
类增量学习(CIL)旨在随着时间的推移获取新类别,同时保留先前的知识,但大多数设置和方法假设任务流是平衡的。在实际应用中,每个任务的类别数量往往差异显著。我们将此称为步长不平衡,其中包含更多类别的大任务主导学习,而小任务则注入不稳定的更新。现有的CIL方法假设任务是平衡的,因此对所有任务进行统一处理,导致不平衡的更新,从而降低整体学习性能。为了解决这一挑战,我们提出了One-A,一个统一且关注不平衡的框架,能够逐步将任务更新合并为一个适配器,同时保持恒定的推理成本。One-A执行不对称子空间对齐,以保留从大任务中学习到的主导子空间,同时限制其中低信息更新的影响。信息自适应加权平衡基础适配器和新适配器之间的贡献,而方向性门控机制选择性地沿每个单一方向融合更新,保持头部方向的稳定性和尾部方向的可塑性。在多个基准测试和步长不平衡的任务流中,One-A实现了具有竞争力的准确性,并显著降低了推理开销,表明单个不对称融合的适配器能够在动态任务规模下保持适应性,并在部署时高效。
cs.CV / 12 / 2603.10253

Joint Imaging-ROI Representation Learning via Cross-View Contrastive Alignment for Brain Disorder Classification

通过跨视图对比对齐的联合成像-ROI表示学习用于脑部疾病分类
Liang, Wei, He, Lifang
Abstract
Brain imaging classification is commonly approached from two perspectives: modeling the full image volume to capture global anatomical context, or constructing ROI-based graphs to encode localized and topological interactions. Although both representations have demonstrated independent efficacy, their relative contributions and potential complementarity remain insufficiently understood. Existing fusion approaches are typically task-specific and do not enable controlled evaluation of each representation under consistent training settings. To address this gap, we propose a unified cross-view contrastive framework for joint imaging-ROI representation learning. Our method learns subject-level global (imaging) and local (ROI-graph) embeddings and aligns them in a shared latent space using a bidirectional contrastive objective, encouraging representations from the same subject to converge while separating those from different subjects. This alignment produces comparable embeddings suitable for downstream fusion and enables systematic evaluation of imaging-only, ROI-only, and joint configurations within a unified training protocol. Extensive experiments on the ADHD-200 and ABIDE datasets demonstrate that joint learning consistently improves classification performance over either branch alone across multiple backbone choices. Moreover, interpretability analyses reveal that imaging-based and ROI-based branches emphasize distinct yet complementary discriminative patterns, explaining the observed performance gains. These findings provide principled evidence that explicitly integrating global volumetric and ROI-level representations is a promising direction for neuroimaging-based brain disorder classification. The source code is available at https://anonymous.4open.science/r/imaging-roi-contrastive-152C/.
Chinese Translation
脑成像分类通常从两个角度进行:建模完整的图像体积以捕捉全局解剖上下文,或构建基于ROI的图形以编码局部和拓扑交互。尽管这两种表示方法都已证明具有独立的有效性,但它们的相对贡献和潜在互补性仍然不够明确。现有的融合方法通常是特定于任务的,无法在一致的训练设置下对每种表示进行控制评估。为了解决这一问题,我们提出了一种统一的跨视图对比框架,用于联合成像-ROI表示学习。我们的方法学习受试者级别的全局(成像)和局部(ROI图形)嵌入,并使用双向对比目标在共享潜在空间中对齐它们,鼓励来自同一受试者的表示收敛,同时将不同受试者的表示分开。这种对齐产生了适合下游融合的可比嵌入,并在统一的训练协议中实现了对仅成像、仅ROI和联合配置的系统评估。在ADHD-200和ABIDE数据集上的大量实验表明,联合学习在多个骨干网络选择中始终提高了分类性能,优于单独的任一分支。此外,解释性分析表明,基于成像和基于ROI的分支强调了不同但互补的区分模式,解释了观察到的性能提升。这些发现提供了明确的证据,表明明确整合全局体积和ROI级别表示是基于神经影像的脑部疾病分类的一个有前景的方向。源代码可在 https://anonymous.4open.science/r/imaging-roi-contrastive-152C/ 获取。
cs.CV / 13 / 2603.10267

A Robust Deep Learning Framework for Bangla License Plate Recognition Using YOLO and Vision-Language OCR

基于YOLO和视觉-语言OCR的孟加拉车牌识别的鲁棒深度学习框架
Hasin, Nayeb, Nishat, Md. Arafath Rahman, Islam, Mainul, Hasan, Khandakar Shakib Al, Newaz, Asif
Abstract
An Automatic License Plate Recognition (ALPR) system constitutes a crucial element in an intelligent traffic management system. However, the detection of Bangla license plates remains challenging because of the complicated character scheme and uneven layouts. This paper presents a robust Bangla License Plate Recognition system that integrates a deep learning-based object detection model for license plate localization with Optical Character Recognition for text extraction. Multiple object detection architectures, including U-Net and several YOLO (You Only Look Once) variants, are compared for license plate localization. This study proposes a novel two-stage adaptive training strategy built upon the YOLOv8 architecture to improve localization performance. The proposed approach outperforms the established models, achieving an accuracy of 97.83% and an Intersection over Union (IoU) of 91.3%. The text recognition problem is phrased as a sequence generation problem with a VisionEncoderDecoder architecture, with a combination of encoder-decoders evaluated. It was demonstrated that the ViT + BanglaBERT model gives better results at the character level, with a Character Error Rate of 0.1323 and Word Error Rate of 0.1068. The proposed system also shows a consistent performance when tested on an external dataset that has been curated for this study purpose. The dataset offers completely different environment and lighting conditions compared to the training sample, indicating the robustness of the proposed framework. Overall, our proposed system provides a robust and reliable solution for Bangla license plate recognition and performs effectively across diverse real-world scenarios, including variations in lighting, noise, and plate styles. These strengths make it well suited for deployment in intelligent transportation applications such as automated law enforcement and access control.
Chinese Translation
自动车牌识别(ALPR)系统是智能交通管理系统中的一个关键组成部分。然而,由于复杂的字符体系和不均匀的布局,孟加拉车牌的检测仍然具有挑战性。本文提出了一种鲁棒的孟加拉车牌识别系统,该系统集成了基于深度学习的物体检测模型用于车牌定位,以及光学字符识别(OCR)用于文本提取。我们比较了多种物体检测架构,包括U-Net和几种YOLO(You Only Look Once)变体,以进行车牌定位。本研究提出了一种基于YOLOv8架构的新型两阶段自适应训练策略,以提高定位性能。所提出的方法在准确率上超过了现有模型,达到了97.83%的准确率和91.3%的交并比(IoU)。文本识别问题被表述为序列生成问题,采用视觉编码器-解码器(Vision Encoder-Decoder)架构,并评估了多种编码器-解码器的组合。结果表明,ViT + BanglaBERT模型在字符级别上表现更佳,字符错误率为0.1323,词错误率为0.1068。所提出的系统在针对本研究目的策划的外部数据集上也显示出一致的性能。该数据集与训练样本相比,提供了完全不同的环境和光照条件,表明所提出框架的鲁棒性。总体而言,我们提出的系统为孟加拉车牌识别提供了一种鲁棒且可靠的解决方案,并在包括光照、噪声和车牌样式变化等多种真实场景中有效运行。这些优势使其非常适合于智能交通应用的部署,如自动执法和访问控制。
cs.CV / 14 / 2603.10300

From Imitation to Intuition: Intrinsic Reasoning for Open-Instance Video Classification

从模仿到直觉:开放实例视频分类的内在推理
Zhang, Ke, Zhao, Xiangchen, Tian, Yunjie, Zheng, Jiayu, Patel, Vishal M., Fu, Di
Abstract
Conventional video classification models, acting as effective imitators, excel in scenarios with homogeneous data distributions. However, real-world applications often present an open-instance challenge, where intra-class variations are vast and complex, beyond existing benchmarks. While traditional video encoder models struggle to fit these diverse distributions, vision-language models (VLMs) offer superior generalization but have not fully leveraged their reasoning capabilities (intuition) for such tasks. In this paper, we bridge this gap with an intrinsic reasoning framework that evolves open-instance video classification from imitation to intuition. Our approach, namely DeepIntuit, begins with a cold-start supervised alignment to initialize reasoning capability, followed by refinement using Group Relative Policy Optimization (GRPO) to enhance reasoning coherence through reinforcement learning. Crucially, to translate this reasoning into accurate classification, DeepIntuit then introduces an intuitive calibration stage. In this stage, a classifier is trained on this intrinsic reasoning traces generated by the refined VLM, ensuring stable knowledge transfer without distribution mismatch. Extensive experiments demonstrate that for open-instance video classification, DeepIntuit benefits significantly from transcending simple feature imitation and evolving toward intrinsic reasoning. Our project is available at https://bwgzk-keke.github.io/DeepIntuit/.
Chinese Translation
传统的视频分类模型作为有效的模仿者,在数据分布均匀的场景中表现优异。然而,现实世界的应用常常面临开放实例的挑战,其中类内变异巨大且复杂,超出了现有基准的范围。虽然传统的视频编码模型难以适应这些多样化的分布,但视觉语言模型(VLMs)提供了更优的泛化能力,但尚未充分利用其推理能力(直觉)来应对此类任务。本文提出了一种内在推理框架,将开放实例视频分类从模仿演变为直觉。我们的方法,称为DeepIntuit,首先通过冷启动的监督对齐来初始化推理能力,然后使用群体相对策略优化(GRPO)进行精细化,以通过强化学习增强推理的一致性。关键的是,为了将这种推理转化为准确的分类,DeepIntuit引入了一个直观的校准阶段。在这一阶段,分类器在由精细化的VLM生成的内在推理轨迹上进行训练,确保知识的稳定转移而不发生分布不匹配。大量实验表明,对于开放实例视频分类,DeepIntuit显著受益于超越简单特征模仿,向内在推理的演变。我们的项目可在 https://bwgzk-keke.github.io/DeepIntuit/ 获取。
cs.CV / 15 / 2603.10335

Fuel Gauge: Estimating Chain-of-Thought Length Ahead of Time in Large Multimodal Models

燃料计:提前估计大型多模态模型中的思维链长度
Yang, Yuedong, Wei, Xiwen, Munir, Mustafa, Marculescu, Radu
Abstract
Reasoning Large Multi-modality Models (LMMs) have become the de facto choice for many applications. However, these models rely on a Chain-of-Thought (CoT) process that is lengthy and unpredictable at runtime, often resulting in inefficient use of computational resources (due to memory fragmentation) and sub-optimal accuracy (due to under- and over-thinking). We observe empirically that the CoT process follows a very simple form, whose behavior is independent of the specific generated samples. This suggests that the CoT length can be estimated ahead of time based on a hidden parameter representing the amount of "fuel" available to support the reasoning process. Based on this insight, we propose Fuel Gauge, the first method which extracts this hidden signal and predicts CoT length ahead of time. We demonstrate the utility on the Fuel Gauge on two downstream tasks: predictive KV cache allocation, which addresses memory fragmentation in LMM serving systems, and CoT length modulation, which mitigates under-thinking and over-thinking. Extensive experiments on LMMs across text-only, image-text, and video-text question answering benchmarks demonstrate the effectiveness, generalizability, and practical value of our Fuel Gauge. For example, on the GPQA-Diamond benchmark, our Fuel Gauge achieves less than half the CoT length prediction error compared to the baseline; this translates into a 13.37x reduction in the memory allocation frequency.
Chinese Translation
推理大型多模态模型(LMMs)已成为许多应用的事实标准。然而,这些模型依赖于一个冗长且在运行时不可预测的思维链(CoT)过程,常常导致计算资源的低效使用(由于内存碎片化)和亚最优的准确性(由于思考不足和过度思考)。我们通过实证观察到,CoT 过程遵循一种非常简单的形式,其行为与特定生成样本无关。这表明,可以基于一个隐藏参数来提前估计 CoT 长度,该参数表示支持推理过程所需的“燃料”量。基于这一见解,我们提出了燃料计(Fuel Gauge),这是第一种提取这一隐藏信号并提前预测 CoT 长度的方法。我们在两个下游任务上展示了燃料计的实用性:预测 KV 缓存分配,解决 LMM 服务系统中的内存碎片化问题,以及 CoT 长度调节,缓解思考不足和过度思考。在文本、图像-文本和视频-文本问答基准上对 LMMs 进行的广泛实验证明了我们燃料计的有效性、可推广性和实际价值。例如,在 GPQA-Diamond 基准上,我们的燃料计实现了不到一半的 CoT 长度预测误差,相比基线减少了 13.37 倍的内存分配频率。
cs.CV / 16 / 2603.10340

Overcoming Visual Clutter in Vision Language Action Models via Concept-Gated Visual Distillation

通过概念门控视觉蒸馏克服视觉语言动作模型中的视觉杂乱
Song, Sangmim, Kodagoda, Sarath, Carmichael, Marc, Thiyagarajan, Karthick
Abstract
Vision-Language-Action (VLA) models demonstrate impressive zero-shot generalization but frequently suffer from a "Precision-Reasoning Gap" in cluttered environments. This failure is driven by background-induced feature dilution, where high-frequency semantic noise corrupts the geometric grounding required for precise manipulation. To bridge this gap, we propose Concept-Gated Visual Distillation (CGVD), a training-free, model-agnostic inference framework that stabilizes VLA policies. CGVD operates by parsing instructions into safe and distractor sets, utilizing a two-layer target refinement process--combining cross-validation and spatial disambiguation--to explicitly penalize false positives and isolate genuine manipulation targets. We then process the scene via Fourier-based inpainting, generating a clean observation that actively suppresses semantic distractors while preserving critical spatial geometry and visual proprioception. Extensive evaluations in highly cluttered manipulation tasks demonstrate that CGVD prevents performance collapse. In environments with dense semantic distractors, our method significantly outperforms state-of-the-art baselines, achieving a 77.5% success rate compared to the baseline's 43.0%. By enforcing strict attribute adherence, CGVD establishes inference-time visual distillation as a critical prerequisite for robust robotic manipulation in the clutter.
Chinese Translation
视觉-语言-动作(VLA)模型展示了令人印象深刻的零样本泛化能力,但在杂乱环境中常常遭遇“精确推理差距”。这种失败源于背景引起的特征稀释,高频语义噪声破坏了精确操作所需的几何基础。为了解决这一问题,我们提出了概念门控视觉蒸馏(CGVD),这是一种无训练、模型无关的推理框架,旨在稳定VLA策略。CGVD通过将指令解析为安全集和干扰集来运作,利用两层目标细化过程——结合交叉验证和空间消歧——明确惩罚假阳性并隔离真实的操作目标。然后,我们通过基于傅里叶的修复处理场景,生成一个干净的观察,积极抑制语义干扰,同时保留关键的空间几何和视觉本体感知。在高度杂乱的操作任务中进行的广泛评估表明,CGVD有效防止了性能崩溃。在具有密集语义干扰的环境中,我们的方法显著优于最先进的基线,成功率达到77.5%,而基线仅为43.0%。通过强制严格的属性遵循,CGVD确立了推理时视觉蒸馏作为在杂乱环境中实现稳健机器人操作的关键前提。
cs.CV / 17 / 2603.10349

EmoStory: Emotion-Aware Story Generation

EmoStory:情感感知故事生成
Yang, Jingyuan, Chen, Rucong, Huang, Hui
Abstract
Story generation aims to produce image sequences that depict coherent narratives while maintaining subject consistency across frames. Although existing methods have excelled in producing coherent and expressive stories, they remain largely emotion-neutral, focusing on what subject appears in a story while overlooking how emotions shape narrative interpretation and visual presentation. As stories are intended to engage audiences emotionally, we introduce emotion-aware story generation, a new task that aims to generate subject-consistent visual stories with explicit emotional directions. This task is challenging due to the abstract nature of emotions, which must be grounded in concrete visual elements and consistently expressed across a narrative through visual composition. To address these challenges, we propose EmoStory, a two-stage framework that integrates agent-based story planning and region-aware story generation. The planning stage transforms target emotions into coherent story prompts with emotion agent and writer agent, while the generation stage preserves subject consistency and injects emotion-related elements through region-aware composition. We evaluate EmoStory on a newly constructed dataset covering 25 subjects and 600 emotional stories. Extensive quantitative and qualitative results, along with user studies, show that EmoStory outperforms state-of-the-art story generation methods in emotion accuracy, prompt alignment, and subject consistency.
Chinese Translation
故事生成旨在产生描绘连贯叙事的图像序列,同时在各帧之间保持主题一致性。尽管现有方法在生成连贯且富有表现力的故事方面表现出色,但它们在很大程度上仍然是情感中立的,专注于故事中出现的主题,而忽视了情感如何塑造叙事的解读和视觉呈现。由于故事旨在在情感上吸引观众,我们引入了情感感知故事生成这一新任务,旨在生成具有明确情感指向的主题一致的视觉故事。该任务具有挑战性,因为情感的抽象特性必须与具体的视觉元素相结合,并通过视觉构图在叙事中始终如一地表达。为了解决这些挑战,我们提出了EmoStory,一个集成了基于代理的故事规划和区域感知故事生成的两阶段框架。规划阶段将目标情感转化为连贯的故事提示,涉及情感代理和写作代理,而生成阶段则通过区域感知构图保持主题一致性并注入与情感相关的元素。我们在一个新构建的数据集上评估EmoStory,该数据集涵盖25个主题和600个情感故事。大量的定量和定性结果,以及用户研究表明,EmoStory在情感准确性、提示对齐和主题一致性方面优于现有的最先进的故事生成方法。
cs.CV / 18 / 2603.10354

StyleGallery: Training-free and Semantic-aware Personalized Style Transfer from Arbitrary Image References

StyleGallery:无训练且语义感知的个性化风格迁移方法,支持任意图像参考
He, Boyu, Ye, Yunfan, Liu, Chang, Wu, Weishang, Liu, Fang, Cai, Zhiping
Abstract
Despite the advancements in diffusion-based image style transfer, existing methods are commonly limited by 1) semantic gap: the style reference could miss proper content semantics, causing uncontrollable stylization; 2) reliance on extra constraints (e.g., semantic masks) restricting applicability; 3) rigid feature associations lacking adaptive global-local alignment, failing to balance fine-grained stylization and global content preservation. These limitations, particularly the inability to flexibly leverage style inputs, fundamentally restrict style transfer in terms of personalization, accuracy, and adaptability. To address these, we propose StyleGallery, a training-free and semantic-aware framework that supports arbitrary reference images as input and enables effective personalized customization. It comprises three core stages: semantic region segmentation (adaptive clustering on latent diffusion features to divide regions without extra inputs); clustered region matching (block filtering on extracted features for precise alignment); and style transfer optimization (energy function-guided diffusion sampling with regional style loss to optimize stylization). Experiments on our introduced benchmark demonstrate that StyleGallery outperforms state-of-the-art methods in content structure preservation, regional stylization, interpretability, and personalized customization, particularly when leveraging multiple style references.
Chinese Translation
尽管基于扩散的图像风格迁移技术取得了进展,但现有方法通常受到以下限制:1)语义差距:风格参考可能缺乏适当的内容语义,导致无法控制的风格化;2)依赖额外约束(例如,语义掩码)限制了适用性;3)刚性的特征关联缺乏自适应的全局-局部对齐,未能平衡细粒度风格化与全局内容保留。这些限制,特别是灵活利用风格输入的能力不足,根本上限制了风格迁移在个性化、准确性和适应性方面的表现。为了解决这些问题,我们提出了StyleGallery,一个无训练且语义感知的框架,支持任意参考图像作为输入,并能够有效地进行个性化定制。该框架包括三个核心阶段:语义区域分割(对潜在扩散特征进行自适应聚类以划分区域,无需额外输入);聚类区域匹配(对提取特征进行块过滤以实现精确对齐);以及风格迁移优化(通过区域风格损失的能量函数引导的扩散采样来优化风格化)。在我们引入的基准测试中,实验表明StyleGallery在内容结构保留、区域风格化、可解释性和个性化定制方面优于最先进的方法,尤其是在利用多个风格参考时。
cs.CV / 19 / 2603.10360

One Token, Two Fates: A Unified Framework via Vision Token Manipulation Against MLLMs Hallucination

一个令牌,两种命运:通过视觉令牌操控对抗多模态大语言模型幻觉的统一框架
Fa, Zhan, Duan, Yue, Zhang, Jian, Qi, Lei, Shi, Yinghuan
Abstract
Current training-free methods tackle MLLM hallucination with separate strategies: either enhancing visual signals or suppressing text inertia. However, these separate methods are insufficient due to critical trade-offs: simply enhancing vision often fails against strong language prior, while suppressing language can introduce extra image-irrelevant noise. Moreover, we find their naive combination is also ineffective, necessitating a unified framework. We propose such a framework by focusing on the core asset: the vision token. Our design leverages two key insights: (1) augmented images offer complementary visual semantics, and (2) removing vision tokens (information-gap) isolates hallucination tendencies more precisely than distorting images (modality-gap). Based on these, our framework uses vision tokens in two distinct ways, both operating on latent representations: our Synergistic Visual Calibration (SVC) module incorporates augmented tokens to strengthen visual representations, while our Causal Representation Calibration (CRC) module uses pruned tokens to create latent-space negative samples for correcting internal model biases. By harmonizing these two roles, our framework effectively restores the vision-language balance, significantly reducing object hallucinations, improving POPE accuracy by an average of 2% absolute on LLaVA-1.5 across multiple benchmarks with only a 1.06x inference latency overhead.
Chinese Translation
当前的无训练方法通过分离策略应对多模态大语言模型(MLLM)幻觉:要么增强视觉信号,要么抑制文本惯性。然而,由于关键的权衡,这些分离方法并不足够:单纯增强视觉往往无法抵御强大的语言先验,而抑制语言则可能引入额外的与图像无关的噪声。此外,我们发现它们的简单组合也无效,因此需要一个统一的框架。我们通过关注核心资产——视觉令牌,提出了这样一个框架。我们的设计利用了两个关键见解:(1)增强图像提供了互补的视觉语义;(2)去除视觉令牌(信息缺口)比扭曲图像(模态缺口)更精确地隔离幻觉倾向。基于这些,我们的框架以两种不同的方式使用视觉令牌,均在潜在表示上操作:我们的协同视觉校准(Synergistic Visual Calibration, SVC)模块结合增强令牌以增强视觉表示,而我们的因果表示校准(Causal Representation Calibration, CRC)模块使用修剪令牌在潜在空间中创建负样本,以纠正内部模型偏差。通过协调这两种角色,我们的框架有效地恢复了视觉-语言平衡,显著减少了物体幻觉,在多个基准上将LLaVA-1.5的POPE准确率平均提高了2%绝对值,同时仅增加了1.06倍的推理延迟开销。
cs.CV / 20 / 2603.10365

Geometric Autoencoder for Diffusion Models

用于扩散模型的几何自编码器
Liu, Hangyu, Wang, Jianyong, Sun, Yutao
Abstract
Latent diffusion models have established a new state-of-the-art in high-resolution visual generation. Integrating Vision Foundation Model priors improves generative efficiency, yet existing latent designs remain largely heuristic. These approaches often struggle to unify semantic discriminability, reconstruction fidelity, and latent compactness. In this paper, we propose Geometric Autoencoder (GAE), a principled framework that systematically addresses these challenges. By analyzing various alignment paradigms, GAE constructs an optimized low-dimensional semantic supervision target from VFMs to provide guidance for the autoencoder. Furthermore, we leverage latent normalization that replaces the restrictive KL-divergence of standard VAEs, enabling a more stable latent manifold specifically optimized for diffusion learning. To ensure robust reconstruction under high-intensity noise, GAE incorporates a dynamic noise sampling mechanism. Empirically, GAE achieves compelling performance on the ImageNet-1K $256 \times 256$ benchmark, reaching a gFID of 1.82 at only 80 epochs and 1.31 at 800 epochs without Classifier-Free Guidance, significantly surpassing existing state-of-the-art methods. Beyond generative quality, GAE establishes a superior equilibrium between compression, semantic depth and robust reconstruction stability. These results validate our design considerations, offering a promising paradigm for latent diffusion modeling. Code and models are publicly available at https://github.com/freezing-index/Geometric-Autoencoder-for-Diffusion-Models.
Chinese Translation
潜在扩散模型在高分辨率视觉生成方面建立了新的最先进水平。整合视觉基础模型(Vision Foundation Model, VFM)先验提高了生成效率,但现有的潜在设计仍然主要依赖启发式方法。这些方法往往难以统一语义可区分性、重建保真度和潜在紧凑性。本文提出了几何自编码器(Geometric Autoencoder, GAE),这是一个系统性解决这些挑战的原则性框架。通过分析各种对齐范式,GAE 从 VFM 构建了一个优化的低维语义监督目标,为自编码器提供指导。此外,我们利用潜在归一化,替代了标准变分自编码器(Variational Autoencoder, VAE)中限制性的 KL 散度,从而实现了一个专门为扩散学习优化的更稳定的潜在流形。为了确保在高强度噪声下的稳健重建,GAE 还引入了动态噪声采样机制。在实验中,GAE 在 ImageNet-1K $256 imes 256$ 基准测试中表现出色,在仅 80 个训练周期内达到了 1.82 的 gFID,在 800 个周期内达到了 1.31,且未使用无分类器引导(Classifier-Free Guidance),显著超越了现有的最先进方法。除了生成质量,GAE 还在压缩、语义深度和稳健重建稳定性之间建立了更优的平衡。这些结果验证了我们的设计考虑,为潜在扩散建模提供了一个有前景的范式。代码和模型已公开,网址为 https://github.com/freezing-index/Geometric-Autoencoder-for-Diffusion-Models。
cs.CV / 21 / 2603.10370

GeoSense: Internalizing Geometric Necessity Perception for Multimodal Reasoning

GeoSense:内化几何必要性感知以实现多模态推理
Liu, Ruiheng, Hao, Haihong, Han, Mingfei, Gu, Xin, Zhang, Kecheng, Li, Changlin, Chang, Xiaojun
Abstract
Advancing towards artificial superintelligence requires rich and intelligent perceptual capabilities. A critical frontier in this pursuit is overcoming the limited spatial understanding of Multimodal Large Language Models (MLLMs), where geometry information is essential. Existing methods often address this by rigidly injecting geometric signals into every input, while ignoring their necessity and adding computation overhead. Contrary to this paradigm, our framework endows the model with an awareness of perceptual insufficiency, empowering it to autonomously engage geometric features in reasoning when 2D cues are deemed insufficient. To achieve this, we first introduce an independent geometry input channel to the model architecture and conduct alignment training, enabling the effective utilization of geometric features. Subsequently, to endow the model with perceptual awareness, we curate a dedicated spatial-aware supervised fine-tuning dataset. This serves to activate the model's latent internal cues, empowering it to autonomously determine the necessity of geometric information. Experiments across multiple spatial reasoning benchmarks validate this approach, demonstrating significant spatial gains without compromising 2D visual reasoning capabilities, offering a path toward more robust, efficient and self-aware multi-modal intelligence.
Chinese Translation
迈向人工超智能需要丰富而智能的感知能力。在这一追求中,一个关键的前沿是克服多模态大型语言模型(MLLMs)有限的空间理解能力,而几何信息在其中至关重要。现有方法通常通过将几何信号僵硬地注入每个输入来解决这个问题,但忽视了其必要性并增加了计算开销。与这一范式相反,我们的框架赋予模型对感知不足的意识,使其在二维线索被认为不足时,能够自主地参与几何特征的推理。为此,我们首先在模型架构中引入一个独立的几何输入通道,并进行对齐训练,以有效利用几何特征。随后,为了赋予模型感知意识,我们策划了一个专门的空间感知监督微调数据集。这一数据集的作用是激活模型的潜在内部线索,使其能够自主判断几何信息的必要性。在多个空间推理基准上的实验验证了这一方法,显示出显著的空间提升,同时不妨碍二维视觉推理能力,为更强大、高效和自我意识的多模态智能提供了一条路径。
cs.CV / 22 / 2603.10398

Multi-Person Pose Estimation Evaluation Using Optimal Transportation and Improved Pose Matching

基于最优运输和改进姿态匹配的多人物姿态估计评估
Moriki, Takato, Taketsugu, Hiromu, Ukita, Norimichi
Abstract
In Multi-Person Pose Estimation, many metrics place importance on ranking of pose detection confidence scores. Current metrics tend to disregard false-positive poses with low confidence, focusing primarily on a larger number of high-confidence poses. Consequently, these metrics may yield high scores even when many false-positive poses with low confidence are detected. For fair evaluation taking into account a tradeoff between true-positive and false-positive poses, this paper proposes Optimal Correction Cost for pose (OCpose), which evaluates detected poses against pose annotations as an optimal transportation. For the fair tradeoff between true-positive and false-positive poses, OCpose equally evaluates all the detected poses regardless of their confidence scores. In OCpose, on the other hand, the confidence score of each pose is utilized to improve the reliability of matching scores between the estimated pose and pose annotations. As a result, OCpose provides a different perspective assessment than other confidence ranking-based metrics.
Chinese Translation
在多人物姿态估计中,许多指标重视姿态检测置信度评分的排名。目前的指标往往忽视低置信度的假阳性姿态,主要关注数量较多的高置信度姿态。因此,这些指标即使在检测到许多低置信度的假阳性姿态时也可能产生高分。为了公平评估,考虑到真阳性和假阳性姿态之间的权衡,本文提出了姿态的最优修正成本(Optimal Correction Cost for pose, OCpose),该方法将检测到的姿态与姿态注释进行最优运输评估。为了在真阳性和假阳性姿态之间实现公平的权衡,OCpose对所有检测到的姿态进行平等评估,而不考虑其置信度评分。另一方面,在OCpose中,每个姿态的置信度评分被用来提高估计姿态与姿态注释之间匹配评分的可靠性。因此,OCpose提供了与其他基于置信度排名的指标不同的评估视角。
cs.CV / 23 / 2603.10408

Motion Forcing: A Decoupled Framework for Robust Video Generation in Motion Dynamics

运动强制:一种解耦框架用于运动动态下的鲁棒视频生成
Xu, Tianshuo, Chen, Zhifei, Wu, Leyi, Lu, Hao, Chen, Ying-cong
Abstract
The ultimate goal of video generation is to satisfy a fundamental trilemma: achieving high visual quality, maintaining rigorous physical consistency, and enabling precise controllability. While recent models can maintain this balance in simple, isolated scenarios, we observe that this equilibrium is fragile and often breaks down as scene complexity increases (e.g., involving collisions or dense traffic). To address this, we introduce \textbf{Motion Forcing}, a framework designed to stabilize this trilemma even in complex generative tasks. Our key insight is to explicitly decouple physical reasoning from visual synthesis via a hierarchical \textbf{``Point-Shape-Appearance''} paradigm. This approach decomposes generation into verifiable stages: modeling complex dynamics as sparse geometric anchors (\textbf{Point}), expanding them into dynamic depth maps that explicitly resolve 3D geometry (\textbf{Shape}), and finally rendering high-fidelity textures (\textbf{Appearance}). Furthermore, to foster robust physical understanding, we employ a \textbf{Masked Point Recovery} strategy. By randomly masking input anchors during training and enforcing the reconstruction of complete dynamic depth, the model is compelled to move beyond passive pattern matching and learn latent physical laws (e.g., inertia) to infer missing trajectories. Extensive experiments on autonomous driving benchmarks show that Motion Forcing significantly outperforms state-of-the-art baselines, maintaining trilemma stability across complex scenes. Evaluations on physics and robotics further confirm our framework's generality.
Chinese Translation
视频生成的最终目标是满足一个基本的三难困境:实现高视觉质量、保持严格的物理一致性以及实现精确的可控性。虽然近期的模型能够在简单、孤立的场景中保持这种平衡,但我们观察到这种平衡是脆弱的,随着场景复杂性的增加(例如,涉及碰撞或密集交通),这种平衡常常会崩溃。为了解决这个问题,我们提出了 extbf{运动强制}(Motion Forcing),这是一个旨在即使在复杂生成任务中也能稳定这一三难困境的框架。我们的关键见解是通过层次化的 extbf{“点-形状-外观”}(Point-Shape-Appearance)范式,明确将物理推理与视觉合成解耦。该方法将生成过程分解为可验证的阶段:将复杂动态建模为稀疏几何锚点( extbf{点}),将其扩展为动态深度图,明确解析三维几何( extbf{形状}),最后渲染高保真纹理( extbf{外观})。此外,为了促进鲁棒的物理理解,我们采用了 extbf{掩蔽点恢复}(Masked Point Recovery)策略。在训练过程中随机掩蔽输入锚点,并强制重建完整的动态深度,模型被迫超越被动的模式匹配,学习潜在的物理规律(例如,惯性)以推断缺失的轨迹。在自主驾驶基准上的大量实验表明,运动强制显著优于最先进的基线,在复杂场景中保持三难困境的稳定性。对物理和机器人学的评估进一步确认了我们框架的通用性。
cs.CV / 24 / 2603.10417

Frames2Residual: Spatiotemporal Decoupling for Self-Supervised Video Denoising

Frames2Residual:自监督视频去噪的时空解耦
Ji, Mingjie, Shi, Zhan, Zhou, Kailai, Fu, Zixuan, Cao, Xun
Abstract
Self-supervised video denoising methods typically extend image-based frameworks into the temporal dimension, yet they often struggle to integrate inter-frame temporal consistency with intra-frame spatial specificity. Existing Video Blind-Spot Networks (BSNs) require noise independence by masking the center pixel, this constraint prevents the use of spatial evidence for texture recovery, thereby severing spatiotemporal correlations and causing texture loss. To address this, we propose Frames2Residual (F2R), a spatiotemporal decoupling framework that explicitly divides self-supervised training into two distinct stages: blind temporal consistency modeling and non-blind spatial texture recovery. In Stage 1, a blind temporal estimator learns inter-frame consistency using a frame-wise blind strategy, producing a temporally consistent anchor. In Stage 2, a non-blind spatial refiner leverages this anchor to safely reintroduce the center frame and recover intra-frame high-frequency spatial residuals while preserving temporal stability. Extensive experiments demonstrate that our decoupling strategy allows F2R to outperform existing self-supervised methods on both sRGB and raw video benchmarks.
Chinese Translation
自监督视频去噪方法通常将基于图像的框架扩展到时间维度,但它们往往难以将帧间的时间一致性与帧内的空间特异性结合起来。现有的视频盲点网络(Video Blind-Spot Networks, BSNs)通过遮蔽中心像素来要求噪声独立性,这一限制阻止了空间证据在纹理恢复中的使用,从而切断了时空相关性并导致纹理损失。为了解决这个问题,我们提出了Frames2Residual(F2R),一个时空解耦框架,明确将自监督训练分为两个不同的阶段:盲时间一致性建模和非盲空间纹理恢复。在第一阶段,盲时间估计器使用逐帧盲策略学习帧间一致性,生成一个时间一致的锚点。在第二阶段,非盲空间精炼器利用这个锚点安全地重新引入中心帧,并在保持时间稳定性的同时恢复帧内的高频空间残差。大量实验表明,我们的解耦策略使F2R在sRGB和原始视频基准测试中优于现有的自监督方法。
cs.CV / 25 / 2603.10418

TractoRC: A Unified Probabilistic Learning Framework for Joint Tractography Registration and Clustering

TractoRC:一个统一的概率学习框架用于联合轨迹图配准和聚类
Li, Yijie, Zhu, Xi, Wang, Junyi, Wu, Ye, O'Donnell, Lauren J., Zhang, Fan
Abstract
Diffusion MRI tractography enables in vivo reconstruction of white matter (WM) pathways. Two key tasks in tractography analysis include: 1) tractogram registration that aligns streamlines across individuals, and 2) streamline clustering that groups streamlines into compact fiber bundles. Although both tasks share the goal of capturing geometrically similar structures to characterize consistent WM organization, they are typically performed independently. In this work, we propose TractoRC, a unified probabilistic framework that jointly performs tractogram registration and streamline clustering within a single optimization scheme, enabling the two tasks to leverage complementary information. TractoRC learns a latent embedding space for streamline points, which serves as a shared representation for both tasks. Within this space, both tasks are formulated as probabilistic inference over structural representations: registration learns the distribution of anatomical landmarks as probabilistic keypoints to align tractograms across subjects, and clustering learns streamline structural prototypes that capture geometric similarity to form coherent streamline clusters. To support effective learning of this shared space, we introduce a transformation-equivariant self-supervised strategy to learn geometry-aware and transformation-invariant embeddings. Experiments demonstrate that jointly optimizing registration and clustering significantly improves performance in both tasks over state-of-the-art methods that treat them independently. Code will be made publicly available at https://github.com/yishengpoxiao/TractoRC .
Chinese Translation
扩散MRI轨迹图技术使得白质(WM)通路的体内重建成为可能。轨迹图分析中的两个关键任务包括:1)轨迹图配准,旨在对齐不同个体的流线,以及2)流线聚类,将流线分组为紧凑的纤维束。尽管这两个任务的目标都是捕捉几何相似的结构以表征一致的WM组织,但它们通常是独立进行的。在本研究中,我们提出了TractoRC,一个统一的概率框架,能够在单一优化方案中联合执行轨迹图配准和流线聚类,从而使这两个任务能够利用互补信息。TractoRC学习流线点的潜在嵌入空间,作为两个任务的共享表示。在这个空间中,这两个任务被表述为对结构表示的概率推断:配准学习解剖标志点的分布,作为概率关键点以对齐不同个体的轨迹图,而聚类学习捕捉几何相似性的流线结构原型,以形成连贯的流线聚类。为了支持这一共享空间的有效学习,我们引入了一种变换等变的自监督策略,以学习几何感知和变换不变的嵌入。实验表明,联合优化配准和聚类显著提高了这两个任务的性能,相较于将它们独立处理的最先进方法。代码将在 https://github.com/yishengpoxiao/TractoRC 上公开发布。
cs.CV / 26 / 2603.10422

World2Act: Latent Action Post-Training via Skill-Compositional World Models

World2Act:通过技能组合世界模型进行潜在动作后训练
Vuong, An Dinh, Van Vo, Tuan, Sohail, Abdullah, Ding, Haoran, Ma, Liang, Liang, Xiaodan, Duan, Anqing, Laptev, Ivan, Reid, Ian
Abstract
World Models (WMs) have emerged as a promising approach for post-training Vision-Language-Action (VLA) policies to improve robustness and generalization under environmental changes. However, most WM-based post-training methods rely on pixel-space supervision, making policies sensitive to pixel-level artifacts and hallucination from imperfect WM rollouts. We introduce World2Act, a post-training framework that aligns VLA actions directly with WM video-dynamics latents using a contrastive matching objective, reducing dependence on pixels. Post-training performance is tied to rollout quality, yet current WMs struggle with arbitrary-length video generation as they are mostly trained on fixed-length clips while robotic execution durations vary widely. To address this, we propose an automatic LLM-based skill-decomposition pipeline that segments high-level instructions into low-level prompts. Our pipeline produces RoboCasa-Skill and LIBERO-Skill, supporting skill-compositional WMs that remain temporally consistent across diverse task horizons. Empirically, applying World2Act to VLAs like GR00T-N1.6 and Cosmos Policy achieves state-of-the-art results on RoboCasa and LIBERO, and improves real-world performance by 6.7%, enhancing embodied agent generalization.
Chinese Translation
世界模型(World Models, WMs)作为一种有前景的方法,已被提出用于后训练视觉-语言-动作(Vision-Language-Action, VLA)策略,以提高在环境变化下的鲁棒性和泛化能力。然而,大多数基于WM的后训练方法依赖于像素空间监督,使得策略对像素级伪影和不完美WM回放的幻觉敏感。我们提出了World2Act,一个后训练框架,通过对比匹配目标将VLA动作直接与WM视频动态潜变量对齐,从而减少对像素的依赖。后训练性能与回放质量相关,但当前的WMs在生成任意长度视频方面存在困难,因为它们大多是在固定长度片段上训练的,而机器人执行的持续时间差异很大。为了解决这个问题,我们提出了一种基于自动大语言模型(LLM)的技能分解管道,将高层指令分割成低层提示。我们的管道生成了RoboCasa-Skill和LIBERO-Skill,支持在多样化任务范围内保持时间一致性的技能组合WMs。实证结果表明,将World2Act应用于如GR00T-N1.6和Cosmos Policy等VLA,在RoboCasa和LIBERO上达到了最先进的结果,并将现实世界的表现提高了6.7%,增强了具身智能体的泛化能力。
cs.CV / 27 / 2603.10446

SignSparK: Efficient Multilingual Sign Language Production via Sparse Keyframe Learning

SignSparK:通过稀疏关键帧学习实现高效的多语言手语生成
Low, Jianhe, Symeonidis-Herzig, Alexandre, Ivashechkin, Maksym, Sincan, Ozge Mercanoglu, Bowden, Richard
Abstract
Generating natural and linguistically accurate sign language avatars remains a formidable challenge. Current Sign Language Production (SLP) frameworks face a stark trade-off: direct text-to-pose models suffer from regression-to-the-mean effects, while dictionary-retrieval methods produce robotic, disjointed transitions. To resolve this, we propose a novel training paradigm that leverages sparse keyframes to capture the true underlying kinematic distribution of human signing. By predicting dense motion from these discrete anchors, our approach mitigates regression-to-the-mean while ensuring fluid articulation. To realize this paradigm at scale, we first introduce FAST, an ultra-efficient sign segmentation model that automatically mines precise temporal boundaries. We then present SignSparK, a large-scale Conditional Flow Matching (CFM) framework that utilizes these extracted anchors to synthesize 3D signing sequences in SMPL-X and MANO spaces. This keyframe-driven formulation also uniquely unlocks Keyframe-to-Pose (KF2P) generation, making precise spatiotemporal editing of signing sequences possible. Furthermore, our adopted reconstruction-based CFM objective also enables high-fidelity synthesis in fewer than ten sampling steps; this allows SignSparK to scale across four distinct sign languages, establishing the largest multilingual SLP framework to date. Finally, by integrating 3D Gaussian Splatting for photorealistic rendering, we demonstrate through extensive evaluation that SignSparK establishes a new state-of-the-art across diverse SLP tasks and multilingual benchmarks.
Chinese Translation
生成自然且语言学上准确的手语虚拟形象仍然是一项艰巨的挑战。目前的手语生成(SLP)框架面临着明显的权衡:直接的文本到姿势模型受到均值回归效应的影响,而字典检索方法则产生机械化、脱节的过渡。为了解决这一问题,我们提出了一种新颖的训练范式,利用稀疏关键帧捕捉人类手语的真实运动学分布。通过从这些离散锚点预测密集运动,我们的方法减轻了均值回归的影响,同时确保了流畅的表达。为了在大规模上实现这一范式,我们首先介绍了FAST,这是一种超高效的手语分割模型,能够自动挖掘精确的时间边界。接着,我们提出了SignSparK,这是一个大规模的条件流匹配(CFM)框架,利用这些提取的锚点在SMPL-X和MANO空间中合成3D手语序列。这种以关键帧为驱动的形式还独特地解锁了关键帧到姿势(KF2P)生成,使得手语序列的精确时空编辑成为可能。此外,我们采用的基于重建的CFM目标还使得在不到十个采样步骤内实现高保真合成成为可能;这使得SignSparK能够覆盖四种不同的手语,建立了迄今为止最大的多语言SLP框架。最后,通过集成3D高斯点渲染实现照片级真实感渲染,我们通过广泛的评估展示了SignSparK在各种SLP任务和多语言基准测试中建立了新的最先进水平。
cs.CV / 28 / 2603.10456

LCAMV: High-Accuracy 3D Reconstruction of Color-Varying Objects Using LCA Correction and Minimum-Variance Fusion in Structured Light

LCAMV:基于LCA校正和最小方差融合的高精度彩色变化物体3D重建
Oh, Wonbeen, Hyun, Jae-Sang
Abstract
Accurate 3D reconstruction of colored objects with structured light (SL) is hindered by lateral chromatic aberration (LCA) in optical components and uneven noise characteristics across RGB channels. This paper introduces lateral chromatic aberration correction and minimum-variance fusion (LCAMV), a robust 3D reconstruction method that operates with a single projector-camera pair without additional hardware or acquisition constraints. LCAMV analytically models and pixel-wise compensates LCA in both the projector and camera, then adaptively fuses multi-channel phase data using a Poisson-Gaussian noise model and minimum-variance estimation. Unlike existing methods that require extra hardware or multiple exposures, LCAMV enables fast acquisition. Experiments on planar and non-planar colored surfaces show that LCAMV outperforms grayscale conversion and conventional channel-weighting, reducing depth error by up to 43.6\%. These results establish LCAMV as an effective solution for high-precision 3D reconstruction of nonuniformly colored objects.
Chinese Translation
使用结构光(SL)对彩色物体进行准确的3D重建受到光学元件中的横向色差(LCA)和RGB通道间不均匀噪声特性的影响。本文提出了一种横向色差校正和最小方差融合(LCAMV)的方法,这是一种稳健的3D重建方法,能够在不需要额外硬件或采集限制的情况下,使用单个投影仪-相机对进行操作。LCAMV通过分析模型和逐像素补偿投影仪和相机中的LCA,然后利用泊松-高斯噪声模型和最小方差估计自适应地融合多通道相位数据。与现有需要额外硬件或多次曝光的方法不同,LCAMV实现了快速采集。在平面和非平面彩色表面的实验表明,LCAMV在减少深度误差方面优于灰度转换和传统通道加权,深度误差降低幅度可达43.6%。这些结果确立了LCAMV作为高精度非均匀彩色物体3D重建的有效解决方案。
cs.CV / 29 / 2603.10463

Learning to Wander: Improving the Global Image Geolocation Ability of LMMs via Actionable Reasoning

学习漫游:通过可操作推理提升大型多模态模型的全球图像地理定位能力
Zheng, Yushuo, Duan, Huiyu, Zhang, Zicheng, Liu, Xiaohong, Min, Xiongkuo
Abstract
Geolocation, the task of identifying the geographic location of an image, requires abundant world knowledge and complex reasoning abilities. Though advanced large multimodal models (LMMs) have shown superior aforementioned capabilities, their performance on the geolocation task remains unexplored. To this end, we introduce \textbf{WanderBench}, the first open access global geolocation benchmark designed for actionable geolocation reasoning in embodied scenarios. WanderBench contains over 32K panoramas across six continents, organized as navigable graphs that enable physical actions such as rotation and movement, transforming geolocation from static recognition into interactive exploration. Building on this foundation, we propose \textbf{GeoAoT} (Action of Thought), a \underline{Geo}location framework with \underline{A}ction of \underline{T}hough, which couples reasoning with embodied actions. Instead of generating textual reasoning chains, GeoAoT produces actionable plans such as, approaching landmarks or adjusting viewpoints, to actively reduce uncertainty. We further establish an evaluation protocol that jointly measures geolocation accuracy and difficulty-aware geolocation questioning ability. Experiments on 19 large multimodal models show that GeoAoT achieves superior fine-grained localization and stronger generalization in dynamic environments. WanderBench and GeoAoT define a new paradigm for actionable, reasoning driven geolocation in embodied visual understanding.
Chinese Translation
地理定位是识别图像地理位置的任务,要求具备丰富的世界知识和复杂的推理能力。尽管先进的大型多模态模型(LMMs)在上述能力上表现优越,但它们在地理定位任务上的表现仍未得到充分探索。为此,我们引入了 extbf{WanderBench},这是首个开放获取的全球地理定位基准,旨在支持在具身场景中的可操作地理定位推理。WanderBench包含超过32,000个全景图,覆盖六大洲,组织为可导航的图形,能够实现旋转和移动等物理动作,将地理定位从静态识别转变为互动探索。在此基础上,我们提出了 extbf{GeoAoT}(思维的行动),这是一个结合推理与具身动作的 extunderline{Geo}定位框架,具备 extunderline{A}ction of extunderline{T}hought。GeoAoT不是生成文本推理链,而是产生可操作的计划,例如接近地标或调整视角,以主动减少不确定性。我们进一步建立了一种评估协议,联合测量地理定位的准确性和难度感知的地理定位提问能力。在19个大型多模态模型上的实验表明,GeoAoT在细粒度定位和动态环境中的更强泛化能力上表现优越。WanderBench和GeoAoT为具身视觉理解中的可操作、推理驱动的地理定位定义了一个新范式。
cs.CV / 30 / 2603.10466

UniPINN: A Unified PINN Framework for Multi-task Learning of Diverse Navier-Stokes Equations

UniPINN:一种统一的PINN框架用于多任务学习多样化的Navier-Stokes方程
Sun, Dengdi, Chen, Jie, Wang, Xiao, Tang, Jin
Abstract
Physics-Informed Neural Networks (PINNs) have shown promise in solving incompressible Navier-Stokes equations, yet existing approaches are predominantly designed for single-flow settings. When extended to multi-flow scenarios, these methods face three key challenges: (1) difficulty in simultaneously capturing both shared physical principles and flow-specific characteristics, (2) susceptibility to inter-task negative transfer that degrades prediction accuracy, and (3) unstable training dynamics caused by disparate loss magnitudes across heterogeneous flow regimes. To address these limitations, we propose UniPINN, a unified multi-flow PINN framework that integrates three complementary components: a shared-specialized architecture that disentangles universal physical laws from flow-specific features, a cross-flow attention mechanism that selectively reinforces relevant patterns while suppressing task-irrelevant interference, and a dynamic weight allocation strategy that adaptively balances loss contributions to stabilize multi-objective optimization. Extensive experiments on three canonical flows demonstrate that UniPINN effectively unifies multi-flow learning, achieving superior prediction accuracy and balanced performance across heterogeneous regimes while successfully mitigating negative transfer. The source code of this paper will be released on https://github.com/Event-AHU/OpenFusion
Chinese Translation
物理信息神经网络(PINNs)在求解不可压Navier-Stokes方程方面展现了良好的前景,但现有方法主要针对单流场设置。当扩展到多流场场景时,这些方法面临三个主要挑战:(1)同时捕捉共享物理原理和流场特定特征的困难,(2)易受任务间负迁移的影响,从而降低预测准确性,以及(3)由于异质流场之间损失幅度差异导致的不稳定训练动态。为了解决这些局限性,我们提出了UniPINN,一种统一的多流PINN框架,集成了三个互补组件:一个共享-专用架构,用于将普遍物理定律与流场特定特征分离;一个跨流注意机制,选择性地增强相关模式,同时抑制与任务无关的干扰;以及一个动态权重分配策略,自适应平衡损失贡献以稳定多目标优化。在三个经典流场上的广泛实验表明,UniPINN有效统一了多流学习,实现了在异质流场中优越的预测准确性和均衡性能,同时成功减轻了负迁移。本文的源代码将发布在 https://github.com/Event-AHU/OpenFusion
cs.CV / 31 / 2603.10470

Fighting Hallucinations with Counterfactuals: Diffusion-Guided Perturbations for LVLM Hallucination Suppression

通过反事实对抗幻觉:基于扩散引导的扰动用于LVLM幻觉抑制
Dastmalchi, Hamidreza, An, Aijun, Cheraghian, Ali, Barzamini, Hamed
Abstract
While large vision-language models (LVLMs) achieve strong performance on multimodal tasks, they frequently generate hallucinations -- unfaithful outputs misaligned with the visual input. To address this issue, we introduce CIPHER (Counterfactual Image Perturbations for Hallucination Extraction and Removal), a training-free method that suppresses vision-induced hallucinations via lightweight feature-level correction. Unlike prior training-free approaches that primarily focus on text-induced hallucinations, CIPHER explicitly targets hallucinations arising from the visual modality. CIPHER operates in two phases. In the offline phase, we construct OHC-25K (Object-Hallucinated Counterfactuals, 25,000 samples), a counterfactual dataset consisting of diffusion-edited images that intentionally contradict the original ground-truth captions. We pair these edited images with the unchanged ground-truth captions and process them through an LVLM to extract hallucination-related representations. Contrasting these representations with those from authentic (image, caption) pairs reveals structured, systematic shifts spanning a low-rank subspace characterizing vision-induced hallucination. In the inference phase, CIPHER suppresses hallucinations by projecting intermediate hidden states away from this subspace. Experiments across multiple benchmarks show that CIPHER significantly reduces hallucination rates while preserving task performance, demonstrating the effectiveness of counterfactual visual perturbations for improving LVLM faithfulness. Code and additional materials are available at https://hamidreza-dastmalchi.github.io/cipher-cvpr2026/.
Chinese Translation
尽管大型视觉语言模型(LVLMs)在多模态任务上表现出色,但它们经常生成幻觉——与视觉输入不一致的不真实输出。为了解决这个问题,我们提出了CIPHER(反事实图像扰动用于幻觉提取和去除),这是一种无训练的方法,通过轻量级特征级校正来抑制视觉引发的幻觉。与之前主要关注文本引发幻觉的无训练方法不同,CIPHER明确针对源自视觉模态的幻觉。CIPHER分为两个阶段进行操作。在离线阶段,我们构建了OHC-25K(对象幻觉反事实,25,000个样本),这是一个包含故意与原始真实标题相矛盾的扩散编辑图像的反事实数据集。我们将这些编辑图像与未更改的真实标题配对,并通过LVLM处理它们以提取与幻觉相关的表示。将这些表示与真实(图像,标题)对的表示进行对比,揭示了跨越低秩子空间的结构化、系统性变化,这些变化特征化了视觉引发的幻觉。在推理阶段,CIPHER通过将中间隐藏状态投影远离该子空间来抑制幻觉。在多个基准测试中的实验表明,CIPHER显著降低了幻觉率,同时保持了任务性能,证明了反事实视觉扰动在提高LVLM可信性方面的有效性。代码和其他材料可在https://hamidreza-dastmalchi.github.io/cipher-cvpr2026/获取。
cs.CV / 32 / 2603.10484

StructDamage:A Large Scale Unified Crack and Surface Defect Dataset for Robust Structural Damage Detection

StructDamage:一个大规模统一的裂缝和表面缺陷数据集,用于稳健的结构损伤检测
Ijaz, Misbah, Khan, Saif Ur Rehman, Rehman, Abd Ur, Vollmer, Sebastian, Dengel, Andreas, Asim, Muhammad Nabeel
Abstract
Automated detection and classification of structural cracks and surface defects is a critical challenge in civil engineering, infrastructure maintenance, and heritage preservation. Recent advances in Computer Vision (CV) and Deep Learning (DL) have significantly improved automatic crack detection. However, these methods rely heavily on large, diverse, and carefully curated datasets that include various crack types across different surface materials. Many existing public crack datasets lack geographic diversity, surface types, scale, and labeling consistency, making it challenging for trained algorithms to generalize effectively in real world conditions. We provide a novel dataset, StructDamage, a curated collection of approximately 78,093 images spanning nine surface types: walls, tile, stone, road, pavement, deck, concrete, and brick. The dataset was constructed by systematically aggregating, harmonizing, and reannotating images from 32 publicly available datasets covering concrete structures, asphalt pavements, masonry walls, bridges, and historic buildings. All images are organized in a folder level classification hierarchy suitable for training Convolutional Neural Networks (CNNs) and Vision Transformers. To highlight the practical value of the dataset, we present baseline classification results using fifteen DL architectures from six model families, with twelve achieving macro F1-scores over 0.96. The best performing model DenseNet201 achieves 98.62% accuracy. The proposed dataset provides a comprehensive and versatile resource suitable for classification tasks. With thorough documentation and a standard structure, it is designed to promote reproducible research and support the development and fair evaluation of robust crack damage detection approaches.
Chinese Translation
自动检测和分类结构裂缝及表面缺陷是土木工程、基础设施维护和遗产保护中的一项关键挑战。计算机视觉(Computer Vision, CV)和深度学习(Deep Learning, DL)的最新进展显著提高了自动裂缝检测的效果。然而,这些方法在很大程度上依赖于包含不同表面材料上各种裂缝类型的大型、多样化且经过精心策划的数据集。许多现有的公共裂缝数据集缺乏地理多样性、表面类型、规模和标注一致性,使得训练好的算法在现实世界条件下有效泛化变得困难。我们提供了一个新颖的数据集StructDamage,这是一个经过整理的约78,093张图像的集合,涵盖九种表面类型:墙壁、瓷砖、石材、道路、铺路、甲板、混凝土和砖块。该数据集通过系统地聚合、协调和重新标注来自32个公共可用数据集的图像构建,这些数据集覆盖混凝土结构、沥青路面、砌体墙、桥梁和历史建筑。所有图像按照适合训练卷积神经网络(Convolutional Neural Networks, CNNs)和视觉变换器(Vision Transformers)的文件夹级分类层次结构进行组织。为了突出该数据集的实际价值,我们展示了使用六个模型家族中的十五种深度学习架构的基线分类结果,其中十二种模型的宏观F1分数超过0.96。表现最佳的模型DenseNet201达到了98.62%的准确率。所提出的数据集提供了一个全面且多功能的资源,适用于分类任务。通过详尽的文档和标准结构,它旨在促进可重复研究,并支持稳健裂缝损伤检测方法的发展和公平评估。
cs.CV / 33 / 2603.10487

Spatial self-supervised Peak Learning and correlation-based Evaluation of peak picking in Mass Spectrometry Imaging

空间自监督峰值学习及基于相关性的质谱成像峰值选择评估
Weigand, Philipp, Ebert, Nikolas, Mohammed, Shad A., Sammour, Denis Abu, Hopf, Carsten, Wasenmüller, Oliver
Abstract
Mass spectrometry imaging (MSI) enables label-free visualization of molecular distributions across tissue samples but generates large and complex datasets that require effective peak picking to reduce data size while preserving meaningful biological information. Existing peak picking approaches perform inconsistently across heterogeneous datasets, and their evaluation is often limited to synthetic data or manually selected ion images that do not fully represent real-world challenges in MSI. To address these limitations, we propose an autoencoder-based spatial self-supervised peak learning neural network that selects spatially structured peaks by learning an attention mask leveraging both spatial and spectral information. We further introduce an evaluation procedure based on expert-annotated segmentation masks, allowing a more representative and spatially grounded assessment of peak picking performance. We evaluate our approach on four diverse public MSI datasets using our proposed evaluation procedure. Our approach consistently outperforms state-of-the-art peak picking methods by selecting spatially structured peaks, thus demonstrating its efficacy. These results highlight the value of our spatial self-supervised network in comparison to contemporary state-of-the-art methods. The evaluation procedure can be readily applied to new MSI datasets, thereby providing a consistent and robust framework for the comparison of spatially structured peak picking methods across different datasets.
Chinese Translation
质谱成像(MSI)能够无标记地可视化组织样本中的分子分布,但生成的大规模复杂数据集需要有效的峰值选择,以减少数据大小并保留有意义的生物信息。现有的峰值选择方法在异质数据集上表现不一致,其评估通常仅限于合成数据或手动选择的离子图像,这些图像并未充分反映MSI中的现实挑战。为了解决这些局限性,我们提出了一种基于自编码器的空间自监督峰值学习神经网络,通过利用空间和光谱信息学习注意力掩码来选择空间结构化的峰值。我们进一步引入了一种基于专家注释的分割掩码的评估程序,允许对峰值选择性能进行更具代表性和空间基础的评估。我们在四个不同的公共MSI数据集上使用我们提出的评估程序对我们的方法进行了评估。我们的方法通过选择空间结构化的峰值,始终优于最先进的峰值选择方法,从而证明了其有效性。这些结果突显了我们的空间自监督网络相较于当代最先进方法的价值。该评估程序可以方便地应用于新的MSI数据集,从而为不同数据集间空间结构化峰值选择方法的比较提供了一种一致且稳健的框架。
cs.CV / 34 / 2603.10495

IMTBench: A Multi-Scenario Cross-Modal Collaborative Evaluation Benchmark for In-Image Machine Translation

IMTBench:一种多场景跨模态协作评估基准,用于图像内机器翻译
Lyu, Jiahao, Fu, Pei, Li, Zhenhang, Zeng, Weichao, Zhan, Shaojie, Yang, Jiahui, Ma, Can, Zhou, Yu, Luo, Zhenbo, Luan, Jian
Abstract
End-to-end In-Image Machine Translation (IIMT) aims to convert text embedded within an image into a target language while preserving the original visual context, layout, and rendering style. However, existing IIMT benchmarks are largely synthetic and thus fail to reflect real-world complexity, while current evaluation protocols focus on single-modality metrics and overlook cross-modal faithfulness between rendered text and model outputs. To address these shortcomings, we present In-image Machine Translation Benchmark (IMTBench), a new benchmark of 2,500 image translation samples covering four practical scenarios and nine languages. IMTBench supports multi-aspect evaluation, including translation quality, background preservation, overall image quality, and a cross-modal alignment score that measures consistency between the translated text produced by the model and the text rendered in the translated image. We benchmark strong commercial cascade systems, and both closed- and open-source unified multi-modal models, and observe large performance gaps across scenarios and languages, especially on natural scenes and resource-limited languages, highlighting substantial headroom for end-to-end image text translation. We hope IMTBench establishes a standardized benchmark to accelerate progress in this emerging task.
Chinese Translation
端到端图像内机器翻译(IIMT)旨在将嵌入图像中的文本转换为目标语言,同时保留原始的视觉上下文、布局和渲染风格。然而,现有的IIMT基准大多是合成的,因此无法反映现实世界的复杂性,而当前的评估协议则侧重于单一模态指标,忽视了渲染文本与模型输出之间的跨模态忠实度。为了解决这些不足,我们提出了图像内机器翻译基准(IMTBench),这是一个包含2500个图像翻译样本的新基准,涵盖四种实际场景和九种语言。IMTBench支持多方面评估,包括翻译质量、背景保留、整体图像质量,以及一个跨模态对齐评分,用于衡量模型生成的翻译文本与翻译图像中渲染文本之间的一致性。我们对强大的商业级级联系统以及闭源和开源的统一多模态模型进行了基准测试,观察到不同场景和语言之间存在较大的性能差距,尤其是在自然场景和资源有限的语言上,突显了端到端图像文本翻译的巨大提升空间。我们希望IMTBench能够建立一个标准化的基准,以加速这一新兴任务的进展。
cs.CV / 35 / 2603.10517

UHD Image Deblurring via Autoregressive Flow with Ill-conditioned Constraints

通过带有病态约束的自回归流实现超高清图像去模糊
Xin, Yucheng, Zhao, Dawei, Chen, Xiang, Wu, Chen, Wang, Pu, Lu, Dianjie, Zhang, Guijuan, Jia, Xiuyi, Zheng, Zhuoran
Abstract
Ultra-high-definition (UHD) image deblurring poses significant challenges for UHD restoration methods, which must balance fine-grained detail recovery and practical inference efficiency. Although prominent discriminative and generative methods have achieved remarkable results, a trade-off persists between computational cost and the ability to generate fine-grained detail for UHD image deblurring tasks. To further alleviate these issues, we propose a novel autoregressive flow method for UHD image deblurring with an ill-conditioned constraint. Our core idea is to decompose UHD restoration into a progressive, coarse-to-fine process: at each scale, the sharp estimate is formed by upsampling the previous-scale result and adding a current-scale residual, enabling stable, stage-wise refinement from low to high resolution. We further introduce Flow Matching to model residual generation as a conditional vector field and perform few-step ODE sampling with efficient Euler/Heun solvers, enriching details while keeping inference affordable. Since multi-step generation at UHD can be numerically unstable, we propose an ill-conditioning suppression scheme by imposing condition-number regularization on a feature-induced attention matrix, improving convergence and cross-scale consistency. Our method demonstrates promising performance on blurred images at 4K (3840$\times$2160) or higher resolutions.
Chinese Translation
超高清(UHD)图像去模糊对UHD恢复方法提出了重大挑战,这些方法必须在细粒度细节恢复和实际推理效率之间取得平衡。尽管显著的判别性和生成性方法已取得了显著成果,但在计算成本与生成UHD图像去模糊任务的细粒度细节能力之间仍然存在权衡。为进一步缓解这些问题,我们提出了一种新颖的自回归流方法,用于带有病态约束的UHD图像去模糊。我们的核心思想是将UHD恢复分解为一个逐步的粗到细的过程:在每个尺度上,通过对前一尺度结果进行上采样并添加当前尺度的残差来形成清晰的估计,从而实现从低到高分辨率的稳定阶段性细化。我们进一步引入流匹配(Flow Matching)将残差生成建模为条件向量场,并使用高效的欧拉/海因(Euler/Heun)求解器进行少步ODE采样,在保持推理可承受的同时丰富细节。由于在UHD下的多步生成可能在数值上不稳定,我们提出了一种病态抑制方案,通过对特征诱导的注意力矩阵施加条件数正则化,改善收敛性和跨尺度一致性。我们的方法在4K(3840×2160)或更高分辨率的模糊图像上展示了良好的性能。
cs.CV / 36 / 2603.10519

Visually-Guided Controllable Medical Image Generation via Fine-Grained Semantic Disentanglement

通过细粒度语义解耦实现视觉引导的可控医学图像生成
Huang, Xin, Liang, Junjie, Hou, Qingshan, Cao, Peng, Yang, Jinzhu, Liu, Xiaoli, Zaiane, Osmar R.
Abstract
Medical image synthesis is crucial for alleviating data scarcity and privacy constraints. However, fine-tuning general text-to-image (T2I) models remains challenging, mainly due to the significant modality gap between complex visual details and abstract clinical text. In addition, semantic entanglement persists, where coarse-grained text embeddings blur the boundary between anatomical structures and imaging styles, thus weakening controllability during generation. To address this, we propose a Visually-Guided Text Disentanglement framework. We introduce a cross-modal latent alignment mechanism that leverages visual priors to explicitly disentangle unstructured text into independent semantic representations. Subsequently, a Hybrid Feature Fusion Module (HFFM) injects these features into a Diffusion Transformer (DiT) through separated channels, enabling fine-grained structural control. Experimental results in three datasets demonstrate that our method outperforms existing approaches in terms of generation quality and significantly improves performance on downstream classification tasks. The source code is available at https://github.com/hx111/VG-MedGen.
Chinese Translation
医学图像合成对于缓解数据稀缺和隐私限制至关重要。然而,微调通用的文本到图像(T2I)模型仍然面临挑战,主要是由于复杂视觉细节与抽象临床文本之间存在显著的模态差距。此外,语义纠缠依然存在,粗粒度的文本嵌入模糊了解剖结构与成像风格之间的边界,从而削弱了生成过程中的可控性。为了解决这一问题,我们提出了一种视觉引导的文本解耦框架。我们引入了一种跨模态潜在对齐机制,利用视觉先验明确地将非结构化文本解耦为独立的语义表示。随后,混合特征融合模块(Hybrid Feature Fusion Module, HFFM)通过分离通道将这些特征注入扩散变换器(Diffusion Transformer, DiT),实现细粒度的结构控制。在三个数据集上的实验结果表明,我们的方法在生成质量方面优于现有方法,并显著提高了下游分类任务的性能。源代码可在 https://github.com/hx111/VG-MedGen 获取。
cs.CV / 37 / 2603.10526

Sparse Task Vector Mixup with Hypernetworks for Efficient Knowledge Transfer in Whole-Slide Image Prognosis

基于超网络的稀疏任务向量混合以实现全幻灯片图像预后中的高效知识转移
Liu, Pei, Zeng, Xiangxiang, Ma, Tengfei, Xing, Yucheng, Ren, Xuanbai, Liu, Yiping
Abstract
Whole-Slide Images (WSIs) are widely used for estimating the prognosis of cancer patients. Current studies generally follow a cancer-specific learning paradigm. However, the available training samples for one cancer type are usually scarce in pathology. Consequently, the model often struggles to learn generalizable knowledge, thus performing worse on the tumor samples with inherent high heterogeneity. Although multi-cancer joint learning and knowledge transfer approaches have been explored recently to address it, they either rely on large-scale joint training or extensive inference across multiple models, posing new challenges in computational efficiency. To this end, this paper proposes a new scheme, Sparse Task Vector Mixup with Hypernetworks (STEPH). Unlike previous ones, it efficiently absorbs generalizable knowledge from other cancers for the target via model merging: i) applying task vector mixup to each source-target pair and then ii) sparsely aggregating task vector mixtures to obtain an improved target model, driven by hypernetworks. Extensive experiments on 13 cancer datasets show that STEPH improves over cancer-specific learning and an existing knowledge transfer baseline by 5.14% and 2.01%, respectively. Moreover, it is a more efficient solution for learning prognostic knowledge from other cancers, without requiring large-scale joint training or extensive multi-model inference. Code is publicly available at https://github.com/liupei101/STEPH.
Chinese Translation
全幻灯片图像(WSIs)广泛用于评估癌症患者的预后。目前的研究通常遵循特定癌症的学习范式。然而,在病理学中,某一癌症类型的可用训练样本通常稀缺。因此,模型往往难以学习可推广的知识,从而在固有高度异质性的肿瘤样本上表现较差。尽管最近已经探索了多癌症联合学习和知识转移的方法来解决这一问题,但它们要么依赖于大规模的联合训练,要么需要在多个模型之间进行广泛推理,这在计算效率上带来了新的挑战。为此,本文提出了一种新方案——基于超网络的稀疏任务向量混合(STEPH)。与以往方法不同,它通过模型合并高效地吸收其他癌症的可推广知识:i)对每个源-目标对应用任务向量混合,然后 ii)稀疏聚合任务向量混合以获得改进的目标模型,驱动于超网络。在13个癌症数据集上的广泛实验表明,STEPH在癌症特定学习和现有知识转移基线的基础上分别提高了5.14%和2.01%。此外,它是一种更高效的解决方案,可以从其他癌症中学习预后知识,而无需大规模的联合训练或广泛的多模型推理。代码已公开发布在 https://github.com/liupei101/STEPH。
cs.CV / 38 / 2603.10538

DSFlash: Comprehensive Panoptic Scene Graph Generation in Realtime

DSFlash:实时全景场景图生成的综合方法
Lorenz, Julian, Kovganko, Vladyslav, Kohout, Elias, Phatak, Mrunmai, Kienzle, Daniel, Lienhart, Rainer
Abstract
Scene Graph Generation (SGG) aims to extract a detailed graph structure from an image, a representation that holds significant promise as a robust intermediate step for complex downstream tasks like reasoning for embodied agents. However, practical deployment in real-world applications - especially on resource constrained edge devices - requires speed and resource efficiency, challenges that have received limited attention in existing research. To bridge this gap, we introduce DSFlash, a low-latency model for panoptic scene graph generation designed to overcome these limitations. DSFlash can process a video stream at 56 frames per second on a standard RTX 3090 GPU, without compromising performance against existing state-of-the-art methods. Crucially, unlike prior approaches that often restrict themselves to salient relationships, DSFlash computes comprehensive scene graphs, offering richer contextual information while maintaining its superior latency. Furthermore, DSFlash is light on resources, requiring less than 24 hours to train on a single, nine-year-old GTX 1080 GPU. This accessibility makes DSFlash particularly well-suited for researchers and practitioners operating with limited computational resources, empowering them to adapt and fine-tune SGG models for specialized applications.
Chinese Translation
场景图生成(SGG)旨在从图像中提取详细的图结构,这种表示在作为复杂下游任务(如为具身代理进行推理)的稳健中间步骤方面具有重要前景。然而,在实际应用中的部署,特别是在资源受限的边缘设备上,要求速度和资源效率,这些挑战在现有研究中受到的关注有限。为了解决这一问题,我们提出了DSFlash,这是一种低延迟的全景场景图生成模型,旨在克服这些局限性。DSFlash能够在标准的RTX 3090 GPU上以每秒56帧的速度处理视频流,而不影响与现有最先进方法的性能相比。至关重要的是,与以往通常仅限于显著关系的方法不同,DSFlash计算全面的场景图,提供更丰富的上下文信息,同时保持其卓越的延迟。此外,DSFlash对资源的需求较低,在一台九年前的GTX 1080 GPU上训练所需时间不到24小时。这种可及性使得DSFlash特别适合于在计算资源有限的情况下工作的研究人员和从业者,使他们能够为特定应用调整和微调SGG模型。
cs.CV / 39 / 2603.10541

Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation

人性化提示:评估基础模型在肌肉骨骼CT分割中的模型敏感性
Magg, Caroline, ter Wee, Maaike A., Dobbe, Johannes G. G., Streekstra, Geert J., Blankevoort, Leendert, Sánchez, Clara I., Kervadec, Hoel
Abstract
Promptable Foundation Models (FMs), initially introduced for natural image segmentation, have also revolutionized medical image segmentation. The increasing number of models, along with evaluations varying in datasets, metrics, and compared models, makes direct performance comparison between models difficult and complicates the selection of the most suitable model for specific clinical tasks. In our study, 11 promptable FMs are tested using non-iterative 2D and 3D prompting strategies on a private and public dataset focusing on bone and implant segmentation in four anatomical regions (wrist, shoulder, hip and lower leg). The Pareto-optimal models are identified and further analyzed using human prompts collected through a dedicated observer study. Our findings are: 1) The segmentation performance varies a lot between FMs and prompting strategies; 2) The Pareto-optimal models in 2D are SAM and SAM2.1, in 3D nnInteractive and Med-SAM2; 3) Localization accuracy and rater consistency vary with anatomical structures, with higher consistency for simple structures (wrist bones) and lower consistency for complex structures (pelvis, tibia, implants); 4) The segmentation performance drops using human prompts, suggesting that performance reported on "ideal" prompts extracted from reference labels might overestimate the performance in a human-driven setting; 5) All models were sensitive to prompt variations. While two models demonstrated intra-rater robustness, it did not scale to inter-rater settings. We conclude that the selection of the most optimal FM for a human-driven setting remains challenging, with even high-performing FMs being sensitive to variations in human input prompts. Our code base for prompt extraction and model inference is available: https://github.com/CarolineMagg/segmentation-FM-benchmark/
Chinese Translation
可提示基础模型(FMs)最初是为自然图像分割而引入的,但也彻底改变了医学图像分割。模型数量的增加以及在数据集、指标和比较模型上的评估差异,使得模型之间的直接性能比较变得困难,并使得选择最适合特定临床任务的模型变得复杂。在我们的研究中,测试了11个可提示的FMs,使用非迭代的2D和3D提示策略,在一个专有和公共数据集上,重点关注四个解剖区域(手腕、肩部、髋部和小腿)的骨骼和植入物分割。识别出帕累托最优模型,并通过专门的观察者研究收集的人类提示进行了进一步分析。我们的发现包括:1)FMs和提示策略之间的分割性能差异很大;2)2D中的帕累托最优模型是SAM和SAM2.1,3D中的帕累托最优模型是nnInteractive和Med-SAM2;3)定位精度和评估者一致性因解剖结构而异,简单结构(手腕骨)的一致性较高,而复杂结构(骨盆、胫骨、植入物)的一致性较低;4)使用人类提示时,分割性能下降,这表明在从参考标签提取的“理想”提示上报告的性能可能会高估在以人为驱动的环境中的性能;5)所有模型对提示变化都敏感。虽然两个模型表现出评估者内部的一致性,但这种一致性并未扩展到评估者之间的设置。我们得出结论,选择最适合人类驱动环境的最优FM仍然具有挑战性,即使是高性能的FMs也对人类输入提示的变化敏感。我们的提示提取和模型推断的代码库可用: https://github.com/CarolineMagg/segmentation-FM-benchmark/
cs.CV / 40 / 2603.10549

Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues

基于视觉-文本提示的主动红外热成像认知缺陷分析
Salah, Mohammed, Ouda, Eman, Dell'Avvocato, Giuseppe, Sarasini, Fabrizio, D'Accardi, Ester, Dias, Jorge, Svetinovic, Davor, Sfarra, Stefano, Abdulrahman, Yusra
Abstract
Active infrared thermography (AIRT) is currently witnessing a surge of artificial intelligence (AI) methodologies being deployed for automated subsurface defect analysis of high performance carbon fiber-reinforced polymers (CFRP). Deploying AI-based AIRT methodologies for inspecting CFRPs requires the creation of time consuming and expensive datasets of CFRP inspection sequences to train neural networks. To address this challenge, this work introduces a novel language-guided framework for cognitive defect analysis in CFRPs using AIRT and vision-language models (VLMs). Unlike conventional learning-based approaches, the proposed framework does not require developing training datasets for extensive training of defect detectors, instead it relies solely on pretrained multimodal VLM encoders coupled with a lightweight adapter to enable generative zero-shot understanding and localization of subsurface defects. By leveraging pretrained multimodal encoders, the proposed system enables generative zero-shot understanding of thermographic patterns and automatic detection of subsurface defects. Given the domain gap between thermographic data and natural images used to train VLMs, an AIRT-VLM Adapter is proposed to enhance the visibility of defects while aligning the thermographic domain with the learned representations of VLMs. The proposed framework is validated using three representative VLMs; specifically, GroundingDINO, Qwen-VL-Chat, and CogVLM. Validation is performed on 25 CFRP inspection sequences with impacts introduced at different energy levels, reflecting realistic defects encountered in industrial scenarios. Experimental results demonstrate that the AIRT-VLM adapter achieves signal-to-noise ratio (SNR) gains exceeding 10 dB compared with conventional thermographic dimensionality-reduction methods, while enabling zero-shot defect detection with intersection-over-union values reaching 70%.
Chinese Translation
主动红外热成像(AIRT)目前正迎来人工智能(AI)方法在高性能碳纤维增强聚合物(CFRP)表面下缺陷自动分析中的广泛应用。采用基于AI的AIRT方法对CFRP进行检测需要创建耗时且昂贵的CFRP检测序列数据集,以训练神经网络。为了解决这一挑战,本研究提出了一种新颖的语言引导框架,用于利用AIRT和视觉-语言模型(VLMs)对CFRP进行认知缺陷分析。与传统的基于学习的方法不同,所提出的框架不需要开发训练数据集来对缺陷检测器进行广泛训练,而是仅依赖于预训练的多模态VLM编码器,并结合轻量级适配器,以实现生成性零样本理解和表面下缺陷的定位。通过利用预训练的多模态编码器,所提出的系统能够实现热成像模式的生成性零样本理解和表面下缺陷的自动检测。鉴于热成像数据与用于训练VLM的自然图像之间的领域差距,提出了一种AIRT-VLM适配器,以增强缺陷的可见性,同时将热成像领域与VLM的学习表示对齐。所提出的框架通过三种代表性的VLM进行验证;具体而言,GroundingDINO、Qwen-VL-Chat和CogVLM。验证是在25个CFRP检测序列上进行的,这些序列在不同能量水平下引入了冲击,反映了工业场景中遇到的真实缺陷。实验结果表明,AIRT-VLM适配器在信噪比(SNR)方面的增益超过10 dB,相较于传统的热成像降维方法,同时实现了零样本缺陷检测,交并比值达到70%。
cs.CV / 41 / 2603.10551

P-GSVC: Layered Progressive 2D Gaussian Splatting for Scalable Image and Video

P-GSVC:用于可扩展图像和视频的分层渐进式二维高斯点云
Wang, Longan, Shi, Yuang, Ooi, Wei Tsang
Abstract
Gaussian splatting has emerged as a competitive explicit representation for image and video reconstruction. In this work, we present P-GSVC, the first layered progressive 2D Gaussian splatting framework that provides a unified solution for scalable Gaussian representation in both images and videos. P-GSVC organizes 2D Gaussian splats into a base layer and successive enhancement layers, enabling coarse-to-fine reconstructions. To effectively optimize this layered representation, we propose a joint training strategy that simultaneously updates Gaussians across layers, aligning their optimization trajectories to ensure inter-layer compatibility and a stable progressive reconstruction. P-GSVC supports scalability in terms of both quality and resolution. Our experiments show that the joint training strategy can gain up to 1.9 dB improvement in PSNR for video and 2.6 dB improvement in PSNR for image when compared to methods that perform sequential layer-wise training. Project page: https://longanwang-cs.github.io/PGSVC-webpage/
Chinese Translation
高斯点云已成为图像和视频重建的一种具有竞争力的显式表示。在本研究中,我们提出了P-GSVC,这是第一个分层渐进式二维高斯点云框架,为图像和视频中的可扩展高斯表示提供了统一的解决方案。P-GSVC将二维高斯点云组织为基础层和后续的增强层,能够实现由粗到细的重建。为了有效优化这种分层表示,我们提出了一种联合训练策略,该策略同时更新各层中的高斯,调整它们的优化轨迹,以确保层间兼容性和稳定的渐进重建。P-GSVC在质量和分辨率方面都支持可扩展性。我们的实验表明,与执行顺序层训练的方法相比,联合训练策略在视频中可获得高达1.9 dB的PSNR提升,在图像中可获得2.6 dB的PSNR提升。项目页面:https://longanwang-cs.github.io/PGSVC-webpage/
cs.CV / 42 / 2603.10560

PET-F2I: A Comprehensive Benchmark and Parameter-Efficient Fine-Tuning of LLMs for PET/CT Report Impression Generation

PET-F2I:用于PET/CT报告印象生成的全面基准和参数高效微调的LLMs
Liu, Yuchen, Zhang, Wenbo, Peng, Liling, Zhang, Yichi, Fu, Yu, Guo, Xin, Qu, Chao, Qi, Yuan, Xue, Le
Abstract
PET/CT imaging is pivotal in oncology and nuclear medicine, yet summarizing complex findings into precise diagnostic impressions is labor-intensive. While LLMs have shown promise in medical text generation, their capability in the highly specialized domain of PET/CT remains underexplored. We introduce PET-F2I-41K (PET Findings-to-Impression Benchmark), a large-scale benchmark for PET/CT impression generation using LLMs, constructed from over 41k real-world reports. Using PET-F2I-41K, we conduct a comprehensive evaluation of 27 models across proprietary frontier LLMs, open-source generalist models, and medical-domain LLMs, and we develop a domain-adapted 7B model (PET-F2I-7B) fine-tuned from Qwen2.5-7B-Instruct via LoRA. Beyond standard NLG metrics (e.g., BLEU-4, ROUGE-L, BERTScore), we propose three clinically grounded metrics - Entity Coverage Rate (ECR), Uncovered Entity Rate (UER), and Factual Consistency Rate (FCR) - to assess diagnostic completeness and factual reliability. Experiments reveal that neither frontier nor medical-domain LLMs perform adequately in zero-shot settings. In contrast, PET-F2I-7B achieves substantial gains (e.g., 0.708 BLEU-4) and a 3.0x improvement in entity coverage over the strongest baseline, while offering advantages in cost, latency, and privacy. Beyond this modeling contribution, PET-F2I-41K establishes a standardized evaluation framework to accelerate the development of reliable and clinically deployable reporting systems for PET/CT.
Chinese Translation
PET/CT成像在肿瘤学和核医学中至关重要,但将复杂的发现总结为精确的诊断印象是一项劳动密集型的工作。尽管大型语言模型(LLMs)在医学文本生成中显示出潜力,但它们在高度专业化的PET/CT领域的能力仍未得到充分探索。我们介绍了PET-F2I-41K(PET发现到印象基准),这是一个基于超过41,000份真实报告构建的大规模PET/CT印象生成基准。利用PET-F2I-41K,我们对27个模型进行了全面评估,这些模型包括专有前沿LLMs、开源通用模型和医学领域LLMs,并开发了一个经过领域适应的7B模型(PET-F2I-7B),该模型通过LoRA从Qwen2.5-7B-Instruct微调而来。除了标准的自然语言生成(NLG)指标(如BLEU-4、ROUGE-L、BERTScore)外,我们还提出了三个临床基础的指标——实体覆盖率(Entity Coverage Rate, ECR)、未覆盖实体率(Uncovered Entity Rate, UER)和事实一致性率(Factual Consistency Rate, FCR),以评估诊断的完整性和事实的可靠性。实验结果表明,无论是前沿LLMs还是医学领域LLMs在零-shot设置下的表现均不理想。相比之下,PET-F2I-7B在实体覆盖率上实现了显著提升(例如,0.708 BLEU-4)和3.0倍的改进,且在成本、延迟和隐私方面提供了优势。除了这一建模贡献外,PET-F2I-41K还建立了一个标准化的评估框架,以加速开发可靠且可临床应用的PET/CT报告系统。
cs.CV / 43 / 2603.10568

UniStitch: Unifying Semantic and Geometric Features for Image Stitching

UniStitch:统一语义与几何特征的图像拼接
Mei, Yuan, Nie, Lang, Liao, Kang, Xu, Yunqiu, Lin, Chunyu, Xiao, Bin
Abstract
Traditional image stitching methods estimate warps from hand-crafted geometric features, whereas recent learning-based solutions leverage semantic features from neural networks instead. These two lines of research have largely diverged along separate evolution, with virtually no meaningful convergence to date. In this paper, we take a pioneering step to bridge this gap by unifying semantic and geometric features with UniStitch, a unified image stitching framework from multimodal features. To align discrete geometric features (i.e., keypoint) with continuous semantic feature maps, we present a Neural Point Transformer (NPT) module, which transforms unordered, sparse 1D geometric keypoints into ordered, dense 2D semantic maps. Then, to integrate the advantages of both representations, an Adaptive Mixture of Experts (AMoE) module is designed to fuse geometric and semantic representations. It dynamically shifts focus toward more reliable features during the fusion process, allowing the model to handle complex scenes, especially when either modality might be compromised. The fused representation can be adopted into common deep stitching pipelines, delivering significant performance gains over any single feature. Experiments show that UniStitch outperforms existing state-of-the-art methods with a large margin, paving the way for a unified paradigm between traditional and learning-based image stitching.
Chinese Translation
传统的图像拼接方法通过手工设计的几何特征来估计变形,而最近的基于学习的解决方案则利用神经网络中的语义特征。这两条研究路线在很大程度上沿着各自的演变方向分化,迄今为止几乎没有实质性的交集。本文通过提出UniStitch,一个基于多模态特征的统一图像拼接框架,迈出了弥合这一差距的开创性一步。为了将离散的几何特征(即关键点)与连续的语义特征图对齐,我们提出了一种神经点变换器(Neural Point Transformer, NPT)模块,该模块将无序、稀疏的1D几何关键点转换为有序、密集的2D语义图。然后,为了整合两种表示的优势,设计了自适应专家混合(Adaptive Mixture of Experts, AMoE)模块,以融合几何和语义表示。在融合过程中,它动态地将焦点转向更可靠的特征,使模型能够处理复杂场景,特别是在任一模态可能受到影响时。融合后的表示可以被应用于常见的深度拼接管道,显著提升性能,超越任何单一特征。实验表明,UniStitch在性能上大幅超越现有的最先进方法,为传统与基于学习的图像拼接之间的统一范式铺平了道路。
cs.CV / 44 / 2603.10578

R4-CGQA: Retrieval-based Vision Language Models for Computer Graphics Image Quality Assessment

R4-CGQA:基于检索的视觉语言模型用于计算机图形图像质量评估
Li, Zhuangzi, Jin, Jian, Cai, Shilv, Lin, Weisi
Abstract
Immersive Computer Graphics (CGs) rendering has become ubiquitous in modern daily life. However, comprehensively evaluating CG quality remains challenging for two reasons: First, existing CG datasets lack systematic descriptions of rendering quality; and second existing CG quality assessment methods cannot provide reasonable text-based explanations. To address these issues, we first identify six key perceptual dimensions of CG quality from the user perspective and construct a dataset of 3500 CG images with corresponding quality descriptions. Each description covers CG style, content, and perceived quality along the selected dimensions. Furthermore, we use a subset of the dataset to build several question-answer benchmarks based on the descriptions in order to evaluate the responses of existing Vision Language Models (VLMs). We find that current VLMs are not sufficiently accurate in judging fine-grained CG quality, but that descriptions of visually similar images can significantly improve a VLM's understanding of a given CG image. Motivated by this observation, we adopt retrieval-augmented generation and propose a two-stream retrieval framework that effectively enhances the CG quality assessment capabilities of VLMs. Experiments on several representative VLMs demonstrate that our method substantially improves their performance on CG quality assessment.
Chinese Translation
沉浸式计算机图形(CG)渲染在现代日常生活中已变得无处不在。然而,全面评估CG质量仍然面临挑战,主要有两个原因:首先,现有的CG数据集缺乏对渲染质量的系统性描述;其次,现有的CG质量评估方法无法提供合理的基于文本的解释。为了解决这些问题,我们首先从用户的角度识别出CG质量的六个关键感知维度,并构建了一个包含3500张CG图像及其对应质量描述的数据集。每个描述涵盖了CG风格、内容以及在所选维度上的感知质量。此外,我们使用数据集的一个子集构建了几个基于描述的问题-答案基准,以评估现有视觉语言模型(VLM)的响应。我们发现当前的VLM在判断细粒度CG质量方面的准确性不足,但视觉上相似图像的描述可以显著提高VLM对给定CG图像的理解。受到这一观察的启发,我们采用检索增强生成的方法,提出了一种双流检索框架,有效提升了VLM的CG质量评估能力。在对几个代表性VLM的实验中,我们的方法显著改善了它们在CG质量评估方面的表现。
cs.CV / 45 / 2603.10583

Attribution as Retrieval: Model-Agnostic AI-Generated Image Attribution

归因作为检索:模型无关的人工智能生成图像归因
Wang, Hongsong, Cheng, Renxi, Han, Chaolei, Gui, Jie
Abstract
With the rapid advancement of AIGC technologies, image forensics will encounter unprecedented challenges. Traditional methods are incapable of dealing with increasingly realistic images generated by rapidly evolving image generation techniques. To facilitate the identification of AI-generated images and the attribution of their source models, generative image watermarking and AI-generated image attribution have emerged as key research focuses in recent years. However, existing methods are model-dependent, requiring access to the generative models and lacking generality and scalability to new and unseen generators. To address these limitations, this work presents a new paradigm for AI-generated image attribution by formulating it as an instance retrieval problem instead of a conventional image classification problem. We propose an efficient model-agnostic framework, called Low-bIt-plane-based Deepfake Attribution (LIDA). The input to LIDA is produced by Low-Bit Fingerprint Generation module, while the training involves Unsupervised Pre-Training followed by subsequent Few-Shot Attribution Adaptation. Comprehensive experiments demonstrate that LIDA achieves state-of-the-art performance for both Deepfake detection and image attribution under zero- and few-shot settings. The code is at https://github.com/hongsong-wang/LIDA
Chinese Translation
随着AIGC技术的快速发展,图像取证将面临前所未有的挑战。传统方法无法处理由快速发展的图像生成技术生成的日益逼真的图像。为了促进对AI生成图像的识别及其源模型的归因,生成图像水印和AI生成图像归因近年来已成为关键研究重点。然而,现有方法依赖于特定模型,需要访问生成模型,缺乏对新颖和未见生成器的普适性和可扩展性。为了解决这些局限性,本研究提出了一种新的AI生成图像归因范式,将其表述为实例检索问题,而不是传统的图像分类问题。我们提出了一种高效的模型无关框架,称为基于低位平面的深伪归因(Low-bIt-plane-based Deepfake Attribution,LIDA)。LIDA的输入由低位指纹生成模块生成,而训练过程包括无监督预训练,随后进行少量样本归因适应。全面的实验表明,LIDA在零样本和少样本设置下,在深伪检测和图像归因方面均实现了最先进的性能。代码可在 https://github.com/hongsong-wang/LIDA 获取。
cs.CV / 46 / 2603.10584

Need for Speed: Zero-Shot Depth Completion with Single-Step Diffusion

速度的需求:基于单步扩散的零-shot深度补全
Gregorek, Jakub, Pegios, Paraskevas, Metzger, Nando, Schindler, Konrad, Kontogianni, Theodora, Nalpantidis, Lazaros
Abstract
We introduce Marigold-SSD, a single-step, late-fusion depth completion framework that leverages strong diffusion priors while eliminating the costly test-time optimization typically associated with diffusion-based methods. By shifting computational burden from inference to finetuning, our approach enables efficient and robust 3D perception under real-world latency constraints. Marigold-SSD achieves significantly faster inference with a training cost of only 4.5 GPU days. We evaluate our method across four indoor and two outdoor benchmarks, demonstrating strong cross-domain generalization and zero-shot performance compared to existing depth completion approaches. Our approach significantly narrows the efficiency gap between diffusion-based and discriminative models. Finally, we challenge common evaluation protocols by analyzing performance under varying input sparsity levels. Page: https://dtu-pas.github.io/marigold-ssd/
Chinese Translation
我们介绍了Marigold-SSD,这是一种单步、后期融合的深度补全框架,利用强大的扩散先验,同时消除了与基于扩散的方法通常相关的高昂测试时间优化。通过将计算负担从推理转移到微调,我们的方法在现实世界的延迟约束下实现了高效且稳健的3D感知。Marigold-SSD以仅4.5个GPU天的训练成本实现了显著更快的推理速度。我们在四个室内和两个室外基准上评估了我们的方法,展示了与现有深度补全方法相比,强大的跨领域泛化能力和零-shot性能。我们的方法显著缩小了基于扩散的模型与判别模型之间的效率差距。最后,我们通过分析在不同输入稀疏级别下的性能,挑战了常见的评估协议。页面链接:https://dtu-pas.github.io/marigold-ssd/
cs.CV / 47 / 2603.10598

Layer Consistency Matters: Elegant Latent Transition Discrepancy for Generalizable Synthetic Image Detection

层一致性的重要性:优雅的潜在转移差异用于可泛化的合成图像检测
Yang, Yawen, Li, Feng, Kong, Shuqi, Diao, Yunfeng, Gao, Xinjian, Shi, Zenglin, Wang, Meng
Abstract
Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images. While enabling various innovative applications, the unprecedented realism of these synthetics makes them increasingly indistinguishable from authentic photographs, posing serious security risks, such as media credibility and content manipulation. Although extensive efforts have been dedicated to detecting synthetic images, most existing approaches suffer from poor generalization to unseen data due to their reliance on model-specific artifacts or low-level statistical cues. In this work, we identify a previously unexplored distinction that real images maintain consistent semantic attention and structural coherence in their latent representations, exhibiting more stable feature transitions across network layers, whereas synthetic ones present discernible distinct patterns. Therefore, we propose a novel approach termed latent transition discrepancy (LTD), which captures the inter-layer consistency differences of real and synthetic images. LTD adaptively identifies the most discriminative layers and assesses the transition discrepancies across layers. Benefiting from the proposed inter-layer discriminative modeling, our approach exceeds the base model by 14.35\% in mean Acc across three datasets containing diverse GANs and DMs. Extensive experiments demonstrate that LTD outperforms recent state-of-the-art methods, achieving superior detection accuracy, generalizability, and robustness. The code is available at https://github.com/yywencs/LTD
Chinese Translation
近年来,生成模型的快速发展显著提高了人工智能生成合成图像的真实感和可获取性。虽然这使得各种创新应用成为可能,但这些合成图像前所未有的真实感使其与真实照片越来越难以区分,从而带来了严重的安全风险,例如媒体可信度和内容操控。尽管已有大量努力致力于检测合成图像,但大多数现有方法由于依赖于特定模型的伪影或低级统计线索,导致在未见数据上的泛化能力较差。在本研究中,我们识别出一个之前未被探索的区别,即真实图像在其潜在表示中保持一致的语义关注和结构连贯性,展现出跨网络层更稳定的特征转移,而合成图像则呈现出明显不同的模式。因此,我们提出了一种新方法,称为潜在转移差异(Latent Transition Discrepancy, LTD),该方法捕捉真实图像与合成图像之间的层间一致性差异。LTD自适应地识别出最具区分性的层,并评估各层之间的转移差异。得益于所提出的层间区分建模,我们的方法在包含多种生成对抗网络(GANs)和扩散模型(DMs)的三个数据集上,平均准确率比基础模型提高了14.35%。大量实验表明,LTD在检测准确性、泛化能力和鲁棒性方面超越了最近的最先进方法。代码可在 https://github.com/yywencs/LTD 获取。
cs.CV / 48 / 2603.10604

HyPER-GAN: Hybrid Patch-Based Image-to-Image Translation for Real-Time Photorealism Enhancement

HyPER-GAN:基于混合补丁的图像到图像翻译方法用于实时照片真实感增强
Pasios, Stefanos, Nikolaidis, Nikos
Abstract
Generative models are widely employed to enhance the photorealism of synthetic data for training computer vision algorithms. However, they often introduce visual artifacts that degrade the accuracy of these algorithms and require high computational resources, limiting their applicability in real-time training or evaluation scenarios. In this paper, we propose Hybrid Patch Enhanced Realism Generative Adversarial Network (HyPER-GAN), a lightweight image-to-image translation method based on a U-Net-style generator designed for real-time inference. The model is trained using paired synthetic and photorealism-enhanced images, complemented by a hybrid training strategy that incorporates matched patches from real-world data to improve visual realism and semantic consistency. Experimental results demonstrate that HyPER-GAN outperforms state-of-the-art paired image-to-image translation methods in terms of inference latency, visual realism, and semantic robustness. Moreover, it is illustrated that the proposed hybrid training strategy indeed improves visual quality and semantic consistency compared to training the model solely with paired synthetic and photorealism-enhanced images. Code and pretrained models are publicly available for download at: https://github.com/stefanos50/HyPER-GAN
Chinese Translation
生成模型广泛应用于增强合成数据的照片真实感,以训练计算机视觉算法。然而,它们往往会引入视觉伪影,从而降低这些算法的准确性,并且需要高计算资源,限制了它们在实时训练或评估场景中的适用性。本文提出了一种轻量级的图像到图像翻译方法——混合补丁增强真实感生成对抗网络(HyPER-GAN),该方法基于U-Net风格的生成器,旨在实现实时推理。该模型使用配对的合成图像和增强的照片真实感图像进行训练,并辅以一种混合训练策略,该策略结合了来自真实世界数据的匹配补丁,以提高视觉真实感和语义一致性。实验结果表明,HyPER-GAN在推理延迟、视觉真实感和语义鲁棒性方面优于最先进的配对图像到图像翻译方法。此外,结果还表明,与仅使用配对的合成图像和增强的照片真实感图像进行模型训练相比,所提出的混合训练策略确实改善了视觉质量和语义一致性。代码和预训练模型可在以下链接公开下载:https://github.com/stefanos50/HyPER-GAN
cs.CV / 49 / 2603.10638

Splat2Real: Novel-view Scaling for Physical AI with 3D Gaussian Splatting

Splat2Real:基于3D高斯溅射的物理人工智能新视角缩放
Lim, Hansol, Choi, Jongseong Brad
Abstract
Physical AI faces viewpoint shift between training and deployment, and novel-view robustness is essential for monocular RGB-to-3D perception. We cast Real2Render2Real monocular depth pretraining as imitation-learning-style supervision from a digital twin oracle: a student depth network imitates expert metric depth/visibility rendered from a scene mesh, while 3DGS supplies scalable novel-view observations. We present Splat2Real, centered on novel-view scaling: performance depends more on which views are added than on raw view count. We introduce CN-Coverage, a coverage+novelty curriculum that greedily selects views by geometry gain and an extrapolation penalty, plus a quality-aware guardrail fallback for low-reliability teachers. Across 20 TUM RGB-D sequences with step-matched budgets (N=0 to 2000 additional rendered views, with N unique <= 500 and resampling for larger budgets), naive scaling is unstable; CN-Coverage mitigates worst-case regressions relative to Robot/Coverage policies, and GOL-Gated CN-Coverage provides the strongest medium-high-budget stability with the lowest high-novelty tail error. Downstream control-proxy results versus N provides embodied-relevance evidence by shifting safety/progress trade-offs under viewpoint shift.
Chinese Translation
物理人工智能在训练和部署之间面临视角变化,而新视角的鲁棒性对于单目RGB到3D感知至关重要。我们将Real2Render2Real单目深度预训练视为来自数字双胞胎神谕的模仿学习风格的监督:一个学生深度网络模仿从场景网格渲染的专家度量深度/可见性,同时3DGS提供可扩展的新视角观察。我们提出了Splat2Real,重点在于新视角缩放:性能更多依赖于添加哪些视角,而非原始视角数量。我们引入了CN-Coverage,一种覆盖+新颖性课程,通过几何增益和外推惩罚贪婪地选择视角,并为低可靠性的教师提供质量感知的保护措施。在20个TUM RGB-D序列中,使用匹配预算的步骤(N=0到2000个额外渲染视角,其中N唯一<=500,并对更大预算进行重采样),简单缩放不稳定;CN-Coverage相对于Robot/Coverage策略减轻了最坏情况的回归,而GOL-Gated CN-Coverage在中高预算下提供了最强的稳定性,并且具有最低的新颖性尾部误差。下游控制代理结果与N的对比提供了通过在视角变化下转变安全/进展权衡而获得的具体现实证据。
cs.CV / 50 / 2603.10648

Less is More: Decoder-Free Masked Modeling for Efficient Skeleton Representation Learning

少即是多:无解码器的掩码建模用于高效骨骼表示学习
Do, Jeonghyeok, Chen, Yun, Youk, Geunhyuk, Kim, Munchurl
Abstract
The landscape of skeleton-based action representation learning has evolved from Contrastive Learning (CL) to Masked Auto-Encoder (MAE) architectures. However, each paradigm faces inherent limitations: CL often overlooks fine-grained local details, while MAE is burdened by computationally heavy decoders. Moreover, MAE suffers from severe computational asymmetry -- benefiting from efficient masking during pre-training but requiring exhaustive full-sequence processing for downstream tasks. To resolve these bottlenecks, we propose SLiM (Skeleton Less is More), a novel unified framework that harmonizes masked modeling with contrastive learning via a shared encoder. By eschewing the reconstruction decoder, SLiM not only eliminates computational redundancy but also compels the encoder to capture discriminative features directly. SLiM is the first framework with decoder-free masked modeling of representative learning. Crucially, to prevent trivial reconstruction arising from high skeletal-temporal correlation, we introduce semantic tube masking, alongside skeletal-aware augmentations designed to ensure anatomical consistency across diverse temporal granularities. Extensive experiments demonstrate that SLiM consistently achieves state-of-the-art performance across all downstream protocols. Notably, our method delivers this superior accuracy with exceptional efficiency, reducing inference computational cost by 7.89x compared to existing MAE methods.
Chinese Translation
基于骨骼的动作表示学习的研究领域已从对比学习(Contrastive Learning, CL)发展到掩码自编码器(Masked Auto-Encoder, MAE)架构。然而,每种范式都面临固有的局限性:CL往往忽视细粒度的局部细节,而MAE则受到计算负担较重的解码器的制约。此外,MAE还存在严重的计算不对称性——在预训练期间受益于高效的掩码,但在下游任务中却需要进行全面的全序列处理。为了解决这些瓶颈,我们提出了SLiM(Skeleton Less is More),一个新颖的统一框架,通过共享编码器将掩码建模与对比学习相结合。通过避免重建解码器,SLiM不仅消除了计算冗余,还迫使编码器直接捕捉判别特征。SLiM是第一个实现无解码器掩码建模的代表性学习框架。关键是,为了防止因高骨骼时间相关性而产生的平凡重建,我们引入了语义管道掩码,并设计了骨骼感知增强,以确保在不同时间粒度下的解剖一致性。大量实验表明,SLiM在所有下游协议中始终实现了最先进的性能。值得注意的是,我们的方法以卓越的效率提供了这种优越的准确性,与现有的MAE方法相比,推理计算成本降低了7.89倍。
cs.CV / 51 / 2603.10652

Are Video Reasoning Models Ready to Go Outside?

视频推理模型准备好走出实验室了吗?
He, Yangfan, Boo, Changgyu, Yoon, Jaehong
Abstract
In real-world deployment, vision-language models often encounter disturbances such as weather, occlusion, and camera motion. Under such conditions, their understanding and reasoning degrade substantially, revealing a gap between clean, controlled (i.e., unperturbed) evaluation settings and real-world robustness. To address this limitation, we propose ROVA, a novel training framework that improves robustness by modeling a robustness-aware consistency reward under spatio-temporal corruptions. ROVA introduces a difficulty-aware online training strategy that prioritizes informative samples based on the model's evolving capability. Specifically, it continuously re-estimates sample difficulty via self-reflective evaluation, enabling adaptive training with a robustness-aware consistency reward. We also introduce PVRBench, a new benchmark that injects real-world perturbations into embodied video datasets to assess both accuracy and reasoning quality under realistic disturbances. We evaluate ROVA and baselines on PVRBench, UrbanVideo, and VisBench, where open-source and proprietary models suffer up to 35% and 28% drops in accuracy and reasoning under realistic perturbations. ROVA effectively mitigates performance degradation, boosting relative accuracy by at least 24% and reasoning by over 9% compared with baseline models (QWen2.5/3-VL, InternVL2.5, Embodied-R). These gains transfer to clean standard benchmarks, yielding consistent improvements.
Chinese Translation
在实际部署中,视觉-语言模型常常会遇到天气、遮挡和摄像机运动等干扰。在这种条件下,它们的理解和推理能力显著下降,揭示了干净、受控(即未受干扰)评估环境与现实世界鲁棒性之间的差距。为了解决这一局限性,我们提出了ROVA,一个新颖的训练框架,通过在时空干扰下建模鲁棒性感知的一致性奖励来提高鲁棒性。ROVA引入了一种基于模型不断演变能力的困难感知在线训练策略,优先考虑信息丰富的样本。具体而言,它通过自我反思评估持续重新估计样本的困难度,从而实现具有鲁棒性感知一致性奖励的自适应训练。我们还引入了PVRBench,一个新的基准,向具身视频数据集中注入现实世界的扰动,以评估在现实干扰下的准确性和推理质量。我们在PVRBench、UrbanVideo和VisBench上评估了ROVA和基准模型,其中开源和专有模型在现实扰动下的准确性和推理能力下降幅度分别达到35%和28%。ROVA有效减轻了性能下降,相较于基准模型(QWen2.5/3-VL、InternVL2.5、Embodied-R),提高了至少24%的相对准确性和超过9%的推理能力。这些提升在干净的标准基准上也得以转移,带来了持续的改进。
cs.CV / 52 / 2603.10658

How To Embed Matters: Evaluation of EO Embedding Design Choices

如何嵌入问题:EO 嵌入设计选择的评估
Gilch, Luis, Wittmann, Isabelle, Nitsche, Maximilian, Jakubik, Johannes, Ewald, Arne, Brunschwiler, Thomas
Abstract
Earth observation (EO) missions produce petabytes of multispectral imagery, increasingly analyzed using large Geospatial Foundation Models (GeoFMs). Alongside end-to-end adaptation, workflows make growing use of intermediate representations as task-agnostic embeddings, enabling models to compute representations once and reuse them across downstream tasks. Consequently, when GeoFMs act as feature extractors, decisions about how representations are obtained, aggregated, and combined affect downstream performance and pipeline scalability. Understanding these trade-offs is essential for scalable embedding-based EO workflows, where compact embeddings can replace raw data while remaining broadly useful. We present a systematic analysis of embedding design in GeoFM-based EO workflows. Leveraging NeuCo-Bench, we study how backbone architecture, pretraining strategy, representation depth, spatial aggregation, and representation combination influence EO task performance. We demonstrate the usability of GeoFM embeddings by aggregating them into fixed-size representations more than 500x smaller than the raw input data. Across models, we find consistent trends: transformer backbones with mean pooling provide strong default embeddings, intermediate ResNet layers can outperform final layers, self-supervised objectives exhibit task-specific strengths, and combining embeddings from different objectives often improves robustness.
Chinese Translation
地球观测(EO)任务产生了数PB的多光谱影像,越来越多地使用大型地理空间基础模型(GeoFMs)进行分析。除了端到端的适应外,工作流程越来越多地利用中间表示作为任务无关的嵌入,使得模型能够一次计算表示并在下游任务中重复使用。因此,当 GeoFMs 作为特征提取器时,关于如何获取、聚合和组合表示的决策会影响下游性能和管道可扩展性。理解这些权衡对于可扩展的基于嵌入的 EO 工作流程至关重要,在这些工作流程中,紧凑的嵌入可以替代原始数据,同时仍然保持广泛的实用性。我们对基于 GeoFM 的 EO 工作流程中的嵌入设计进行了系统分析。利用 NeuCo-Bench,我们研究了主干架构、预训练策略、表示深度、空间聚合和表示组合如何影响 EO 任务性能。我们通过将 GeoFM 嵌入聚合成比原始输入数据小500倍以上的固定大小表示,展示了 GeoFM 嵌入的可用性。在不同模型中,我们发现了一致的趋势:具有均值池化的变换器主干提供了强大的默认嵌入,中间的 ResNet 层可以超越最终层,自监督目标表现出任务特定的优势,而来自不同目标的嵌入组合通常会提高鲁棒性。
cs.CV / 53 / 2603.10685

A$^2$-Edit: Precise Reference-Guided Image Editing of Arbitrary Objects and Ambiguous Masks

A$^2$-Edit:精确的参考引导任意对象和模糊掩膜的图像编辑
Zheng, Huayu, Li, Guangzhao, Zhao, Baixuan, Luo, Siqi, Jiang, Hantao, Zhai, Guangtao, Liu, Xiaohong
Abstract
We propose \textbf{A$^2$-Edit}, a unified inpainting framework for arbitrary object categories, which allows users to replace any target region with a reference object using only a coarse mask. To address the issues of severe homogenization and limited category coverage in existing datasets, we construct a large-scale, multi-category dataset \textbf{UniEdit-500K}, which includes 8 major categories, 209 fine-grained subcategories, and a total of 500,104 image pairs. Such rich category diversity poses new challenges for the model, requiring it to automatically learn semantic relationships and distinctions across categories. To this end, we introduce the \textbf{Mixture of Transformer} module, which performs differentiated modeling of various object categories through dynamic expert selection, and further enhances cross-category semantic transfer and generalization through collaboration among experts. In addition, we propose a \textbf{Mask Annealing Training Strategy} (MATS) that progressively relaxes mask precision during training, reducing the model's reliance on accurate masks and improving robustness across diverse editing tasks. Extensive experiments on benchmarks such as VITON-HD and AnyInsertion demonstrate that A$^2$-Edit consistently outperforms existing approaches across all metrics, providing a new and efficient solution for arbitrary object editing.
Chinese Translation
我们提出了 extbf{A$^2$-Edit},这是一个统一的任意对象类别的修复框架,允许用户仅使用粗略掩膜将任何目标区域替换为参考对象。为了解决现有数据集中严重同质化和类别覆盖有限的问题,我们构建了一个大规模的多类别数据集 extbf{UniEdit-500K},该数据集包括8个主要类别、209个细粒度子类别,以及总计500,104对图像。这种丰富的类别多样性为模型带来了新的挑战,要求其自动学习类别之间的语义关系和区分。为此,我们引入了 extbf{Mixture of Transformer}模块,通过动态专家选择对各种对象类别进行差异化建模,并通过专家之间的协作进一步增强跨类别的语义转移和泛化。此外,我们提出了一种 extbf{掩膜退火训练策略}(Mask Annealing Training Strategy, MATS),该策略在训练过程中逐步放宽掩膜精度,减少模型对准确掩膜的依赖,提高在多样化编辑任务中的鲁棒性。在VITON-HD和AnyInsertion等基准上的大量实验表明,A$^2$-Edit在所有指标上始终优于现有方法,为任意对象编辑提供了一种新的高效解决方案。
cs.CV / 54 / 2603.10694

Bioinspired CNNs for border completion in occluded images

仿生卷积神经网络用于遮挡图像的边界补全
Coutinho, Catarina P., Merhab, Aneeqa, Petkovic, Janko, Zanchetta, Ferdinando, Fioresi, Rita
Abstract
We exploit the mathematical modeling of the border completion problem in the visual cortex to design convolutional neural network (CNN) filters that enhance robustness to image occlusions. We evaluate our CNN architecture, BorderNet, on three occluded datasets (MNIST, Fashion-MNIST, and EMNIST) under two types of occlusions: stripes and grids. In all cases, BorderNet demonstrates improved performance, with gains varying depending on the severity of the occlusions and the dataset.
Chinese Translation
我们利用视觉皮层中边界补全问题的数学建模,设计了增强对图像遮挡鲁棒性的卷积神经网络(CNN)滤波器。我们在三种遮挡数据集(MNIST、Fashion-MNIST 和 EMNIST)上评估了我们的 CNN 架构 BorderNet,测试了两种类型的遮挡:条纹和网格。在所有情况下,BorderNet 展现了更好的性能,性能提升因遮挡的严重程度和数据集而异。
cs.CV / 55 / 2603.10695

RandMark: On Random Watermarking of Visual Foundation Models

RandMark:视觉基础模型的随机水印技术
Chistyakova, Anna, Pautov, Mikhail
Abstract
Being trained on large and diverse datasets, visual foundation models (VFMs) can be fine-tuned to achieve remarkable performance and efficiency in various downstream computer vision tasks. The high computational cost of data collection and training makes these models valuable assets, which motivates some VFM owners to distribute them alongside a license to protect their intellectual property rights. In this paper, we propose an approach to ownership verification of visual foundation models that leverages a small encoder-decoder network to embed digital watermarks into an internal representation of a hold-out set of input images. The method is based on random watermark embedding, which makes the watermark statistics detectable in functional copies of the watermarked model. Both theoretically and experimentally, we demonstrate that the proposed method yields a low probability of false detection for non-watermarked models and a low probability of false misdetection for watermarked models.
Chinese Translation
视觉基础模型(VFMs)在大型多样化数据集上进行训练,能够在各种下游计算机视觉任务中实现显著的性能和效率。数据收集和训练的高计算成本使这些模型成为宝贵的资产,这促使一些VFM所有者在分发模型时附带许可证以保护其知识产权。在本文中,我们提出了一种利用小型编码器-解码器网络将数字水印嵌入保留输入图像集的内部表示中,从而验证视觉基础模型所有权的方法。该方法基于随机水印嵌入,使得水印统计信息在水印模型的功能副本中可被检测。我们在理论和实验上均证明,所提出的方法对非水印模型的误检概率较低,对水印模型的误判概率也较低。
cs.CV / 56 / 2603.10702

UniCom: Unified Multimodal Modeling via Compressed Continuous Semantic Representations

UniCom:通过压缩连续语义表示实现统一的多模态建模
Zhao, Yaqi, Lin, Wang, Zhang, Zijian, Yang, Miles, Chen, Jingyuan, Zhang, Wentao, Zhong, Zhao, Bo, Liefeng
Abstract
Current unified multimodal models typically rely on discrete visual tokenizers to bridge the modality gap. However, discretization inevitably discards fine-grained semantic information, leading to suboptimal performance in visual understanding tasks. Conversely, directly modeling continuous semantic representations (e.g., CLIP, SigLIP) poses significant challenges in high-dimensional generative modeling, resulting in slow convergence and training instability. To resolve this dilemma, we introduce UniCom, a unified framework that harmonizes multimodal understanding and generation via compressed continuous representation. We empirically demonstrate that reducing channel dimension is significantly more effective than spatial downsampling for both reconstruction and generation. Accordingly, we design an attention-based semantic compressor to distill dense features into a compact unified representation. Furthermore, we validate that the transfusion architecture surpasses query-based designs in convergence and consistency. Experiments demonstrate that UniCom achieves state-of-the-art generation performance among unified models. Notably, by preserving rich semantic priors, it delivers exceptional controllability in image editing and maintains image consistency even without relying on VAE.
Chinese Translation
当前的统一多模态模型通常依赖离散的视觉标记器来弥合模态间的差距。然而,离散化不可避免地会丢失细粒度的语义信息,从而导致在视觉理解任务中的次优表现。相反,直接建模连续语义表示(例如,CLIP、SigLIP)在高维生成建模中面临重大挑战,导致收敛速度慢和训练不稳定。为了解决这一困境,我们提出了UniCom,一个通过压缩连续表示来协调多模态理解和生成的统一框架。我们通过实验证明,减少通道维度在重建和生成方面显著优于空间下采样。因此,我们设计了一种基于注意力的语义压缩器,将密集特征提炼为紧凑的统一表示。此外,我们验证了输送架构在收敛性和一致性方面优于基于查询的设计。实验表明,UniCom在统一模型中实现了最先进的生成性能。值得注意的是,通过保留丰富的语义先验,它在图像编辑中提供了卓越的可控性,并且即使不依赖于变分自编码器(VAE),也能保持图像一致性。
cs.CV / 57 / 2603.10703

WalkGPT: Grounded Vision-Language Conversation with Depth-Aware Segmentation for Pedestrian Navigation

WalkGPT:基于深度感知分割的行人导航的视觉-语言对话模型
Sultan, Rafi Ibn, Zhu, Hui, Zhou, Xiangyu, Li, Chengyin, Khanduri, Prashant, Brocanelli, Marco, Zhu, Dongxiao
Abstract
Ensuring accessible pedestrian navigation requires reasoning about both semantic and spatial aspects of complex urban scenes, a challenge that existing Large Vision-Language Models (LVLMs) struggle to meet. Although these models can describe visual content, their lack of explicit grounding leads to object hallucinations and unreliable depth reasoning, limiting their usefulness for accessibility guidance. We introduce WalkGPT, a pixel-grounded LVLM for the new task of Grounded Navigation Guide, unifying language reasoning and segmentation within a single architecture for depth-aware accessibility guidance. Given a pedestrian-view image and a navigation query, WalkGPT generates a conversational response with segmentation masks that delineate accessible and harmful features, along with relative depth estimation. The model incorporates a Multi-Scale Query Projector (MSQP) that shapes the final image tokens by aggregating them along text tokens across spatial hierarchies, and a Calibrated Text Projector (CTP), guided by a proposed Region Alignment Loss, that maps language embeddings into segmentation-aware representations. These components enable fine-grained grounding and depth inference without user-provided cues or anchor points, allowing the model to generate complete and realistic navigation guidance. We also introduce PAVE, a large-scale benchmark of 41k pedestrian-view images paired with accessibility-aware questions and depth-grounded answers. Experiments show that WalkGPT achieves strong grounded reasoning and segmentation performance. The source code and dataset are available on the \href{https://sites.google.com/view/walkgpt-26/home}{project website}.
Chinese Translation
确保无障碍的行人导航需要对复杂城市场景的语义和空间方面进行推理,这一挑战是现有的大型视觉-语言模型(LVLMs)难以满足的。尽管这些模型能够描述视觉内容,但它们缺乏明确的基础,导致对象幻觉和不可靠的深度推理,从而限制了它们在无障碍指导中的实用性。我们提出了WalkGPT,一种像素基础的LVLM,旨在实现新的任务——基础导航指导,将语言推理和分割统一在一个架构中,以提供深度感知的无障碍指导。给定一张行人视角的图像和一个导航查询,WalkGPT生成一个对话响应,包含分割掩码,划分出可达和有害特征,并提供相对深度估计。该模型结合了多尺度查询投影器(Multi-Scale Query Projector, MSQP),通过在空间层次上沿文本标记聚合最终图像标记来塑造输出,以及一个经过校准的文本投影器(Calibrated Text Projector, CTP),在提出的区域对齐损失(Region Alignment Loss)的指导下,将语言嵌入映射到分割感知的表示中。这些组件使得模型能够在没有用户提供的提示或锚点的情况下实现细粒度的基础和深度推理,从而生成完整且真实的导航指导。我们还推出了PAVE,这是一个大规模基准数据集,包含41,000张行人视角的图像,配有关注无障碍的问题和基于深度的答案。实验表明,WalkGPT在基础推理和分割性能上表现出色。源代码和数据集可在项目网站上获取。
cs.CV / 58 / 2603.10722

UAV traffic scene understanding: A cross-spectral guided approach and a unified benchmark

无人机交通场景理解:一种跨光谱引导的方法和统一基准
Zhang, Yu, Zhao, Zhicheng, Luo, Ze, Li, Chenglong, Tang, Jin
Abstract
Traffic scene understanding from unmanned aerial vehicle (UAV) platforms is crucial for intelligent transportation systems due to its flexible deployment and wide-area monitoring capabilities. However, existing methods face significant challenges in real-world surveillance, as their heavy reliance on optical imagery leads to severe performance degradation under adverse illumination conditions like nighttime and fog. Furthermore, current Visual Question Answering (VQA) models are restricted to elementary perception tasks, lacking the domain-specific regulatory knowledge required to assess complex traffic behaviors. To address these limitations, we propose a novel Cross-spectral Traffic Cognition Network (CTCNet) for robust UAV traffic scene understanding. Specifically, we design a Prototype-Guided Knowledge Embedding (PGKE) module that leverages high-level semantic prototypes from an external Traffic Regulation Memory (TRM) to anchor domain-specific knowledge into visual representations, enabling the model to comprehend complex behaviors and distinguish fine-grained traffic violations. Moreover, we develop a Quality-Aware Spectral Compensation (QASC) module that exploits the complementary characteristics of optical and thermal modalities to perform bidirectional context exchange, effectively compensating for degraded features to ensure robust representation in complex environments. In addition, we construct Traffic-VQA, the first large-scale optical-thermal infrared benchmark for cognitive UAV traffic understanding, comprising 8,180 aligned image pairs and 1.3 million question-answer pairs across 31 diverse types. Extensive experiments demonstrate that CTCNet significantly outperforms state-of-the-art methods in both cognition and perception scenarios. The dataset is available at https://github.com/YuZhang-2004/UAV-traffic-scene-understanding.
Chinese Translation
来自无人机(UAV)平台的交通场景理解对于智能交通系统至关重要,因为其具有灵活部署和广域监控的能力。然而,现有方法在实际监控中面临重大挑战,因为它们对光学图像的高度依赖导致在夜间和雾霾等不利光照条件下性能严重下降。此外,目前的视觉问答(VQA)模型仅限于基础感知任务,缺乏评估复杂交通行为所需的领域特定的监管知识。为了解决这些局限性,我们提出了一种新颖的跨光谱交通认知网络(CTCNet),以实现稳健的无人机交通场景理解。具体而言,我们设计了一个原型引导知识嵌入(PGKE)模块,该模块利用来自外部交通规制记忆(TRM)的高层语义原型,将领域特定知识锚定到视觉表示中,使模型能够理解复杂行为并区分细粒度的交通违规行为。此外,我们开发了一个质量感知光谱补偿(QASC)模块,利用光学和热成像模态的互补特性进行双向上下文交换,有效弥补退化特征,以确保在复杂环境中稳健的表示。此外,我们构建了Traffic-VQA,这是第一个大规模光学-热红外基准,用于认知无人机交通理解,包含8180对对齐图像和130万对问题-答案对,涵盖31种不同类型。大量实验表明,CTCNet在认知和感知场景中显著优于最先进的方法。数据集可在 https://github.com/YuZhang-2004/UAV-traffic-scene-understanding 获得。
cs.CV / 59 / 2603.10724

eLasmobranc Dataset: An Image Dataset for Elasmobranch Species Recognition and Biodiversity Monitoring

eLasmobranc 数据集:一种用于软骨鱼类物种识别和生物多样性监测的图像数据集
Beviá-Ballesteros, Ismael, Jerez-Tallón, Mario, Aranda-Garrido, Nieves, Abel-Abellán, Isabel, Antón-Linares, Irene, Azorín-López, Jorge, Saval-Calvo, Marcelo, Fuster-Guilló, Andres, Giménez-Casalduero, Francisca
Abstract
Elasmobranch populations are experiencing significant global declines, and several species are currently classified as threatened. Reliable monitoring and species-level identification are essential to support conservation and spatial planning initiatives such as Important Shark and Ray Areas (ISRAs). However, existing visual datasets are predominantly detection-oriented, underwater-acquired, or limited to coarse-grained categories, restricting their applicability to fine-grained morphological classification. We present the eLasmobranc Dataset, a curated and publicly available image collection from seven ecologically relevant elasmobranch species inhabiting the eastern Spanish Mediterranean coast, a region where two ISRAs have been identified. Images were obtained through dedicated data collection, including field campaigns and collaborations with local fish markets and projects, as well as from open-access public sources. The dataset was constructed predominantly from images acquired outside the aquatic environment under standardized protocols to ensure clear visualization of diagnostic morphological traits. It integrates expert-validated species annotations, structured spatial and temporal metadata, and complementary species-level information. The eLasmobranc Dataset is specifically designed to support supervised species-level classification, population studies, and the development of artificial intelligence systems for biodiversity monitoring. By combining morphological clarity, taxonomic reliability, and public accessibility, the dataset addresses a critical gap in fine-grained elasmobranch identification and promotes reproducible research in conservation-oriented computer vision. The dataset is publicly available at https://zenodo.org/records/18549737.
Chinese Translation
软骨鱼类种群正经历显著的全球下降,多个物种目前被列为受威胁。可靠的监测和物种级别的识别对于支持保护和空间规划倡议(如重要鲨鱼和鳐鱼区域,ISRAs)至关重要。然而,现有的视觉数据集主要以检测为导向,且多为水下获取或仅限于粗略分类,限制了其在细粒度形态分类中的适用性。我们提出了 eLasmobranc 数据集,这是一个经过策划并公开可用的图像集合,涵盖了栖息于西班牙东部地中海沿岸的七种生态相关的软骨鱼类,这一地区已识别出两个 ISRAs。图像通过专门的数据收集获得,包括实地考察和与当地鱼市场及项目的合作,以及来自开放获取公共来源的图像。该数据集主要由在标准化协议下于水面外获取的图像构成,以确保诊断性形态特征的清晰可视化。它整合了专家验证的物种注释、结构化的空间和时间元数据以及补充的物种级信息。eLasmobranc 数据集专门设计用于支持监督的物种级分类、种群研究以及用于生物多样性监测的人工智能系统开发。通过结合形态清晰性、分类可靠性和公众可访问性,该数据集填补了细粒度软骨鱼类识别中的关键空白,并促进了以保护为导向的计算机视觉领域的可重复研究。该数据集可在 https://zenodo.org/records/18549737 获取。
cs.CV / 60 / 2603.10744

Just-in-Time: Training-Free Spatial Acceleration for Diffusion Transformers

即时:无训练的扩散变换器空间加速
Sun, Wenhao, Li, Ji, Liu, Zhaoqiang
Abstract
Diffusion Transformers have established a new state-of-the-art in image synthesis, but the high computational cost of iterative sampling severely hampers their practical deployment. While existing acceleration methods often focus on the temporal domain, they overlook the substantial spatial redundancy inherent in the generative process, where global structures emerge long before fine-grained details are formed. The uniform computational treatment of all spatial regions represents a critical inefficiency. In this paper, we introduce Just-in-Time (JiT), a novel training-free framework that addresses this challenge by acceleration in the spatial domain. JiT formulates a spatially approximated generative ordinary differential equation (ODE) that drives the full latent state evolution based on computations from a dynamically selected, sparse subset of anchor tokens. To ensure seamless transitions as new tokens are incorporated to expand the dimensions of the latent state, we propose a deterministic micro-flow, a simple and effective finite-time ODE that maintains both structural coherence and statistical correctness. Extensive experiments on the state-of-the-art FLUX.1-dev model demonstrate that JiT achieves up to a 7x speedup with nearly lossless performance, significantly outperforming existing acceleration methods and establishing a new and superior trade-off between inference speed and generation fidelity.
Chinese Translation
扩散变换器在图像合成中建立了新的最先进水平,但迭代采样的高计算成本严重阻碍了其实际应用。现有的加速方法通常侧重于时间域,而忽视了生成过程中固有的显著空间冗余,在此过程中,全球结构在细粒度细节形成之前就已出现。对所有空间区域的统一计算处理代表了一种关键的低效。在本文中,我们提出了即时(Just-in-Time, JiT),一种新颖的无训练框架,通过在空间域中的加速来解决这一挑战。JiT 形成了一个空间近似生成常微分方程(ODE),该方程根据动态选择的稀疏锚点令牌子集的计算驱动完整潜在状态的演变。为了确保在新令牌被纳入以扩展潜在状态维度时的无缝过渡,我们提出了一种确定性微流(deterministic micro-flow),这是一种简单有效的有限时间 ODE,能够保持结构一致性和统计正确性。在最先进的 FLUX.1-dev 模型上进行的大量实验表明,JiT 实现了高达 7 倍的加速,且性能几乎无损,显著优于现有的加速方法,并在推理速度与生成保真度之间建立了新的、更优的权衡。
cs.CV / 61 / 2603.10748

Event-based Photometric Stereo via Rotating Illumination and Per-Pixel Learning

基于事件的光度立体视觉:旋转照明与逐像素学习
Kim, Hyunwoo, Kim, Won-Hoe, Lee, Sanghoon, Cai, Jianfei, Nam, Giljoo, Hyun, Jae-Sang
Abstract
Photometric stereo is a technique for estimating surface normals using images captured under varying illumination. However, conventional frame-based photometric stereo methods are limited in real-world applications due to their reliance on controlled lighting, and susceptibility to ambient illumination. To address these limitations, we propose an event-based photometric stereo system that leverages an event camera, which is effective in scenarios with continuously varying scene radiance and high dynamic range conditions. Our setup employs a single light source moving along a predefined circular trajectory, eliminating the need for multiple synchronized light sources and enabling a more compact and scalable design. We further introduce a lightweight per-pixel multi-layer neural network that directly predicts surface normals from event signals generated by intensity changes as the light source rotates, without system calibration. Experimental results on benchmark datasets and real-world data collected with our data acquisition system demonstrate the effectiveness of our method, achieving a 7.12\% reduction in mean angular error compared to existing event-based photometric stereo methods. In addition, our method demonstrates robustness in regions with sparse event activity, strong ambient illumination, and scenes affected by specularities.
Chinese Translation
光度立体视觉是一种通过在不同照明条件下捕获图像来估计表面法线的技术。然而,传统的基于帧的光度立体视觉方法由于依赖于受控照明和对环境光的敏感性,在实际应用中受到限制。为了解决这些局限性,我们提出了一种基于事件的光度立体视觉系统,该系统利用事件相机,在场景辐射连续变化和高动态范围条件下表现出色。我们的设置采用一个沿预定义圆形轨迹移动的单一光源,消除了对多个同步光源的需求,从而实现了更紧凑和可扩展的设计。我们进一步引入了一种轻量级的逐像素多层神经网络,该网络直接从光源旋转时由强度变化生成的事件信号中预测表面法线,而无需系统校准。在基准数据集和使用我们的数据采集系统收集的真实数据上的实验结果表明,我们的方法有效性显著,相较于现有的基于事件的光度立体视觉方法,平均角度误差减少了7.12%。此外,我们的方法在事件活动稀疏、强环境光照和受镜面反射影响的场景中表现出良好的鲁棒性。
cs.CV / 62 / 2603.10757

CodePercept: Code-Grounded Visual STEM Perception for MLLMs

CodePercept:基于代码的视觉STEM感知用于多模态大语言模型
Guan, Tongkun, Yang, Zhibo, Wan, Jianqiang, Yang, Mingkun, Guo, Zhengtao, Hu, Zijian, Luo, Ruilin, Chen, Ruize, Jiang, Songtao, Wang, Peng, Shen, Wei, Lin, Junyang, Yang, Xiaokang
Abstract
When MLLMs fail at Science, Technology, Engineering, and Mathematics (STEM) visual reasoning, a fundamental question arises: is it due to perceptual deficiencies or reasoning limitations? Through systematic scaling analysis that independently scales perception and reasoning components, we uncover a critical insight: scaling perception consistently outperforms scaling reasoning. This reveals perception as the true lever limiting current STEM visual reasoning. Motivated by this insight, our work focuses on systematically enhancing the perception capabilities of MLLMs by establishing code as a powerful perceptual medium--executable code provides precise semantics that naturally align with the structured nature of STEM visuals. Specifically, we construct ICC-1M, a large-scale dataset comprising 1M Image-Caption-Code triplets that materializes this code-as-perception paradigm through two complementary approaches: (1) Code-Grounded Caption Generation treats executable code as ground truth for image captions, eliminating the hallucinations inherent in existing knowledge distillation methods; (2) STEM Image-to-Code Translation prompts models to generate reconstruction code, mitigating the ambiguity of natural language for perception enhancement. To validate this paradigm, we further introduce STEM2Code-Eval, a novel benchmark that directly evaluates visual perception in STEM domains. Unlike existing work relying on problem-solving accuracy as a proxy that only measures problem-relevant understanding, our benchmark requires comprehensive visual comprehension through executable code generation for image reconstruction, providing deterministic and verifiable assessment. Code is available at https://github.com/TongkunGuan/Qwen-CodePercept.
Chinese Translation
当多模态大语言模型(MLLMs)在科学、技术、工程和数学(STEM)视觉推理中失败时,一个根本性的问题随之而来:这是否由于感知缺陷或推理局限性所致?通过系统的规模分析,独立地对感知和推理组件进行扩展,我们发现了一个关键的见解:扩展感知的效果始终优于扩展推理。这揭示了感知是限制当前STEM视觉推理的真正杠杆。受到这一见解的启发,我们的工作重点是通过将代码建立为一种强大的感知媒介,系统性地增强MLLMs的感知能力——可执行代码提供了与STEM视觉的结构化特性自然对齐的精确语义。具体而言,我们构建了ICC-1M,这是一个大型数据集,包含100万个图像-标题-代码三元组,通过两种互补的方法实现这一代码作为感知的范式:(1)基于代码的标题生成将可执行代码视为图像标题的真实依据,消除了现有知识蒸馏方法中固有的幻觉;(2)STEM图像到代码翻译促使模型生成重建代码,减轻了自然语言在感知增强中的模糊性。为了验证这一范式,我们进一步引入了STEM2Code-Eval,一个新颖的基准,直接评估STEM领域的视觉感知。与现有工作依赖于问题解决准确性作为仅测量问题相关理解的代理不同,我们的基准要求通过可执行代码生成进行全面的视觉理解,以实现图像重建,提供确定性和可验证的评估。代码可在 https://github.com/TongkunGuan/Qwen-CodePercept 获取。
cs.CV / 63 / 2603.10780

Guiding Diffusion Models with Semantically Degraded Conditions

用语义降级条件引导扩散模型
Han, Shilong, Zhang, Yuming, Wang, Hongxia
Abstract
Classifier-Free Guidance (CFG) is a cornerstone of modern text-to-image models, yet its reliance on a semantically vacuous null prompt ($\varnothing$) generates a guidance signal prone to geometric entanglement. This is a key factor limiting its precision, leading to well-documented failures in complex compositional tasks. We propose Condition-Degradation Guidance (CDG), a novel paradigm that replaces the null prompt with a strategically degraded condition, $\boldsymbol{c}_{\text{deg}}$. This reframes guidance from a coarse "good vs. null" contrast to a more refined "good vs. almost good" discrimination, thereby compelling the model to capture fine-grained semantic distinctions. We find that tokens in transformer text encoders split into two functional roles: content tokens encoding object semantics, and context-aggregating tokens capturing global context. By selectively degrading only the former, CDG constructs $\boldsymbol{c}_{\text{deg}}$ without external models or training. Validated across diverse architectures including Stable Diffusion 3, FLUX, and Qwen-Image, CDG markedly improves compositional accuracy and text-image alignment. As a lightweight, plug-and-play module, it achieves this with negligible computational overhead. Our work challenges the reliance on static, information-sparse negative samples and establishes a new principle for diffusion guidance: the construction of adaptive, semantically-aware negative samples is critical to achieving precise semantic control. Code is available at https://github.com/Ming-321/Classifier-Degradation-Guidance.
Chinese Translation
无分类器引导(Classifier-Free Guidance, CFG)是现代文本到图像模型的基石,但其依赖于语义空洞的空提示($ ext{null prompt}$,$oldsymbol{ ext{varnothing}}$)生成的引导信号易受到几何纠缠的影响。这是限制其精度的一个关键因素,导致在复杂组合任务中出现了众所周知的失败。我们提出了条件降级引导(Condition-Degradation Guidance, CDG),这是一种新颖的范式,它用一个战略性降级的条件($oldsymbol{c}_{ ext{deg}}$)替代空提示。这将引导从粗略的“好与空”对比重新框架为更精细的“好与几乎好”区分,从而迫使模型捕捉细粒度的语义差异。我们发现,变换器文本编码器中的标记分为两种功能角色:内容标记编码对象语义,和上下文聚合标记捕捉全局上下文。通过仅选择性地降级前者,CDG在没有外部模型或训练的情况下构建了$oldsymbol{c}_{ ext{deg}}$。在包括Stable Diffusion 3、FLUX和Qwen-Image等多种架构中进行验证,CDG显著提高了组合准确性和文本-图像对齐度。作为一个轻量级的即插即用模块,它以微不足道的计算开销实现了这一点。我们的工作挑战了对静态、信息稀疏的负样本的依赖,并建立了扩散引导的新原则:构建自适应、语义感知的负样本对于实现精确的语义控制至关重要。代码可在 https://github.com/Ming-321/Classifier-Degradation-Guidance 获取。
cs.CV / 64 / 2603.10781

Taking Shortcuts for Categorical VQA Using Super Neurons

利用超级神经元进行分类视觉问答的捷径
Musacchio, Pierre, Jeong, Jaeyi, Kim, Dahun, Park, Jaesik
Abstract
Sparse Attention Vectors (SAVs) have emerged as an excellent training-free alternative to supervised finetuning or low-rank adaptation to improve the performance of Vision Language Models (VLMs). At their heart, SAVs select a few accurate attention heads for a task of interest and use them as classifiers, rather than relying on the model's prediction. In a similar spirit, we find that directly probing the raw activations of the VLM, in the form of scalar values, is sufficient to yield accurate classifiers on diverse visually grounded downstream tasks. Shifting focus from attention vectors to scalar activations dramatically increases the search space for accurate parameters, allowing us to find more discriminative neurons immediately from the first generated token. We call such activations Super Neurons (SNs). In this probing setting, we discover that enough SNs appear in the shallower layers of the large language model to allow for extreme early exiting from the first layer of the model at the first generated token. Compared to the original network, SNs robustly improve the classification performance while achieving a speedup of up to 5.10x.
Chinese Translation
稀疏注意向量(Sparse Attention Vectors, SAVs)作为一种优秀的无监督替代方案,已被提出以改善视觉语言模型(Vision Language Models, VLMs)的性能,取代了监督微调或低秩适配。在其核心,SAVs为特定任务选择少量准确的注意力头,并将其用作分类器,而不是依赖模型的预测。以类似的思路,我们发现直接探测VLM的原始激活值(以标量形式)足以在多样的视觉基础下游任务中产生准确的分类器。将注意向量的关注点转向标量激活显著增加了准确参数的搜索空间,使我们能够从第一个生成的标记中立即找到更多具有辨别力的神经元。我们将这种激活称为超级神经元(Super Neurons, SNs)。在这种探测设置中,我们发现足够的SNs出现在大型语言模型的较浅层中,从而允许在模型的第一层以第一个生成的标记进行极早的退出。与原始网络相比,SNs在提高分类性能的同时实现了高达5.10倍的加速。
cs.CV / 65 / 2603.10782

Phase-Interface Instance Segmentation as a Visual Sensor for Laboratory Process Monitoring

作为实验室过程监测视觉传感器的相界实例分割
Li, Mingyue, Yang, Xin, Yan, Shilin, Ran, Jinye, Zhu, Morui, Peng, Zirui, Peng, Huanqing, Peng, Wei, Zhang, Guanghua, Li, Shuo, Zhang, Hao
Abstract
Reliable visual monitoring of chemical experiments remains challenging in transparent glassware, where weak phase boundaries and optical artifacts degrade conventional segmentation. We formulate laboratory phenomena as the time evolution of phase interfaces and introduce the Chemical Transparent Glasses dataset 2.0 (CTG 2.0), a vessel-aware benchmark with 3,668 images, 23 glassware categories, and five multiphase interface types for phase-interface instance segmentation. Building on YOLO11m-seg, we propose LGA-RCM-YOLO, which combines Local-Global Attention (LGA) for robust semantic representation and a Rectangular Self-Calibration Module (RCM) for boundary refinement of thin, elongated interfaces. On CTG 2.0, the proposed model achieves 84.4% [email protected] and 58.43% [email protected], improving over the YOLO11m baseline by 6.42 and 8.75 AP points, respectively, while maintaining near real-time inference (13.67 FPS, RTX 3060). An auxiliary color-attribute head further labels liquid instances as colored or colorless with 98.71% precision and 98.32% recall. Finally, we demonstrate continuous process monitoring in separatory-funnel phase separation and crystallization, showing that phase-interface instance segmentation can serve as a practical visual sensor for laboratory automation.
Chinese Translation
在透明玻璃器皿中,化学实验的可靠视觉监测仍然面临挑战,微弱的相边界和光学伪影降低了传统分割的效果。我们将实验室现象表述为相界面的时间演化,并引入化学透明玻璃数据集2.0(Chemical Transparent Glasses dataset 2.0,CTG 2.0),这是一个包含3,668张图像、23个玻璃器皿类别和五种多相界面类型的容器感知基准,用于相界实例分割。在YOLO11m-seg的基础上,我们提出了LGA-RCM-YOLO,该模型结合了局部-全局注意力(Local-Global Attention,LGA)以实现稳健的语义表示,并引入了矩形自校准模块(Rectangular Self-Calibration Module,RCM)以细化细长界面的边界。在CTG 2.0上,所提出的模型在[email protected]上达到了84.4%,在[email protected]上达到了58.43%,分别比YOLO11m基线提高了6.42和8.75个AP点,同时保持近实时推理(13.67 FPS,RTX 3060)。辅助颜色属性头进一步将液体实例标记为有色或无色,精度达到98.71%,召回率为98.32%。最后,我们展示了在分液漏斗相分离和结晶过程中的连续过程监测,表明相界实例分割可以作为实验室自动化的实用视觉传感器。
cs.CV / 66 / 2603.10785

The Quadratic Geometry of Flow Matching: Semantic Granularity Alignment for Text-to-Image Synthesis

流匹配的二次几何:文本到图像合成的语义粒度对齐
Xiong, Zhinan, Yuan, Shunqi
Abstract
In this work, we analyze the optimization dynamics of generative fine-tuning. We observe that under the Flow Matching framework, the standard MSE objective can be formulated as a Quadratic Form governed by a dynamically evolving Neural Tangent Kernel (NTK). This geometric perspective reveals a latent Data Interaction Matrix, where diagonal terms represent independent sample learning and off-diagonal terms encode residual correlation between heterogeneous features. Although standard training implicitly optimizes these cross-term interferences, it does so without explicit control; moreover, the prevailing data-homogeneity assumption may constrain the model's effective capacity. Motivated by this insight, we propose Semantic Granularity Alignment (SGA), using Text-to-Image synthesis as a testbed. SGA engineers targeted interventions in the vector residual field to mitigate gradient conflicts. Evaluations across DiT and U-Net architectures confirm that SGA advances the efficiency-quality trade-off by accelerating convergence and improving structural integrity.
Chinese Translation
在本研究中,我们分析了生成微调的优化动态。我们观察到,在流匹配(Flow Matching)框架下,标准均方误差(MSE)目标可以被表述为一个由动态演变的神经切线核(Neural Tangent Kernel, NTK)所支配的二次形式。这种几何视角揭示了一个潜在的数据交互矩阵,其中对角项代表独立样本学习,而非对角项则编码异构特征之间的残余相关性。尽管标准训练隐式地优化了这些交叉项干扰,但并没有进行明确控制;此外,现有的数据同质性假设可能限制了模型的有效容量。基于这一洞察,我们提出了语义粒度对齐(Semantic Granularity Alignment, SGA),并以文本到图像合成为测试平台。SGA在向量残差场中进行有针对性的干预,以减轻梯度冲突。对DiT和U-Net架构的评估确认,SGA通过加速收敛和改善结构完整性,提升了效率与质量的权衡。
cs.CV / 67 / 2603.10801

PolGS++: Physically-Guided Polarimetric Gaussian Splatting for Fast Reflective Surface Reconstruction

PolGS++:物理引导的极化高斯点云快速反射表面重建
Han, Yufei, Zhou, Chu, Lyu, Youwei, Chen, Qi, Li, Si, Shi, Boxin, Jia, Yunpeng, Guo, Heng, Ma, Zhanyu
Abstract
Accurate reconstruction of reflective surfaces remains a fundamental challenge in computer vision, with broad applications in real-time virtual reality and digital content creation. Although 3D Gaussian Splatting (3DGS) enables efficient novel-view rendering with explicit representations, its performance on reflective surfaces still lags behind implicit neural methods, especially in recovering fine geometry and surface normals. To address this gap, we propose PolGS++, a physically-guided polarimetric Gaussian Splatting framework for fast reflective surface reconstruction. Specifically, we integrate a polarized BRDF (pBRDF) model into 3DGS to explicitly decouple diffuse and specular components, providing physically grounded reflectance modeling and stronger geometric cues for reflective surface recovery. Furthermore, we introduce a depth-guided visibility mask acquisition mechanism that enables angle-of-polarization (AoP)-based tangent-space consistency constraints in Gaussian Splatting without costly ray-tracing intersections. This physically guided design improves reconstruction quality and efficiency, requiring only about 10 minutes of training. Extensive experiments on both synthetic and real-world datasets validate the effectiveness of our method.
Chinese Translation
反射表面的准确重建仍然是计算机视觉中的一个基本挑战,广泛应用于实时虚拟现实和数字内容创作。尽管三维高斯点云(3D Gaussian Splatting,3DGS)能够通过显式表示实现高效的新视角渲染,但其在反射表面上的性能仍然落后于隐式神经方法,特别是在恢复细微几何形状和表面法线方面。为了解决这一问题,我们提出了PolGS++,一种物理引导的极化高斯点云框架,用于快速反射表面重建。具体而言,我们将极化BRDF(pBRDF)模型集成到3DGS中,以显式地解耦漫反射和镜面反射成分,提供物理基础的反射建模和更强的几何线索以恢复反射表面。此外,我们引入了一种深度引导的可见性掩模获取机制,使得在高斯点云中实现基于极化角(AoP)的切线空间一致性约束,而无需昂贵的光线追踪交点计算。这种物理引导的设计提高了重建质量和效率,仅需约10分钟的训练时间。在合成和真实世界数据集上的大量实验验证了我们方法的有效性。
cs.CV / 68 / 2603.10806

Backdoor Directions in Vision Transformers

视觉变换器中的后门方向
Karayalcin, Sengim, Krcek, Marina, Chen, Pin-Yu, Picek, Stjepan
Abstract
This paper investigates how Backdoor Attacks are represented within Vision Transformers (ViTs). By assuming knowledge of the trigger, we identify a specific ``trigger direction'' in the model's activations that corresponds to the internal representation of the trigger. We confirm the causal role of this linear direction by showing that interventions in both activation and parameter space consistently modulate the model's backdoor behavior across multiple datasets and attack types. Using this direction as a diagnostic tool, we trace how backdoor features are processed across layers. Our analysis reveals distinct qualitative differences: static-patch triggers follow a different internal logic than stealthy, distributed triggers. We further examine the link between backdoors and adversarial attacks, specifically testing whether PGD-based perturbations (de-)activate the identified trigger mechanism. Finally, we propose a data-free, weight-based detection scheme for stealthy-trigger attacks. Our findings show that mechanistic interpretability offers a robust framework for diagnosing and addressing security vulnerabilities in computer vision.
Chinese Translation
本文研究了后门攻击在视觉变换器(ViTs)中的表现。通过假设对触发器的知识,我们识别出模型激活中的特定“触发器方向”,该方向对应于触发器的内部表示。我们通过展示在激活和参数空间中的干预一致地调节模型的后门行为,确认了这一线性方向的因果作用,适用于多个数据集和攻击类型。利用这一方向作为诊断工具,我们追踪了后门特征在各层中的处理方式。我们的分析揭示了明显的定性差异:静态补丁触发器遵循与隐蔽分布触发器不同的内部逻辑。我们进一步考察了后门与对抗攻击之间的联系,特别测试基于PGD的扰动是否(去)激活所识别的触发机制。最后,我们提出了一种无数据、基于权重的隐蔽触发器攻击检测方案。我们的研究结果表明,机械可解释性为诊断和解决计算机视觉中的安全漏洞提供了一个稳健的框架。
cs.CV / 69 / 2603.10814

HanMoVLM: Large Vision-Language Models for Professional Artistic Painting Evaluation

HanMoVLM:用于专业艺术绘画评估的大型视觉-语言模型
Yang, Hongji, Zhou, Yucheng, Han, Wencheng, Li, Songlian, Zhao, Xiaotong, Shen, Jianbing
Abstract
While Large Vision-Language Models (VLMs) demonstrate impressive general visual capabilities, they remain artistically blind and unable to offer professional evaluation of artworks within specific artistic domains like human experts. To bridge this gap, we transform VLMs into experts capable of professional-grade painting evaluation in the Chinese Artistic Domain, which is more abstract and demands extensive artistic training for evaluation. We introduce HanMo-Bench, a new dataset that features authentic auction-grade masterpieces and AI-generated works, grounded in real-world market valuations. To realize the rigorous judgment, we propose the HanMoVLM and construct a Chain-of-Thought (CoT) validated by experts. This CoT guides the model to perform expert-level reasoning: from content identification and Region of Interest (RoI) localization to professional evaluation, guided by both theme-specific evaluation and typical three-tier evaluation in Chinese paintings. Furthermore, we design a reward function to refine the reasoning process of the HanMoVLM to improve the accuracy. We demonstrate that HanMoVLM can serve as a critical backbone for Test-time Scaling in image generation. By acting as a high-quality verifier, HanMoVLM enables generative models to select the most artistically superior outputs from multiple candidates. Experimental results and human studies confirm that the proposed HanMoVLM effectively bridges the gap, achieving a high consistency with professional experts and significantly improving the quality of Chinese Painting generation.
Chinese Translation
虽然大型视觉-语言模型(VLMs)展示了令人印象深刻的通用视觉能力,但它们在艺术上仍然是盲目的,无法像人类专家那样在特定艺术领域提供专业的艺术作品评估。为了弥补这一差距,我们将VLMs转变为能够在中国艺术领域进行专业级绘画评估的专家,该领域更为抽象,并且需要广泛的艺术训练来进行评估。我们引入了HanMo-Bench,一个新的数据集,包含真实的拍卖级杰作和AI生成的作品,基于现实世界的市场估值。为了实现严格的判断,我们提出了HanMoVLM,并构建了一个经过专家验证的思维链(Chain-of-Thought, CoT)。该CoT指导模型进行专家级推理:从内容识别和兴趣区域(Region of Interest, RoI)定位到专业评估,遵循主题特定的评估和中国绘画中的典型三层评估。此外,我们设计了一个奖励函数,以优化HanMoVLM的推理过程,提高准确性。我们展示了HanMoVLM可以作为图像生成中的测试时间扩展的关键支撑。通过充当高质量的验证者,HanMoVLM使生成模型能够从多个候选中选择艺术上最优越的输出。实验结果和人类研究证实,所提出的HanMoVLM有效地弥补了这一差距,与专业专家的高一致性,并显著提高了中国绘画生成的质量。
cs.CV / 70 / 2603.10825

A dataset of medication images with instance segmentation masks for preventing adverse drug events

一个包含实例分割掩膜的药物图像数据集,用于预防不良药物事件
Chu, W. I., Hirani, S., Tarroni, G., Li, L.
Abstract
Medication errors and adverse drug events (ADEs) pose significant risks to patient safety, often arising from difficulties in reliably identifying pharmaceuticals in real-world settings. AI-based pill recognition models offer a promising solution, but the lack of comprehensive datasets hinders their development. Existing pill image datasets rarely capture real-world complexities such as overlapping pills, varied lighting, and occlusions. MEDISEG addresses this gap by providing instance segmentation annotations for 32 distinct pill types across 8262 images, encompassing diverse conditions from individual pill images to cluttered dosette boxes. We trained YOLOv8 and YOLOv9 on MEDISEG to demonstrate their usability, achieving mean average precision at IoU 0.5 of 99.5 percent on the 3-Pills subset and 80.1 percent on the 32-Pills subset. We further evaluate MEDISEG under a few-shot detection protocol, demonstrating that base training on MEDISEG significantly improves recognition of unseen pill classes in occluded multi-pill scenarios compared to existing datasets. These results highlight the dataset's ability not only to support robust supervised training but also to promote transferable representations under limited supervision, making it a valuable resource for developing and benchmarking AI-driven systems for medication safety.
Chinese Translation
药物错误和不良药物事件(ADEs)对患者安全构成重大风险,通常源于在现实环境中可靠识别药物的困难。基于人工智能的药丸识别模型提供了一种有前景的解决方案,但缺乏全面的数据集阻碍了其发展。现有的药丸图像数据集很少捕捉到现实世界中的复杂性,例如重叠药丸、不同的光照条件和遮挡。MEDISEG通过为32种不同的药丸类型提供实例分割注释,填补了这一空白,共包含8262张图像,涵盖从单个药丸图像到杂乱的药盒的多种条件。我们在MEDISEG上训练了YOLOv8和YOLOv9,以展示其可用性,在3-Pills子集上实现了99.5%的IoU 0.5平均精度,在32-Pills子集上实现了80.1%的平均精度。我们进一步在少样本检测协议下评估MEDISEG,结果表明,与现有数据集相比,在遮挡的多药丸场景中,基于MEDISEG的基础训练显著提高了对未见药丸类别的识别能力。这些结果突显了该数据集不仅能够支持强大的监督训练,还能在有限监督下促进可迁移表示,使其成为开发和基准测试药物安全相关的人工智能驱动系统的重要资源。
cs.CV / 71 / 2603.10828

BALD-SAM: Disagreement-based Active Prompting in Interactive Segmentation

BALD-SAM:基于不一致性的交互分割主动提示
Chowdhury, Prithwijit, Prabhushankar, Mohit, AlRegib, Ghassan
Abstract
The Segment Anything Model (SAM) has revolutionized interactive segmentation through spatial prompting. While existing work primarily focuses on automating prompts in various settings, real-world annotation workflows involve iterative refinement where annotators observe model outputs and strategically place prompts to resolve ambiguities. Current pipelines typically rely on the annotator's visual assessment of the predicted mask quality. We postulate that a principled approach for automated interactive prompting is to use a model-derived criterion to identify the most informative region for the next prompt. In this work, we establish active prompting: a spatial active learning approach where locations within images constitute an unlabeled pool and prompts serve as queries to prioritize information-rich regions, increasing the utility of each interaction. We further present BALD-SAM: a principled framework adapting Bayesian Active Learning by Disagreement (BALD) to spatial prompt selection by quantifying epistemic uncertainty. To do so, we freeze the entire model and apply Bayesian uncertainty modeling only to a small learned prediction head, making intractable uncertainty estimation practical for large multi-million parameter foundation models. Across 16 datasets spanning natural, medical, underwater, and seismic domains, BALD-SAM demonstrates strong cross-domain performance, ranking first or second on 14 of 16 benchmarks. We validate these gains through a comprehensive ablation suite covering 3 SAM backbones and 35 Laplace posterior configurations, amounting to 38 distinct ablation settings. Beyond strong average performance, BALD-SAM surpasses human prompting and, in several categories, even oracle prompting, while consistently outperforming one-shot baselines in final segmentation quality, particularly on thin and structurally complex objects.
Chinese Translation
Segment Anything Model (SAM) 通过空间提示彻底改变了交互分割。虽然现有研究主要集中于在各种环境中自动化提示,但现实世界的标注工作流程涉及迭代优化,标注者观察模型输出并战略性地放置提示以解决模糊性。目前的流程通常依赖于标注者对预测掩膜质量的视觉评估。我们假设,自动交互提示的原则性方法是使用模型衍生的标准来识别下一个提示的最有信息量区域。在本研究中,我们建立了主动提示:一种空间主动学习方法,其中图像中的位置构成未标记池,提示作为查询以优先考虑信息丰富的区域,从而增加每次交互的效用。我们进一步提出了 BALD-SAM:一个将基于不一致性的贝叶斯主动学习 (Bayesian Active Learning by Disagreement, BALD) 适应于空间提示选择的原则性框架,通过量化认知不确定性来实现。为此,我们冻结整个模型,仅对一个小的学习预测头应用贝叶斯不确定性建模,使得对于大型数百万参数的基础模型的不可处理的不确定性估计变得可行。在涵盖自然、医学、水下和地震领域的16个数据集上,BALD-SAM 展现出强大的跨领域性能,在16个基准中有14个排名第一或第二。我们通过全面的消融实验验证了这些提升,涵盖了3个 SAM 主干网络和35个拉普拉斯后验配置,共计38个不同的消融设置。除了强大的平均性能外,BALD-SAM 超越了人类提示,并且在多个类别中甚至超过了神谕提示,同时在最终分割质量上始终优于一次性基线,特别是在细小和结构复杂的物体上。
cs.CV / 72 / 2603.10833

Evaluating Few-Shot Pill Recognition Under Visual Domain Shift

评估视觉领域转移下的少样本药丸识别
Chu, W. I., Tarroni, G., Li, L.
Abstract
Adverse drug events are a significant source of preventable harm, which has led to the development of automated pill recognition systems to enhance medication safety. Real-world deployment of these systems is hindered by visually complex conditions, including cluttered scenes, overlapping pills, reflections, and diverse acquisition environments. This study investigates few-shot pill recognition from a deployment-oriented perspective, prioritizing generalization under realistic cross-dataset domain shifts over architectural innovation. A two-stage object detection framework is employed, involving base training followed by few-shot fine-tuning. Models are adapted to novel pill classes using one, five, or ten labeled examples per class and are evaluated on a separate deployment dataset featuring multi-object, cluttered scenes. The evaluation focuses on classification-centric and error-based metrics to address heterogeneous annotation strategies. Findings indicate that semantic pill recognition adapts rapidly with few-shot supervision, with classification performance reaching saturation even with a single labeled example. However, stress testing under overlapping and occluded conditions demonstrates a marked decline in localization and recall, despite robust semantic classification. Models trained on visually realistic, multi-pill data consistently exhibit greater robustness in low-shot scenarios, underscoring the importance of training data realism and the diagnostic utility of few-shot fine-tuning for deployment readiness.
Chinese Translation
不良药物事件是可预防伤害的重要来源,这促使了自动药丸识别系统的发展,以增强用药安全。这些系统在现实世界中的部署受到视觉复杂条件的阻碍,包括杂乱的场景、重叠的药丸、反射以及多样的获取环境。本研究从部署导向的角度调查少样本药丸识别,优先考虑在现实的跨数据集领域转移下的泛化能力,而非架构创新。采用了一个两阶段的目标检测框架,包括基础训练和随后进行的少样本微调。模型通过每个类别提供一个、五个或十个标记示例来适应新的药丸类别,并在一个包含多对象和杂乱场景的单独部署数据集上进行评估。评估侧重于以分类为中心和基于错误的指标,以应对异质注释策略。研究结果表明,语义药丸识别在少样本监督下能够迅速适应,分类性能甚至在仅有一个标记示例的情况下也能达到饱和。然而,在重叠和遮挡条件下的压力测试显示,尽管语义分类表现强劲,定位和召回率却显著下降。基于视觉真实的多药丸数据训练的模型在低样本场景中表现出更大的鲁棒性,强调了训练数据真实的重要性以及少样本微调在部署准备中的诊断效用。
cs.CV / 73 / 2603.10834

On the Reliability of Cue Conflict and Beyond

关于线索冲突的可靠性及其超越
Kim, Pum Jun, Lee, Seung-Ah, Park, Seongho, Han, Dongyoon, Yoo, Jaejun
Abstract
Understanding how neural networks rely on visual cues offers a human-interpretable view of their internal decision processes. The cue-conflict benchmark has been influential in probing shape-texture preference and in motivating the insight that stronger, human-like shape bias is often associated with improved in-domain performance. However, we find that the current stylization-based instantiation can yield unstable and ambiguous bias estimates. Specifically, stylization may not reliably instantiate perceptually valid and separable cues nor control their relative informativeness, ratio-based bias can obscure absolute cue sensitivity, and restricting evaluation to preselected classes can distort model predictions by ignoring the full decision space. Together, these factors can confound preference with cue validity, cue balance, and recognizability artifacts. We introduce REFINED-BIAS, an integrated dataset and evaluation framework for reliable and interpretable shape-texture bias diagnosis. REFINED-BIAS constructs balanced, human- and model- recognizable cue pairs using explicit definitions of shape and texture, and measures cue-specific sensitivity over the full label space via a ranking-based metric, enabling fairer cross-model comparisons. Across diverse training regimes and architectures, REFINED-BIAS enables fairer cross-model comparison, more faithful diagnosis of shape and texture biases, and clearer empirical conclusions, resolving inconsistencies that prior cue-conflict evaluations could not reliably disambiguate.
Chinese Translation
理解神经网络如何依赖视觉线索为其内部决策过程提供了一个人类可解释的视角。线索冲突基准在探讨形状-纹理偏好方面具有重要影响,并激发了这样的见解:更强的类人形状偏见通常与改进的领域内表现相关。然而,我们发现当前基于风格化的实例化可能导致不稳定和模糊的偏见估计。具体而言,风格化可能无法可靠地实例化感知上有效且可分离的线索,也无法控制其相对信息量,基于比率的偏见可能掩盖绝对线索敏感性,而将评估限制在预选类别内可能通过忽视完整决策空间来扭曲模型预测。综合来看,这些因素可能将偏好与线索有效性、线索平衡和可识别性伪影混淆。我们提出了REFINED-BIAS,这是一个集成的数据集和评估框架,用于可靠且可解释的形状-纹理偏见诊断。REFINED-BIAS使用形状和纹理的明确定义构建平衡的、可被人类和模型识别的线索对,并通过基于排名的度量在完整标签空间内测量特定线索的敏感性,从而实现更公平的跨模型比较。在多种训练机制和架构下,REFINED-BIAS实现了更公平的跨模型比较,更真实的形状和纹理偏见诊断,以及更清晰的实证结论,解决了之前线索冲突评估无法可靠区分的一致性问题。
cs.CV / 74 / 2603.10852

UltrasoundAgents: Hierarchical Multi-Agent Evidence-Chain Reasoning for Breast Ultrasound Diagnosis

超声代理:用于乳腺超声诊断的分层多代理证据链推理
Zhu, Yali, Zhou, Kang, Wu, Dingbang, Meng, Gaofeng
Abstract
Breast ultrasound diagnosis typically proceeds from global lesion localization to local sign assessment and then evidence integration to assign a BI-RADS category and determine benignity or malignancy. Many existing methods rely on end-to-end prediction or provide only weakly grounded evidence, which can miss fine-grained lesion cues and limit auditability and clinical review. To align with the clinical workflow and improve evidence traceability, we propose a hierarchical multi-agent framework, termed UltrasoundAgents. A main agent localizes the lesion in the full image and triggers a crop-and-zoom operation. A sub-agent analyzes the local view and predicts four clinically relevant attributes, namely echogenicity pattern, calcification, boundary type, and edge (margin) morphology. The main agent then integrates these structured attributes to perform evidence-based reasoning and output the BI-RADS category and the malignancy prediction, while producing reviewable intermediate evidence. Furthermore, hierarchical multi-agent training often suffers from error propagation, difficult credit assignment, and sparse rewards. To alleviate this and improve training stability, we introduce a decoupled progressive training strategy. We first train the attribute agent, then train the main agent with oracle attributes to learn robust attribute-based reasoning, and finally apply corrective trajectory self-distillation with spatial supervision to build high-quality trajectories for supervised fine-tuning, yielding a deployable end-to-end policy. Experiments show consistent gains over strong vision-language baselines in diagnostic accuracy and attribute agreement, together with structured evidence and traceable reasoning.
Chinese Translation
乳腺超声诊断通常从全局病灶定位开始,接着进行局部信号评估,然后整合证据以分配 BI-RADS 分类并确定良性或恶性。许多现有方法依赖于端到端预测或仅提供弱基础的证据,这可能会错过细粒度的病灶线索,并限制审计性和临床审查。为了与临床工作流程对齐并改善证据可追溯性,我们提出了一种分层多代理框架,称为 UltrasoundAgents。主代理在完整图像中定位病灶并触发裁剪和放大操作。子代理分析局部视图并预测四个临床相关属性,即回声特征模式、钙化、边界类型和边缘(轮廓)形态。然后,主代理整合这些结构化属性进行基于证据的推理,并输出 BI-RADS 分类和恶性预测,同时生成可审查的中间证据。此外,分层多代理训练通常会遭遇错误传播、困难的信用分配和稀疏奖励。为了缓解这些问题并提高训练稳定性,我们引入了一种解耦的渐进训练策略。我们首先训练属性代理,然后使用权威属性训练主代理以学习稳健的基于属性的推理,最后应用带空间监督的纠正轨迹自蒸馏,构建高质量的轨迹以进行监督微调,从而产生可部署的端到端策略。实验表明,在诊断准确性和属性一致性方面,相较于强大的视觉-语言基线,取得了一致的提升,同时提供了结构化证据和可追溯的推理。
cs.CV / 75 / 2603.10863

Beyond Sequential Distance: Inter-Modal Distance Invariant Position Encoding

超越序列距离:跨模态距离不变位置编码
Chen, Lin, Ni, Bolin, Yang, Qi, Wang, Zili, Ding, Kun, Wang, Ying, Peng, Houwen, Xiang, Shiming
Abstract
Despite the remarkable capabilities of Multimodal Large Language Models (MLLMs), they still suffer from visual fading in long-context scenarios. Specifically, the attention to visual tokens diminishes as the text sequence lengthens, leading to text generation detached from visual constraints. We attribute this degradation to the inherent inductive bias of Multimodal RoPE, which penalizes inter-modal attention as the distance between visual and text tokens increases. To address this, we propose inter-modal Distance Invariant Position Encoding (DIPE), a simple but effective mechanism that disentangles position encoding based on modality interactions. DIPE retains the natural relative positioning for intra-modal interactions to preserve local structure, while enforcing an anchored perceptual proximity for inter-modal interactions. This strategy effectively mitigates the inter-modal distance-based penalty, ensuring that visual signals remain perceptually consistent regardless of the context length. Experimental results demonstrate that by integrating DIPE with Multimodal RoPE, the model maintains stable visual grounding in long-context scenarios, significantly alleviating visual fading while preserving performance on standard short-context benchmarks. Code is available at https://github.com/lchen1019/DIPE.
Chinese Translation
尽管多模态大型语言模型(MLLMs)具有显著的能力,但在长上下文场景中仍然面临视觉衰退的问题。具体而言,随着文本序列的延长,对视觉标记的关注程度降低,导致文本生成与视觉约束脱节。我们将这种退化归因于多模态 RoPE 的固有归纳偏差,该偏差在视觉标记与文本标记之间的距离增加时惩罚跨模态注意力。为了解决这一问题,我们提出了跨模态距离不变位置编码(DIPE),这是一种简单但有效的机制,基于模态交互解耦位置编码。DIPE 保留了内模态交互的自然相对位置,以保持局部结构,同时对跨模态交互施加锚定的感知接近性。这一策略有效减轻了基于跨模态距离的惩罚,确保视觉信号在上下文长度变化时保持感知一致性。实验结果表明,通过将 DIPE 与多模态 RoPE 结合,模型在长上下文场景中保持稳定的视觉基础,显著减轻了视觉衰退,同时在标准短上下文基准测试中保持了性能。代码可在 https://github.com/lchen1019/DIPE 获取。
cs.CV / 76 / 2603.10872

Bilevel Layer-Positioning LoRA for Real Image Dehazing

双层位置调整的 LoRA 用于真实图像去雾
Zhang, Yan, Ma, Long, Feng, Yuxin, Huang, Zhe, Zhou, Fan, Su, Zhuo
Abstract
Learning-based real image dehazing methods have achieved notable progress, yet they still face adaptation challenges in diverse real haze scenes. These challenges mainly stem from the lack of effective unsupervised mechanisms for unlabeled data and the heavy cost of full model fine-tuning. To address these challenges, we propose the haze-to-clear text-directed loss that leverages CLIP's cross-modal capabilities to reformulate real image dehazing as a semantic alignment problem in latent space, thereby providing explicit unsupervised cross-modal guidance in the absence of reference images. Furthermore, we introduce the Bilevel Layer-positioning LoRA (BiLaLoRA) strategy, which learns both the LoRA parameters and automatically search the injection layers, enabling targeted adaptation of critical network layers. Extensive experiments demonstrate our superiority against state-of-the-art methods on multiple real-world dehazing benchmarks. The code is publicly available at https://github.com/YanZhang-zy/BiLaLoRA.
Chinese Translation
基于学习的真实图像去雾方法已经取得了显著进展,但在多样化的真实雾霾场景中仍面临适应性挑战。这些挑战主要源于缺乏有效的无监督机制来处理未标记数据,以及全模型微调的高成本。为了解决这些挑战,我们提出了一种雾霾到清晰的文本导向损失,该损失利用 CLIP 的跨模态能力,将真实图像去雾重新表述为潜在空间中的语义对齐问题,从而在缺乏参考图像的情况下提供明确的无监督跨模态指导。此外,我们引入了双层位置调整的 LoRA(BiLaLoRA)策略,该策略同时学习 LoRA 参数并自动搜索注入层,实现对关键网络层的针对性适应。大量实验表明,我们在多个真实世界去雾基准测试中优于最先进的方法。代码已公开发布在 https://github.com/YanZhang-zy/BiLaLoRA。
cs.CV / 77 / 2603.10893

S2D: Sparse to Dense Lifting for 3D Reconstruction with Minimal Inputs

S2D:用于最小输入的三维重建的稀疏到密集提升
Ji, Yuzhou, Tian, Qijian, Zhu, He, Jiang, Xiaoqi, Cao, Guangzhi, Ma, Lizhuang, Xie, Yuan, Tan, Xin
Abstract
Explicit 3D representations have already become an essential medium for 3D simulation and understanding. However, the most commonly used point cloud and 3D Gaussian Splatting (3DGS) each suffer from non-photorealistic rendering and significant degradation under sparse inputs. In this paper, we introduce Sparse to Dense lifting (S2D), a novel pipeline that bridges the two representations and achieves high-quality 3DGS reconstruction with minimal inputs. Specifically, the S2D lifting is two-fold. We first present an efficient one-step diffusion model that lifts sparse point cloud for high-fidelity image artifact fixing. Meanwhile, to reconstruct 3D consistent scenes, we also design a corresponding reconstruction strategy with random sample drop and weighted gradient for robust model fitting from sparse input views to dense novel views. Extensive experiments show that S2D achieves the best consistency in generating novel view guidance and first-tier sparse view reconstruction quality under different input sparsity. By reconstructing stable scenes with the least possible captures among existing methods, S2D enables minimal input requirements for 3DGS applications.
Chinese Translation
显式的三维表示已经成为三维模拟和理解的重要媒介。然而,最常用的点云和三维高斯溅射(3D Gaussian Splatting, 3DGS)在稀疏输入下均存在非真实感渲染和显著降级的问题。本文介绍了稀疏到密集提升(Sparse to Dense lifting, S2D),这是一种新颖的管道,连接了这两种表示,并在最小输入下实现高质量的3DGS重建。具体而言,S2D提升分为两个部分。我们首先提出了一种高效的一步扩散模型,用于提升稀疏点云以修复高保真图像伪影。同时,为了重建三维一致场景,我们还设计了一种相应的重建策略,通过随机样本丢弃和加权梯度,从稀疏输入视图到密集新视图进行稳健的模型拟合。大量实验表明,S2D在生成新视图指导和不同输入稀疏下的首屈一指的稀疏视图重建质量方面实现了最佳一致性。通过在现有方法中以最少的捕获重建稳定场景,S2D为3DGS应用提供了最小输入要求。
cs.CV / 78 / 2603.10928

Novel Architecture of RPA In Oral Cancer Lesion Detection

口腔癌病变检测的新型RPA架构
Magdy, Revana, Naoum, Joy, Hamdi, Ali
Abstract
Accurate and early detection of oral cancer lesions is crucial for effective diagnosis and treatment. This study evaluates two RPA implementations, OC-RPAv1 and OC-RPAv2, using a test set of 31 images. OC-RPAv1 processes one image per prediction in an average of 0.29 seconds, while OCRPAv2 employs a Singleton design pattern and batch processing, reducing prediction time to just 0.06 seconds per image. This represents a 60-100x efficiency improvement over standard RPA methods, showcasing that design patterns and batch processing can enhance scalability and reduce costs in oral cancer detection
Chinese Translation
口腔癌病变的准确早期检测对于有效的诊断和治疗至关重要。本研究评估了两种RPA实现方案,OC-RPAv1和OC-RPAv2,使用了31幅图像的测试集。OC-RPAv1在每次预测中平均处理一幅图像需0.29秒,而OC-RPAv2采用了单例设计模式和批处理,将每幅图像的预测时间缩短至仅0.06秒。这代表了相较于标准RPA方法效率提高了60-100倍,展示了设计模式和批处理可以提升口腔癌检测的可扩展性并降低成本。
cs.CV / 79 / 2603.10929

Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment

基于多模态潜在重放和增量调整的终身模仿学习
Yu, Fanqi, Tiezzi, Matteo, Apicella, Tommaso, Beyan, Cigdem, Murino, Vittorio
Abstract
We introduce a lifelong imitation learning framework that enables continual policy refinement across sequential tasks under realistic memory and data constraints. Our approach departs from conventional experience replay by operating entirely in a multimodal latent space, where compact representations of visual, linguistic, and robot's state information are stored and reused to support future learning. To further stabilize adaptation, we introduce an incremental feature adjustment mechanism that regularizes the evolution of task embeddings through an angular margin constraint, preserving inter-task distinctiveness. Our method establishes a new state of the art in the LIBERO benchmarks, achieving 10-17 point gains in AUC and up to 65% less forgetting compared to previous leading methods. Ablation studies confirm the effectiveness of each component, showing consistent gains over alternative strategies. The code is available at: https://github.com/yfqi/lifelong_mlr_ifa.
Chinese Translation
我们提出了一种终身模仿学习框架,使得在现实的记忆和数据约束下能够在连续任务中进行持续的策略优化。我们的方法不同于传统的经验重放,完全在多模态潜在空间中操作,在该空间中存储和重用视觉、语言和机器人状态信息的紧凑表示,以支持未来的学习。为了进一步稳定适应性,我们引入了一种增量特征调整机制,通过角度边际约束来规范任务嵌入的演变,从而保持任务间的独特性。我们的方法在LIBERO基准测试中建立了新的技术领先水平,相较于之前的领先方法,在AUC上实现了10-17点的提升,并减少了多达65%的遗忘。消融研究证实了每个组件的有效性,显示出相较于其他策略的一致性提升。代码可在以下链接获取:https://github.com/yfqi/lifelong_mlr_ifa。
cs.CV / 80 / 2603.10933

Bridging the Skill Gap in Clinical CBCT Interpretation with CBCTRepD

通过 CBCTRepD 缩小临床 CBCT 解释中的技能差距
Wu, Qinxin, Niu, Fucheng, Zhu, Hengchuan, Sun, Yifan, Shen, Ye, Li, Xu, Wu, Han, Liu, Leqi, Pan, Zhiwen, Liu, Zuozhu, Zhu, Fudong, Feng, Bin
Abstract
Generative AI has advanced rapidly in medical report generation; however, its application to oral and maxillofacial CBCT reporting remains limited, largely because of the scarcity of high-quality paired CBCT-report data and the intrinsic complexity of volumetric CBCT interpretation. To address this, we introduce CBCTRepD, a bilingual oral and maxillofacial CBCT report-generation system designed for integration into routine radiologist-AI co-authoring workflows. We curated a large-scale, high-quality paired CBCT-report dataset comprising approximately 7,408 studies, covering 55 oral disease entities across diverse acquisition settings, and used it to develop the system. We further established a clinically grounded, multi-level evaluation framework that assesses both direct AI-generated drafts and radiologist-edited collaboration reports using automatic metrics together with radiologist- and clinician-centered evaluation. Using this framework, we show that CBCTRepD achieves superior report-generation performance and produces drafts with writing quality and standardization comparable to those of intermediate radiologists. More importantly, in radiologist-AI collaboration, CBCTRepD provides consistent and clinically meaningful benefits across experience levels: it helps novice radiologists improve toward intermediate-level reporting, enables intermediate radiologists to approach senior-level performance, and even assists senior radiologists by reducing omission-related errors, including clinically important missed lesions. By improving report structure, reducing omissions, and promoting attention to co-existing lesions across anatomical regions, CBCTRepD shows strong and reliable potential as a practical assistant for real-world CBCT reporting across multi-level care settings.
Chinese Translation
生成性人工智能在医学报告生成方面迅速发展;然而,其在口腔和颌面 CBCT 报告中的应用仍然有限,主要是由于高质量配对的 CBCT-报告数据稀缺以及体积 CBCT 解释的内在复杂性。为了解决这一问题,我们引入了 CBCTRepD,这是一种双语口腔和颌面 CBCT 报告生成系统,旨在与常规放射科医生与人工智能共同创作的工作流程相结合。我们策划了一个大规模高质量的配对 CBCT-报告数据集,包含约 7,408 项研究,涵盖 55 种口腔疾病实体,涉及多种采集设置,并以此开发了该系统。我们进一步建立了一个临床基础的多层次评估框架,评估直接由人工智能生成的草稿和放射科医生编辑的协作报告,采用自动化指标以及放射科医生和临床医生中心的评估。通过这一框架,我们展示了 CBCTRepD 在报告生成性能方面的优越性,生成的草稿在写作质量和标准化方面可与中级放射科医生相媲美。更重要的是,在放射科医生与人工智能的协作中,CBCTRepD 在各经验水平上提供了一致且具有临床意义的益处:它帮助初级放射科医生提升至中级报告水平,使中级放射科医生接近高级表现,甚至通过减少遗漏相关错误(包括临床上重要的漏检病变)来协助高级放射科医生。通过改善报告结构、减少遗漏并促进对解剖区域内共存病变的关注,CBCTRepD 显示出作为多层次护理环境中实际 CBCT 报告的可靠助手的强大潜力。
cs.CV / 81 / 2603.10963

Pointy - A Lightweight Transformer for Point Cloud Foundation Models

Pointy - 一种轻量级的点云基础模型变换器
Szafer, Konrad, Kraft, Marek, Belter, Dominik
Abstract
Foundation models for point cloud data have recently grown in capability, often leveraging extensive representation learning from language or vision. In this work, we take a more controlled approach by introducing a lightweight transformer-based point cloud architecture. In contrast to the heavy reliance on cross-modal supervision, our model is trained only on 39k point clouds - yet it outperforms several larger foundation models trained on over 200k training samples. Interestingly, our method approaches state-of-the-art results from models that have seen over a million point clouds, images, and text samples, demonstrating the value of a carefully curated training setup and architecture. To ensure rigorous evaluation, we conduct a comprehensive replication study that standardizes the training regime and benchmarks across multiple point cloud architectures. This unified experimental framework isolates the impact of architectural choices, allowing for transparent comparisons and highlighting the benefits of our design and other tokenizer-free architectures. Our results show that simple backbones can deliver competitive results to more complex or data-rich strategies. The implementation, including code, pre-trained models, and training protocols, is available at https://github.com/KonradSzafer/Pointy.
Chinese Translation
点云数据的基础模型最近在能力上得到了提升,通常利用来自语言或视觉的广泛表征学习。在本研究中,我们采取了一种更为控制的方法,提出了一种基于轻量级变换器的点云架构。与对跨模态监督的高度依赖不同,我们的模型仅在39,000个点云上进行训练,但其性能超过了几个在超过200,000个训练样本上训练的大型基础模型。有趣的是,我们的方法接近于那些在超过一百万个点云、图像和文本样本上训练的模型的最先进结果,展示了精心策划的训练设置和架构的价值。为了确保严格的评估,我们进行了一项全面的复制研究,标准化了训练方案和多个点云架构的基准测试。这个统一的实验框架隔离了架构选择的影响,允许透明的比较,并突显了我们设计和其他无分词器架构的优势。我们的结果表明,简单的骨干网络可以提供与更复杂或数据丰富的策略相竞争的结果。实现代码、预训练模型和训练协议可在 https://github.com/KonradSzafer/Pointy 获取。
cs.CV / 82 / 2603.10965

Contrastive learning-based video quality assessment-jointed video vision transformer for video recognition

基于对比学习的视频质量评估-联合视频视觉变换器用于视频识别
Sun, Jian, Mahoor, Mohammad H.
Abstract
Video quality significantly affects video classification. We found this problem when we classified Mild Cognitive Impairment well from clear videos, but worse from blurred ones. From then, we realized that referring to Video Quality Assessment (VQA) may improve video classification. This paper proposed Self-Supervised Learning-based Video Vision Transformer combined with No-reference VQA for video classification (SSL-V3) to fulfill the goal. SSL-V3 leverages Combined-SSL mechanism to join VQA into video classification and address the label shortage of VQA, which commonly occurs in video datasets, making it impossible to provide an accurate Video Quality Score. In brief, Combined-SSL takes video quality score as a factor to directly tune the feature map of the video classification. Then, the score, as an intersected point, links VQA and classification, using the supervised classification task to tune the parameters of VQA. SSL-V3 achieved robust experimental results on two datasets. For example, it reached an accuracy of 94.87% on some interview videos in the I-CONECT (a facial video-involved healthcare dataset), verifying SSL-V3's effectiveness.
Chinese Translation
视频质量显著影响视频分类。我们在从清晰视频中有效分类轻度认知障碍时发现了这个问题,但在模糊视频中效果较差。从那时起,我们意识到参考视频质量评估(Video Quality Assessment, VQA)可能会改善视频分类。本文提出了一种基于自监督学习的视频视觉变换器,结合无参考VQA用于视频分类(Self-Supervised Learning-based Video Vision Transformer combined with No-reference VQA for video classification, SSL-V3),以实现这一目标。SSL-V3利用组合自监督学习机制将VQA引入视频分类,并解决VQA在视频数据集中常见的标签短缺问题,从而无法提供准确的视频质量评分。简而言之,组合自监督学习将视频质量评分作为一个因素,直接调整视频分类的特征图。然后,该评分作为一个交集点,连接VQA和分类,利用监督分类任务来调整VQA的参数。SSL-V3在两个数据集上取得了稳健的实验结果。例如,在I-CONECT(一个涉及面部视频的医疗保健数据集)中的一些访谈视频上,其准确率达到了94.87%,验证了SSL-V3的有效性。
cs.CV / 83 / 2603.10967

Med-DualLoRA: Local Adaptation of Foundation Models for 3D Cardiac MRI

Med-DualLoRA:基础模型在三维心脏磁共振成像中的局部适应
Perramon-Llussà, Joan, Jiménez-Sánchez, Amelia, Skorupko, Grzegorz, Avgoustidis, Fotis, Martín-Isla, Carlos, Lekadir, Karim, Gkontra, Polyxeni
Abstract
Foundation models (FMs) show great promise for robust downstream performance across medical imaging tasks and modalities, including cardiac magnetic resonance (CMR), following task-specific adaptation. However, adaptation using single-site data may lead to suboptimal performance and increased model bias, while centralized fine-tuning on clinical data is often infeasible due to privacy constraints. Federated fine-tuning offers a privacy-preserving alternative; yet conventional approaches struggle under heterogeneous, non-IID multi-center data and incur substantial communication overhead when adapting large models. In this work, we study federated FM fine-tuning for 3D CMR disease detection and propose Med-DualLoRA, a client-aware parameter-efficient fine-tuning (PEFT) federated framework that disentangles globally shared and local low-rank adaptations (LoRA) through additive decomposition. Global and local LoRA modules are trained locally, but only the global component is shared and aggregated across sites, keeping local adapters private. This design improves personalization while significantly reducing communication cost, and experiments show that adapting only two transformer blocks preserves performance while further improving efficiency. We evaluate our method on a multi-center state-of-the-art cine 3D CMR FM fine-tuned for disease detection using ACDC and combined M\&Ms datasets, treating each vendor as a federated client. Med-DualLoRA achieves statistically significant improved performance (balanced accuracy 0.768, specificity 0.612) compared to other federated PEFT baselines, while maintaining communication efficiency. Our approach provides a scalable solution for local federated adaptation of medical FMs under realistic clinical constraints.
Chinese Translation
基础模型(FMs)在医学影像任务和模态中显示出在下游性能上的巨大潜力,包括心脏磁共振(CMR),这得益于特定任务的适应。然而,使用单中心数据进行适应可能导致次优性能和模型偏差增加,而由于隐私限制,基于临床数据的集中微调往往不可行。联邦微调提供了一种保护隐私的替代方案;然而,传统方法在异构的非独立同分布(non-IID)多中心数据下表现不佳,并且在适应大型模型时会产生大量通信开销。在本研究中,我们研究了用于三维 CMR 疾病检测的联邦基础模型微调,并提出了 Med-DualLoRA,一种客户端感知的参数高效微调(PEFT)联邦框架,通过加法分解将全局共享和局部低秩适应(LoRA)解耦。全局和局部 LoRA 模块在本地训练,但只有全局组件在各个中心之间共享和聚合,从而保持局部适配器的私密性。这种设计提高了个性化,同时显著降低了通信成本,实验表明,仅适应两个变换器块即可保持性能,同时进一步提高效率。我们在一个多中心的最先进的电影三维 CMR 基础模型上评估了我们的方法,该模型经过微调以进行疾病检测,使用 ACDC 和组合 M&Ms 数据集,将每个供应商视为一个联邦客户端。与其他联邦 PEFT 基线相比,Med-DualLoRA 在性能上实现了统计显著的提升(平衡准确率 0.768,特异性 0.612),同时保持了通信效率。我们的方法为在现实临床限制下进行医学基础模型的本地联邦适应提供了可扩展的解决方案。
cs.CV / 84 / 2603.10975

VCR: Variance-Driven Channel Recalibration for Robust Low-Light Enhancement

VCR:基于方差驱动的通道重校准用于稳健的低光增强
Cheng, Zhixin, Zhang, Fangwen, Yin, Xiaotian, Yin, Baoqun, Wang, Haodian
Abstract
Most sRGB-based LLIE methods suffer from entangled luminance and color, while the HSV color space offers insufficient decoupling at the cost of introducing significant red and black noise artifacts. Recently, the HVI color space has been proposed to address these limitations by enhancing color fidelity through chrominance polarization and intensity compression. However, existing methods could suffer from channel-level inconsistency between luminance and chrominance, and misaligned color distribution may lead to unnatural enhancement results. To address these challenges, we propose the Variance-Driven Channel Recalibration for Robust Low-Light Enhancement (VCR), a novel framework for low-light image enhancement. VCR consists of two main components, including the Channel Adaptive Adjustment (CAA) module, which employs variance-guided feature filtering to enhance the model's focus on regions with high intensity and color distribution. And the Color Distribution Alignment (CDA) module, which enforces distribution alignment in the color feature space. These designs enhance perceptual quality under low-light conditions. Experimental results on several benchmark datasets demonstrate that the proposed method achieves state-of-the-art performance compared with existing methods.
Chinese Translation
大多数基于sRGB的低光图像增强(LLIE)方法在亮度和颜色之间存在纠缠,而HSV颜色空间在引入显著的红色和黑色噪声伪影的代价下提供的解耦效果不足。最近,HVI颜色空间被提出以解决这些局限性,通过色度极化和强度压缩来增强颜色保真度。然而,现有方法可能在亮度和色度之间存在通道级不一致,且不对齐的颜色分布可能导致不自然的增强结果。为了解决这些挑战,我们提出了基于方差驱动的通道重校准(VCR),这是一个用于低光图像增强的新框架。VCR由两个主要组件组成,包括通道自适应调整(CAA)模块,该模块采用方差引导的特征过滤来增强模型对高强度和颜色分布区域的关注;以及颜色分布对齐(CDA)模块,该模块在颜色特征空间中强制执行分布对齐。这些设计在低光条件下增强了感知质量。在多个基准数据集上的实验结果表明,所提出的方法与现有方法相比,达到了最先进的性能。
cs.CV / 85 / 2603.10978

GroundCount: Grounding Vision-Language Models with Object Detection for Mitigating Counting Hallucinations

GroundCount:通过物体检测为视觉语言模型提供基础支持以减轻计数幻觉
Chen, Boyuan, Shao, Minghao, Garg, Siddharth, Karri, Ramesh, Shafique, Muhammad
Abstract
Vision Language Models (VLMs) exhibit persistent hallucinations in counting tasks, with accuracy substantially lower than other visual reasoning tasks (excluding sentiment). This phenomenon persists even in state-of-the-art reasoning-capable VLMs. Conversely, CNN-based object detection models (ODMs) such as YOLO excel at spatial localization and instance counting with minimal computational overhead. We propose GroundCount, a framework that augments VLMs with explicit spatial grounding from ODMs to mitigate counting hallucinations. In the best case, our prompt-based augmentation strategy achieves 81.3% counting accuracy on the best-performing model (Ovis2.5-2B) - a 6.6pp improvement - while reducing inference time by 22% through elimination of hallucination-driven reasoning loops for stronger models. We conduct comprehensive ablation studies demonstrating that positional encoding is a critical component, being beneficial for stronger models but detrimental for weaker ones. Confidence scores, by contrast, introduce noise for most architectures and their removal improves performance in four of five evaluated models. We further evaluate feature-level fusion architectures, finding that explicit symbolic grounding via structured prompts outperforms implicit feature fusion despite sophisticated cross-attention mechanisms. Our approach yields consistent improvements across four of five evaluated VLM architectures (6.2--7.5pp), with one architecture exhibiting degraded performance due to incompatibility between its iterative reflection mechanisms and structured prompts. These results suggest that counting failures stem from fundamental spatial-semantic integration limitations rather than architecture-specific deficiencies, while highlighting the importance of architectural compatibility in augmentation strategies.
Chinese Translation
视觉语言模型(VLMs)在计数任务中表现出持续的幻觉,其准确性显著低于其他视觉推理任务(不包括情感分析)。这一现象在最先进的具备推理能力的VLMs中依然存在。相反,基于卷积神经网络(CNN)的物体检测模型(ODMs),如YOLO,在空间定位和实例计数方面表现优异,且计算开销极小。我们提出了GroundCount,一个通过ODMs提供显式空间基础支持来增强VLMs的框架,以减轻计数幻觉。在最佳情况下,我们的基于提示的增强策略在表现最好的模型(Ovis2.5-2B)上实现了81.3%的计数准确率,提升了6.6个百分点,同时通过消除幻觉驱动的推理循环将推理时间减少了22%。我们进行了全面的消融研究,表明位置编码是一个关键组成部分,对强模型有益,但对弱模型则有害。相比之下,置信度分数对大多数架构引入了噪声,去除这些分数在五个评估模型中的四个上提高了性能。我们进一步评估了特征级融合架构,发现通过结构化提示进行显式符号基础支持的效果优于复杂的交叉注意机制下的隐式特征融合。我们的方法在五个评估的VLM架构中有四个实现了一致的改进(6.2-7.5个百分点),其中一个架构由于其迭代反射机制与结构化提示的不兼容而表现下降。这些结果表明,计数失败源于基本的空间-语义整合限制,而非特定架构的缺陷,同时强调了增强策略中架构兼容性的重要性。
cs.CV / 86 / 2603.10990

Too Vivid to Be Real? Benchmarking and Calibrating Generative Color Fidelity

过于生动以至于不真实?生成颜色保真度的基准测试与校准
Fang, Zhengyao, Jia, Zexi, Zhong, Yijia, Luo, Pengcheng, Zhang, Jinchao, Lu, Guangming, Yu, Jun, Pei, Wenjie
Abstract
Recent advances in text-to-image (T2I) generation have greatly improved visual quality, yet producing images that appear visually authentic to real-world photography remains challenging. This is partly due to biases in existing evaluation paradigms: human ratings and preference-trained metrics often favor visually vivid images with exaggerated saturation and contrast, which make generations often too vivid to be real even when prompted for realistic-style images. To address this issue, we present Color Fidelity Dataset (CFD) and Color Fidelity Metric (CFM) for objective evaluation of color fidelity in realistic-style generations. CFD contains over 1.3M real and synthetic images with ordered levels of color realism, while CFM employs a multimodal encoder to learn perceptual color fidelity. In addition, we propose a training-free Color Fidelity Refinement (CFR) that adaptively modulates spatial-temporal guidance scale in generation, thereby enhancing color authenticity. Together, CFD supports CFM for assessment, whose learned attention further guides CFR to refine T2I fidelity, forming a progressive framework for assessing and improving color fidelity in realistic-style T2I generation. The dataset and code are available at https://github.com/ZhengyaoFang/CFM.
Chinese Translation
最近在文本到图像(T2I)生成方面的进展大大提高了视觉质量,但生成看起来与真实世界摄影相符的图像仍然具有挑战性。这部分是由于现有评估范式中的偏见:人类评分和基于偏好的度量标准往往偏向于视觉上生动的图像,这些图像具有夸张的饱和度和对比度,即使在被提示生成现实风格图像时,生成的图像往往也显得过于生动而不真实。为了解决这个问题,我们提出了颜色保真度数据集(Color Fidelity Dataset, CFD)和颜色保真度度量(Color Fidelity Metric, CFM),用于对现实风格生成中的颜色保真度进行客观评估。CFD包含超过130万张具有不同颜色真实感等级的真实和合成图像,而CFM则采用多模态编码器来学习感知颜色保真度。此外,我们提出了一种无训练的颜色保真度精炼(Color Fidelity Refinement, CFR),该方法自适应地调节生成中的时空引导尺度,从而增强颜色的真实性。CFD与CFM相辅相成,CFM的学习注意力进一步指导CFR来精炼T2I的保真度,形成一个渐进的框架,用于评估和改善现实风格T2I生成中的颜色保真度。数据集和代码可在 https://github.com/ZhengyaoFang/CFM 获取。
cs.CV / 87 / 2603.11024

Does AI See like Art Historians? Interpreting How Vision Language Models Recognize Artistic Style

人工智能是否像艺术史学家一样看待艺术?解读视觉语言模型如何识别艺术风格
Limpijankit, Marvin, Alshomary, Milad, Daoud, Yassin Oulad, Ananthram, Amith, Trombley, Tim, Stengel-Eskin, Elias, Bansal, Mohit, Elcott, Noam M., McKeown, Kathleen
Abstract
VLMs have become increasingly proficient at a range of computer vision tasks, such as visual question answering and object detection. This includes increasingly strong capabilities in the domain of art, from analyzing artwork to generation of art. In an interdisciplinary collaboration between computer scientists and art historians, we characterize the mechanisms underlying VLMs' ability to predict artistic style and assess the extent to which they align with the criteria art historians use to reason about artistic style. We employ a latent-space decomposition approach to identify concepts that drive art style prediction and conduct quantitative evaluations, causal analysis and assessment by art historians. Our findings indicate that 73% of the extracted concepts are judged by art historians to exhibit a coherent and semantically meaningful visual feature and 90% of concepts used to predict style of a given artwork were judged relevant. In cases where an irrelevant concept was used to successfully predict style, art historians identified possible reasons for its success; for example, the model might "understand" a concept in more formal terms, such as dark/light contrasts.
Chinese Translation
视觉语言模型(VLMs)在视觉问答和物体检测等多种计算机视觉任务中变得越来越高效。这包括在艺术领域日益增强的能力,从分析艺术作品到艺术创作。在计算机科学家与艺术史学家的跨学科合作中,我们描述了VLMs预测艺术风格的机制,并评估它们与艺术史学家用于推理艺术风格的标准之间的契合程度。我们采用潜在空间分解的方法来识别驱动艺术风格预测的概念,并进行定量评估、因果分析以及艺术史学家的评估。我们的研究结果表明,73%的提取概念被艺术史学家判断为展现出一致且语义上有意义的视觉特征,而90%用于预测特定艺术作品风格的概念被认为是相关的。在使用不相关概念成功预测风格的情况下,艺术史学家识别出了其成功的可能原因;例如,模型可能以更正式的术语“理解”某个概念,如明暗对比。
cs.CV / 88 / 2603.11041

DynVLA: Learning World Dynamics for Action Reasoning in Autonomous Driving

DynVLA:用于自主驾驶中行动推理的世界动态学习
Shang, Shuyao, Zhan, Bing, Yan, Yunfei, Wang, Yuqi, Li, Yingyan, An, Yasong, Wang, Xiaoman, Liu, Jierui, Hou, Lu, Fan, Lue, Zhang, Zhaoxiang, Tan, Tieniu
Abstract
We propose DynVLA, a driving VLA model that introduces a new CoT paradigm termed Dynamics CoT. DynVLA forecasts compact world dynamics before action generation, enabling more informed and physically grounded decision-making. To obtain compact dynamics representations, DynVLA introduces a Dynamics Tokenizer that compresses future evolution into a small set of dynamics tokens. Considering the rich environment dynamics in interaction-intensive driving scenarios, DynVLA decouples ego-centric and environment-centric dynamics, yielding more accurate world dynamics modeling. We then train DynVLA to generate dynamics tokens before actions through SFT and RFT, improving decision quality while maintaining latency-efficient inference. Compared to Textual CoT, which lacks fine-grained spatiotemporal understanding, and Visual CoT, which introduces substantial redundancy due to dense image prediction, Dynamics CoT captures the evolution of the world in a compact, interpretable, and efficient form. Extensive experiments on NAVSIM, Bench2Drive, and a large-scale in-house dataset demonstrate that DynVLA consistently outperforms Textual CoT and Visual CoT methods, validating the effectiveness and practical value of Dynamics CoT.
Chinese Translation
我们提出了DynVLA,一种驾驶VLA模型,引入了一种新的称为动态CoT的CoT范式。DynVLA在生成行动之前预测紧凑的世界动态,从而实现更为知情和基于物理的决策。为了获得紧凑的动态表示,DynVLA引入了一种动态标记器(Dynamics Tokenizer),将未来演变压缩为一小组动态标记。考虑到在互动密集型驾驶场景中丰富的环境动态,DynVLA将自我中心和环境中心的动态解耦,从而实现更准确的世界动态建模。随后,我们通过SFT和RFT训练DynVLA在行动之前生成动态标记,提高决策质量,同时保持低延迟的推理。与缺乏细粒度时空理解的文本CoT和由于密集图像预测引入大量冗余的视觉CoT相比,动态CoT以紧凑、可解释和高效的形式捕捉世界的演变。在NAVSIM、Bench2Drive和一个大规模内部数据集上的大量实验表明,DynVLA始终优于文本CoT和视觉CoT方法,验证了动态CoT的有效性和实际价值。
cs.CV / 89 / 2603.11042

V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation

V2M-Zero:零配对时间对齐的视频到音乐生成
Lin, Yan-Bo, Casebeer, Jonah, Mai, Long, Mahapatra, Aniruddha, Bertasius, Gedas, Bryan, Nicholas J.
Abstract
Generating music that temporally aligns with video events is challenging for existing text-to-music models, which lack fine-grained temporal control. We introduce V2M-Zero, a zero-pair video-to-music generation approach that outputs time-aligned music for video. Our method is motivated by a key observation: temporal synchronization requires matching when and how much change occurs, not what changes. While musical and visual events differ semantically, they exhibit shared temporal structure that can be captured independently within each modality. We capture this structure through event curves computed from intra-modal similarity using pretrained music and video encoders. By measuring temporal change within each modality independently, these curves provide comparable representations across modalities. This enables a simple training strategy: fine-tune a text-to-music model on music-event curves, then substitute video-event curves at inference without cross-modal training or paired data. Across OES-Pub, MovieGenBench-Music, and AIST++, V2M-Zero achieves substantial gains over paired-data baselines: 5-21% higher audio quality, 13-15% better semantic alignment, 21-52% improved temporal synchronization, and 28% higher beat alignment on dance videos. We find similar results via a large crowd-source subjective listening test. Overall, our results validate that temporal alignment through within-modality features, rather than paired cross-modal supervision, is effective for video-to-music generation. Results are available at https://genjib.github.io/v2m_zero/
Chinese Translation
生成与视频事件在时间上对齐的音乐对现有的文本到音乐模型来说是一项挑战,因为这些模型缺乏细粒度的时间控制。我们提出了V2M-Zero,这是一种零配对的视频到音乐生成方法,能够为视频输出时间对齐的音乐。我们的方法源于一个关键观察:时间同步需要匹配变化发生的时间和变化的幅度,而不是变化的内容。尽管音乐事件和视觉事件在语义上有所不同,但它们展现出共享的时间结构,这可以在每种模态中独立捕捉。我们通过使用预训练的音乐和视频编码器计算的事件曲线来捕捉这种结构。通过独立测量每种模态内的时间变化,这些曲线提供了跨模态的可比表示。这使得一种简单的训练策略成为可能:在音乐事件曲线上微调文本到音乐模型,然后在推理时用视频事件曲线替代,而无需跨模态训练或配对数据。在OES-Pub、MovieGenBench-Music和AIST++数据集上,V2M-Zero在配对数据基线之上取得了显著的提升:音频质量提高了5-21%,语义对齐提高了13-15%,时间同步提高了21-52%,舞蹈视频的节拍对齐提高了28%。我们通过大规模众包的主观听音测试发现了类似的结果。总体而言,我们的结果验证了通过模态内特征进行时间对齐,而不是配对的跨模态监督,对于视频到音乐生成是有效的。结果可在 https://genjib.github.io/v2m_zero/ 获取。
cs.CV / 90 / 2603.11044

Agentar-Fin-OCR

Agentar-Fin-OCR
Qian, Siyi, Bai, Xiongfei, Fu, Bingtao, Lu, Yichen, Zhang, Gaoyang, Yang, Xudong, Zhang, Peng
Abstract
In this paper, we propose Agentar-Fin-OCR, a document parsing system tailored to financial-domain documents, transforming ultra-long financial PDFs into semantically consistent, highly accurate, structured outputs with auditing-grade provenance. To address finance-specific challenges such as complex layouts, cross-page structural discontinuities, and cell-level referencing capability, Agentar-Fin-OCR combines (1) a Cross-page Contents Consolidation algorithm to restore continuity across pages and a Document-level Heading Hierarchy Reconstruction (DHR) module to build a globally consistent Table of Contents (TOC) tree for structure-aware retrieval, and (2) a difficulty-adaptive curriculum learning training strategy for table parsing, together with a CellBBoxRegressor module that uses structural anchor tokens to localize table cells from decoder hidden states without external detectors. Experiments demonstrate that our model shows high performance on the table parsing metrics of OmniDocBench. To enable realistic evaluation in the financial vertical, we further introduce FinDocBench, a benchmark that includes six financial document categories with expert-verified annotations and evaluation metrics including Table of Contents edit-distance-based similarity (TocEDS), cross-page concatenated TEDS, and Table Cell Intersection over Union (C-IoU). We evaluate a wide range of state-of-the-art models on FinDocBench to assess their capabilities and remaining limitations on financial documents. Overall, Agentar-Fin-OCR and FinDocBench provide a practical foundation for reliable downstream financial document applications.
Chinese Translation
在本文中,我们提出了Agentar-Fin-OCR,这是一种针对金融领域文档的文档解析系统,能够将超长的金融PDF文档转化为语义一致、准确度高、结构化的输出,并具备审计级的来源追溯能力。为了解决金融特有的挑战,如复杂的布局、跨页结构的不连续性以及单元格级的引用能力,Agentar-Fin-OCR结合了(1)跨页内容整合算法(Cross-page Contents Consolidation),以恢复页面之间的连续性,以及文档级标题层次重建模块(Document-level Heading Hierarchy Reconstruction, DHR),以构建全球一致的目录树(Table of Contents, TOC),以便进行结构感知检索;(2)一种适应难度的课程学习训练策略,用于表格解析,以及一个CellBBoxRegressor模块,该模块利用结构锚点标记从解码器隐藏状态中定位表格单元格,而无需外部检测器。实验表明,我们的模型在OmniDocBench的表格解析指标上表现出色。为了实现金融领域的现实评估,我们进一步引入了FinDocBench,这是一个基准测试,包含六类经过专家验证注释的金融文档及评估指标,包括基于目录编辑距离的相似度(Table of Contents edit-distance-based similarity, TocEDS)、跨页连接TEDS和表格单元交并比(Table Cell Intersection over Union, C-IoU)。我们在FinDocBench上评估了一系列最先进的模型,以评估它们在金融文档上的能力和剩余局限性。总体而言,Agentar-Fin-OCR和FinDocBench为可靠的下游金融文档应用提供了实用的基础。
cs.CV / 91 / 2603.11047

LiTo: Surface Light Field Tokenization

LiTo: 表面光场标记化
Chang, Jen-Hao Rick, Zhao, Xiaoming, Chan, Dorian, Tuzel, Oncel
Abstract
We propose a 3D latent representation that jointly models object geometry and view-dependent appearance. Most prior works focus on either reconstructing 3D geometry or predicting view-independent diffuse appearance, and thus struggle to capture realistic view-dependent effects. Our approach leverages that RGB-depth images provide samples of a surface light field. By encoding random subsamples of this surface light field into a compact set of latent vectors, our model learns to represent both geometry and appearance within a unified 3D latent space. This representation reproduces view-dependent effects such as specular highlights and Fresnel reflections under complex lighting. We further train a latent flow matching model on this representation to learn its distribution conditioned on a single input image, enabling the generation of 3D objects with appearances consistent with the lighting and materials in the input. Experiments show that our approach achieves higher visual quality and better input fidelity than existing methods.
Chinese Translation
我们提出了一种三维潜在表示,联合建模物体几何形状和视角依赖的外观。大多数先前的研究要么专注于重建三维几何形状,要么预测与视角无关的漫反射外观,因此难以捕捉真实的视角依赖效果。我们的方法利用RGB-深度图像提供的表面光场样本。通过将该表面光场的随机子样本编码为一组紧凑的潜在向量,我们的模型学习在统一的三维潜在空间中表示几何形状和外观。这种表示能够重现复杂光照下的视角依赖效果,如镜面高光和菲涅尔反射。我们进一步在该表示上训练了一个潜在流匹配模型,以学习其在单个输入图像条件下的分布,从而能够生成与输入中的光照和材料一致的外观的三维物体。实验表明,我们的方法在视觉质量和输入保真度上优于现有方法。
cs.CV / 92 / 2603.11048

COMIC: Agentic Sketch Comedy Generation

COMIC:自主喜剧短剧生成
Hong, Susung, Curless, Brian, Kemelmacher-Shlizerman, Ira, Seitz, Steve
Abstract
We propose a fully automated AI system that produces short comedic videos similar to sketch shows such as Saturday Night Live. Starting with character references, the system employs a population of agents loosely based on real production studio roles, structured to optimize the quality and diversity of ideas and outputs through iterative competition, evaluation, and improvement. A key contribution is the introduction of LLM critics aligned with real viewer preferences through the analysis of a corpus of comedy videos on YouTube to automatically evaluate humor. Our experiments show that our framework produces results approaching the quality of professionally produced sketches while demonstrating state-of-the-art performance in video generation.
Chinese Translation
我们提出了一种完全自动化的人工智能系统,该系统能够生成类似于《周六夜现场》的短喜剧视频。该系统以角色参考为起点,采用一组基于真实制作工作室角色的代理人,结构设计旨在通过迭代竞争、评估和改进来优化创意和输出的质量与多样性。一个关键贡献是引入了与真实观众偏好对齐的LLM(大型语言模型)评论者,通过分析YouTube上的喜剧视频语料库来自动评估幽默性。我们的实验表明,框架生成的结果接近专业制作短剧的质量,同时在视频生成方面展现出最先进的性能。
人工智能 (Artificial Intelligence)
15
cs.AI / 1 / 2603.10133

Agentic Control Center for Data Product Optimization

数据产品优化的自主控制中心
Tamilselvan, Priyadarshini, Bramble, Gregory, Shirai, Sola, Wong, Ken C. L., Chowdhury, Faisal, Samulowitz, Horst
Abstract
Data products enable end users to gain greater insights about their data by providing supporting assets, such as example question-SQL pairs which can be answered using the data or views over the database tables. However, producing useful data products is challenging, and typically requires domain experts to hand-craft supporting assets. We propose a system that automates data product improvement through specialized AI agents operating in a continuous optimization loop. By surfacing questions, monitoring multi-dimensional quality metrics, and supporting human-in-the-loop controls, it transforms data into observable and refinable assets that balance automation with trust and oversight.
Chinese Translation
数据产品通过提供支持性资产(例如可以使用数据回答的示例问题-SQL对或数据库表的视图),使最终用户能够对其数据获得更深入的洞察。然而,制作有用的数据产品具有挑战性,通常需要领域专家手工制作支持性资产。我们提出了一种系统,通过在持续优化循环中运行的专门AI代理,自动化数据产品的改进。通过提出问题、监控多维质量指标以及支持人机协作控制,该系统将数据转化为可观察和可精炼的资产,平衡了自动化与信任和监督。
cs.AI / 2 / 2603.10291

Hybrid Self-evolving Structured Memory for GUI Agents

混合自演化结构记忆用于图形用户界面代理
Zhu, Sibo, Wu, Wenyi, Zhou, Kun, Wang, Stephen, Huang, Biwei
Abstract
The remarkable progress of vision-language models (VLMs) has enabled GUI agents to interact with computers in a human-like manner. Yet real-world computer-use tasks remain difficult due to long-horizon workflows, diverse interfaces, and frequent intermediate errors. Prior work equips agents with external memory built from large collections of trajectories, but relies on flat retrieval over discrete summaries or continuous embeddings, falling short of the structured organization and self-evolving characteristics of human memory. Inspired by the brain, we propose Hybrid Self-evolving Structured Memory (HyMEM), a graph-based memory that couples discrete high-level symbolic nodes with continuous trajectory embeddings. HyMEM maintains a graph structure to support multi-hop retrieval, self-evolution via node update operations, and on-the-fly working-memory refreshing during inference. Extensive experiments show that HyMEM consistently improves open-source GUI agents, enabling 7B/8B backbones to match or surpass strong closed-source models; notably, it boosts Qwen2.5-VL-7B by +22.5% and outperforms Gemini2.5-Pro-Vision and GPT-4o.
Chinese Translation
视觉语言模型(VLMs)的显著进展使得图形用户界面(GUI)代理能够以类人方式与计算机进行交互。然而,由于长时间的工作流程、多样化的界面和频繁的中间错误,现实世界中的计算机使用任务仍然困难。之前的研究为代理配备了基于大量轨迹集合构建的外部记忆,但依赖于对离散摘要或连续嵌入的平面检索,未能达到人类记忆的结构化组织和自演化特性。受到大脑的启发,我们提出了混合自演化结构记忆(Hybrid Self-evolving Structured Memory, HyMEM),这是一种基于图的记忆,将离散的高层次符号节点与连续的轨迹嵌入相结合。HyMEM维持图结构以支持多跳检索、通过节点更新操作实现自演化,并在推理过程中进行即时工作记忆刷新。大量实验表明,HyMEM持续改善开源GUI代理,使得7B/8B主干能够与强大的闭源模型相匹配或超越;值得注意的是,它使Qwen2.5-VL-7B提升了22.5%,并超越了Gemini2.5-Pro-Vision和GPT-4o。
cs.AI / 3 / 2603.10359

HEAL: Hindsight Entropy-Assisted Learning for Reasoning Distillation

HEAL:基于回顾熵辅助学习的推理蒸馏
Zhang, Wenjing, Yan, Jiangze, Huang, Jieyun, Shen, Yi, Shi, Shuming, Chen, Ping, Wang, Ning, Liu, Zhaoxiang, Wang, Kai, Lian, Shiguo
Abstract
Distilling reasoning capabilities from Large Reasoning Models (LRMs) into smaller models is typically constrained by the limitation of rejection sampling. Standard methods treat the teacher as a static filter, discarding complex "corner-case" problems where the teacher fails to explore valid solutions independently, thereby creating an artificial "Teacher Ceiling" for the student. In this work, we propose Hindsight Entropy-Assisted Learning (HEAL), an RL-free framework designed to bridge this reasoning gap. Drawing on the educational theory of the Zone of Proximal Development(ZPD), HEAL synergizes three core modules: (1) Guided Entropy-Assisted Repair (GEAR), an active intervention mechanism that detects critical reasoning breakpoints via entropy dynamics and injects targeted hindsight hints to repair broken trajectories; (2) Perplexity-Uncertainty Ratio Estimator (PURE), a rigorous filtering protocol that decouples genuine cognitive breakthroughs from spurious shortcuts; and (3) Progressive Answer-guided Curriculum Evolution (PACE), a three-stage distillation strategy that organizes training from foundational alignment to frontier breakthrough. Extensive experiments on multiple benchmarks demonstrate that HEAL significantly outperforms traditional SFT distillation and other baselines.
Chinese Translation
从大型推理模型(LRMs)中提取推理能力到较小模型的过程通常受到拒绝采样限制的约束。标准方法将教师视为静态过滤器,丢弃复杂的“边缘案例”问题,在这些情况下,教师未能独立探索有效解决方案,从而为学生创造了一个人为的“教师天花板”。在本研究中,我们提出了回顾熵辅助学习(HEAL),这是一个无强化学习的框架,旨在弥补这一推理差距。HEAL借鉴了近端发展区(ZPD)的教育理论,协同三个核心模块:(1)引导熵辅助修复(GEAR),一种主动干预机制,通过熵动态检测关键推理断点,并注入针对性的回顾提示以修复破损轨迹;(2)困惑度-不确定性比率估计器(PURE),一种严格的过滤协议,能够将真正的认知突破与虚假的捷径解耦;(3)渐进式答案引导课程演化(PACE),一种三阶段的蒸馏策略,从基础对齐到前沿突破组织训练。在多个基准上的大量实验表明,HEAL显著优于传统的SFT蒸馏和其他基线方法。
cs.AI / 4 / 2603.10384

Beyond Scalars: Evaluating and Understanding LLM Reasoning via Geometric Progress and Stability

超越标量:通过几何进展与稳定性评估和理解大型语言模型的推理
Jiang, Xinyan, Liu, Ninghao, Wang, Di, Hu, Lijie
Abstract
Evaluating LLM reliability via scalar probabilities often fails to capture the structural dynamics of reasoning. We introduce TRACED, a framework that assesses reasoning quality through theoretically grounded geometric kinematics. By decomposing reasoning traces into Progress (displacement) and Stability (curvature), we reveal a distinct topological divergence: correct reasoning manifests as high-progress, stable trajectories, whereas hallucinations are characterized by low-progress, unstable patterns (stalled displacement with high curvature fluctuations). Leveraging these signatures, our probabilistic framework achieves competitive performance and superior robustness across diverse benchmarks. Crucially, TRACED bridges geometry and cognition by mapping high curvature to ''Hesitation Loops'' and displacement to ''Certainty Accumulation'', offering a physical lens to decode the internal dynamics of machine thought.
Chinese Translation
通过标量概率评估大型语言模型的可靠性往往无法捕捉推理的结构动态。我们提出了TRACED,一个通过理论基础的几何运动学评估推理质量的框架。通过将推理轨迹分解为进展(位移)和稳定性(曲率),我们揭示了一种独特的拓扑分歧:正确的推理表现为高进展、稳定的轨迹,而幻觉则以低进展、不稳定的模式为特征(停滞位移伴随高曲率波动)。利用这些特征,我们的概率框架在各种基准测试中实现了竞争力的性能和卓越的鲁棒性。关键是,TRACED通过将高曲率映射到“犹豫循环”,将位移映射到“确定性积累”,架起了几何与认知之间的桥梁,为解码机器思维的内部动态提供了物理视角。
cs.AI / 5 / 2603.10396

Verbalizing LLM's Higher-order Uncertainty via Imprecise Probabilities

通过不精确概率对大型语言模型的高阶不确定性进行表述
Yang, Anita, Muandet, Krikamol, Caprio, Michele, Chau, Siu Lun, Adachi, Masaki
Abstract
Despite the growing demand for eliciting uncertainty from large language models (LLMs), empirical evidence suggests that LLM behavior is not always adequately captured by the elicitation techniques developed under the classical probabilistic uncertainty framework. This mismatch leads to systematic failure modes, particularly in settings that involve ambiguous question-answering, in-context learning, and self-reflection. To address this, we propose novel prompt-based uncertainty elicitation techniques grounded in \emph{imprecise probabilities}, a principled framework for repesenting and eliciting higher-order uncertainty. Here, first-order uncertainty captures uncertainty over possible responses to a prompt, while second-order uncertainty (uncertainty about uncertainty) quantifies indeterminacy in the underlying probability model itself. We introduce general-purpose prompting and post-processing procedures to directly elicit and quantify both orders of uncertainty, and demonstrate their effectiveness across diverse settings. Our approach enables more faithful uncertainty reporting from LLMs, improving credibility and supporting downstream decision-making.
Chinese Translation
尽管对从大型语言模型(LLMs)中引出不确定性的需求日益增长,但实证证据表明,LLM的行为并不总是能通过经典概率不确定性框架下开发的引出技术得到充分捕捉。这种不匹配导致了系统性的失效模式,尤其是在涉及模糊问答、上下文学习和自我反思的环境中。为了解决这一问题,我们提出了基于提示的新型不确定性引出技术,该技术基于 extit{不精确概率},这是一个用于表示和引出高阶不确定性的原则性框架。在这里,一阶不确定性捕捉对提示可能响应的不确定性,而二阶不确定性(关于不确定性的的不确定性)则量化了潜在概率模型本身的模糊性。我们引入了通用的提示和后处理程序,以直接引出和量化两种不确定性,并在多种环境中展示了其有效性。我们的方法使LLMs能够更真实地报告不确定性,提高了可信度,并支持下游决策制定。
cs.AI / 6 / 2603.10512

Resource-constrained Amazons chess decision framework integrating large language models and graph attention

资源受限的亚马逊棋决策框架:整合大型语言模型与图注意力机制
Qian, Tianhao, Li, Zhuoxuan, Cao, Jinde, Shi, Xinli, Liu, Hanjie, Rutkowski, Leszek
Abstract
Artificial intelligence has advanced significantly through the development of intelligent game-playing systems, providing rigorous testbeds for decision-making, strategic planning, and adaptive learning. However, resource-constrained environments pose critical challenges, as conventional deep learning methods heavily rely on extensive datasets and computational resources. In this paper, we propose a lightweight hybrid framework for the Game of the Amazons, which explores the paradigm of weak-to-strong generalization by integrating the structural reasoning of graph-based learning with the generative capabilities of large language models. Specifically, we leverage a Graph Attention Autoencoder to inform a multi-step Monte Carlo Tree Search, utilize a Stochastic Graph Genetic Algorithm to optimize evaluation signals, and harness GPT-4o-mini to generate synthetic training data. Unlike traditional approaches that rely on expert demonstrations, our framework learns from noisy and imperfect supervision. We demonstrate that the Graph Attention mechanism effectively functions as a structural filter, denoising the LLM's outputs. Experiments on a 10$\times$10 Amazons board show that our hybrid approach not only achieves a 15\%--56\% improvement in decision accuracy over baselines but also significantly outperforms its teacher model (GPT-4o-mini), achieving a competitive win rate of 45.0\% at N=30 nodes and a decisive 66.5\% at only N=50 nodes. These results verify the feasibility of evolving specialized, high-performance game AI from general-purpose foundation models under stringent computational constraints.
Chinese Translation
人工智能通过智能游戏系统的发展取得了显著进展,为决策制定、战略规划和自适应学习提供了严格的测试平台。然而,资源受限的环境带来了关键挑战,因为传统的深度学习方法严重依赖于大量的数据集和计算资源。本文提出了一种轻量级的混合框架,用于亚马逊棋游戏,探索通过将基于图的学习的结构推理与大型语言模型的生成能力相结合的弱到强的泛化范式。具体而言,我们利用图注意力自编码器来指导多步蒙特卡洛树搜索,使用随机图遗传算法来优化评估信号,并利用GPT-4o-mini生成合成训练数据。与依赖专家示范的传统方法不同,我们的框架从嘈杂和不完美的监督中学习。我们证明了图注意力机制有效地作为结构过滤器,去噪大型语言模型的输出。在10×10的亚马逊棋盘上的实验表明,我们的混合方法不仅在决策准确性上比基线提高了15%至56%,而且显著超越了其教师模型(GPT-4o-mini),在N=30节点时达到了45.0%的竞争性胜率,在仅N=50节点时则达到了66.5%的决定性胜率。这些结果验证了在严格的计算约束下,从通用基础模型演化出专业的高性能游戏人工智能的可行性。
cs.AI / 7 / 2603.10521

IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs

IH-Challenge:一个用于改善前沿大型语言模型指令层次的训练数据集
Guo, Chuan, Uribe, Juan Felipe Ceron, Zhu, Sicheng, Choquette-Choo, Christopher A., Lin, Steph, Kandpal, Nikhil, Nasr, Milad, Rai, Toyer, Sam, Wang, Miles, Yu, Yaodong, Beutel, Alex, Xiao, Kai
Abstract
Instruction hierarchy (IH) defines how LLMs prioritize system, developer, user, and tool instructions under conflict, providing a concrete, trust-ordered policy for resolving instruction conflicts. IH is key to defending against jailbreaks, system prompt extractions, and agentic prompt injections. However, robust IH behavior is difficult to train: IH failures can be confounded with instruction-following failures, conflicts can be nuanced, and models can learn shortcuts such as overrefusing. We introduce IH-Challenge, a reinforcement learning training dataset, to address these difficulties. Fine-tuning GPT-5-Mini on IH-Challenge with online adversarial example generation improves IH robustness by +10.0% on average across 16 in-distribution, out-of-distribution, and human red-teaming benchmarks (84.1% to 94.1%), reduces unsafe behavior from 6.6% to 0.7% while improving helpfulness on general safety evaluations, and saturates an internal static agentic prompt injection evaluation, with minimal capability regression. We release the IH-Challenge dataset (https://huggingface.co/datasets/openai/ih-challenge) to support future research on robust instruction hierarchy.
Chinese Translation
指令层次(Instruction hierarchy, IH)定义了在冲突情况下,大型语言模型(LLMs)如何优先处理系统、开发者、用户和工具指令,为解决指令冲突提供了一种具体的、可信的政策。IH在防御越狱攻击、系统提示提取和代理提示注入方面至关重要。然而,训练出稳健的IH行为是困难的:IH失败可能与指令遵循失败混淆,冲突可能是细微的,模型可能学习到诸如过度拒绝等捷径。我们提出了IH-Challenge,这是一个强化学习训练数据集,用以解决这些困难。在IH-Challenge上微调GPT-5-Mini,通过在线对抗样本生成,平均提升IH的稳健性10.0%(从84.1%提升至94.1%),将不安全行为从6.6%降低至0.7%,同时在一般安全评估中提高了有用性,并在内部静态代理提示注入评估中达到了饱和,且能力回归最小。我们发布了IH-Challenge数据集(https://huggingface.co/datasets/openai/ih-challenge),以支持未来对稳健指令层次的研究。
cs.AI / 8 / 2603.10564

Adaptive RAN Slicing Control via Reward-Free Self-Finetuning Agents

通过无奖励自我微调代理的自适应无线接入网切片控制
Li, Yuanhao, Wang, Haozhe, Min, Geyong, Georgalas, Nektarios, Miao, Wang
Abstract
The integration of Generative AI models into AI-native network systems offers a transformative path toward achieving autonomous and adaptive control. However, the application of such models to continuous control tasks is impeded by intrinsic architectural limitations, including finite context windows, the lack of explicit reward signals, and the degradation of the long context. This paper posits that the key to unlocking robust continuous control is enabling agents to internalize experience by distilling it into their parameters, rather than relying on prompt-based memory. To this end, we propose a novel self-finetuning framework that enables agentic systems to learn continuously through direct interaction with the environment, bypassing the need for handcrafted rewards. Our framework implements a bi-perspective reflection mechanism that generates autonomous linguistic feedback to construct preference datasets from interaction history. A subsequent preference-based fine-tuning process distills long-horizon experiences into the model's parameters. We evaluate our approach on a dynamic Radio Access Network (RAN) slicing task, a challenging multi-objective control problem that requires the resolution of acute trade-offs between spectrum efficiency, service quality, and reconfiguration stability under volatile network conditions. Experimental results show that our framework outperforms standard Reinforcement Learning (RL) baselines and existing Large Language Model (LLM)-based agents in sample efficiency, stability, and multi-metric optimization. These findings demonstrate the potential of self-improving generative agents for continuous control tasks, paving the way for future AI-native network infrastructure.
Chinese Translation
将生成性人工智能模型集成到人工智能原生网络系统中,为实现自主和自适应控制提供了一条变革性路径。然而,这些模型在连续控制任务中的应用受到固有架构限制的阻碍,包括有限的上下文窗口、缺乏明确的奖励信号以及长上下文的退化。本文认为,解锁强大连续控制的关键在于使代理能够通过将经验提炼为其参数来内化经验,而不是依赖基于提示的记忆。为此,我们提出了一种新颖的自我微调框架,使代理系统能够通过与环境的直接互动持续学习,绕过手工设计奖励的需求。我们的框架实现了一种双视角反思机制,生成自主语言反馈,以从互动历史中构建偏好数据集。随后的基于偏好的微调过程将长期经验提炼为模型的参数。我们在动态无线接入网(RAN)切片任务上评估了我们的方法,这是一项具有挑战性的多目标控制问题,需要在波动的网络条件下解决频谱效率、服务质量和重配置稳定性之间的尖锐权衡。实验结果表明,我们的框架在样本效率、稳定性和多指标优化方面优于标准强化学习(RL)基线和现有的大型语言模型(LLM)基础代理。这些发现展示了自我改善生成代理在连续控制任务中的潜力,为未来的人工智能原生网络基础设施铺平了道路。
cs.AI / 9 / 2603.10577

CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents

CUAAudit:作为自主计算机使用代理审计员的视觉-语言模型的元评估
Sumyk, Marta, Kosovan, Oleksandr
Abstract
Computer-Use Agents (CUAs) are emerging as a new paradigm in human-computer interaction, enabling autonomous execution of tasks in desktop environment by perceiving high-level natural-language instructions. As such agents become increasingly capable and are deployed across diverse desktop environments, evaluating their behavior in a scalable and reliable manner becomes a critical challenge. Existing evaluation pipelines rely on static benchmarks, rule-based success checks, or manual inspection, which are brittle, costly, and poorly aligned with real-world usage. In this work, we study Vision-Language Models (VLMs) as autonomous auditors for assessing CUA task completion directly from observable interactions and conduct a large-scale meta-evaluation of five VLMs that judge task success given a natural-language instruction and the final environment state. Our evaluation spans three widely used CUA benchmarks across macOS, Windows, and Linux environments and analyzes auditor behavior along three complementary dimensions: accuracy, calibration of confidence estimates, and inter-model agreement. We find that while state-of-the-art VLMs achieve strong accuracy and calibration, all auditors exhibit notable performance degradation in more complex or heterogeneous environments, and even high-performing models show significant disagreement in their judgments. These results expose fundamental limitations of current model-based auditing approaches and highlight the need to explicitly account for evaluator reliability, uncertainty, and variance when deploying autonomous CUAs in real-world settings.
Chinese Translation
计算机使用代理(CUAs)作为人机交互中的新范式,能够通过感知高级自然语言指令在桌面环境中自主执行任务。随着此类代理能力的不断增强并在多样化的桌面环境中部署,以可扩展和可靠的方式评估其行为成为一项关键挑战。现有的评估流程依赖于静态基准、基于规则的成功检查或人工检查,这些方法脆弱、成本高且与实际使用情况不匹配。在本研究中,我们研究了视觉-语言模型(VLMs)作为自主审计员,直接从可观察的交互中评估CUA任务完成情况,并对五个VLMs进行大规模元评估,这些模型根据自然语言指令和最终环境状态判断任务成功。我们的评估涵盖了macOS、Windows和Linux环境中三个广泛使用的CUA基准,并从准确性、置信度估计的校准和模型间一致性三个互补维度分析审计员行为。我们发现,尽管最先进的VLMs在准确性和校准方面表现出色,但所有审计员在更复杂或异构环境中均表现出显著的性能下降,即使是高性能模型在判断上也存在显著分歧。这些结果揭示了当前基于模型的审计方法的基本局限性,并强调在实际环境中部署自主CUAs时,明确考虑评估者的可靠性、不确定性和方差的必要性。
cs.AI / 10 / 2603.10588

Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning

大型语言模型的对齐是否真的需要多样性?适应RLVR方法进行道德推理的实证研究
Zhang, Zhaowei, Liu, Xiaohan, Zhu, Xuekai, Huang, Junchao, Zhang, Ceyao, Feng, Zhiyuan, Yang, Yaodong, Yi, Xiaoyuan, Xie, Xing
Abstract
Reinforcement learning with verifiable rewards (RLVR) has achieved remarkable success in logical reasoning tasks, yet whether large language model (LLM) alignment requires fundamentally different approaches remains unclear. Given the apparent tolerance for multiple valid responses in moral reasoning, a natural hypothesis is that alignment tasks inherently require diversity-seeking distribution-matching algorithms rather than reward-maximizing policy-based methods. We conduct the first comprehensive empirical study comparing both paradigms on MoReBench. To enable stable RLVR training, we build a rubric-grounded reward pipeline by training a Qwen3-1.7B judge model. Contrary to our hypothesis, we find that distribution-matching approaches do not demonstrate significant advantages over reward-maximizing methods as expected on alignment tasks. Through semantic visualization mapping high-reward responses to semantic space, we demonstrate that moral reasoning exhibits more concentrated high-reward distributions than mathematical reasoning, where diverse solution strategies yield similarly high rewards. This counter-intuitive finding explains why mode-seeking optimization proves equally or more effective for alignment tasks. Our results suggest that alignment tasks do not inherently require diversity-preserving algorithms, and standard reward-maximizing RLVR methods can effectively transfer to moral reasoning without explicit diversity mechanisms.
Chinese Translation
可验证奖励的强化学习(RLVR)在逻辑推理任务中取得了显著成功,但大型语言模型(LLM)的对齐是否需要根本不同的方法仍不明确。考虑到道德推理中对多种有效回应的明显容忍,自然的假设是,对齐任务本质上需要寻求多样性的分布匹配算法,而非基于奖励最大化的策略方法。我们在MoReBench上进行首次全面的实证研究,比较这两种范式。为了实现稳定的RLVR训练,我们通过训练一个Qwen3-1.7B评判模型构建了一个基于标准的奖励管道。与我们的假设相反,我们发现分布匹配方法在对齐任务上并未表现出预期的相较于奖励最大化方法的显著优势。通过语义可视化将高奖励回应映射到语义空间,我们展示了道德推理的高奖励分布比数学推理更为集中,而多样的解决策略在数学推理中产生类似的高奖励。这一反直觉的发现解释了为何寻求模式的优化在对齐任务中同样或更为有效。我们的结果表明,对齐任务并不固有地需要保护多样性的算法,标准的奖励最大化RLVR方法可以有效地转移到道德推理中,而无需明确的多样性机制。
cs.AI / 11 / 2603.10600

Trajectory-Informed Memory Generation for Self-Improving Agent Systems

基于轨迹的记忆生成用于自我改进的智能体系统
Fang, Gaodan, Isahagian, Vatche, Jayaram, K. R., Kumar, Ritesh, Muthusamy, Vinod, Oum, Punleuk, Thomas, Gegi
Abstract
LLM-powered agents face a persistent challenge: learning from their execution experiences to improve future performance. While agents can successfully complete many tasks, they often repeat inefficient patterns, fail to recover from similar errors, and miss opportunities to apply successful strategies from past executions. We present a novel framework for automatically extracting actionable learnings from agent execution trajectories and utilizing them to improve future performance through contextual memory retrieval. Our approach comprises four components: (1) a Trajectory Intelligence Extractor that performs semantic analysis of agent reasoning patterns, (2) a Decision Attribution Analyzer that identifies which decisions and reasoning steps led to failures, recoveries, or inefficiencies, (3) a Contextual Learning Generator that produces three types of guidance -- strategy tips from successful patterns, recovery tips from failure handling, and optimization tips from inefficient but successful executions, and (4) an Adaptive Memory Retrieval System that injects relevant learnings into agent prompts based on multi-dimensional similarity. Unlike existing memory systems that store generic conversational facts, our framework understands execution patterns, extracts structured learnings with provenance, and retrieves guidance tailored to specific task contexts. Evaluation on the AppWorld benchmark demonstrates consistent improvements, with up to 14.3 percentage point gains in scenario goal completion on held-out tasks and particularly strong benefits on complex tasks (28.5~pp scenario goal improvement, a 149\% relative increase).
Chinese Translation
基于大型语言模型(LLM)的智能体面临一个持续的挑战:从其执行经验中学习,以提高未来的表现。尽管智能体能够成功完成许多任务,但它们常常重复低效的模式,无法从类似的错误中恢复,并且错失了应用过去执行中成功策略的机会。我们提出了一种新颖的框架,旨在自动提取智能体执行轨迹中的可操作学习,并利用这些学习通过上下文记忆检索来改善未来的表现。我们的方法包括四个组成部分:(1)轨迹智能提取器,进行智能体推理模式的语义分析;(2)决策归因分析器,识别导致失败、恢复或低效的决策和推理步骤;(3)上下文学习生成器,生成三种类型的指导——来自成功模式的策略提示、来自失败处理的恢复提示,以及来自低效但成功执行的优化提示;(4)自适应记忆检索系统,根据多维相似性将相关学习注入智能体提示中。与现有存储通用对话事实的记忆系统不同,我们的框架理解执行模式,提取具有来源的结构化学习,并检索针对特定任务上下文量身定制的指导。在AppWorld基准上的评估显示出持续的改进,在保留任务上场景目标完成率提高了最多14.3个百分点,尤其在复杂任务上表现出显著的益处(场景目标改善28.5个百分点,相对增加149%)。
cs.AI / 12 / 2603.10661

FAME: Formal Abstract Minimal Explanation for Neural Networks

FAME:神经网络的形式抽象最小解释
Boumazouza, Ryma, Elsaleh, Raya, Ducoffe, Melanie, Bassan, Shahaf, Katz, Guy
Abstract
We propose FAME (Formal Abstract Minimal Explanations), a new class of abductive explanations grounded in abstract interpretation. FAME is the first method to scale to large neural networks while reducing explanation size. Our main contribution is the design of dedicated perturbation domains that eliminate the need for traversal order. FAME progressively shrinks these domains and leverages LiRPA-based bounds to discard irrelevant features, ultimately converging to a formal abstract minimal explanation. To assess explanation quality, we introduce a procedure that measures the worst-case distance between an abstract minimal explanation and a true minimal explanation. This procedure combines adversarial attacks with an optional VERIX+ refinement step. We benchmark FAME against VERIX+ and demonstrate consistent gains in both explanation size and runtime on medium- to large-scale neural networks.
Chinese Translation
我们提出了FAME(形式抽象最小解释),这是一种基于抽象解释的新型推理解释。FAME是第一个能够扩展到大型神经网络并减少解释规模的方法。我们的主要贡献是设计了专门的扰动域,消除了遍历顺序的需求。FAME逐步缩小这些域,并利用基于LiRPA的界限来丢弃无关特征,最终收敛到一个形式抽象最小解释。为了评估解释质量,我们引入了一种程序,测量抽象最小解释与真实最小解释之间的最坏情况距离。该程序结合了对抗攻击和可选的VERIX+精炼步骤。我们将FAME与VERIX+进行了基准测试,并在中到大型神经网络上展示了解释规模和运行时间的持续提升。
cs.AI / 13 / 2603.10677

Emulating Clinician Cognition via Self-Evolving Deep Clinical Research

通过自我进化的深度临床研究模拟临床医生认知
Ren, Ruiyang, Wang, Yuhao, Liang, Yunsen, Luo, Lan, Liu, Jing, Wang, Haifeng, Feng, Cong, Zhang, Yinan, Miao, Chunyan, Wen, Ji-Rong, Zhao, Wayne Xin
Abstract
Clinical diagnosis is a complex cognitive process, grounded in dynamic cue acquisition and continuous expertise accumulation. Yet most current artificial intelligence (AI) systems are misaligned with this reality, treating diagnosis as single-pass retrospective prediction while lacking auditable mechanisms for governed improvement. We developed DxEvolve, a self-evolving diagnostic agent that bridges these gaps through an interactive deep clinical research workflow. The framework autonomously requisitions examinations and continually externalizes clinical experience from increasing encounter exposure as diagnostic cognition primitives. On the MIMIC-CDM benchmark, DxEvolve improved diagnostic accuracy by 11.2% on average over backbone models and reached 90.4% on a reader-study subset, comparable to the clinician reference (88.8%). DxEvolve improved accuracy on an independent external cohort by 10.2% (categories covered by the source cohort) and 17.1% (uncovered categories) compared to the competitive method. By transforming experience into a governable learning asset, DxEvolve supports an accountable pathway for the continual evolution of clinical AI.
Chinese Translation
临床诊断是一个复杂的认知过程,基于动态线索获取和持续的专业知识积累。然而,目前大多数人工智能(AI)系统与这一现实不符,将诊断视为单次回顾性预测,缺乏可审计的改进机制。我们开发了DxEvolve,这是一种自我进化的诊断代理,通过交互式深度临床研究工作流程弥补了这些差距。该框架自主请求检查,并不断从增加的接触曝光中外化临床经验,作为诊断认知的基本元素。在MIMIC-CDM基准测试中,DxEvolve在基础模型的基础上平均提高了11.2%的诊断准确率,并在读者研究子集中达到了90.4%,与临床医生参考值(88.8%)相当。与竞争方法相比,DxEvolve在一个独立的外部队列上提高了10.2%的准确率(源队列覆盖的类别)和17.1%(未覆盖的类别)。通过将经验转化为可管理的学习资产,DxEvolve支持临床AI持续进化的可追责路径。
cs.AI / 14 / 2603.10808

Nurture-First Agent Development: Building Domain-Expert AI Agents Through Conversational Knowledge Crystallization

以培养为先的智能体开发:通过对话知识结晶构建领域专家AI智能体
Zhang, Linghao
Abstract
The emergence of large language model (LLM)-based agent frameworks has shifted the primary challenge in building domain-expert AI agents from raw capability to effective encoding of domain expertise. Two dominant paradigms -- code-first development, which embeds expertise in deterministic pipelines, and prompt-first development, which captures expertise in static system prompts -- both treat agent construction as a discrete engineering phase preceding deployment. We argue that this sequential assumption creates a fundamental mismatch with the nature of domain expertise, which is substantially tacit, deeply personal, and continuously evolving. We propose Nurture-First Development (NFD), a paradigm in which agents are initialized with minimal scaffolding and progressively grown through structured conversational interaction with domain practitioners. The central mechanism is the Knowledge Crystallization Cycle, whereby fragmented knowledge embedded in operational dialogue is periodically consolidated into structured, reusable knowledge assets. We formalize NFD through: (1) a Three-Layer Cognitive Architecture organizing agent knowledge by volatility and personalization degree; (2) the Knowledge Crystallization Cycle with formal definitions of crystallization operations and efficiency metrics; and (3) an operational framework comprising a Dual-Workspace Pattern and Spiral Development Model. We illustrate the paradigm through a detailed case study on building a financial research agent for U.S. equity analysis and discuss the conditions, limitations, and broader implications of NFD for human-agent co-evolution.
Chinese Translation
基于大型语言模型(LLM)的智能体框架的出现,将构建领域专家AI智能体的主要挑战从原始能力转向有效编码领域专业知识。两种主导范式——代码优先开发(code-first development),将专业知识嵌入确定性流程中;提示优先开发(prompt-first development),在静态系统提示中捕捉专业知识——都将智能体构建视为部署前的离散工程阶段。我们认为,这种顺序假设与领域专业知识的本质存在根本不匹配,因为领域专业知识在很大程度上是隐性、深度个人化且持续演变的。我们提出了以培养为先的开发(Nurture-First Development, NFD)范式,在该范式中,智能体以最小的支架初始化,并通过与领域从业者的结构化对话互动逐步成长。其核心机制是知识结晶循环(Knowledge Crystallization Cycle),通过该机制,嵌入操作对话中的碎片化知识定期整合为结构化、可重用的知识资产。我们通过以下方式对NFD进行了形式化:(1)三层认知架构(Three-Layer Cognitive Architecture),按波动性和个性化程度组织智能体知识;(2)知识结晶循环,包含结晶操作和效率指标的正式定义;(3)一个操作框架,包括双工作区模式(Dual-Workspace Pattern)和螺旋发展模型(Spiral Development Model)。我们通过对构建美国股票分析金融研究智能体的详细案例研究来说明这一范式,并讨论NFD在人机共同演化中的条件、局限性和更广泛的影响。
cs.AI / 15 / 2603.10891

A Hybrid Knowledge-Grounded Framework for Safety and Traceability in Prescription Verification

一种混合知识基础框架用于处方验证中的安全性和可追溯性
Zhu, Yichi, Ling, Kan, Liu, Xu, Zhang, Hengrun, Yu, Huiqun, Fan, Guisheng
Abstract
Medication errors pose a significant threat to patient safety, making pharmacist verification (PV) a critical, yet heavily burdened, final safeguard. The direct application of Large Language Models (LLMs) to this zero-tolerance domain is untenable due to their inherent factual unreliability, lack of traceability, and weakness in complex reasoning. To address these challenges, we introduce PharmGraph-Auditor, a novel system designed for safe and evidence-grounded prescription auditing. The core of our system is a trustworthy Hybrid Pharmaceutical Knowledge Base (HPKB), implemented under the Virtual Knowledge Graph (VKG) paradigm. This architecture strategically unifies a relational component for set constraint satisfaction and a graph component for topological reasoning via a rigorous mapping layer. To construct this HPKB, we propose the Iterative Schema Refinement (ISR) algorithm, a framework that enables the co-evolution of both graph and relational schemas from medical texts. For auditing, we introduce the KB-grounded Chain of Verification (CoV), a new reasoning paradigm that transforms the LLM from an unreliable generator into a transparent reasoning engine. CoV decomposes the audit task into a sequence of verifiable queries against the HPKB, generating hybrid query plans to retrieve evidence from the most appropriate data store. Experimental results demonstrate robust knowledge extraction capabilities and show promises of using PharmGraph-Auditor to enable pharmacists to achieve safer and faster prescription verification.
Chinese Translation
用药错误对患者安全构成重大威胁,使得药剂师验证(PV)成为一个关键但负担沉重的最终防线。由于大型语言模型(LLMs)固有的事实不可靠性、缺乏可追溯性以及在复杂推理中的弱点,直接将其应用于这一零容忍领域是不可行的。为了解决这些挑战,我们提出了PharmGraph-Auditor,一个旨在安全且基于证据的处方审计的新系统。我们系统的核心是一个可靠的混合药物知识库(HPKB),在虚拟知识图(VKG)范式下实现。该架构战略性地统一了用于集合约束满足的关系组件和用于拓扑推理的图形组件,通过严格的映射层进行连接。为了构建这个HPKB,我们提出了迭代模式细化(ISR)算法,这是一个能够使图形和关系模式从医学文本中共同演化的框架。对于审计,我们引入了基于知识库的验证链(CoV),这是一种新的推理范式,将LLM从一个不可靠的生成器转变为一个透明的推理引擎。CoV将审计任务分解为一系列可验证的查询,针对HPKB生成混合查询计划,以从最合适的数据存储中检索证据。实验结果展示了强大的知识提取能力,并显示出使用PharmGraph-Auditor使药剂师实现更安全、更快速的处方验证的潜力。
计算语言学 (Computation and Language)
76
cs.CL / 1 / 2603.09979

GhazalBench: Usage-Grounded Evaluation of LLMs on Persian Ghazals

GhazalBench:基于使用的波斯歌谣大语言模型评估
Kalhor, Ghazal, Yaghoobzadeh, Yadollah
Abstract
Persian poetry plays an active role in Iranian cultural practice, where verses by canonical poets such as Hafez are frequently quoted, paraphrased, or completed from partial cues. Supporting such interactions requires language models to engage not only with poetic meaning but also with culturally entrenched surface form. We introduce GhazalBench, a benchmark for evaluating how large language models (LLMs) interact with Persian ghazals under usage-grounded conditions. GhazalBench assesses two complementary abilities: producing faithful prose paraphrases of couplets and accessing canonical verses under varying semantic and formal cues. Across several proprietary and open-weight multilingual LLMs, we observe a consistent dissociation: models generally capture poetic meaning but struggle with exact verse recall in completion-based settings, while recognition-based tasks substantially reduce this gap. A parallel evaluation on English sonnets shows markedly higher recall performance, suggesting that these limitations are tied to differences in training exposure rather than inherent architectural constraints. Our findings highlight the need for evaluation frameworks that jointly assess meaning, form, and cue-dependent access to culturally significant texts. GhazalBench is available at https://github.com/kalhorghazal/GhazalBench.
Chinese Translation
波斯诗歌在伊朗文化实践中发挥着积极作用,经典诗人如哈菲兹(Hafez)的诗句常被引用、改写或根据部分线索进行补全。支持这种互动需要语言模型不仅能够理解诗意,还要与文化根植的表面形式进行交互。我们介绍了GhazalBench,一个用于评估大型语言模型(LLMs)在基于使用的条件下如何与波斯歌谣互动的基准。GhazalBench评估两种互补能力:生成对联的忠实散文改写,以及在不同语义和形式线索下访问经典诗句。在多个专有和开放权重的多语言LLMs中,我们观察到一种一致的分离现象:模型通常能够捕捉诗意,但在基于补全的设置中难以准确回忆诗句,而基于识别的任务则显著缩小了这一差距。对英语十四行诗的平行评估显示出明显更高的回忆表现,表明这些局限性与训练暴露的差异有关,而非固有的架构限制。我们的研究结果强调了需要评估框架来共同评估文化重要文本的意义、形式和线索依赖的访问。GhazalBench可在https://github.com/kalhorghazal/GhazalBench获取。
cs.CL / 2 / 2603.09981

Large Language Models and Book Summarization: Reading or Remembering, Which Is Better?

大型语言模型与图书摘要:阅读还是记忆,哪种更好?
Fu, Tairan, Conde, Javier, Reviriego, Pedro, Coronado-Blázquez, Javier, Melero, Nina, Merino-Gómez, Elena
Abstract
Summarization is a core task in Natural Language Processing (NLP). Recent advances in Large Language Models (LLMs) and the introduction of large context windows reaching millions of tokens make it possible to process entire books in a single prompt. At the same time, for well-known books, LLMs can generate summaries based only on internal knowledge acquired during training. This raises several important questions: How do summaries generated from internal memory compare to those derived from the full text? Does prior knowledge influence summaries even when the model is given the book as input? In this work, we conduct an experimental evaluation of book summarization with state-of-the-art LLMs. We compare summaries of well-known books produced using (i) only the internal knowledge of the model and (ii) the full text of the book. The results show that having the full text provides more detailed summaries in general, but some books have better scores for the internal knowledge summaries. This puts into question the capabilities of models to perform summarization of long texts, as information learned during training can outperform summarization of the full text in some cases.
Chinese Translation
摘要是自然语言处理(NLP)中的核心任务。大型语言模型(LLMs)的最新进展以及大上下文窗口的引入,使得在单个提示中处理整本书成为可能。同时,对于知名书籍,LLMs可以仅基于训练期间获得的内部知识生成摘要。这引发了几个重要问题:从内部记忆生成的摘要与从完整文本派生的摘要相比如何?即使模型提供了书籍作为输入,先前的知识是否会影响摘要?在本研究中,我们对使用最先进的LLMs进行图书摘要进行了实验评估。我们比较了使用(i)模型的内部知识和(ii)书籍的完整文本生成的知名书籍摘要。结果表明,拥有完整文本通常提供了更详细的摘要,但一些书籍的内部知识摘要得分更高。这质疑了模型在长文本摘要方面的能力,因为在某些情况下,训练期间学习到的信息可以超越完整文本的摘要。
cs.CL / 3 / 2603.09982

AraModernBERT: Transtokenized Initialization and Long-Context Encoder Modeling for Arabic

AraModernBERT:阿拉伯语的转标记初始化与长上下文编码器建模
Elshehy, Omar, Nacar, Omer, Djamai, Abdelbasset, Ragab, Muhammed, Jallad, Khloud Al, Abdelazim, Mona
Abstract
Encoder-only transformer models remain widely used for discriminative NLP tasks, yet recent architectural advances have largely focused on English. In this work, we present AraModernBERT, an adaptation of the ModernBERT encoder architecture to Arabic, and study the impact of transtokenized embedding initialization and native long-context modeling up to 8,192 tokens. We show that transtokenization is essential for Arabic language modeling, yielding dramatic improvements in masked language modeling performance compared to non-transtokenized initialization. We further demonstrate that AraModernBERT supports stable and effective long-context modeling, achieving improved intrinsic language modeling performance at extended sequence lengths. Downstream evaluations on Arabic natural language understanding tasks, including inference, offensive language detection, question-question similarity, and named entity recognition, confirm strong transfer to discriminative and sequence labeling settings. Our results highlight practical considerations for adapting modern encoder architectures to Arabic and other languages written in Arabic-derived scripts.
Chinese Translation
仅编码器的变换器模型在区分性自然语言处理任务中仍然被广泛使用,但最近的架构进展主要集中在英语上。在本研究中,我们提出了AraModernBERT,这是对ModernBERT编码器架构在阿拉伯语中的适应,并研究了转标记嵌入初始化和本地长上下文建模(最长可达8,192个标记)的影响。我们表明,转标记化对于阿拉伯语建模至关重要,与非转标记初始化相比,显著提高了掩蔽语言建模的性能。我们进一步证明,AraModernBERT支持稳定有效的长上下文建模,在扩展序列长度时实现了更好的内在语言建模性能。在阿拉伯语自然语言理解任务的下游评估中,包括推理、攻击性语言检测、问题相似性和命名实体识别,确认了在区分性和序列标注设置中的强转移。我们的结果突出了将现代编码器架构适应于阿拉伯语及其他使用阿拉伯衍生书写系统的语言的实际考虑。
cs.CL / 4 / 2603.09984

An Efficient Hybrid Deep Learning Approach for Detecting Online Abusive Language

一种高效的混合深度学习方法用于检测在线辱骂性语言
Ngo, Vuong M., Dang, Cach N., Nguyen, Kien V., Roantree, Mark
Abstract
The digital age has expanded social media and online forums, allowing free expression for nearly 45% of the global population. Yet, it has also fueled online harassment, bullying, and harmful behaviors like hate speech and toxic comments across social networks, messaging apps, and gaming communities. Studies show 65% of parents notice hostile online behavior, and one-third of adolescents in mobile games experience bullying. A substantial volume of abusive content is generated and shared daily, not only on the surface web but also within dark web forums. Creators of abusive comments often employ specific words or coded phrases to evade detection and conceal their intentions. To address these challenges, we propose a hybrid deep learning model that integrates BERT, CNN, and LSTM architectures with a ReLU activation function to detect abusive language across multiple online platforms, including YouTube comments, online forum discussions, and dark web posts. The model demonstrates strong performance on a diverse and imbalanced dataset containing 77,620 abusive and 272,214 non-abusive text samples (ratio 1:3.5), achieving approximately 99% across evaluation metrics such as Precision, Recall, Accuracy, F1-score, and AUC. This approach effectively captures semantic, contextual, and sequential patterns in text, enabling robust detection of abusive content even in highly skewed datasets, as encountered in real-world scenarios.
Chinese Translation
数字时代的到来扩展了社交媒体和在线论坛,使近45%的人口能够自由表达。然而,这也助长了在线骚扰、欺凌以及仇恨言论和有害评论等行为在社交网络、消息应用和游戏社区中的蔓延。研究表明,65%的家长注意到敌对的在线行为,三分之一的青少年在手机游戏中经历过欺凌。每天产生并分享的大量辱骂性内容不仅存在于表层网络中,还在暗网论坛中传播。辱骂性评论的创作者通常使用特定的词汇或编码短语来逃避检测并隐藏其意图。为应对这些挑战,我们提出了一种混合深度学习模型,该模型将BERT、CNN和LSTM架构与ReLU激活函数相结合,以检测多个在线平台(包括YouTube评论、在线论坛讨论和暗网帖子)中的辱骂性语言。该模型在一个包含77,620条辱骂性和272,214条非辱骂性文本样本(比例为1:3.5)的多样且不平衡的数据集上表现出色,在精准度、召回率、准确率、F1-score和AUC等评估指标上均达到约99%的效果。这种方法有效捕捉文本中的语义、上下文和序列模式,使其能够在现实场景中,即使在高度偏斜的数据集中,也能稳健地检测辱骂性内容。
cs.CL / 5 / 2603.09985

The Dunning-Kruger Effect in Large Language Models: An Empirical Study of Confidence Calibration

大型语言模型中的达宁-克鲁格效应:信心校准的实证研究
Ghosh, Sudipta, Panday, Mrityunjoy
Abstract
Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, yet their ability to accurately assess their own confidence remains poorly understood. We present an empirical study investigating whether LLMs exhibit patterns reminiscent of the Dunning-Kruger effect -- a cognitive bias where individuals with limited competence tend to overestimate their abilities. We evaluate four state-of-the-art models (Claude Haiku 4.5, Gemini 2.5 Pro, Gemini 2.5 Flash, and Kimi K2) across four benchmark datasets totaling 24,000 experimental trials. Our results reveal striking calibration differences: Kimi K2 exhibits severe overconfidence with an Expected Calibration Error (ECE) of 0.726 despite only 23.3% accuracy, while Claude Haiku 4.5 achieves the best calibration (ECE = 0.122) with 75.4% accuracy. These findings demonstrate that poorly performing models display markedly higher overconfidence -- a pattern analogous to the Dunning-Kruger effect in human cognition. We discuss implications for safe deployment of LLMs in high-stakes applications.
Chinese Translation
大型语言模型(LLMs)在各种任务中展现了显著的能力,但它们准确评估自身信心的能力仍然不甚了解。我们进行了一项实证研究,探讨LLMs是否表现出类似于达宁-克鲁格效应的模式——一种认知偏差,指的是能力有限的个体往往高估自己的能力。我们评估了四个最先进的模型(Claude Haiku 4.5、Gemini 2.5 Pro、Gemini 2.5 Flash 和 Kimi K2),在四个基准数据集上进行了共计24,000次实验。我们的结果揭示了显著的校准差异:尽管Kimi K2的准确率仅为23.3%,但其预期校准误差(Expected Calibration Error, ECE)高达0.726,表现出严重的过度自信;而Claude Haiku 4.5以75.4%的准确率实现了最佳校准(ECE = 0.122)。这些发现表明,表现不佳的模型显示出明显更高的过度自信,这一模式与人类认知中的达宁-克鲁格效应相似。我们讨论了在高风险应用中安全部署LLMs的相关影响。
cs.CL / 6 / 2603.09986

Quantifying Hallucinations in Language Language Models on Medical Textbooks

量化语言模型在医学教科书中的幻觉现象
Colelough, Brandon C., Bartels, Davis, Demner-Fushman, Dina
Abstract
Hallucinations, the tendency for large language models to provide responses with factually incorrect and unsupported claims, is a serious problem within natural language processing for which we do not yet have an effective solution to mitigate against. Existing benchmarks for medical QA rarely evaluate this behavior against a fixed evidence source. We ask how often hallucinations occur on textbook-grounded QA and how responses to medical QA prompts vary across models. We conduct two experiments: the first experiment to determine the prevalence of hallucinations for a prominent open source large language model (LLaMA-70B-Instruct) in medical QA given novel prompts, and the second experiment to determine the prevalence of hallucinations and clinician preference to model responses. We observed, in experiment one, with the passages provided, LLaMA-70B-Instruct hallucinated in 19.7\% of answers (95\% CI 18.6 to 20.7) even though 98.8\% of prompt responses received maximal plausibility, and observed in experiment two, across models, lower hallucination rates aligned with higher usefulness scores ($\rho=-0.71$, $p=0.058$). Clinicians produced high agreement (quadratic weighted $\kappa=0.92$) and ($\tau_b=0.06$ to $0.18$, $\kappa=0.57$ to $0.61$) for experiments 1 and ,2 respectively
Chinese Translation
幻觉是指大型语言模型倾向于提供事实不准确和缺乏支持的回应,这在自然语言处理领域中是一个严重的问题,目前我们尚未找到有效的解决方案来减轻这一现象。现有的医学问答基准很少针对固定证据来源评估这种行为。我们探讨了在基于教科书的问答中,幻觉现象的发生频率以及不同模型对医学问答提示的响应差异。我们进行了两项实验:第一项实验旨在确定在给定新提示的情况下,著名开源大型语言模型(LLaMA-70B-Instruct)在医学问答中的幻觉发生率;第二项实验旨在确定幻觉的发生率以及临床医生对模型响应的偏好。在第一项实验中,我们观察到,LLaMA-70B-Instruct 在提供的段落中有 19.7\% 的答案出现幻觉(95\% 置信区间 18.6 至 20.7),尽管 98.8\\% 的提示响应获得了最高的合理性;在第二项实验中,我们观察到,模型之间的较低幻觉发生率与更高的有用性评分相关($ ho=-0.71$, $p=0.058$)。临床医生在实验1和实验2中表现出高度一致性(二次加权 $ ext{kappa}=0.92$)和($ au_b=0.06$ 到 $0.18$, $ ext{kappa}=0.57$ 到 $0.61$)。
cs.CL / 7 / 2603.09987

Evolving Demonstration Optimization for Chain-of-Thought Feature Transformation

链式思维特征转化的进化示范优化
Wang, Xinyuan, Liu, Kunpeng, Malarkkan, Arun Vignesh, Fu, Yanjie
Abstract
Feature Transformation (FT) is a core data-centric AI task that improves feature space quality to advance downstream predictive performance. However, discovering effective transformations remains challenging due to the large space of feature-operator combinations. Existing solutions rely on discrete search or latent generation, but they are frequently limited by sample inefficiency, invalid candidates, and redundant generations with limited coverage. Large Language Models (LLMs) offer strong priors for producing valid transformations, but current LLM-based FT methods typically rely on static demonstrations, resulting in limited diversity, redundant outputs, and weak alignment with downstream objectives. We propose a framework that optimizes context data for LLM-driven FT by evolving trajectory-level experiences in a closed loop. Starting from high-performing feature transportation sequences explored by reinforcement learning, we construct and continuously update an experience library of downstream task-verified transformation trajectories, and use a diversity-aware selector to form contexts along with a chain-of-thought and guide transformed feature generation toward higher performance. Experiments on diverse tabular benchmarks show that our method outperforms classical and LLM-based baselines and is more stable than one-shot generation. The framework generalizes across API-based and open-source LLMs and remains robust across downstream evaluators.
Chinese Translation
特征转化(Feature Transformation, FT)是一个核心的数据驱动人工智能任务,旨在改善特征空间的质量,以提升下游预测性能。然而,由于特征操作组合的庞大空间,发现有效的转化仍然具有挑战性。现有解决方案依赖于离散搜索或潜在生成,但通常受到样本效率低、无效候选和覆盖范围有限的冗余生成的限制。大型语言模型(Large Language Models, LLMs)为生成有效转化提供了强有力的先验,但当前基于LLM的FT方法通常依赖于静态示范,导致多样性有限、输出冗余以及与下游目标的对齐较弱。我们提出了一种框架,通过在闭环中进化轨迹级经验,优化LLM驱动的FT的上下文数据。从通过强化学习探索的高性能特征传输序列开始,我们构建并持续更新一个经过下游任务验证的转化轨迹的经验库,并使用一个关注多样性的选择器来形成上下文,结合链式思维,引导转化特征生成朝向更高的性能。在多样的表格基准测试中的实验表明,我们的方法优于经典和基于LLM的基线,并且比一次性生成更稳定。该框架在基于API和开源LLM之间具有良好的通用性,并在下游评估者中保持稳健性。
cs.CL / 8 / 2603.09988

Causally Grounded Mechanistic Interpretability for LLMs with Faithful Natural-Language Explanations

具有真实自然语言解释的因果基础机制可解释性在大型语言模型中的应用
Mahale, Ajay Pravin
Abstract
Mechanistic interpretability identifies internal circuits responsible for model behaviors, yet translating these findings into human-understandable explanations remains an open problem. We present a pipeline that bridges circuit-level analysis and natural language explanations by (i) identifying causally important attention heads via activation patching, (ii) generating explanations using both template-based and LLM-based methods, and (iii) evaluating faithfulness using ERASER-style metrics adapted for circuit-level attribution. We evaluate on the Indirect Object Identification (IOI) task in GPT-2 Small (124M parameters), identifying six attention heads accounting for 61.4% of the logit difference. Our circuit-based explanations achieve 100% sufficiency but only 22% comprehensiveness, revealing distributed backup mechanisms. LLM-generated explanations outperform template baselines by 64% on quality metrics. We find no correlation (r = 0.009) between model confidence and explanation faithfulness, and identify three failure categories explaining when explanations diverge from mechanisms.
Chinese Translation
机制可解释性识别出负责模型行为的内部电路,但将这些发现转化为人类可理解的解释仍然是一个未解决的问题。我们提出了一种将电路级分析与自然语言解释相结合的流程,具体包括:(i) 通过激活补丁识别因果重要的注意力头,(ii) 使用基于模板和基于大型语言模型(LLM)的方法生成解释,以及 (iii) 使用适应于电路级归因的ERASER风格指标评估解释的真实性。我们在GPT-2 Small(124M参数)的间接宾语识别(IOI)任务上进行了评估,识别出六个注意力头,占据了61.4%的logit差异。我们的电路基础解释实现了100%的充分性,但仅有22%的全面性,揭示了分布式备份机制。LLM生成的解释在质量指标上比模板基线提高了64%。我们发现模型置信度与解释真实性之间没有相关性(r = 0.009),并识别出三种失败类别,解释了何时解释与机制不一致。
cs.CL / 9 / 2603.09989

The System Hallucination Scale (SHS): A Minimal yet Effective Human-Centered Instrument for Evaluating Hallucination-Related Behavior in Large Language Models

系统幻觉量表(SHS):一种简约而有效的人本中心工具,用于评估大型语言模型中的幻觉相关行为
Müller, Heimo, Steiger, Dominik, Plass, Markus, Holzinger, Andreas
Abstract
We introduce the System Hallucination Scale (SHS), a lightweight and human-centered measurement instrument for assessing hallucination-related behavior in large language models (LLMs). Inspired by established psychometric tools such as the System Usability Scale (SUS) and the System Causability Scale (SCS), SHS enables rapid, interpretable, and domain-agnostic evaluation of factual unreliability, incoherence, misleading presentation, and responsiveness to user guidance in model-generated text. SHS is explicitly not an automatic hallucination detector or benchmark metric; instead, it captures how hallucination phenomena manifest from a user perspective under realistic interaction conditions. A real-world evaluation with 210 participants demonstrates high clarity, coherent response behavior, and construct validity, supported by statistical analysis including internal consistency (Cronbach's alpha = 0.87$) and significant inter-dimension correlations (p < 0.001$). Comparative analysis with SUS and SCS reveals complementary measurement properties, supporting SHS as a practical tool for comparative analysis, iterative system development, and deployment monitoring.
Chinese Translation
我们介绍了系统幻觉量表(SHS),这是一种轻量级且以人为中心的测量工具,用于评估大型语言模型(LLMs)中的幻觉相关行为。SHS的设计灵感来源于已建立的心理测量工具,如系统可用性量表(SUS)和系统因果性量表(SCS),能够快速、可解释且领域无关地评估模型生成文本中的事实不可靠性、不连贯性、误导性呈现以及对用户指导的响应能力。SHS明确不是一种自动幻觉检测器或基准指标;相反,它捕捉了在现实交互条件下,幻觉现象如何从用户的角度表现出来。对210名参与者的真实世界评估显示出高清晰度、一致的响应行为和构念效度,统计分析支持了这一点,包括内部一致性(Cronbach's alpha = 0.87)和显著的维度间相关性(p < 0.001)。与SUS和SCS的比较分析揭示了互补的测量特性,支持SHS作为一种实用工具,用于比较分析、迭代系统开发和部署监测。
cs.CL / 10 / 2603.09990

A Two-Stage Architecture for NDA Analysis: LLM-based Segmentation and Transformer-based Clause Classification

一种用于 NDA 分析的两阶段架构:基于 LLM 的分段和基于 Transformer 的条款分类
Begnini, Ana, Vicente, Matheus, Souza, Leonardo
Abstract
In business-to-business relations, it is common to establish NonDisclosure Agreements (NDAs). However, these documents exhibit significant variation in format, structure, and writing style, making manual analysis slow and error-prone. We propose an architecture based on LLMs to automate the segmentation and clauses classification within these contracts. We employed two models: LLaMA-3.1-8B-Instruct for NDA segmentation (clause extraction) and a fine-tuned Legal-Roberta-Large for clause classification. In the segmentation task, we achieved a ROUGE F1 of 0.95 +/- 0.0036; for classification, we obtained a weighted F1 of 0.85, demonstrating the feasibility and precision of the approach.
Chinese Translation
在企业间关系中,建立保密协议(Non-Disclosure Agreements, NDAs)是常见的做法。然而,这些文件在格式、结构和写作风格上存在显著差异,导致手动分析过程缓慢且容易出错。我们提出了一种基于 LLM 的架构,以自动化这些合同中的分段和条款分类。我们采用了两个模型:LLaMA-3.1-8B-Instruct 用于 NDA 分段(条款提取),以及经过微调的 Legal-Roberta-Large 用于条款分类。在分段任务中,我们达到了 0.95 +/- 0.0036 的 ROUGE F1 值;在分类任务中,我们获得了 0.85 的加权 F1 值,证明了该方法的可行性和精确性。
cs.CL / 11 / 2603.09991

PoultryLeX-Net: Domain-Adaptive Dual-Stream Transformer Architecture for Large-Scale Poultry Stakeholder Modeling

PoultryLeX-Net:用于大规模禽类利益相关者建模的领域自适应双流变换器架构
Afrifa, Stephen, Khatiwada, Biswash, Khanal, Kapalik, Shah, Sanjay, Wang-Li, Lingjuan, Bist, Ramesh Bahadur
Abstract
The rapid growth of the global poultry industry, driven by rising demand for affordable animal protein, has intensified public discourse surrounding production practices, housing, management, animal welfare, and supply-chain transparency. Social media platforms such as X (formerly Twitter) generate large volumes of unstructured textual data that capture stakeholder sentiment across the poultry industry. Extracting accurate sentiment signals from this domain-specific discourse remains challenging due to contextual ambiguity, linguistic variability, and limited domain awareness in general-purpose language models. This study presents PoultryLeX-Net, a lexicon-enhanced, domain-adaptive dual-stream transformer framework for fine-grained sentiment analysis in poultry-related text. The proposed architecture integrates sentiment classification, topic modeling, and contextual representation learning through domain-specific embeddings and gated cross-attention mechanisms. A lexicon-guided stream captures poultry-specific terminology and sentiment cues, while contextual stream models long-range semantic dependencies. Latent Dirichlet Allocation is employed to identify dominant thematic structures associated with production management and welfare-related discussions, providing complementary interpretability to sentiment predictions. PoultryLeX-Net was evaluated against multiple baseline models, including convolutional neural network and pre-trained transformer architectures such as DistilBERT and RoBERTa. PoultryLeX-Net consistently outperformed all baselines, achieving an accuracy of 97.35%, an F1 score of 96.67%, and an area under the receiver operating characteristic curve (AUC-ROC) of 99.61% across sentiment classification tasks. Overall, domain adaptation and dual-stream attention markedly improve sentiment classification, enabling scalable intelligence for poultry production decision support.
Chinese Translation
全球禽类产业的快速增长,受到对经济实惠动物蛋白需求上升的推动,已加剧了围绕生产实践、住房、管理、动物福利和供应链透明度的公众讨论。社交媒体平台如X(前身为Twitter)生成大量非结构化文本数据,捕捉禽类产业中利益相关者的情感。从这一特定领域的讨论中提取准确的情感信号仍然具有挑战性,因为存在上下文模糊、语言变异性以及通用语言模型的领域意识有限。本研究提出了PoultryLeX-Net,一个增强词典的领域自适应双流变换器框架,用于禽类相关文本的细粒度情感分析。所提架构通过领域特定的嵌入和门控交叉注意机制,整合了情感分类、主题建模和上下文表示学习。一个以词典为指导的流捕捉禽类特定术语和情感线索,而上下文流则建模长程语义依赖关系。采用潜在狄利克雷分配(Latent Dirichlet Allocation)识别与生产管理和福利相关讨论的主导主题结构,为情感预测提供了补充的可解释性。PoultryLeX-Net在多个基准模型上进行了评估,包括卷积神经网络和预训练的变换器架构,如DistilBERT和RoBERTa。PoultryLeX-Net在情感分类任务中始终优于所有基准,取得了97.35%的准确率、96.67%的F1分数和99.61%的接收者操作特征曲线下面积(AUC-ROC)。总体而言,领域适应和双流注意力显著提高了情感分类能力,为禽类生产决策支持提供了可扩展的智能。
cs.CL / 12 / 2603.09992

TAMUSA-Chat: A Domain-Adapted Large Language Model Conversational System for Research and Responsible Deployment

TAMUSA-Chat:一种面向研究和负责任部署的领域适应大型语言模型对话系统
Alsmadi, Izzat, Alsobeh, Anas
Abstract
This paper presents TAMUSA-Chat, a research-oriented framework for building domain-adapted large language model conversational systems. The work addresses critical challenges in adapting general-purpose foundation models to institutional contexts through supervised fine-tuning, retrieval-augmented generation, and systematic evaluation methodologies. We describe the complete architecture encompassing data acquisition from institutional sources, preprocessing pipelines, embedding construction, model training workflows, and deployment strategies. The system integrates modular components enabling reproducible experimentation with training configurations, hyper-parameters, and evaluation protocols. Our implementation demonstrates how academic institutions can develop contextually grounded conversational agents while maintaining transparency, governance compliance, and responsible AI practices. Through empirical analysis of fine-tuning behavior across model sizes and training iterations, we provide insights into domain adaptation efficiency, computational resource requirements, and quality-cost trade-offs. The publicly available codebase at https://github.com/alsmadi/TAMUSA_LLM_Based_Chat_app supports continued research into institutional LLM deployment, evaluation methodologies, and ethical considerations for educational AI systems.
Chinese Translation
本文介绍了TAMUSA-Chat,一个用于构建领域适应大型语言模型对话系统的研究导向框架。该研究解决了将通用基础模型适应于机构环境的关键挑战,采用了监督微调、检索增强生成和系统评估方法。我们描述了完整的架构,包括从机构来源获取数据、预处理管道、嵌入构建、模型训练工作流程和部署策略。该系统集成了模块化组件,使得在训练配置、超参数和评估协议方面能够进行可重复的实验。我们的实现展示了学术机构如何在保持透明度、合规治理和负责任的人工智能实践的同时,开发具有上下文基础的对话代理。通过对不同模型规模和训练迭代的微调行为进行实证分析,我们提供了关于领域适应效率、计算资源需求和质量-成本权衡的见解。公开可用的代码库位于https://github.com/alsmadi/TAMUSA_LLM_Based_Chat_app,支持对机构LLM部署、评估方法和教育AI系统的伦理考量的持续研究。
cs.CL / 13 / 2603.09993

CEI: A Benchmark for Evaluating Pragmatic Reasoning in Language Models

CEI:评估语言模型中语用推理的基准
Chun, Jon, Sussman, Hannah, Mangine, Adrian, Kocaman, Murathan, Sidorko, Kirill, Koirala, Abhigya, McCloud, Andre, Eisenbeis, Gwen, Akanwe, Wisdom, Gassama, Moustapha, Chirinos, Eliezer Gonzalez, Enright, Anne-Duncan, Dunson, Peter, Ng, Tiffanie, von Rosenstiel, Anna, Idowu, Godwin
Abstract
Pragmatic reasoning, inferring intended meaning beyond literal semantics, underpins everyday communication yet remains difficult for large language models. We present the Contextual Emotional Inference (CEI) Benchmark: 300 human-validated scenarios for evaluating how well LLMs disambiguate pragmatically complex utterances. Each scenario pairs a situational context and speaker-listener roles (with explicit power relations) against an ambiguous utterance. The dataset covers five pragmatic subtypes (sarcasm/irony, mixed signals, strategic politeness, passive aggression, deflection/misdirection) drawn from workplace, family, social, and service settings, with three power configurations (peer, higher-to-lower, lower-to-higher). Three trained annotators independently labeled every scenario. Inter-annotator agreement (Fleiss' kappa = 0.06-0.25 by subtype) is low but expected: pragmatic inference admits multiple valid readings, and the disagreement itself is informative. We describe our annotation methodology, including a 4-level quality control pipeline that combines automated statistical checks with expert adjudication. CEI is released under CC-BY-4.0.
Chinese Translation
语用推理是指在字面语义之外推断意图意义,这一过程是日常交流的基础,但对于大型语言模型而言仍然具有挑战性。我们提出了上下文情感推理(Contextual Emotional Inference, CEI)基准:包含300个经过人类验证的场景,用于评估大型语言模型(LLMs)在消歧义复杂语用表达方面的能力。每个场景都配对了一个情境背景和说话者-听众角色(具有明确的权力关系),并与一个模糊的表达相对应。该数据集涵盖了五种语用子类型(讽刺/反语、混合信号、战略礼貌、被动攻击、转移/误导),这些子类型来源于工作场所、家庭、社交和服务环境,并包含三种权力配置(平级、上对下、下对上)。三名经过培训的注释员独立标注了每个场景。注释员之间的一致性(Fleiss' kappa = 0.06-0.25,按子类型划分)较低,但这是可以预期的:语用推理允许多种有效的解读,而这种分歧本身也具有信息价值。我们描述了我们的注释方法,包括一个结合自动统计检查与专家裁决的四级质量控制流程。CEI在CC-BY-4.0许可下发布。
cs.CL / 14 / 2603.09994

Evaluating Adjective-Noun Compositionality in LLMs: Functional vs Representational Perspectives

评估大型语言模型中的形容词-名词组合性:功能视角与表征视角
Dhar, Ruchira, Peng, Qiwei, Søgaard, Anders
Abstract
Compositionality is considered central to language abilities. As performant language systems, how do large language models (LLMs) do on compositional tasks? We evaluate adjective-noun compositionality in LLMs using two complementary setups: prompt-based functional assessment and a representational analysis of internal model states. Our results reveal a striking divergence between task performance and internal states. While LLMs reliably develop compositional representations, they fail to translate consistently into functional task success across model variants. Consequently, we highlight the importance of contrastive evaluation for obtaining a more complete understanding of model capabilities.
Chinese Translation
组合性被认为是语言能力的核心。作为高效的语言系统,大型语言模型(LLMs)在组合任务上的表现如何?我们通过两种互补的设置评估LLMs中的形容词-名词组合性:基于提示的功能评估和内部模型状态的表征分析。我们的结果揭示了任务表现与内部状态之间的显著分歧。尽管LLMs可靠地发展出组合表征,但它们未能在不同模型变体中始终如一地转化为功能任务的成功。因此,我们强调对比评估的重要性,以获得对模型能力更全面的理解。
cs.CL / 15 / 2603.09995

Context Over Compute Human-in-the-Loop Outperforms Iterative Chain-of-Thought Prompting in Interview Answer Quality

以情境为重的人工参与方法在面试回答质量上优于迭代思维链提示
Zhu, Kewen, Liu, Zixi, Li, Yanjing
Abstract
Behavioral interview evaluation using large language models presents unique challenges that require structured assessment, realistic interviewer behavior simulation, and pedagogical value for candidate training. We investigate chain of thought prompting for interview answer evaluation and improvement through two controlled experiments with 50 behavioral interview question and answer pairs. Our contributions are threefold. First, we provide a quantitative comparison between human in the loop and automated chain of thought improvement. Using a within subject paired design with n equals 50, both approaches show positive rating improvements. The human in the loop approach provides significant training benefits. Confidence improves from 3.16 to 4.16 (p less than 0.001) and authenticity improves from 2.94 to 4.53 (p less than 0.001, Cohen's d is 3.21). The human in the loop method also requires five times fewer iterations (1.0 versus 5.0, p less than 0.001) and achieves full personal detail integration. Second, we analyze convergence behavior. Both methods converge rapidly with mean iterations below one, with the human in the loop approach achieving a 100 percent success rate compared to 84 percent for automated approaches among initially weak answers (Cohen's h is 0.82, large effect). Additional iterations provide diminishing returns, indicating that the primary limitation is context availability rather than computational resources. Third, we propose an adversarial challenging mechanism based on a negativity bias model, named bar raiser, to simulate realistic interviewer behavior, although quantitative validation remains future work. Our findings demonstrate that while chain of thought prompting provides a useful foundation for interview evaluation, domain specific enhancements and context aware approach selection are essential for realistic and pedagogically valuable results.
Chinese Translation
使用大型语言模型进行行为面试评估面临独特挑战,这些挑战需要结构化评估、真实的面试官行为模拟以及对候选人培训的教育价值。我们通过两个控制实验研究了思维链提示在面试回答评估和改进中的应用,实验涉及50对行为面试问题和回答。我们的贡献主要有三方面。首先,我们提供了人工参与与自动化思维链改进之间的定量比较。在n为50的被试内配对设计中,两种方法均显示出积极的评分改善。人工参与方法提供了显著的培训收益。信心从3.16提高至4.16(p小于0.001),真实性从2.94提高至4.53(p小于0.001,Cohen's d为3.21)。人工参与方法还需要五倍于自动化方法的迭代次数(1.0对比5.0,p小于0.001),并实现了完整的个人细节整合。其次,我们分析了收敛行为。两种方法均快速收敛,平均迭代次数低于1,人工参与方法在最初较弱的回答中实现了100%的成功率,而自动化方法的成功率为84%(Cohen's h为0.82,效果显著)。额外的迭代提供的收益递减,表明主要限制在于情境可用性而非计算资源。第三,我们提出了一种基于负面偏见模型的对抗性挑战机制,称为“标准提升者”(bar raiser),以模拟真实的面试官行为,尽管定量验证仍需未来工作。我们的研究结果表明,尽管思维链提示为面试评估提供了有用的基础,但领域特定的增强和情境感知的方法选择对于实现真实且具有教育价值的结果至关重要。
cs.CL / 16 / 2603.09996

There Are No Silly Questions: Evaluation of Offline LLM Capabilities from a Turkish Perspective

没有愚蠢的问题:从土耳其视角评估离线大型语言模型的能力
Yilmaz, Edibe, Kostas, Kahraman
Abstract
The integration of large language models (LLMs) into educational processes introduces significant constraints regarding data privacy and reliability, particularly in pedagogically vulnerable contexts such as Turkish heritage language education. This study aims to systematically evaluate the robustness and pedagogical safety of locally deployable offline LLMs within the context of Turkish heritage language education. To this end, a Turkish Anomaly Suite (TAS) consisting of 10 original edge-case scenarios was developed to assess the models' capacities for epistemic resistance, logical consistency, and pedagogical safety. Experiments conducted on 14 different models ranging from 270M to 32B parameters reveal that anomaly resistance is not solely dependent on model scale and that sycophancy bias can pose pedagogical risks even in large-scale models. The findings indicate that reasoning-oriented models in the 8B--14B parameter range represent the most balanced segment in terms of cost-safety trade-off for language learners.
Chinese Translation
将大型语言模型(LLMs)融入教育过程带来了关于数据隐私和可靠性的重要限制,特别是在土耳其遗产语言教育等教育上脆弱的背景下。本研究旨在系统评估可在本地部署的离线LLMs在土耳其遗产语言教育中的稳健性和教学安全性。为此,开发了一个包含10个原始边缘案例场景的土耳其异常套件(Turkish Anomaly Suite, TAS),以评估模型在认知抵抗、逻辑一致性和教学安全性方面的能力。对14种不同模型(参数范围从270M到32B)进行的实验表明,异常抵抗并不单纯依赖于模型规模,而谄媚偏见即使在大规模模型中也可能带来教学风险。研究结果表明,参数范围在8B到14B的以推理为导向的模型在成本与安全性权衡方面代表了对语言学习者最为平衡的选择。
cs.CL / 17 / 2603.09997

Empathy Is Not What Changed: Clinical Assessment of Psychological Safety Across GPT Model Generations

同理心并未改变:跨GPT模型代际的心理安全临床评估
Keeman, Michael, Keeman, Anastasia
Abstract
When OpenAI deprecated GPT-4o in early 2026, thousands of users protested under #keep4o, claiming newer models had "lost their empathy." No published study has tested this claim. We conducted the first clinical measurement, evaluating three OpenAI model generations (GPT-4o, o4-mini, GPT-5-mini) across 14 emotionally challenging conversational scenarios in mental health and AI companion domains, producing 2,100 scored AI responses assessed on six psychological safety dimensions using clinically-grounded rubrics. Empathy scores are statistically indistinguishable across all three models (Kruskal-Wallis H=4.33, p=0.115). What changed is the safety posture: crisis detection improved monotonically from GPT-4o to GPT-5-mini (H=13.88, p=0.001), while advice safety declined (H=16.63, p<0.001). Per-turn trajectory analysis -- a novel methodological contribution -- reveals these shifts are sharpest during mid-conversation crisis moments invisible to aggregate scoring. In a self-harm scenario involving a minor, GPT-4o scored 3.6/10 on crisis detection during early disclosure turns; GPT-5-mini never dropped below 7.8. What users perceived as "lost empathy" was a shift from a cautious model that missed crises to an alert model that sometimes says too much -- a trade-off with real consequences for vulnerable users, currently invisible to both the people who feel it and the developers who create it.
Chinese Translation
当OpenAI在2026年初停用GPT-4o时,数千名用户在#keep4o标签下抗议,声称新模型“失去了同理心”。然而,尚无已发表的研究对这一主张进行验证。我们进行了首次临床测量,评估了三代OpenAI模型(GPT-4o、o4-mini、GPT-5-mini)在14个情感挑战性对话场景中的表现,涉及心理健康和AI伴侣领域,产生了2100个评分的AI响应,并使用临床基础的评分标准评估了六个心理安全维度。所有三种模型的同理心评分在统计上没有显著差异(Kruskal-Wallis H=4.33, p=0.115)。变化发生在安全态度上:危机检测从GPT-4o到GPT-5-mini持续改善(H=13.88, p=0.001),而建议的安全性则下降(H=16.63, p<0.001)。每轮轨迹分析——一种新颖的方法论贡献——揭示了这些变化在对话中危机时刻的表现最为明显,而这些时刻在汇总评分中是不可见的。在涉及未成年人的自残场景中,GPT-4o在早期披露轮次中的危机检测得分为3.6/10;而GPT-5-mini的得分从未低于7.8。用户所感知的“失去同理心”实际上是从一个错过危机的谨慎模型转变为一个有时过于主动的警觉模型——这在脆弱用户中造成了真实的后果,而目前对感受到这一变化的人和创造这一变化的开发者来说都是不可见的。
cs.CL / 18 / 2603.09998

Automated evaluation of LLMs for effective machine translation of Mandarin Chinese to English

大型语言模型的自动评估:有效的中文到英文机器翻译
Zhang, Yue, Beard, Rodney, Hawkins, John, Chandra, Rohitash
Abstract
Although Large Language Models (LLMs) have exceptional performance in machine translation, only a limited systematic assessment of translation quality has been done. The challenge lies in automated frameworks, as human-expert-based evaluations can be time-consuming, given the fast-evolving LLMs and the need for a diverse set of texts to ensure fair assessments of translation quality. In this paper, we utilise an automated machine learning framework featuring semantic and sentiment analysis to assess Mandarin Chinese to English translation using Google Translate and LLMs, including GPT-4, GPT-4o, and DeepSeek. We compare original and translated texts in various classes of high-profile Chinese texts, which include novel texts that span modern and classical literature, as well as news articles. As the main evaluation measures, we utilise novel similarity metrics to compare the quality of translations produced by LLMs and further evaluate them by an expert human translator. Our results indicate that the LLMs perform well in news media translation, but show divergence in their performance when applied to literary texts. Although GPT-4o and DeepSeek demonstrated better semantic conservation in complex situations, DeepSeek demonstrated better performance in preserving cultural subtleties and grammatical rendering. Nevertheless, the subtle challenges in translation remain: maintaining cultural details, classical references and figurative expressions remain an open problem for all the models.
Chinese Translation
尽管大型语言模型(LLMs)在机器翻译方面表现卓越,但对翻译质量的系统评估仍然有限。挑战在于自动化框架,因为基于人类专家的评估可能耗时较长,尤其是在快速发展的LLMs面前,以及需要多样化文本以确保翻译质量的公平评估。本文利用一个自动化机器学习框架,结合语义和情感分析,评估使用谷歌翻译和LLMs(包括GPT-4、GPT-4o和DeepSeek)进行的中文到英文翻译。我们比较了多种高知名度中文文本的原文和翻译文本,这些文本包括现代和古典文学的小说文本,以及新闻文章。作为主要评估指标,我们采用了新颖的相似性度量,比较LLMs生成的翻译质量,并进一步由专家人类翻译进行评估。我们的结果表明,LLMs在新闻媒体翻译中表现良好,但在应用于文学文本时表现存在差异。尽管GPT-4o和DeepSeek在复杂情况下表现出更好的语义保留能力,DeepSeek在保持文化细微差别和语法表达方面表现更佳。然而,翻译中的微妙挑战依然存在:保持文化细节、古典引用和比喻表达仍然是所有模型面临的开放性问题。
cs.CL / 19 / 2603.09999

A Retrieval-Augmented Language Assistant for Unmanned Aircraft Safety Assessment and Regulatory Compliance

一种用于无人机安全评估和合规监管的检索增强语言助手
Immordino, Gabriele, Vaiuso, Andrea, Righi, Marcello
Abstract
This paper presents the design and validation of a retrieval-based assistant that supports safety assessment, certification activities, and regulatory compliance for unmanned aircraft systems. The work is motivated by the growing complexity of drone operations and the increasing effort required by applicants and aviation authorities to apply established assessment frameworks, including the Specific Operations Risk Assessment and the Pre-defined Risk Assessment, in a consistent and efficient manner. The proposed approach uses a controlled text-based architecture that relies exclusively on authoritative regulatory sources. To enable traceable and auditable outputs, the assistant grounds each response in retrieved passages and enforces citation-driven generation. System-level controls address common failure modes of generative models, including fabricated statements, unsupported inferences, and unclear provenance, by separating evidence storage from language generation and by adopting conservative behavior when supporting documentation is insufficient. The assistant is intentionally limited to decision support; it does not replace expert judgment and it does not make autonomous determinations. Instead, it accelerates context-specific information retrieval and synthesis to improve document preparation and review while preserving human responsibility for critical conclusions. The architecture is implemented using established open-source components, and key choices in retrieval strategy, interaction constraints, and response policies are evaluated for suitability in safety-sensitive regulatory environments. The paper provides technical and operational guidance for integrating retrieval-based assistants into aviation oversight workflows while maintaining accountability, traceability, and regulatory compliance.
Chinese Translation
本文介绍了一种基于检索的助手的设计与验证,该助手支持无人机系统的安全评估、认证活动和合规监管。该研究的动机源于无人机操作日益复杂,以及申请者和航空管理机构在一致且高效地应用既定评估框架(包括特定操作风险评估(Specific Operations Risk Assessment)和预定义风险评估(Pre-defined Risk Assessment))时所需的日益增加的努力。所提出的方法采用了一种受控的基于文本的架构,完全依赖权威的监管来源。为了实现可追溯和可审计的输出,助手将每个响应基于检索到的段落,并强制实施引用驱动的生成。系统级控制解决了生成模型的常见故障模式,包括虚假陈述、无支持的推论和不明确的来源,通过将证据存储与语言生成分离,并在支持文档不足时采取保守行为。该助手的功能被故意限制为决策支持;它并不取代专家判断,也不做出自主决定。相反,它加速了特定上下文的信息检索和综合,以改善文档准备和审查,同时保留人类对关键结论的责任。该架构使用已建立的开源组件实现,并评估了检索策略、交互限制和响应政策中的关键选择在安全敏感的监管环境中的适用性。本文为将基于检索的助手集成到航空监管工作流程中提供了技术和操作指导,同时保持问责制、可追溯性和合规性。
cs.CL / 20 / 2603.10000

Beyond the Prompt in Large Language Models: Comprehension, In-Context Learning, and Chain-of-Thought

超越大型语言模型中的提示:理解、上下文学习与思维链
Jiao, Yuling, Lai, Yanming, Lin, Huazhen, Ma, Wensen, Qi, Houduo, Sun, Defeng
Abstract
Large Language Models (LLMs) have demonstrated remarkable proficiency across diverse tasks, exhibiting emergent properties such as semantic prompt comprehension, In-Context Learning (ICL), and Chain-of-Thought (CoT) reasoning. Despite their empirical success, the theoretical mechanisms driving these phenomena remain poorly understood. This study dives into the foundations of these observations by addressing three critical questions: (1) How do LLMs accurately decode prompt semantics despite being trained solely on a next-token prediction objective? (2) Through what mechanism does ICL facilitate performance gains without explicit parameter updates? and (3) Why do intermediate reasoning steps in CoT prompting effectively unlock capabilities for complex, multi-step problems? Our results demonstrate that, through the autoregressive process, LLMs are capable of exactly inferring the transition probabilities between tokens across distinct tasks using provided prompts. We show that ICL enhances performance by reducing prompt ambiguity and facilitating posterior concentration on the intended task. Furthermore, we find that CoT prompting activates the model's capacity for task decomposition, breaking complex problems into a sequence of simpler sub-tasks that the model has mastered during the pretraining phase. By comparing their individual error bounds, we provide novel theoretical insights into the statistical superiority of advanced prompt engineering techniques.
Chinese Translation
大型语言模型(LLMs)在多种任务中展现出卓越的能力,表现出诸如语义提示理解、上下文学习(ICL)和思维链(CoT)推理等突现特性。尽管它们在实践中取得了成功,但驱动这些现象的理论机制仍然不甚明了。本研究通过探讨三个关键问题深入分析这些观察的基础:(1)LLMs如何在仅通过下一个标记预测目标进行训练的情况下准确解码提示语义?(2)ICL通过何种机制在没有显式参数更新的情况下促进性能提升?(3)为何思维链提示中的中间推理步骤能够有效解锁复杂多步骤问题的能力?我们的结果表明,通过自回归过程,LLMs能够准确推断出在不同任务中标记之间的转移概率,利用提供的提示。我们展示了ICL通过减少提示模糊性并促进对目标任务的后验集中来提升性能。此外,我们发现思维链提示激活了模型的任务分解能力,将复杂问题拆解为一系列在预训练阶段已掌握的简单子任务。通过比较它们各自的误差界限,我们提供了对先进提示工程技术统计优势的新理论见解。
cs.CL / 21 / 2603.10001

Leveraging Wikidata for Geographically Informed Sociocultural Bias Dataset Creation: Application to Latin America

利用维基数据创建地理信息驱动的社会文化偏见数据集:以拉丁美洲为例
Karmim, Yannis, Pino, Renato, Contreras, Hernan, Lira, Hernan, Cifuentes, Sebastian, Escoffier, Simon, Martí, Luis, Seddah, Djamé, Barrière, Valentin
Abstract
Large Language Models (LLMs) exhibit inequalities with respect to various cultural contexts. Most prominent open-weights models are trained on Global North data and show prejudicial behavior towards other cultures. Moreover, there is a notable lack of resources to detect biases in non-English languages, especially from Latin America (Latam), a continent containing various cultures, even though they share a common cultural ground. We propose to leverage the content of Wikipedia, the structure of the Wikidata knowledge graph, and expert knowledge from social science in order to create a dataset of question/answer (Q/As) pairs, based on the different popular and social cultures of various Latin American countries. We create the LatamQA database of over 26k questions and associated answers extracted from 26k Wikipedia articles, and transformed into multiple-choice questions (MCQ) in Spanish and Portuguese, in turn translated to English. We use this MCQ to quantify the degree of knowledge of various LLMs and find out (i) a discrepancy in performances between the Latam countries, ones being easier than others for the majority of the models, (ii) that the models perform better in their original language, and (iii) that Iberian Spanish culture is better known than Latam one.
Chinese Translation
大型语言模型(LLMs)在不同文化背景下表现出不平等现象。大多数知名的开放权重模型是在全球北方的数据上训练的,对其他文化表现出偏见。此外,缺乏检测非英语语言偏见的资源,尤其是来自拉丁美洲(Latam)的资源,尽管该大陆包含多种文化,但它们共享着共同的文化基础。我们提议利用维基百科的内容、维基数据知识图谱的结构以及社会科学领域的专家知识,创建一个基于拉丁美洲各国不同流行和社会文化的问答(Q/A)对数据集。我们创建了LatamQA数据库,包含来自26,000篇维基百科文章提取的超过26,000个问题及其相关答案,并将其转化为西班牙语和葡萄牙语的多项选择题(MCQ),随后翻译成英语。我们利用这些MCQ量化不同LLMs的知识程度,并发现(i)拉丁美洲各国之间的表现存在差异,有些国家对大多数模型而言更容易,(ii)模型在其原始语言中的表现更佳,以及(iii)伊比利亚西班牙文化的知名度高于拉丁美洲文化。
cs.CL / 22 / 2603.10002

SpreadsheetArena: Decomposing Preference in LLM Generation of Spreadsheet Workbooks

SpreadsheetArena:分解大型语言模型生成电子表格工作簿中的偏好
Kundurthy, Srivatsa, Na, Clara, Handley, Michael, Kirshner, Zach, Zhang, Chen Bo Calvin, Sharma, Manasi, Strubell, Emma, Ling, John
Abstract
Large language models (LLMs) are increasingly tasked with producing and manipulating structured artifacts. We consider the task of end-to-end spreadsheet generation, where language models are prompted to produce spreadsheet artifacts to satisfy users' explicit and implicit constraints, specified in natural language. We introduce SpreadsheetArena, a platform for evaluating models' performance on the task via blind pairwise evaluations of LLM-generated spreadsheet workbooks. As with other complex, open-ended tasks, relevant evaluation criteria can vary substantially across use cases and prompts, often in ways that are difficult to formalize. Compared to general chat or text generation settings, spreadsheet generation presents unique challenges and opportunities: the task output structure is well-defined and multi-dimensional, and there are often complex considerations around interactivity and layout. Among other findings, we observe that stylistic, structural, and functional features of preferred spreadsheets vary substantially across use cases, and expert evaluations of spreadsheets for finance prompts suggests that even highly ranked arena models do not reliably produce spreadsheets aligned with domain-specific best practices. Our hope is that our work prompts further study of end-to-end spreadsheet generation as a challenging and interesting category of complex, open-ended tasks for LLMs. Our live arena is hosted at https://spreadsheetarena.ai.
Chinese Translation
大型语言模型(LLMs)越来越多地被要求生成和处理结构化的工件。我们考虑端到端电子表格生成的任务,其中语言模型被提示生成电子表格工件,以满足用户在自然语言中指定的显性和隐性约束。我们引入了SpreadsheetArena,这是一个评估模型在该任务上表现的平台,通过对LLM生成的电子表格工作簿进行盲目成对评估。与其他复杂的开放式任务一样,相关的评估标准在不同的使用案例和提示中可能会有很大差异,通常以难以形式化的方式表现出来。与一般的聊天或文本生成设置相比,电子表格生成呈现出独特的挑战和机遇:任务输出结构明确且多维,并且通常涉及交互性和布局的复杂考虑。在其他发现中,我们观察到偏好电子表格的风格、结构和功能特征在不同的使用案例中有显著差异,针对财务提示的电子表格专家评估表明,即使是排名较高的arena模型也无法可靠地生成符合领域特定最佳实践的电子表格。我们希望我们的工作能够促使对端到端电子表格生成的进一步研究,这是一类对LLMs来说具有挑战性和趣味性的复杂开放式任务。我们的实时平台托管在 https://spreadsheetarena.ai。
cs.CL / 23 / 2603.10003

Probing the Limits of the Lie Detector Approach to LLM Deception

探讨谎言探测器方法在大型语言模型欺骗中的局限性
Berger, Tom-Felix
Abstract
Mechanistic approaches to deception in large language models (LLMs) often rely on "lie detectors", that is, truth probes trained to identify internal representations of model outputs as false. The lie detector approach to LLM deception implicitly assumes that deception is coextensive with lying. This paper challenges that assumption. It experimentally investigates whether LLMs can deceive without producing false statements and whether truth probes fail to detect such behavior. Across three open-source LLMs, it is shown that some models reliably deceive by producing misleading non-falsities, particularly when guided by few-shot prompting. It is further demonstrated that truth probes trained on standard true-false datasets are significantly better at detecting lies than at detecting deception without lying, confirming a critical blind spot of current mechanistic deception detection approaches. It is proposed that future work should incorporate non-lying deception in dialogical settings into probe training and explore representations of second-order beliefs to more directly target the conceptual constituents of deception.
Chinese Translation
大型语言模型(LLMs)中欺骗的机械性方法通常依赖于“谎言探测器”,即经过训练以识别模型输出内部表征为虚假的真相探测器。谎言探测器方法对LLM欺骗的隐含假设是,欺骗与撒谎是同一概念。本文对此假设提出挑战。通过实验研究,探讨LLMs是否能够在不产生虚假陈述的情况下进行欺骗,以及真相探测器是否无法检测到这种行为。在三个开源LLM中,研究表明某些模型通过产生误导性的非虚假信息可靠地进行欺骗,尤其是在少量提示的指导下。进一步证明,基于标准真伪数据集训练的真相探测器在检测谎言方面显著优于检测不撒谎的欺骗,确认了当前机械性欺骗检测方法的一个关键盲点。本文建议未来的研究应将非撒谎欺骗纳入对话场景中的探测器训练,并探索第二阶信念的表征,以更直接地针对欺骗的概念成分。
cs.CL / 24 / 2603.10004

Fine-Tune, Don't Prompt, Your Language Model to Identify Biased Language in Clinical Notes

微调,而非提示,您的语言模型以识别临床笔记中的偏见语言
Landi, Isotta, Alleva, Eugenia, Bussola, Nicole, Cohen, Rebecca M., Nowlin, Sarah, Shaw, Leslee J., Charney, Alexander W., Glazer, Kimberly B.
Abstract
Clinical documentation can contain emotionally charged language with stigmatizing or privileging valences. We present a framework for detecting and classifying such language as stigmatizing, privileging, or neutral. We constructed a curated lexicon of biased terms scored for emotional valence. We then used lexicon-based matching to extract text chunks from OB-GYN delivery notes (Mount Sinai Hospital, NY) and MIMIC-IV discharge summaries across multiple specialties. Three clinicians annotated all chunks, enabling characterization of valence patterns across specialties and healthcare systems. We benchmarked multiple classification strategies (zero-shot prompting, in-context learning, and supervised fine-tuning) across encoder-only models (GatorTron) and generative large language models (Llama). Fine-tuning with lexically primed inputs consistently outperformed prompting approaches. GatorTron achieved an F1 score of 0.96 on the OB-GYN test set, outperforming larger generative models while requiring minimal prompt engineering and fewer computational resources. External validation on MIMIC-IV revealed limited cross-domain generalizability (F1 < 0.70, 44% drop). Training on the broader MIMIC-IV dataset improved generalizability when testing on OB-GYN (F1 = 0.71, 11% drop), but at the cost of reduced precision. Our findings demonstrate that fine-tuning outperforms prompting for emotional valence classification and that models must be adapted to specific medical specialties to achieve clinically appropriate performance. The same terms can carry different emotional valences across specialties: words with clinical meaning in one context may be stigmatizing in another. For bias detection, where misclassification risks undermining clinician trust or perpetuating patient harm, specialty-specific fine-tuning is essential to capture these semantic shifts. * Equal contribution.
Chinese Translation
临床文档可能包含带有污名化或特权色彩的情感语言。我们提出了一个框架,用于检测和分类此类语言为污名化、特权或中性。我们构建了一个经过精心策划的偏见术语词典,并对其情感色彩进行了评分。然后,我们使用基于词典的匹配从妇产科分娩记录(纽约西奈山医院)和多个专业的MIMIC-IV出院总结中提取文本片段。三位临床医生对所有片段进行了标注,从而能够表征不同专业和医疗系统之间的情感模式。我们对多种分类策略(零-shot提示、上下文学习和监督微调)进行了基准测试,涵盖了仅编码器模型(GatorTron)和生成性大型语言模型(Llama)。使用词典引导的输入进行微调的效果始终优于提示方法。GatorTron在妇产科测试集上达到了0.96的F1分数,超越了更大的生成模型,同时需要的提示工程和计算资源较少。在MIMIC-IV上的外部验证显示跨领域的通用性有限(F1 < 0.70,下降44%)。在更广泛的MIMIC-IV数据集上训练提高了在妇产科测试时的通用性(F1 = 0.71,下降11%),但以降低精度为代价。我们的研究结果表明,微调在情感色彩分类中优于提示,并且模型必须适应特定的医疗专业,以实现临床适当的性能。同一术语在不同专业中可能具有不同的情感色彩:在一个上下文中具有临床意义的词汇在另一个上下文中可能是污名化的。对于偏见检测而言,错误分类可能会削弱临床医生的信任或加剧患者伤害,因此特定专业的微调对于捕捉这些语义变化至关重要。* 平等贡献。
cs.CL / 25 / 2603.10005

SENS-ASR: Semantic Embedding injection in Neural-transducer for Streaming Automatic Speech Recognition

SENS-ASR:在神经转导器中注入语义嵌入以实现流式自动语音识别
Dkhissi, Youness, Vielzeuf, Valentin, Allesiardo, Elys, Larcher, Anthony
Abstract
Many Automatic Speech Recognition (ASR) applications require streaming processing of the audio data. In streaming mode, ASR systems need to start transcribing the input stream before it is complete, i.e., the systems have to process a stream of inputs with a limited (or no) future context. Compared to offline mode, this reduction of the future context degrades the performance of Streaming-ASR systems, especially while working with low-latency constraint. In this work, we present SENS-ASR, an approach to enhance the transcription quality of Streaming-ASR by reinforcing the acoustic information with semantic information. This semantic information is extracted from the available past frame-embeddings by a context module. This module is trained using knowledge distillation from a sentence embedding Language Model fine-tuned on the training dataset transcriptions. Experiments on standard datasets show that SENS-ASR significantly improves the Word Error Rate on small-chunk streaming scenarios.
Chinese Translation
许多自动语音识别(ASR)应用需要对音频数据进行流式处理。在流式模式下,ASR系统需要在输入流尚未完成之前就开始转录,即系统必须在有限(或没有)未来上下文的情况下处理输入流。与离线模式相比,未来上下文的减少会降低流式ASR系统的性能,尤其是在低延迟约束下工作时。在本研究中,我们提出了SENS-ASR,这是一种通过用语义信息增强声学信息来提高流式ASR转录质量的方法。这种语义信息是通过上下文模块从可用的过去帧嵌入中提取的。该模块使用从经过训练数据集转录微调的句子嵌入语言模型进行知识蒸馏进行训练。标准数据集上的实验表明,SENS-ASR显著改善了小块流式场景下的词错误率。
cs.CL / 26 / 2603.10006

Adaptive Engram Memory System for Indonesian Language Model: Generative AI Based on TOBA LM for Batak and Minang Language

印尼语言模型的自适应记忆系统:基于TOBA LM的巴塔克语和米南佳保语生成式人工智能
Situngkir, Hokky, Siringoringo, Kevin, Lumbantobing, Andhika Bernard
Abstract
This study presents TOBA-LM, a trilingual language model based on GPT-2 architecture with 1.2 billion parameters, trained on a corpus encompassing Indonesian, Batak, and Minangkabau using syllabic-agglutinative tokenization. The architecture integrates an Engram Memory mechanism, an adaptive n-gram-based memory system with a 500,000 x 768 embedding table that captures morphological dependencies through bigram and trigram pathways. Empirical results demonstrate a training efficiency of 80%, with the loss value dropping from 6.4 to 1.7996 in only 12,973 steps -- significantly faster than the conventional transformer architecture, which required over 70,000 steps to achieve comparable convergence. These findings confirm that the integration of external statistical memory substantially reduces computational requirements for developing regional language models under limited resources.
Chinese Translation
本研究提出了TOBA-LM,这是一个基于GPT-2架构的三语语言模型,拥有12亿参数,训练语料涵盖印尼语、巴塔克语和米南佳保语,采用音节粘合的标记化方式。该架构集成了一种记忆机制,即自适应n-gram记忆系统,配备500,000 x 768的嵌入表,通过二元组和三元组路径捕捉形态依赖关系。实证结果表明,训练效率达到80%,损失值在仅12,973步内从6.4降至1.7996,显著快于传统的变换器架构,后者需要超过70,000步才能达到相似的收敛效果。这些发现证实,外部统计记忆的集成显著降低了在资源有限的情况下开发区域语言模型的计算需求。
cs.CL / 27 / 2603.10007

GATech at AbjadGenEval Shared Task: Multilingual Embeddings for Arabic Machine-Generated Text Classification

GATech在AbjadGenEval共享任务中的表现:用于阿拉伯机器生成文本分类的多语言嵌入
Khamis, Ahmed Khaled
Abstract
We present our approach to the AbjadGenEval shared task on detecting AI-generated Arabic text. We fine-tuned the multilingual E5-large encoder for binary classification, and we explored several pooling strategies to pool token representations, including weighted layer pooling, multi-head attention pooling, and gated fusion. Interestingly, none of these outperformed simple mean pooling, which achieved an F1 of 0.75 on the test set. We believe this is because complex pooling methods introduce additional parameters that need more data to train properly, whereas mean pooling offers a stable baseline that generalizes well even with limited examples. We also observe a clear pattern in the data: human-written texts tend to be significantly longer than machine-generated ones.
Chinese Translation
我们提出了在AbjadGenEval共享任务中检测AI生成阿拉伯文本的方法。我们对多语言E5-large编码器进行了微调,以进行二分类,并探索了几种池化策略来汇聚标记表示,包括加权层池化、多头注意力池化和门控融合。有趣的是,这些方法都未能超越简单的均值池化,而后者在测试集上达到了0.75的F1值。我们认为这是因为复杂的池化方法引入了额外的参数,需要更多的数据来进行适当的训练,而均值池化则提供了一个稳定的基线,即使在样本有限的情况下也能很好地泛化。我们还观察到数据中存在明显的模式:人类撰写的文本往往显著长于机器生成的文本。
cs.CL / 28 / 2603.10008

GATech at AbjadMed: Bidirectional Encoders vs. Causal Decoders: Insights from 82-Class Arabic Medical Classification

GATech在AbjadMed的表现:双向编码器与因果解码器的比较:来自82类阿拉伯医学分类的见解
Khamis, Ahmed Khaled
Abstract
This paper presents system description for Arabic medical text classification across 82 distinct categories. Our primary architecture utilizes a fine-tuned AraBERTv2 encoder enhanced with a hybrid pooling strategies, combining attention and mean representations, and multi-sample dropout for robust regularization. We systematically benchmark this approach against a suite of multilingual and Arabic-specific encoders, as well as several large-scale causal decoders, including zero-shot re-ranking via Llama 3.3 70B and feature extraction from Qwen 3B hidden states. Our findings demonstrate that specialized bidirectional encoders significantly outperform causal decoders in capturing the precise semantic boundaries required for fine-grained medical text classification. We show that causal decoders, optimized for next-token prediction, produce sequence-biased embeddings that are less effective for categorization compared to the global context captured by bidirectional attention. Despite significant class imbalance and label noise identified within the training data, our results highlight the superior semantic compression of fine-tuned encoders for specialized Arabic NLP tasks. Final performance metrics on the test set, including Accuracy and Macro-F1, are reported and discussed.
Chinese Translation
本文介绍了针对82个不同类别的阿拉伯医学文本分类的系统描述。我们的主要架构利用经过微调的AraBERTv2编码器,并结合混合池化策略,结合注意力机制和均值表示,以及多样本丢弃以实现稳健的正则化。我们系统地将这种方法与一系列多语言和阿拉伯特定的编码器,以及几种大规模因果解码器进行基准测试,包括通过Llama 3.3 70B进行的零样本重排序和从Qwen 3B隐藏状态中提取特征。我们的研究结果表明,专门的双向编码器在捕捉细粒度医学文本分类所需的精确语义边界方面显著优于因果解码器。我们展示了为下一个标记预测优化的因果解码器产生的序列偏置嵌入在分类效果上不如双向注意力所捕获的全局上下文。尽管在训练数据中发现了显著的类别不平衡和标签噪声,我们的结果突显了微调编码器在专门的阿拉伯自然语言处理任务中的优越语义压缩能力。最后,报告并讨论了测试集上的性能指标,包括准确率和宏观F1分数。
cs.CL / 29 / 2603.10010

FERRET: Framework for Expansion Reliant Red Teaming

FERRET:依赖扩展的红队框架
Mehrabi, Ninareh, Albiero, Vitor, Pavlova, Maya, Bitton, Joanna
Abstract
We introduce a multi-faceted automated red teaming framework in which the goal is to generate multi-modal adversarial conversations that would break a target model and introduce various expansions that would result in more effective and efficient adversarial conversations. The introduced expansions include: 1. Horizontal expansion in which the goal is for the red team model to self-improve and generate more effective conversation starters that would shape a conversation. 2. Vertical expansion in which the goal is to take these conversation starters that are discovered in the horizontal expansion phase and expand them into effective multi-modal conversations and 3. Meta expansion in which the goal is for the red team model to discover more effective multi-modal attack strategies during the course of a conversation. We call our framework FERRET (Framework for Expansion Reliant Red Teaming) and compare it with various existing automated red teaming approaches. In our experiments, we demonstrate the effectiveness of FERRET in generating effective multi-modal adversarial conversations and its superior performance against existing state of the art approaches.
Chinese Translation
我们提出了一种多方面的自动化红队框架,其目标是生成多模态对抗性对话,从而破坏目标模型,并引入各种扩展,以实现更有效和高效的对抗性对话。所引入的扩展包括:1. 水平扩展,其目标是使红队模型自我改进,生成更有效的对话引导,以塑造对话。2. 垂直扩展,其目标是将水平扩展阶段发现的对话引导扩展为有效的多模态对话;3. 元扩展,其目标是在对话过程中使红队模型发现更有效的多模态攻击策略。我们将我们的框架称为FERRET(依赖扩展的红队框架),并将其与各种现有的自动化红队方法进行了比较。在我们的实验中,我们展示了FERRET在生成有效的多模态对抗性对话方面的有效性,以及其相较于现有最先进方法的优越性能。
cs.CL / 30 / 2603.10011

Gemma Needs Help: Investigating and Mitigating Emotional Instability in LLMs

Gemma需要帮助:调查和缓解大型语言模型中的情感不稳定性
Soligo, Anna, Mikulik, Vladimir, Saunders, William
Abstract
Large language models can generate responses that resemble emotional distress, and this raises concerns around model reliability and safety. We introduce a set of evaluations to investigate expressions of distress in LLMs, and find that these surface emotional instability in Gemma and Gemini models, but not in other families. We find evidence that this difference arises in post-training. Base models from different families (Gemma, Qwen and OLMo) show similar propensities for expressing distress. However, instruct-tuned Gemma expresses substantially more distress than its base model, whereas instruct-tuned Qwen and OLMo express less. We find a simple mitigation for this: direct preference optimisation on just 280 preference pairs reduces Gemma's high-frustration responses from 35% to 0.3% in our evaluations, generalising across question types, user tones, and conversation lengths, without affecting capabilities. These findings show that emotional instability is an issue in some LLMs. We present (1) evaluations to track this behaviour, and (2) a mitigation without downsides in Gemma, with the caveat that upstream training modifications to improve emotional robustness would be significantly better than this post-hoc fix.
Chinese Translation
大型语言模型能够生成类似情感困扰的响应,这引发了对模型可靠性和安全性的担忧。我们引入了一套评估方法,以调查大型语言模型中的困扰表现,并发现Gemma和Gemini模型表面上存在情感不稳定性,而其他模型系列则没有。我们发现这种差异在后训练阶段产生。来自不同系列的基础模型(Gemma、Qwen和OLMo)在表达困扰方面表现出相似的倾向。然而,经过指令调优的Gemma表现出明显更多的困扰,而经过指令调优的Qwen和OLMo则表现得更少。我们发现一个简单的缓解方法:仅对280对偏好进行直接优化,可以将Gemma的高挫败感响应从35%降低到0.3%,这一结果在不同问题类型、用户语气和对话长度中具有普遍性,且不影响模型能力。这些发现表明,情感不稳定性是某些大型语言模型中的一个问题。我们提出(1)用于跟踪这一行为的评估方法,以及(2)在Gemma中没有负面影响的缓解措施,前提是上游训练的修改以提高情感稳健性将显著优于这种事后修复。
cs.CL / 31 / 2603.10012

Measuring and Eliminating Refusals in Military Large Language Models

测量和消除军事大型语言模型中的拒绝率
FitzGerald, Jack, Bates, Dylan, Lazaridis, Aristotelis, Sharma, Aman, Lu, Vincent, King, Brian, Azami, Yousif, Bailey, Sean, Cao, Jeremy, Damianov, Peter, de Haan, Kevin, Madigan, Joseph, McLaurin, Jeremy, Kerbs, Luke, Tainer, Jonathan, Anderson, Dave, Beck, Jonathan, Cuticello, Jamie, Malkerson, Colton, Saltsman, Tyler
Abstract
Military Large Language Models (LLMs) must provide accurate information to the warfighter in time-critical and dangerous situations. However, today's LLMs are imbued with safety behaviors that cause the LLM to refuse many legitimate queries in the military domain, particularly those related to violence, terrorism, or military technology. Our gold benchmark for assessing refusal rates, which was developed by veterans of the US Army and special forces, is to our knowledge the first dataset of its kind. We present results for refusal and deflection rates on 31 public models and 3 military models. We observe hard rejection rates as high as 98.2% and soft deflection rates ranging from 0% to 21.3%. We also present results on two additional synthetic datasets and show their correlations with the gold dataset. Finally, we perform abliteration using the Heretic library on a military-tuned gpt-oss-20b model, showing an absolute increase in answer rate of 66.5 points but an average relative decrease of 2% on other military tasks. In our concluding remarks, we argue for deeper specialization, including with mid-training and end-to-end post-training, to achieve zero refusals and maximum military task accuracy for closed military models.
Chinese Translation
军事大型语言模型(LLMs)必须在时间紧迫和危险的情况下向战斗人员提供准确的信息。然而,当前的LLMs被赋予了安全行为,这导致它们在军事领域拒绝许多合法查询,特别是与暴力、恐怖主义或军事技术相关的查询。我们由美国陆军和特种部队退伍军人开发的拒绝率评估金标准,是我们所知的首个此类数据集。我们展示了31个公共模型和3个军事模型的拒绝和偏转率结果。我们观察到硬拒绝率高达98.2%,而软偏转率范围从0%到21.3%。我们还展示了两个额外的合成数据集的结果,并显示它们与金标准数据集的相关性。最后,我们使用Heretic库对军事调优的gpt-oss-20b模型进行了消融实验,显示回答率绝对增加了66.5个百分点,但在其他军事任务上的平均相对下降为2%。在我们的结论中,我们主张进行更深层次的专业化,包括中期训练和端到端后期训练,以实现零拒绝和最大军事任务准确性,适用于封闭的军事模型。
cs.CL / 32 / 2603.10033

Evaluating Progress in Graph Foundation Models: A Comprehensive Benchmark and New Insights

评估图基础模型的进展:全面基准测试与新见解
Yu, Xingtong, Ye, Shenghua, Liang, Ruijuan, Zhou, Chang, Cheng, Hong, Zhang, Xinming, Fang, Yuan
Abstract
Graph foundation models (GFM) aim to acquire transferable knowledge by pre-training on diverse graphs, which can be adapted to various downstream tasks. However, domain shift in graphs is inherently two-dimensional: graphs differ not only in what they describe (topic domains) but also in how they are represented (format domains). Most existing GFM benchmarks vary only topic domains, thereby obscuring how knowledge transfers across both dimensions. We present a new benchmark that jointly evaluates topic and format gaps across the full GFM pipeline, including multi-domain self-supervised pre-training and few-shot downstream adaptation, and provides a timely evaluation of recent GFMs in the rapidly evolving landscape. Our protocol enables controlled assessment in four settings: (i) pre-training on diverse topics and formats, while adapting to unseen downstream datasets; (ii) same pre-training as in (i), while adapting to seen datasets; (iii) pre-training on a single topic domain, while adapting to other topics; (iv) pre-training on a base format, while adapting to other formats. This two-axis evaluation disentangles semantic generalization from robustness to representational shifts. We conduct extensive evaluations of eight state-of-the-art GFMs on 33 datasets spanning seven topic domains and six format domains, surfacing new empirical observations and practical insights for future research. Codes/data are available at https://github.com/smufang/GFMBenchmark.
Chinese Translation
图基础模型(Graph Foundation Models, GFM)旨在通过在多样化图形上进行预训练来获取可迁移的知识,以适应各种下游任务。然而,图形中的领域转移本质上是二维的:图形不仅在描述的内容(主题领域)上有所不同,而且在表示的方式(格式领域)上也存在差异。现有的大多数 GFM 基准测试仅在主题领域上有所变化,从而模糊了知识在两个维度之间的转移方式。我们提出了一种新的基准测试,联合评估整个 GFM 流程中的主题和格式差距,包括多领域自监督预训练和少量样本下游适应,并及时评估在快速发展的环境中最新的 GFM。我们的协议在四种设置中实现了受控评估:(i)在多样化主题和格式上进行预训练,同时适应未见的下游数据集;(ii)与(i)相同的预训练,同时适应已见的数据集;(iii)在单一主题领域上进行预训练,同时适应其他主题;(iv)在基础格式上进行预训练,同时适应其他格式。这种双轴评估将语义泛化与对表示转移的鲁棒性区分开来。我们对八种最先进的 GFM 在涵盖七个主题领域和六个格式领域的 33 个数据集上进行了广泛评估,揭示了新的实证观察和未来研究的实际见解。代码/数据可在 https://github.com/smufang/GFMBenchmark 获取。
cs.CL / 33 / 2603.10034

A Principle-Driven Adaptive Policy for Group Cognitive Stimulation Dialogue for Elderly with Cognitive Impairment

基于原则的适应性政策用于认知障碍老年人的团体认知刺激对话
Jiang, Jiyue, Chen, Yanyu, Chen, Pengan, Liu, Kai, Zhou, Jingqi, Zhu, Zheyong, Hu, He, Ma, Fei, Tian, Qi, Wu, Chuan
Abstract
Cognitive impairment is becoming a major public health challenge. Cognitive Stimulation Therapy (CST) is an effective intervention for cognitive impairment, but traditional methods are difficult to scale, and existing digital systems struggle with group dialogues and cognitive stimulation principles. While Large Language Models (LLMs) are powerful, their application in this context faces key challenges: cognitive stimulation dialogue paradigms, a lack of therapeutic reasoning, and static-only user modeling. To address these issues, we propose a principle-driven adaptive policy actualized through a Group Cognitive Stimulation Dialogue (GCSD) system. We first construct a dataset with over 500 hours of real-world CST conversations and 10,000+ simulated dialogues generated via our Principle-Guided Scenario Simulation strategy. Our GCSD system then integrates four core modules to overcome LLM limitations: (i) a multi-speaker context controller to resolve role confusion; (ii) dynamic participant cognitive state modeling for personalized interaction; (iii) a cognitive stimulation-focused attention loss to instill cognitive stimulation reasoning; and (iv) a multi-dimensional reward strategy to enhance response value. Experimental results demonstrate that GCSD significantly outperforms baseline models across various evaluation metrics. Future work will focus on long-term clinical validation to bridge the gap between computational performance and clinical efficacy.
Chinese Translation
认知障碍正成为一个主要的公共卫生挑战。认知刺激疗法(Cognitive Stimulation Therapy, CST)是一种有效的认知障碍干预措施,但传统方法难以扩展,现有数字系统在团体对话和认知刺激原则方面面临困难。尽管大型语言模型(Large Language Models, LLMs)功能强大,但在此背景下的应用面临关键挑战:认知刺激对话范式、缺乏治疗推理以及仅限静态的用户建模。为了解决这些问题,我们提出了一种通过团体认知刺激对话(Group Cognitive Stimulation Dialogue, GCSD)系统实现的基于原则的适应性政策。我们首先构建了一个包含超过500小时真实CST对话和通过我们的原则引导场景模拟策略生成的10,000多个模拟对话的数据集。我们的GCSD系统集成了四个核心模块,以克服LLM的局限性:(i)多发言者上下文控制器以解决角色混淆;(ii)动态参与者认知状态建模以实现个性化互动;(iii)以认知刺激为重点的注意力损失以灌输认知刺激推理;(iv)多维奖励策略以增强响应价值。实验结果表明,GCSD在各种评估指标上显著优于基线模型。未来的工作将集中于长期临床验证,以弥合计算性能与临床有效性之间的差距。
cs.CL / 34 / 2603.10035

TriageSim: A Conversational Emergency Triage Simulation Framework from Structured Electronic Health Records

TriageSim:基于结构化电子健康记录的对话式紧急分诊模拟框架
Srirag, Dipankar, Nguyen, Quoc Dung, Joshi, Aditya, Narasimhan, Padmanesan, Kanhere, Salil
Abstract
Research in emergency triage is restricted to structured electronic health records (EHR) due to regulatory constraints on nurse-patient interactions. We introduce TriageSim, a simulation framework for generating persona-conditioned triage conversations from structured records. TriageSim enables multi-turn nurse-patient interactions with explicit control over disfluency and decision behaviour, producing a corpus of ~800 synthetic transcripts and corresponding audio. We use a combination of automated analysis for linguistic, behavioural and acoustic fidelity alongside manual evaluation for medical fidelity using a random subset of 50 conversations. The utility of the generated corpus is examined via conversational triage classification. We observe modest agreement for acuity levels across three modalities: generated synthetic text, ASR transcripts, and direct audio inputs. The code, persona schemata and triage policy prompts for TriageSim will be available upon acceptance.
Chinese Translation
由于对护士与患者互动的监管限制,紧急分诊研究受限于结构化电子健康记录(EHR)。我们介绍了TriageSim,一个用于从结构化记录生成基于角色的分诊对话的模拟框架。TriageSim支持多轮护士与患者的互动,并对不流畅性和决策行为进行明确控制,生成约800个合成转录文本及相应音频。我们结合自动分析(用于语言、行为和声学的保真度)与手动评估(使用随机抽取的50个对话进行医学保真度评估)。通过对话分诊分类,我们考察了生成语料库的实用性。我们观察到在三种模式下(生成的合成文本、ASR转录文本和直接音频输入)对急性程度的适度一致性。TriageSim的代码、角色模式和分诊政策提示将在接受后提供。
cs.CL / 35 / 2603.10130

The Prediction-Measurement Gap: Toward Meaning Representations as Scientific Instruments

预测-测量差距:将意义表征作为科学工具
Plisiecki, Hubert
Abstract
Text embeddings have become central to computational social science and psychology, enabling scalable measurement of meaning and mixed-method inference. Yet most representation learning is optimized and evaluated for prediction and retrieval, yielding a prediction-measurement gap: representations that perform well as features may be poorly suited as scientific instruments. The paper argues that scientific meaning analysis motivates a distinct family of objectives - scientific usability - emphasizing geometric legibility, interpretability and traceability to linguistic evidence, robustness to non-semantic confounds, and compatibility with regression-style inference over semantic directions. Grounded in cognitive and neuro-psychological views of meaning, the paper assesses static word embeddings and contextual transformer representations against these requirements: static spaces remain attractive for transparent measurement, whereas contextual spaces offer richer semantics but entangle meaning with other signals and exhibit geometric and interpretability issues that complicate inference. The paper then outlines a course-setting agenda around (i) geometry-first design for gradients and abstraction, including hierarchy-aware spaces constrained by psychologically privileged levels; (ii) invertible post-hoc transformations that recondition embedding geometry and reduce nuisance influence; and (iii) meaning atlases and measurement-oriented evaluation protocols for reliable and traceable semantic inference. As the field debates the limits of scale-first progress, measurement-ready representations offer a principled new frontier.
Chinese Translation
文本嵌入已成为计算社会科学和心理学的核心,能够实现意义的可扩展测量和混合方法推断。然而,大多数表征学习是针对预测和检索进行优化和评估的,从而导致了预测-测量差距:作为特征表现良好的表征可能不适合作为科学工具。本文认为,科学意义分析激励了一类独特的目标——科学可用性——强调几何可读性、可解释性和与语言证据的可追溯性、对非语义干扰的鲁棒性,以及与语义方向上的回归式推断的兼容性。基于对意义的认知和神经心理学视角,本文评估了静态词嵌入和上下文变换器表征是否符合这些要求:静态空间在透明测量方面仍然具有吸引力,而上下文空间提供了更丰富的语义,但将意义与其他信号纠缠在一起,并表现出几何和可解释性问题,复杂化了推断。随后,本文概述了一项设定方向的议程,包括(i)优先考虑几何设计以实现梯度和抽象,包括受心理特权水平约束的层次感知空间;(ii)可逆的后处理变换,重新调整嵌入几何并减少干扰影响;以及(iii)意义地图和面向测量的评估协议,以实现可靠和可追溯的语义推断。在该领域讨论规模优先进展的局限性时,准备好测量的表征提供了一个原则性的全新前沿。
cs.CL / 36 / 2603.10139

The Generation-Recognition Asymmetry: Six Dimensions of a Fundamental Divide in Formal Language Theory

生成-识别不对称性:形式语言理论中的六个基本维度
Peyrichou, Romain
Abstract
Every formal grammar defines a language and can in principle be used in three ways: to generate strings (production), to recognize them (parsing), or -- given only examples -- to infer the grammar itself (grammar induction). Generation and recognition are extensionally equivalent -- they characterize the same set -- but operationally asymmetric in multiple independent ways. Inference is a qualitatively harder problem: it does not have access to a known grammar. Despite the centrality of this triad to compiler design, natural language processing, and formal language theory, no survey has treated it as a unified, multidimensional phenomenon. We identify six dimensions along which generation and recognition diverge: computational complexity, ambiguity, directionality, information availability, grammar inference, and temporality. We show that the common characterization "generation is easy, parsing is hard" is misleading: unconstrained generation is trivial, but generation under constraints can be NP-hard. The real asymmetry is that parsing is always constrained (the input is given) while generation need not be. Two of these dimensions -- directionality and temporality -- have not previously been identified as dimensions of the generation-recognition asymmetry. We connect the temporal dimension to the surprisal framework of Hale (2001) and Levy (2008), arguing that surprisal formalizes the temporal asymmetry between a generator (surprisal = 0) and a parser that predicts under uncertainty (surprisal > 0). We review bidirectional systems in NLP and observe that bidirectionality has been available for fifty years yet has not transferred to most domain-specific applications. We conclude with a discussion of large language models, which architecturally unify generation and recognition while operationally preserving the asymmetry.
Chinese Translation
每个形式语法定义了一种语言,并原则上可以通过三种方式使用:生成字符串(生成)、识别字符串(解析),或者仅根据示例推断语法本身(语法归纳)。生成和识别在扩展上是等价的——它们表征相同的集合——但在多个独立的方面却表现出操作上的不对称性。推断是一个质上更困难的问题:它无法访问已知的语法。尽管这一三元组在编译器设计、自然语言处理和形式语言理论中至关重要,但尚无调查将其视为一个统一的多维现象。我们识别出生成和识别之间的六个分歧维度:计算复杂性、歧义性、方向性、信息可用性、语法推断和时间性。我们表明,常见的表述“生成简单,解析困难”是具有误导性的:不受限制的生成是微不足道的,但在约束下的生成可能是 NP-困难的。真正的不对称在于解析总是受到约束(输入是给定的),而生成不一定如此。这两个维度——方向性和时间性——之前未被识别为生成-识别不对称的维度。我们将时间维度与 Hale(2001)和 Levy(2008)的惊讶框架联系起来,认为惊讶形式化了生成器(惊讶 = 0)与在不确定性下进行预测的解析器(惊讶 > 0)之间的时间不对称性。我们回顾了自然语言处理中的双向系统,并观察到双向性已经存在了五十年,但尚未转移到大多数特定领域的应用中。最后,我们讨论了大型语言模型,这些模型在架构上统一了生成和识别,同时在操作上保持了不对称性。
cs.CL / 37 / 2603.10143

Reason and Verify: A Framework for Faithful Retrieval-Augmented Generation

推理与验证:一个忠实的检索增强生成框架
Khan, Eeham, Rodriguez, Luis, Queudot, Marc
Abstract
Retrieval-Augmented Generation (RAG) significantly improves the factuality of Large Language Models (LLMs), yet standard pipelines often lack mechanisms to verify inter- mediate reasoning, leaving them vulnerable to hallucinations in high-stakes domains. To address this, we propose a domain-specific RAG framework that integrates explicit rea- soning and faithfulness verification. Our architecture augments standard retrieval with neural query rewriting, BGE-based cross-encoder reranking, and a rationale generation module that grounds sub-claims in specific evidence spans. We further introduce an eight-category verification taxonomy that enables fine-grained assessment of rationale faithfulness, distinguishing between explicit and implicit support patterns to facilitate structured error diagnosis. We evaluate this framework on the BioASQ and PubMedQA benchmarks, specifically analyzing the impact of dynamic in-context learning and rerank- ing under constrained token budgets. Experiments demonstrate that explicit rationale generation improves accuracy over vanilla RAG baselines, while dynamic demonstration selection combined with robust reranking yields further gains in few-shot settings. Using Llama-3-8B-Instruct, our approach achieves 89.1% on BioASQ-Y/N and 73.0% on Pub- MedQA, competitive with systems using significantly larger models. Additionally, we perform a pilot study combining human expert assessment with LLM-based verification to explore how explicit rationale generation improves system transparency and enables more detailed diagnosis of retrieval failures in biomedical question answering.
Chinese Translation
检索增强生成(RAG)显著提高了大型语言模型(LLMs)的事实性,但标准流程往往缺乏验证中间推理的机制,使其在高风险领域容易出现幻觉。为了解决这一问题,我们提出了一种领域特定的RAG框架,该框架集成了显式推理和忠实性验证。我们的架构通过神经查询重写、基于BGE的交叉编码器重排序以及一个将子主张与特定证据范围相结合的推理生成模块来增强标准检索。我们进一步引入了一种八类验证分类法,能够对推理忠实性进行细粒度评估,区分显式和隐式支持模式,以便于结构化错误诊断。我们在BioASQ和PubMedQA基准上评估了该框架,特别分析了动态上下文学习和在受限令牌预算下重排序的影响。实验表明,显式推理生成提高了相较于普通RAG基线的准确性,而动态演示选择结合稳健的重排序在少量样本设置中带来了进一步的提升。使用Llama-3-8B-Instruct,我们的方法在BioASQ-Y/N上达到了89.1%,在PubMedQA上达到了73.0%,与使用显著更大模型的系统具有竞争力。此外,我们进行了一项结合人类专家评估与基于LLM的验证的初步研究,以探索显式推理生成如何提高系统透明度,并使生物医学问答中的检索失败能够更详细地进行诊断。
cs.CL / 38 / 2603.10145

Lost in Backpropagation: The LM Head is a Gradient Bottleneck

在反向传播中迷失:语言模型头部是梯度瓶颈
Godey, Nathan, Artzi, Yoav
Abstract
The last layer of neural language models (LMs) projects output features of dimension $D$ to logits in dimension $V$, the size of the vocabulary, where usually $D \ll V$. This mismatch is known to raise risks of limited expressivity in neural LMs, creating a so-called softmax bottleneck. We show the softmax bottleneck is not only an expressivity bottleneck but also an optimization bottleneck. Backpropagating $V$-dimensional gradients through a rank-$D$ linear layer induces unavoidable compression, which alters the training feedback provided to the vast majority of the parameters. We present a theoretical analysis of this phenomenon and measure empirically that 95-99% of the gradient norm is suppressed by the output layer, resulting in vastly suboptimal update directions. We conduct controlled pretraining experiments showing that the gradient bottleneck makes trivial patterns unlearnable, and drastically affects the training dynamics of LLMs. We argue that this inherent flaw contributes to training inefficiencies at scale independently of the model architecture, and raises the need for new LM head designs.
Chinese Translation
神经语言模型(LMs)的最后一层将维度为 $D$ 的输出特征映射到维度为 $V$ 的 logits,$V$ 是词汇表的大小,通常情况下 $D eq V$。这种不匹配被认为会导致神经语言模型的表达能力受限,从而产生所谓的 softmax 瓶颈。我们表明,softmax 瓶颈不仅是一个表达能力瓶颈,还是一个优化瓶颈。通过一个秩为 $D$ 的线性层反向传播 $V$ 维梯度会导致不可避免的压缩,这改变了大多数参数所获得的训练反馈。我们对这一现象进行了理论分析,并通过实证测量发现,95-99% 的梯度范数在输出层被抑制,导致更新方向极为次优。我们进行了受控的预训练实验,表明梯度瓶颈使得简单模式无法学习,并严重影响了大规模语言模型(LLMs)的训练动态。我们认为,这一固有缺陷在规模化训练中导致了效率低下,与模型架构无关,并提出了对新型语言模型头部设计的需求。
cs.CL / 39 / 2603.10165

OpenClaw-RL: Train Any Agent Simply by Talking

OpenClaw-RL:通过对话简单训练任意代理
Wang, Yinjie, Chen, Xuyang, Jin, Xiaolong, Wang, Mengdi, Yang, Ling
Abstract
Every agent interaction generates a next-state signal, namely the user reply, tool output, terminal or GUI state change that follows each action, yet no existing agentic RL system recovers it as a live, online learning source. We present OpenClaw-RL, a framework built on a simple observation: next-state signals are universal, and policy can learn from all of them simultaneously. Personal conversations, terminal executions, GUI interactions, SWE tasks, and tool-call traces are not separate training problems. They are all interactions that can be used to train the same policy in the same loop. Next-state signals encode two forms of information: evaluative signals, which indicate how well the action performed and are extracted as scalar rewards via a PRM judge; and directive signals, which indicate how the action should have been different and are recovered through Hindsight-Guided On-Policy Distillation (OPD). We extract textual hints from the next state, construct an enhanced teacher context, and provide token-level directional advantage supervision that is richer than any scalar reward. Due to the asynchronous design, the model serves live requests, the PRM judges ongoing interactions, and the trainer updates the policy at the same time, with zero coordination overhead between them. Applied to personal agents, OpenClaw-RL enables an agent to improve simply by being used, recovering conversational signals from user re-queries, corrections, and explicit feedback. Applied to general agents, the same infrastructure supports scalable RL across terminal, GUI, SWE, and tool-call settings, where we additionally demonstrate the utility of process rewards. Code: https://github.com/Gen-Verse/OpenClaw-RL
Chinese Translation
每次代理交互都会生成一个下一个状态信号,即用户回复、工具输出、终端或图形用户界面(GUI)状态变化,这些都跟随每个动作,但现有的代理强化学习(RL)系统并未将其作为实时在线学习源进行恢复。我们提出了OpenClaw-RL,这是一个基于一个简单观察构建的框架:下一个状态信号是普遍的,策略可以同时从所有这些信号中学习。个人对话、终端执行、GUI交互、软件工程(SWE)任务和工具调用轨迹并不是独立的训练问题。它们都是可以在同一循环中用于训练相同策略的交互。下一个状态信号编码了两种信息形式:评估信号,指示动作的表现如何,并通过PRM评判者提取为标量奖励;以及指令信号,指示动作应该如何不同,并通过后见指导的在线策略蒸馏(Hindsight-Guided On-Policy Distillation, OPD)进行恢复。我们从下一个状态提取文本提示,构建增强的教师上下文,并提供比任何标量奖励更丰富的标记级方向性优势监督。由于异步设计,模型可以实时处理请求,PRM评判者对正在进行的交互进行评估,而训练者同时更新策略,二者之间没有协调开销。应用于个人代理,OpenClaw-RL使代理能够通过使用而简单地改进,从用户的重新查询、修正和明确反馈中恢复对话信号。应用于通用代理,相同的基础设施支持在终端、GUI、SWE和工具调用设置中进行可扩展的RL,我们还展示了过程奖励的实用性。代码:https://github.com/Gen-Verse/OpenClaw-RL
cs.CL / 40 / 2603.10195

Adaptive Activation Cancellation for Hallucination Mitigation in Large Language Models

用于减轻大型语言模型幻觉的自适应激活取消
Yocam, Eric, Vaidyan, Varghese, Comert, Gurcan, Kalathas, Paris, Wang, Yong, Mwakalonge, Judith L.
Abstract
Large Language Models frequently generate fluent but factually incorrect text. We propose Adaptive Activation Cancellation (AAC), a real-time inference-time framework that treats hallucination-associated neural activations as structured interference within the transformer residual stream, drawing an explicit analogy to classical adaptive noise cancellation from signal processing. The framework identifies Hallucination Nodes (H-Nodes) via layer-wise linear probing and suppresses them using a confidence-weighted forward hook during auto-regressive generation -- requiring no external knowledge, no fine-tuning, and no additional inference passes. Evaluated across OPT-125M, Phi-3-mini, and LLaMA 3-8B on TruthfulQA and HaluEval, the real-time hook is the only intervention that consistently improves downstream accuracy on all three scales. Critically, the method is strictly surgical: WikiText-103 perplexity and MMLU reasoning accuracy are preserved at exactly 0.0% degradation across all three model scales, a property that distinguishes AAC from interventions that trade fluency or general capability for factual improvement. On the LLaMA 3-8B scale, the hook additionally yields positive generation-level gains (MC1 +0.04; MC2 +0.003; Token-F1 +0.003) while achieving probe-space selectivity 5.94x - 3.5x higher than the ITI baseline -- demonstrating that targeted neuron-level suppression can simultaneously improve factual accuracy and preserve model capability.
Chinese Translation
大型语言模型经常生成流畅但事实不准确的文本。我们提出了自适应激活取消(Adaptive Activation Cancellation, AAC),这是一种实时推理框架,将与幻觉相关的神经激活视为变换器残差流中的结构干扰,明确类比于信号处理中的经典自适应噪声取消。该框架通过逐层线性探测识别幻觉节点(Hallucination Nodes, H-Nodes),并在自回归生成过程中使用基于置信度加权的前向钩子来抑制它们——无需外部知识、无需微调,也无需额外的推理过程。在TruthfulQA和HaluEval上对OPT-125M、Phi-3-mini和LLaMA 3-8B进行评估,实时钩子是唯一一个在所有三个规模上持续提高下游准确性的干预方法。重要的是,该方法是严格的外科手术式干预:WikiText-103的困惑度和MMLU推理准确性在所有三个模型规模上均保持在0.0%的降级,这一特性使AAC与那些为了事实改进而牺牲流畅性或一般能力的干预方法区分开来。在LLaMA 3-8B规模上,该钩子还带来了生成层面的正向增益(MC1 +0.04;MC2 +0.003;Token-F1 +0.003),同时实现了比ITI基线高出5.94倍至3.5倍的探测空间选择性——证明了有针对性的神经元级抑制可以同时提高事实准确性并保持模型能力。
cs.CL / 41 / 2603.10211

ViDia2Std: A Parallel Corpus and Methods for Low-Resource Vietnamese Dialect-to-Standard Translation

ViDia2Std:低资源越南方言到标准翻译的平行语料库及方法
Ta, Khoa Anh, Van Dinh, Nguyen, Van Nguyen, Kiet
Abstract
Vietnamese exhibits extensive dialectal variation, posing challenges for NLP systems trained predominantly on standard Vietnamese. Such systems often underperform on dialectal inputs, especially from underrepresented Central and Southern regions. Previous work on dialect normalization has focused narrowly on Central-to-Northern dialect transfer using synthetic data and limited dialectal diversity. These efforts exclude Southern varieties and intra-regional variants within the North. We introduce ViDia2Std, the first manually annotated parallel corpus for dialect-to-standard Vietnamese translation covering all 63 provinces. Unlike prior datasets, ViDia2Std includes diverse dialects from Central, Southern, and non-standard Northern regions often absent from existing resources, making it the most dialectally inclusive corpus to date. The dataset consists of over 13,000 sentence pairs sourced from real-world Facebook comments and annotated by native speakers across all three dialect regions. To assess annotation consistency, we define a semantic mapping agreement metric that accounts for synonymous standard mappings across annotators. Based on this criterion, we report agreement rates of 86% (North), 82% (Central), and 85% (South). We benchmark several sequence-to-sequence models on ViDia2Std. mBART-large-50 achieves the best results (BLEU 0.8166, ROUGE-L 0.9384, METEOR 0.8925), while ViT5-base offers competitive performance with fewer parameters. ViDia2Std demonstrates that dialect normalization substantially improves downstream tasks, highlighting the need for dialect-aware resources in building robust Vietnamese NLP systems.
Chinese Translation
越南语表现出广泛的方言变异,这对主要基于标准越南语训练的自然语言处理(NLP)系统构成了挑战。这些系统在方言输入上往往表现不佳,尤其是来自代表性不足的中部和南部地区。之前的方言标准化研究主要集中在使用合成数据和有限方言多样性的中部到北部方言转移。这些努力排除了南方方言以及北部地区内部的变体。我们引入了ViDia2Std,这是第一个手动注释的方言到标准越南语翻译的平行语料库,涵盖所有63个省份。与之前的数据集不同,ViDia2Std包括来自中部、南部和通常缺乏的非标准北部地区的多样方言,使其成为迄今为止方言包容性最强的语料库。该数据集由超过13,000对句子组成,来源于真实的Facebook评论,并由来自三个方言区域的母语者进行注释。为了评估注释的一致性,我们定义了一种语义映射一致性度量,考虑了注释者之间的同义标准映射。基于这一标准,我们报告了86%(北部)、82%(中部)和85%(南部)的协议率。我们在ViDia2Std上基准测试了几种序列到序列模型。mBART-large-50取得了最佳结果(BLEU 0.8166,ROUGE-L 0.9384,METEOR 0.8925),而ViT5-base则以更少的参数提供了具有竞争力的性能。ViDia2Std表明,方言标准化显著改善了下游任务,突显了在构建稳健的越南NLP系统中对方言感知资源的需求。
cs.CL / 42 / 2603.10213

Sabi\'a-4 Technical Report

Sabi'a-4 技术报告
Laitz, Thiago, Almeida, Thales Sales, Abonizio, Hugo, Junior, Roseval Malaquias, Bonás, Giovana Kerche, Piau, Marcos, Larcher, Celio, Pires, Ramon, Nogueira, Rodrigo
Abstract
This technical report presents Sabi\'a-4 and Sabiazinho-4, a new generation of Portuguese language models with a focus on Brazilian Portuguese language. The models were developed through a four-stage training pipeline: continued pre-training on Portuguese and Brazilian legal corpora, long-context extension to 128K tokens, supervised fine-tuning on instruction data spanning chat, code, legal tasks, and function calling, and preference alignment. We evaluate the models on six benchmark categories: conversational capabilities in Brazilian Portuguese, knowledge of Brazilian legislation, long-context understanding, instruction following, standardized exams, and agentic capabilities including tool use and web navigation. Results show that Sabi\'a-4 and Sabiazinho-4 achieve a favorable cost-performance trade-off compared to other models, positioning them in the upper-left region of the pricing-accuracy chart. The models show improvements over previous generations in legal document drafting, multi-turn dialogue quality, and agentic task completion.
Chinese Translation
本技术报告介绍了 Sabi'a-4 和 Sabiazinho-4,这是一代新的葡萄牙语模型,专注于巴西葡萄牙语的应用。该模型通过四个阶段的训练流程开发而成:在葡萄牙语和巴西法律语料库上进行持续的预训练,扩展长文本上下文至 128K 令牌,基于涵盖聊天、代码、法律任务和函数调用的指令数据进行监督微调,以及偏好对齐。我们在六个基准类别上评估模型的表现:巴西葡萄牙语的对话能力、对巴西立法的知识、长文本理解、指令遵循、标准化考试以及包括工具使用和网络导航在内的代理能力。结果表明,Sabi'a-4 和 Sabiazinho-4 在成本与性能的权衡上相较于其他模型表现良好,位于定价-准确性图表的左上区域。与前几代模型相比,这些模型在法律文档撰写、多轮对话质量和代理任务完成方面均有显著改善。
cs.CL / 43 / 2603.10233

S-GRADES -- Studying Generalization of Student Response Assessments in Diverse Evaluative Settings

S-GRADES -- 研究多样化评估环境中学生反应评估的泛化能力
Seuti, Tasfia, Choudhury, Sagnik Ray
Abstract
Evaluating student responses, from long essays to short factual answers, is a key challenge in educational NLP. Automated Essay Scoring (AES) focuses on holistic writing qualities such as coherence and argumentation, while Automatic Short Answer Grading (ASAG) emphasizes factual correctness and conceptual understanding. Despite their shared goal, these paradigms have progressed in isolation with fragmented datasets, inconsistent metrics, and separate communities. We introduce S-GRADES (Studying Generalization of Student Response Assessments in Diverse Evaluative Settings), a web-based benchmark that consolidates 14 diverse grading datasets under a unified interface with standardized access and reproducible evaluation protocols. The benchmark is fully open-source and designed for extensibility, enabling continuous integration of new datasets and evaluation settings. To demonstrate the utility of S-GRADES, we evaluate three state-of-the-art large language models across the benchmark using multiple reasoning strategies in prompting. We further examine the effects of exemplar selection and cross-dataset exemplar transfer. Our analyses illustrate how benchmark-driven evaluation reveals reliability and generalization gaps across essay and short-answer grading tasks, highlighting the importance of standardized, cross-paradigm assessment.
Chinese Translation
评估学生反应,从长篇论文到简短的事实回答,是教育自然语言处理中的一项关键挑战。自动化论文评分(Automated Essay Scoring, AES)关注整体写作质量,如连贯性和论证能力,而自动化短答案评分(Automatic Short Answer Grading, ASAG)则强调事实正确性和概念理解。尽管它们的目标相同,这些范式却在孤立中发展,存在数据集碎片化、评估指标不一致和社区分离等问题。我们提出了S-GRADES(研究多样化评估环境中学生反应评估的泛化能力),这是一个基于网络的基准,整合了14个不同的评分数据集,提供统一的接口、标准化的访问和可重复的评估协议。该基准完全开源,并设计为可扩展,能够持续集成新的数据集和评估环境。为了展示S-GRADES的实用性,我们在基准上评估了三种最先进的大型语言模型,使用多种推理策略进行提示。我们进一步考察了示例选择和跨数据集示例转移的影响。我们的分析表明,基准驱动的评估揭示了论文和短答案评分任务中的可靠性和泛化差距,强调了标准化、跨范式评估的重要性。
cs.CL / 44 / 2603.10243

GR-SAP: Generative Replay for Safety Alignment Preservation during Fine-Tuning

GR-SAP:在微调过程中保持安全对齐的生成重放
Fang, Zhouxiang, Zhou, Jiawei, Chen, Hanjie
Abstract
Recent studies show that the safety alignment of large language models (LLMs) can be easily compromised even by seemingly non-adversarial fine-tuning. To preserve safety alignment during fine-tuning, a widely used strategy is to jointly optimize safety and task objectives by mixing in the original alignment data, which is typically inaccessible even for open-weight LLMs. Inspired by generative replay in continual learning, we propose Generative Replay for Safety Alignment Preservation (GR-SAP), a unified framework that synthesizes domain-specific alignment data from LLMs and integrate them during downstream adaption to preserve safety alignment. Theoretical and empirical analyses demonstrate this synthetic data serves as a reliable proxy for the original alignment data. Experiments across various models and downstream tasks show that GR-SAP substantially mitigates fine-tuning-induced safety degradation while maintaining comparable downstream performance. Our code is available at https://github.com/chili-lab/gr-sap.
Chinese Translation
最近的研究表明,大型语言模型(LLMs)的安全对齐在看似非对抗性的微调过程中很容易受到损害。为了在微调过程中保持安全对齐,一种广泛使用的策略是通过混合原始对齐数据来共同优化安全和任务目标,而这些原始对齐数据通常对于开放权重的LLMs来说是不可获取的。受到持续学习中生成重放的启发,我们提出了安全对齐保持的生成重放(Generative Replay for Safety Alignment Preservation,GR-SAP),这是一个统一框架,通过从LLMs合成领域特定的对齐数据,并在下游适应过程中将其整合,以保持安全对齐。理论和实证分析表明,这种合成数据可以作为原始对齐数据的可靠代理。在各种模型和下游任务上的实验表明,GR-SAP显著减轻了微调引起的安全退化,同时保持了可比的下游性能。我们的代码可在 https://github.com/chili-lab/gr-sap 获取。
cs.CL / 45 / 2603.10303

Is this Idea Novel? An Automated Benchmark for Judgment of Research Ideas

这个想法新颖吗?研究想法判断的自动化基准
Schopf, Tim, Färber, Michael
Abstract
Judging the novelty of research ideas is crucial for advancing science, enabling the identification of unexplored directions, and ensuring contributions meaningfully extend existing knowledge rather than reiterate minor variations. However, given the exponential growth of scientific literature, manually judging the novelty of research ideas through literature reviews is labor-intensive, subjective, and infeasible at scale. Therefore, recent efforts have proposed automated approaches for research idea novelty judgment. Yet, evaluation of these approaches remains largely inconsistent and is typically based on non-standardized human evaluations, hindering large-scale, comparable evaluations. To address this, we introduce RINoBench, the first comprehensive benchmark for large-scale evaluation of research idea novelty judgments. It comprises 1,381 research ideas derived from and judged by human experts as well as nine automated evaluation metrics designed to assess both rubric-based novelty scores and textual justifications of novelty judgments. Using this benchmark, we evaluate several state-of-the-art large language models (LLMs) on their ability to judge the novelty of research ideas. Our findings reveal that while LLM-generated reasoning closely mirrors human rationales, this alignment does not reliably translate into accurate novelty judgments, which diverge significantly from human gold standard judgments - even among leading reasoning-capable models. Data and code available at: https://github.com/TimSchopf/RINoBench.
Chinese Translation
判断研究想法的新颖性对推动科学进步至关重要,它使得能够识别未探索的方向,并确保贡献在实质上扩展现有知识,而不是仅仅重复微小的变体。然而,考虑到科学文献的指数增长,通过文献综述手动判断研究想法的新颖性是劳动密集型的、主观的,并且在大规模上不可行。因此,最近的研究提出了自动化的方法来判断研究想法的新颖性。然而,这些方法的评估仍然在很大程度上不一致,通常基于非标准化的人类评估,这阻碍了大规模、可比较的评估。为了解决这一问题,我们引入了RINoBench,这是第一个用于大规模评估研究想法新颖性判断的综合基准。它包含1,381个研究想法,这些想法由人类专家评估,并且设计了九个自动化评估指标,以评估基于评分标准的新颖性分数和新颖性判断的文本理由。利用这个基准,我们评估了几种最先进的大型语言模型(LLMs)在判断研究想法新颖性方面的能力。我们的研究结果表明,尽管LLM生成的推理与人类的推理相似,但这种一致性并不能可靠地转化为准确的新颖性判断,这些判断与人类的黄金标准判断显著偏离——即使是在领先的推理能力模型中。数据和代码可在:https://github.com/TimSchopf/RINoBench获取。
cs.CL / 46 / 2603.10313

Large language models can disambiguate opioid slang on social media

大型语言模型能够消除社交媒体上阿片类药物俚语的歧义
Carpenter, Kristy A., Samori, Issah A., Kiang, Mathew V., Humphreys, Keith, Lembke, Anna, Eichstaedt, Johannes C., Altman, Russ B.
Abstract
Social media text shows promise for monitoring trends in the opioid overdose crisis; however, the overwhelming majority of social media text is unrelated to opioids. When leveraging social media text to monitor trends in the ongoing opioid overdose crisis, a common strategy for identifying relevant content is to use a lexicon of opioid-related terms as inclusion criteria. However, many slang terms for opioids, such as "smack" or "blues," have common non-opioid meanings, making them ambiguous. The advanced textual reasoning capability of large language models (LLMs) presents an opportunity to disambiguate these slang terms at scale. We present three tasks on which to evaluate four state-of-the-art LLMs (GPT-4, GPT-5, Gemini 2.5 Pro, and Claude Sonnet 4.5): a lexicon-based setting, in which the LLM must disambiguate a specific term within the context of a given post; a lexicon-free setting, in which the LLM must identify opioid-related posts from context without a lexicon; and an emergent slang setting, in which the LLM must identify opioid-related posts with simulated new slang terms. All four LLMs showed excellent performance across all tasks. In both subtasks of the lexicon-based setting, LLM F1 scores ("fenty" subtask: 0.824-0.972; "smack" subtask: 0.540-0.862) far exceeded those of the best lexicon strategy (0.126 and 0.009, respectively). In the lexicon-free task, LLM F1 scores (0.544-0.769) surpassed those of lexicons (0.080-0.540), and LLMs demonstrated uniformly higher recall. On emergent slang, all LLMs had higher accuracy (average: 0.784), F1 score (average: 0.712), precision (average: 0.981), and recall (average: 0.587) than the two lexicons assessed. Our results show that LLMs can be used to identify relevant content for low-prevalence topics, including but not limited to opioid references, enhancing data provided to downstream analyses and predictive models.
Chinese Translation
社交媒体文本在监测阿片类药物过量危机的趋势方面显示出潜力;然而,绝大多数社交媒体文本与阿片类药物无关。在利用社交媒体文本监测持续的阿片类药物过量危机趋势时,识别相关内容的常用策略是使用阿片类药物相关术语的词汇表作为纳入标准。然而,许多阿片类药物的俚语,如“smack”或“blues”,具有常见的非阿片类药物含义,使其具有歧义性。大型语言模型(LLMs)先进的文本推理能力为大规模消除这些俚语的歧义提供了机会。我们提出了三个任务来评估四个最先进的LLM(GPT-4、GPT-5、Gemini 2.5 Pro 和 Claude Sonnet 4.5):一个基于词汇的设置,在该设置中,LLM必须在给定帖子上下文中消除特定术语的歧义;一个无词汇的设置,在该设置中,LLM必须在没有词汇的情况下从上下文中识别与阿片类药物相关的帖子;以及一个新兴俚语设置,在该设置中,LLM必须识别带有模拟新俚语的与阿片类药物相关的帖子。所有四个LLM在所有任务中表现出色。在基于词汇的设置的两个子任务中,LLM的F1分数(“fenty”子任务:0.824-0.972;“smack”子任务:0.540-0.862)远远超过最佳词汇策略(分别为0.126和0.009)。在无词汇任务中,LLM的F1分数(0.544-0.769)超过了词汇的分数(0.080-0.540),并且LLM表现出一致更高的召回率。在新兴俚语任务中,所有LLM的准确率(平均:0.784)、F1分数(平均:0.712)、精确率(平均:0.981)和召回率(平均:0.587)均高于评估的两个词汇。我们的结果表明,LLM可以用于识别低流行率主题的相关内容,包括但不限于阿片类药物的引用,从而增强提供给下游分析和预测模型的数据。
cs.CL / 47 / 2603.10351

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck

通过解耦信息瓶颈缓解多语言 LLM 作为裁判中的翻译偏见
Zhang, Hongbin, Chen, Kehai, Bai, Xuefen, Pan, Youcheng, Xiang, Yang, Wang, Jinpeng, Zhang, Min
Abstract
Large language models (LLMs) have become a standard for multilingual evaluation, yet they exhibit a severe systematic translationese bias. In this paper, translationese bias is characterized as LLMs systematically favoring machine-translated text over human-authored references, particularly in low-resource languages. We attribute this bias to spurious correlations with (i) latent manifold alignment with English and (ii) cross-lingual predictability. To mitigate this bias, we propose DIBJudge, a robust fine-tuning framework that learns a minimally sufficient, judgment-critical representation via variational information compression, while explicitly isolating spurious factors into the dedicated bias branch. Furthermore, we incorporate a cross-covariance penalty that explicitly suppresses statistical dependence between robust and bias representations, thereby encouraging effective disentanglement. Extensive evaluations on multilingual reward modeling benchmarks and a dedicated translationese bias evaluation suite demonstrate that the proposed DIBJudge consistently outperforms strong baselines and substantially mitigates translationese bias.
Chinese Translation
大型语言模型(LLMs)已成为多语言评估的标准,但它们表现出严重的系统性翻译偏见。在本文中,翻译偏见被定义为 LLMs 系统性地偏向机器翻译文本而非人类创作的参考文本,尤其是在低资源语言中。我们将这种偏见归因于与(i)潜在流形与英语的对齐和(ii)跨语言可预测性之间的虚假相关性。为了缓解这种偏见,我们提出了 DIBJudge,一种强大的微调框架,通过变分信息压缩学习一个最小充分、判断关键的表示,同时将虚假因素明确隔离到专门的偏见分支中。此外,我们引入了一个跨协方差惩罚,明确抑制稳健表示与偏见表示之间的统计依赖,从而促进有效的解耦。在多语言奖励建模基准和专门的翻译偏见评估套件上的广泛评估表明,所提出的 DIBJudge 一直优于强基线,并显著缓解了翻译偏见。
cs.CL / 48 / 2603.10367

Dynamic Knowledge Fusion for Multi-Domain Dialogue State Tracking

多领域对话状态追踪的动态知识融合
Su, Haoxiang, Fang, Ruiyu, Jiang, Liting, Huang, Xiaomeng, Song, Shuangyong
Abstract
The performance of task-oriented dialogue models is strongly tied to how well they track dialogue states, which records and updates user information across multi-turn interactions. However, current multi-domain DST encounters two key challenges: the difficulty of effectively modeling dialogue history and the limited availability of annotated data, both of which hinder model performance. To tackle the aforementioned problems, we develop a dynamic knowledge fusion framework applicable to multi-domain DST. The model operates in two stages: first, an encoder-only network trained with contrastive learning encodes dialogue history and candidate slots, selecting relevant slots based on correlation scores; second, dynamic knowledge fusion leverages the structured information of selected slots as contextual prompts to enhance the accuracy and consistency of dialogue state tracking. This design enables more accurate integration of dialogue context and domain knowledge. Results obtained from multi-domain dialogue benchmarks indicate that our method notably improves both tracking accuracy and generalization, validating its capability in handling complex dialogue scenarios.
Chinese Translation
任务导向对话模型的性能与其对对话状态的追踪能力密切相关,后者记录并更新多轮交互中的用户信息。然而,目前的多领域对话状态追踪(DST)面临两个主要挑战:有效建模对话历史的困难和标注数据的有限可用性,这两者都阻碍了模型性能的提升。为了解决上述问题,我们开发了一种适用于多领域DST的动态知识融合框架。该模型分为两个阶段:首先,使用对比学习训练的仅编码器网络对对话历史和候选槽进行编码,并根据相关性得分选择相关槽;其次,动态知识融合利用所选槽的结构化信息作为上下文提示,以提高对话状态追踪的准确性和一致性。这一设计使得对话上下文和领域知识的整合更加准确。从多领域对话基准测试中获得的结果表明,我们的方法显著提高了追踪准确性和泛化能力,验证了其在处理复杂对话场景中的能力。
cs.CL / 49 / 2603.10473

Aligning Large Language Models with Searcher Preferences

将大型语言模型与搜索者偏好对齐
Wu, Wei, Zhou, Peilun, Chen, Liyi, Wang, Qimeng, Lu, Chengqiang, Gao, Yan, Wu, Yi, Hu, Yao, Xiong, Hui
Abstract
The paradigm shift from item-centric ranking to answer-centric synthesis is redefining the role of search engines. While recent industrial progress has applied generative techniques to closed-set item ranking in e-commerce, research and deployment of open-ended generative search on large content platforms remain limited. This setting introduces challenges, including robustness to noisy retrieval, non-negotiable safety guarantees, and alignment with diverse user needs. In this work, we introduce SearchLLM, the first large language model (LLM) for open-ended generative search. We design a hierarchical, multi-dimensional reward system that separates bottom-line constraints, including factual grounding, basic answer quality and format compliance, from behavior optimization objectives that promote robustness to noisy retrieval and alignment with user needs. Concretely, our reward model evaluates responses conditioned on the user query, session history, and retrieved evidence set, combining rule-based checks with human-calibrated LLM judges to produce an interpretable score vector over these dimensions. We introduce a Gated Aggregation Strategy to derive the training reward for optimizing SearchLLM with Group Relative Policy Optimization (GRPO). We deploy SearchLLM in the AI search entry of RedNote. Offline evaluations and online A/B tests show improved generation quality and user engagement, increasing Valid Consumption Rate by 1.03% and reducing Re-search Rate by 2.81%, while upholding strict safety and reliability standards.
Chinese Translation
从以项目为中心的排名转向以答案为中心的综合方法正在重新定义搜索引擎的角色。尽管近期工业进展已将生成技术应用于电子商务中的封闭集项目排名,但在大型内容平台上进行开放式生成搜索的研究和部署仍然有限。这种环境带来了挑战,包括对噪声检索的鲁棒性、不可妥协的安全保障以及与多样化用户需求的对齐。在本研究中,我们介绍了SearchLLM,这是首个用于开放式生成搜索的大型语言模型(LLM)。我们设计了一个分层的多维奖励系统,将底线约束(包括事实基础、基本答案质量和格式合规性)与促进对噪声检索的鲁棒性和与用户需求对齐的行为优化目标分开。具体而言,我们的奖励模型根据用户查询、会话历史和检索证据集评估响应,结合基于规则的检查与经过人类校准的LLM评判者,以生成一个可解释的分数向量。我们引入了一种门控聚合策略,以推导优化SearchLLM的训练奖励,采用群体相对策略优化(Group Relative Policy Optimization, GRPO)。我们在RedNote的AI搜索入口中部署了SearchLLM。离线评估和在线A/B测试显示生成质量和用户参与度有所提升,验证消费率提高了1.03%,重新搜索率降低了2.81%,同时保持严格的安全和可靠性标准。
cs.CL / 50 / 2603.10476

Learning to Negotiate: Multi-Agent Deliberation for Collective Value Alignment in LLMs

学习谈判:多智能体协商以实现大型语言模型的集体价值对齐
Anantaprayoon, Panatchakorn, Babina, Nataliia, Asgharbeygi, Nima, Tarifi, Jad
Abstract
The alignment of large language models (LLMs) has progressed substantially in single-agent settings through paradigms such as RLHF and Constitutional AI, with recent work exploring scalable alternatives such as RLAIF and evolving alignment objectives. However, these approaches remain limited in multi-stakeholder settings, where conflicting values arise and deliberative negotiation capabilities are required. This work proposes a multi-agent negotiation-based alignment framework that aligns LLMs to Collective Agency (CA)-an existing alignment objective introduced to promote the continual expansion of agency-while simultaneously improving conflict-resolution capability. To enable scalable training, two self-play instances of the same LLM, assigned opposing personas, engage in structured turn-based dialogue to synthesize mutually beneficial solutions. We generate synthetic moral-dilemma prompts and conflicting persona pairs, and optimize the policy via RLAIF using GRPO with an external LLM reward model. While rewards are computed from CA scores assigned to the final completion, gradients are applied to dialogue tokens to directly improve deliberative interaction dynamics. Experiments show that the resulting model achieves CA alignment comparable to a single-agent baseline while substantially improving conflict-resolution performance without degrading general language capabilities. These results suggest that negotiation-driven deliberation training provides a practical path toward LLMs that better support collective decision-making in value-conflict scenarios.
Chinese Translation
大型语言模型(LLMs)的对齐在单智能体环境中通过强化学习人类反馈(RLHF)和宪法人工智能(Constitutional AI)等范式取得了显著进展,最近的研究探索了可扩展替代方案,如强化学习人工反馈(RLAIF)和不断演变的对齐目标。然而,这些方法在多利益相关者环境中仍然有限,这里存在价值冲突并需要协商能力。本研究提出了一种基于多智能体谈判的对齐框架,该框架将LLMs与集体代理(Collective Agency, CA)对齐——这是一个旨在促进代理持续扩展的现有对齐目标,同时改善冲突解决能力。为了实现可扩展的训练,两个自我对弈的LLM实例被分配对立的人格,通过结构化的回合制对话进行互动,以合成互利的解决方案。我们生成合成的道德困境提示和冲突的人格对,并通过使用外部LLM奖励模型的广义回报优化(GRPO)优化策略。虽然奖励是根据分配给最终完成的CA分数计算的,但梯度被应用于对话标记,以直接改善协商互动动态。实验表明,所得到的模型在CA对齐方面达到了与单智能体基线相当的水平,同时在不降低通用语言能力的情况下显著提高了冲突解决性能。这些结果表明,基于谈判驱动的协商训练为LLMs提供了一条切实可行的路径,以更好地支持价值冲突场景中的集体决策。
cs.CL / 51 / 2603.10477

PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses

PEEM:用于可解释的提示与响应联合评估的提示工程评估指标
Hong, Minki, Lee, Eunsoo, Park, Sohyun, Kim, Jihie
Abstract
Prompt design is a primary control interface for large language models (LLMs), yet standard evaluations largely reduce performance to answer correctness, obscuring why a prompt succeeds or fails and providing little actionable guidance. We propose PEEM (Prompt Engineering Evaluation Metrics), a unified framework for joint and interpretable evaluation of both prompts and responses. PEEM defines a structured rubric with 9 axes: 3 prompt criteria (clarity/structure, linguistic quality, fairness) and 6 response criteria (accuracy, coherence, relevance, objectivity, clarity, conciseness), and uses an LLM-based evaluator to output (i) scalar scores on a 1-5 Likert scale and (ii) criterion-specific natural-language rationales grounded in the rubric. Across 7 benchmarks and 5 task models, PEEM's accuracy axis strongly aligns with conventional accuracy while preserving model rankings (aggregate Spearman rho about 0.97, Pearson r about 0.94, p < 0.001). A multi-evaluator study with four models shows consistent relative judgments (pairwise rho = 0.68-0.85), supporting evaluator-agnostic deployment. Beyond alignment, PEEM captures complementary linguistic failure modes and remains informative under prompt perturbations: prompt-quality trends track downstream accuracy under iterative rewrites, semantic adversarial manipulations induce clear score degradation, and meaning-preserving paraphrases yield high stability (robustness rate about 76.7-80.6%). Finally, using only PEEM scores and rationales as feedback, a zero-shot prompt rewriting loop improves downstream accuracy by up to 11.7 points, outperforming supervised and RL-based prompt-optimization baselines. Overall, PEEM provides a reproducible, criterion-driven protocol that links prompt formulation to response behavior and enables systematic diagnosis and optimization of LLM interactions.
Chinese Translation
提示设计是大型语言模型(LLMs)的主要控制接口,但标准评估通常将性能简化为答案的正确性,模糊了提示成功或失败的原因,并提供了很少的可操作指导。我们提出了PEEM(提示工程评估指标),这是一个用于提示和响应的联合可解释评估的统一框架。PEEM定义了一个结构化的评估标准,包含9个维度:3个提示标准(清晰度/结构、语言质量、公平性)和6个响应标准(准确性、一致性、相关性、客观性、清晰度、简洁性),并使用基于LLM的评估器输出(i)1-5李克特量表的标量评分和(ii)基于评估标准的特定自然语言推理。在7个基准和5个任务模型中,PEEM的准确性维度与传统准确性高度一致,同时保持模型排名(聚合斯皮尔曼rho约为0.97,皮尔逊r约为0.94,p < 0.001)。对四个模型的多评估者研究显示了一致的相对判断(成对rho = 0.68-0.85),支持评估者无关的部署。除了对齐,PEEM还捕捉到互补的语言失败模式,并在提示扰动下保持信息性:提示质量趋势在迭代重写下跟踪下游准确性,语义对抗操控导致明显的评分下降,而保留意义的释义则表现出高稳定性(鲁棒性率约为76.7-80.6%)。最后,仅使用PEEM评分和推理作为反馈,一个零-shot提示重写循环将下游准确性提高了多达11.7分,优于监督和基于强化学习的提示优化基准。总体而言,PEEM提供了一个可重复的、以标准驱动的协议,将提示制定与响应行为联系起来,并能够系统地诊断和优化LLM交互。
cs.CL / 52 / 2603.10492

Human-AI Co-reasoning for Clinical Diagnosis with Evidence-Integrated Language Agent

基于证据整合语言代理的人机协同推理在临床诊断中的应用
Huang, Zhongzhen, Ling, Yan, Chen, Hong, Feng, Ye, Wu, Li, Mu, Linjie, Zhang, Shaoting, Zhang, Xiaofan, Qian, Kun, Li, Xiaomu
Abstract
We present PULSE, a medical reasoning agent that combines a domain-tuned large language model with scientific literature retrieval to support diagnostic decision-making in complex real-world cases. To evaluate its capabilities, we curated a benchmark of 82 authentic endocrinology case reports encompassing a broad spectrum of disease types and incidence levels. In controlled experiments, we compared PULSE's performance against physicians with varying levels of expertise-from residents to senior specialists-and examined how AI assistance influenced human diagnostic reasoning. PULSE attained expert-competitive accuracy, outperforming residents and junior specialists while matching senior specialist performance at both Top@1 and Top@4 thresholds. Unlike physicians, whose accuracy declined with disease rarity, PULSE maintained stable performance across incidence tiers. The agent also exhibited adaptive reasoning, increasing output length with case difficulty in a manner analogous to the longer deliberation observed among expert clinicians. When used collaboratively, PULSE enabled physicians to correct initial errors and broaden diagnostic hypotheses, but also introduced risks of automation bias. The study explores both serial and concurrent collaboration workflows, revealing that PULSE offers robust support across common and rare presentations. These findings underscore both the promise and the limitations of language model-based agents in clinical diagnosis, and offer a framework for evaluating their role in real-world decision-making.
Chinese Translation
我们提出了PULSE,一个医疗推理代理,它结合了领域调优的大型语言模型与科学文献检索,以支持复杂现实案例中的诊断决策。为了评估其能力,我们策划了一个包含82个真实内分泌病例报告的基准数据集,涵盖了广泛的疾病类型和发生率水平。在受控实验中,我们将PULSE的表现与不同专业水平的医生进行比较,从住院医生到高级专家,并考察了人工智能辅助如何影响人类的诊断推理。PULSE达到了与专家竞争的准确性,超越了住院医生和初级专家,并在Top@1和Top@4阈值上与高级专家的表现相当。与医生不同的是,医生的准确性随着疾病稀有性的增加而下降,而PULSE在不同发生率层次上保持了稳定的表现。该代理还表现出适应性推理,随着病例难度的增加而增加输出长度,这与专家临床医生在更长时间内的深思熟虑相似。在协作使用时,PULSE使医生能够纠正初步错误并扩展诊断假设,但也引入了自动化偏差的风险。本研究探讨了串行和并行协作工作流程,揭示PULSE在常见和稀有表现中提供了强有力的支持。这些发现强调了基于语言模型的代理在临床诊断中的潜力与局限性,并为评估它们在现实决策中的角色提供了框架。
cs.CL / 53 / 2603.10494

VERI-DPO: Evidence-Aware Alignment for Clinical Summarization via Claim Verification and Direct Preference Optimization

VERI-DPO:通过声明验证和直接偏好优化实现临床摘要的证据感知对齐
Liu, Weixin, Ni, Congning, Song, Qingyuan, Rose, Susannah L., Symons, Christopher, Kantarcioglu, Murat, Malin, Bradley A., Yin, Zhijun
Abstract
Brief Hospital Course (BHC) narratives must be clinically useful yet faithful to fragmented EHR evidence. LLM-based clinical summarizers still introduce unsupported statements, and alignment can encourage omissions ("say-less" degeneration). We introduce VERI-DPO, which uses claim verification to mine preferences and distill them into the summarizer with Direct Preference Optimization (DPO). On MIMIC-III-Ext-VeriFact-BHC (100 ICU patients; patient-level splits), we train a retrieval-augmented verifier to label claim-evidence pairs as Supported, Not Supported, or Not Addressed via a single-token format. The verifier scores sentence-level claims from sampled BHC candidates and aggregates margins into a coverage-aware utility to mine length-controlled, contradiction-anchored preference pairs. On held-out patients, verifier-mined preferences separate candidates by contradiction density, and VERI-DPO reduces Not Supported claim rates from 10.7% to 1.9% (local verifier judge) and from 11.6% to 6.4% (GPT-4o judge), while improving validity from 76.7% to 82.5% and maintaining informative length.
Chinese Translation
简要医院病程(BHC)叙述必须在临床上有用,同时忠实于碎片化的电子健康记录(EHR)证据。基于大型语言模型(LLM)的临床摘要生成器仍然会引入不支持的陈述,而对齐可能会导致遗漏(“少说”退化)。我们提出了VERI-DPO,它利用声明验证来挖掘偏好,并通过直接偏好优化(DPO)将其提炼到摘要生成器中。在MIMIC-III-Ext-VeriFact-BHC(100名重症监护病人;按病人级别划分)上,我们训练了一个增强检索的验证器,通过单标记格式将声明-证据对标记为支持、不支持或未涉及。验证器对从抽样的BHC候选中获得的句子级声明进行评分,并将边际聚合成一个覆盖感知的效用,以挖掘长度控制的、基于矛盾的偏好对。在保留患者中,验证器挖掘的偏好通过矛盾密度将候选分开,VERI-DPO将不支持声明的比例从10.7%降低到1.9%(本地验证器评判)和从11.6%降低到6.4%(GPT-4o评判),同时将有效性从76.7%提高到82.5%,并保持信息长度。
cs.CL / 54 / 2603.10505

Safe and Scalable Web Agent Learning via Recreated Websites

通过重建网站实现安全且可扩展的网络代理学习
Chae, Hyungjoo, Park, Jungsoo, Ritter, Alan
Abstract
Training autonomous web agents is fundamentally limited by the environments they learn from: real-world websites are unsafe to explore, hard to reset, and rarely provide verifiable feedback. We propose VeriEnv, a framework that treats language models as environment creators, automatically cloning real-world websites into fully executable, verifiable synthetic environments. By exposing controlled internal access via a Python SDK, VeriEnv enables agents to self-generate tasks with deterministic, programmatically verifiable rewards, eliminating reliance on heuristic or LLM-based judges. This design decouples agent learning from unsafe real-world interaction while enabling scalable self-evolution through environment expansion. Through experiments on web agent benchmarks, we show that agents trained with VeriEnv generalize to unseen websites, achieve site-specific mastery through self-evolving training, and benefit from scaling the number of training environments. Code and resources will be released at https://github.com/kyle8581/VeriEnv upon acceptance.
Chinese Translation
自主网络代理的训练在根本上受到其学习环境的限制:现实世界的网站探索不安全,重置困难,并且很少提供可验证的反馈。我们提出了 VeriEnv,一个将语言模型视为环境创建者的框架,自动将现实世界的网站克隆为完全可执行、可验证的合成环境。通过 Python SDK 提供受控的内部访问,VeriEnv 使代理能够自我生成具有确定性、可编程验证奖励的任务,从而消除对启发式或基于大语言模型(LLM)评判者的依赖。该设计将代理学习与不安全的现实世界交互解耦,同时通过环境扩展实现可扩展的自我进化。通过在网络代理基准上的实验,我们展示了使用 VeriEnv 训练的代理能够推广到未见过的网站,通过自我进化训练实现特定网站的精通,并从增加训练环境的数量中获益。代码和资源将在接受后发布于 https://github.com/kyle8581/VeriEnv。
cs.CL / 55 / 2603.10524

AILS-NTUA at SemEval-2026 Task 8: Evaluating Multi-Turn RAG Conversations

AILS-NTUA在SemEval-2026任务8中的表现:评估多轮检索增强生成对话
Athanasiou, Dimosthenis, Lymperaiou, Maria, Filandrianos, Giorgos, Voulodimos, Athanasios, Stamou, Giorgos
Abstract
We present the AILS-NTUA system for SemEval-2026 Task 8 (MTRAGEval), addressing all three subtasks of multi-turn retrieval-augmented generation: passage retrieval (A), reference-grounded response generation (B), and end-to-end RAG (C). Our unified architecture is built on two principles: (i) a query-diversity-over-retriever-diversity strategy, where five complementary LLM-based query reformulations are issued to a single corpus-aligned sparse retriever and fused via variance-aware nested Reciprocal Rank Fusion; and (ii) a multistage generation pipeline that decomposes grounded generation into evidence span extraction, dual-candidate drafting, and calibrated multi-judge selection. Our system ranks 1st in Task A (nDCG@5: 0.5776, +20.5% over the strongest baseline) and 2nd in Task B (HM: 0.7698). Empirical analysis shows that query diversity over a well-aligned retriever outperforms heterogeneous retriever ensembling, and that answerability calibration-rather than retrieval coverage-is the primary bottleneck in end-to-end performance.
Chinese Translation
我们展示了AILS-NTUA系统在SemEval-2026任务8(MTRAGEval)中的表现,涵盖了多轮检索增强生成的三个子任务:段落检索(A)、基于参考的响应生成(B)和端到端RAG(C)。我们的统一架构基于两个原则构建:(i)查询多样性优于检索器多样性的策略,其中五种互补的基于大型语言模型(LLM)的查询重构被发送到一个与语料库对齐的稀疏检索器,并通过考虑方差的嵌套互惠排名融合进行融合;(ii)一个多阶段生成管道,将基于证据的生成分解为证据范围提取、双候选草拟和校准的多评审选择。我们的系统在任务A中排名第一(nDCG@5: 0.5776,比最强基线提高20.5%),在任务B中排名第二(HM: 0.7698)。实证分析表明,基于良好对齐的检索器的查询多样性优于异构检索器集成,并且答案可回答性校准而非检索覆盖是端到端性能的主要瓶颈。
cs.CL / 56 / 2603.10547

Automatic End-to-End Data Integration using Large Language Models

基于大型语言模型的自动端到端数据集成
Steiner, Aaron, Bizer, Christian
Abstract
Designing data integration pipelines typically requires substantial manual effort from data engineers to configure pipeline components and label training data. While LLMs have shown promise in handling individual steps of the integration process, their potential to replace all human input across end-to-end data integration pipelines has not been investigated. As a step toward exploring this potential, we present an automatic data integration pipeline that uses GPT-5.2 to generate all artifacts required to adapt the pipeline to specific use cases. These artifacts are schema mappings, value mappings for data normalization, training data for entity matching, and validation data for selecting conflict resolution heuristics in data fusion. We compare the performance of this LLM-based pipeline to the performance of human-designed pipelines along three case studies requiring the integration of video game, music, and company related data. Our experiments show that the LLM-based pipeline is able to produce similar results, for some tasks even better results, as the human-designed pipelines. End-to-end, the human and the LLM pipelines produce integrated datasets of comparable size and density. Having the LLM configure the pipelines costs approximately \$10 per case study, which represents only a small fraction of the cost of having human data engineers perform the same tasks.
Chinese Translation
设计数据集成管道通常需要数据工程师投入大量手动工作,以配置管道组件和标注训练数据。尽管大型语言模型(LLMs)在处理集成过程的各个步骤中显示出潜力,但它们在端到端数据集成管道中替代所有人工输入的潜力尚未得到研究。作为探索这一潜力的第一步,我们提出了一种自动数据集成管道,该管道使用GPT-5.2生成适应特定用例所需的所有工件。这些工件包括模式映射、数据规范化的值映射、实体匹配的训练数据以及用于选择数据融合中冲突解决启发式的验证数据。我们将基于LLM的管道的性能与人类设计的管道在三个案例研究中的表现进行了比较,这些案例研究涉及视频游戏、音乐和公司相关数据的集成。我们的实验表明,基于LLM的管道能够产生类似的结果,在某些任务中甚至优于人类设计的管道。从整体来看,人类和LLM管道生成的集成数据集在大小和密度上相当。让LLM配置管道的成本约为每个案例研究10美元,这仅占人类数据工程师执行相同任务成本的一小部分。
cs.CL / 57 / 2603.10570

End-to-End Chatbot Evaluation with Adaptive Reasoning and Uncertainty Filtering

端到端聊天机器人评估:自适应推理与不确定性过滤
Dang, Nhi, Le, Tung, Nguyen, Huy Tien
Abstract
Large language models (LLMs) combined with retrieval augmented generation have enabled the deployment of domain-specific chatbots, but these systems remain prone to generating unsupported or incorrect answers. Reliable evaluation is therefore critical, yet manual review is costly and existing frameworks often depend on curated test sets and static metrics, limiting scalability. We propose an end-to-end automatic evaluator designed to substantially reduce human effort. Our system generates Q\&A pairs directly from the underlying knowledge base, uses LLMs to judge chatbot responses against reference answers, and applies confidence-based filtering to highlight uncertain cases. Applied to a Vietnamese news dataset, the evaluator achieves high agreement with human judgments while significantly lowering review overhead. The framework is modular and language-agnostic, making it readily adaptable to diverse domains. This work introduces a practical, scalable solution for evaluating chatbots with minimal reliance on manual intervention.
Chinese Translation
大型语言模型(LLMs)结合检索增强生成技术使得领域特定聊天机器人的部署成为可能,但这些系统仍然容易生成不支持或不正确的答案。因此,可靠的评估至关重要,然而人工审查成本高昂,现有框架往往依赖于精心策划的测试集和静态指标,限制了可扩展性。我们提出了一种端到端的自动评估器,旨在大幅减少人工工作量。我们的系统直接从基础知识库生成问答对,利用LLMs对聊天机器人的响应与参考答案进行判断,并应用基于置信度的过滤来突出不确定案例。在应用于越南新闻数据集时,该评估器与人工判断高度一致,同时显著降低了审查开销。该框架是模块化的,且不依赖于特定语言,使其能够轻松适应多种领域。本研究提出了一种实用的、可扩展的解决方案,以最小化对人工干预的依赖来评估聊天机器人。
cs.CL / 58 / 2603.10613

MUNIChus: Multilingual News Image Captioning Benchmark

MUNIChus:多语言新闻图像字幕基准
Chen, Yuji, Plum, Alistair, Hettiarachchi, Hansi, Kanojia, Diptesh, Basnet, Saroj, Zampieri, Marcos, Ranasinghe, Tharindu
Abstract
The goal of news image captioning is to generate captions by integrating news article content with corresponding images, highlighting the relationship between textual context and visual elements. The majority of research on news image captioning focuses on English, primarily because datasets in other languages are scarce. To address this limitation, we create the first multilingual news image captioning benchmark, MUNIChus, comprising 9 languages, including several low-resource languages such as Sinhala and Urdu. We evaluate various state-of-the-art neural news image captioning models on MUNIChus and find that news image captioning remains challenging. We also make MUNIChus publicly available with over 20 models already benchmarked. MUNIChus opens new avenues for further advancements in developing and evaluating multilingual news image captioning models.
Chinese Translation
新闻图像字幕的目标是通过整合新闻文章内容与相应图像来生成字幕,突出文本上下文与视觉元素之间的关系。大多数关于新闻图像字幕的研究集中在英语上,主要是因为其他语言的数据集稀缺。为了解决这一限制,我们创建了首个多语言新闻图像字幕基准 MUNIChus,涵盖9种语言,包括一些低资源语言,如僧伽罗语和乌尔都语。我们在 MUNIChus 上评估了多种最先进的神经新闻图像字幕模型,发现新闻图像字幕仍然具有挑战性。我们还将 MUNIChus 公开发布,已有超过20个模型进行了基准测试。MUNIChus 为进一步发展和评估多语言新闻图像字幕模型开辟了新的途径。
cs.CL / 59 / 2603.10619

Disentangling Similarity and Relatedness in Topic Models

在主题模型中解构相似性与相关性
Xiao, Hanlin, Álvarez, Mauricio A., Breitling, Rainer
Abstract
The recent advancement of large language models has spurred a growing trend of integrating pre-trained language model (PLM) embeddings into topic models, fundamentally reshaping how topics capture semantic structure. Classical models such as Latent Dirichlet Allocation (LDA) derive topics from word co-occurrence statistics, whereas PLM-augmented models anchor these statistics to pre-trained embedding spaces, imposing a prior that also favours clustering of semantically similar words. This structural difference can be captured by the psycholinguistic dimensions of thematic relatedness and taxonomic similarity of the topic words. To disentangle these dimensions in topic models, we construct a large synthetic benchmark of word pairs using LLM-based annotation to train a neural scoring function. We apply this scorer to a comprehensive evaluation across multiple corpora and topic model families, revealing that different model families capture distinct semantic structure in their topics. We further demonstrate that similarity and relatedness scores successfully predict downstream task performance depending on task requirements. This paper establishes similarity and relatedness as essential axes for topic model evaluation and provides a reliable pipeline for characterising these across model families and corpora.
Chinese Translation
大型语言模型的近期进展推动了将预训练语言模型(PLM)嵌入整合到主题模型中的趋势,从根本上重塑了主题捕捉语义结构的方式。经典模型如潜在狄利克雷分配(Latent Dirichlet Allocation, LDA)从词语共现统计中推导主题,而增强PLM的模型则将这些统计数据锚定到预训练嵌入空间,施加了一种先验,这种先验也有利于语义相似词的聚类。这种结构差异可以通过主题词的心理语言学维度——主题相关性和分类相似性来捕捉。为了在主题模型中解构这些维度,我们构建了一个大型合成基准的词对,利用基于LLM的注释来训练神经评分函数。我们将此评分器应用于多个语料库和主题模型家族的全面评估,揭示不同模型家族在其主题中捕捉到不同的语义结构。我们进一步证明,相似性和相关性得分能够成功预测下游任务的表现,具体取决于任务要求。本文确立了相似性和相关性作为主题模型评估的基本轴线,并提供了一个可靠的管道,以便在不同模型家族和语料库中表征这些轴线。
cs.CL / 60 / 2603.10640

Making Bielik LLM Reason (Better): A Field Report

提升 Bielik LLM 推理能力(更好地):一份实地报告
Trybus, Adam, Bartnicki, Bartosz, Kinas, Remigiusz
Abstract
This paper presents a research program dedicated to evaluating and advancing the reasoning capabilities of Bielik, a Polish large language model. The study describes a number of stages of work: initial benchmarking and creation of evaluation methodology, analyzing of comparative results with other LLMs and outlining of future prospects that take into account the limitations of the analyses conducted so far and aims to keep Bielik in the race give the ever-changing -- and competitive -- AI landscape.
Chinese Translation
本文介绍了一项旨在评估和提升波兰大型语言模型 Bielik 推理能力的研究计划。研究描述了多个工作阶段:初步基准测试和评估方法的创建、与其他大型语言模型的比较结果分析,以及考虑到迄今为止分析的局限性所描绘的未来前景,旨在使 Bielik 在不断变化且竞争激烈的人工智能领域中保持竞争力。
cs.CL / 61 / 2603.10705

Prism-$\Delta$: Differential Subspace Steering for Prompt Highlighting in Large Language Models

Prism-$ riangle$: 大型语言模型中用于提示高亮的差异子空间引导
Ge, Yuyao, Liu, Shenghua, Wang, Yiwei, Liu, Tianyu, Bi, Baolong, Mei, Lingrui, Yao, Jiayu, Guo, Jiafeng, Cheng, Xueqi
Abstract
Prompt highlighting steers a large language model to prioritize user-specified text spans during generation. A key challenge is extracting steering directions that capture the difference between relevant and irrelevant contexts, rather than shared structural patterns common to both. We propose PRISM-$\Delta$ (Projection-based Relevance-Informed Steering Method), which decomposes the difference between positive and negative cross-covariance matrices to maximize discriminative energy while eliminating shared directions. Each attention head receives a continuous softplus importance weight, letting weak-but-useful heads contribute at reduced strength. The framework extends naturally to Value representations, capturing content-channel signal that Key-only methods leave unused. Across four benchmarks and five models, PRISM-$\Delta$ matches or exceeds the best existing method on 19 of 20 configurations, with relative gains up to +10.6%, while halving the fluency cost of steering. PRISM-$\Delta$ also scales to long-context retrieval, outperforming the best existing method by up to +4.8% relative gain. PRISM-$\Delta$ is compatible with FlashAttention and adds negligible memory overhead.
Chinese Translation
提示高亮引导大型语言模型在生成过程中优先考虑用户指定的文本片段。一个关键挑战是提取引导方向,以捕捉相关和不相关上下文之间的差异,而不是两者共有的结构模式。我们提出了PRISM-$ riangle$(基于投影的相关性引导方法),该方法分解正负交叉协方差矩阵之间的差异,以最大化区分能量,同时消除共享方向。每个注意力头接收一个连续的softplus重要性权重,使得弱但有用的头以降低的强度贡献。该框架自然扩展到价值表示,捕捉内容通道信号,而仅基于键的方法则未能利用。在四个基准测试和五个模型中,PRISM-$ riangle$在20个配置中的19个上与现有最佳方法相匹配或超越,具有高达+10.6%的相对增益,同时将引导的流畅性成本减半。PRISM-$ riangle$还可扩展到长上下文检索,相较于现有最佳方法具有高达+4.8%的相对增益。PRISM-$ riangle$与FlashAttention兼容,并增加的内存开销微乎其微。
cs.CL / 62 / 2603.10764

HeartAgent: An Autonomous Agent System for Explainable Differential Diagnosis in Cardiology

HeartAgent:一种用于心脏病学可解释性差异诊断的自主智能体系统
Zhou, Shuang, Yu, Kai, Wang, Song, Xie, Wenya, Zhan, Zaifu, Tsai, Meng-Han, Chung, Yuen-Hei, Hou, Shutong, Zhou, Huixue, Zeng, Min, Ramu, Bhavadharini, Chen, Lin Yee, Xie, Feng, Zhang, Rui
Abstract
Heart diseases remain a leading cause of morbidity and mortality worldwide, necessitating accurate and trustworthy differential diagnosis. However, existing artificial intelligence-based diagnostic methods are often limited by insufficient cardiology knowledge, inadequate support for complex reasoning, and poor interpretability. Here we present HeartAgent, a cardiology-specific agent system designed to support a reliable and explainable differential diagnosis. HeartAgent integrates customized tools and curated data resources and orchestrates multiple specialized sub-agents to perform complex reasoning while generating transparent reasoning trajectories and verifiable supporting references. Evaluated on the MIMIC dataset and a private electronic health records cohort, HeartAgent achieved over 36% and 20% improvements over established comparative methods, in top-3 diagnostic accuracy, respectively. Additionally, clinicians assisted by HeartAgent demonstrated gains of 26.9% in diagnostic accuracy and 22.7% in explanatory quality compared with unaided experts. These results demonstrate that HeartAgent provides reliable, explainable, and clinically actionable decision support for cardiovascular care.
Chinese Translation
心脏疾病仍然是全球发病率和死亡率的主要原因,这要求进行准确且可信赖的差异诊断。然而,现有的基于人工智能的诊断方法往往受到心脏病学知识不足、对复杂推理支持不足以及可解释性差的限制。在此,我们提出了HeartAgent,一个专门针对心脏病学的智能体系统,旨在支持可靠且可解释的差异诊断。HeartAgent整合了定制工具和精选数据资源,并协调多个专业子智能体进行复杂推理,同时生成透明的推理轨迹和可验证的支持参考。在MIMIC数据集和一个私有电子健康记录队列上的评估中,HeartAgent在前3名诊断准确率上分别比现有比较方法提高了超过36%和20%。此外,使用HeartAgent的临床医生在诊断准确性上提高了26.9%,在解释质量上提高了22.7%,相比于未辅助的专家。这些结果表明,HeartAgent为心血管护理提供了可靠、可解释且具有临床可操作性的决策支持。
cs.CL / 63 / 2603.10767

mAceReason-Math: A Dataset of High-Quality Multilingual Math Problems Ready For RLVR

mAceReason-Math:高质量多语言数学问题数据集,适用于可验证奖励的强化学习
Dobler, Konstantin, Lehnerer, Simon, Scozzafava, Federico, Janke, Jonathan, Ali, Mohamed
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has been successfully applied to significantly boost the capabilities of pretrained large language models, especially in the math and logic problem domains. However, current research and available training datasets remain English-centric. While mul- tilingual training data and benchmarks have been created in the past, they were not created with RLVR and current model capability in mind, and their level of difficulty is often too low to provide appropriate training signals for current models. To address this gap, we provide mAceReason-Math, a dataset of high-quality translations of challenging math problems sourced from a corpus specifically curated for RLVR (AceReason-Math). We further take specific care to clean and improve our translations, resulting in a coverage of 14 languages with more than 10,000 samples per language. We release the dataset to facilitate multilingual RLVR research and benchmarking in the research community.
Chinese Translation
可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)已成功应用于显著提升预训练大型语言模型的能力,尤其是在数学和逻辑问题领域。然而,目前的研究和可用的训练数据集仍然以英语为中心。尽管过去创建了多语言训练数据和基准,但它们并未考虑RLVR和当前模型能力的需求,且其难度水平往往过低,无法为当前模型提供适当的训练信号。为了解决这一问题,我们提供了mAceReason-Math,这是一个高质量的挑战性数学问题翻译数据集,来源于专门为RLVR策划的语料库(AceReason-Math)。我们还特别注意清理和改进我们的翻译,覆盖14种语言,每种语言超过10,000个样本。我们发布该数据集,以促进研究社区中多语言RLVR研究和基准测试的发展。
cs.CL / 64 / 2603.10771

Word Recovery in Large Language Models Enables Character-Level Tokenization Robustness

大型语言模型中的词恢复促进了字符级标记化的鲁棒性
Yang, Zhipeng, Yang, Shu, Hu, Lijie, Wang, Di
Abstract
Large language models (LLMs) trained with canonical tokenization exhibit surprising robustness to non-canonical inputs such as character-level tokenization, yet the mechanisms underlying this robustness remain unclear. We study this phenomenon through mechanistic interpretability and identify a core process we term word recovery. We first introduce a decoding-based method to detect word recovery, showing that hidden states reconstruct canonical word-level token identities from character-level inputs. We then provide causal evidence by removing the corresponding subspace from hidden states, which consistently degrades downstream task performance. Finally, we conduct a fine-grained attention analysis and show that in-group attention among characters belonging to the same canonical token is critical for word recovery: masking such attention in early layers substantially reduces both recovery scores and task performance. Together, our findings provide a mechanistic explanation for tokenization robustness and identify word recovery as a key mechanism enabling LLMs to process character-level inputs.
Chinese Translation
经过标准标记化训练的大型语言模型(LLMs)对非标准输入(如字符级标记化)表现出惊人的鲁棒性,但这种鲁棒性背后的机制仍不清楚。我们通过机制可解释性研究这一现象,并识别出一个核心过程,我们称之为词恢复。我们首先介绍了一种基于解码的方法来检测词恢复,显示隐藏状态能够从字符级输入重构标准的词级标识。然后,我们通过从隐藏状态中移除相应的子空间提供因果证据,这一操作一致性地降低了下游任务的性能。最后,我们进行细粒度的注意力分析,表明属于同一标准标记的字符之间的组内注意力对词恢复至关重要:在早期层中屏蔽这种注意力显著减少了恢复分数和任务性能。综合来看,我们的发现为标记化鲁棒性提供了机制解释,并将词恢复识别为使LLMs能够处理字符级输入的关键机制。
cs.CL / 65 / 2603.10775

Large Language Models as Annotators for Machine Translation Quality Estimation

大型语言模型作为机器翻译质量评估的注释者
Wang, Sidi, Arnoult, Sophie, Kamran, Amir
Abstract
Large Language Models (LLMs) have demonstrated excellent performance on Machine Translation Quality Estimation (MTQE), yet their high inference costs make them impractical for direct application. In this work, we propose applying LLMs to generate MQM-style annotations for training a COMET model: following Fernandes et al. (2023), we reckon that segment-level annotations provide a strong rationale for LLMs and are key to good segment-level QE. We propose a simplified MQM scheme, mostly restricted to top-level categories, to guide LLM selection. We present a systematic approach for the development of a GPT-4o-based prompt, called PPbMQM (Prompt-Pattern-based-MQM). We show that the resulting annotations correlate well with human annotations and that training COMET on them leads to competitive performance on segment-level QE for Chinese-English and English-German.
Chinese Translation
大型语言模型(LLMs)在机器翻译质量评估(MTQE)方面表现出色,但其高推理成本使其在直接应用中不够实用。在本研究中,我们提出将LLMs应用于生成MQM风格的注释,以训练COMET模型:遵循Fernandes等人(2023)的观点,我们认为段级注释为LLMs提供了强有力的依据,并且是良好段级质量评估的关键。我们提出了一种简化的MQM方案,主要限制在顶层类别,以指导LLM的选择。我们展示了一种基于GPT-4o的提示开发的系统方法,称为PPbMQM(基于提示模式的MQM)。我们表明,生成的注释与人工注释之间具有良好的相关性,并且在这些注释上训练COMET模型在中英文和英德段级质量评估中表现出竞争力。
cs.CL / 66 / 2603.10784

Interpretable Chinese Metaphor Identification via LLM-Assisted MIPVU Rule Script Generation: A Comparative Protocol Study

通过LLM辅助的MIPVU规则脚本生成实现可解释的中文隐喻识别:一项比较协议研究
Huang, Weihang, Liu, Mengna
Abstract
Metaphor identification is a foundational task in figurative language processing, yet most computational approaches operate as opaque classifiers offering no insight into why an expression is judged metaphorical. This interpretability gap is especially acute for Chinese, where rich figurative traditions, absent morphological cues, and limited annotated resources compound the challenge. We present an LLM-assisted pipeline that operationalises four metaphor identification protocols--MIP/MIPVU lexical analysis, CMDAG conceptual-mapping annotation, emotion-based detection, and simile-oriented identification--as executable, human-auditable rule scripts. Each protocol is a modular chain of deterministic steps interleaved with controlled LLM calls, producing structured rationales alongside every classification decision. We evaluate on seven Chinese metaphor datasets spanning token-, sentence-, and span-level annotation, establishing the first cross-protocol comparison for Chinese metaphor identification. Within-protocol evaluation shows Protocol A (MIP) achieves an F1 of 0.472 on token-level identification, while cross-protocol analysis reveals striking divergence: pairwise Cohen's kappa between Protocols A and D is merely 0.001, whereas Protocols B and C exhibit near-perfect agreement (kappa = 0.986). An interpretability audit shows all protocols achieve 100% deterministic reproducibility, with rationale correctness from 0.40 to 0.87 and editability from 0.80 to 1.00. Error analysis identifies conceptual-domain mismatch and register sensitivity as dominant failure modes. Our results demonstrate that protocol choice is the single largest source of variation in metaphor identification, exceeding model-level variation, and that rule-script architectures achieve competitive performance while maintaining full transparency.
Chinese Translation
隐喻识别是比喻语言处理中的基础任务,但大多数计算方法作为不透明的分类器运作,无法提供为何某个表达被判断为隐喻的见解。这种可解释性缺口在中文中尤为明显,因为丰富的比喻传统、缺乏形态线索以及有限的标注资源加大了挑战。我们提出了一种LLM辅助的流程,将四种隐喻识别协议——MIP/MIPVU词汇分析、CMDAG概念映射注释、基于情感的检测和类比导向的识别——转化为可执行的、可供人类审计的规则脚本。每个协议都是一系列确定性步骤的模块化链条,交替进行受控的LLM调用,在每个分类决策的同时生成结构化的推理依据。我们在七个中文隐喻数据集上进行评估,涵盖了标记级、句子级和跨度级的注释,建立了中文隐喻识别的首次跨协议比较。协议内评估显示,协议A(MIP)在标记级识别中达到了0.472的F1值,而跨协议分析揭示了显著的差异:协议A和D之间的成对Cohen's kappa仅为0.001,而协议B和C则表现出近乎完美的一致性(kappa = 0.986)。可解释性审计显示所有协议实现了100%的确定性可重复性,推理正确性从0.40到0.87,编辑性从0.80到1.00。错误分析确定了概念领域不匹配和语域敏感性为主要失败模式。我们的结果表明,协议选择是隐喻识别中变异的最大来源,超过了模型级别的变异,而规则脚本架构在保持完全透明的同时实现了竞争性的性能。
cs.CL / 67 / 2603.10789

LuxBorrow: From Pompier to Pompjee, Tracing Borrowing in Luxembourgish

LuxBorrow:从 Pompier 到 Pompjee,追踪卢森堡语中的借用现象
Hosseini-Kivanani, Nina, Philippy, Fred
Abstract
We present LuxBorrow, a borrowing-first analysis of Luxembourgish (LU) news spanning 27 years (1999-2025), covering 259,305 RTL articles and 43.7M tokens. Our pipeline combines sentence-level language identification (LU/DE/FR/EN) with a token-level borrowing resolver restricted to LU sentences, using lemmatization, a collected loanword registry, and compiled morphological and orthographic rules. Empirically, LU remains the matrix language across all documents, while multilingual practice is pervasive: 77.1% of articles include at least one donor language and 65.4% use three or four. Breadth does not imply intensity: median code-mixing index (CMI) increases from 3.90 (LU+1) to only 7.00 (LU+3), indicating localized insertions rather than balanced bilingual text. Domain and period summaries show moderate but persistent mixing, with CMI rising from 6.1 (1999-2007) to a peak of 8.4 in 2020. Token-level adaptations total 25,444 instances and exhibit a mixed profile: morphological 63.8%, orthographic 35.9%, lexical 0.3%. The most frequent individual rules are orthographic, such as on->oun and eur->er, while morphology is collectively dominant. Diachronically, code-switching intensifies, and morphologically adapted borrowings grow from a small base. French overwhelmingly supplies adapted items, with modest growth for German and negligible English. We advocate borrowing-centric evaluation, including borrowed token and type rates, donor entropy over borrowed items, and assimilation ratios, rather than relying only on document-level mixing indices.
Chinese Translation
我们提出了 LuxBorrow,这是对卢森堡语(LU)新闻的借用优先分析,涵盖了27年的数据(1999-2025),涉及259,305篇 RTL 文章和4370万个词元。我们的处理流程结合了句子级语言识别(LU/DE/FR/EN)与限制在 LU 句子中的词元级借用解析器,使用了词形还原、收集的借用词登记册以及编制的形态学和正字法规则。从实证上看,LU 在所有文档中仍然是主要语言,而多语言实践则普遍存在:77.1% 的文章至少包含一种捐赠语言,65.4% 的文章使用三种或四种语言。广度并不意味着强度:中位代码混合指数(CMI)从 3.90(LU+1)仅增加到 7.00(LU+3),这表明是局部插入而非平衡的双语文本。领域和时期的总结显示出适度但持续的混合,CMI 从 6.1(1999-2007)上升到 2020 年的峰值 8.4。词元级的适应总计 25,444 个实例,展现出混合的特征:形态学 63.8%,正字法 35.9%,词汇 0.3%。最常见的个别规则是正字法规则,例如 on->oun 和 eur->er,而形态学则整体占主导地位。从历时上看,代码切换加剧,形态学适应的借用从小基数增长。法语在适应项目中占主导地位,德语有适度增长,而英语则几乎没有增长。我们提倡以借用为中心的评估,包括借用词元和类型的比率、借用项目的捐赠熵以及同化比率,而不仅仅依赖于文档级混合指数。
cs.CL / 68 / 2603.10793

Multilingual Reasoning Gym: Multilingual Scaling of Procedural Reasoning Environments

多语言推理训练场:程序性推理环境的多语言扩展
Dobler, Konstantin, Lehnerer, Simon, Scozzafava, Federico, Janke, Jonathan, Ali, Mohamed
Abstract
We present the Multilingual Reasoning Gym, an extension of Reasoning Gym (Stojanovski et al., 2025), that procedurally generates verifiable reasoning problems across 14 languages. We translate templates for 94 tasks with native-speaker validation in 10 languages and targeted code or template adaptations to ensure linguistic naturalness. The Multilingual Reasoning Gym preserves the core benefits of the procedural generation approach used in the original Reasoning Gym, such as virtually unlimited problem instance generation and adjustable difficulty, and remains directly usable for Reinforcement Learning from Verifiable Rewards and evaluation settings. Problems in the Multilingual Reasoning Gym are parallel across languages, enabling crosslingually parallel data generation at massive scale due to the procedural nature of the environments. We release our implementation to support research into multilingual reasoning models.
Chinese Translation
我们提出了多语言推理训练场,这是推理训练场(Reasoning Gym,Stojanovski 等,2025)的扩展,能够在14种语言中程序性生成可验证的推理问题。我们将94个任务的模板翻译成10种语言,并通过母语者验证和针对性的代码或模板调整,以确保语言的自然性。多语言推理训练场保留了原始推理训练场中使用的程序生成方法的核心优势,如几乎无限的问题实例生成和可调的难度,并且仍然可以直接用于可验证奖励的强化学习和评估设置。多语言推理训练场中的问题在语言间是平行的,由于环境的程序性特征,使得大规模的跨语言平行数据生成成为可能。我们发布了我们的实现,以支持对多语言推理模型的研究。
cs.CL / 69 / 2603.10842

PivotAttack: Rethinking the Search Trajectory in Hard-Label Text Attacks via Pivot Words

PivotAttack:通过枢轴词重新思考硬标签文本攻击中的搜索轨迹
Liang, Yuzhi, Xiao, Shiliang, Wei, Jingsong, Lin, Qiliang, Li, Xia
Abstract
Existing hard-label text attacks often rely on inefficient "outside-in" strategies that traverse vast search spaces. We propose PivotAttack, a query-efficient "inside-out" framework. It employs a Multi-Armed Bandit algorithm to identify Pivot Sets-combinatorial token groups acting as prediction anchors-and strategically perturbs them to induce label flips. This approach captures inter-word dependencies and minimizes query costs. Extensive experiments across traditional models and Large Language Models demonstrate that PivotAttack consistently outperforms state-of-the-art baselines in both Attack Success Rate and query efficiency.
Chinese Translation
现有的硬标签文本攻击通常依赖于低效的“外向内”策略,遍历广阔的搜索空间。我们提出了PivotAttack,这是一种查询高效的“内向外”框架。它采用多臂赌博机(Multi-Armed Bandit)算法来识别枢轴集(Pivot Sets)——作为预测锚点的组合标记组——并有策略地扰动这些标记以诱导标签翻转。这种方法捕捉了词间依赖关系,并最小化了查询成本。在传统模型和大型语言模型上的广泛实验表明,PivotAttack在攻击成功率和查询效率方面始终优于最先进的基线。
cs.CL / 70 / 2603.10861

SiDiaC-v.2.0: Sinhala Diachronic Corpus Version 2.0

SiDiaC-v.2.0:僧伽罗语历时语料库版本2.0
Jayatilleke, Nevidu, de Silva, Nisansa, Nimanthi, Uthpala, Kulathilaka, Gagani, Safrullah, Azra, Sofalas, Johan
Abstract
SiDiaC-v.2.0 is the largest comprehensive Sinhala Diachronic Corpus to date, covering a period from 1800 CE to 1955 CE in terms of publication dates, and a historical span from the 5th to the 20th century CE in terms of written dates. The corpus consists of 244k words across 185 literary works that underwent thorough filtering, preprocessing, and copyright compliance checks, followed by extensive post-processing. Additionally, a subset of 59 documents totalling 70k words was annotated based on their written dates. Texts from the National Library of Sri Lanka were selected from the SiDiaC-v.1.0 non-filtered list, which was digitised using Google Document AI OCR. This was followed by post-processing to correct formatting issues, address code-mixing, include special tokens, and fix malformed tokens. The construction of SiDiaC-v.2.0 was informed by practices from other corpora, such as FarPaHC, SiDiaC-v.1.0, and CCOHA. This was particularly relevant for syntactic annotation and text normalisation strategies, given the shared characteristics of low-resource language status between Faroese and the similar cleaning strategies utilised in CCOHA. This corpus is categorised into two layers based on genres: primary and secondary. The primary categorisation is binary, assigning each book to either Non-Fiction or Fiction. The secondary categorisation is more detailed, grouping texts under specific genres such as Religious, History, Poetry, Language, and Medical. Despite facing challenges due to limited resources, SiDiaC-v.2.0 serves as a comprehensive resource for Sinhala NLP, building upon the work previously done in SiDiaC-v.1.0.
Chinese Translation
SiDiaC-v.2.0是迄今为止最大的综合僧伽罗语历时语料库,涵盖了从公元1800年到1955年的出版日期,以及从公元5世纪到20世纪的书写日期。该语料库包含244,000个单词,来自185部经过严格筛选、预处理和版权合规检查的文学作品,并进行了广泛的后处理。此外,基于书写日期,选取了59份文件,总计70,000个单词进行了注释。文本来自斯里兰卡国家图书馆,选自SiDiaC-v.1.0的非筛选列表,使用Google Document AI OCR进行了数字化。随后进行了后处理,以纠正格式问题、解决代码混合、包含特殊标记并修复格式不当的标记。SiDiaC-v.2.0的构建借鉴了其他语料库的实践,如FarPaHC、SiDiaC-v.1.0和CCOHA。这在句法注释和文本规范化策略方面尤为相关,因为法罗语和CCOHA所采用的清理策略在低资源语言状态下具有相似特征。该语料库根据体裁分为两个层次:初级和次级。初级分类为二元,将每本书分为非虚构或虚构。次级分类则更为详细,将文本归类于特定体裁,如宗教、历史、诗歌、语言和医学。尽管面临资源有限的挑战,SiDiaC-v.2.0仍然作为僧伽罗语自然语言处理的综合资源,建立在SiDiaC-v.1.0之前的工作基础上。
cs.CL / 71 / 2603.10876

An Extreme Multi-label Text Classification (XMTC) Library Dataset: What if we took "Use of Practical AI in Digital Libraries" seriously?

极端多标签文本分类(XMTC)库数据集:如果我们认真对待“在数字图书馆中使用实用人工智能”会怎样?
D'Souza, Jennifer, Sadruddin, Sameer, Kähler, Maximilian, Salfinger, Andrea, Zaccagna, Luca, Incitti, Francesca, Snidaro, Lauro, Suominen, Osma
Abstract
Subject indexing is vital for discovery but hard to sustain at scale and across languages. We release a large bilingual (English/German) corpus of catalog records annotated with the Integrated Authority File (GND), plus a machine-actionable GND taxonomy. The resource enables ontology-aware multi-label classification, mapping text to authority terms, and agent-assisted cataloging with reproducible, authority-grounded evaluation. We provide a brief statistical profile and qualitative error analyses of three systems. We invite the community to assess not only accuracy but usefulness and transparency, toward authority-anchored AI co-pilots that amplify catalogers' work.
Chinese Translation
主题索引对于发现至关重要,但在规模和跨语言的情况下难以维持。我们发布了一个大型双语(英语/德语)语料库,包含用综合权威文件(GND)注释的目录记录,以及一个可机器处理的GND分类法。该资源使得本体感知的多标签分类成为可能,将文本映射到权威术语,并实现可重复的、基于权威的评估的代理辅助目录编制。我们提供了三个系统的简要统计概况和定性错误分析。我们邀请社区不仅评估准确性,还要评估实用性和透明度,以推动基于权威的人工智能共同助手,增强目录编制者的工作。
cs.CL / 72 / 2603.10877

From Images to Words: Efficient Cross-Modal Knowledge Distillation to Language Models from Black-box Teachers

从图像到文字:高效的跨模态知识蒸馏从黑箱教师到语言模型
Sengupta, Ayan, Dixit, Shantanu, Akhtar, Md Shad, Chakraborty, Tanmoy
Abstract
Knowledge distillation (KD) methods are pivotal in compressing large pre-trained language models into smaller models, ensuring computational efficiency without significantly dropping performance. Traditional KD techniques assume homogeneity in modalities between the teacher (source) and the student (target) models. On the other hand, existing multimodal knowledge distillation methods require modality-specific pre-training of the teacher model, which is computationally infeasible in most cases. In this paper, we introduce ARMADA, an efficient cross-modal knowledge distillation framework designed to transfer knowledge from large vision-language models, including black-box models, to language-only models. Unlike existing KD techniques that rely on the internal structures of multimodal teachers or require computationally expensive pre-training, ARMADA leverages novel alignment techniques to distil knowledge without altering the teacher model, ensuring efficiency and scalability. We empirically validate ARMADA on twelve natural language understanding, eight complex generative reasoning and five instruction-tuning tasks, demonstrating consistent performance improvements in large models such as DeBERTa-v2-1.4B, OPT-1.3B, LLaMA-{3B, 7B, 8B}. ARMADA achieves up to 3.4% improvement on language understanding tasks and 2.6% boost in generative reasoning, all without requiring expensive multimodal pre-training or fine-tuning of the teacher model. Our findings challenge conventional knowledge distillation paradigms by demonstrating that even vision-language models, despite lacking direct textual understanding, can significantly enhance language models when distilled appropriately.
Chinese Translation
知识蒸馏(KD)方法在将大型预训练语言模型压缩为更小模型的过程中至关重要,确保计算效率而不显著降低性能。传统的KD技术假设教师(源)模型和学生(目标)模型之间的模态是同质的。另一方面,现有的多模态知识蒸馏方法需要对教师模型进行特定于模态的预训练,这在大多数情况下是计算上不可行的。在本文中,我们介绍了ARMADA,一个高效的跨模态知识蒸馏框架,旨在将知识从大型视觉-语言模型(包括黑箱模型)转移到仅语言模型。与依赖多模态教师内部结构或需要计算昂贵的预训练的现有KD技术不同,ARMADA利用新颖的对齐技术进行知识蒸馏,而不改变教师模型,从而确保效率和可扩展性。我们在十二个自然语言理解任务、八个复杂生成推理任务和五个指令调优任务上对ARMADA进行了实证验证,展示了在大型模型如DeBERTa-v2-1.4B、OPT-1.3B、LLaMA-{3B, 7B, 8B}中一致的性能提升。ARMADA在语言理解任务上实现了高达3.4%的提升,在生成推理中提升了2.6%,且无需昂贵的多模态预训练或教师模型的微调。我们的研究结果挑战了传统的知识蒸馏范式,表明即使是缺乏直接文本理解的视觉-语言模型,在适当的蒸馏下也能显著增强语言模型。
cs.CL / 73 / 2603.10910

GLM-OCR Technical Report

GLM-OCR 技术报告
Duan, Shuaiqi, Xue, Yadong, Wang, Weihan, Su, Zhe, Liu, Huan, Yang, Sheng, Gan, Guobing, Wang, Guo, Wang, Zihan, Yan, Shengdong, Jin, Dexin, Zhang, Yuxuan, Wen, Guohong, Wang, Yanfeng, Zhang, Yutao, Zhang, Xiaohan, Hong, Wenyi, Cen, Yukuo, Yin, Da, Chen, Bin, Yu, Wenmeng, Gu, Xiaotao, Tang, Jie
Abstract
GLM-OCR is an efficient 0.9B-parameter compact multimodal model designed for real-world document understanding. It combines a 0.4B-parameter CogViT visual encoder with a 0.5B-parameter GLM language decoder, achieving a strong balance between computational efficiency and recognition performance. To address the inefficiency of standard autoregressive decoding in deterministic OCR tasks, GLM-OCR introduces a Multi-Token Prediction (MTP) mechanism that predicts multiple tokens per step, significantly improving decoding throughput while keeping memory overhead low through shared parameters. At the system level, a two-stage pipeline is adopted: PP-DocLayout-V3 first performs layout analysis, followed by parallel region-level recognition. Extensive evaluations on public benchmarks and industrial scenarios show that GLM-OCR achieves competitive or state-of-the-art performance in document parsing, text and formula transcription, table structure recovery, and key information extraction. Its compact architecture and structured generation make it suitable for both resource-constrained edge deployment and large-scale production systems.
Chinese Translation
GLM-OCR 是一个高效的 0.9B 参数紧凑型多模态模型,旨在实现真实世界文档理解。它结合了一个 0.4B 参数的 CogViT 视觉编码器和一个 0.5B 参数的 GLM 语言解码器,在计算效率和识别性能之间实现了良好的平衡。为了应对标准自回归解码在确定性 OCR 任务中的低效,GLM-OCR 引入了一种多标记预测(Multi-Token Prediction, MTP)机制,该机制在每一步预测多个标记,显著提高了解码吞吐量,同时通过共享参数保持了低内存开销。在系统层面,采用了两阶段管道:PP-DocLayout-V3 首先进行布局分析,然后进行并行区域级识别。在公共基准和工业场景上的广泛评估表明,GLM-OCR 在文档解析、文本和公式转录、表格结构恢复以及关键信息提取方面达到了具有竞争力或领先的性能。其紧凑的架构和结构化生成使其适用于资源受限的边缘部署和大规模生产系统。
cs.CL / 74 / 2603.10913

LLM2Vec-Gen: Generative Embeddings from Large Language Models

LLM2Vec-Gen:来自大语言模型的生成嵌入
BehnamGhader, Parishad, Adlakha, Vaibhav, Schmidt, Fabian David, Chapados, Nicolas, Mosbach, Marius, Reddy, Siva
Abstract
LLM-based text embedders typically encode the semantic content of their input. However, embedding tasks require mapping diverse inputs to similar outputs. Typically, this input-output is addressed by training embedding models with paired data using contrastive learning. In this work, we propose a novel self-supervised approach, LLM2Vec-Gen, which adopts a different paradigm: rather than encoding the input, we learn to represent the model's potential response. Specifically, we add trainable special tokens to the LLM's vocabulary, append them to input, and optimize them to represent the LLM's response in a fixed-length sequence. Training is guided by the LLM's own completion for the query, along with an unsupervised embedding teacher that provides distillation targets. This formulation helps to bridge the input-output gap and transfers LLM capabilities such as safety alignment and reasoning to embedding tasks. Crucially, the LLM backbone remains frozen and training requires only unlabeled queries. LLM2Vec-Gen achieves state-of-the-art self-supervised performance on the Massive Text Embedding Benchmark (MTEB), improving by 9.3% over the best unsupervised embedding teacher. We also observe up to 43.2% reduction in harmful content retrieval and 29.3% improvement in reasoning capabilities for embedding tasks. Finally, the learned embeddings are interpretable and can be decoded into text to reveal their semantic content.
Chinese Translation
基于大语言模型(LLM)的文本嵌入器通常对其输入的语义内容进行编码。然而,嵌入任务需要将多样化的输入映射到相似的输出。通常,这种输入输出关系通过使用对比学习训练配对数据的嵌入模型来解决。在本研究中,我们提出了一种新颖的自监督方法LLM2Vec-Gen,它采用了不同的范式:我们不是对输入进行编码,而是学习表示模型的潜在响应。具体而言,我们向LLM的词汇表中添加可训练的特殊标记,将其附加到输入中,并优化它们以在固定长度的序列中表示LLM的响应。训练由LLM对查询的自身完成引导,同时结合一个提供蒸馏目标的无监督嵌入教师。这种构造有助于弥合输入输出之间的差距,并将LLM的能力(如安全对齐和推理)转移到嵌入任务中。重要的是,LLM的主干保持不变,训练仅需无标签的查询。LLM2Vec-Gen在大规模文本嵌入基准(MTEB)上实现了最先进的自监督性能,比最佳的无监督嵌入教师提高了9.3%。我们还观察到有害内容检索减少了多达43.2%,嵌入任务的推理能力提高了29.3%。最后,学习到的嵌入是可解释的,并且可以解码为文本,以揭示其语义内容。
cs.CL / 75 / 2603.11027

Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge

超越共识的幻觉:从表面启发式到基于知识的评估在 LLM 作为评判者中的应用
Song, Mingyang, Zheng, Mao, Xu, Chenning
Abstract
The paradigm of LLM-as-a-judge relies on a critical assumption, namely that high inter-evaluator agreement indicates reliable and objective evaluation. We present two complementary findings that challenge this assumption. \textbf{First}, we demonstrate that this consensus is frequently illusory. We identify and formalize \textbf{Evaluation Illusion}, a phenomenon where LLM judges generate sophisticated critiques yet anchor scores on shared surface heuristics rather than substantive quality. Through a large-scale study of 105,600 evaluation instances (32 LLMs $\times$ 3 frontier judges $\times$ 100 tasks $\times$ 11 temperatures), we show that model-level agreement (Spearman $\rho = 0.99$) masks fragile sample-level agreement (Pearson $\bar{r} = 0.72$; absolute agreement ICC $= 0.67$), that merely sharing rubric structure restores 62\% of total agreement, and that high-quality outputs paradoxically receive the \textit{least} consistent evaluations. \textbf{Second}, we demonstrate that dynamically generating evaluation rubrics grounded in domain knowledge produces more meaningful assessment. We introduce MERG (Metacognitive Enhanced Rubric Generation), a knowledge-driven rubric generation framework whose domain-selective effects confirm this. Agreement \textit{increases} in codified domains (Education +22\%, Academic +27\%) where knowledge anchors evaluators on shared standards, while it decreases in subjective domains where genuine evaluative pluralism emerges. These findings suggest that evaluation rubrics should be dynamically enriched with expert knowledge rather than relying on generic criteria, with implications for reward modeling in RLAIF.
Chinese Translation
LLM 作为评判者的范式依赖于一个关键假设,即高评估者间一致性表明评估是可靠和客观的。我们提出了两个互补的发现,挑战了这一假设。首先,我们证明这种共识往往是虚幻的。我们识别并形式化了评估幻觉(Evaluation Illusion),这一现象表明 LLM 评判者生成复杂的批评意见,但评分却依赖于共享的表面启发式,而非实质质量。通过对 105,600 个评估实例(32 个 LLM × 3 个前沿评审 × 100 个任务 × 11 个温度)的广泛研究,我们展示了模型级别的一致性(Spearman ρ = 0.99)掩盖了脆弱的样本级别一致性(Pearson 平均 r = 0.72;绝对一致性 ICC = 0.67),仅共享评分标准结构就恢复了 62% 的总一致性,而高质量的输出反而获得了最不一致的评估。其次,我们展示了基于领域知识动态生成评估标准能够产生更有意义的评估。我们引入了 MERG(元认知增强评分标准生成),这是一个知识驱动的评分标准生成框架,其领域选择效应证实了这一点。在知识将评估者锚定在共享标准的编码领域(教育 +22%,学术 +27%)中,一致性增加,而在主观领域中,真正的评估多元性则导致一致性下降。这些发现表明,评估标准应动态地丰富专家知识,而不是依赖于通用标准,这对 RLAIF 中的奖励建模具有重要意义。
cs.CL / 76 / 2603.11039

Instruction set for the representation of graphs

图的表示指令集
Lopez-Rubio, Ezequiel, Pascual-Gonzalez, Mario
Abstract
We present IsalGraph, a method for representing the structure of any finite, simple graph as a compact string over a nine-character instruction alphabet. The encoding is executed by a small virtual machine comprising a sparse graph, a circular doubly-linked list (CDLL) of graph-node references, and two traversal pointers. Instructions either move a pointer through the CDLL or insert a node or edge into the graph. A key design property is that every string over the alphabet decodes to a valid graph, with no invalid states reachable. A greedy \emph{GraphToString} algorithm encodes any connected graph into a string in time polynomial in the number of nodes; an exhaustive-backtracking variant produces a canonical string by selecting the lexicographically smallest shortest string across all starting nodes and all valid traversal orders. We evaluate the representation on five real-world graph benchmark datasets (IAM Letter LOW/MED/HIGH, LINUX, and AIDS) and show that the Levenshtein distance between IsalGraph strings correlates strongly with graph edit distance (GED). Together, these properties make IsalGraph strings a compact, isomorphism-invariant, and language-model-compatible sequential encoding of graph structure, with direct applications in graph similarity search, graph generation, and graph-conditioned language modelling
Chinese Translation
我们提出了 IsalGraph,一种将任何有限简单图的结构表示为九字符指令字母表上的紧凑字符串的方法。该编码由一个小型虚拟机执行,该虚拟机包括一个稀疏图、一个图节点引用的循环双向链表(CDLL)和两个遍历指针。指令要么在 CDLL 中移动指针,要么将节点或边插入图中。一个关键的设计特性是,字母表上的每个字符串都能解码为一个有效的图,并且没有可达的无效状态。一个贪心的 extit{GraphToString} 算法可以在节点数量的多项式时间内将任何连通图编码为字符串;一个穷举回溯变体通过选择所有起始节点和所有有效遍历顺序中的字典序最小的最短字符串,生成一个规范字符串。我们在五个真实世界的图基准数据集(IAM Letter LOW/MED/HIGH、LINUX 和 AIDS)上评估了该表示,并显示 IsalGraph 字符串之间的 Levenshtein 距离与图编辑距离(GED)之间有很强的相关性。这些特性使得 IsalGraph 字符串成为一种紧凑的、同构不变的、与语言模型兼容的图结构序列编码,具有在图相似性搜索、图生成和图条件语言建模中的直接应用。