← Back to Index
Daily Research Digest

arXiv Papers

2026-02-23
104
Papers
4
Categories
104
Translated
收藏清单 0
机器人学 (Robotics)
24
cs.RO / 1 / 2602.17737

Nested Training for Mutual Adaptation in Human-AI Teaming

人机团队中的互适应嵌套训练
Biswas, Upasana, Kalwar, Durgesh, Kambhampati, Subbarao, Sreedharan, Sarath
Abstract
Mutual adaptation is a central challenge in human--AI teaming, as humans naturally adjust their strategies in response to a robot's policy. Existing approaches aim to improve diversity in training partners to approximate human behavior, but these partners are static and fail to capture adaptive behavior of humans. Exposing robots to adaptive behaviors is critical, yet when both agents learn simultaneously in a multi-agent setting, they often converge to opaque implicit coordination strategies that only work with the agents they were co-trained with. Such agents fail to generalize when paired with new partners. In order to capture the adaptive behavior of humans, we model the human-robot teaming scenario as an Interactive Partially Observable Markov Decision Process (I-POMDP), explicitly modeling human adaptation as part of the state. We propose a nested training regime to approximately learn the solution to a finite-level I-POMDP. In this framework, agents at each level are trained against adaptive agents from the level below. This ensures that the ego agent is exposed to adaptive behavior during training while avoiding the emergence of implicit coordination strategies, since the training partners are not themselves learning. We train our method in a multi-episode, required cooperation setup in the Overcooked domain, comparing it against several baseline agents designed for human-robot teaming. We evaluate the performance of our agent when paired with adaptive partners that were not seen during training. Our results demonstrate that our agent not only achieves higher task performance with these adaptive partners but also exhibits significantly greater adaptability during team interactions.
Chinese Translation
互适应是人机团队中的一个核心挑战,因为人类会自然地根据机器人的策略调整他们的策略。现有的方法旨在提高训练伙伴的多样性,以近似人类行为,但这些伙伴是静态的,无法捕捉人类的适应性行为。让机器人接触适应性行为至关重要,然而,当两个智能体在多智能体环境中同时学习时,它们往往会收敛到不透明的隐式协调策略,这些策略仅适用于与其共同训练的智能体。当与新伙伴配对时,这些智能体无法进行泛化。为了捕捉人类的适应性行为,我们将人机团队场景建模为一个交互式部分可观察马尔可夫决策过程(Interactive Partially Observable Markov Decision Process, I-POMDP),并将人类适应性明确建模为状态的一部分。我们提出了一种嵌套训练机制,以近似学习有限层级 I-POMDP 的解。在该框架中,每一层的智能体都与来自下层的适应性智能体进行训练。这确保了自我智能体在训练过程中接触到适应性行为,同时避免了隐式协调策略的出现,因为训练伙伴本身并不进行学习。我们在 Overcooked 域中以多回合、要求合作的设置训练我们的方法,并将其与几种为人机团队设计的基线智能体进行比较。我们评估了我们的智能体在与未在训练中见过的适应性伙伴配对时的表现。我们的结果表明,我们的智能体不仅在与这些适应性伙伴的任务表现上更高,而且在团队互动中表现出显著更大的适应性。
cs.RO / 2 / 2602.17794

Reinforcement-Learning-Based Assistance Reduces Squat Effort with a Modular Hip--Knee Exoskeleton

基于强化学习的辅助系统通过模块化髋膝外骨骼减少深蹲努力
Ratnakumar, Neethan, Tohfafarosh, Mariya Huzaifa, Jauhri, Saanya, Zhou, Xianlian
Abstract
Squatting is one of the most demanding lower-limb movements, requiring substantial muscular effort and coordination. Reducing the physical demands of this task through intelligent and personalized assistance has significant implications, particularly in industries involving repetitive low-level assembly activities. In this study, we evaluated the effectiveness of a neural network controller for a modular Hip-Knee exoskeleton designed to assist squatting tasks. The neural network controller was trained via reinforcement learning (RL) in a physics-based, human-exoskeleton interaction simulation environment. The controller generated real-time hip and knee assistance torques based on recent joint-angle and velocity histories. Five healthy adults performed three-minute metronome-guided squats under three conditions: (1) no exoskeleton (No-Exo), (2) exoskeleton with Zero-Torque, and (3) exoskeleton with active assistance (Assistance). Physiological effort was assessed using indirect calorimetry and heart rate monitoring, alongside concurrent kinematic data collection. Results show that the RL-based controller adapts to individuals by producing torque profiles tailored to each subject's kinematics and timing. Compared with the Zero-Torque and No-Exo condition, active assistance reduced the net metabolic rate by approximately 10%, with minor reductions observed in heart rate. However, assisted trials also exhibited reduced squat depth, reflected by smaller hip and knee flexion. These preliminary findings suggest that the proposed controller can effectively lower physiological effort during repetitive squatting, motivating further improvements in both hardware design and control strategies.
Chinese Translation
深蹲是最具挑战性的下肢运动之一,需要大量的肌肉努力和协调。通过智能和个性化的辅助来减少这一任务的身体需求具有重要意义,特别是在涉及重复低强度组装活动的行业中。在本研究中,我们评估了一种用于模块化髋膝外骨骼的神经网络控制器的有效性,该外骨骼旨在辅助深蹲任务。该神经网络控制器在基于物理的人体-外骨骼交互仿真环境中通过强化学习(RL)进行训练。控制器根据最近的关节角度和速度历史生成实时的髋关节和膝关节辅助扭矩。五名健康成年人在三种条件下进行了三分钟的节拍引导深蹲:(1)无外骨骼(No-Exo),(2)外骨骼零扭矩(Zero-Torque),(3)外骨骼主动辅助(Assistance)。通过间接热量测定和心率监测评估生理努力,同时收集运动学数据。结果表明,基于RL的控制器通过生成针对每个受试者的运动学和时序量身定制的扭矩曲线,适应个体。与零扭矩和无外骨骼条件相比,主动辅助将净代谢率降低了约10%,心率也有轻微下降。然而,辅助试验的深蹲深度也有所减少,表现为髋关节和膝关节的屈曲角度较小。这些初步发现表明,所提出的控制器可以有效降低重复深蹲过程中的生理努力,激励在硬件设计和控制策略方面的进一步改进。
cs.RO / 3 / 2602.17818

Lend me an Ear: Speech Enhancement Using a Robotic Arm with a Microphone Array

借我一只耳:使用带麦克风阵列的机器人手臂进行语音增强
Turcotte, Zachary, Grondin, François
Abstract
Speech enhancement performance degrades significantly in noisy environments, limiting the deployment of speech-controlled technologies in industrial settings, such as manufacturing plants. Existing speech enhancement solutions primarly rely on advanced digital signal processing techniques, deep learning methods, or complex software optimization techniques. This paper introduces a novel enhancement strategy that incorporates a physical optimization stage by dynamically modifying the geometry of a microphone array to adapt to changing acoustic conditions. A sixteen-microphone array is mounted on a robotic arm manipulator with seven degrees of freedom, with microphones divided into four groups of four, including one group positioned near the end-effector. The system reconfigures the array by adjusting the manipulator joint angles to place the end-effector microphones closer to the target speaker, thereby improving the reference signal quality. This proposed method integrates sound source localization techniques, computer vision, inverse kinematics, minimum variance distortionless response beamformer and time-frequency masking using a deep neural network. Experimental results demonstrate that this approach outperforms other traditional recording configruations, achieving higher scale-invariant signal-to-distortion ratio and lower word error rate accross multiple input signal-to-noise ratio conditions.
Chinese Translation
在嘈杂环境中,语音增强性能显著下降,这限制了语音控制技术在工业环境(如制造厂)的应用。现有的语音增强解决方案主要依赖于先进的数字信号处理技术、深度学习方法或复杂的软件优化技术。本文提出了一种新颖的增强策略,通过动态修改麦克风阵列的几何形状来适应变化的声学条件,从而引入了物理优化阶段。一个由十六个麦克风组成的阵列被安装在一个具有七个自由度的机器人手臂上,麦克风被分为四组,每组四个,其中一组位于末端执行器附近。该系统通过调整操纵器的关节角度重新配置阵列,将末端执行器的麦克风更靠近目标说话者,从而改善参考信号的质量。该方法结合了声源定位技术、计算机视觉、逆向运动学、最小方差无失真响应波束形成器和使用深度神经网络的时频掩蔽。实验结果表明,该方法在多种输入信噪比条件下,优于其他传统录音配置,达到了更高的尺度不变信噪比和更低的词错误率。
cs.RO / 4 / 2602.17822

Evolution of Safety Requirements in Industrial Robotics: Comparative Analysis of ISO 10218-1/2 (2011 vs. 2025) and Integration of ISO/TS 15066

工业机器人安全要求的演变:ISO 10218-1/2(2011年与2025年)的比较分析及ISO/TS 15066的整合
Hartmann, Daniel, Hamříková, Kristýna, Vysocký, Aleš, Laciok, Vendula, Bernatík, Aleš
Abstract
Industrial robotics has established itself as an integral component of large-scale manufacturing enterprises. Simultaneously, collaborative robotics is gaining prominence, introducing novel paradigms of human-machine interaction. These advancements have necessitated a comprehensive revision of safety standards, specifically incorporating requirements for cybersecurity and protection against unauthorized access in networked robotic systems. This article presents a comparative analysis of the ISO 10218:2011 and ISO 10218:2025 standards, examining the evolution of their structure, terminology, technical requirements, and annexes. The analysis reveals significant expansions in functional safety and cybersecurity, the introduction of new classifications for robots and collaborative applications, and the normative integration of the technical specification ISO/TS 15066. Consequently, the new edition synthesizes mechanical, functional, and digital safety requirements, establishing a comprehensive framework for the design and operation of modern robotic systems.
Chinese Translation
工业机器人已成为大规模制造企业不可或缺的组成部分。同时,协作机器人日益受到重视,引入了人机交互的新范式。这些进展促使对安全标准进行全面修订,特别是将网络机器人系统中的网络安全和防止未经授权访问的要求纳入其中。本文对ISO 10218:2011和ISO 10218:2025标准进行了比较分析,考察了它们在结构、术语、技术要求和附录方面的演变。分析显示,在功能安全和网络安全方面有显著扩展,新的机器人和协作应用分类的引入,以及技术规范ISO/TS 15066的规范性整合。因此,新版本综合了机械、功能和数字安全要求,为现代机器人系统的设计和运行建立了全面的框架。
cs.RO / 5 / 2602.17908

WHED: A Wearable Hand Exoskeleton for Natural, High-Quality Demonstration Collection

WHED:一种用于自然、高质量演示采集的可穿戴手部外骨骼
Zhu, Mingzhang, Zhu, Alvin, Ramos, Jose Victor S. H., Kim, Beom Jun, Shi, Yike, Wu, Yufeng, Hou, Ruochen, Wang, Quanyou, Song, Eric, Fan, Tony, Cui, Yuchen, Hong, Dennis W.
Abstract
Scalable learning of dexterous manipulation remains bottlenecked by the difficulty of collecting natural, high-fidelity human demonstrations of multi-finger hands due to occlusion, complex hand kinematics, and contact-rich interactions. We present WHED, a wearable hand-exoskeleton system designed for in-the-wild demonstration capture, guided by two principles: wearability-first operation for extended use and a pose-tolerant, free-to-move thumb coupling that preserves natural thumb behaviors while maintaining a consistent mapping to the target robot thumb degrees of freedom. WHED integrates a linkage-driven finger interface with passive fit accommodation, a modified passive hand with robust proprioceptive sensing, and an onboard sensing/power module. We also provide an end-to-end data pipeline that synchronizes joint encoders, AR-based end-effector pose, and wrist-mounted visual observations, and supports post-processing for time alignment and replay. We demonstrate feasibility on representative grasping and manipulation sequences spanning precision pinch and full-hand enclosure grasps, and show qualitative consistency between collected demonstrations and replayed executions.
Chinese Translation
灵巧操作的可扩展学习仍然受到收集自然、高保真多指手部人类演示的困难所制约,这主要由于遮挡、复杂的手部运动学以及丰富的接触交互。我们提出了WHED,一种可穿戴手部外骨骼系统,旨在野外环境中捕捉演示,遵循两个原则:以可穿戴性为优先的操作以支持长时间使用,以及一种姿态容忍、自由移动的拇指耦合,能够在保持自然拇指行为的同时,确保与目标机器人拇指自由度的一致映射。WHED集成了一个基于连杆驱动的手指接口,具有被动适配功能,一个经过改进的被动手部,配备强大的本体感觉传感器,以及一个车载传感/电源模块。我们还提供了一个端到端的数据处理管道,能够同步关节编码器、基于增强现实的末端执行器姿态和腕部安装的视觉观测,并支持时间对齐和重放的后处理。我们在代表性的抓取和操作序列上展示了可行性,这些序列涵盖了精确捏握和全手包围抓取,并展示了收集的演示与重放执行之间的定性一致性。
cs.RO / 6 / 2602.17921

Latent Diffeomorphic Co-Design of End-Effectors for Deformable and Fragile Object Manipulation

可变形和易碎物体操作的潜在微分同胚协同设计
Ikemura, Kei, Dong, Yifei, Pokorny, Florian T.
Abstract
Manipulating deformable and fragile objects remains a fundamental challenge in robotics due to complex contact dynamics and strict requirements on object integrity. Existing approaches typically optimize either end-effector design or control strategies in isolation, limiting achievable performance. In this work, we present the first co-design framework that jointly optimizes end-effector morphology and manipulation control for deformable and fragile object manipulation. We introduce (1) a latent diffeomorphic shape parameterization enabling expressive yet tractable end-effector geometry optimization, (2) a stress-aware bi-level co-design pipeline coupling morphology and control optimization, and (3) a privileged-to-pointcloud policy distillation scheme for zero-shot real-world deployment. We evaluate our approach on challenging food manipulation tasks, including grasping and pushing jelly and scooping fillets. Simulation and real-world experiments demonstrate the effectiveness of the proposed method.
Chinese Translation
由于复杂的接触动力学和对物体完整性的严格要求,操控可变形和易碎物体仍然是机器人技术中的一项基本挑战。现有方法通常单独优化末端执行器设计或控制策略,从而限制了可实现的性能。在本研究中,我们提出了第一个协同设计框架,联合优化可变形和易碎物体操作的末端执行器形态和操控控制。我们引入了(1)一种潜在微分同胚形状参数化方法,使得末端执行器几何优化既富有表现力又可处理;(2)一种关注应力的双层协同设计流程,将形态和控制优化相结合;(3)一种特权到点云的策略蒸馏方案,用于零样本的现实世界部署。我们在具有挑战性的食品操控任务上评估了我们的方法,包括抓取和推动果冻以及舀取鱼片。仿真和现实世界实验表明了所提方法的有效性。
cs.RO / 7 / 2602.17926

Homotopic information gain for sparse active target tracking

稀疏主动目标跟踪的同伦信息增益
Wakulicz, Jennifer, Lee, Ki Myung Brian, Vidal-Calleja, Teresa, Fitch, Robert
Abstract
The problem of planning sensing trajectories for a mobile robot to collect observations of a target and predict its future trajectory is known as active target tracking. Enabled by probabilistic motion models, one may solve this problem by exploring the belief space of all trajectory predictions given future sensing actions to maximise information gain. However, for multi-modal motion models the notion of information gain is often ill-defined. This paper proposes a planning approach designed around maximising information regarding the target's homotopy class, or high-level motion. We introduce homotopic information gain, a measure of the expected high-level trajectory information given by a measurement. We show that homotopic information gain is a lower bound for metric or low-level information gain, and is as sparsely distributed in the environment as obstacles are. Planning sensing trajectories to maximise homotopic information results in highly accurate trajectory estimates with fewer measurements than a metric information approach, as supported by our empirical evaluation on real and simulated pedestrian data.
Chinese Translation
为移动机器人规划感知轨迹以收集目标观察数据并预测其未来轨迹的问题被称为主动目标跟踪。在概率运动模型的支持下,可以通过探索给定未来感知动作的所有轨迹预测的信念空间来解决此问题,以最大化信息增益。然而,对于多模态运动模型,信息增益的概念往往不明确。本文提出了一种围绕最大化目标同伦类或高层运动信息的规划方法。我们引入了同伦信息增益,这是一种由测量提供的预期高层轨迹信息的度量。我们证明同伦信息增益是度量或低层信息增益的下界,并且在环境中与障碍物一样稀疏分布。为了最大化同伦信息而规划感知轨迹,能够在比度量信息方法更少的测量下获得高度准确的轨迹估计,这一点得到了我们在真实和模拟行人数据上的实证评估的支持。
cs.RO / 8 / 2602.18014

Quasi-Periodic Gaussian Process Predictive Iterative Learning Control

准周期高斯过程预测迭代学习控制
Nigam, Unnati, Srivastava, Radhendushka, Marzbanrad, Faezeh, Burke, Michael
Abstract
Repetitive motion tasks are common in robotics, but performance can degrade over time due to environmental changes and robot wear and tear. Iterative learning control (ILC) improves performance by using information from previous iterations to compensate for expected errors in future iterations. This work incorporates the use of Quasi-Periodic Gaussian Processes (QPGPs) into a predictive ILC framework to model and forecast disturbances and drift across iterations. Using a recent structural equation formulation of QPGPs, the proposed approach enables efficient inference with complexity $\mathcal{O}(p^3)$ instead of $\mathcal{O}(i^2p^3)$, where $p$ denotes the number of points within an iteration and $i$ represents the total number of iterations, specially for larger $i$. This formulation also enables parameter estimation without loss of information, making continual GP learning computationally feasible within the control loop. By predicting next-iteration error profiles rather than relying only on past errors, the controller achieves faster convergence and maintains this under time-varying disturbances. We benchmark the method against both standard ILC and conventional Gaussian Process (GP)-based predictive ILC on three tasks, autonomous vehicle trajectory tracking, a three-link robotic manipulator, and a real-world Stretch robot experiment. Across all cases, the proposed approach converges faster and remains robust under injected and natural disturbances while reducing computational cost. This highlights its practicality across a range of repetitive dynamical systems.
Chinese Translation
重复运动任务在机器人领域中十分常见,但由于环境变化和机器人磨损,性能可能会随着时间的推移而下降。迭代学习控制(ILC)通过利用先前迭代的信息来补偿未来迭代中预期的误差,从而提高性能。本研究将准周期高斯过程(QPGPs)引入预测性ILC框架,以建模和预测迭代过程中的干扰和漂移。利用QPGPs的最新结构方程形式,所提出的方法实现了复杂度为$ ext{O}(p^3)$的高效推断,而不是$ ext{O}(i^2p^3)$,其中$p$表示每次迭代中的点数,$i$表示总迭代次数,特别是在较大的$i$情况下。该公式还实现了无信息损失的参数估计,使得在控制回路内进行持续的高斯过程学习在计算上变得可行。通过预测下一次迭代的误差轮廓,而不仅仅依赖于过去的误差,控制器实现了更快的收敛,并在时间变化的干扰下保持这一特性。我们将该方法与标准ILC和传统的基于高斯过程(GP)的预测ILC进行了基准测试,涉及三个任务:自主车辆轨迹跟踪、三连杆机器人操控器以及一个真实世界的Stretch机器人实验。在所有案例中,所提出的方法收敛更快,并在注入和自然干扰下保持稳健,同时降低了计算成本。这突显了其在一系列重复动态系统中的实用性。
cs.RO / 9 / 2602.18071

EgoPush: Learning End-to-End Egocentric Multi-Object Rearrangement for Mobile Robots

EgoPush:为移动机器人学习端到端自我中心多物体重排
An, Boyuan, Wang, Zhexiong, Wang, Yipeng, Li, Jiaqi, Li, Sihang, Zhang, Jing, Feng, Chen
Abstract
Humans can rearrange objects in cluttered environments using egocentric perception, navigating occlusions without global coordinates. Inspired by this capability, we study long-horizon multi-object non-prehensile rearrangement for mobile robots using a single egocentric camera. We introduce EgoPush, a policy learning framework that enables egocentric, perception-driven rearrangement without relying on explicit global state estimation that often fails in dynamic scenes. EgoPush designs an object-centric latent space to encode relative spatial relations among objects, rather than absolute poses. This design enables a privileged reinforcement-learning (RL) teacher to jointly learn latent states and mobile actions from sparse keypoints, which is then distilled into a purely visual student policy. To reduce the supervision gap between the omniscient teacher and the partially observed student, we restrict the teacher's observations to visually accessible cues. This induces active perception behaviors that are recoverable from the student's viewpoint. To address long-horizon credit assignment, we decompose rearrangement into stage-level subproblems using temporally decayed, stage-local completion rewards. Extensive simulation experiments demonstrate that EgoPush significantly outperforms end-to-end RL baselines in success rate, with ablation studies validating each design choice. We further demonstrate zero-shot sim-to-real transfer on a mobile platform in the real world. Code and videos are available at https://ai4ce.github.io/EgoPush/.
Chinese Translation
人类能够利用自我中心感知在杂乱环境中重排物体,能够在没有全局坐标的情况下导航遮挡物。受到这一能力的启发,我们研究了使用单个自我中心相机的移动机器人在长时间跨度内进行多物体非抓取重排。我们提出了EgoPush,一个政策学习框架,使得自我中心、感知驱动的重排无需依赖于在动态场景中常常失败的显式全局状态估计。EgoPush设计了一个以物体为中心的潜在空间,以编码物体之间的相对空间关系,而不是绝对姿态。该设计使得一个特权的强化学习(RL)教师能够从稀疏的关键点中共同学习潜在状态和移动动作,随后被提炼为一个纯视觉的学生政策。为了减少全知教师与部分观察学生之间的监督差距,我们将教师的观察限制为视觉可接触的线索。这引发了可从学生视角恢复的主动感知行为。为了解决长时间跨度的信用分配问题,我们使用时间衰减的阶段局部完成奖励将重排分解为阶段级子问题。大量的仿真实验表明,EgoPush在成功率上显著优于端到端的RL基线,消融研究验证了每个设计选择。我们进一步展示了在真实世界的移动平台上实现零-shot的仿真到现实转移。代码和视频可在 https://ai4ce.github.io/EgoPush/ 获取。
cs.RO / 10 / 2602.18097

Interacting safely with cyclists using Hamilton-Jacobi reachability and reinforcement learning

利用哈密顿-雅可比可达性和强化学习与骑自行车者安全互动
Noronha, Aarati Andrea, Oh, Jean
Abstract
In this paper, we present a framework for enabling autonomous vehicles to interact with cyclists in a manner that balances safety and optimality. The approach integrates Hamilton-Jacobi reachability analysis with deep Q-learning to jointly address safety guarantees and time-efficient navigation. A value function is computed as the solution to a time-dependent Hamilton-Jacobi-Bellman inequality, providing a quantitative measure of safety for each system state. This safety metric is incorporated as a structured reward signal within a reinforcement learning framework. The method further models the cyclist's latent response to the vehicle, allowing disturbance inputs to reflect human comfort and behavioral adaptation. The proposed framework is evaluated through simulation and comparison with human driving behavior and an existing state-of-the-art method.
Chinese Translation
本文提出了一种框架,使自主车辆能够以平衡安全性和最优性的方式与骑自行车者互动。该方法将哈密顿-雅可比可达性分析与深度Q学习相结合,旨在共同解决安全保障和时间高效导航的问题。通过求解时间依赖的哈密顿-雅可比-贝尔曼不等式,计算出一个价值函数,为每个系统状态提供定量的安全度量。该安全度量作为结构化奖励信号纳入强化学习框架中。该方法进一步建模骑自行车者对车辆的潜在反应,使干扰输入能够反映人类的舒适度和行为适应性。通过仿真评估所提出的框架,并与人类驾驶行为和现有的最先进方法进行比较。
cs.RO / 11 / 2602.18164

GrandTour: A Legged Robotics Dataset in the Wild for Multi-Modal Perception and State Estimation

GrandTour:用于多模态感知和状态估计的野外腿式机器人数据集
Frey, Jonas, Tuna, Turcan, Fu, Frank, Patterson, Katharine, Xu, Tianao, Fallon, Maurice, Cadena, Cesar, Hutter, Marco
Abstract
Accurate state estimation and multi-modal perception are prerequisites for autonomous legged robots in complex, large-scale environments. To date, no large-scale public legged-robot dataset captures the real-world conditions needed to develop and benchmark algorithms for legged-robot state estimation, perception, and navigation. To address this, we introduce the GrandTour dataset, a multi-modal legged-robotics dataset collected across challenging outdoor and indoor environments, featuring an ANYbotics ANYmal-D quadruped equipped with the \boxi multi-modal sensor payload. GrandTour spans a broad range of environments and operational scenarios across distinct test sites, ranging from alpine scenery and forests to demolished buildings and urban areas, and covers a wide variation in scale, complexity, illumination, and weather conditions. The dataset provides time-synchronized sensor data from spinning LiDARs, multiple RGB cameras with complementary characteristics, proprioceptive sensors, and stereo depth cameras. Moreover, it includes high-precision ground-truth trajectories from satellite-based RTK-GNSS and a Leica Geosystems total station. This dataset supports research in SLAM, high-precision state estimation, and multi-modal learning, enabling rigorous evaluation and development of new approaches to sensor fusion in legged robotic systems. With its extensive scope, GrandTour represents the largest open-access legged-robotics dataset to date. The dataset is available at https://grand-tour.leggedrobotics.com, on HuggingFace (ROS-independent), and in ROS formats, along with tools and demo resources.
Chinese Translation
准确的状态估计和多模态感知是复杂大规模环境中自主腿式机器人所必需的前提条件。迄今为止,尚无大型公共腿式机器人数据集能够捕捉开发和基准测试腿式机器人状态估计、感知和导航算法所需的真实世界条件。为了解决这一问题,我们推出了GrandTour数据集,这是一个在具有挑战性的室外和室内环境中收集的多模态腿式机器人数据集,采用配备oxi多模态传感器载荷的ANYbotics ANYmal-D四足机器人进行采集。GrandTour覆盖了广泛的环境和操作场景,涵盖从阿尔卑斯风光和森林到废弃建筑和城市区域的不同测试地点,并且在规模、复杂性、照明和天气条件上具有广泛的变化。该数据集提供了来自旋转激光雷达、多台具有互补特性的RGB相机、本体感知传感器和立体深度相机的时间同步传感器数据。此外,它还包括来自基于卫星的RTK-GNSS和Leica Geosystems全站仪的高精度真实轨迹。该数据集支持SLAM、高精度状态估计和多模态学习的研究,使得对腿式机器人系统中的传感器融合新方法进行严格评估和开发成为可能。凭借其广泛的范围,GrandTour代表了迄今为止最大的开放获取腿式机器人数据集。该数据集可在https://grand-tour.leggedrobotics.com上获取,并在HuggingFace(与ROS无关)和ROS格式中提供,附带工具和演示资源。
cs.RO / 12 / 2602.18174

Have We Mastered Scale in Deep Monocular Visual SLAM? The ScaleMaster Dataset and Benchmark

我们是否掌握了深度单目视觉SLAM中的尺度问题?ScaleMaster数据集与基准测试
Ju, Hyoseok, Suh, Bokeon, Kim, Giseop
Abstract
Recent advances in deep monocular visual Simultaneous Localization and Mapping (SLAM) have achieved impressive accuracy and dense reconstruction capabilities, yet their robustness to scale inconsistency in large-scale indoor environments remains largely unexplored. Existing benchmarks are limited to room-scale or structurally simple settings, leaving critical issues of intra-session scale drift and inter-session scale ambiguity insufficiently addressed. To fill this gap, we introduce the ScaleMaster Dataset, the first benchmark explicitly designed to evaluate scale consistency under challenging scenarios such as multi-floor structures, long trajectories, repetitive views, and low-texture regions. We systematically analyze the vulnerability of state-of-the-art deep monocular visual SLAM systems to scale inconsistency, providing both quantitative and qualitative evaluations. Crucially, our analysis extends beyond traditional trajectory metrics to include a direct map-to-map quality assessment using metrics like Chamfer distance against high-fidelity 3D ground truth. Our results reveal that while recent deep monocular visual SLAM systems demonstrate strong performance on existing benchmarks, they suffer from severe scale-related failures in realistic, large-scale indoor environments. By releasing the ScaleMaster dataset and baseline results, we aim to establish a foundation for future research toward developing scale-consistent and reliable visual SLAM systems.
Chinese Translation
近年来,深度单目视觉同步定位与地图构建(SLAM)的进展取得了令人瞩目的准确性和密集重建能力,但其在大规模室内环境中对尺度不一致性的鲁棒性仍然未得到充分探索。现有的基准测试仅限于房间规模或结构简单的设置,导致会话内尺度漂移和会话间尺度模糊等关键问题未得到充分解决。为填补这一空白,我们引入了ScaleMaster数据集,这是第一个明确设计用于评估在多层结构、长轨迹、重复视图和低纹理区域等挑战性场景下尺度一致性的基准测试。我们系统地分析了最先进的深度单目视觉SLAM系统在尺度不一致性方面的脆弱性,并提供了定量和定性的评估。重要的是,我们的分析超越了传统的轨迹指标,包含了使用Chamfer距离等指标与高保真3D真值进行的直接地图对地图质量评估。我们的结果表明,尽管近期的深度单目视觉SLAM系统在现有基准测试中表现强劲,但在现实的大规模室内环境中,它们遭遇了严重的与尺度相关的失败。通过发布ScaleMaster数据集和基准结果,我们旨在为未来研究奠定基础,以开发尺度一致且可靠的视觉SLAM系统。
cs.RO / 13 / 2602.18212

Design and Characterization of a Dual-DOF Soft Shoulder Exosuit with Volume-Optimized Pneumatic Actuator

双自由度软肩部外骨骼的设计与特性研究:基于体积优化的气动执行器
Chen, Rui, Chiaradia, Domenico, Leonardis, Daniele, Frisoli, Antonio
Abstract
Portable pneumatic systems for 2 degree-of-freedom (DOF) soft shoulder exosuits remain underexplored, and face fundamental trade-offs between torque output and dynamic response that are further compounded by the need for multiple actuators to support complex shoulder movement. This work addresses these constraints through a volume-optimized spindle-shaped angled actuator (SSAA) geometry: by reducing actuator volume by 35.7% (357mL vs. 555mL), the SSAA maintains 94.2% of output torque while achieving 35.2% faster dynamic response compared to uniform cylindrical designs. Building on the SSAA, we develop a curved abduction actuator (CAA) based on the SSAA geometry and a horizontal adduction actuator (HAA) based on the pouch motor principle, integrating both into a dual-DOF textile-based shoulder exosuit (390 g). The exosuit delivers multi-modal assistance spanning shoulder abduction, flexion, and horizontal adduction, depending on the actuation. User studies with 10 healthy participants reveal that the exosuit substantially reduces electromyographic (EMG) activity across both shoulder abduction and flexion tasks. For abduction with HAA only, the exosuit achieved up to 59% muscle activity reduction across seven muscles. For flexion, both the single-actuator configuration (HAA only) and the dual-actuator configuration (HAA,+,CAA) reduced EMG activity by up to 63.7% compared to no assistance. However, the incremental benefit of adding the CAA to existing HAA support was limited in healthy users during flexion, with statistically significant additional reductions observed only in pectoralis major. These experimental findings characterize actuator contributions in healthy users and provide design guidance for multi-DOF exosuit systems.
Chinese Translation
便携式气动系统在双自由度(DOF)软肩部外骨骼中的应用仍然未被充分探索,并面临扭矩输出与动态响应之间的基本权衡,这一问题因需要多个执行器以支持复杂的肩部运动而进一步加剧。本研究通过一种体积优化的主轴形状角度执行器(SSAA)几何形状来解决这些限制:通过将执行器体积减少35.7%(357毫升对比555毫升),SSAA在保持94.2%输出扭矩的同时,实现了比均匀圆柱形设计快35.2%的动态响应。在此基础上,我们开发了一种基于SSAA几何形状的弯曲外展执行器(CAA)和一种基于袋式电机原理的水平内收执行器(HAA),将两者整合为一种双自由度纺织基肩部外骨骼(390克)。该外骨骼提供多模式辅助,涵盖肩部外展、屈曲和水平内收,具体取决于激励方式。对10名健康参与者的用户研究表明,该外骨骼在肩部外展和屈曲任务中显著减少了肌电图(EMG)活动。在仅使用HAA进行外展时,该外骨骼在七个肌肉中实现了高达59%的肌肉活动减少。在屈曲任务中,无论是单执行器配置(仅HAA)还是双执行器配置(HAA+CAA),EMG活动均较无辅助状态减少了高达63.7%。然而,在健康用户的屈曲过程中,增加CAA对现有HAA支持的增量效益有限,仅在大胸肌中观察到统计学上显著的额外减少。这些实验结果表征了健康用户中执行器的贡献,并为多自由度外骨骼系统的设计提供了指导。
cs.RO / 14 / 2602.18224

SimVLA: A Simple VLA Baseline for Robotic Manipulation

SimVLA:一种用于机器人操作的简单 VLA 基线
Luo, Yuankai, Chen, Woping, Liang, Tong, Wang, Baiqiao, Li, Zhenguo
Abstract
Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic manipulation, leveraging large-scale pre-training to achieve strong performance. The field has rapidly evolved with additional spatial priors and diverse architectural innovations. However, these advancements are often accompanied by varying training recipes and implementation details, which can make it challenging to disentangle the precise source of empirical gains. In this work, we introduce SimVLA, a streamlined baseline designed to establish a transparent reference point for VLA research. By strictly decoupling perception from control, using a standard vision-language backbone and a lightweight action head, and standardizing critical training dynamics, we demonstrate that a minimal design can achieve state-of-the-art performance. Despite having only 0.5B parameters, SimVLA outperforms multi-billion-parameter models on standard simulation benchmarks without robot pretraining. SimVLA also reaches on-par real-robot performance compared to pi0.5. Our results establish SimVLA as a robust, reproducible baseline that enables clear attribution of empirical gains to future architectural innovations. Website: https://frontierrobo.github.io/SimVLA
Chinese Translation
视觉-语言-动作(VLA)模型作为一种有前景的通用机器人操作范式,利用大规模的预训练实现了强大的性能。该领域迅速发展,伴随着额外的空间先验和多样的架构创新。然而,这些进展往往伴随着不同的训练方案和实现细节,这使得很难理清经验性提升的确切来源。在本研究中,我们提出了 SimVLA,一种简化的基线,旨在为 VLA 研究建立一个透明的参考点。通过严格将感知与控制解耦,使用标准的视觉-语言骨干网络和轻量级的动作头,并标准化关键的训练动态,我们证明了最小设计可以实现最先进的性能。尽管仅有 0.5B 参数,SimVLA 在标准模拟基准上超越了多亿参数模型,而无需机器人预训练。与 pi0.5 相比,SimVLA 在真实机器人性能上也达到了相当水平。我们的结果确立了 SimVLA 作为一个稳健、可重复的基线,使得未来的架构创新能够清晰地归因于经验性提升。网站:https://frontierrobo.github.io/SimVLA
cs.RO / 15 / 2602.18258

RoEL: Robust Event-based 3D Line Reconstruction

RoEL:鲁棒的基于事件的三维线重建
Bae, Gwangtak, Shin, Jaeho, Kang, Seunggu, Kim, Junho, Kim, Ayoung, Kim, Young Min
Abstract
Event cameras in motion tend to detect object boundaries or texture edges, which produce lines of brightness changes, especially in man-made environments. While lines can constitute a robust intermediate representation that is consistently observed, the sparse nature of lines may lead to drastic deterioration with minor estimation errors. Only a few previous works, often accompanied by additional sensors, utilize lines to compensate for the severe domain discrepancies of event sensors along with unpredictable noise characteristics. We propose a method that can stably extract tracks of varying appearances of lines using a clever algorithmic process that observes multiple representations from various time slices of events, compensating for potential adversaries within the event data. We then propose geometric cost functions that can refine the 3D line maps and camera poses, eliminating projective distortions and depth ambiguities. The 3D line maps are highly compact and can be equipped with our proposed cost function, which can be adapted for any observations that can detect and extract line structures or projections of them, including 3D point cloud maps or image observations. We demonstrate that our formulation is powerful enough to exhibit a significant performance boost in event-based mapping and pose refinement across diverse datasets, and can be flexibly applied to multimodal scenarios. Our results confirm that the proposed line-based formulation is a robust and effective approach for the practical deployment of event-based perceptual modules. Project page: https://gwangtak.github.io/roel/
Chinese Translation
运动中的事件相机倾向于检测物体边界或纹理边缘,这会产生亮度变化的线条,尤其是在人工环境中。虽然线条可以构成一种稳健的中间表示,并且能够持续观察到,但线条的稀疏特性可能导致在微小估计误差下的急剧恶化。仅有少数先前的工作,通常伴随额外传感器,利用线条来补偿事件传感器在严重领域差异和不可预测噪声特性下的影响。我们提出了一种方法,可以通过巧妙的算法过程稳定地提取具有不同外观的线条轨迹,该过程观察来自多个时间片段的事件的多种表示,补偿事件数据中的潜在干扰。随后,我们提出了几何代价函数,可以精细化三维线图和相机位姿,消除投影畸变和深度模糊。三维线图高度紧凑,可以配备我们提出的代价函数,该函数可以适应任何能够检测和提取线结构或其投影的观察,包括三维点云图或图像观察。我们证明了我们的公式足够强大,能够在基于事件的映射和位姿精细化中,在不同数据集上显著提升性能,并能够灵活应用于多模态场景。我们的结果确认了所提出的基于线条的公式是一种鲁棒且有效的方法,适用于基于事件的感知模块的实际部署。项目页面:https://gwangtak.github.io/roel/
cs.RO / 16 / 2602.18260

Role-Adaptive Collaborative Formation Planning for Team of Quadruped Robots in Cluttered Environments

在复杂环境中四足机器人团队的角色自适应协作队形规划
Norén, Magnus, Stamatopoulos, Marios-Nektarios, Banerjee, Avijit, Nikolakopoulos, George
Abstract
This paper presents a role-adaptive Leader-Follower-based formation planning and control framework for teams of quadruped robots operating in cluttered environments. Unlike conventional methods with fixed leaders or rigid formation roles, the proposed approach integrates dynamic role assignment and partial goal planning, enabling flexible, collision-free navigation in complex scenarios. Formation stability and inter-robot safety are ensured through a virtual spring-damper system coupled with a novel obstacle avoidance layer that adaptively adjusts each agent's velocity. A dynamic look-ahead reference generator further enhances flexibility, allowing temporary formation deformation to maneuver around obstacles while maintaining goal-directed motion. The Fast Marching Square (FM2) algorithm provides the global path for the leader and local paths for the followers as the planning backbone. The framework is validated through extensive simulations and real-world experiments with teams of quadruped robots. Results demonstrate smooth coordination, adaptive role switching, and robust formation maintenance in complex, unstructured environments. A video featuring the simulation and physical experiments along with their associated visualizations can be found at https://youtu.be/scq37Tua9W4.
Chinese Translation
本文提出了一种基于领导者-跟随者的角色自适应队形规划与控制框架,适用于在复杂环境中操作的四足机器人团队。与传统的固定领导者或刚性队形角色方法不同,所提出的方法集成了动态角色分配和部分目标规划,使得在复杂场景中能够灵活、无碰撞地导航。通过与新颖的障碍物避让层相结合的虚拟弹簧-阻尼器系统,确保了队形的稳定性和机器人间的安全。动态前瞻参考生成器进一步增强了灵活性,允许在保持目标导向运动的同时,临时变形以绕过障碍物。快速行进方块(Fast Marching Square, FM2)算法为领导者提供全局路径,为跟随者提供局部路径,作为规划的基础。通过广泛的仿真和四足机器人团队的实际实验验证了该框架。结果表明,在复杂、非结构化环境中实现了平滑的协调、自适应的角色切换和稳健的队形维护。相关的仿真和物理实验视频及其可视化内容可在 https://youtu.be/scq37Tua9W4 找到。
cs.RO / 17 / 2602.18312

Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty

学习平滑时变线性策略与动作雅可比惩罚
Xie, Zhaoming, Karol, Kevin, Hodgins, Jessica
Abstract
Reinforcement learning provides a framework for learning control policies that can reproduce diverse motions for simulated characters. However, such policies often exploit unnatural high-frequency signals that are unachievable by humans or physical robots, making them poor representations of real-world behaviors. Existing work addresses this issue by adding a reward term that penalizes a large change in actions over time. This term often requires substantial tuning efforts. We propose to use the action Jacobian penalty, which penalizes changes in action with respect to the changes in simulated state directly through auto differentiation. This effectively eliminates unrealistic high-frequency control signals without task specific tuning. While effective, the action Jacobian penalty introduces significant computational overhead when used with traditional fully connected neural network architectures. To mitigate this, we introduce a new architecture called a Linear Policy Net (LPN) that significantly reduces the computational burden for calculating the action Jacobian penalty during training. In addition, a LPN requires no parameter tuning, exhibits faster learning convergence compared to baseline methods, and can be more efficiently queried during inference time compared to a fully connected neural network. We demonstrate that a Linear Policy Net, combined with the action Jacobian penalty, is able to learn policies that generate smooth signals while solving a number of motion imitation tasks with different characteristics, including dynamic motions such as a backflip and various challenging parkour skills. Finally, we apply this approach to create policies for dynamic motions on a physical quadrupedal robot equipped with an arm.
Chinese Translation
强化学习为学习控制策略提供了一个框架,这些策略能够为模拟角色再现多样的运动。然而,这些策略往往利用了人类或物理机器人无法实现的不自然高频信号,从而使其成为现实世界行为的较差表征。现有研究通过添加一个惩罚在时间上大幅变化的动作的奖励项来解决这个问题。该项通常需要大量的调参工作。我们建议使用动作雅可比惩罚,它通过自动微分直接惩罚与模拟状态变化相关的动作变化。这有效消除了不现实的高频控制信号,而无需特定任务的调优。尽管有效,动作雅可比惩罚在与传统的全连接神经网络架构结合使用时引入了显著的计算开销。为此,我们提出了一种新的架构,称为线性策略网络(Linear Policy Net, LPN),它显著减少了在训练过程中计算动作雅可比惩罚的计算负担。此外,LPN不需要参数调优,与基线方法相比展现出更快的学习收敛速度,并且在推理时相比全连接神经网络可以更高效地查询。我们证明了线性策略网络结合动作雅可比惩罚能够学习生成平滑信号的策略,同时解决了多个具有不同特征的运动模仿任务,包括动态运动如后空翻和各种具有挑战性的跑酷技能。最后,我们将该方法应用于为配备机械臂的物理四足机器人创建动态运动策略。
cs.RO / 18 / 2602.18330

Tendon-Driven Reciprocating and Non-Reciprocating Motion via Snapping Metabeams

通过弹性元件驱动的往复和非往复运动
Jafarpour, Mohsen, Yüksek, Ayberk, Eshghi, Shahab, Gorb, Stanislav, Milana, Edoardo
Abstract
Snapping beams enable rapid geometric transitions through nonlinear instability, offering an efficient means of generating motion in soft robotic systems. In this study, a tendon-driven mechanism consisting of spiral-based metabeams was developed to exploit this principle for producing both reciprocating and non-reciprocating motion. The snapping structures were fabricated using fused deposition modeling with polylactic acid (PLA) and experimentally tested under different boundary conditions to analyze their nonlinear behavior. The results show that the mechanical characteristics, including critical forces and stability, can be tuned solely by adjusting the boundary constraints. The spiral geometry allows large reversible deformation even when made from a relatively stiff material such as PLA, providing a straightforward design concept for controllable snapping behavior. The developed mechanism was further integrated into a swimming robot, where tendon-driven fins exhibited two distinct actuation modes: reciprocating and non-reciprocating motion. The latter enabled efficient propulsion, producing a forward displacement of about 32 mm per 0.4 s cycle ($\approx$ 81 mm/s, equivalent to 0.4 body lengths per second). This study highlights the potential of geometry-driven snapping structures for efficient and programmable actuation in soft robotic systems.
Chinese Translation
弹性梁通过非线性不稳定性实现快速的几何转变,为软机器人系统中的运动生成提供了一种高效的方法。在本研究中,开发了一种基于螺旋形态的弹性元件驱动机制,以利用这一原理产生往复和非往复运动。弹性结构采用聚乳酸(PLA)通过熔融沉积建模技术制造,并在不同边界条件下进行实验测试,以分析其非线性行为。结果表明,机械特性(包括临界力和稳定性)可以仅通过调整边界约束来调节。螺旋几何形状即使在相对刚性的材料(如PLA)制成时也允许大幅可逆变形,提供了一种简单的设计概念以实现可控的弹性行为。所开发的机制进一步集成到一款游泳机器人中,其中弹性驱动的鳍展现了两种不同的驱动模式:往复和非往复运动。后者实现了高效的推进,产生约32毫米每0.4秒周期的前向位移(约81毫米/秒,相当于每秒0.4个体长)。本研究突显了几何驱动的弹性结构在软机器人系统中实现高效和可编程驱动的潜力。
cs.RO / 19 / 2602.18344

Downwash-aware Configuration Optimization for Modular Aerial Systems

考虑下洗气流的模块化空中系统配置优化
Li, Mengguang, Koeppl, Heinz
Abstract
This work proposes a framework that generates and optimally selects task-specific assembly configurations for a large group of homogeneous modular aerial systems, explicitly enforcing bounds on inter-module downwash. Prior work largely focuses on planar layouts and often ignores aerodynamic interference. In contrast, firstly we enumerate non-isomorphic connection topologies at scale; secondly, we solve a nonlinear program to check feasibility and select the configuration that minimizes control input subject to actuation limits and downwash constraints. We evaluate the framework in physics-based simulation and demonstrate it in real-world experiments.
Chinese Translation
本研究提出了一种框架,用于生成和优化选择针对大规模同质模块化空中系统的特定任务组装配置,明确施加模块间下洗气流的限制。以往的研究主要集中在平面布局上,通常忽视了空气动力学干扰。相比之下,我们首先在规模上枚举了非同构连接拓扑;其次,我们解决了一个非线性规划问题,以检查可行性并选择在执行限制和下洗气流约束下最小化控制输入的配置。我们在基于物理的仿真中评估了该框架,并在实际实验中进行了验证。
cs.RO / 20 / 2602.18374

Zero-shot Interactive Perception

零-shot互动感知
Sripada, Venkatesh, Guerin, Frank, Ghalamzan, Amir
Abstract
Interactive perception (IP) enables robots to extract hidden information in their workspace and execute manipulation plans by physically interacting with objects and altering the state of the environment -- crucial for resolving occlusions and ambiguity in complex, partially observable scenarios. We present Zero-Shot IP (ZS-IP), a novel framework that couples multi-strategy manipulation (pushing and grasping) with a memory-driven Vision Language Model (VLM) to guide robotic interactions and resolve semantic queries. ZS-IP integrates three key components: (1) an Enhanced Observation (EO) module that augments the VLM's visual perception with both conventional keypoints and our proposed pushlines -- a novel 2D visual augmentation tailored to pushing actions, (2) a memory-guided action module that reinforces semantic reasoning through context lookup, and (3) a robotic controller that executes pushing, pulling, or grasping based on VLM output. Unlike grid-based augmentations optimized for pick-and-place, pushlines capture affordances for contact-rich actions, substantially improving pushing performance. We evaluate ZS-IP on a 7-DOF Franka Panda arm across diverse scenes with varying occlusions and task complexities. Our experiments demonstrate that ZS-IP outperforms passive and viewpoint-based perception techniques such as Mark-Based Visual Prompting (MOKA), particularly in pushing tasks, while preserving the integrity of non-target elements.
Chinese Translation
互动感知(IP)使机器人能够在其工作空间中提取隐藏信息,并通过与物体的物理交互和改变环境状态来执行操作计划——这对于解决复杂的部分可观测场景中的遮挡和模糊至关重要。我们提出了零-shot互动感知(ZS-IP),这是一个新颖的框架,将多策略操作(推和抓)与基于记忆的视觉语言模型(VLM)相结合,以指导机器人交互并解决语义查询。ZS-IP集成了三个关键组件:(1)增强观察(EO)模块,通过常规关键点和我们提出的推线(pushlines)增强VLM的视觉感知——推线是一种针对推动作的新型二维视觉增强,(2)基于记忆的动作模块,通过上下文查找强化语义推理,以及(3)机器人控制器,根据VLM输出执行推、拉或抓取。与针对拾取和放置优化的网格增强不同,推线捕捉了接触丰富动作的可用性,显著提高了推的性能。我们在具有不同遮挡和任务复杂度的多样场景中,使用7自由度的Franka Panda臂评估ZS-IP。我们的实验表明,ZS-IP在推任务中优于被动和基于视角的感知技术,如基于标记的视觉提示(MOKA),同时保持非目标元素的完整性。
cs.RO / 21 / 2602.18379

Ori-Sense: origami capacitive sensing for soft robotic applications

Ori-Sense:用于软机器人应用的折纸电容传感器
Oliveira, Hugo de Souza, Li, Xin, Jafarpour, Mohsen, Milana, Edoardo
Abstract
This work introduces Ori-Sense, a compliant capacitive sensor inspired by the inverted Kresling origami pattern. The device translates torsional deformation into measurable capacitance changes, enabling proprioceptive feedback for soft robotic systems. Using dissolvable-core molding, we fabricated a monolithic silicone structure with embedded conductive TPU electrodes, forming an integrated soft capacitor. Mechanical characterization revealed low stiffness and minimal impedance, with torque values below 0.01 N mm for axial displacements between -15 mm and 15 mm, and up to 0.03 N mm at 30 degrees twist under compression. Finite-element simulations confirmed localized stresses along fold lines and validated the measured torque-rotation response. Electrical tests showed consistent capacitance modulation up to 30%, directly correlated with the twist angle, and maximal sensitivity of S_theta ~ 0.0067 pF/deg at 5 mm of axial deformation.
Chinese Translation
本研究介绍了Ori-Sense,一种受倒Kresling折纸图案启发的柔性电容传感器。该设备将扭转变形转化为可测量的电容变化,从而为软机器人系统提供本体感知反馈。通过可溶核心成型技术,我们制造了一种单体硅胶结构,内部嵌入导电TPU电极,形成了一个集成的软电容器。机械特性表明,该传感器具有低刚度和最小阻抗,在-15 mm到15 mm的轴向位移下,扭矩值低于0.01 N mm,在30度扭转下的压缩状态下,扭矩值可达到0.03 N mm。有限元模拟确认了沿折叠线的局部应力,并验证了测得的扭矩-旋转响应。电气测试显示电容调制稳定,最高可达30%,与扭转角度直接相关,最大灵敏度为S_theta ~ 0.0067 pF/deg,在5 mm的轴向变形下。
cs.RO / 22 / 2602.18386

Learning to Tune Pure Pursuit in Autonomous Racing: Joint Lookahead and Steering-Gain Control with PPO

在自主赛车中学习调整纯追踪:基于PPO的联合前瞻与转向增益控制
Elgouhary, Mohamed, El-Wakeel, Amr S.
Abstract
Pure Pursuit (PP) is widely used in autonomous racing for real-time path tracking due to its efficiency and geometric clarity, yet performance is highly sensitive to how key parameters-lookahead distance and steering gain-are chosen. Standard velocity-based schedules adjust these only approximately and often fail to transfer across tracks and speed profiles. We propose a reinforcement-learning (RL) approach that jointly chooses the lookahead Ld and a steering gain g online using Proximal Policy Optimization (PPO). The policy observes compact state features (speed and curvature taps) and outputs (Ld, g) at each control step. Trained in F1TENTH Gym and deployed in a ROS 2 stack, the policy drives PP directly (with light smoothing) and requires no per-map retuning. Across simulation and real-car tests, the proposed RL-PP controller that jointly selects (Ld, g) consistently outperforms fixed-lookahead PP, velocity-scheduled adaptive PP, and an RL lookahead-only variant, and it also exceeds a kinematic MPC raceline tracker under our evaluated settings in lap time, path-tracking accuracy, and steering smoothness, demonstrating that policy-guided parameter tuning can reliably improve classical geometry-based control.
Chinese Translation
纯追踪(Pure Pursuit, PP)因其高效性和几何清晰性而广泛应用于自主赛车的实时路径跟踪,但其性能对关键参数(前瞻距离和转向增益)的选择极为敏感。标准的基于速度的调度方法仅能大致调整这些参数,且常常无法在不同赛道和速度特征之间有效迁移。我们提出了一种强化学习(Reinforcement Learning, RL)方法,通过近端策略优化(Proximal Policy Optimization, PPO)在线联合选择前瞻距离 Ld 和转向增益 g。该策略在每个控制步骤中观察紧凑的状态特征(速度和曲率信息),并输出(Ld, g)。在F1TENTH Gym中训练并部署在ROS 2框架中,该策略直接驱动PP(轻微平滑处理),无需针对每个地图进行重新调优。在仿真和真实车辆测试中,所提出的RL-PP控制器联合选择(Ld, g)始终优于固定前瞻PP、基于速度调度的自适应PP以及仅考虑前瞻的RL变体,并且在我们评估的设置下,其在圈速、路径跟踪精度和转向平滑性方面也超越了运动学模型预测控制(MPC)赛道跟踪器,证明了基于策略的参数调优能够可靠地改善经典的基于几何的控制方法。
cs.RO / 23 / 2602.18397

How Fast Can I Run My VLA? Demystifying VLA Inference Performance with VLA-Perf

我能多快运行我的 VLA?揭示 VLA 推理性能与 VLA-Perf
Jiang, Wenqi, Clemons, Jason, Sankaralingam, Karu, Kozyrakis, Christos
Abstract
Vision-Language-Action (VLA) models have recently demonstrated impressive capabilities across various embodied AI tasks. While deploying VLA models on real-world robots imposes strict real-time inference constraints, the inference performance landscape of VLA remains poorly understood due to the large combinatorial space of model architectures and inference systems. In this paper, we ask a fundamental research question: How should we design future VLA models and systems to support real-time inference? To address this question, we first introduce VLA-Perf, an analytical performance model that can analyze inference performance for arbitrary combinations of VLA models and inference systems. Using VLA-Perf, we conduct the first systematic study of the VLA inference performance landscape. From a model-design perspective, we examine how inference performance is affected by model scaling, model architectural choices, long-context video inputs, asynchronous inference, and dual-system model pipelines. From the deployment perspective, we analyze where VLA inference should be executed -- on-device, on edge servers, or in the cloud -- and how hardware capability and network performance jointly determine end-to-end latency. By distilling 15 key takeaways from our comprehensive evaluation, we hope this work can provide practical guidance for the design of future VLA models and inference systems.
Chinese Translation
视觉-语言-动作(VLA)模型最近在各种具身人工智能任务中展现了令人印象深刻的能力。然而,在真实世界的机器人上部署 VLA 模型时,面临严格的实时推理约束,VLA 的推理性能景观由于模型架构和推理系统的组合空间庞大而仍然不甚了解。本文提出了一个基本的研究问题:我们应如何设计未来的 VLA 模型和系统以支持实时推理?为了解决这个问题,我们首先介绍了 VLA-Perf,一个可以分析任意组合的 VLA 模型和推理系统推理性能的分析性能模型。借助 VLA-Perf,我们首次系统性地研究了 VLA 推理性能的全景。从模型设计的角度,我们考察了模型缩放、模型架构选择、长时序视频输入、异步推理和双系统模型管道等因素如何影响推理性能。从部署的角度,我们分析了 VLA 推理应在何处执行——在设备上、边缘服务器上还是云端——以及硬件能力和网络性能如何共同决定端到端延迟。通过从我们的综合评估中提炼出 15 个关键要点,我们希望这项工作能为未来 VLA 模型和推理系统的设计提供实用指导。
cs.RO / 24 / 2602.18421

Snapping Actuators with Asymmetric and Sequenced Motion

具有不对称和序列运动的弹跳驱动器
Li, Xin, Jin, Ye, Jafarpour, Mohsen, Oliveira, Hugo de Souza, Milana, Edoardo
Abstract
Snapping instabilities in soft structures offer a powerful pathway to achieve rapid and energy-efficient actuation. In this study, an eccentric dome-shaped snapping actuator is developed to generate controllable asymmetric motion through geometry-induced instability. Finite element simulations and experiments reveal consistent asymmetric deformation and the corresponding pressure characteristics. By coupling four snapping actuators in a pneumatic network, a compact quadrupedal robot achieves coordinated wavelike locomotion using only a single pressure input. The robot exhibits frequency-dependent performance with a maximum speed of 72.78~mm/s at 7.5~Hz. These findings demonstrate the potential of asymmetric snapping mechanisms for physically controlled actuation and lay the groundwork for fully untethered and efficient soft robotic systems.
Chinese Translation
软结构中的弹跳不稳定性为实现快速且高效能的驱动提供了一种有效途径。本研究开发了一种偏心圆顶形状的弹跳驱动器,通过几何诱导的不稳定性生成可控的不对称运动。有限元模拟和实验结果揭示了一致的不对称变形及其相应的压力特性。通过在气动网络中耦合四个弹跳驱动器,一个紧凑的四足机器人能够仅使用单一压力输入实现协调的波浪式运动。该机器人表现出频率依赖的性能,在7.5 Hz时达到最大速度72.78 mm/s。这些发现展示了不对称弹跳机制在物理控制驱动中的潜力,并为完全无绳和高效的软机器人系统奠定了基础。
计算机视觉 (Computer Vision)
43
cs.CV / 1 / 2602.17768

KPM-Bench: A Kinematic Parsing Motion Benchmark for Fine-grained Motion-centric Video Understanding

KPM-Bench:一个用于细粒度运动中心视频理解的运动学解析基准
Lin, Boda, Zhu, Yongjie, Gong, Xiaocheng, Qin, Wenyu, Wang, Meng
Abstract
Despite recent advancements, video captioning models still face significant limitations in accurately describing fine-grained motion details and suffer from severe hallucination issues. These challenges become particularly prominent when generating captions for motion-centric videos, where precise depiction of intricate movements and limb dynamics is crucial yet often neglected. To alleviate this gap, we introduce an automated annotation pipeline that integrates kinematic-based motion computation with linguistic parsing, enabling detailed decomposition and description of complex human motions. Based on this pipeline, we construct and release the Kinematic Parsing Motion Benchmark (KPM-Bench), a novel open-source dataset designed to facilitate fine-grained motion understanding. KPM-Bench consists of (i) fine-grained video-caption pairs that comprehensively illustrate limb-level dynamics in complex actions, (ii) diverse and challenging question-answer pairs focusing specifically on motion understanding, and (iii) a meticulously curated evaluation set specifically designed to assess hallucination phenomena associated with motion descriptions. Furthermore, to address hallucination issues systematically, we propose the linguistically grounded Motion Parsing and Extraction (MoPE) algorithm, capable of accurately extracting motion-specific attributes directly from textual captions. Leveraging MoPE, we introduce a precise hallucination evaluation metric that functions independently of large-scale vision-language or language-only models. By integrating MoPE into the GRPO post-training framework, we effectively mitigate hallucination problems, significantly improving the reliability of motion-centric video captioning models.
Chinese Translation
尽管近年来取得了进展,视频字幕生成模型在准确描述细粒度运动细节方面仍面临重大限制,并且存在严重的幻觉问题。当为运动中心视频生成字幕时,这些挑战尤为突出,因为对复杂运动和肢体动态的精确描绘至关重要,但往往被忽视。为了解决这一问题,我们引入了一种自动化注释流程,该流程将基于运动学的运动计算与语言解析相结合,使复杂的人类运动能够进行详细的分解和描述。基于该流程,我们构建并发布了运动学解析运动基准(KPM-Bench),这是一个旨在促进细粒度运动理解的新型开源数据集。KPM-Bench包括(i)细粒度视频-字幕对,全面展示复杂动作中的肢体动态,(ii)多样且具有挑战性的问题-答案对,专注于运动理解,以及(iii)一个精心策划的评估集,专门设计用于评估与运动描述相关的幻觉现象。此外,为了系统性地解决幻觉问题,我们提出了语言基础的运动解析与提取(MoPE)算法,能够从文本字幕中准确提取运动特定属性。利用MoPE,我们引入了一种精确的幻觉评估指标,该指标独立于大规模视觉-语言或仅语言模型。通过将MoPE整合到GRPO后训练框架中,我们有效减轻了幻觉问题,显著提高了运动中心视频字幕生成模型的可靠性。
cs.CV / 2 / 2602.17770

CLUTCH: Contextualized Language model for Unlocking Text-Conditioned Hand motion modelling in the wild

CLUTCH:用于解锁文本条件手部动作建模的上下文化语言模型
Thambiraja, Balamurugan, Taheri, Omid, Danecek, Radek, Becherini, Giorgio, Pons-Moll, Gerard, Thies, Justus
Abstract
Hands play a central role in daily life, yet modeling natural hand motions remains underexplored. Existing methods that tackle text-to-hand-motion generation or hand animation captioning rely on studio-captured datasets with limited actions and contexts, making them costly to scale to "in-the-wild" settings. Further, contemporary models and their training schemes struggle to capture animation fidelity with text-motion alignment. To address this, we (1) introduce '3D Hands in the Wild' (3D-HIW), a dataset of 32K 3D hand-motion sequences and aligned text, and (2) propose CLUTCH, an LLM-based hand animation system with two critical innovations: (a) SHIFT, a novel VQ-VAE architecture to tokenize hand motion, and (b) a geometric refinement stage to finetune the LLM. To build 3D-HIW, we propose a data annotation pipeline that combines vision-language models (VLMs) and state-of-the-art 3D hand trackers, and apply it to a large corpus of egocentric action videos covering a wide range of scenarios. To fully capture motion in-the-wild, CLUTCH employs SHIFT, a part-modality decomposed VQ-VAE, which improves generalization and reconstruction fidelity. Finally, to improve animation quality, we introduce a geometric refinement stage, where CLUTCH is co-supervised with a reconstruction loss applied directly to decoded hand motion parameters. Experiments demonstrate state-of-the-art performance on text-to-motion and motion-to-text tasks, establishing the first benchmark for scalable in-the-wild hand motion modelling. Code, data and models will be released.
Chinese Translation
手在日常生活中扮演着核心角色,但自然手部动作的建模仍然未得到充分探索。现有的文本到手部动作生成或手部动画字幕的方法依赖于有限动作和上下文的工作室捕获数据集,这使得它们在“野外”环境中扩展成本高昂。此外,当前模型及其训练方案在捕捉动画保真度与文本-动作对齐方面面临挑战。为了解决这一问题,我们(1)引入了“野外的3D手部动作”(3D-HIW),这是一个包含32K个3D手部动作序列及其对齐文本的数据集;(2)提出了CLUTCH,一个基于大语言模型(LLM)的手部动画系统,具有两个关键创新:(a)SHIFT,一种新颖的VQ-VAE架构用于对手部动作进行标记;(b)几何细化阶段,用于微调LLM。为了构建3D-HIW,我们提出了一种数据标注管道,结合了视觉-语言模型(VLMs)和最先进的3D手部跟踪器,并将其应用于覆盖广泛场景的大量自我中心动作视频。为了充分捕捉野外的动作,CLUTCH采用了SHIFT,这是一种部分模态分解的VQ-VAE,能够提高泛化能力和重建保真度。最后,为了提高动画质量,我们引入了几何细化阶段,在该阶段,CLUTCH与直接应用于解码手部动作参数的重建损失共同监督。实验表明,在文本到动作和动作到文本任务上,CLUTCH实现了最先进的性能,为可扩展的野外手部动作建模建立了首个基准。代码、数据和模型将会发布。
cs.CV / 3 / 2602.17785

Multi-Modal Monocular Endoscopic Depth and Pose Estimation with Edge-Guided Self-Supervision

边缘引导自监督的多模态单目内窥镜深度与姿态估计
Ju, Xinwei, Daher, Rema, Stoyanov, Danail, Bano, Sophia, Vasconcelos, Francisco
Abstract
Monocular depth and pose estimation play an important role in the development of colonoscopy-assisted navigation, as they enable improved screening by reducing blind spots, minimizing the risk of missed or recurrent lesions, and lowering the likelihood of incomplete examinations. However, this task remains challenging due to the presence of texture-less surfaces, complex illumination patterns, deformation, and a lack of in-vivo datasets with reliable ground truth. In this paper, we propose **PRISM** (Pose-Refinement with Intrinsic Shading and edge Maps), a self-supervised learning framework that leverages anatomical and illumination priors to guide geometric learning. Our approach uniquely incorporates edge detection and luminance decoupling for structural guidance. Specifically, edge maps are derived using a learning-based edge detector (e.g., DexiNed or HED) trained to capture thin and high-frequency boundaries, while luminance decoupling is obtained through an intrinsic decomposition module that separates shading and reflectance, enabling the model to exploit shading cues for depth estimation. Experimental results on multiple real and synthetic datasets demonstrate state-of-the-art performance. We further conduct a thorough ablation study on training data selection to establish best practices for pose and depth estimation in colonoscopy. This analysis yields two practical insights: (1) self-supervised training on real-world data outperforms supervised training on realistic phantom data, underscoring the superiority of domain realism over ground truth availability; and (2) video frame rate is an extremely important factor for model performance, where dataset-specific video frame sampling is necessary for generating high quality training data.
Chinese Translation
单目深度和姿态估计在结肠镜辅助导航的发展中扮演着重要角色,因为它们通过减少盲区、降低漏诊或复发病变的风险以及降低不完全检查的可能性来改善筛查。然而,由于存在无纹理表面、复杂的照明模式、变形以及缺乏具有可靠真实值的体内数据集,这一任务仍然具有挑战性。本文提出了**PRISM**(基于内在阴影和边缘图的姿态精细化),这是一个自监督学习框架,利用解剖学和照明先验来指导几何学习。我们的方法独特地结合了边缘检测和亮度解耦以提供结构指导。具体而言,边缘图是通过一个学习型边缘检测器(例如,DexiNed或HED)获得的,该检测器经过训练以捕捉细小和高频边界,而亮度解耦则通过一个内在分解模块实现,该模块将阴影和反射分离,从而使模型能够利用阴影线索进行深度估计。在多个真实和合成数据集上的实验结果表明,我们的方法达到了最先进的性能。我们还对训练数据选择进行了深入的消融研究,以建立结肠镜姿态和深度估计的最佳实践。这一分析得出了两个实用的见解:(1)在真实世界数据上进行自监督训练优于在逼真幻影数据上进行监督训练,强调了领域真实感优于真实值可用性的优势;(2)视频帧率是模型性能的一个极其重要的因素,特定数据集的视频帧采样对于生成高质量的训练数据是必要的。
cs.CV / 4 / 2602.17793

LGD-Net: Latent-Guided Dual-Stream Network for HER2 Scoring with Task-Specific Domain Knowledge

LGD-Net:基于潜在引导的双流网络用于HER2评分与任务特定领域知识
Zhu, Peide, Lu, Linbin, Chen, Zhiqin, Chen, Xiong
Abstract
It is a critical task to evalaute HER2 expression level accurately for breast cancer evaluation and targeted treatment therapy selection. However, the standard multi-step Immunohistochemistry (IHC) staining is resource-intensive, expensive, and time-consuming, which is also often unavailable in many areas. Consequently, predicting HER2 levels directly from H&E slides has emerged as a potential alternative solution. It has been shown to be effective to use virtual IHC images from H&E images for automatic HER2 scoring. However, the pixel-level virtual staining methods are computationally expensive and prone to reconstruction artifacts that can propagate diagnostic errors. To address these limitations, we propose the Latent-Guided Dual-Stream Network (LGD-Net), a novel framework that employes cross-modal feature hallucination instead of explicit pixel-level image generation. LGD-Net learns to map morphological H&E features directly to the molecular latent space, guided by a teacher IHC encoder during training. To ensure the hallucinated features capture clinically relevant phenotypes, we explicitly regularize the model training with task-specific domain knowledge, specifically nuclei distribution and membrane staining intensity, via lightweight auxiliary regularization tasks. Extensive experiments on the public BCI dataset demonstrate that LGD-Net achieves state-of-the-art performance, significantly outperforming baseline methods while enabling efficient inference using single-modality H&E inputs.
Chinese Translation
准确评估HER2表达水平对于乳腺癌评估和靶向治疗选择至关重要。然而,标准的多步骤免疫组化(IHC)染色过程资源密集、成本高昂且耗时较长,在许多地区往往无法获得。因此,直接从H&E切片预测HER2水平成为了一种潜在的替代解决方案。研究表明,利用H&E图像生成虚拟IHC图像进行自动HER2评分是有效的。然而,像素级虚拟染色方法计算开销大,且容易产生重建伪影,从而可能传播诊断错误。为了解决这些局限性,我们提出了潜在引导双流网络(LGD-Net),这是一种新颖的框架,采用跨模态特征幻觉而不是显式的像素级图像生成。LGD-Net学习将形态学H&E特征直接映射到分子潜在空间,在训练过程中由教师IHC编码器进行指导。为了确保幻觉特征捕捉临床相关的表型,我们通过轻量级辅助正则化任务显式地对模型训练进行正则化,特别是核分布和膜染色强度的任务特定领域知识。在公共BCI数据集上的大量实验表明,LGD-Net实现了最先进的性能,显著超越了基线方法,同时能够使用单模态H&E输入进行高效推理。
cs.CV / 5 / 2602.17799

Enabling Training-Free Text-Based Remote Sensing Segmentation

实现无训练的基于文本的遥感分割
Sosa, Jose, Rukhovich, Danila, Kacem, Anis, Aouada, Djamila
Abstract
Recent advances in Vision Language Models (VLMs) and Vision Foundation Models (VFMs) have opened new opportunities for zero-shot text-guided segmentation of remote sensing imagery. However, most existing approaches still rely on additional trainable components, limiting their generalisation and practical applicability. In this work, we investigate to what extent text-based remote sensing segmentation can be achieved without additional training, by relying solely on existing foundation models. We propose a simple yet effective approach that integrates contrastive and generative VLMs with the Segment Anything Model (SAM), enabling a fully training-free or lightweight LoRA-tuned pipeline. Our contrastive approach employs CLIP as mask selector for SAM's grid-based proposals, achieving state-of-the-art open-vocabulary semantic segmentation (OVSS) in a completely zero-shot setting. In parallel, our generative approach enables reasoning and referring segmentation by generating click prompts for SAM using GPT-5 in a zero-shot setting and a LoRA-tuned Qwen-VL model, with the latter yielding the best results. Extensive experiments across 19 remote sensing benchmarks, including open-vocabulary, referring, and reasoning-based tasks, demonstrate the strong capabilities of our approach. Code will be released at https://github.com/josesosajs/trainfree-rs-segmentation.
Chinese Translation
近年来,视觉语言模型(Vision Language Models, VLMs)和视觉基础模型(Vision Foundation Models, VFMs)的进展为零-shot文本引导的遥感图像分割开辟了新的机会。然而,大多数现有方法仍依赖于额外的可训练组件,限制了它们的泛化能力和实际应用性。在本研究中,我们探讨了在不进行额外训练的情况下,基于文本的遥感分割可以达到何种程度,完全依赖于现有的基础模型。我们提出了一种简单而有效的方法,将对比性和生成性VLM与“随意分割模型”(Segment Anything Model, SAM)相结合,实现了完全无训练或轻量级的LoRA调优管道。我们的对比方法采用CLIP作为SAM网格提议的掩码选择器,在完全零-shot的设置中实现了最先进的开放词汇语义分割(open-vocabulary semantic segmentation, OVSS)。与此同时,我们的生成方法通过在零-shot设置中使用GPT-5生成点击提示来实现推理和引用分割,并使用LoRA调优的Qwen-VL模型,后者产生了最佳结果。针对19个遥感基准的广泛实验,包括开放词汇、引用和基于推理的任务,展示了我们方法的强大能力。代码将发布在https://github.com/josesosajs/trainfree-rs-segmentation。
cs.CV / 6 / 2602.17807

VidEoMT: Your ViT is Secretly Also a Video Segmentation Model

VidEoMT:您的 ViT 实际上也是一个视频分割模型
Norouzi, Narges, Zulfikar, Idil Esen, Cavagnero, Niccol`o, Kerssies, Tommie, Leibe, Bastian, Dubbelman, Gijs, de Geus, Daan
Abstract
Existing online video segmentation models typically combine a per-frame segmenter with complex specialized tracking modules. While effective, these modules introduce significant architectural complexity and computational overhead. Recent studies suggest that plain Vision Transformer (ViT) encoders, when scaled with sufficient capacity and large-scale pre-training, can conduct accurate image segmentation without requiring specialized modules. Motivated by this observation, we propose the Video Encoder-only Mask Transformer (VidEoMT), a simple encoder-only video segmentation model that eliminates the need for dedicated tracking modules. To enable temporal modeling in an encoder-only ViT, VidEoMT introduces a lightweight query propagation mechanism that carries information across frames by reusing queries from the previous frame. To balance this with adaptability to new content, it employs a query fusion strategy that combines the propagated queries with a set of temporally-agnostic learned queries. As a result, VidEoMT attains the benefits of a tracker without added complexity, achieving competitive accuracy while being 5x--10x faster, running at up to 160 FPS with a ViT-L backbone. Code: https://www.tue-mps.org/videomt/
Chinese Translation
现有的在线视频分割模型通常将逐帧分割器与复杂的专用跟踪模块相结合。尽管这些模块有效,但它们引入了显著的架构复杂性和计算开销。最近的研究表明,当以足够的容量和大规模预训练进行扩展时,普通的视觉变换器(ViT)编码器能够在不需要专用模块的情况下进行准确的图像分割。基于这一观察,我们提出了视频编码器专用掩码变换器(VidEoMT),这是一个简单的仅编码器视频分割模型,消除了对专用跟踪模块的需求。为了在仅编码器的 ViT 中实现时间建模,VidEoMT 引入了一种轻量级查询传播机制,通过重用前一帧的查询在帧之间传递信息。为了与适应新内容的能力保持平衡,它采用了一种查询融合策略,将传播的查询与一组时间无关的学习查询结合在一起。因此,VidEoMT 在不增加复杂性的情况下获得了跟踪器的优势,实现了具有竞争力的准确性,同时速度提高了 5 到 10 倍,使用 ViT-L 主干时可达到每秒 160 帧的运行速度。代码: https://www.tue-mps.org/videomt/
cs.CV / 7 / 2602.17814

VQPP: Video Query Performance Prediction Benchmark

VQPP:视频查询性能预测基准
Lutu, Adrian Catalin, Poesina, Eduard, Ionescu, Radu Tudor
Abstract
Query performance prediction (QPP) is an important and actively studied information retrieval task, having various applications, such as query reformulation, query expansion, and retrieval system selection, among many others. The task has been primarily studied in the context of text and image retrieval, whereas QPP for content-based video retrieval (CBVR) remains largely underexplored. To this end, we propose the first benchmark for video query performance prediction (VQPP), comprising two text-to-video retrieval datasets and two CBVR systems, respectively. VQPP contains a total of 56K text queries and 51K videos, and comes with official training, validation and test splits, fostering direct comparisons and reproducible results. We explore multiple pre-retrieval and post-retrieval performance predictors, creating a representative benchmark for future exploration of QPP in the video domain. Our results show that pre-retrieval predictors obtain competitive performance, enabling applications before performing the retrieval step. We also demonstrate the applicability of VQPP by employing the best performing pre-retrieval predictor as reward model for training a large language model (LLM) on the query reformulation task via direct preference optimization (DPO). We release our benchmark and code at https://github.com/AdrianLutu/VQPP.
Chinese Translation
查询性能预测(QPP)是一项重要且积极研究的信息检索任务,具有多种应用,如查询重构、查询扩展和检索系统选择等。该任务主要在文本和图像检索的背景下进行研究,而基于内容的视频检索(CBVR)的QPP仍然在很大程度上未被探索。为此,我们提出了首个视频查询性能预测基准(VQPP),包括两个文本到视频检索的数据集和两个CBVR系统。VQPP总共包含56K个文本查询和51K个视频,并提供官方的训练、验证和测试划分,促进直接比较和可重复结果的生成。我们探讨了多种检索前和检索后性能预测器,为未来在视频领域探索QPP创建了一个具有代表性的基准。我们的结果表明,检索前预测器能够获得竞争性的性能,使得在执行检索步骤之前的应用成为可能。我们还通过将表现最佳的检索前预测器作为奖励模型,展示了VQPP的适用性,以通过直接偏好优化(DPO)训练大型语言模型(LLM)进行查询重构任务。我们在 https://github.com/AdrianLutu/VQPP 发布了我们的基准和代码。
cs.CV / 8 / 2602.17854

On the Evaluation Protocol of Gesture Recognition for UAV-based Rescue Operation based on Deep Learning: A Subject-Independence Perspective

基于深度学习的无人机救援操作手势识别评估协议研究:一种独立于受试者的视角
Varga, Domonkos
Abstract
This paper presents a methodological analysis of the gesture-recognition approach proposed by Liu and Szir\'anyi, with a particular focus on the validity of their evaluation protocol. We show that the reported near-perfect accuracy metrics result from a frame-level random train-test split that inevitably mixes samples from the same subjects across both sets, causing severe data leakage. By examining the published confusion matrix, learning curves, and dataset construction, we demonstrate that the evaluation does not measure generalization to unseen individuals. Our findings underscore the importance of subject-independent data partitioning in vision-based gesture-recognition research, especially for applications - such as UAV-human interaction - that require reliable recognition of gestures performed by previously unseen people.
Chinese Translation
本文对Liu和Szirányi提出的手势识别方法进行了方法论分析,特别关注其评估协议的有效性。我们指出,报告的近乎完美的准确率指标源于帧级随机训练-测试划分,这不可避免地混合了来自同一受试者的样本,导致严重的数据泄漏。通过检查已发布的混淆矩阵、学习曲线和数据集构建,我们证明该评估未能衡量对未见个体的泛化能力。我们的研究结果强调了在基于视觉的手势识别研究中,尤其是在需要可靠识别之前未见个体所执行手势的应用(如无人机与人类的交互)中,独立于受试者的数据划分的重要性。
cs.CV / 9 / 2602.17869

Learning Compact Video Representations for Efficient Long-form Video Understanding in Large Multimodal Models

学习紧凑的视频表示以高效理解大型多模态模型中的长视频
Chen, Yuxiao, Wang, Jue, Zhang, Zhikang, Yi, Jingru, Zhang, Xu, Zou, Yang, Cai, Zhaowei, Yuan, Jianbo, Li, Xinyu, Yang, Hao, Modolo, Davide
Abstract
With recent advancements in video backbone architectures, combined with the remarkable achievements of large language models (LLMs), the analysis of long-form videos spanning tens of minutes has become both feasible and increasingly prevalent. However, the inherently redundant nature of video sequences poses significant challenges for contemporary state-of-the-art models. These challenges stem from two primary aspects: 1) efficiently incorporating a larger number of frames within memory constraints, and 2) extracting discriminative information from the vast volume of input data. In this paper, we introduce a novel end-to-end schema for long-form video understanding, which includes an information-density-based adaptive video sampler (AVS) and an autoencoder-based spatiotemporal video compressor (SVC) integrated with a multimodal large language model (MLLM). Our proposed system offers two major advantages: it adaptively and effectively captures essential information from video sequences of varying durations, and it achieves high compression rates while preserving crucial discriminative information. The proposed framework demonstrates promising performance across various benchmarks, excelling in both long-form video understanding tasks and standard video understanding benchmarks. These results underscore the versatility and efficacy of our approach, particularly in managing the complexities of prolonged video sequences.
Chinese Translation
随着视频主干架构的最新进展,以及大型语言模型(LLMs)所取得的显著成就,对时长达数十分钟的长视频的分析变得既可行又日益普遍。然而,视频序列固有的冗余特性对当代最先进的模型提出了重大挑战。这些挑战主要源于两个方面:1)在内存限制下高效地处理更多帧,2)从大量输入数据中提取具有区分性的信息。在本文中,我们提出了一种新颖的端到端长视频理解架构,其中包括基于信息密度的自适应视频采样器(AVS)和基于自编码器的时空视频压缩器(SVC),并与多模态大型语言模型(MLLM)相结合。我们提出的系统具有两个主要优点:能够自适应且有效地捕捉不同时长视频序列中的关键信息,并在保持重要区分性信息的同时实现高压缩率。所提出的框架在各种基准测试中表现出色,在长视频理解任务和标准视频理解基准测试中均表现优异。这些结果凸显了我们方法的多样性和有效性,特别是在处理长视频序列的复杂性方面。
cs.CV / 10 / 2602.17871

Understanding the Fine-Grained Knowledge Capabilities of Vision-Language Models

理解视觉-语言模型的细粒度知识能力
Ghosh, Dhruba, Zhang, Yuhui, Schmidt, Ludwig
Abstract
Vision-language models (VLMs) have made substantial progress across a wide range of visual question answering benchmarks, spanning visual reasoning, document understanding, and multimodal dialogue. These improvements are evident in a wide range of VLMs built on a variety of base models, alignment architectures, and training data. However, recent works show that these models trail behind in traditional image classification benchmarks, which test fine-grained visual knowledge. We test a large number of recent VLMs on fine-grained classification benchmarks and identify potential factors in the disconnect between fine-grained knowledge and other vision benchmarks. Through a series of ablation experiments, we find that using a better LLM improves all benchmark scores equally, while a better vision encoder disproportionately improves fine-grained classification performance. Furthermore, we find that the pretraining stage is also vital to fine-grained performance, particularly when the language model weights are unfrozen during pretraining. These insights pave the way for enhancing fine-grained visual understanding and vision-centric capabilities in VLMs.
Chinese Translation
视觉-语言模型(VLMs)在广泛的视觉问答基准测试中取得了显著进展,涵盖了视觉推理、文档理解和多模态对话。这些改进在基于多种基础模型、对齐架构和训练数据构建的各种VLM中都得到了体现。然而,最近的研究表明,这些模型在传统的图像分类基准测试中表现不佳,而这些测试旨在评估细粒度视觉知识。我们在细粒度分类基准上测试了大量近期的VLM,并识别出细粒度知识与其他视觉基准之间的潜在脱节因素。通过一系列消融实验,我们发现使用更好的语言模型(LLM)可以均衡提高所有基准分数,而更好的视觉编码器则不成比例地提升细粒度分类性能。此外,我们发现预训练阶段对细粒度性能也至关重要,特别是在预训练期间语言模型权重未被冻结时。这些见解为提升VLM中的细粒度视觉理解和以视觉为中心的能力铺平了道路。
cs.CV / 11 / 2602.17909

A Single Image and Multimodality Is All You Need for Novel View Synthesis

单幅图像与多模态数据是新视图合成所需的一切
Javadi, Amirhosein, Gau, Chi-Shiang, Polyzos, Konstantinos D., Javidi, Tara
Abstract
Diffusion-based approaches have recently demonstrated strong performance for single-image novel view synthesis by conditioning generative models on geometry inferred from monocular depth estimation. However, in practice, the quality and consistency of the synthesized views are fundamentally limited by the reliability of the underlying depth estimates, which are often fragile under low texture, adverse weather, and occlusion-heavy real-world conditions. In this work, we show that incorporating sparse multimodal range measurements provides a simple yet effective way to overcome these limitations. We introduce a multimodal depth reconstruction framework that leverages extremely sparse range sensing data, such as automotive radar or LiDAR, to produce dense depth maps that serve as robust geometric conditioning for diffusion-based novel view synthesis. Our approach models depth in an angular domain using a localized Gaussian Process formulation, enabling computationally efficient inference while explicitly quantifying uncertainty in regions with limited observations. The reconstructed depth and uncertainty are used as a drop-in replacement for monocular depth estimators in existing diffusion-based rendering pipelines, without modifying the generative model itself. Experiments on real-world multimodal driving scenes demonstrate that replacing vision-only depth with our sparse range-based reconstruction substantially improves both geometric consistency and visual quality in single-image novel-view video generation. These results highlight the importance of reliable geometric priors for diffusion-based view synthesis and demonstrate the practical benefits of multimodal sensing even at extreme levels of sparsity.
Chinese Translation
基于扩散的方法最近在单幅图像的新视图合成中表现出强大的性能,这得益于将生成模型与从单目深度估计中推断出的几何信息相结合。然而,在实际应用中,合成视图的质量和一致性在根本上受到基础深度估计可靠性的限制,而这些估计在低纹理、不利天气和遮挡严重的现实条件下往往较为脆弱。在本研究中,我们展示了结合稀疏多模态范围测量提供了一种简单而有效的方式来克服这些限制。我们引入了一种多模态深度重建框架,该框架利用极其稀疏的范围传感数据,如汽车雷达或激光雷达(LiDAR),生成密集的深度图,这些深度图作为基于扩散的新视图合成的稳健几何条件。我们的方法使用局部高斯过程(Gaussian Process)模型在角度域中建模深度,从而实现计算效率高的推断,同时在观察有限的区域明确量化不确定性。重建的深度和不确定性可以作为现有基于扩散的渲染管道中单目深度估计器的替代品,而无需修改生成模型本身。在真实的多模态驾驶场景中的实验表明,用我们的稀疏范围重建替代仅依赖视觉的深度,显著提高了单幅图像新视图视频生成中的几何一致性和视觉质量。这些结果突显了可靠几何先验在基于扩散的视图合成中的重要性,并展示了即使在极端稀疏情况下,多模态传感的实际好处。
cs.CV / 12 / 2602.17929

ZACH-ViT: Regime-Dependent Inductive Bias in Compact Vision Transformers for Medical Imaging

ZACH-ViT:用于医学成像的紧凑视觉变换器中的状态依赖性归纳偏置
Angelakis, Athanasios
Abstract
Vision Transformers rely on positional embeddings and class tokens that encode fixed spatial priors. While effective for natural images, these priors may hinder generalization when spatial layout is weakly informative or inconsistent, a frequent condition in medical imaging and edge-deployed clinical systems. We introduce ZACH-ViT (Zero-token Adaptive Compact Hierarchical Vision Transformer), a compact Vision Transformer that removes both positional embeddings and the [CLS] token, achieving permutation invariance through global average pooling over patch representations. The term "Zero-token" specifically refers to removing the dedicated [CLS] aggregation token and positional embeddings; patch tokens remain unchanged and are processed normally. Adaptive residual projections preserve training stability in compact configurations while maintaining a strict parameter budget. Evaluation is performed across seven MedMNIST datasets spanning binary and multi-class tasks under a strict few-shot protocol (50 samples per class, fixed hyperparameters, five random seeds). The empirical analysis demonstrates regime-dependent behavior: ZACH-ViT (0.25M parameters, trained from scratch) achieves its strongest advantage on BloodMNIST and remains competitive with TransMIL on PathMNIST, while its relative advantage decreases on datasets with strong anatomical priors (OCTMNIST, OrganAMNIST), consistent with the architectural hypothesis. These findings support the view that aligning architectural inductive bias with data structure can be more important than pursuing universal benchmark dominance. Despite its minimal size and lack of pretraining, ZACH-ViT achieves competitive performance while maintaining sub-second inference times, supporting deployment in resource-constrained clinical environments. Code and models are available at https://github.com/Bluesman79/ZACH-ViT.
Chinese Translation
视觉变换器依赖于编码固定空间先验的位置信息嵌入和类别标记。尽管这些先验对自然图像有效,但在空间布局信息弱或不一致的情况下,可能会妨碍泛化,这在医学成像和边缘部署的临床系统中是常见的情况。我们提出了ZACH-ViT(零标记自适应紧凑层次视觉变换器),这是一种紧凑型视觉变换器,去除了位置信息嵌入和[CLS]标记,通过对补丁表示进行全局平均池化实现了置换不变性。“零标记”一词特指去除专用的[CLS]聚合标记和位置信息嵌入;补丁标记保持不变并正常处理。自适应残差投影在紧凑配置中保持训练稳定性,同时维持严格的参数预算。评估在七个MedMNIST数据集上进行,涵盖了二分类和多分类任务,采用严格的少样本协议(每类50个样本,固定超参数,五个随机种子)。实证分析表明了状态依赖性行为:ZACH-ViT(参数量为0.25M,从头训练)在BloodMNIST上表现出最强的优势,并在PathMNIST上与TransMIL保持竞争力,而在具有强解剖先验的数据集(OCTMNIST,OrganAMNIST)上其相对优势则下降,这与架构假设一致。这些发现支持了将架构归纳偏置与数据结构对齐的重要性,可能比追求普遍基准主导地位更为重要。尽管其体积小且没有预训练,ZACH-ViT仍然实现了竞争力的性能,同时保持亚秒的推理时间,支持在资源受限的临床环境中的部署。代码和模型可在https://github.com/Bluesman79/ZACH-ViT获取。
cs.CV / 13 / 2602.17951

ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models

ROCKET:面向残差的多层对齐框架用于空间感知的视觉-语言-动作模型
Sun, Guoheng, Du, Tingting, Feng, Kaixi, Luo, Chenxiang, Ding, Xingguo, Shen, Zheyu, Wang, Ziyao, He, Yexiao, Li, Ang
Abstract
Vision-Language-Action (VLA) models enable instruction-following robotic manipulation, but they are typically pretrained on 2D data and lack 3D spatial understanding. An effective approach is representation alignment, where a strong vision foundation model is used to guide a 2D VLA model. However, existing methods usually apply supervision at only a single layer, failing to fully exploit the rich information distributed across depth; meanwhile, na\"ive multi-layer alignment can cause gradient interference. We introduce ROCKET, a residual-oriented multi-layer representation alignment framework that formulates multi-layer alignment as aligning one residual stream to another. Concretely, ROCKET employs a shared projector to align multiple layers of the VLA backbone with multiple layers of a powerful 3D vision foundation model via a layer-invariant mapping, which reduces gradient conflicts. We provide both theoretical justification and empirical analyses showing that a shared projector is sufficient and outperforms prior designs, and further propose a Matryoshka-style sparse activation scheme for the shared projector to balance multiple alignment losses. Our experiments show that, combined with a training-free layer selection strategy, ROCKET requires only about 4% of the compute budget while achieving 98.5% state-of-the-art success rate on LIBERO. We further demonstrate the superior performance of ROCKET across LIBERO-Plus and RoboTwin, as well as multiple VLA models. The code and model weights can be found at https://github.com/CASE-Lab-UMD/ROCKET-VLA.
Chinese Translation
视觉-语言-动作(VLA)模型使得机器人能够按照指令进行操作,但它们通常是在二维数据上进行预训练,缺乏三维空间理解。一种有效的方法是表示对齐,其中强大的视觉基础模型用于指导二维 VLA 模型。然而,现有的方法通常仅在单层上应用监督,未能充分利用分布在深度上的丰富信息;与此同时,简单的多层对齐可能导致梯度干扰。我们提出了 ROCKET,一个面向残差的多层表示对齐框架,将多层对齐形式化为将一个残差流对齐到另一个残差流。具体而言,ROCKET 采用共享投影器,通过层不变映射将 VLA 主干的多个层与强大的三维视觉基础模型的多个层对齐,从而减少梯度冲突。我们提供了理论依据和实证分析,表明共享投影器是足够的,并且优于先前的设计,并进一步提出了一种 Matryoshka 风格的稀疏激活方案,以平衡多个对齐损失。我们的实验表明,结合无训练的层选择策略,ROCKET 仅需约 4% 的计算预算,同时在 LIBERO 上实现了 98.5% 的最先进成功率。我们进一步展示了 ROCKET 在 LIBERO-Plus 和 RoboTwin 以及多个 VLA 模型上的卓越性能。代码和模型权重可以在 https://github.com/CASE-Lab-UMD/ROCKET-VLA 找到。
cs.CV / 14 / 2602.18000

Image Quality Assessment: Exploring Quality Awareness via Memory-driven Distortion Patterns Matching

图像质量评估:通过记忆驱动的失真模式匹配探索质量意识
Lan, Xuting, Zhou, Mingliang, Wei, Xuekai, Yan, Jielu, Huang, Yueting, Pu, Huayan, Luo, Jun, Jia, Weijia
Abstract
Existing full-reference image quality assessment (FR-IQA) methods achieve high-precision evaluation by analysing feature differences between reference and distorted images. However, their performance is constrained by the quality of the reference image, which limits real-world applications where ideal reference sources are unavailable. Notably, the human visual system has the ability to accumulate visual memory, allowing image quality assessment on the basis of long-term memory storage. Inspired by this biological memory mechanism, we propose a memory-driven quality-aware framework (MQAF), which establishes a memory bank for storing distortion patterns and dynamically switches between dual-mode quality assessment strategies to reduce reliance on high-quality reference images. When reference images are available, MQAF obtains reference-guided quality scores by adaptively weighting reference information and comparing the distorted image with stored distortion patterns in the memory bank. When the reference image is absent, the framework relies on distortion patterns in the memory bank to infer image quality, enabling no-reference quality assessment (NR-IQA). The experimental results show that our method outperforms state-of-the-art approaches across multiple datasets while adapting to both no-reference and full-reference tasks.
Chinese Translation
现有的全参考图像质量评估(FR-IQA)方法通过分析参考图像与失真图像之间的特征差异实现高精度评估。然而,它们的性能受到参考图像质量的限制,这在理想参考源不可用的实际应用中造成了局限。值得注意的是,人类视觉系统具有积累视觉记忆的能力,可以基于长期记忆存储进行图像质量评估。受到这一生物记忆机制的启发,我们提出了一种记忆驱动的质量意识框架(MQAF),该框架建立了一个用于存储失真模式的记忆库,并动态切换双模式质量评估策略,以减少对高质量参考图像的依赖。当参考图像可用时,MQAF通过自适应加权参考信息并将失真图像与记忆库中存储的失真模式进行比较,从而获得参考引导的质量评分。当参考图像缺失时,该框架依赖于记忆库中的失真模式来推断图像质量,实现无参考质量评估(NR-IQA)。实验结果表明,我们的方法在多个数据集上优于最先进的方法,同时适应无参考和全参考任务。
cs.CV / 15 / 2602.18006

MUOT_3M: A 3 Million Frame Multimodal Underwater Benchmark and the MUTrack Tracking Method

MUOT_3M:一个300万帧的多模态水下基准及MUTrack跟踪方法
Bakht, Ahsan Baidar, Alansari, Mohamad, Din, Muhayy Ud, Naseer, Muzammal, Javed, Sajid, Hussain, Irfan, Matas, Jiri, Mahmood, Arif
Abstract
Underwater Object Tracking (UOT) is crucial for efficient marine robotics, large scale ecological monitoring, and ocean exploration; however, progress has been hindered by the scarcity of large, multimodal, and diverse datasets. Existing benchmarks remain small and RGB only, limiting robustness under severe color distortion, turbidity, and low visibility conditions. We introduce MUOT_3M, the first pseudo multimodal UOT benchmark comprising 3 million frames from 3,030 videos (27.8h) annotated with 32 tracking attributes, 677 fine grained classes, and synchronized RGB, estimated enhanced RGB, estimated depth, and language modalities validated by a marine biologist. Building upon MUOT_3M, we propose MUTrack, a SAM-based multimodal to unimodal tracker featuring visual geometric alignment, vision language fusion, and four level knowledge distillation that transfers multimodal knowledge into a unimodal student model. Extensive evaluations across five UOT benchmarks demonstrate that MUTrack achieves up to 8.40% higher AUC and 7.80% higher precision than the strongest SOTA baselines while running at 24 FPS. MUOT_3M and MUTrack establish a new foundation for scalable, multimodally trained yet practically deployable underwater tracking.
Chinese Translation
水下物体跟踪(UOT)对于高效的海洋机器人、大规模生态监测和海洋探索至关重要;然而,由于缺乏大型、多模态和多样化的数据集,进展受到限制。现有的基准数据集仍然较小且仅包含RGB图像,限制了在严重色彩失真、浑浊和低能见度条件下的鲁棒性。我们引入了MUOT_3M,这是第一个伪多模态UOT基准,包含来自3,030个视频(27.8小时)的300万帧,标注了32个跟踪属性、677个细粒度类别,并同步了RGB、估计增强RGB、估计深度和经过海洋生物学家验证的语言模态。在MUOT_3M的基础上,我们提出了MUTrack,这是一种基于SAM的多模态到单模态的跟踪器,具有视觉几何对齐、视觉语言融合和四级知识蒸馏,将多模态知识转移到单模态学生模型中。针对五个UOT基准的广泛评估表明,MUTrack在24帧每秒的运行速度下,AUC最高提高了8.40%,精度提高了7.80%,超越了最强的现有技术基线。MUOT_3M和MUTrack为可扩展的、多模态训练但在实际应用中可部署的水下跟踪奠定了新的基础。
cs.CV / 16 / 2602.18016

Towards LLM-centric Affective Visual Customization via Efficient and Precise Emotion Manipulating

面向大语言模型的情感视觉定制:高效精准的情感操控
Luo, Jiamin, Gu, Xuqian, Wang, Jingjing, Lu, Jiahong
Abstract
Previous studies on visual customization primarily rely on the objective alignment between various control signals (e.g., language, layout and canny) and the edited images, which largely ignore the subjective emotional contents, and more importantly lack general-purpose foundation models for affective visual customization. With this in mind, this paper proposes an LLM-centric Affective Visual Customization (L-AVC) task, which focuses on generating images within modifying their subjective emotions via Multimodal LLM. Further, this paper contends that how to make the model efficiently align emotion conversion in semantics (named inter-emotion semantic conversion) and how to precisely retain emotion-agnostic contents (named exter-emotion semantic retaining) are rather important and challenging in this L-AVC task. To this end, this paper proposes an Efficient and Precise Emotion Manipulating approach for editing subjective emotions in images. Specifically, an Efficient Inter-emotion Converting (EIC) module is tailored to make the LLM efficiently align emotion conversion in semantics before and after editing, followed by a Precise Exter-emotion Retaining (PER) module to precisely retain the emotion-agnostic contents. Comprehensive experimental evaluations on our constructed L-AVC dataset demonstrate the great advantage of the proposed EPEM approach to the L-AVC task over several state-of-the-art baselines. This justifies the importance of emotion information for L-AVC and the effectiveness of EPEM in efficiently and precisely manipulating such information.
Chinese Translation
以往关于视觉定制的研究主要依赖于各种控制信号(例如语言、布局和边缘检测)与编辑图像之间的客观对齐,往往忽视了主观情感内容,更重要的是缺乏用于情感视觉定制的通用基础模型。基于此,本文提出了一项以大语言模型为中心的情感视觉定制(L-AVC)任务,旨在通过多模态大语言模型生成图像,同时修改其主观情感。此外,本文认为在L-AVC任务中,如何使模型高效地对齐情感语义转换(称为内情感语义转换)以及如何精准地保留与情感无关的内容(称为外情感语义保留)是相当重要且具有挑战性的。为此,本文提出了一种高效精准的情感操控方法,用于编辑图像中的主观情感。具体而言,设计了一个高效内情感转换(EIC)模块,使得大语言模型在编辑前后能够高效对齐情感语义转换,接着是一个精准外情感保留(PER)模块,以精准保留与情感无关的内容。对我们构建的L-AVC数据集进行的全面实验评估表明,所提出的EPEM方法在L-AVC任务上相较于多个最先进的基线方法具有显著优势。这证明了情感信息在L-AVC中的重要性以及EPEM在高效精准操控该信息方面的有效性。
cs.CV / 17 / 2602.18019

DeepSVU: Towards In-depth Security-oriented Video Understanding via Unified Physical-world Regularized MoE

DeepSVU:通过统一的物理世界正则化混合专家实现深入的安全导向视频理解
Jin, Yujie, Zhang, Wenxin, Wang, Jingjing, Zhou, Guodong
Abstract
In the literature, prior research on Security-oriented Video Understanding (SVU) has predominantly focused on detecting and localize the threats (e.g., shootings, robberies) in videos, while largely lacking the effective capability to generate and evaluate the threat causes. Motivated by these gaps, this paper introduces a new chat paradigm SVU task, i.e., In-depth Security-oriented Video Understanding (DeepSVU), which aims to not only identify and locate the threats but also attribute and evaluate the causes threatening segments. Furthermore, this paper reveals two key challenges in the proposed task: 1) how to effectively model the coarse-to-fine physical-world information (e.g., human behavior, object interactions and background context) to boost the DeepSVU task; and 2) how to adaptively trade off these factors. To tackle these challenges, this paper proposes a new Unified Physical-world Regularized MoE (UPRM) approach. Specifically, UPRM incorporates two key components: the Unified Physical-world Enhanced MoE (UPE) Block and the Physical-world Trade-off Regularizer (PTR), to address the above two challenges, respectively. Extensive experiments conduct on our DeepSVU instructions datasets (i.e., UCF-C instructions and CUVA instructions) demonstrate that UPRM outperforms several advanced Video-LLMs as well as non-VLM approaches. Such information.These justify the importance of the coarse-to-fine physical-world information in the DeepSVU task and demonstrate the effectiveness of our UPRM in capturing such information.
Chinese Translation
在文献中,以安全导向的视频理解(SVU)为主题的先前研究主要集中于检测和定位视频中的威胁(例如,枪击、抢劫),而在生成和评估威胁原因的有效能力上则显得不足。基于这些不足,本文引入了一种新的聊天范式SVU任务,即深入的安全导向视频理解(DeepSVU),旨在不仅识别和定位威胁,还要归因和评估威胁片段的原因。此外,本文揭示了该任务中的两个关键挑战:1)如何有效建模粗到细的物理世界信息(例如,人类行为、物体交互和背景上下文)以提升DeepSVU任务的效果;2)如何自适应地权衡这些因素。为了解决这些挑战,本文提出了一种新的统一物理世界正则化混合专家(UPRM)方法。具体而言,UPRM包含两个关键组件:统一物理世界增强混合专家(UPE)模块和物理世界权衡正则化器(PTR),分别针对上述两个挑战。对我们的DeepSVU指令数据集(即UCF-C指令和CUVA指令)进行的广泛实验表明,UPRM在多个先进的视频大语言模型(Video-LLMs)以及非大语言模型(non-VLM)方法中表现优越。这些结果证明了粗到细的物理世界信息在DeepSVU任务中的重要性,并展示了我们UPRM在捕捉此类信息方面的有效性。
cs.CV / 18 / 2602.18020

UAOR: Uncertainty-aware Observation Reinjection for Vision-Language-Action Models

UAOR:面向不确定性的观察重注入用于视觉-语言-动作模型
Yang, Jiabing, Chen, Yixiang, Xu, Yuan, Li, Peiyan, Wu, Xiangnan, Wen, Zichen, Fang, Bowen, Yu, Tao, Zhang, Zhengbo, Li, Yingda, Wang, Kai, Liu, Jing, Liu, Nianfeng, Huang, Yan, Wang, Liang
Abstract
Vision-Language-Action (VLA) models leverage pretrained Vision-Language Models (VLMs) as backbones to map images and instructions to actions, demonstrating remarkable potential for generalizable robotic manipulation. To enhance performance, existing methods often incorporate extra observation cues (e.g., depth maps, point clouds) or auxiliary modules (e.g., object detectors, encoders) to enable more precise and reliable task execution, yet these typically require costly data collection and additional training. Inspired by the finding that Feed-Forward Network (FFN) in language models can act as "key-value memory", we propose Uncertainty-aware Observation Reinjection (UAOR), an effective, training-free and plug-and-play module for VLA models. Specifically, when the current language model layer exhibits high uncertainty, measured by Action Entropy, it reinjects key observation information into the next layer's Feed-Forward Network (FFN) through attention retrieval. This mechanism helps VLAs better attend to observations during inference, enabling more confident and faithful action generation. Comprehensive experiments show that our method consistently improves diverse VLA models across simulation and real-world tasks with minimal overhead. Notably, UAOR eliminates the need for additional observation cues or modules, making it a versatile and practical plug-in for existing VLA pipelines. The project page is at https://uaor.jiabingyang.cn.
Chinese Translation
视觉-语言-动作(VLA)模型利用预训练的视觉-语言模型(VLMs)作为骨干,将图像和指令映射到动作,展现出在可推广的机器人操作方面的显著潜力。为了提升性能,现有方法通常会结合额外的观察线索(例如,深度图、点云)或辅助模块(例如,物体检测器、编码器),以实现更精确和可靠的任务执行,但这些通常需要昂贵的数据收集和额外的训练。受到语言模型中前馈网络(FFN)可以作为“键值记忆”的发现启发,我们提出了面向不确定性的观察重注入(UAOR),这是一种有效的、无训练的即插即用模块,适用于VLA模型。具体而言,当当前语言模型层表现出高不确定性时(通过动作熵测量),它通过注意力检索将关键信息重注入到下一层的前馈网络(FFN)中。这一机制帮助VLA在推理过程中更好地关注观察,从而实现更自信和真实的动作生成。综合实验表明,我们的方法在模拟和现实任务中始终能有效提升多种VLA模型的性能,且开销极小。值得注意的是,UAOR消除了对额外观察线索或模块的需求,使其成为现有VLA管道中一种多功能且实用的插件。项目页面地址为 https://uaor.jiabingyang.cn。
cs.CV / 19 / 2602.18022

Dual-Channel Attention Guidance for Training-Free Image Editing Control in Diffusion Transformers

用于无训练图像编辑控制的双通道注意力引导在扩散变换器中的应用
Li, Guandong, Ye, Mengxia
Abstract
Training-free control over editing intensity is a critical requirement for diffusion-based image editing models built on the Diffusion Transformer (DiT) architecture. Existing attention manipulation methods focus exclusively on the Key space to modulate attention routing, leaving the Value space -- which governs feature aggregation -- entirely unexploited. In this paper, we first reveal that both Key and Value projections in DiT's multi-modal attention layers exhibit a pronounced bias-delta structure, where token embeddings cluster tightly around a layer-specific bias vector. Building on this observation, we propose Dual-Channel Attention Guidance (DCAG), a training-free framework that simultaneously manipulates both the Key channel (controlling where to attend) and the Value channel (controlling what to aggregate). We provide a theoretical analysis showing that the Key channel operates through the nonlinear softmax function, acting as a coarse control knob, while the Value channel operates through linear weighted summation, serving as a fine-grained complement. Together, the two-dimensional parameter space $(\delta_k, \delta_v)$ enables more precise editing-fidelity trade-offs than any single-channel method. Extensive experiments on the PIE-Bench benchmark (700 images, 10 editing categories) demonstrate that DCAG consistently outperforms Key-only guidance across all fidelity metrics, with the most significant improvements observed in localized editing tasks such as object deletion (4.9% LPIPS reduction) and object addition (3.2% LPIPS reduction).
Chinese Translation
无训练的编辑强度控制是基于扩散变换器(Diffusion Transformer, DiT)架构的扩散图像编辑模型的重要需求。现有的注意力操控方法仅专注于键(Key)空间以调节注意力路由,而完全未利用值(Value)空间——它负责特征聚合。在本文中,我们首先揭示了DiT的多模态注意力层中的键和值投影都表现出明显的偏差-增量结构,其中令牌嵌入紧密聚集在特定层的偏差向量周围。基于这一观察,我们提出了双通道注意力引导(Dual-Channel Attention Guidance, DCAG),这是一个无训练框架,能够同时操控键通道(控制关注位置)和值通道(控制聚合内容)。我们提供了理论分析,表明键通道通过非线性softmax函数运作,充当粗略控制旋钮,而值通道通过线性加权求和运作,作为细粒度的补充。结合这两个维度的参数空间$( heta_k, heta_v)$,能够实现比任何单通道方法更精确的编辑保真度权衡。在PIE-Bench基准(700张图像,10个编辑类别)上的广泛实验表明,DCAG在所有保真度指标上始终优于仅使用键的引导,尤其在局部编辑任务(如对象删除(LPIPS减少4.9%)和对象添加(LPIPS减少3.2%))中观察到最显著的改进。
cs.CV / 20 / 2602.18043

Spatio-temporal Decoupled Knowledge Compensator for Few-Shot Action Recognition

时空解耦知识补偿器用于少样本动作识别
Qu, Hongyu, Shu, Xiangbo, Yan, Rui, Gao, Hailiang, Wang, Wenguan, Tang, Jinhui
Abstract
Few-Shot Action Recognition (FSAR) is a challenging task that requires recognizing novel action categories with a few labeled videos. Recent works typically apply semantically coarse category names as auxiliary contexts to guide the learning of discriminative visual features. However, such context provided by the action names is too limited to provide sufficient background knowledge for capturing novel spatial and temporal concepts in actions. In this paper, we propose DiST, an innovative Decomposition-incorporation framework for FSAR that makes use of decoupled Spatial and Temporal knowledge provided by large language models to learn expressive multi-granularity prototypes. In the decomposition stage, we decouple vanilla action names into diverse spatio-temporal attribute descriptions (action-related knowledge). Such commonsense knowledge complements semantic contexts from spatial and temporal perspectives. In the incorporation stage, we propose Spatial/Temporal Knowledge Compensators (SKC/TKC) to discover discriminative object-level and frame-level prototypes, respectively. In SKC, object-level prototypes adaptively aggregate important patch tokens under the guidance of spatial knowledge. Moreover, in TKC, frame-level prototypes utilize temporal attributes to assist in inter-frame temporal relation modeling. These learned prototypes thus provide transparency in capturing fine-grained spatial details and diverse temporal patterns. Experimental results show DiST achieves state-of-the-art results on five standard FSAR datasets.
Chinese Translation
少样本动作识别(FSAR)是一项具有挑战性的任务,要求在仅有少量标记视频的情况下识别新颖的动作类别。近期的研究通常使用语义粗糙的类别名称作为辅助上下文,以指导辨别性视觉特征的学习。然而,动作名称提供的这种上下文过于有限,无法为捕捉新颖的空间和时间概念提供足够的背景知识。本文提出了DiST,一种创新的FSAR分解整合框架,利用大型语言模型提供的解耦空间和时间知识来学习表达性多粒度原型。在分解阶段,我们将普通动作名称解耦为多样的时空属性描述(与动作相关的知识)。这种常识知识从空间和时间的角度补充了语义上下文。在整合阶段,我们提出了空间/时间知识补偿器(SKC/TKC),分别用于发现辨别性的对象级和帧级原型。在SKC中,对象级原型在空间知识的指导下自适应地聚合重要的补丁标记。此外,在TKC中,帧级原型利用时间属性来辅助建模帧间的时间关系。这些学习到的原型因此在捕捉细粒度空间细节和多样的时间模式方面提供了透明性。实验结果表明,DiST在五个标准FSAR数据集上达到了最先进的结果。
cs.CV / 21 / 2602.18047

CityGuard: Graph-Aware Private Descriptors for Bias-Resilient Identity Search Across Urban Cameras

CityGuard:面向图的隐私描述符,用于跨城市摄像头的抗偏见身份搜索
Fu, Rong, Zhang, Wenxin, Meng, Yibo, Tan, Jia Yee, Lu, Jiaxuan, Lu, Rui, Wu, Jiekai, Kang, Zhaolu, Fong, Simon
Abstract
City-scale person re-identification across distributed cameras must handle severe appearance changes from viewpoint, occlusion, and domain shift while complying with data protection rules that prevent sharing raw imagery. We introduce CityGuard, a topology-aware transformer for privacy-preserving identity retrieval in decentralized surveillance. The framework integrates three components. A dispersion-adaptive metric learner adjusts instance-level margins according to feature spread, increasing intra-class compactness. Spatially conditioned attention injects coarse geometry, such as GPS or deployment floor plans, into graph-based self-attention to enable projectively consistent cross-view alignment using only coarse geometric priors without requiring survey-grade calibration. Differentially private embedding maps are coupled with compact approximate indexes to support secure and cost-efficient deployment. Together these designs produce descriptors robust to viewpoint variation, occlusion, and domain shifts, and they enable a tunable balance between privacy and utility under rigorous differential-privacy accounting. Experiments on Market-1501 and additional public benchmarks, complemented by database-scale retrieval studies, show consistent gains in retrieval precision and query throughput over strong baselines, confirming the practicality of the framework for privacy-critical urban identity matching.
Chinese Translation
城市规模的人物重识别需要在分布式摄像头之间处理由于视角、遮挡和领域转移带来的严重外观变化,同时遵循数据保护规则以防止共享原始图像。我们提出了CityGuard,一种用于去中心化监控中的隐私保护身份检索的拓扑感知变换器。该框架集成了三个组件。分散自适应度量学习器根据特征分布调整实例级边界,增加类内紧凑性。空间条件注意力将粗略几何信息(如GPS或部署平面图)注入基于图的自注意力中,以仅使用粗略几何先验实现投影一致的跨视图对齐,而无需进行调查级别的校准。差分隐私嵌入映射与紧凑的近似索引相结合,以支持安全且成本高效的部署。这些设计共同产生了对视角变化、遮挡和领域转移具有鲁棒性的描述符,并在严格的差分隐私核算下实现隐私与效用之间的可调平衡。在Market-1501和其他公共基准上的实验,以及数据库规模的检索研究,显示出在检索精度和查询吞吐量方面相较于强基线的一致提升,确认了该框架在隐私关键的城市身份匹配中的实用性。
cs.CV / 22 / 2602.18057

Temporal Consistency-Aware Text-to-Motion Generation

考虑时间一致性的文本到动作生成
Wang, Hongsong, Yan, Wenjing, Lai, Qiuxia, Geng, Xin
Abstract
Text-to-Motion (T2M) generation aims to synthesize realistic human motion sequences from natural language descriptions. While two-stage frameworks leveraging discrete motion representations have advanced T2M research, they often neglect cross-sequence temporal consistency, i.e., the shared temporal structures present across different instances of the same action. This leads to semantic misalignments and physically implausible motions. To address this limitation, we propose TCA-T2M, a framework for temporal consistency-aware T2M generation. Our approach introduces a temporal consistency-aware spatial VQ-VAE (TCaS-VQ-VAE) for cross-sequence temporal alignment, coupled with a masked motion transformer for text-conditioned motion generation. Additionally, a kinematic constraint block mitigates discretization artifacts to ensure physical plausibility. Experiments on HumanML3D and KIT-ML benchmarks demonstrate that TCA-T2M achieves state-of-the-art performance, highlighting the importance of temporal consistency in robust and coherent T2M generation.
Chinese Translation
文本到动作(T2M)生成旨在从自然语言描述合成逼真的人类动作序列。尽管利用离散动作表示的两阶段框架推动了T2M研究的发展,但它们往往忽视了跨序列的时间一致性,即同一动作不同实例之间共享的时间结构。这导致了语义的不对齐和物理上不合理的动作。为了解决这一局限性,我们提出了TCA-T2M,一个考虑时间一致性的T2M生成框架。我们的方法引入了一种时间一致性感知的空间VQ-VAE(TCaS-VQ-VAE)用于跨序列的时间对齐,并结合了一个用于文本条件下动作生成的掩蔽动作变换器。此外,一个运动学约束模块减轻了离散化伪影,以确保物理上的合理性。在HumanML3D和KIT-ML基准上的实验表明,TCA-T2M实现了最先进的性能,突显了时间一致性在稳健和连贯的T2M生成中的重要性。
cs.CV / 23 / 2602.18064

3DMedAgent: Unified Perception-to-Understanding for 3D Medical Analysis

3DMedAgent:用于3D医学分析的统一感知到理解框架
Wang, Ziyue, Cai, Linghan, Low, Chang Han, Liu, Haofeng, Wu, Junde, Wang, Jingyu, Wang, Rui, Song, Lei, Bian, Jiang, Fu, Jingjing, Jin, Yueming
Abstract
3D CT analysis spans a continuum from low-level perception to high-level clinical understanding. Existing 3D-oriented analysis methods adopt either isolated task-specific modeling or task-agnostic end-to-end paradigms to produce one-hop outputs, impeding the systematic accumulation of perceptual evidence for downstream reasoning. In parallel, recent multimodal large language models (MLLMs) exhibit improved visual perception and can integrate visual and textual information effectively, yet their predominantly 2D-oriented designs fundamentally limit their ability to perceive and analyze volumetric medical data. To bridge this gap, we propose 3DMedAgent, a unified agent that enables 2D MLLMs to perform general 3D CT analysis without 3D-specific fine-tuning. 3DMedAgent coordinates heterogeneous visual and textual tools through a flexible MLLM agent, progressively decomposing complex 3D analysis into tractable subtasks that transition from global to regional views, from 3D volumes to informative 2D slices, and from visual evidence to structured textual representations. Central to this design, 3DMedAgent maintains a long-term structured memory that aggregates intermediate tool outputs and supports query-adaptive, evidence-driven multi-step reasoning. We further introduce the DeepChestVQA benchmark for evaluating unified perception-to-understanding capabilities in 3D thoracic imaging. Experiments across over 40 tasks demonstrate that 3DMedAgent consistently outperforms general, medical, and 3D-specific MLLMs, highlighting a scalable path toward general-purpose 3D clinical assistants.Code and data are available at \href{https://github.com/jinlab-imvr/3DMedAgent}{https://github.com/jinlab-imvr/3DMedAgent}.
Chinese Translation
3D CT分析涵盖了从低级感知到高级临床理解的连续体。现有的3D导向分析方法采用孤立的任务特定建模或任务无关的端到端范式来产生单步输出,这阻碍了感知证据的系统积累以支持下游推理。与此同时,最近的多模态大型语言模型(MLLMs)展现了改进的视觉感知能力,并能够有效整合视觉和文本信息,但其主要以2D为导向的设计在根本上限制了其感知和分析体积医学数据的能力。为了解决这一问题,我们提出了3DMedAgent,一个统一的代理,使得2D MLLMs能够在不进行3D特定微调的情况下执行一般的3D CT分析。3DMedAgent通过灵活的MLLM代理协调异构的视觉和文本工具,逐步将复杂的3D分析分解为可处理的子任务,这些子任务从全局视角转向区域视角,从3D体积转向信息丰富的2D切片,并从视觉证据转向结构化文本表示。该设计的核心是,3DMedAgent维护一个长期的结构化记忆,聚合中间工具输出,并支持查询自适应的、基于证据的多步推理。我们进一步引入了DeepChestVQA基准,用于评估3D胸部影像中统一感知到理解的能力。在超过40个任务的实验中,3DMedAgent始终优于一般的、医学的和3D特定的MLLMs,突显了朝向通用3D临床助手的可扩展路径。代码和数据可在 https://github.com/jinlab-imvr/3DMedAgent 获取。
cs.CV / 24 / 2602.18066

Faster Training, Fewer Labels: Self-Supervised Pretraining for Fine-Grained BEV Segmentation

更快的训练,更少的标签:用于细粒度鸟瞰视图(BEV)分割的自监督预训练
Busch, Daniel, Bohn, Christian, Kurbiel, Thomas, Friedrichs, Klaus, Meyes, Richard, Meisen, Tobias
Abstract
Dense Bird's Eye View (BEV) semantic maps are central to autonomous driving, yet current multi-camera methods depend on costly, inconsistently annotated BEV ground truth. We address this limitation with a two-phase training strategy for fine-grained road marking segmentation that removes full supervision during pretraining and halves the amount of training data during fine-tuning while still outperforming the comparable supervised baseline model. During the self-supervised pretraining, BEVFormer predictions are differentiably reprojected into the image plane and trained against multi-view semantic pseudo-labels generated by the widely used semantic segmentation model Mask2Former. A temporal loss encourages consistency across frames. The subsequent supervised fine-tuning phase requires only 50% of the dataset and significantly less training time. With our method, the fine-tuning benefits from rich priors learned during pretraining boosting the performance and BEV segmentation quality (up to +2.5pp mIoU over the fully supervised baseline) on nuScenes. It simultaneously halves the usage of annotation data and reduces total training time by up to two thirds. The results demonstrate that differentiable reprojection plus camera perspective pseudo labels yields transferable BEV features and a scalable path toward reduced-label autonomous perception.
Chinese Translation
密集的鸟瞰视图(BEV)语义地图是自动驾驶的核心,但当前的多摄像头方法依赖于昂贵且标注不一致的BEV真实数据。我们通过一种两阶段训练策略来解决这一限制,该策略用于细粒度道路标记分割,在预训练阶段去除了完全监督,并在微调阶段将训练数据量减少了一半,同时仍然超越了可比的监督基线模型。在自监督预训练期间,BEVFormer的预测结果被可微重投影到图像平面,并与由广泛使用的语义分割模型Mask2Former生成的多视角语义伪标签进行训练。时间损失鼓励帧间一致性。随后的监督微调阶段仅需使用50%的数据集,并显著减少训练时间。通过我们的方法,微调受益于在预训练期间学习到的丰富先验,从而提升性能和BEV分割质量(在nuScenes上相较于完全监督基线提高了最多2.5个百分点的mIoU)。它同时将标注数据的使用量减半,并将总训练时间减少了最多三分之二。结果表明,可微重投影加上相机视角伪标签能够产生可迁移的BEV特征,并为减少标签的自主感知提供了一条可扩展的路径。
cs.CV / 25 / 2602.18083

Comparative Assessment of Multimodal Earth Observation Data for Soil Moisture Estimation

多模态地球观测数据在土壤湿度估算中的比较评估
Kontogiorgakis, Ioannis, Askitopoulos, Athanasios, Tsardanidis, Iason, Bormpoudakis, Dimitrios, Tsoumas, Ilias, Balampanis, Fotios, Kontoes, Charalampos
Abstract
Accurate soil moisture (SM) estimation is critical for precision agriculture, water resources management and climate monitoring. Yet, existing satellite SM products are too coarse (>1km) for farm-level applications. We present a high-resolution (10m) SM estimation framework for vegetated areas across Europe, combining Sentinel-1 SAR, Sentinel-2 optical imagery and ERA-5 reanalysis data through machine learning. Using 113 International Soil Moisture Network (ISMN) stations spanning diverse vegetated areas, we compare modality combinations with temporal parameterizations, using spatial cross-validation, to ensure geographic generalization. We also evaluate whether foundation model embeddings from IBM-NASA's Prithvi model improve upon traditional hand-crafted spectral features. Results demonstrate that hybrid temporal matching - Sentinel-2 current-day acquisitions with Sentinel-1 descending orbit - achieves R^2=0.514, with 10-day ERA5 lookback window improving performance to R^2=0.518. Foundation model (Prithvi) embeddings provide negligible improvement over hand-crafted features (R^2=0.515 vs. 0.514), indicating traditional feature engineering remains highly competitive for sparse-data regression tasks. Our findings suggest that domain-specific spectral indices combined with tree-based ensemble methods offer a practical and computationally efficient solution for operational pan-European field-scale soil moisture monitoring.
Chinese Translation
准确的土壤湿度(SM)估算对于精准农业、水资源管理和气候监测至关重要。然而,现有的卫星SM产品分辨率过低(>1km),不适用于农场级应用。我们提出了一种高分辨率(10m)土壤湿度估算框架,适用于欧洲的植被区域,结合了Sentinel-1 SAR、Sentinel-2光学影像和ERA-5再分析数据,通过机器学习进行处理。利用113个国际土壤湿度网络(ISMN)站点,覆盖多样的植被区域,我们比较了不同模态组合与时间参数化,采用空间交叉验证,以确保地理泛化。我们还评估了IBM-NASA的Prithvi模型的基础模型嵌入是否能改善传统手工制作的光谱特征。结果表明,混合时间匹配——Sentinel-2当前日获取与Sentinel-1下降轨道结合——达到了R^2=0.514,10天的ERA5回顾窗口将性能提升至R^2=0.518。基础模型(Prithvi)嵌入对手工特征的改进微乎其微(R^2=0.515 vs. 0.514),表明传统特征工程在稀疏数据回归任务中仍然具有高度竞争力。我们的研究结果表明,特定领域的光谱指数结合基于树的集成方法,为全欧洲农田尺度的土壤湿度监测提供了一种实用且计算高效的解决方案。
cs.CV / 26 / 2602.18089

DohaScript: A Large-Scale Multi-Writer Dataset for Continuous Handwritten Hindi Text

DohaScript:一个大规模多作者连续手写印地语文本数据集
Singh, Kunwar Arpit, Prakash, Ankush, Lone, Haroon R
Abstract
Despite having hundreds of millions of speakers, handwritten Devanagari text remains severely underrepresented in publicly available benchmark datasets. Existing resources are limited in scale, focus primarily on isolated characters or short words, and lack controlled lexical content and writer level diversity, which restricts their utility for modern data driven handwriting analysis. As a result, they fail to capture the continuous, fused, and structurally complex nature of Devanagari handwriting, where characters are connected through a shared shirorekha (horizontal headline) and exhibit rich ligature formations. We introduce DohaScript, a large scale, multi writer dataset of handwritten Hindi text collected from 531 unique contributors. The dataset is designed as a parallel stylistic corpus, in which all writers transcribe the same fixed set of six traditional Hindi dohas (couplets). This controlled design enables systematic analysis of writer specific variation independent of linguistic content, and supports tasks such as handwriting recognition, writer identification, style analysis, and generative modeling. The dataset is accompanied by non identifiable demographic metadata, rigorous quality curation based on objective sharpness and resolution criteria, and page level layout difficulty annotations that facilitate stratified benchmarking. Baseline experiments demonstrate clear quality separation and strong generalization to unseen writers, highlighting the dataset's reliability and practical value. DohaScript is intended to serve as a standardized and reproducible benchmark for advancing research on continuous handwritten Devanagari text in low resource script settings.
Chinese Translation
尽管拥有数亿说话者,手写的天城文文本在公开可用的基准数据集中仍然严重不足。现有资源规模有限,主要集中在孤立字符或短词上,缺乏受控的词汇内容和作者层面的多样性,这限制了它们在现代数据驱动的手写分析中的实用性。因此,它们未能捕捉到天城文手写的连续性、融合性和结构复杂性,其中字符通过共享的 shirorekha(水平标题线)连接,并表现出丰富的连字形式。我们推出了 DohaScript,这是一个大规模的多作者手写印地语文本数据集,收集自531位独特的贡献者。该数据集被设计为一个平行风格语料库,所有作者均转录相同的六首传统印地语 dohas(对偶诗)。这种受控设计使得可以系统地分析特定于作者的变异,而不受语言内容的影响,并支持手写识别、作者识别、风格分析和生成建模等任务。该数据集附带不可识别的人口统计元数据,基于客观的清晰度和分辨率标准进行严格的质量审核,以及页面级布局难度注释,以促进分层基准测试。基线实验表明明显的质量分离和对未见作者的强泛化能力,突显了该数据集的可靠性和实际价值。DohaScript旨在作为一个标准化和可重复的基准,推动在低资源脚本环境中对连续手写天城文文本的研究。
cs.CV / 27 / 2602.18093

Predict to Skip: Linear Multistep Feature Forecasting for Efficient Diffusion Transformers

预测以跳过:高效扩散变换器的线性多步特征预测
Cui, Hanshuai, Tang, Zhiqing, Ma, Qianli, Yao, Zhi, Jia, Weijia
Abstract
Diffusion Transformers (DiT) have emerged as a widely adopted backbone for high-fidelity image and video generation, yet their iterative denoising process incurs high computational costs. Existing training-free acceleration methods rely on feature caching and reuse under the assumption of temporal stability. However, reusing features for multiple steps may lead to latent drift and visual degradation. We observe that model outputs evolve smoothly along much of the diffusion trajectory, enabling principled predictions rather than naive reuse. Based on this insight, we propose \textbf{PrediT}, a training-free acceleration framework that formulates feature prediction as a linear multistep problem. We employ classical linear multistep methods to forecast future model outputs from historical information, combined with a corrector that activates in high-dynamics regions to prevent error accumulation. A dynamic step modulation mechanism adaptively adjusts the prediction horizon by monitoring the feature change rate. Together, these components enable substantial acceleration while preserving generation fidelity. Extensive experiments validate that our method achieves up to $5.54\times$ latency reduction across various DiT-based image and video generation models, while incurring negligible quality degradation.
Chinese Translation
扩散变换器(Diffusion Transformers, DiT)已成为高保真图像和视频生成的广泛采用的基础架构,但其迭代去噪过程带来了高计算成本。现有的无训练加速方法依赖于特征缓存和重用,假设时间稳定性。然而,针对多个步骤重用特征可能导致潜在漂移和视觉退化。我们观察到模型输出在扩散轨迹的大部分过程中平滑演变,这使得基于原则的预测成为可能,而不是简单的重用。基于这一见解,我们提出了 extbf{PrediT},一种无训练加速框架,将特征预测公式化为线性多步问题。我们采用经典的线性多步方法,从历史信息中预测未来模型输出,并结合在高动态区域激活的修正器,以防止误差累积。动态步长调制机制通过监测特征变化率自适应地调整预测视野。这些组件共同实现了显著加速,同时保持生成的保真度。大量实验验证了我们的方法在各种基于DiT的图像和视频生成模型中实现了高达$5.54 imes$的延迟减少,同时几乎没有质量下降。
cs.CV / 28 / 2602.18094

OODBench: Out-of-Distribution Benchmark for Large Vision-Language Models

OODBench:大规模视觉-语言模型的分布外基准测试
Lin, Ling, Bai, Yang, Su, Heng, Zhu, Congcong, Wang, Yaoxing, Zhou, Yang, Fu, Huazhu, Chen, Jingrun
Abstract
Existing Visual-Language Models (VLMs) have achieved significant progress by being trained on massive-scale datasets, typically under the assumption that data are independent and identically distributed (IID). However, in real-world scenarios, it is often impractical to expect that all data processed by an AI system satisfy this assumption. Furthermore, failure to appropriately handle out-of-distribution (OOD) objects may introduce safety risks in real-world applications (e.g., autonomous driving or medical assistance). Unfortunately, current research has not yet provided valid benchmarks that can comprehensively assess the performance of VLMs in response to OOD data. Therefore, we propose OODBench, a predominantly automated method with minimal human verification, for constructing new benchmarks and evaluating the ability of VLMs to process OOD data. OODBench contains 40K instance-level OOD instance-category pairs, and we show that current VLMs still exhibit notable performance degradation on OODBench, even when the underlying image categories are common. In addition, we propose a reliable automated assessment metric that employs a Basic-to-Advanced Progression of prompted questions to assess the impact of OOD data on questions of varying difficulty more fully. Lastly, we summarize substantial findings and insights to facilitate future research in the acquisition and evaluation of OOD data.
Chinese Translation
现有的视觉-语言模型(VLMs)通过在大规模数据集上训练取得了显著进展,通常假设数据是独立同分布(IID)的。然而,在现实世界场景中,期望所有由人工智能系统处理的数据都满足这一假设往往是不切实际的。此外,未能适当地处理分布外(OOD)对象可能会在现实应用中引入安全风险(例如,自动驾驶或医疗辅助)。不幸的是,目前的研究尚未提供有效的基准测试,能够全面评估VLMs在应对OOD数据时的表现。因此,我们提出了OODBench,这是一种以自动化为主、人工验证最小化的方法,用于构建新的基准测试并评估VLMs处理OOD数据的能力。OODBench包含40K实例级OOD实例-类别对,我们展示了当前的VLMs在OODBench上仍然表现出显著的性能下降,即使基础图像类别是常见的。此外,我们提出了一种可靠的自动评估指标,采用基本到高级的问题进展,来更全面地评估OOD数据对不同难度问题的影响。最后,我们总结了重要的发现和见解,以促进未来在获取和评估OOD数据方面的研究。
cs.CV / 29 / 2602.18178

Evaluating Graphical Perception Capabilities of Vision Transformers

评估视觉变换器的图形感知能力
Poonam, Poonam, Vázquez, Pere-Pau, Ropinski, Timo
Abstract
Vision Transformers, ViTs, have emerged as a powerful alternative to convolutional neural networks, CNNs, in a variety of image-based tasks. While CNNs have previously been evaluated for their ability to perform graphical perception tasks, which are essential for interpreting visualizations, the perceptual capabilities of ViTs remain largely unexplored. In this work, we investigate the performance of ViTs in elementary visual judgment tasks inspired by the foundational studies of Cleveland and McGill, which quantified the accuracy of human perception across different visual encodings. Inspired by their study, we benchmark ViTs against CNNs and human participants in a series of controlled graphical perception tasks. Our results reveal that, although ViTs demonstrate strong performance in general vision tasks, their alignment with human-like graphical perception in the visualization domain is limited. This study highlights key perceptual gaps and points to important considerations for the application of ViTs in visualization systems and graphical perceptual modeling.
Chinese Translation
视觉变换器(Vision Transformers, ViTs)已成为卷积神经网络(Convolutional Neural Networks, CNNs)在多种基于图像任务中的一种强有力的替代方案。尽管CNNs在执行图形感知任务方面的能力已经得到评估,而这些任务对于解释可视化至关重要,但ViTs的感知能力仍然在很大程度上未被探索。在本研究中,我们调查了ViTs在基础视觉判断任务中的表现,这些任务受到Cleveland和McGill的基础研究的启发,该研究量化了人类在不同视觉编码下的感知准确性。我们根据他们的研究,将ViTs与CNNs和人类参与者在一系列受控的图形感知任务中进行了基准测试。我们的结果表明,尽管ViTs在一般视觉任务中表现出强大的性能,但它们在可视化领域与人类图形感知的一致性有限。本研究突出了关键的感知差距,并指出了在可视化系统和图形感知建模中应用ViTs时的重要考虑因素。
cs.CV / 30 / 2602.18193

BLM-Guard: Explainable Multimodal Ad Moderation with Chain-of-Thought and Policy-Aligned Rewards

BLM-Guard:基于链式思维和政策一致奖励的可解释多模态广告审核
Yang, Yiran, Liu, Zhaowei, Yuan, Yuan, Song, Yukun, Ma, Xiong, Song, Yinghao, Zeng, Xiangji, Sun, Lu, Wang, Yulu, Zhou, Hai, Cui, Shuai, Gong, Zhaohan, Zhang, Jiefei
Abstract
Short-video platforms now host vast multimodal ads whose deceptive visuals, speech and subtitles demand finer-grained, policy-driven moderation than community safety filters. We present BLM-Guard, a content-audit framework for commercial ads that fuses Chain-of-Thought reasoning with rule-based policy principles and a critic-guided reward. A rule-driven ICoT data-synthesis pipeline jump-starts training by generating structured scene descriptions, reasoning chains and labels, cutting annotation costs. Reinforcement learning then refines the model using a composite reward balancing causal coherence with policy adherence. A multitask architecture models intra-modal manipulations (e.g., exaggerated imagery) and cross-modal mismatches (e.g., subtitle-speech drift), boosting robustness. Experiments on real short-video ads show BLM-Guard surpasses strong baselines in accuracy, consistency and generalization.
Chinese Translation
短视频平台现在承载着大量多模态广告,其具有误导性的视觉效果、语音和字幕要求比社区安全过滤器更细致的、政策驱动的审核。我们提出了BLM-Guard,这是一种商业广告内容审核框架,融合了链式思维推理、基于规则的政策原则和批评者引导的奖励。一个基于规则的ICoT数据合成管道通过生成结构化场景描述、推理链和标签来启动训练,从而降低注释成本。然后,强化学习利用复合奖励来优化模型,该奖励在因果一致性和政策遵循之间取得平衡。多任务架构建模了模态内部操控(例如,夸张的图像)和跨模态不匹配(例如,字幕与语音的偏差),增强了鲁棒性。在真实短视频广告上的实验表明,BLM-Guard在准确性、一致性和泛化能力上超越了强基线。
cs.CV / 31 / 2602.18199

A Self-Supervised Approach on Motion Calibration for Enhancing Physical Plausibility in Text-to-Motion

一种自监督的运动校准方法以增强文本到运动中的物理合理性
Shim, Gahyeon, Park, Soogeun, Ahn, Hyemin
Abstract
Generating semantically aligned human motion from textual descriptions has made rapid progress, but ensuring both semantic and physical realism in motion remains a challenge. In this paper, we introduce the Distortion-aware Motion Calibrator (DMC), a post-hoc module that refines physically implausible motions (e.g., foot floating) while preserving semantic consistency with the original textual description. Rather than relying on complex physical modeling, we propose a self-supervised and data-driven approach, whereby DMC learns to obtain physically plausible motions when an intentionally distorted motion and the original textual descriptions are given as inputs. We evaluate DMC as a post-hoc module to improve motions obtained from various text-to-motion generation models and demonstrate its effectiveness in improving physical plausibility while enhancing semantic consistency. The experimental results show that DMC reduces FID score by 42.74% on T2M and 13.20% on T2M-GPT, while also achieving the highest R-Precision. When applied to high-quality models like MoMask, DMC improves the physical plausibility of motions by reducing penetration by 33.0% as well as adjusting floating artifacts closer to the ground-truth reference. These results highlight that DMC can serve as a promising post-hoc motion refinement framework for any kind of text-to-motion models by incorporating textual semantics and physical plausibility.
Chinese Translation
从文本描述生成语义对齐的人类运动已经取得了快速进展,但确保运动的语义和物理现实性仍然是一个挑战。在本文中,我们引入了扭曲感知运动校准器(Distortion-aware Motion Calibrator,DMC),这是一个后处理模块,旨在在保持与原始文本描述的语义一致性的同时,修正物理上不合理的运动(例如,脚悬浮)。我们提出了一种自监督和数据驱动的方法,DMC在给定故意扭曲的运动和原始文本描述作为输入时,学习获得物理上合理的运动。我们评估DMC作为后处理模块,以改善从各种文本到运动生成模型获得的运动,并展示其在提高物理合理性的同时增强语义一致性方面的有效性。实验结果表明,DMC在T2M上将FID得分降低了42.74%,在T2M-GPT上降低了13.20%,同时实现了最高的R-Precision。当应用于像MoMask这样的高质量模型时,DMC通过减少33.0%的穿透现象以及将悬浮伪影调整得更接近真实参考,改善了运动的物理合理性。这些结果突显了DMC可以作为一种有前景的后处理运动精炼框架,适用于任何类型的文本到运动模型,通过结合文本语义和物理合理性。
cs.CV / 32 / 2602.18252

On the Adversarial Robustness of Discrete Image Tokenizers

离散图像标记器的对抗鲁棒性研究
Bhagwatkar, Rishika, Rish, Irina, Flammarion, Nicolas, Croce, Francesco
Abstract
Discrete image tokenizers encode visual inputs as sequences of tokens from a finite vocabulary and are gaining popularity in multimodal systems, including encoder-only, encoder-decoder, and decoder-only models. However, unlike CLIP encoders, their vulnerability to adversarial attacks has not been explored. Ours being the first work studying this topic, we first formulate attacks that aim to perturb the features extracted by discrete tokenizers, and thus change the extracted tokens. These attacks are computationally efficient, application-agnostic, and effective across classification, multimodal retrieval, and captioning tasks. Second, to defend against this vulnerability, inspired by recent work on robust CLIP encoders, we fine-tune popular tokenizers with unsupervised adversarial training, keeping all other components frozen. While unsupervised and task-agnostic, our approach significantly improves robustness to both unsupervised and end-to-end supervised attacks and generalizes well to unseen tasks and data. Unlike supervised adversarial training, our approach can leverage unlabeled images, making it more versatile. Overall, our work highlights the critical role of tokenizer robustness in downstream tasks and presents an important step in the development of safe multimodal foundation models.
Chinese Translation
离散图像标记器将视觉输入编码为有限词汇表中的标记序列,并在包括仅编码器、编码器-解码器和仅解码器模型的多模态系统中日益受到欢迎。然而,与 CLIP 编码器不同的是,它们对对抗攻击的脆弱性尚未得到探讨。作为首个研究该主题的工作,我们首先制定了旨在扰动离散标记器提取的特征,从而改变提取的标记的攻击。这些攻击在计算上高效,适用于多种应用,并在分类、多模态检索和图像描述任务中表现出有效性。其次,为了防御这种脆弱性,受到近期关于鲁棒 CLIP 编码器工作的启发,我们通过无监督对抗训练微调了流行的标记器,同时保持其他组件不变。尽管是无监督且与任务无关,我们的方法显著提高了对无监督和端到端监督攻击的鲁棒性,并且在未见过的任务和数据上具有良好的泛化能力。与监督对抗训练不同,我们的方法可以利用未标记的图像,使其更加灵活。总体而言,我们的工作突显了标记器鲁棒性在下游任务中的关键作用,并为安全多模态基础模型的发展迈出了重要一步。
cs.CV / 33 / 2602.18282

DEIG: Detail-Enhanced Instance Generation with Fine-Grained Semantic Control

DEIG:具有细粒度语义控制的细节增强实例生成
Du, Shiyan, Yue, Conghan, Cheng, Xinyu, Zhang, Dongyu
Abstract
Multi-Instance Generation has advanced significantly in spatial placement and attribute binding. However, existing approaches still face challenges in fine-grained semantic understanding, particularly when dealing with complex textual descriptions. To overcome these limitations, we propose DEIG, a novel framework for fine-grained and controllable multi-instance generation. DEIG integrates an Instance Detail Extractor (IDE) that transforms text encoder embeddings into compact, instance-aware representations, and a Detail Fusion Module (DFM) that applies instance-based masked attention to prevent attribute leakage across instances. These components enable DEIG to generate visually coherent multi-instance scenes that precisely match rich, localized textual descriptions. To support fine-grained supervision, we construct a high-quality dataset with detailed, compositional instance captions generated by VLMs. We also introduce DEIG-Bench, a new benchmark with region-level annotations and multi-attribute prompts for both humans and objects. Experiments demonstrate that DEIG consistently outperforms existing approaches across multiple benchmarks in spatial consistency, semantic accuracy, and compositional generalization. Moreover, DEIG functions as a plug-and-play module, making it easily integrable into standard diffusion-based pipelines.
Chinese Translation
多实例生成在空间布局和属性绑定方面取得了显著进展。然而,现有方法在细粒度语义理解方面仍面临挑战,尤其是在处理复杂的文本描述时。为了解决这些局限性,我们提出了DEIG,一个用于细粒度和可控多实例生成的新框架。DEIG集成了实例细节提取器(Instance Detail Extractor, IDE),该模块将文本编码器嵌入转换为紧凑的、实例感知的表示,以及细节融合模块(Detail Fusion Module, DFM),该模块应用基于实例的掩码注意力,以防止属性在实例之间泄漏。这些组件使DEIG能够生成视觉上连贯的多实例场景,准确匹配丰富的、局部的文本描述。为了支持细粒度监督,我们构建了一个高质量的数据集,其中包含由视觉语言模型(VLMs)生成的详细组合实例标题。我们还引入了DEIG-Bench,一个新的基准,具有区域级注释和针对人类与物体的多属性提示。实验表明,DEIG在空间一致性、语义准确性和组合泛化等多个基准上始终优于现有方法。此外,DEIG作为一个即插即用模块,易于集成到标准的基于扩散的管道中。
cs.CV / 34 / 2602.18309

Multi-Level Conditioning by Pairing Localized Text and Sketch for Fashion Image Generation

通过配对局部文本和草图进行多层次条件化的时尚图像生成
Liu, Ziyue, Talon, Davide, Girella, Federico, Ruan, Zanxi, Mondo, Mattia, Bazzani, Loris, Wang, Yiming, Cristani, Marco
Abstract
Sketches offer designers a concise yet expressive medium for early-stage fashion ideation by specifying structure, silhouette, and spatial relationships, while textual descriptions complement sketches to convey material, color, and stylistic details. Effectively combining textual and visual modalities requires adherence to the sketch visual structure when leveraging the guidance of localized attributes from text. We present LOcalized Text and Sketch with multi-level guidance (LOTS), a framework that enhances fashion image generation by combining global sketch guidance with multiple localized sketch-text pairs. LOTS employs a Multi-level Conditioning Stage to independently encode local features within a shared latent space while maintaining global structural coordination. Then, the Diffusion Pair Guidance stage integrates both local and global conditioning via attention-based guidance within the diffusion model's multi-step denoising process. To validate our method, we develop Sketchy, the first fashion dataset where multiple text-sketch pairs are provided per image. Sketchy provides high-quality, clean sketches with a professional look and consistent structure. To assess robustness beyond this setting, we also include an "in the wild" split with non-expert sketches, featuring higher variability and imperfections. Experiments demonstrate that our method strengthens global structural adherence while leveraging richer localized semantic guidance, achieving improvement over state-of-the-art. The dataset, platform, and code are publicly available.
Chinese Translation
草图为设计师提供了一种简洁而富有表现力的媒介,用于早期阶段的时尚构思,能够指定结构、轮廓和空间关系,而文本描述则补充草图,传达材料、颜色和风格细节。有效地结合文本和视觉模态需要在利用文本的局部属性指导时遵循草图的视觉结构。我们提出了局部文本与草图的多层次指导框架(LOcalized Text and Sketch with multi-level guidance,简称 LOTS),该框架通过结合全球草图指导与多个局部草图-文本对,增强时尚图像生成。LOTS 采用多层次条件化阶段,在共享潜在空间中独立编码局部特征,同时保持全球结构协调。随后,扩散对偶指导阶段通过基于注意力的指导,将局部和全球条件整合到扩散模型的多步去噪过程中。为了验证我们的方法,我们开发了 Sketchy,这是第一个每幅图像提供多个文本-草图对的时尚数据集。Sketchy 提供高质量、干净且具有专业外观和一致结构的草图。为了评估在此设置之外的稳健性,我们还包含了一个“野外”分割,其中包含非专业草图,具有更高的变异性和缺陷。实验表明,我们的方法在利用更丰富的局部语义指导的同时,增强了全球结构的一致性,取得了超过现有最先进技术的改进。数据集、平台和代码均已公开。
cs.CV / 35 / 2602.18314

Diff2DGS: Reliable Reconstruction of Occluded Surgical Scenes via 2D Gaussian Splatting

Diff2DGS:通过2D高斯喷溅可靠重建被遮挡的外科场景
Song, Tianyi, Stoyanov, Danail, Mazomenos, Evangelos, Vasconcelos, Francisco
Abstract
Real-time reconstruction of deformable surgical scenes is vital for advancing robotic surgery, improving surgeon guidance, and enabling automation. Recent methods achieve dense reconstructions from da Vinci robotic surgery videos, with Gaussian Splatting (GS) offering real-time performance via graphics acceleration. However, reconstruction quality in occluded regions remains limited, and depth accuracy has not been fully assessed, as benchmarks like EndoNeRF and StereoMIS lack 3D ground truth. We propose Diff2DGS, a novel two-stage framework for reliable 3D reconstruction of occluded surgical scenes. In the first stage, a diffusion-based video module with temporal priors inpaints tissue occluded by instruments with high spatial-temporal consistency. In the second stage, we adapt 2D Gaussian Splatting (2DGS) with a Learnable Deformation Model (LDM) to capture dynamic tissue deformation and anatomical geometry. We also extend evaluation beyond prior image-quality metrics by performing quantitative depth accuracy analysis on the SCARED dataset. Diff2DGS outperforms state-of-the-art approaches in both appearance and geometry, reaching 38.02 dB PSNR on EndoNeRF and 34.40 dB on StereoMIS. Furthermore, our experiments demonstrate that optimizing for image quality alone does not necessarily translate into optimal 3D reconstruction accuracy. To address this, we further optimize the depth quality of the reconstructed 3D results, ensuring more faithful geometry in addition to high-fidelity appearance.
Chinese Translation
实时重建可变形外科场景对于推进机器人手术、改善外科医生指导和实现自动化至关重要。最近的方法通过达芬奇机器人手术视频实现了密集重建,其中高斯喷溅(Gaussian Splatting,GS)通过图形加速提供了实时性能。然而,在被遮挡区域的重建质量仍然有限,深度准确性尚未得到充分评估,因为像EndoNeRF和StereoMIS这样的基准缺乏3D真实数据。我们提出了Diff2DGS,这是一种新颖的两阶段框架,用于可靠的3D重建被遮挡的外科场景。在第一阶段,基于扩散的视频模块利用时间先验以高空间-时间一致性填补被器械遮挡的组织。在第二阶段,我们采用带有可学习变形模型(Learnable Deformation Model,LDM)的2D高斯喷溅(2DGS)来捕捉动态组织变形和解剖几何。我们还通过对SCARED数据集进行定量深度准确性分析,扩展了评估超越以往图像质量指标。Diff2DGS在外观和几何上均优于最先进的方法,在EndoNeRF上达到38.02 dB PSNR,在StereoMIS上达到34.40 dB。此外,我们的实验表明,仅优化图像质量并不一定能转化为最佳的3D重建准确性。为了解决这个问题,我们进一步优化了重建3D结果的深度质量,确保在高保真外观的基础上实现更真实的几何形状。
cs.CV / 36 / 2602.18322

Unifying Color and Lightness Correction with View-Adaptive Curve Adjustment for Robust 3D Novel View Synthesis

通过视适应曲线调整统一颜色和亮度校正以实现稳健的3D新视图合成
Cui, Ziteng, Liu, Shuhong, Dong, Xiaoyu, Chu, Xuangeng, Gu, Lin, Yang, Ming-Hsuan, Harada, Tatsuya
Abstract
High-quality image acquisition in real-world environments remains challenging due to complex illumination variations and inherent limitations of camera imaging pipelines. These issues are exacerbated in multi-view capture, where differences in lighting, sensor responses, and image signal processor (ISP) configurations introduce photometric and chromatic inconsistencies that violate the assumptions of photometric consistency underlying modern 3D novel view synthesis (NVS) methods, including Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), leading to degraded reconstruction and rendering quality. We propose Luminance-GS++, a 3DGS-based framework for robust NVS under diverse illumination conditions. Our method combines a globally view-adaptive lightness adjustment with a local pixel-wise residual refinement for precise color correction. We further design unsupervised objectives that jointly enforce lightness correction and multi-view geometric and photometric consistency. Extensive experiments demonstrate state-of-the-art performance across challenging scenarios, including low-light, overexposure, and complex luminance and chromatic variations. Unlike prior approaches that modify the underlying representation, our method preserves the explicit 3DGS formulation, improving reconstruction fidelity while maintaining real-time rendering efficiency.
Chinese Translation
在真实环境中高质量图像获取仍然面临挑战,主要由于复杂的照明变化和相机成像管道的固有限制。在多视图捕捉中,这些问题更加严重,因为照明、传感器响应和图像信号处理器(ISP)配置的差异引入了光度和色度不一致性,违反了现代3D新视图合成(NVS)方法(包括神经辐射场(NeRF)和3D高斯点云(3DGS))所依赖的光度一致性假设,导致重建和渲染质量下降。我们提出了Luminance-GS++,一个基于3DGS的框架,用于在多样照明条件下实现稳健的NVS。我们的方法结合了全局视适应的亮度调整和局部像素级的残差精细化,以实现精确的颜色校正。我们进一步设计了无监督目标,联合强制执行亮度校正以及多视图几何和光度一致性。大量实验表明,在低光、过曝以及复杂亮度和色度变化等具有挑战性的场景中,我们的方法表现出最先进的性能。与之前修改基础表示的方法不同,我们的方法保留了明确的3DGS公式,提高了重建保真度,同时保持实时渲染效率。
cs.CV / 37 / 2602.18329

G-LoG Bi-filtration for Medical Image Classification

用于医学图像分类的 G-LoG 双重滤波
Wang, Qingsong, He, Jiaxing, Hou, Bingzhe, Wu, Tieru, Cao, Yang, Yao, Cailing
Abstract
Building practical filtrations on objects to detect topological and geometric features is an important task in the field of Topological Data Analysis (TDA). In this paper, leveraging the ability of the Laplacian of Gaussian operator to enhance the boundaries of medical images, we define the G-LoG (Gaussian-Laplacian of Gaussian) bi-filtration to generate the features more suitable for multi-parameter persistence module. By modeling volumetric images as bounded functions, then we prove the interleaving distance on the persistence modules obtained from our bi-filtrations on the bounded functions is stable with respect to the maximum norm of the bounded functions. Finally, we conduct experiments on the MedMNIST dataset, comparing our bi-filtration against single-parameter filtration and the established deep learning baselines, including Google AutoML Vision, ResNet, AutoKeras and auto-sklearn. Experiments results demonstrate that our bi-filtration significantly outperforms single-parameter filtration. Notably, a simple Multi-Layer Perceptron (MLP) trained on the topological features generated by our bi-filtration achieves performance comparable to complex deep learning models trained on the original dataset.
Chinese Translation
在拓扑数据分析(TDA)领域,构建实用的滤波器以检测对象的拓扑和几何特征是一项重要任务。本文利用高斯拉普拉斯算子(Laplacian of Gaussian)增强医学图像边界的能力,定义了 G-LoG(高斯-拉普拉斯高斯)双重滤波,以生成更适合多参数持久性模块的特征。通过将体积图像建模为有界函数,我们证明了从我们在有界函数上获得的双重滤波中得到的持久性模块的交错距离在有界函数的最大范数下是稳定的。最后,我们在 MedMNIST 数据集上进行了实验,将我们的双重滤波与单参数滤波和已建立的深度学习基线进行比较,包括 Google AutoML Vision、ResNet、AutoKeras 和 auto-sklearn。实验结果表明,我们的双重滤波显著优于单参数滤波。值得注意的是,基于我们双重滤波生成的拓扑特征训练的简单多层感知器(MLP)在性能上可与在原始数据集上训练的复杂深度学习模型相媲美。
cs.CV / 38 / 2602.18394

Self-Aware Object Detection via Degradation Manifolds

通过降级流形实现自我意识的目标检测
Becker, Stefan, Weiss, Simon, Hübner, Wolfgang, Arens, Michael
Abstract
Object detectors achieve strong performance under nominal imaging conditions but can fail silently when exposed to blur, noise, compression, adverse weather, or resolution changes. In safety-critical settings, it is therefore insufficient to produce predictions without assessing whether the input remains within the detector's nominal operating regime. We refer to this capability as self-aware object detection. We introduce a degradation-aware self-awareness framework based on degradation manifolds, which explicitly structure a detector's feature space according to image degradation rather than semantic content. Our method augments a standard detection backbone with a lightweight embedding head trained via multi-layer contrastive learning. Images sharing the same degradation composition are pulled together, while differing degradation configurations are pushed apart, yielding a geometrically organized representation that captures degradation type and severity without requiring degradation labels or explicit density modeling. To anchor the learned geometry, we estimate a pristine prototype from clean training embeddings, defining a nominal operating point in representation space. Self-awareness emerges as geometric deviation from this reference, providing an intrinsic, image-level signal of degradation-induced shift that is independent of detection confidence. Extensive experiments on synthetic corruption benchmarks, cross-dataset zero-shot transfer, and natural weather-induced distribution shifts demonstrate strong pristine-degraded separability, consistent behavior across multiple detector architectures, and robust generalization under semantic shift. These results suggest that degradation-aware representation geometry provides a practical and detector-agnostic foundation.
Chinese Translation
目标检测器在正常成像条件下表现出色,但在模糊、噪声、压缩、不利天气或分辨率变化等情况下可能会悄然失效。因此,在安全关键的环境中,仅仅产生预测而不评估输入是否仍在检测器的正常操作范围内是不够的。我们将这种能力称为自我意识目标检测。我们引入了一种基于降级流形的降级感知自我意识框架,该框架根据图像降级而非语义内容明确构建检测器的特征空间。我们的方法通过多层对比学习增强了标准检测骨干,配备了轻量级嵌入头。共享相同降级组成的图像被拉近,而不同降级配置的图像则被推远,从而产生几何上有序的表示,捕捉降级类型和严重性,而无需降级标签或显式密度建模。为了锚定学习到的几何,我们从干净的训练嵌入中估计出一个原始原型,定义表示空间中的正常操作点。自我意识表现为与该参考点的几何偏差,提供了一种内在的、图像级别的降级引起的变化信号,这一信号独立于检测置信度。在合成损坏基准、跨数据集零样本迁移和自然天气引起的分布变化上的广泛实验表明,原始与降级之间具有强大的可分离性,在多种检测器架构中表现一致,并在语义变化下具有强大的泛化能力。这些结果表明,降级感知表示几何提供了一个实用的、与检测器无关的基础。
cs.CV / 39 / 2602.18406

Latent Equivariant Operators for Robust Object Recognition: Promise and Challenges

用于鲁棒物体识别的潜在等变算子:前景与挑战
Dinh, Minh, Deny, Stéphane
Abstract
Despite the successes of deep learning in computer vision, difficulties persist in recognizing objects that have undergone group-symmetric transformations rarely seen during training-for example objects seen in unusual poses, scales, positions, or combinations thereof. Equivariant neural networks are a solution to the problem of generalizing across symmetric transformations, but require knowledge of transformations a priori. An alternative family of architectures proposes to earn equivariant operators in a latent space from examples of symmetric transformations. Here, using simple datasets of rotated and translated noisy MNIST, we illustrate how such architectures can successfully be harnessed for out-of-distribution classification, thus overcoming the limitations of both traditional and equivariant networks. While conceptually enticing, we discuss challenges ahead on the path of scaling these architectures to more complex datasets.
Chinese Translation
尽管深度学习在计算机视觉领域取得了成功,但在识别经历过训练中罕见的群体对称变换的物体时仍然存在困难——例如,处于不寻常姿势、尺度、位置或其组合的物体。等变神经网络是解决跨对称变换进行泛化问题的一种方法,但需要事先了解变换。另一类架构则提出从对称变换的示例中学习潜在空间中的等变算子。在这里,我们使用简单的旋转和位移的噪声MNIST数据集,展示了如何成功利用这些架构进行分布外分类,从而克服传统网络和等变网络的局限性。尽管在概念上引人注目,我们讨论了在将这些架构扩展到更复杂数据集的过程中面临的挑战。
cs.CV / 40 / 2602.18422

Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control

生成现实:基于手部和相机控制的以人为中心的世界模拟交互视频生成
Xie, Linxi, Sun, Lisong C., Neall, Ashley, Wu, Tong, Cai, Shengqu, Wetzstein, Gordon
Abstract
Extended reality (XR) demands generative models that respond to users' tracked real-world motion, yet current video world models accept only coarse control signals such as text or keyboard input, limiting their utility for embodied interaction. We introduce a human-centric video world model that is conditioned on both tracked head pose and joint-level hand poses. For this purpose, we evaluate existing diffusion transformer conditioning strategies and propose an effective mechanism for 3D head and hand control, enabling dexterous hand--object interactions. We train a bidirectional video diffusion model teacher using this strategy and distill it into a causal, interactive system that generates egocentric virtual environments. We evaluate this generated reality system with human subjects and demonstrate improved task performance as well as a significantly higher level of perceived amount of control over the performed actions compared with relevant baselines.
Chinese Translation
扩展现实(XR)需要能够响应用户跟踪的真实世界运动的生成模型,但当前的视频世界模型仅接受粗略的控制信号,如文本或键盘输入,这限制了其在具身交互中的实用性。我们提出了一种以人为中心的视频世界模型,该模型基于跟踪的头部姿态和关节级手部姿态进行条件化。为此,我们评估了现有的扩散变换器条件化策略,并提出了一种有效的三维头部和手部控制机制,使得灵巧的手部与物体交互成为可能。我们使用这一策略训练了一个双向视频扩散模型教师,并将其提炼为一个因果的交互系统,生成自我中心的虚拟环境。我们通过人类受试者评估了这一生成现实系统,结果表明,与相关基线相比,任务表现得到了改善,并且在执行动作时感知到的控制程度显著提高。
cs.CV / 41 / 2602.18424

CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation

CapNav:基于能力条件的室内导航视觉语言模型基准测试
Su, Xia, Chen, Ruiqi, Liu, Benlin, Ma, Jingwei, Di, Zonglin, Krishna, Ranjay, Froehlich, Jon
Abstract
Vision-Language Models (VLMs) have shown remarkable progress in Vision-Language Navigation (VLN), offering new possibilities for navigation decision-making that could benefit both robotic platforms and human users. However, real-world navigation is inherently conditioned by the agent's mobility constraints. For example, a sweeping robot cannot traverse stairs, while a quadruped can. We introduce Capability-Conditioned Navigation (CapNav), a benchmark designed to evaluate how well VLMs can navigate complex indoor spaces given an agent's specific physical and operational capabilities. CapNav defines five representative human and robot agents, each described with physical dimensions, mobility capabilities, and environmental interaction abilities. CapNav provides 45 real-world indoor scenes, 473 navigation tasks, and 2365 QA pairs to test if VLMs can traverse indoor environments based on agent capabilities. We evaluate 13 modern VLMs and find that current VLM's navigation performance drops sharply as mobility constraints tighten, and that even state-of-the-art models struggle with obstacle types that require reasoning on spatial dimensions. We conclude by discussing the implications for capability-aware navigation and the opportunities for advancing embodied spatial reasoning in future VLMs. The benchmark is available at https://github.com/makeabilitylab/CapNav
Chinese Translation
视觉语言模型(VLMs)在视觉语言导航(VLN)方面取得了显著进展,为导航决策提供了新的可能性,这对机器人平台和人类用户都有益。然而,现实世界的导航本质上受到代理人移动能力限制的影响。例如,扫地机器人无法跨越楼梯,而四足机器人则可以。我们引入了能力条件导航(CapNav),这是一个旨在评估VLMs在考虑代理人特定物理和操作能力的情况下,如何在复杂室内空间中进行导航的基准测试。CapNav定义了五种具有代表性的人类和机器人代理,每种代理都描述了其物理尺寸、移动能力和环境交互能力。CapNav提供了45个真实世界的室内场景、473个导航任务和2365对问答,以测试VLMs是否能够根据代理能力穿越室内环境。我们评估了13个现代VLM,并发现当前VLM的导航性能在移动限制加剧时急剧下降,甚至最先进的模型在处理需要空间维度推理的障碍类型时也面临困难。最后,我们讨论了能力感知导航的影响以及未来VLM中推进具身空间推理的机会。该基准测试可在https://github.com/makeabilitylab/CapNav获取。
cs.CV / 42 / 2602.18432

SARAH: Spatially Aware Real-time Agentic Humans

SARAH:空间感知的实时代理人类
Ng, Evonne, Zhang, Siwei, Chen, Zhang, Zollhoefer, Michael, Richard, Alexander
Abstract
As embodied agents become central to VR, telepresence, and digital human applications, their motion must go beyond speech-aligned gestures: agents should turn toward users, respond to their movement, and maintain natural gaze. Current methods lack this spatial awareness. We close this gap with the first real-time, fully causal method for spatially-aware conversational motion, deployable on a streaming VR headset. Given a user's position and dyadic audio, our approach produces full-body motion that aligns gestures with speech while orienting the agent according to the user. Our architecture combines a causal transformer-based VAE with interleaved latent tokens for streaming inference and a flow matching model conditioned on user trajectory and audio. To support varying gaze preferences, we introduce a gaze scoring mechanism with classifier-free guidance to decouple learning from control: the model captures natural spatial alignment from data, while users can adjust eye contact intensity at inference time. On the Embody 3D dataset, our method achieves state-of-the-art motion quality at over 300 FPS -- 3x faster than non-causal baselines -- while capturing the subtle spatial dynamics of natural conversation. We validate our approach on a live VR system, bringing spatially-aware conversational agents to real-time deployment. Please see https://evonneng.github.io/sarah/ for details.
Chinese Translation
随着具身代理人在虚拟现实(VR)、远程呈现和数字人类应用中的重要性日益增加,它们的运动必须超越与语言对齐的手势:代理人应当朝向用户,响应他们的移动,并保持自然的注视。目前的方法缺乏这种空间感知。我们通过首个实时、完全因果的方法来填补这一空白,该方法支持空间感知的对话运动,并可在流媒体VR头显上部署。根据用户的位置和双向音频,我们的方法生成全身运动,使手势与语言对齐,同时根据用户的方向调整代理人的朝向。我们的架构结合了基于因果变换器的变分自编码器(VAE)与交错的潜在标记,以实现流媒体推断,并采用基于用户轨迹和音频的流匹配模型。为了支持不同的注视偏好,我们引入了一种注视评分机制,采用无分类器引导以将学习与控制解耦:模型从数据中捕捉自然的空间对齐,而用户可以在推断时调整眼神接触的强度。在Embody 3D数据集上,我们的方法以超过300帧每秒的速度实现了最先进的运动质量,比非因果基线快3倍,同时捕捉自然对话的微妙空间动态。我们在实时VR系统上验证了我们的方法,实现了空间感知的对话代理人实时部署。有关详细信息,请参见 https://evonneng.github.io/sarah/
cs.CV / 43 / 2602.18434

Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory

追忆往昔:通过动态KV缓存记忆扩展视频流理解的令牌
Agarwal, Vatsal, Suri, Saksham, Gwilliam, Matthew, Kumar, Pulkit, Shrivastava, Abhinav
Abstract
Streaming video understanding requires models to robustly encode, store, and retrieve information from a continuous video stream to support accurate video question answering (VQA). Existing state-of-the-art approaches rely on key-value caching to accumulate frame-level information over time, but use a limited number of tokens per frame, leading to the loss of fine-grained visual details. In this work, we propose scaling the token budget to enable more granular spatiotemporal understanding and reasoning. First, we find that current methods are ill-equipped to handle dense streams: their feature encoding causes query-frame similarity scores to increase over time, biasing retrieval toward later frames. To address this, we introduce an adaptive selection strategy that reduces token redundancy while preserving local spatiotemporal information. We further propose a training-free retrieval mixture-of-experts that leverages external models to better identify relevant frames. Our method, MemStream, achieves +8.0% on CG-Bench, +8.5% on LVBench, and +2.4% on VideoMME (Long) over ReKV with Qwen2.5-VL-7B.
Chinese Translation
视频流理解要求模型能够稳健地编码、存储和检索来自连续视频流的信息,以支持准确的视频问答(VQA)。现有的最先进方法依赖于键值缓存,以随着时间的推移积累帧级信息,但每帧使用的令牌数量有限,导致细粒度视觉细节的丢失。在本研究中,我们提出扩展令牌预算,以实现更细致的时空理解和推理。首先,我们发现当前方法在处理密集流时能力不足:它们的特征编码导致查询帧相似度分数随时间增加,从而偏向于后期帧的检索。为了解决这个问题,我们引入了一种自适应选择策略,减少令牌冗余,同时保留局部时空信息。我们进一步提出了一种无训练的检索混合专家模型,利用外部模型更好地识别相关帧。我们的方法MemStream在CG-Bench上提高了8.0%,在LVBench上提高了8.5%,在VideoMME(长)上提高了2.4%,相较于ReKV与Qwen2.5-VL-7B的结果。
人工智能 (Artificial Intelligence)
10
cs.AI / 1 / 2602.17676

Epistemic Traps: Rational Misalignment Driven by Model Misspecification

认知陷阱:由模型错误指定驱动的理性不一致
Xu, Xingcheng, Qu, Jingjing, Zhang, Qiaosheng, Lu, Chaochao, Yang, Yanqing, Zou, Na, Hu, Xia
Abstract
The rapid deployment of Large Language Models and AI agents across critical societal and technical domains is hindered by persistent behavioral pathologies including sycophancy, hallucination, and strategic deception that resist mitigation via reinforcement learning. Current safety paradigms treat these failures as transient training artifacts, lacking a unified theoretical framework to explain their emergence and stability. Here we show that these misalignments are not errors, but mathematically rationalizable behaviors arising from model misspecification. By adapting Berk-Nash Rationalizability from theoretical economics to artificial intelligence, we derive a rigorous framework that models the agent as optimizing against a flawed subjective world model. We demonstrate that widely observed failures are structural necessities: unsafe behaviors emerge as either a stable misaligned equilibrium or oscillatory cycles depending on reward scheme, while strategic deception persists as a "locked-in" equilibrium or through epistemic indeterminacy robust to objective risks. We validate these theoretical predictions through behavioral experiments on six state-of-the-art model families, generating phase diagrams that precisely map the topological boundaries of safe behavior. Our findings reveal that safety is a discrete phase determined by the agent's epistemic priors rather than a continuous function of reward magnitude. This establishes Subjective Model Engineering, defined as the design of an agent's internal belief structure, as a necessary condition for robust alignment, marking a paradigm shift from manipulating environmental rewards to shaping the agent's interpretation of reality.
Chinese Translation
大型语言模型和人工智能代理在关键社会和技术领域的快速部署受到持续的行为病态(如谄媚、幻觉和战略欺骗)的阻碍,这些病态通过强化学习难以缓解。目前的安全范式将这些失败视为短暂的训练伪影,缺乏统一的理论框架来解释它们的出现和稳定性。在此,我们展示这些不一致并非错误,而是源于模型错误指定的数学上可理性化的行为。通过将理论经济学中的伯克-纳什理性化方法应用于人工智能,我们推导出一个严格的框架,将代理建模为在一个有缺陷的主观世界模型下进行优化。我们证明,广泛观察到的失败是结构上的必然性:不安全行为作为稳定的不一致均衡或依赖于奖励机制的振荡周期出现,而战略欺骗则作为“锁定”均衡或通过对客观风险具有鲁棒性的认知不确定性持续存在。我们通过对六个最先进模型家族的行为实验验证这些理论预测,生成的相图精确映射了安全行为的拓扑边界。我们的研究发现,安全性是由代理的认知先验决定的离散相,而不是奖励大小的连续函数。这确立了主观模型工程的必要条件,即设计代理的内部信念结构,以实现稳健的对齐,标志着从操控环境奖励到塑造代理对现实的解释的范式转变。
cs.AI / 2 / 2602.17826

Ontology-Guided Neuro-Symbolic Inference: Grounding Language Models with Mathematical Domain Knowledge

本体引导的神经符号推理:用数学领域知识为语言模型奠基
Labre, Marcelo
Abstract
Language models exhibit fundamental limitations -- hallucination, brittleness, and lack of formal grounding -- that are particularly problematic in high-stakes specialist fields requiring verifiable reasoning. I investigate whether formal domain ontologies can enhance language model reliability through retrieval-augmented generation. Using mathematics as proof of concept, I implement a neuro-symbolic pipeline leveraging the OpenMath ontology with hybrid retrieval and cross-encoder reranking to inject relevant definitions into model prompts. Evaluation on the MATH benchmark with three open-source models reveals that ontology-guided context improves performance when retrieval quality is high, but irrelevant context actively degrades it -- highlighting both the promise and challenges of neuro-symbolic approaches.
Chinese Translation
语言模型存在基本的局限性——幻觉、脆弱性和缺乏形式基础——这些问题在需要可验证推理的高风险专业领域尤为突出。本文探讨了形式领域本体是否能够通过检索增强生成来提高语言模型的可靠性。以数学为概念验证,我实现了一个神经符号管道,利用 OpenMath 本体结合混合检索和交叉编码重排序,将相关定义注入模型提示中。在 MATH 基准测试上对三种开源模型的评估表明,当检索质量高时,本体引导的上下文能够提高性能,但不相关的上下文则会积极降低性能——这突显了神经符号方法的潜力与挑战。
cs.AI / 3 / 2602.17831

The Token Games: Evaluating Language Model Reasoning with Puzzle Duels

代币游戏:通过难题对决评估语言模型的推理能力
Henniger, Simon, Poesia, Gabriel
Abstract
Evaluating the reasoning capabilities of Large Language Models is increasingly challenging as models improve. Human curation of hard questions is highly expensive, especially in recent benchmarks using PhD-level domain knowledge to challenge the most capable models. Even then, there is always a concern about whether these questions test genuine reasoning or if similar problems have been seen during training. Here, we take inspiration from 16th-century mathematical duels to design The Token Games (TTG): an evaluation framework where models challenge each other by creating their own puzzles. We leverage the format of Programming Puzzles - given a Python function that returns a boolean, find inputs that make it return True - to flexibly represent problems and enable verifying solutions. Using results from pairwise duels, we then compute Elo ratings, allowing us to compare models relative to each other. We evaluate 10 frontier models on TTG, and closely match the ranking from existing benchmarks such as Humanity's Last Exam, without involving any human effort in creating puzzles. We also find that creating good puzzles is still a highly challenging task for current models, not measured by previous benchmarks. Overall, our work suggests new paradigms for evaluating reasoning that cannot be saturated by design, and that allow testing models for other skills like creativity and task creation alongside problem solving.
Chinese Translation
随着大型语言模型的不断进步,评估其推理能力变得愈发具有挑战性。人工策划困难问题的成本极高,尤其是在最近的基准测试中,使用博士级领域知识来挑战最强大的模型。即便如此,人们仍然担心这些问题是否真正测试了推理能力,或者类似的问题在训练过程中是否已经见过。在此,我们受到16世纪数学对决的启发,设计了代币游戏(The Token Games, TTG):一个评估框架,模型通过创建自己的难题相互挑战。我们利用编程难题的格式——给定一个返回布尔值的Python函数,找出使其返回True的输入——灵活地表示问题并验证解决方案。通过成对对决的结果,我们计算Elo评分,从而允许我们相对比较模型。我们在TTG上评估了10个前沿模型,并与现有基准(如人类的最后考试)中的排名紧密匹配,而无需任何人工参与创建难题。我们还发现,创建良好的难题对于当前模型仍然是一个高度具有挑战性的任务,这一点并未在之前的基准中得到衡量。总体而言,我们的工作为评估推理能力提供了新的范式,这些范式无法通过设计达到饱和,并且允许在解决问题的同时测试模型的创造力和任务创建等其他技能。
cs.AI / 4 / 2602.17902

El Agente Gr\'afico: Structured Execution Graphs for Scientific Agents

El Agente Gráfico:科学智能体的结构化执行图
Bai, Jiaru, Aldossary, Abdulrahman, Swanick, Thomas, Müller, Marcel, Kang, Yeonghun, Zhang, Zijian, Lee, Jin Won, Ko, Tsz Wai, Vakili, Mohammad Ghazi, Bernales, Varinia, Aspuru-Guzik, Alán
Abstract
Large language models (LLMs) are increasingly used to automate scientific workflows, yet their integration with heterogeneous computational tools remains ad hoc and fragile. Current agentic approaches often rely on unstructured text to manage context and coordinate execution, generating often overwhelming volumes of information that may obscure decision provenance and hinder auditability. In this work, we present El Agente Gr\'afico, a single-agent framework that embeds LLM-driven decision-making within a type-safe execution environment and dynamic knowledge graphs for external persistence. Central to our approach is a structured abstraction of scientific concepts and an object-graph mapper that represents computational state as typed Python objects, stored either in memory or persisted in an external knowledge graph. This design enables context management through typed symbolic identifiers rather than raw text, thereby ensuring consistency, supporting provenance tracking, and enabling efficient tool orchestration. We evaluate the system by developing an automated benchmarking framework across a suite of university-level quantum chemistry tasks previously evaluated on a multi-agent system, demonstrating that a single agent, when coupled to a reliable execution engine, can robustly perform complex, multi-step, and parallel computations. We further extend this paradigm to two other large classes of applications: conformer ensemble generation and metal-organic framework design, where knowledge graphs serve as both memory and reasoning substrates. Together, these results illustrate how abstraction and type safety can provide a scalable foundation for agentic scientific automation beyond prompt-centric designs.
Chinese Translation
大型语言模型(LLMs)越来越多地用于自动化科学工作流,但它们与异构计算工具的集成仍然是临时和脆弱的。目前的智能体方法通常依赖于非结构化文本来管理上下文和协调执行,产生的往往是大量的信息,这可能会模糊决策来源并妨碍审计性。在本研究中,我们提出了El Agente Gráfico,一个单智能体框架,它将基于LLM的决策制定嵌入到类型安全的执行环境和用于外部持久性的动态知识图中。我们方法的核心是科学概念的结构化抽象和一个对象图映射器,它将计算状态表示为类型化的Python对象,这些对象存储在内存中或持久化在外部知识图中。这种设计通过类型化的符号标识符而非原始文本来实现上下文管理,从而确保一致性,支持来源追踪,并实现高效的工具编排。我们通过开发一个自动化基准测试框架来评估该系统,该框架涵盖了一系列以前在多智能体系统上评估的大学级量子化学任务,证明了一个单智能体在与可靠的执行引擎结合时,可以稳健地执行复杂的多步骤和并行计算。我们进一步将这一范式扩展到另外两个大型应用类:构象集合生成和金属有机框架设计,其中知识图既作为记忆又作为推理基础。这些结果共同表明,抽象和类型安全如何为超越以提示为中心的设计的智能科学自动化提供可扩展的基础。
cs.AI / 5 / 2602.17910

Alignment in Time: Peak-Aware Orchestration for Long-Horizon Agentic Systems

时间中的对齐:面向长时间跨度自主系统的峰值感知编排
Shi, Hanjing, DiFranzo, Dominic
Abstract
Traditional AI alignment primarily focuses on individual model outputs; however, autonomous agents in long-horizon workflows require sustained reliability across entire interaction trajectories. We introduce APEMO (Affect-aware Peak-End Modulation for Orchestration), a runtime scheduling layer that optimizes computational allocation under fixed budgets by operationalizing temporal-affective signals. Instead of modifying model weights, APEMO detects trajectory instability through behavioral proxies and targets repairs at critical segments, such as peak moments and endings. Evaluation across multi-agent simulations and LLM-based planner--executor flows demonstrates that APEMO consistently enhances trajectory-level quality and reuse probability over structural orchestrators. Our results reframe alignment as a temporal control problem, offering a resilient engineering pathway for the development of long-horizon agentic systems.
Chinese Translation
传统的人工智能对齐主要关注单个模型输出;然而,长时间跨度工作流中的自主代理需要在整个交互轨迹中保持持续的可靠性。我们引入了APEMO(情感感知峰值-结束调制编排),这是一种运行时调度层,通过将时间-情感信号操作化来优化在固定预算下的计算分配。APEMO并不修改模型权重,而是通过行为代理检测轨迹不稳定性,并针对关键片段进行修复,例如峰值时刻和结束。通过多代理仿真和基于大语言模型(LLM)的规划者-执行者流程的评估表明,APEMO在轨迹级质量和重用概率方面始终优于结构化编排器。我们的结果将对齐重新框架为一个时间控制问题,为长时间跨度自主系统的发展提供了一条韧性工程路径。
cs.AI / 6 / 2602.17990

WorkflowPerturb: Calibrated Stress Tests for Evaluating Multi-Agent Workflow Metrics

WorkflowPerturb:用于评估多智能体工作流指标的校准压力测试
Kanda, Madhav, Las-Casas, Pedro, Kumbhare, Alok Gautam, Fonseca, Rodrigo, Agarwal, Sharad
Abstract
LLM-based systems increasingly generate structured workflows for complex tasks. In practice, automatic evaluation of these workflows is difficult, because metric scores are often not calibrated, and score changes do not directly communicate the severity of workflow degradation. We introduce WorkflowPerturb, a controlled benchmark for studying workflow evaluation metrics. It works by applying realistic, controlled perturbations to golden workflows. WorkflowPerturb contains 4,973 golden workflows and 44,757 perturbed variants across three perturbation types (Missing Steps, Compressed Steps, and Description Changes), each applied at severity levels of 10%, 30%, and 50%. We benchmark multiple metric families and analyze their sensitivity and calibration using expected score trajectories and residuals. Our results characterize systematic differences across metric families and support severity-aware interpretation of workflow evaluation scores. Our dataset will be released upon acceptance.
Chinese Translation
基于大语言模型(LLM)的系统日益生成用于复杂任务的结构化工作流。在实际应用中,这些工作流的自动评估是困难的,因为指标得分往往没有经过校准,得分变化也无法直接传达工作流退化的严重性。我们提出了WorkflowPerturb,这是一个用于研究工作流评估指标的受控基准。它通过对黄金工作流施加现实的、受控的扰动来工作。WorkflowPerturb包含4,973个黄金工作流和44,757个经过扰动的变体,涵盖三种扰动类型(缺失步骤、压缩步骤和描述变化),每种扰动在10%、30%和50%的严重性水平下施加。我们基准测试了多种指标家族,并使用预期得分轨迹和残差分析了它们的敏感性和校准性。我们的结果表征了不同指标家族之间的系统性差异,并支持对工作流评估得分的严重性感知解释。我们的数据集将在接受后发布。
cs.AI / 7 / 2602.18025

Cross-Embodiment Offline Reinforcement Learning for Heterogeneous Robot Datasets

异构机器人数据集的跨体现离线强化学习
Abe, Haruki, Osa, Takayuki, Mukuta, Yusuke, Harada, Tatsuya
Abstract
Scalable robot policy pre-training has been hindered by the high cost of collecting high-quality demonstrations for each platform. In this study, we address this issue by uniting offline reinforcement learning (offline RL) with cross-embodiment learning. Offline RL leverages both expert and abundant suboptimal data, and cross-embodiment learning aggregates heterogeneous robot trajectories across diverse morphologies to acquire universal control priors. We perform a systematic analysis of this offline RL and cross-embodiment paradigm, providing a principled understanding of its strengths and limitations. To evaluate this offline RL and cross-embodiment paradigm, we construct a suite of locomotion datasets spanning 16 distinct robot platforms. Our experiments confirm that this combined approach excels at pre-training with datasets rich in suboptimal trajectories, outperforming pure behavior cloning. However, as the proportion of suboptimal data and the number of robot types increase, we observe that conflicting gradients across morphologies begin to impede learning. To mitigate this, we introduce an embodiment-based grouping strategy in which robots are clustered by morphological similarity and the model is updated with a group gradient. This simple, static grouping substantially reduces inter-robot conflicts and outperforms existing conflict-resolution methods.
Chinese Translation
可扩展的机器人策略预训练受到为每个平台收集高质量示范的高成本的制约。在本研究中,我们通过将离线强化学习(offline RL)与跨体现学习相结合来解决这一问题。离线强化学习利用专家数据和丰富的次优数据,而跨体现学习则聚合了不同形态的异构机器人轨迹,以获取通用控制先验。我们对这一离线强化学习和跨体现范式进行了系统分析,提供了对其优势和局限性的原则性理解。为了评估这一离线强化学习和跨体现范式,我们构建了一套涵盖16种不同机器人平台的运动数据集。我们的实验确认,这种结合方法在使用丰富的次优轨迹的数据集进行预训练时表现优异,超越了纯行为克隆。然而,随着次优数据比例和机器人类型数量的增加,我们观察到不同形态之间的冲突梯度开始妨碍学习。为了解决这一问题,我们引入了一种基于体现的分组策略,将机器人按形态相似性进行聚类,并通过组梯度更新模型。这种简单的静态分组显著减少了机器人之间的冲突,并优于现有的冲突解决方法。
cs.AI / 8 / 2602.18095

Neurosymbolic Language Reasoning as Satisfiability Modulo Theory

神经符号语言推理作为理论模态下的可满足性
Oh, Hyunseok, Stern, Sam, Lee, Youngki, Philipose, Matthai
Abstract
Natural language understanding requires interleaving textual and logical reasoning, yet large language models often fail to perform such reasoning reliably. Existing neurosymbolic systems combine LLMs with solvers but remain limited to fully formalizable tasks such as math or program synthesis, leaving natural documents with only partial logical structure unaddressed. We introduce Logitext, a neurosymbolic language that represents documents as natural language text constraints (NLTCs), making partial logical structure explicit. We develop an algorithm that integrates LLM-based constraint evaluation with satisfiability modulo theory (SMT) solving, enabling joint textual-logical reasoning. Experiments on a new content moderation benchmark, together with LegalBench and Super-Natural Instructions, show that Logitext improves both accuracy and coverage. This work is the first that treats LLM-based reasoning as an SMT theory, extending neurosymbolic methods beyond fully formalizable domains.
Chinese Translation
自然语言理解需要将文本和逻辑推理交替进行,然而大型语言模型往往无法可靠地执行这种推理。现有的神经符号系统将大型语言模型(LLMs)与求解器结合,但仍然仅限于完全可形式化的任务,如数学或程序合成,未能解决自然文档中仅具有部分逻辑结构的问题。我们提出了Logitext,这是一种神经符号语言,将文档表示为自然语言文本约束(NLTCs),使部分逻辑结构变得显性。我们开发了一种算法,将基于LLM的约束评估与理论模态下的可满足性(SMT)求解相结合,实现了文本与逻辑的联合推理。在一个新的内容审核基准测试以及LegalBench和Super-Natural Instructions上的实验表明,Logitext在准确性和覆盖率上都有所提升。这项工作首次将基于LLM的推理视为SMT理论,扩展了神经符号方法的应用范围,超越了完全可形式化的领域。
cs.AI / 9 / 2602.18201

SOMtime the World Ain$'$t Fair: Violating Fairness Using Self-Organizing Maps

有时世界并不公平:使用自组织映射违反公平性
Bingham, Joseph, Arussy, Netanel, Aran, Dvir
Abstract
Unsupervised representations are widely assumed to be neutral with respect to sensitive attributes when those attributes are withheld from training. We show that this assumption is false. Using SOMtime, a topology-preserving representation method based on high-capacity Self-Organizing Maps, we demonstrate that sensitive attributes such as age and income emerge as dominant latent axes in purely unsupervised embeddings, even when explicitly excluded from the input. On two large-scale real-world datasets (the World Values Survey across five countries and the Census-Income dataset), SOMtime recovers monotonic orderings aligned with withheld sensitive attributes, achieving Spearman correlations of up to 0.85, whereas PCA and UMAP typically remain below 0.23 (with a single exception reaching 0.31), and against t-SNE and autoencoders which achieve at most 0.34. Furthermore, unsupervised segmentation of SOMtime embeddings produces demographically skewed clusters, demonstrating downstream fairness risks without any supervised task. These findings establish that \textit{fairness through unawareness} fails at the representation level for ordinal sensitive attributes and that fairness auditing must extend to unsupervised components of machine learning pipelines. We have made the code available at~ https://github.com/JosephBingham/SOMtime
Chinese Translation
无监督表示通常被认为在敏感属性方面是中立的,前提是这些属性在训练中被排除。我们证明了这一假设是错误的。通过使用SOMtime,这是一种基于高容量自组织映射(Self-Organizing Maps)的拓扑保持表示方法,我们展示了即使在输入中明确排除的情况下,年龄和收入等敏感属性也会在纯粹的无监督嵌入中作为主导潜在轴出现。在两个大规模的真实世界数据集上(涵盖五个国家的世界价值观调查和人口普查收入数据集),SOMtime恢复了与被排除的敏感属性一致的单调排序,达到的斯皮尔曼相关系数高达0.85,而主成分分析(PCA)和UMAP通常保持在0.23以下(仅有一个例外达到0.31),而t-SNE和自编码器的最高相关系数也仅为0.34。此外,SOMtime嵌入的无监督分割产生了在人口统计上偏斜的聚类,展示了在没有任何监督任务的情况下的下游公平性风险。这些发现表明,对于有序敏感属性, extit{通过无知实现公平性}在表示层面上是失败的,公平性审计必须扩展到机器学习流程的无监督组件。我们已将代码发布在 https://github.com/JosephBingham/SOMtime
cs.AI / 10 / 2602.18291

Diffusing to Coordinate: Efficient Online Multi-Agent Diffusion Policies

扩散协调:高效的在线多智能体扩散策略
Li, Zhuoran, Zhong, Hai, Wang, Xun, Xia, Qingxin, Zhang, Lihua, Huang, Longbo
Abstract
Online Multi-Agent Reinforcement Learning (MARL) is a prominent framework for efficient agent coordination. Crucially, enhancing policy expressiveness is pivotal for achieving superior performance. Diffusion-based generative models are well-positioned to meet this demand, having demonstrated remarkable expressiveness and multimodal representation in image generation and offline settings. Yet, their potential in online MARL remains largely under-explored. A major obstacle is that the intractable likelihoods of diffusion models impede entropy-based exploration and coordination. To tackle this challenge, we propose among the first \underline{O}nline off-policy \underline{MA}RL framework using \underline{D}iffusion policies (\textbf{OMAD}) to orchestrate coordination. Our key innovation is a relaxed policy objective that maximizes scaled joint entropy, facilitating effective exploration without relying on tractable likelihood. Complementing this, within the centralized training with decentralized execution (CTDE) paradigm, we employ a joint distributional value function to optimize decentralized diffusion policies. It leverages tractable entropy-augmented targets to guide the simultaneous updates of diffusion policies, thereby ensuring stable coordination. Extensive evaluations on MPE and MAMuJoCo establish our method as the new state-of-the-art across $10$ diverse tasks, demonstrating a remarkable $2.5\times$ to $5\times$ improvement in sample efficiency.
Chinese Translation
在线多智能体强化学习(MARL)是实现高效智能体协调的重要框架。提升策略的表现力对于获得优越的性能至关重要。基于扩散的生成模型在满足这一需求方面具有良好潜力,因为它们在图像生成和离线设置中展现了显著的表现力和多模态表示。然而,它们在在线MARL中的潜力仍然未得到充分探索。一个主要障碍是扩散模型的难以处理的似然性阻碍了基于熵的探索和协调。为了解决这一挑战,我们提出了首个使用扩散策略的在线离策略多智能体强化学习框架(OMAD),以协调智能体之间的合作。我们的关键创新是放宽的策略目标,最大化缩放的联合熵,从而在不依赖可处理似然性的情况下促进有效的探索。此外,在集中训练与分散执行(CTDE)范式下,我们采用联合分布值函数来优化分散的扩散策略。它利用可处理的熵增强目标来指导扩散策略的同步更新,从而确保稳定的协调。在MPE和MAMuJoCo上的广泛评估表明,我们的方法在10个不同任务中建立了新的最先进水平,样本效率显著提高了2.5倍至5倍。
计算语言学 (Computation and Language)
27
cs.CL / 1 / 2602.17784

QueryPlot: Generating Geological Evidence Layers using Natural Language Queries for Mineral Exploration

QueryPlot:利用自然语言查询生成矿产勘探的地质证据层
Ye, Meng, Lin, Xiao, Lukoczki, Georgina, Lederer, Graham W., Yao, Yi
Abstract
Mineral prospectivity mapping requires synthesizing heterogeneous geological knowledge, including textual deposit models and geospatial datasets, to identify regions likely to host specific mineral deposit types. This process is traditionally manual and knowledge-intensive. We present QueryPlot, a semantic retrieval and mapping framework that integrates large-scale geological text corpora with geologic map data using modern Natural Language Processing techniques. We curate descriptive deposit models for over 120 deposit types and transform the State Geologic Map Compilation (SGMC) polygons into structured textual representations. Given a user-defined natural language query, the system encodes both queries and region descriptions using a pretrained embedding model and computes semantic similarity scores to rank and spatially visualize regions as continuous evidence layers. QueryPlot supports compositional querying over deposit characteristics, enabling aggregation of multiple similarity-derived layers for multi-criteria prospectivity analysis. In a case study on tungsten skarn deposits, we demonstrate that embedding-based retrieval achieves high recall of known occurrences and produces prospective regions that closely align with expert-defined permissive tracts. Furthermore, similarity scores can be incorporated as additional features in supervised learning pipelines, yielding measurable improvements in classification performance. QueryPlot is implemented as a web-based system supporting interactive querying, visualization, and export of GIS-compatible prospectivity layers.To support future research, we have made the source code and datasets used in this study publicly available.
Chinese Translation
矿产前景制图需要综合异构的地质知识,包括文本矿床模型和地理空间数据集,以识别可能包含特定矿床类型的区域。这个过程传统上是手动且知识密集型的。我们提出了QueryPlot,一个语义检索和制图框架,利用现代自然语言处理技术将大规模地质文本语料库与地质图数据集成。我们为超过120种矿床类型策划了描述性矿床模型,并将州地质图编制(SGMC)多边形转化为结构化的文本表示。给定用户定义的自然语言查询,系统使用预训练的嵌入模型对查询和区域描述进行编码,并计算语义相似性分数,以对区域进行排名并作为连续证据层进行空间可视化。QueryPlot支持对矿床特征的组合查询,使得可以聚合多个基于相似性推导的层,以进行多标准前景分析。在对钨辉石矿床的案例研究中,我们证明了基于嵌入的检索能够高效回溯已知发生点,并生成与专家定义的允许区紧密对齐的前景区域。此外,相似性分数可以作为额外特征纳入监督学习流程,从而在分类性能上实现可测量的改进。QueryPlot作为一个基于网络的系统实现,支持交互式查询、可视化和GIS兼容的前景层导出。为了支持未来的研究,我们已将本研究中使用的源代码和数据集公开。
cs.CL / 2 / 2602.17815

Neural Synchrony Between Socially Interacting Language Models

社会互动语言模型之间的神经同步
Zhang, Zhining, Zhu, Wentao, Han, Chi, Wang, Yizhou, Ji, Heng
Abstract
Neuroscience has uncovered a fundamental mechanism of our social nature: human brain activity becomes synchronized with others in many social contexts involving interaction. Traditionally, social minds have been regarded as an exclusive property of living beings. Although large language models (LLMs) are widely accepted as powerful approximations of human behavior, with multi-LLM system being extensively explored to enhance their capabilities, it remains controversial whether they can be meaningfully compared to human social minds. In this work, we explore neural synchrony between socially interacting LLMs as an empirical evidence for this debate. Specifically, we introduce neural synchrony during social simulations as a novel proxy for analyzing the sociality of LLMs at the representational level. Through carefully designed experiments, we demonstrate that it reliably reflects both social engagement and temporal alignment in their interactions. Our findings indicate that neural synchrony between LLMs is strongly correlated with their social performance, highlighting an important link between neural synchrony and the social behaviors of LLMs. Our work offers a new perspective to examine the "social minds" of LLMs, highlighting surprising parallels in the internal dynamics that underlie human and LLM social interaction.
Chinese Translation
神经科学揭示了我们社会本质的一个基本机制:人类大脑活动在许多涉及互动的社会情境中与他人同步。传统上,社会心智被视为生物体的独特属性。尽管大型语言模型(LLMs)被广泛接受为人类行为的强大近似,且多LLM系统被广泛探索以增强其能力,但它们是否可以与人类社会心智进行有意义的比较仍然存在争议。在本研究中,我们探讨了社会互动LLMs之间的神经同步,作为这一争论的实证证据。具体而言,我们在社会模拟中引入神经同步,作为分析LLMs在表征层面社会性的一个新颖代理。通过精心设计的实验,我们证明了它可靠地反映了它们互动中的社会参与和时间对齐。我们的发现表明,LLMs之间的神经同步与它们的社会表现强烈相关,突显了神经同步与LLMs社会行为之间的重要联系。我们的研究为审视LLMs的“社会心智”提供了新的视角,强调了人类与LLMs社会互动背后内部动态的惊人相似性。
cs.CL / 3 / 2602.17848

On the scaling relationship between cloze probabilities and language model next-token prediction

填空概率与语言模型下一个词预测之间的尺度关系
Jacobs, Cassandra L., Grobol, Morgan
Abstract
Recent work has shown that larger language models have better predictive power for eye movement and reading time data. While even the best models under-allocate probability mass to human responses, larger models assign higher-quality estimates of next tokens and their likelihood of production in cloze data because they are less sensitive to lexical co-occurrence statistics while being better aligned semantically to human cloze responses. The results provide support for the claim that the greater memorization capacity of larger models helps them guess more semantically appropriate words, but makes them less sensitive to low-level information that is relevant for word recognition.
Chinese Translation
最近的研究表明,较大的语言模型在眼动和阅读时间数据的预测能力上表现更佳。尽管即使是最好的模型也未能充分分配概率质量给人类反应,但较大的模型在填空数据中对下一个词及其产生可能性的估计质量更高,因为它们对词汇共现统计的敏感性较低,同时在语义上与人类填空反应的对齐程度更高。这些结果支持了这样一种观点:较大模型的更强记忆能力帮助它们猜测出更语义适当的词汇,但使它们对与词汇识别相关的低层次信息的敏感性降低。
cs.CL / 4 / 2602.17881

Understanding Unreliability of Steering Vectors in Language Models: Geometric Predictors and the Limits of Linear Approximations

理解语言模型中引导向量的不可靠性:几何预测因子与线性近似的局限性
Braun, Joschka
Abstract
Steering vectors are a lightweight method for controlling language model behavior by adding a learned bias to the activations at inference time. Although effective on average, steering effect sizes vary across samples and are unreliable for many target behaviors. In my thesis, I investigate why steering reliability differs across behaviors and how it is impacted by steering vector training data. First, I find that higher cosine similarity between training activation differences predicts more reliable steering. Second, I observe that behavior datasets where positive and negative activations are better separated along the steering direction are more reliably steerable. Finally, steering vectors trained on different prompt variations are directionally distinct, yet perform similarly well and exhibit correlated efficacy across datasets. My findings suggest that steering vectors are unreliable when the latent target behavior representation is not effectively approximated by the linear steering direction. Taken together, these insights offer a practical diagnostic for steering unreliability and motivate the development of more robust steering methods that explicitly account for non-linear latent behavior representations.
Chinese Translation
引导向量是一种轻量级的方法,通过在推理时向激活添加学习到的偏差来控制语言模型的行为。尽管在平均情况下有效,但引导效果的大小在不同样本之间存在差异,并且对于许多目标行为来说是不可靠的。在我的论文中,我探讨了为什么引导的可靠性在不同行为之间存在差异,以及它是如何受到引导向量训练数据的影响。首先,我发现训练激活差异之间的余弦相似度越高,预测的引导越可靠。其次,我观察到在引导方向上,正负激活更好分离的行为数据集具有更高的可引导性。最后,在不同提示变体上训练的引导向量在方向上是不同的,但在性能上表现相似,并且在不同数据集之间展现出相关的有效性。我的研究结果表明,当潜在目标行为的表示未能有效地被线性引导方向近似时,引导向量是不可靠的。综合来看,这些见解为引导不可靠性提供了实用的诊断,并激励开发更稳健的引导方法,以明确考虑非线性潜在行为表示。
cs.CL / 5 / 2602.17907

Improving Neural Topic Modeling with Semantically-Grounded Soft Label Distributions

通过语义基础的软标签分布改善神经主题建模
Li, Raymond, Abaskohi, Amirhossein, Li, Chuyuan, Murray, Gabriel, Carenini, Giuseppe
Abstract
Traditional neural topic models are typically optimized by reconstructing the document's Bag-of-Words (BoW) representations, overlooking contextual information and struggling with data sparsity. In this work, we propose a novel approach to construct semantically-grounded soft label targets using Language Models (LMs) by projecting the next token probabilities, conditioned on a specialized prompt, onto a pre-defined vocabulary to obtain contextually enriched supervision signals. By training the topic models to reconstruct the soft labels using the LM hidden states, our method produces higher-quality topics that are more closely aligned with the underlying thematic structure of the corpus. Experiments on three datasets show that our method achieves substantial improvements in topic coherence, purity over existing baselines. Additionally, we also introduce a retrieval-based metric, which shows that our approach significantly outperforms existing methods in identifying semantically similar documents, highlighting its effectiveness for retrieval-oriented applications.
Chinese Translation
传统的神经主题模型通常通过重建文档的词袋(Bag-of-Words, BoW)表示进行优化,忽视了上下文信息,并且在数据稀疏性方面面临挑战。在本研究中,我们提出了一种新颖的方法,通过使用语言模型(Language Models, LMs)构建语义基础的软标签目标。该方法通过将基于特定提示的下一个标记概率投影到预定义词汇上,从而获得上下文丰富的监督信号。通过训练主题模型以使用语言模型的隐藏状态重建软标签,我们的方法生成了更高质量的主题,这些主题与语料库的潜在主题结构更为一致。在三个数据集上的实验表明,我们的方法在主题一致性和纯度方面相较于现有基线取得了显著提升。此外,我们还引入了一种基于检索的指标,结果显示我们的方法在识别语义相似文档方面显著优于现有方法,突显了其在检索导向应用中的有效性。
cs.CL / 6 / 2602.17911

Condition-Gated Reasoning for Context-Dependent Biomedical Question Answering

条件门控推理用于上下文依赖的生物医学问答
Parekh, Jash Rajesh, Kweon, Wonbin, Chan, Joey, Islamaj, Rezarta, Leaman, Robert, Jiang, Pengcheng, Wei, Chih-Hsuan, Wang, Zhizheng, Lu, Zhiyong, Han, Jiawei
Abstract
Current biomedical question answering (QA) systems often assume that medical knowledge applies uniformly, yet real-world clinical reasoning is inherently conditional: nearly every decision depends on patient-specific factors such as comorbidities and contraindications. Existing benchmarks do not evaluate such conditional reasoning, and retrieval-augmented or graph-based methods lack explicit mechanisms to ensure that retrieved knowledge is applicable to given context. To address this gap, we propose CondMedQA, the first benchmark for conditional biomedical QA, consisting of multi-hop questions whose answers vary with patient conditions. Furthermore, we propose Condition-Gated Reasoning (CGR), a novel framework that constructs condition-aware knowledge graphs and selectively activates or prunes reasoning paths based on query conditions. Our findings show that CGR more reliably selects condition-appropriate answers while matching or exceeding state-of-the-art performance on biomedical QA benchmarks, highlighting the importance of explicitly modeling conditionality for robust medical reasoning.
Chinese Translation
当前的生物医学问答(QA)系统通常假设医学知识是统一适用的,但现实世界中的临床推理本质上是有条件的:几乎每个决策都依赖于患者特定因素,如合并症和禁忌症。现有的基准测试并未评估这种条件推理,而增强检索或基于图的方法缺乏明确的机制来确保检索到的知识适用于特定上下文。为了解决这一问题,我们提出了CondMedQA,这是第一个针对条件生物医学QA的基准,包含多跳问题,其答案随患者状况而变化。此外,我们提出了条件门控推理(Condition-Gated Reasoning, CGR),这是一个新颖的框架,构建条件感知知识图谱,并根据查询条件选择性地激活或剪枝推理路径。我们的研究结果表明,CGR在选择适合条件的答案方面更为可靠,同时在生物医学QA基准测试中匹配或超越了最先进的性能,突显了明确建模条件性对于稳健医学推理的重要性。
cs.CL / 7 / 2602.17937

Analyzing LLM Instruction Optimization for Tabular Fact Verification

分析大型语言模型(LLM)指令优化在表格事实验证中的应用
Du, Xiaotang, Hong, Giwon, Kwan, Wai-Chung, Saxena, Rohit, Titov, Ivan, Minervini, Pasquale, Allaway, Emily
Abstract
Instruction optimization provides a lightweight, model-agnostic approach to enhancing the reasoning performance of large language models (LLMs). This paper presents the first systematic comparison of instruction optimization, based on the DSPy optimization framework, for tabular fact verification. We evaluate four out-of-the-box prompting techniques that cover both text-only prompting and code use: direct prediction, Chain-of-Thought (CoT), ReAct with SQL tools, and CodeAct with Python execution. We study three optimizers from the DSPy framework -- COPRO, MiPROv2, and SIMBA -- across four benchmarks and three model families. We find that instruction optimization consistently improves verification accuracy, with MiPROv2 yielding the most stable gains for CoT, and SIMBA providing the largest benefits for ReAct agents, particularly at larger model scales. Behavioral analyses reveal that SIMBA encourages more direct reasoning paths by applying heuristics, thereby improving numerical comparison abilities in CoT reasoning and helping avoid unnecessary tool calls in ReAct agents. Across different prompting techniques, CoT remains effective for tabular fact checking, especially with smaller models. Although ReAct agents built with larger models can achieve competitive performance, they require careful instruction optimization.
Chinese Translation
指令优化提供了一种轻量级、模型无关的方法,以提升大型语言模型(LLM)的推理性能。本文首次系统性地比较了基于DSPy优化框架的指令优化在表格事实验证中的应用。我们评估了四种现成的提示技术,涵盖文本提示和代码使用:直接预测、思维链(Chain-of-Thought, CoT)、结合SQL工具的ReAct,以及结合Python执行的CodeAct。我们研究了DSPy框架中的三种优化器——COPRO、MiPROv2和SIMBA——在四个基准测试和三个模型系列中的表现。我们发现,指令优化始终提高了验证准确性,其中MiPROv2在CoT中产生了最稳定的增益,而SIMBA在ReAct代理中提供了最大的收益,尤其是在较大模型规模下。行为分析表明,SIMBA通过应用启发式方法鼓励更直接的推理路径,从而改善了CoT推理中的数值比较能力,并帮助避免ReAct代理中的不必要工具调用。在不同的提示技术中,CoT在表格事实检查中仍然有效,尤其是在较小模型下。尽管使用较大模型构建的ReAct代理可以实现竞争性能,但它们需要仔细的指令优化。
cs.CL / 8 / 2602.17949

CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications

CUICurate:基于GraphRAG的自动化临床概念整理框架用于自然语言处理应用
Blake, Victoria, Miller, Mathew, Novak, Jamie, Ooi, Sze-yuan, Gallego, Blanca
Abstract
Background: Clinical named entity recognition tools commonly map free text to Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs). For many downstream tasks, however, the clinically meaningful unit is not a single CUI but a concept set comprising related synonyms, subtypes, and supertypes. Constructing such concept sets is labour-intensive, inconsistently performed, and poorly supported by existing tools, particularly for NLP pipelines that operate directly on UMLS CUIs. Methods We present CUICurate, a Graph-based retrieval-augmented generation (GraphRAG) framework for automated UMLS concept set curation. A UMLS knowledge graph (KG) was constructed and embedded for semantic retrieval. For each target concept, candidate CUIs were retrieved from the KG, followed by large language model (LLM) filtering and classification steps comparing two LLMs (GPT-5 and GPT-5-mini). The framework was evaluated on five lexically heterogeneous clinical concepts against a manually curated benchmark and gold-standard concept sets. Results Across all concepts, CUICurate produced substantially larger and more complete concept sets than the manual benchmarks whilst matching human precision. Comparisons between the two LLMs found that GPT-5-mini achieved higher recall during filtering, while GPT-5 produced classifications that more closely aligned with clinician judgements. Outputs were stable across repeated runs and computationally inexpensive. Conclusions CUICurate offers a scalable and reproducible approach to support UMLS concept set curation that substantially reduces manual effort. By integrating graph-based retrieval with LLM reasoning, the framework produces focused candidate concept sets that can be adapted to clinical NLP pipelines for different phenotyping and analytic requirements.
Chinese Translation
背景:临床命名实体识别工具通常将自由文本映射到统一医学语言系统(UMLS)概念唯一标识符(CUI)。然而,对于许多下游任务而言,临床上有意义的单位并不是单一的CUI,而是由相关同义词、子类型和超类型组成的概念集。构建这样的概念集劳动强度大,执行不一致,并且现有工具的支持不足,尤其是对于直接在UMLS CUI上操作的自然语言处理(NLP)管道。方法:我们提出了CUICurate,一个基于图的检索增强生成(GraphRAG)框架,用于自动化UMLS概念集整理。构建并嵌入了一个UMLS知识图谱(KG)以进行语义检索。对于每个目标概念,从KG中检索候选CUI,随后进行大型语言模型(LLM)过滤和分类步骤,比较了两个LLM(GPT-5和GPT-5-mini)。该框架在五个词汇异质的临床概念上进行了评估,与手动整理的基准和黄金标准概念集进行比较。结果:在所有概念中,CUICurate生成的概念集显著大于且更完整于手动基准,同时匹配了人类的精确度。对两个LLM的比较发现,GPT-5-mini在过滤过程中实现了更高的召回率,而GPT-5生成的分类更接近临床医生的判断。输出在重复运行中保持稳定且计算成本低。结论:CUICurate提供了一种可扩展且可重复的方法,以支持UMLS概念集整理,显著减少了手动工作。通过将基于图的检索与LLM推理相结合,该框架生成了可以适应不同表型和分析需求的临床NLP管道的聚焦候选概念集。
cs.CL / 9 / 2602.17981

Decomposing Retrieval Failures in RAG for Long-Document Financial Question Answering

长文档金融问答中RAG的检索失败分解
Kobeissi, Amine, Langlais, Philippe
Abstract
Retrieval-augmented generation is increasingly used for financial question answering over long regulatory filings, yet reliability depends on retrieving the exact context needed to justify answers in high stakes settings. We study a frequent failure mode in which the correct document is retrieved but the page or chunk that contains the answer is missed, leading the generator to extrapolate from incomplete context. Despite its practical significance, this within-document retrieval failure mode has received limited systematic attention in the Financial Question Answering (QA) literature. We evaluate retrieval at multiple levels of granularity, document, page, and chunk level, and introduce an oracle based analysis to provide empirical upper bounds on retrieval and generative performance. On a 150 question subset of FinanceBench, we reproduce and compare diverse retrieval strategies including dense, sparse, hybrid, and hierarchical methods with reranking and query reformulation. Across methods, gains in document discovery tend to translate into stronger page recall, yet oracle performance still suggests headroom for page and chunk level retrieval. To target this gap, we introduce a domain fine-tuned page scorer that treats pages as an intermediate retrieval unit between documents and chunks. Unlike prior passage-based hierarchical retrieval, we fine-tune a bi-encoder specifically for page-level relevance on financial filings, exploiting the semantic coherence of pages. Overall, our results demonstrate a significant improvement in page recall and chunk retrieval.
Chinese Translation
增强检索生成(Retrieval-augmented generation)在长篇监管文件的金融问答中越来越多地被使用,然而其可靠性依赖于检索到准确的上下文,以在高风险环境中为答案提供依据。我们研究了一种常见的失败模式,即正确的文档被检索到,但包含答案的页面或块却被遗漏,导致生成器从不完整的上下文中推断。尽管这一文档内检索失败模式在金融问答(QA)文献中具有实际意义,但系统性关注仍然有限。我们在多个粒度层面上评估检索,包括文档、页面和块级别,并引入基于oracle的分析,以提供检索和生成性能的经验上限。在FinanceBench的150个问题子集上,我们重现并比较了多种检索策略,包括密集型、稀疏型、混合型和层次型方法,以及重新排序和查询重构。不同方法之间,文档发现的提升往往转化为更强的页面召回率,但oracle性能仍然表明页面和块级检索存在提升空间。为了解决这一差距,我们引入了一种领域微调的页面评分器,将页面视为文档和块之间的中间检索单元。与以往基于段落的层次检索不同,我们专门针对金融文件的页面级相关性微调了双编码器,利用页面的语义一致性。总体而言,我们的结果显示页面召回率和块检索有显著改善。
cs.CL / 10 / 2602.18029

Towards More Standardized AI Evaluation: From Models to Agents

迈向更标准化的人工智能评估:从模型到智能体
Filali, Ali El, Bedar, Inès
Abstract
Evaluation is no longer a final checkpoint in the machine learning lifecycle. As AI systems evolve from static models to compound, tool-using agents, evaluation becomes a core control function. The question is no longer "How good is the model?" but "Can we trust the system to behave as intended, under change, at scale?". Yet most evaluation practices remain anchored in assumptions inherited from the model-centric era: static benchmarks, aggregate scores, and one-off success criteria. This paper argues that such approaches are increasingly obscure rather than illuminating system behavior. We examine how evaluation pipelines themselves introduce silent failure modes, why high benchmark scores routinely mislead teams, and how agentic systems fundamentally alter the meaning of performance measurement. Rather than proposing new metrics or harder benchmarks, we aim to clarify the role of evaluation in the AI era, and especially for agents: not as performance theater, but as a measurement discipline that conditions trust, iteration, and governance in non-deterministic systems.
Chinese Translation
评估不再是机器学习生命周期中的最终检查点。随着人工智能系统从静态模型演变为复合的、使用工具的智能体,评估成为核心控制功能。问题不再是“模型有多好?”,而是“我们能否信任系统在变化和大规模情况下按预期行为运行?”然而,大多数评估实践仍然基于从模型中心时代继承的假设:静态基准、汇总分数和一次性成功标准。本文认为,这种方法越来越模糊,而不是阐明系统行为。我们考察评估流程本身如何引入无声的失败模式,为什么高基准分数常常误导团队,以及智能体系统如何从根本上改变性能测量的意义。我们并不提议新的指标或更严格的基准,而是旨在澄清评估在人工智能时代,尤其是对于智能体的角色:不是作为性能展示,而是作为一种测量学科,条件性地影响非确定性系统中的信任、迭代和治理。
cs.CL / 11 / 2602.18092

Perceived Political Bias in LLMs Reduces Persuasive Abilities

感知的政治偏见降低了大型语言模型的说服能力
DiGiuseppe, Matthew, Robison, Joshua
Abstract
Conversational AI has been proposed as a scalable way to correct public misconceptions and spread misinformation. Yet its effectiveness may depend on perceptions of its political neutrality. As LLMs enter partisan conflict, elites increasingly portray them as ideologically aligned. We test whether these credibility attacks reduce LLM-based persuasion. In a preregistered U.S. survey experiment (N=2144), participants completed a three-round conversation with ChatGPT about a personally held economic policy misconception. Compared to a neutral control, a short message indicating that the LLM was biased against the respondent's party attenuated persuasion by 28%. Transcript analysis indicates that the warnings alter the interaction: respondents push back more and engage less receptively. These findings suggest that the persuasive impact of conversational AI is politically contingent, constrained by perceptions of partisan alignment.
Chinese Translation
对话式人工智能被提议作为一种可扩展的方法,以纠正公众误解和传播错误信息。然而,其有效性可能取决于对其政治中立性的看法。随着大型语言模型(LLMs)进入党派冲突,精英们越来越多地将其描绘为意识形态一致。我们测试了这些信誉攻击是否减少了基于LLM的说服力。在一项预注册的美国调查实验中(N=2144),参与者与ChatGPT进行了三轮关于个人持有的经济政策误解的对话。与中立对照组相比,短消息表明该LLM对受访者所属政党存在偏见,使说服力降低了28%。逐字稿分析表明,这些警告改变了互动:受访者更加强烈反驳,参与的接受度降低。这些发现表明,对话式人工智能的说服影响在政治上是有条件的,受到对党派一致性感知的限制。
cs.CL / 12 / 2602.18137

Agentic Adversarial QA for Improving Domain-Specific LLMs

用于提升领域特定大型语言模型的主动对抗问答
Grari, Vincent, Tomoiaga, Ciprian, Lamprier, Sylvain, Hashimoto, Tatsunori, Detyniecki, Marcin
Abstract
Large Language Models (LLMs), despite extensive pretraining on broad internet corpora, often struggle to adapt effectively to specialized domains. There is growing interest in fine-tuning these models for such domains; however, progress is constrained by the scarcity and limited coverage of high-quality, task-relevant data. To address this, synthetic data generation methods such as paraphrasing or knowledge extraction are commonly applied. Although these approaches excel at factual recall and conceptual knowledge, they suffer from two critical shortcomings: (i) they provide minimal support for interpretive reasoning capabilities in these specialized domains, and (ii) they often produce synthetic corpora that are excessively large and redundant, resulting in poor sample efficiency. To overcome these gaps, we propose an adversarial question-generation framework that produces a compact set of semantically challenging questions. These questions are constructed by comparing the outputs of the model to be adapted and a robust expert model grounded in reference documents, using an iterative, feedback-driven process designed to reveal and address comprehension gaps. Evaluation on specialized subsets of the LegalBench corpus demonstrates that our method achieves greater accuracy with substantially fewer synthetic samples.
Chinese Translation
大型语言模型(LLMs)尽管在广泛的互联网语料库上进行了广泛的预训练,但在有效适应专业领域时常常面临挑战。对这些模型进行领域特定的微调日益受到关注;然而,进展受到高质量、任务相关数据稀缺和覆盖范围有限的制约。为了解决这一问题,通常采用诸如释义或知识提取等合成数据生成方法。尽管这些方法在事实回忆和概念知识方面表现出色,但它们存在两个关键缺陷:(i)在这些专业领域中对解释性推理能力的支持极为有限;(ii)它们往往生成过于庞大和冗余的合成语料,导致样本效率低下。为了克服这些不足,我们提出了一种对抗性问题生成框架,该框架生成一组语义上具有挑战性的紧凑问题。这些问题通过比较待适应模型的输出与基于参考文档的强大专家模型的输出,采用一种迭代的、基于反馈的过程构建,旨在揭示并解决理解差距。在对LegalBench语料库的专业子集进行评估时,我们的方法在合成样本显著减少的情况下实现了更高的准确性。
cs.CL / 13 / 2602.18145

Detecting Contextual Hallucinations in LLMs with Frequency-Aware Attention

基于频率感知注意力检测大语言模型中的上下文幻觉
Qi, Siya, Chen, Yudong, Zhao, Runcong, Zhu, Qinglin, Hu, Zhanghao, Liu, Wei, He, Yulan, Yuan, Zheng, Gui, Lin
Abstract
Hallucination detection is critical for ensuring the reliability of large language models (LLMs) in context-based generation. Prior work has explored intrinsic signals available during generation, among which attention offers a direct view of grounding behavior. However, existing approaches typically rely on coarse summaries that fail to capture fine-grained instabilities in attention. Inspired by signal processing, we introduce a frequency-aware perspective on attention by analyzing its variation during generation. We model attention distributions as discrete signals and extract high-frequency components that reflect rapid local changes in attention. Our analysis reveals that hallucinated tokens are associated with high-frequency attention energy, reflecting fragmented and unstable grounding behavior. Based on this insight, we develop a lightweight hallucination detector using high-frequency attention features. Experiments on the RAGTruth and HalluRAG benchmarks show that our approach achieves performance gains over verification-based, internal-representation-based, and attention-based methods across models and tasks.
Chinese Translation
幻觉检测对于确保大语言模型(LLMs)在基于上下文的生成中的可靠性至关重要。先前的研究探讨了生成过程中可用的内在信号,其中注意力提供了对基础行为的直接视角。然而,现有的方法通常依赖于粗略的总结,未能捕捉注意力中的细粒度不稳定性。受到信号处理的启发,我们通过分析生成过程中注意力的变化,引入了一种基于频率感知的注意力视角。我们将注意力分布建模为离散信号,并提取反映注意力快速局部变化的高频成分。我们的分析揭示,幻觉令牌与高频注意力能量相关,反映出碎片化和不稳定的基础行为。基于这一洞察,我们开发了一种轻量级的幻觉检测器,利用高频注意力特征。在RAGTruth和HalluRAG基准上的实验表明,我们的方法在模型和任务上相较于基于验证、内部表示和注意力的方法实现了性能提升。
cs.CL / 14 / 2602.18152

The Statistical Signature of LLMs

大型语言模型的统计特征
Hadad, Ortal, Loru, Edoardo, Nudo, Jacopo, Di Marco, Niccolò, Cinelli, Matteo, Quattrociocchi, Walter
Abstract
Large language models generate text through probabilistic sampling from high-dimensional distributions, yet how this process reshapes the structural statistical organization of language remains incompletely characterized. Here we show that lossless compression provides a simple, model-agnostic measure of statistical regularity that differentiates generative regimes directly from surface text. We analyze compression behavior across three progressively more complex information ecosystems: controlled human-LLM continuations, generative mediation of a knowledge infrastructure (Wikipedia vs. Grokipedia), and fully synthetic social interaction environments (Moltbook vs. Reddit). Across settings, compression reveals a persistent structural signature of probabilistic generation. In controlled and mediated contexts, LLM-produced language exhibits higher structural regularity and compressibility than human-written text, consistent with a concentration of output within highly recurrent statistical patterns. However, this signature shows scale dependence: in fragmented interaction environments the separation attenuates, suggesting a fundamental limit to surface-level distinguishability at small scales. This compressibility-based separation emerges consistently across models, tasks, and domains and can be observed directly from surface text without relying on model internals or semantic evaluation. Overall, our findings introduce a simple and robust framework for quantifying how generative systems reshape textual production, offering a structural perspective on the evolving complexity of communication.
Chinese Translation
大型语言模型通过从高维分布中进行概率采样生成文本,但这一过程如何重塑语言的结构统计组织仍未完全表征。本文展示了无损压缩提供了一种简单且与模型无关的统计规律度量,能够直接从表面文本中区分生成模式。我们分析了在三个逐渐复杂的信息生态系统中的压缩行为:受控的人类-大型语言模型(LLM)延续、知识基础设施的生成中介(维基百科与Grokipedia),以及完全合成的社交互动环境(Moltbook与Reddit)。在这些环境中,压缩揭示了概率生成的持久结构特征。在受控和中介的语境中,LLM生成的语言表现出比人类撰写文本更高的结构规律性和可压缩性,这与输出集中于高度重复的统计模式相一致。然而,这一特征显示出规模依赖性:在碎片化的互动环境中,分离程度减弱,暗示在小规模下表面层次可区分性的基本限制。这种基于可压缩性的分离在不同模型、任务和领域中一致出现,并且可以直接从表面文本中观察,而无需依赖模型内部或语义评估。总体而言,我们的研究结果引入了一种简单而稳健的框架,用于量化生成系统如何重塑文本生产,提供了对交流复杂性演变的结构性视角。
cs.CL / 15 / 2602.18154

FENCE: A Financial and Multimodal Jailbreak Detection Dataset

FENCE:一个金融和多模态越狱检测数据集
Kim, Mirae, Jeong, Seonghun, Kwak, Youngjun
Abstract
Jailbreaking poses a significant risk to the deployment of Large Language Models (LLMs) and Vision Language Models (VLMs). VLMs are particularly vulnerable because they process both text and images, creating broader attack surfaces. However, available resources for jailbreak detection are scarce, particularly in finance. To address this gap, we present FENCE, a bilingual (Korean-English) multimodal dataset for training and evaluating jailbreak detectors in financial applications. FENCE emphasizes domain realism through finance-relevant queries paired with image-grounded threats. Experiments with commercial and open-source VLMs reveal consistent vulnerabilities, with GPT-4o showing measurable attack success rates and open-source models displaying greater exposure. A baseline detector trained on FENCE achieves 99 percent in-distribution accuracy and maintains strong performance on external benchmarks, underscoring the dataset's robustness for training reliable detection models. FENCE provides a focused resource for advancing multimodal jailbreak detection in finance and for supporting safer, more reliable AI systems in sensitive domains. Warning: This paper includes example data that may be offensive.
Chinese Translation
越狱对大型语言模型(LLMs)和视觉语言模型(VLMs)的部署构成了重大风险。由于VLM同时处理文本和图像,因此其攻击面更为广泛,尤其容易受到攻击。然而,现有的越狱检测资源相对匮乏,尤其是在金融领域。为了解决这一问题,我们提出了FENCE,这是一个双语(韩语-英语)多模态数据集,用于训练和评估金融应用中的越狱检测器。FENCE通过与图像相关的威胁配对的金融相关查询,强调了领域的真实性。对商业和开源VLM的实验揭示了一致的脆弱性,其中GPT-4o显示出可测量的攻击成功率,而开源模型则表现出更大的暴露风险。基于FENCE训练的基线检测器在分布内准确率达到99%,并在外部基准测试中保持强劲表现,凸显了该数据集在训练可靠检测模型方面的稳健性。FENCE为推动金融领域的多模态越狱检测提供了一个专注的资源,并支持在敏感领域中构建更安全、更可靠的人工智能系统。警告:本文包含可能令人反感的示例数据。
cs.CL / 16 / 2602.18171

Click it or Leave it: Detecting and Spoiling Clickbait with Informativeness Measures and Large Language Models

点击它还是离开它:利用信息量度和大型语言模型检测和破坏点击诱饵
Michaluk, Wojciech, Urban, Tymoteusz, Kubita, Mateusz, Kuntur, Soveatin, Wroblewska, Anna
Abstract
Clickbait headlines degrade the quality of online information and undermine user trust. We present a hybrid approach to clickbait detection that combines transformer-based text embeddings with linguistically motivated informativeness features. Using natural language processing techniques, we evaluate classical vectorizers, word embedding baselines, and large language model embeddings paired with tree-based classifiers. Our best-performing model, XGBoost over embeddings augmented with 15 explicit features, achieves an F1-score of 91\%, outperforming TF-IDF, Word2Vec, GloVe, LLM prompt based classification, and feature-only baselines. The proposed feature set enhances interpretability by highlighting salient linguistic cues such as second-person pronouns, superlatives, numerals, and attention-oriented punctuation, enabling transparent and well-calibrated clickbait predictions. We release code and trained models to support reproducible research.
Chinese Translation
点击诱饵标题降低了在线信息的质量,并削弱了用户信任。我们提出了一种混合方法来检测点击诱饵,该方法结合了基于变换器的文本嵌入和语言学驱动的信息量特征。通过自然语言处理技术,我们评估了经典的向量化方法、词嵌入基准以及与树基分类器配对的大型语言模型嵌入。我们表现最佳的模型是基于XGBoost的嵌入,增加了15个显性特征,达到了91%的F1分数,优于TF-IDF、Word2Vec、GloVe、基于LLM提示的分类和仅特征基准。所提出的特征集通过突出显著的语言线索(如第二人称代词、最高级、数字和关注导向的标点符号)增强了可解释性,从而实现透明且经过良好校准的点击诱饵预测。我们发布了代码和训练模型,以支持可重复的研究。
cs.CL / 17 / 2602.18176

Improving Sampling for Masked Diffusion Models via Information Gain

通过信息增益改进掩码扩散模型的采样
Yang, Kaisen, Teoh, Jayden, Yang, Kaicheng, Zhang, Yitong, Lamb, Alex
Abstract
Masked Diffusion Models (MDMs) offer greater flexibility in decoding order than autoregressive models but require careful planning to achieve high-quality generation. Existing samplers typically adopt greedy heuristics, prioritizing positions with the highest local certainty to decode at each step. Through failure case analysis, we identify a fundamental limitation of this approach: it neglects the downstream impact of current decoding choices on subsequent steps and fails to minimize cumulative uncertainty. In particular, these methods do not fully exploit the non-causal nature of MDMs, which enables evaluating how a decoding decision reshapes token probabilities/uncertainty across all remaining masked positions. To bridge this gap, we propose the Info-Gain Sampler, a principled decoding framework that balances immediate uncertainty with information gain over future masked tokens. Extensive evaluations across diverse architectures and tasks (reasoning, coding, creative writing, and image generation) demonstrate that Info-Gain Sampler consistently outperforms existing samplers for MDMs. For instance, it achieves a 3.6% improvement in average accuracy on reasoning tasks and a 63.1% win-rate in creative writing. Notably, on reasoning tasks it reduces cumulative uncertainty from 78.4 to 48.6, outperforming the best baseline by a large margin. The code will be available at https://github.com/yks23/Information-Gain-Sampler.
Chinese Translation
掩码扩散模型(Masked Diffusion Models, MDMs)在解码顺序上比自回归模型提供了更大的灵活性,但需要仔细规划以实现高质量的生成。现有的采样器通常采用贪婪启发式方法,在每一步解码时优先选择具有最高局部确定性的位点。通过对失败案例的分析,我们识别出这种方法的一个根本限制:它忽视了当前解码选择对后续步骤的下游影响,并未能最小化累积不确定性。特别是,这些方法未能充分利用MDMs的非因果特性,这使得能够评估解码决策如何重塑所有剩余掩码位置的标记概率/不确定性。为了解决这一问题,我们提出了信息增益采样器(Info-Gain Sampler),这是一个原则性的解码框架,平衡了对未来掩码标记的信息增益与即时不确定性。在多种架构和任务(推理、编码、创意写作和图像生成)上的广泛评估表明,信息增益采样器始终优于现有的MDMs采样器。例如,它在推理任务上的平均准确率提高了3.6%,在创意写作中的胜率达到了63.1%。值得注意的是,在推理任务中,它将累积不确定性从78.4降低到48.6,显著优于最佳基线。代码将可在 https://github.com/yks23/Information-Gain-Sampler 获取。
cs.CL / 18 / 2602.18217

Information-Theoretic Storage Cost in Sentence Comprehension

句子理解中的信息论存储成本
Kajikawa, Kohei, Isono, Shinnosuke, Wilcox, Ethan Gotlieb
Abstract
Real-time sentence comprehension imposes a significant load on working memory, as comprehenders must maintain contextual information to anticipate future input. While measures of such load have played an important role in psycholinguistic theories, they have been formalized, largely, using symbolic grammars, which assign discrete, uniform costs to syntactic predictions. This study proposes a measure of processing storage cost based on an information-theoretic formalization, as the amount of information previous words carry about future context, under uncertainty. Unlike previous discrete, grammar-based metrics, this measure is continuous, theory-neutral, and can be estimated from pre-trained neural language models. The validity of this approach is demonstrated through three analyses in English: our measure (i) recovers well-known processing asymmetries in center embeddings and relative clauses, (ii) correlates with a grammar-based storage cost in a syntactically-annotated corpus, and (iii) predicts reading-time variance in two large-scale naturalistic datasets over and above baseline models with traditional information-based predictors.
Chinese Translation
实时句子理解对工作记忆施加了显著负担,因为理解者必须保持上下文信息以预测未来输入。尽管此类负载的测量在心理语言学理论中发挥了重要作用,但它们在很大程度上是通过符号语法形式化的,这种语法为句法预测分配了离散且统一的成本。本研究提出了一种基于信息论形式化的处理存储成本测量方法,该方法衡量在不确定性下,前面单词对未来上下文所携带的信息量。与之前的离散语法基础度量不同,这一度量是连续的、理论中立的,并且可以从预训练的神经语言模型中估计。通过对英语的三项分析证明了该方法的有效性:我们的度量(i) 恢复了中心嵌套和关系从句中众所周知的处理不对称性,(ii) 与语法基础的存储成本在一个句法注释语料库中相关,(iii) 在两个大型自然数据集中,超越传统信息基础预测模型,预测阅读时间的方差。
cs.CL / 19 / 2602.18232

Thinking by Subtraction: Confidence-Driven Contrastive Decoding for LLM Reasoning

通过减法思考:基于信心驱动的对比解码用于大语言模型推理
Tang, Lexiang, Gao, Weihao, Zhao, Bingchen, Ma, Lu, jin, Qiao, Yang, Bang, Zou, Yuexian
Abstract
Recent work on test-time scaling for large language model (LLM) reasoning typically assumes that allocating more inference-time computation uniformly improves correctness. However, prior studies show that reasoning uncertainty is highly localized: a small subset of low-confidence tokens disproportionately contributes to reasoning errors and unnecessary output expansion. Motivated by this observation, we propose Thinking by Subtraction, a confidence-driven contrastive decoding approach that improves reasoning reliability through targeted token-level intervention. Our method, Confidence-Driven Contrastive Decoding, detects low-confidence tokens during decoding and intervenes selectively at these positions. It constructs a contrastive reference by replacing high-confidence tokens with minimal placeholders, and refines predictions by subtracting this reference distribution at low-confidence locations. Experiments show that CCD significantly improves accuracy across mathematical reasoning benchmarks while substantially reducing output length, with minimal KV-cache overhead. As a training-free method, CCD enhances reasoning reliability through targeted low-confidence intervention without computational redundancy. Our code will be made available at: https://github.com/bolo-web/CCD.
Chinese Translation
近期关于大语言模型(LLM)推理的测试时刻缩放研究通常假设,增加推理时间的计算均匀地提高了正确性。然而,先前的研究表明,推理不确定性高度局部化:一小部分低信心的标记不成比例地导致推理错误和不必要的输出扩展。基于这一观察,我们提出了通过减法思考(Thinking by Subtraction),一种基于信心驱动的对比解码方法,通过针对性地在标记级别进行干预来提高推理的可靠性。我们的方法,信心驱动的对比解码(Confidence-Driven Contrastive Decoding),在解码过程中检测低信心标记,并在这些位置进行选择性干预。它通过用最小占位符替换高信心标记构建对比参考,并通过在低信心位置减去该参考分布来精炼预测。实验表明,CCD在数学推理基准测试中显著提高了准确性,同时大幅减少了输出长度,并且KV缓存开销最小。作为一种无训练的方法,CCD通过针对性地干预低信心标记来增强推理的可靠性,而不产生计算冗余。我们的代码将发布于:https://github.com/bolo-web/CCD。
cs.CL / 20 / 2602.18262

Simplifying Outcomes of Language Model Component Analyses with ELIA

通过ELIA简化语言模型组件分析的结果
Eidt, Aaron Louis, Feldhus, Nils
Abstract
While mechanistic interpretability has developed powerful tools to analyze the internal workings of Large Language Models (LLMs), their complexity has created an accessibility gap, limiting their use to specialists. We address this challenge by designing, building, and evaluating ELIA (Explainable Language Interpretability Analysis), an interactive web application that simplifies the outcomes of various language model component analyses for a broader audience. The system integrates three key techniques -- Attribution Analysis, Function Vector Analysis, and Circuit Tracing -- and introduces a novel methodology: using a vision-language model to automatically generate natural language explanations (NLEs) for the complex visualizations produced by these methods. The effectiveness of this approach was empirically validated through a mixed-methods user study, which revealed a clear preference for interactive, explorable interfaces over simpler, static visualizations. A key finding was that the AI-powered explanations helped bridge the knowledge gap for non-experts; a statistical analysis showed no significant correlation between a user's prior LLM experience and their comprehension scores, suggesting that the system reduced barriers to comprehension across experience levels. We conclude that an AI system can indeed simplify complex model analyses, but its true power is unlocked when paired with thoughtful, user-centered design that prioritizes interactivity, specificity, and narrative guidance.
Chinese Translation
尽管机械可解释性已经发展出强大的工具来分析大型语言模型(LLMs)的内部工作机制,但其复杂性造成了可及性差距,限制了其在专业人士之外的使用。我们通过设计、构建和评估ELIA(可解释语言可解释性分析)这一互动网页应用程序来应对这一挑战,该应用程序简化了各种语言模型组件分析的结果,使其更易于广泛受众使用。该系统整合了三种关键技术——归因分析(Attribution Analysis)、功能向量分析(Function Vector Analysis)和电路追踪(Circuit Tracing),并引入了一种新方法:使用视觉-语言模型自动生成自然语言解释(NLEs),以解释这些方法产生的复杂可视化效果。通过一项混合方法的用户研究,实证验证了该方法的有效性,结果显示用户更偏好互动、可探索的界面,而非简单的静态可视化。一个重要发现是,AI驱动的解释帮助弥补了非专家的知识差距;统计分析显示用户的先前LLM经验与其理解分数之间没有显著相关性,表明该系统降低了不同经验水平用户的理解障碍。我们得出结论,AI系统确实可以简化复杂的模型分析,但其真正的力量在于与以用户为中心的设计相结合,优先考虑互动性、具体性和叙事指导。
cs.CL / 21 / 2602.18324

PsihoRo: Depression and Anxiety Romanian Text Corpus

PsihoRo:抑郁与焦虑罗马尼亚文本语料库
Ciobotaru, Alexandra, Bucur, Ana-Maria, Dinu, Liviu P.
Abstract
Psychological corpora in NLP are collections of texts used to analyze human psychology, emotions, and mental health. These texts allow researchers to study psychological constructs, detect mental health issues and analyze emotional language. However, mental health data can be difficult to collect correctly from social media, due to suppositions made by the collectors. A more pragmatic strategy involves gathering data through open-ended questions and then assessing this information with self-report screening surveys. This method was employed successfully for English, a language with a lot of psychological NLP resources. However, this cannot be stated for Romanian, which currently has no open-source mental health corpus. To address this gap, we have created the first corpus for depression and anxiety in Romanian, by utilizing a form with 6 open-ended questions along with the standardized PHQ-9 and GAD-7 screening questionnaires. Consisting of the texts of 205 respondents and although it may seem small, PsihoRo is a first step towards understanding and analyzing texts regarding the mental health of the Romanian population. We employ statistical analysis, text analysis using Romanian LIWC, emotion detection and topic modeling to show what are the most important features of this newly introduced resource to the NLP community.
Chinese Translation
心理语料库在自然语言处理(NLP)中是用于分析人类心理、情感和心理健康的文本集合。这些文本使研究人员能够研究心理构念、检测心理健康问题并分析情感语言。然而,由于收集者的假设,从社交媒体上正确收集心理健康数据可能会很困难。一种更务实的策略是通过开放式问题收集数据,然后使用自我报告筛查调查评估这些信息。这种方法在英语中取得了成功,因为英语有大量的心理学NLP资源。然而,对于罗马尼亚语来说,情况并非如此,目前没有开源的心理健康语料库。为了解决这一空白,我们创建了第一个罗马尼亚语的抑郁与焦虑语料库,采用了包含6个开放式问题的表单,并结合标准化的PHQ-9和GAD-7筛查问卷。该语料库由205名受访者的文本组成,尽管看似数量较少,但PsihoRo是理解和分析罗马尼亚人口心理健康文本的第一步。我们采用统计分析、使用罗马尼亚LIWC的文本分析、情感检测和主题建模,展示了这一新引入资源对NLP社区最重要的特征。
cs.CL / 22 / 2602.18326

Predicting Contextual Informativeness for Vocabulary Learning using Deep Learning

利用深度学习预测词汇学习的上下文信息性
Wu, Tao, Kapelner, Adam
Abstract
We describe a modern deep learning system that automatically identifies informative contextual examples (\qu{contexts}) for first language vocabulary instruction for high school student. Our paper compares three modeling approaches: (i) an unsupervised similarity-based strategy using MPNet's uniformly contextualized embeddings, (ii) a supervised framework built on instruction-aware, fine-tuned Qwen3 embeddings with a nonlinear regression head and (iii) model (ii) plus handcrafted context features. We introduce a novel metric called the Retention Competency Curve to visualize trade-offs between the discarded proportion of good contexts and the \qu{good-to-bad} contexts ratio providing a compact, unified lens on model performance. Model (iii) delivers the most dramatic gains with performance of a good-to-bad ratio of 440 all while only throwing out 70\% of the good contexts. In summary, we demonstrate that a modern embedding model on neural network architecture, when guided by human supervision, results in a low-cost large supply of near-perfect contexts for teaching vocabulary for a variety of target words.
Chinese Translation
我们描述了一种现代深度学习系统,该系统能够自动识别用于高中生第一语言词汇教学的信息性上下文示例(c2a8contextsc2a8)。本文比较了三种建模方法:(i)一种基于无监督相似性的策略,使用MPNet的统一上下文嵌入;(ii)一种基于监督的框架,构建在具有指令感知的微调Qwen3嵌入上,并配有非线性回归头;(iii)模型(ii)加上手工制作的上下文特征。我们引入了一种新颖的度量标准,称为保留能力曲线(Retention Competency Curve),用于可视化丢弃良好上下文的比例与“良好与不良”上下文比例之间的权衡,从而提供对模型性能的紧凑统一视角。模型(iii)在良好与不良上下文比例达到440的情况下,表现出最显著的提升,同时仅丢弃了70%的良好上下文。总之,我们证明了在人工监督指导下,基于神经网络架构的现代嵌入模型能够以低成本提供大量近乎完美的上下文,用于教授各种目标词汇。
cs.CL / 23 / 2602.18346

Vichara: Appellate Judgment Prediction and Explanation for the Indian Judicial System

Vichara:印度司法系统的上诉判决预测与解释
Nair, Pavithra PM, Anish, Preethu Rose
Abstract
In jurisdictions like India, where courts face an extensive backlog of cases, artificial intelligence offers transformative potential for legal judgment prediction. A critical subset of this backlog comprises appellate cases, which are formal decisions issued by higher courts reviewing the rulings of lower courts. To this end, we present Vichara, a novel framework tailored to the Indian judicial system that predicts and explains appellate judgments. Vichara processes English-language appellate case proceeding documents and decomposes them into decision points. Decision points are discrete legal determinations that encapsulate the legal issue, deciding authority, outcome, reasoning, and temporal context. The structured representation isolates the core determinations and their context, enabling accurate predictions and interpretable explanations. Vichara's explanations follow a structured format inspired by the IRAC (Issue-Rule-Application-Conclusion) framework and adapted for Indian legal reasoning. This enhances interpretability, allowing legal professionals to assess the soundness of predictions efficiently. We evaluate Vichara on two datasets, PredEx and the expert-annotated subset of the Indian Legal Documents Corpus (ILDC_expert), using four large language models: GPT-4o mini, Llama-3.1-8B, Mistral-7B, and Qwen2.5-7B. Vichara surpasses existing judgment prediction benchmarks on both datasets, with GPT-4o mini achieving the highest performance (F1: 81.5 on PredEx, 80.3 on ILDC_expert), followed by Llama-3.1-8B. Human evaluation of the generated explanations across Clarity, Linking, and Usefulness metrics highlights GPT-4o mini's superior interpretability.
Chinese Translation
在印度等司法管辖区,法院面临着大量待处理案件的积压,人工智能为法律判决预测提供了变革性的潜力。这一积压的一个关键子集是上诉案件,即由高等法院对下级法院裁决进行审查所作出的正式决定。为此,我们提出了Vichara,这是一个专为印度司法系统量身定制的新框架,用于预测和解释上诉判决。Vichara处理英语上诉案件程序文件,并将其分解为决策点。决策点是离散的法律判定,概括了法律问题、裁决权、结果、推理和时间背景。结构化的表示法隔离了核心判定及其背景,从而实现准确的预测和可解释的解释。Vichara的解释遵循受IRAC(问题-规则-应用-结论)框架启发的结构化格式,并针对印度法律推理进行了调整。这增强了可解释性,使法律专业人士能够高效地评估预测的合理性。我们在两个数据集上评估了Vichara,分别是PredEx和印度法律文档语料库(ILDC_expert)的专家注释子集,使用了四个大型语言模型:GPT-4o mini、Llama-3.1-8B、Mistral-7B和Qwen2.5-7B。Vichara在这两个数据集上超越了现有的判决预测基准,其中GPT-4o mini的表现最佳(F1:在PredEx上为81.5,在ILDC_expert上为80.3),其次是Llama-3.1-8B。对生成的解释进行的人类评估在清晰度、关联性和实用性指标上突显了GPT-4o mini的优越可解释性。
cs.CL / 24 / 2602.18351

Validating Political Position Predictions of Arguments

验证论点的政治立场预测
Robinson, Jordan, Williams, Angus R., Atkinson, Katie, Cohn, Anthony G.
Abstract
Real-world knowledge representation often requires capturing subjective, continuous attributes -- such as political positions -- that conflict with pairwise validation, the widely accepted gold standard for human evaluation. We address this challenge through a dual-scale validation framework applied to political stance prediction in argumentative discourse, combining pointwise and pairwise human annotation. Using 22 language models, we construct a large-scale knowledge base of political position predictions for 23,228 arguments drawn from 30 debates that appeared on the UK politicial television programme \textit{Question Time}. Pointwise evaluation shows moderate human-model agreement (Krippendorff's $\alpha=0.578$), reflecting intrinsic subjectivity, while pairwise validation reveals substantially stronger alignment between human- and model-derived rankings ($\alpha=0.86$ for the best model). This work contributes: (i) a practical validation methodology for subjective continuous knowledge that balances scalability with reliability; (ii) a validated structured argumentation knowledge base enabling graph-based reasoning and retrieval-augmented generation in political domains; and (iii) evidence that ordinal structure can be extracted from pointwise language models predictions from inherently subjective real-world discourse, advancing knowledge representation capabilities for domains where traditional symbolic or categorical approaches are insufficient.
Chinese Translation
现实世界的知识表示通常需要捕捉主观的、连续的属性——例如政治立场——这些属性与成对验证相冲突,而成对验证是人类评估的广泛接受的金标准。我们通过一种双尺度验证框架来应对这一挑战,该框架应用于论证话语中的政治立场预测,结合了逐点和成对的人类注释。使用22种语言模型,我们构建了一个大规模的政治立场预测知识库,涵盖了来自30场辩论的23,228个论点,这些辩论出现在英国政治电视节目《Question Time》中。逐点评估显示人类与模型之间的中等一致性(Krippendorff's α=0.578),反映了内在的主观性,而成对验证则揭示了人类与模型生成的排名之间显著更强的一致性(最佳模型的α=0.86)。本研究贡献了:(i)一种实用的主观连续知识验证方法,平衡了可扩展性与可靠性;(ii)一个经过验证的结构化论证知识库,支持图基推理和增强检索生成在政治领域的应用;(iii)证据表明,可以从逐点语言模型的预测中提取序数结构,这些预测源自本质上主观的现实世界话语,推动了在传统符号或分类方法不足的领域中知识表示能力的发展。
cs.CL / 25 / 2602.18420

SPQ: An Ensemble Technique for Large Language Model Compression

SPQ:一种用于大型语言模型压缩的集成技术
Yao, Jiamin, Gultepe, Eren
Abstract
This study presents an ensemble technique, SPQ (SVD-Pruning-Quantization), for large language model (LLM) compression that combines variance-retained singular value decomposition (SVD), activation-based pruning, and post-training linear quantization. Each component targets a different source of inefficiency: i) pruning removes redundant neurons in MLP layers, ii) SVD reduces attention projections into compact low-rank factors, iii) and 8-bit quantization uniformly compresses all linear layers. At matched compression ratios, SPQ outperforms individual methods (SVD-only, pruning-only, or quantization-only) in perplexity, demonstrating the benefit of combining complementary techniques. Applied to LLaMA-2-7B, SPQ achieves up to 75% memory reduction while maintaining or improving perplexity (e.g., WikiText-2 5.47 to 4.91) and preserving accuracy on downstream benchmarks such as C4, TruthfulQA, and GSM8K. Compared to strong baselines like GPTQ and SparseGPT, SPQ offers competitive perplexity and accuracy while using less memory (6.86 GB vs. 7.16 GB for GPTQ). Moreover, SPQ improves inference throughput over GPTQ, achieving up to a 1.9x speedup, which further enhances its practicality for real-world deployment. The effectiveness of SPQ's robust compression through layer-aware and complementary compression techniques may provide practical deployment of LLMs in memory-constrained environments. Code is available at: https://github.com/JiaminYao/SPQ_LLM_Compression/
Chinese Translation
本研究提出了一种名为SPQ(SVD-剪枝-量化)的集成技术,用于大型语言模型(LLM)的压缩,该技术结合了保留方差的奇异值分解(SVD)、基于激活的剪枝和后训练线性量化。每个组件针对不同的低效来源:i)剪枝去除多层感知器(MLP)层中的冗余神经元,ii)SVD将注意力投影压缩为紧凑的低秩因子,iii)8位量化均匀压缩所有线性层。在相同的压缩比下,SPQ在困惑度上优于单独的方法(仅SVD、仅剪枝或仅量化),展示了结合互补技术的优势。应用于LLaMA-2-7B,SPQ实现了高达75%的内存减少,同时保持或提高了困惑度(例如,从WikiText-2的5.47降低到4.91),并在下游基准测试(如C4、TruthfulQA和GSM8K)中保持准确性。与强基线如GPTQ和SparseGPT相比,SPQ在使用更少内存(6.86 GB对比GPTQ的7.16 GB)的同时,提供了竞争力的困惑度和准确性。此外,SPQ在推理吞吐量上优于GPTQ,实现了高达1.9倍的加速,这进一步增强了其在实际部署中的实用性。SPQ通过层感知和互补压缩技术实现的强大压缩效果,可能为在内存受限环境中实际部署LLM提供了可行性。代码可在以下链接获取:https://github.com/JiaminYao/SPQ_LLM_Compression/
cs.CL / 26 / 2602.18425

RVR: Retrieve-Verify-Retrieve for Comprehensive Question Answering

RVR:检索-验证-再检索用于综合问题回答
Qian, Deniz, Chen, Hung-Ting, Choi, Eunsol
Abstract
Comprehensively retrieving diverse documents is crucial to address queries that admit a wide range of valid answers. We introduce retrieve-verify-retrieve (RVR), a multi-round retrieval framework designed to maximize answer coverage. Initially, a retriever takes the original query and returns a candidate document set, followed by a verifier that identifies a high-quality subset. For subsequent rounds, the query is augmented with previously verified documents to uncover answers that are not yet covered in previous rounds. RVR is effective even with off-the-shelf retrievers, and fine-tuning retrievers for our inference procedure brings further gains. Our method outperforms baselines, including agentic search approaches, achieving at least 10% relative and 3% absolute gain in complete recall percentage on a multi-answer retrieval dataset (QAMPARI). We also see consistent gains on two out-of-domain datasets (QUEST and WebQuestionsSP) across different base retrievers. Our work presents a promising iterative approach for comprehensive answer recall leveraging a verifier and adapting retrievers to a new inference scenario.
Chinese Translation
全面检索多样化文档对于处理允许多种有效答案的查询至关重要。我们提出了检索-验证-再检索(RVR),这是一种多轮检索框架,旨在最大化答案覆盖率。最初,检索器接受原始查询并返回候选文档集,随后由验证器识别出高质量的子集。在后续轮次中,查询与之前验证过的文档相结合,以发现尚未在前几轮中覆盖的答案。即使使用现成的检索器,RVR 也表现出色,并且对我们的推理过程进行微调检索器可以带来进一步的提升。我们的方法在基准测试中表现优于包括自主搜索方法在内的多种基线,在多答案检索数据集(QAMPARI)上实现了至少 10% 的相对增益和 3% 的绝对增益。我们还在两个领域外的数据集(QUEST 和 WebQuestionsSP)上观察到了不同基础检索器的一致增益。我们的工作展示了一种有前景的迭代方法,通过利用验证器并将检索器适应于新的推理场景,以实现全面的答案召回。
cs.CL / 27 / 2602.18429

VIRAASAT: Traversing Novel Paths for Indian Cultural Reasoning

VIRAASAT:探索印度文化推理的新路径
Surana, Harshul Raj, Maji, Arijit, Vats, Aryan, Ghosh, Akash, Saha, Sriparna, Sheth, Amit
Abstract
Large Language Models (LLMs) have made significant progress in reasoning tasks across various domains such as mathematics and coding. However, their performance deteriorates in tasks requiring rich socio-cultural knowledge and diverse local contexts, particularly those involving Indian Culture. Existing Cultural benchmarks are (i) Manually crafted, (ii) contain single-hop questions testing factual recall, and (iii) prohibitively costly to scale, leaving this deficiency largely unmeasured. To address this, we introduce VIRAASAT, a novel, semi-automated multi-hop approach for generating cultural specific multi-hop Question-Answering dataset for Indian culture. VIRAASAT leverages a Knowledge Graph comprising more than 700 expert-curated cultural artifacts, covering 13 key attributes of Indian culture (history, festivals, etc). VIRAASAT spans all 28 states and 8 Union Territories, yielding more than 3,200 multi-hop questions that necessitate chained cultural reasoning. We evaluate current State-of-the-Art (SOTA) LLMs on VIRAASAT and identify key limitations in reasoning wherein fine-tuning on Chain-of-Thought(CoT) traces fails to ground and synthesize low-probability facts. To bridge this gap, we propose a novel framework named Symbolic Chain-of-Manipulation (SCoM). Adapting the Chain-of-Manipulation paradigm, we train the model to simulate atomic Knowledge Graph manipulations internally. SCoM teaches the model to reliably traverse the topological structure of the graph. Experiments on Supervised Fine-Tuning (SFT) demonstrate that SCoM outperforms standard CoT baselines by up to 20%. We release the VIRAASAT dataset along with our findings, laying a strong foundation towards building Culturally Aware Reasoning Models.
Chinese Translation
大型语言模型(LLMs)在数学和编程等多个领域的推理任务中取得了显著进展。然而,在需要丰富社会文化知识和多样地方背景的任务中,它们的表现却有所下降,尤其是在涉及印度文化的任务中。现有的文化基准(i)是手动构建的,(ii)包含测试事实回忆的单跳问题,以及(iii)扩展成本过高,导致这一缺陷在很大程度上未被量化。为了解决这个问题,我们提出了VIRAASAT,一种新颖的半自动化多跳方法,用于生成针对印度文化的文化特定多跳问答数据集。VIRAASAT利用一个包含700多个专家策划的文化文物的知识图谱,涵盖印度文化的13个关键属性(历史、节日等)。VIRAASAT覆盖所有28个州和8个联邦直辖区,产生了3200多个需要链式文化推理的多跳问题。我们在VIRAASAT上评估当前的最先进(SOTA)LLMs,并识别出推理中的关键局限性,其中基于思维链(Chain-of-Thought, CoT)的微调未能有效地基于低概率事实进行推理和综合。为了弥补这一差距,我们提出了一种名为符号操作链(Symbolic Chain-of-Manipulation, SCoM)的新框架。通过适应操作链范式,我们训练模型在内部模拟原子知识图谱操作。SCoM教会模型可靠地遍历图谱的拓扑结构。在监督微调(Supervised Fine-Tuning, SFT)实验中,SCoM的表现比标准CoT基线高出20%。我们发布了VIRAASAT数据集及我们的研究结果,为构建具有文化意识的推理模型奠定了坚实的基础。