cs.RO / 1 / 2602.06966
Embodied Intelligence for Flexible Manufacturing: A Survey
灵活制造中的具身智能:一项综述
Abstract
Driven by breakthroughs in next-generation artificial intelligence, embodied intelligence is rapidly advancing into industrial manufacturing. In flexible manufacturing, industrial embodied intelligence faces three core challenges: accurate process modeling and monitoring under limited perception, dynamic balancing between flexible adaptation and high-precision control, and the integration of general-purpose skills with specialized industrial operations. Accordingly, this survey reviews existing work from three viewpoints: Industrial Eye, Industrial Hand, and Industrial Brain. At the perception level (Industrial Eye), multimodal data fusion and real-time modeling in complex dynamic settings are examined. At the control level (Industrial Hand), flexible, adaptive, and precise manipulation for complex manufacturing processes is analyzed. At the decision level (Industrial Brain), intelligent optimization methods for process planning and line scheduling are summarized. By considering multi-level collaboration and interdisciplinary integration, this work reveals the key technological pathways of embodied intelligence for closed-loop optimization of perception-decision-execution in manufacturing systems. A three-stage evolution model for the development of embodied intelligence in flexible manufacturing scenarios, comprising cognition enhancement, skill transition, and system evolution, is proposed, and future development trends are examined, to offer both a theoretical framework and practical guidance for the interdisciplinary advancement of industrial embodied intelligence in the context of flexible manufacturing.
Chinese Translation
受下一代人工智能突破的推动,具身智能正在迅速进入工业制造领域。在灵活制造中,工业具身智能面临三个核心挑战:在有限感知下的准确过程建模与监控、灵活适应与高精度控制之间的动态平衡,以及通用技能与专业工业操作的整合。因此,本综述从三个视角回顾现有工作:工业眼(Industrial Eye)、工业手(Industrial Hand)和工业脑(Industrial Brain)。在感知层面(工业眼),考察了复杂动态环境中的多模态数据融合和实时建模。在控制层面(工业手),分析了针对复杂制造过程的灵活、适应性和精确的操作。在决策层面(工业脑),总结了过程规划和生产线调度的智能优化方法。通过考虑多层次协作和跨学科整合,本研究揭示了具身智能在制造系统中实现感知-决策-执行闭环优化的关键技术路径。提出了一个三阶段的具身智能发展演化模型,涵盖认知增强、技能转变和系统演化,并考察了未来发展趋势,以为灵活制造背景下工业具身智能的跨学科进步提供理论框架和实践指导。
cs.RO / 2 / 2602.06967
Leveraging Adaptive Group Negotiation for Heterogeneous Multi-Robot Collaboration with Large Language Models
利用自适应组协商实现异构多机器人协作与大型语言模型的结合
Abstract
Multi-robot collaboration tasks often require heterogeneous robots to work together over long horizons under spatial constraints and environmental uncertainties. Although Large Language Models (LLMs) excel at reasoning and planning, their potential for coordinated control has not been fully explored. Inspired by human teamwork, we present CLiMRS (Cooperative Large-Language-Model-Driven Heterogeneous Multi-Robot System), an adaptive group negotiation framework among LLMs for multi-robot collaboration. This framework pairs each robot with an LLM agent and dynamically forms subgroups through a general proposal planner. Within each subgroup, a subgroup manager leads perception-driven multi-LLM discussions to get commands for actions. Feedback is provided by both robot execution outcomes and environment changes. This grouping-planning-execution-feedback loop enables efficient planning and robust execution. To evaluate these capabilities, we introduce CLiMBench, a heterogeneous multi-robot benchmark of challenging assembly tasks. Our experiments show that CLiMRS surpasses the best baseline, achieving over 40% higher efficiency on complex tasks without sacrificing success on simpler ones. Overall, our results demonstrate that leveraging human-inspired group formation and negotiation principles significantly enhances the efficiency of heterogeneous multi-robot collaboration. Our code is available here: https://github.com/song-siqi/CLiMRS.
Chinese Translation
多机器人协作任务通常要求异构机器人在空间限制和环境不确定性下共同工作,持续较长时间。尽管大型语言模型(LLMs)在推理和规划方面表现出色,但其在协调控制方面的潜力尚未得到充分探索。受到人类团队合作的启发,我们提出了CLiMRS(合作大型语言模型驱动的异构多机器人系统),这是一个针对多机器人协作的自适应组协商框架。该框架将每个机器人与一个LLM代理配对,并通过通用提案规划器动态形成子组。在每个子组内,子组管理者引导基于感知的多LLM讨论,以获取行动命令。反馈由机器人执行结果和环境变化提供。这个分组-规划-执行-反馈循环使得高效规划和稳健执行成为可能。为了评估这些能力,我们引入了CLiMBench,这是一个具有挑战性的异构多机器人组装任务基准。我们的实验表明,CLiMRS超越了最佳基线,在复杂任务上实现了超过40%的效率提升,同时在简单任务上没有牺牲成功率。总体而言,我们的结果表明,利用人类启发的组建和协商原则显著提高了异构多机器人协作的效率。我们的代码可在此获取:https://github.com/song-siqi/CLiMRS。
cs.RO / 3 / 2602.06968
Learning to Anchor Visual Odometry: KAN-Based Pose Regression for Planetary Landing
学习锚定视觉里程计:基于KAN的行姿回归用于行星着陆
Abstract
Accurate and real-time 6-DoF localization is mission-critical for autonomous lunar landing, yet existing approaches remain limited: visual odometry (VO) drifts unboundedly, while map-based absolute localization fails in texture-sparse or low-light terrain. We introduce KANLoc, a monocular localization framework that tightly couples VO with a lightweight but robust absolute pose regressor. At its core is a Kolmogorov-Arnold Network (KAN) that learns the complex mapping from image features to map coordinates, producing sparse but highly reliable global pose anchors. These anchors are fused into a bundle adjustment framework, effectively canceling drift while retaining local motion precision. KANLoc delivers three key advances: (i) a KAN-based pose regressor that achieves high accuracy with remarkable parameter efficiency, (ii) a hybrid VO-absolute localization scheme that yields globally consistent real-time trajectories (>=15 FPS), and (iii) a tailored data augmentation strategy that improves robustness to sensor occlusion. On both realistic synthetic and real lunar landing datasets, KANLoc reduces average translation and rotation error by 32% and 45%, respectively, with per-trajectory gains of up to 45%/48%, outperforming strong baselines.
Chinese Translation
准确且实时的6自由度定位对于自主月球着陆至关重要,但现有方法仍然有限:视觉里程计(VO)存在无限漂移,而基于地图的绝对定位在纹理稀疏或低光照地形中失效。我们提出了KANLoc,一个单目定位框架,它将视觉里程计与轻量但稳健的绝对姿态回归器紧密结合。其核心是一个Kolmogorov-Arnold网络(KAN),该网络学习从图像特征到地图坐标的复杂映射,生成稀疏但高度可靠的全局姿态锚点。这些锚点被融合到束调整框架中,有效地消除漂移,同时保持局部运动精度。KANLoc带来了三个关键进展:(i)基于KAN的姿态回归器在参数效率上实现了高精度,(ii)混合的视觉里程计-绝对定位方案产生了全球一致的实时轨迹(>=15 FPS),以及(iii)量身定制的数据增强策略提高了对传感器遮挡的鲁棒性。在现实的合成和真实月球着陆数据集上,KANLoc分别将平均平移和旋转误差降低了32%和45%,每条轨迹的增益高达45%/48%,超越了强基线。
cs.RO / 4 / 2602.06969
A Survey of Medical Drones from Flight Dynamics, Guidance, Navigation, and Control Perspectives
从飞行动力学、引导、导航和控制视角的医疗无人机调查
Abstract
The integration of drones into the medical field has revolutionized healthcare delivery by enabling rapid transportation of medical supplies, organs, and even emergency assistance in remote or disaster-stricken areas. While other survey papers focus on the healthcare supply chain, operations, and medical emergency response aspects, this paper provides a comprehensive review of medical drones from the perspectives of flight dynamics and guidance, navigation, and control (GNC) systems. We first discuss the medical aerial delivery mission requirements and suitable uncrewed aerial system (UAS) configurations. We then address payload container design and optimization, and its effect on supplies and overall flight dynamics. We also explore the fundamental principles of GNC in the context of medical drone operations, highlighting key challenges arising from vibration, air temperature, pressure, and humidity, which affect the quality of medical supplies. The paper examines various GNC algorithms that can mitigate these challenges, as well as the algorithms' limitations. With these considerations, this survey aims to provide insights into optimizing GNC frameworks for medical drones, emphasizing research gaps and directions to improve real-world healthcare applications.
Chinese Translation
无人机在医疗领域的应用彻底改变了医疗服务的提供,使得医疗物资、器官甚至紧急救助能够迅速运输到偏远或灾后地区。尽管其他调查论文关注医疗供应链、运营和医疗紧急响应等方面,本文从飞行动力学以及引导、导航和控制(GNC)系统的角度对医疗无人机进行了全面回顾。我们首先讨论医疗空中投递任务的要求和适合的无人机系统(UAS)配置。接着,我们探讨了有效载荷容器的设计与优化及其对物资和整体飞行动力学的影响。我们还研究了GNC的基本原理在医疗无人机操作中的应用,强调了由振动、空气温度、压力和湿度引起的关键挑战,这些因素会影响医疗物资的质量。本文考察了多种GNC算法,这些算法能够缓解上述挑战,同时也讨论了这些算法的局限性。基于这些考虑,本调查旨在为优化医疗无人机的GNC框架提供见解,强调研究空白和改善实际医疗应用的方向。
cs.RO / 5 / 2602.06971
Formal Methods in Robot Policy Learning and Verification: A Survey on Current Techniques and Future Directions
机器人政策学习与验证中的形式化方法:当前技术与未来方向的综述
Abstract
As hardware and software systems have grown in complexity, formal methods have been indispensable tools for rigorously specifying acceptable behaviors, synthesizing programs to meet these specifications, and validating the correctness of existing programs. In the field of robotics, a similar trend of rising complexity has emerged, driven in large part by the adoption of deep learning. While this shift has enabled the development of highly performant robot policies, their implementation as deep neural networks has posed challenges to traditional formal analysis, leading to models that are inflexible, fragile, and difficult to interpret. In response, the robotics community has introduced new formal and semi-formal methods to support the precise specification of complex objectives, guide the learning process to achieve them, and enable the verification of learned policies against them. In this survey, we provide a comprehensive overview of how formal methods have been used in recent robot learning research. We organize our discussion around two pillars: policy learning and policy verification. For both, we highlight representative techniques, compare their scalability and expressiveness, and summarize how they contribute to meaningfully improving realistic robot safety and correctness. We conclude with a discussion of remaining obstacles for achieving that goal and promising directions for advancing formal methods in robot learning.
Chinese Translation
随着硬件和软件系统复杂性的增加,形式化方法已成为严格规范可接受行为、合成满足这些规范的程序以及验证现有程序正确性的重要工具。在机器人领域,类似的复杂性上升趋势也已出现,这在很大程度上是由于深度学习的采用。虽然这一转变使得高度性能的机器人政策得以发展,但作为深度神经网络的实现却给传统形式分析带来了挑战,导致模型变得不灵活、脆弱且难以解释。对此,机器人社区引入了新的形式化和半形式化方法,以支持复杂目标的精确规范,引导学习过程以实现这些目标,并使得学习到的政策能够进行验证。在本综述中,我们全面概述了形式化方法在近期机器人学习研究中的应用。我们围绕两个支柱组织讨论:政策学习和政策验证。对于这两者,我们强调了代表性技术,比较了它们的可扩展性和表达能力,并总结了它们如何有意义地改善现实机器人安全性和正确性。最后,我们讨论了实现这一目标的剩余障碍以及推进机器人学习中形式化方法的有前景的方向。
cs.RO / 6 / 2602.06974
FeudalNav: A Simple Framework for Visual Navigation
FeudalNav:一种简单的视觉导航框架
Abstract
Visual navigation for robotics is inspired by the human ability to navigate environments using visual cues and memory, eliminating the need for detailed maps. In unseen, unmapped, or GPS-denied settings, traditional metric map-based methods fall short, prompting a shift toward learning-based approaches with minimal exploration. In this work, we develop a hierarchical framework that decomposes the navigation decision-making process into multiple levels. Our method learns to select subgoals through a simple, transferable waypoint selection network. A key component of the approach is a latent-space memory module organized solely by visual similarity, as a proxy for distance. This alternative to graph-based topological representations proves sufficient for navigation tasks, providing a compact, light-weight, simple-to-train navigator that can find its way to the goal in novel locations. We show competitive results with a suite of SOTA methods in Habitat AI environments without using any odometry in training or inference. An additional contribution leverages the interpretablility of the framework for interactive navigation. We consider the question: how much direction intervention/interaction is needed to achieve success in all trials? We demonstrate that even minimal human involvement can significantly enhance overall navigation performance.
Chinese Translation
机器人的视觉导航受到人类利用视觉线索和记忆在环境中导航能力的启发,消除了对详细地图的需求。在未见过、未映射或GPS信号缺失的环境中,传统的基于度量地图的方法表现不佳,这促使我们转向以学习为基础的方法,进行最小化探索。在本研究中,我们开发了一个分层框架,将导航决策过程分解为多个层次。我们的方法通过一个简单、可迁移的航点选择网络学习选择子目标。该方法的一个关键组成部分是一个仅按视觉相似性组织的潜在空间记忆模块,作为距离的代理。这种替代图形基础的拓扑表示的方法证明在导航任务中是足够的,提供了一个紧凑、轻量、易于训练的导航器,能够在新位置找到通往目标的路径。我们在Habitat AI环境中展示了与一系列最先进(SOTA)方法的竞争结果,而在训练或推理中没有使用任何里程计。此外,我们的另一个贡献是利用该框架的可解释性进行交互式导航。我们考虑了这样一个问题:在所有试验中,实现成功需要多少方向干预/互动?我们证明,即使是最小的人类参与也能显著提升整体导航性能。
cs.RO / 7 / 2602.06977
Autonomous Manipulation of Hazardous Chemicals and Delicate Objects in a Self-Driving Laboratory: A Sliding Mode Approach
自驾实验室中危险化学品和精细物体的自主操控:滑模控制方法
Abstract
Precise handling of chemical instruments and materials within a self-driving laboratory environment using robotic systems demands advanced and reliable control strategies. Sliding Mode Control (SMC) has emerged as a robust approach for managing uncertainties and disturbances in manipulator dynamics, providing superior control performance compared to traditional methods. This study implements a model-based SMC (MBSMC) utilizing a hyperbolic tangent function to regulate the motion of a manipulator mounted on a mobile platform operating inside a self-driving chemical laboratory. Given the manipulator's role in transporting fragile glass vessels filled with hazardous chemicals, the controller is specifically designed to minimize abrupt transitions and achieve gentle, accurate trajectory tracking. The proposed controller is benchmarked against a non-model-based SMC (NMBSMC) and a Proportional-Integral-Derivative (PID) controller using a comprehensive set of joint and Cartesian metrics. Compared to PID and NMBSMC, MBSMC achieved significantly smoother motion and up to 90% lower control effort, validating its robustness and precision for autonomous laboratory operations. Experimental trials confirmed successful execution of tasks such as vessel grasping and window operation, which failed under PID control due to its limited ability to handle nonlinear dynamics and external disturbances, resulting in substantial trajectory tracking errors. The results validate the controller's effectiveness in achieving smooth, precise, and safe manipulator motions, supporting the advancement of intelligent mobile manipulators in autonomous laboratory environments.
Chinese Translation
在自驾实验室环境中,使用机器人系统精确处理化学仪器和材料需要先进且可靠的控制策略。滑模控制(Sliding Mode Control, SMC)作为一种强健的方法,已被广泛应用于管理操纵器动态中的不确定性和干扰,相较于传统方法提供了更优的控制性能。本研究实施了一种基于模型的滑模控制(Model-Based Sliding Mode Control, MBSMC),利用双曲正切函数调节安装在移动平台上的操纵器在自驾化学实验室内的运动。考虑到操纵器在运输装有危险化学品的易碎玻璃容器中的作用,控制器特别设计以最小化突变过渡,实现温和且精确的轨迹跟踪。所提出的控制器与非模型基滑模控制(Non-Model-Based Sliding Mode Control, NMBSMC)和比例-积分-微分(Proportional-Integral-Derivative, PID)控制器进行了基准测试,使用了一套全面的关节和笛卡尔度量。与PID和NMBSMC相比,MBSMC实现了显著更平滑的运动和高达90%的控制努力降低,验证了其在自主实验室操作中的鲁棒性和精确性。实验试验确认了任务的成功执行,如容器抓取和窗户操作,而在PID控制下由于其处理非线性动态和外部干扰的能力有限,导致了显著的轨迹跟踪误差。结果验证了控制器在实现平滑、精确和安全的操纵器运动方面的有效性,支持了智能移动操纵器在自主实验室环境中的发展。
cs.RO / 8 / 2602.06991
LangGS-SLAM: Real-Time Language-Feature Gaussian Splatting SLAM
LangGS-SLAM:实时语言特征高斯点云SLAM
Abstract
In this paper, we propose a RGB-D SLAM system that reconstructs a language-aligned dense feature field while sustaining low-latency tracking and mapping. First, we introduce a Top-K Rendering pipeline, a high-throughput and semantic-distortion-free method for efficiently rendering high-dimensional feature maps. To address the resulting semantic-geometric discrepancy and mitigate the memory consumption, we further design a multi-criteria map management strategy that prunes redundant or inconsistent Gaussians while preserving scene integrity. Finally, a hybrid field optimization framework jointly refines the geometric and semantic fields under real-time constraints by decoupling their optimization frequencies according to field characteristics. The proposed system achieves superior geometric fidelity compared to geometric-only baselines and comparable semantic fidelity to offline approaches while operating at 15 FPS. Our results demonstrate that online SLAM with dense, uncompressed language-aligned feature fields is both feasible and effective, bridging the gap between 3D perception and language-based reasoning.
Chinese Translation
在本文中,我们提出了一种RGB-D SLAM系统,该系统在保持低延迟跟踪和映射的同时重建与语言对齐的密集特征场。首先,我们引入了一种Top-K渲染管道,这是一种高吞吐量且无语义失真的方法,用于高效渲染高维特征图。为了应对由此产生的语义-几何差异并减轻内存消耗,我们进一步设计了一种多标准地图管理策略,该策略在保留场景完整性的同时,修剪冗余或不一致的高斯分布。最后,一个混合场优化框架在实时约束下联合优化几何和语义场,通过根据场特性解耦它们的优化频率。所提出的系统在几何保真度上优于仅几何基线,并在15 FPS下与离线方法相比具有可比的语义保真度。我们的结果表明,使用密集的、未压缩的与语言对齐的特征场进行在线SLAM是可行且有效的,弥合了3D感知与基于语言的推理之间的差距。
cs.RO / 9 / 2602.06995
When Simultaneous Localization and Mapping Meets Wireless Communications: A Survey
当同时定位与地图构建遇上无线通信:一项综述
Abstract
The availability of commercial wireless communication and sensing equipment combined with the advancements in intelligent autonomous systems paves the way towards robust joint communications and simultaneous localization and mapping (SLAM). This paper surveys the state-of-the-art in the nexus of SLAM and Wireless Communications, attributing the bidirectional impact of each with a focus on visual SLAM (V-SLAM) integration. We provide an overview of key concepts related to wireless signal propagation, geometric channel modeling, and radio frequency (RF)-based localization and sensing. In addition to this, we show image processing techniques that can detect landmarks, proactively predicting optimal paths for wireless channels. Several dimensions are considered, including the prerequisites, techniques, background, and future directions and challenges of the intersection between SLAM and wireless communications. We analyze mathematical approaches such as probabilistic models, and spatial methods for signal processing, as well as key technological aspects. We expose techniques and items towards enabling a highly effective retrieval of the autonomous robot state. Among other interesting findings, we observe that monocular V-SLAM would benefit from RF relevant information, as the latter can serve as a proxy for the scale ambiguity resolution. Conversely, we find that wireless communications in the context of 5G and beyond can potentially benefit from visual odometry that is central in SLAM. Moreover, we examine other sources besides the camera for SLAM and describe the twofold relation with wireless communications. Finally, integrated solutions performing joint communications and SLAM are still in their infancy: theoretical and practical advancements are required to add higher-level localization and semantic perception capabilities to RF and multi-antenna technologies.
Chinese Translation
商业无线通信和传感设备的可用性,加上智能自主系统的进步,为稳健的联合通信和同时定位与地图构建(SLAM)铺平了道路。本文对SLAM与无线通信的交汇点进行了综述,重点关注视觉SLAM(V-SLAM)集成,探讨了双方的双向影响。我们提供了与无线信号传播、几何信道建模以及基于射频(RF)的定位和感知相关的关键概念的概述。此外,我们展示了能够检测地标的图像处理技术,主动预测无线信道的最佳路径。考虑了多个维度,包括SLAM与无线通信交汇的前提条件、技术、背景以及未来方向和挑战。我们分析了数学方法,如概率模型和信号处理的空间方法,以及关键技术方面。我们揭示了使自主机器人状态高效检索的技术和项目。在其他有趣的发现中,我们观察到单目V-SLAM将受益于与RF相关的信息,因为后者可以作为解决尺度模糊的代理。相反,我们发现5G及更高版本的无线通信可能会从SLAM中核心的视觉里程计中受益。此外,我们考察了除了相机之外的其他SLAM来源,并描述了与无线通信的双重关系。最后,执行联合通信和SLAM的集成解决方案仍处于起步阶段:需要理论和实践上的进展,以为RF和多天线技术增加更高级的定位和语义感知能力。
cs.RO / 10 / 2602.07005
Admittance-Based Motion Planning with Vision-Guided Initialization for Robotic Manipulators in Self-Driving Laboratories
基于导纳的运动规划与视觉引导初始化在自驾实验室中的机器人操控
Abstract
Self driving laboratories (SDLs) are highly automated research environments that leverage advanced technologies to conduct experiments and analyze data with minimal human involvement. These environments often involve delicate laboratory equipment, unpredictable environmental interactions, and occasional human intervention, making compliant and force aware control essential for ensuring safety, adaptability, and reliability. This paper introduces a motion-planning framework centered on admittance control to enable adaptive and compliant robotic manipulation. Unlike conventional schemes, the proposed approach integrates an admittance controller directly into trajectory execution, allowing the manipulator to dynamically respond to external forces during interaction. This capability enables human operators to override or redirect the robot's motion in real time. A vision algorithm based on structured planar pose estimation is employed to detect and localize textured planar objects through feature extraction, homography estimation, and depth fusion, thereby providing an initial target configuration for motion planning. The vision based initialization establishes the reference trajectory, while the embedded admittance controller ensures that trajectory execution remains safe, adaptive, and responsive to external forces or human intervention. The proposed strategy is validated using textured image detection as a proof of concept. Future work will extend the framework to SDL environments involving transparent laboratory objects where compliant motion planning can further enhance autonomy, safety, and human-robot collaboration.
Chinese Translation
自驾实验室(SDL)是高度自动化的研究环境,利用先进技术以最小的人类参与进行实验和数据分析。这些环境通常涉及精密的实验室设备、不可预测的环境交互以及偶尔的人类干预,因此合规和力感知控制对于确保安全性、适应性和可靠性至关重要。本文提出了一种以导纳控制为中心的运动规划框架,以实现自适应和合规的机器人操控。与传统方案不同,所提出的方法将导纳控制器直接集成到轨迹执行中,使得操控器能够在交互过程中动态响应外部力量。这一能力使得人类操作员能够实时覆盖或重定向机器人的运动。采用基于结构化平面姿态估计的视觉算法,通过特征提取、单应性估计和深度融合来检测和定位纹理平面物体,从而为运动规划提供初始目标配置。基于视觉的初始化建立了参考轨迹,而嵌入的导纳控制器确保轨迹执行在安全、自适应的同时,对外部力量或人类干预保持响应。所提出的策略通过纹理图像检测作为概念验证进行验证。未来的工作将扩展该框架至涉及透明实验室物体的SDL环境,在这些环境中,合规运动规划可以进一步增强自主性、安全性和人机协作。
cs.RO / 11 / 2602.07007
ARGOS: Automated Functional Safety Requirement Synthesis for Embodied AI via Attribute-Guided Combinatorial Reasoning
ARGOS:通过属性引导的组合推理实现具身人工智能的自动化功能安全需求合成
Abstract
Ensuring functional safety is essential for the deployment of Embodied AI in complex open-world environments. However, traditional Hazard Analysis and Risk Assessment (HARA) methods struggle to scale in this domain. While HARA relies on enumerating risks for finite and pre-defined function lists, Embodied AI operates on open-ended natural language instructions, creating a challenge of combinatorial interaction risks. Whereas Large Language Models (LLMs) have emerged as a promising solution to this scalability challenge, they often lack physical grounding, yielding semantically superficial and incoherent hazard descriptions. To overcome these limitations, we propose a new framework ARGOS (AttRibute-Guided cOmbinatorial reaSoning), which bridges the gap between open-ended user instructions and concrete physical attributes. By dynamically decomposing entities from instructions into these fine-grained properties, ARGOS grounds LLM reasoning in causal risk factors to generate physically plausible hazard scenarios. It then instantiates abstract safety standards, such as ISO 13482, into context-specific Functional Safety Requirements (FSRs) by integrating these scenarios with robot capabilities. Extensive experiments validate that ARGOS produces high-quality FSRs and outperforms baselines in identifying long-tail risks. Overall, this work paves the way for systematic and grounded functional safety requirement generation, a critical step toward the safe industrial deployment of Embodied AI.
Chinese Translation
确保功能安全对于在复杂开放世界环境中部署具身人工智能至关重要。然而,传统的危害分析与风险评估(HARA)方法在这一领域面临扩展性挑战。HARA依赖于对有限且预定义的功能列表进行风险枚举,而具身人工智能则基于开放式自然语言指令进行操作,这导致了组合交互风险的挑战。尽管大型语言模型(LLMs)已被视为解决这一扩展性挑战的有前景的方案,但它们往往缺乏物理基础,导致生成的危害描述在语义上肤浅且不连贯。为克服这些局限性,我们提出了一个新的框架ARGOS(属性引导的组合推理),该框架弥合了开放式用户指令与具体物理属性之间的差距。通过动态地将指令中的实体分解为这些细粒度属性,ARGOS将LLM推理与因果风险因素相结合,从而生成物理上合理的危害场景。然后,它通过将这些场景与机器人能力相结合,将抽象的安全标准(如ISO 13482)实例化为特定上下文的功能安全需求(FSRs)。大量实验验证了ARGOS能够生成高质量的FSRs,并在识别长尾风险方面优于基准方法。总体而言,这项工作为系统化和有基础的功能安全需求生成铺平了道路,这是实现具身人工智能安全工业部署的关键一步。
cs.RO / 12 / 2602.07024
A Distributed Multi-Modal Sensing Approach for Human Activity Recognition in Real-Time Human-Robot Collaboration
一种分布式多模态感知方法用于实时人机协作中的人类活动识别
Abstract
Human activity recognition (HAR) is fundamental in human-robot collaboration (HRC), enabling robots to respond to and dynamically adapt to human intentions. This paper introduces a HAR system combining a modular data glove equipped with Inertial Measurement Units and a vision-based tactile sensor to capture hand activities in contact with a robot. We tested our activity recognition approach under different conditions, including offline classification of segmented sequences, real-time classification under static conditions, and a realistic HRC scenario. The experimental results show a high accuracy for all the tasks, suggesting that multiple collaborative settings could benefit from this multi-modal approach.
Chinese Translation
人类活动识别(HAR)在人人机协作(HRC)中至关重要,使机器人能够响应并动态适应人类意图。本文介绍了一种HAR系统,该系统结合了配备惯性测量单元的数据手套和基于视觉的触觉传感器,以捕捉与机器人接触的手部活动。我们在不同条件下测试了我们的活动识别方法,包括对分段序列的离线分类、静态条件下的实时分类以及现实的HRC场景。实验结果表明,所有任务的准确率都很高,表明多种协作环境可以从这种多模态方法中受益。
cs.RO / 13 / 2602.07074
Airspace-aware Contingency Landing Planning
考虑空域的应急着陆规划
Abstract
This paper develops a real-time, search-based aircraft contingency landing planner that minimizes traffic disruptions while accounting for ground risk. The airspace model captures dense air traffic departure and arrival flows, helicopter corridors, and prohibited zones and is demonstrated with a Washington, D.C., area case study. Historical Automatic Dependent Surveillance-Broadcast (ADS-B) data are processed to estimate air traffic density. A low-latency computational geometry algorithm generates proximity-based heatmaps around high-risk corridors and restricted regions. Airspace risk is quantified as the cumulative exposure time of a landing trajectory within congested regions, while ground risk is assessed from overflown population density to jointly guide trajectory selection. A landing site selection module further mitigates disruption to nominal air traffic operations. Benchmarking against minimum-risk Dubins solutions demonstrates that the proposed planner achieves lower joint risk and reduced airspace disruption while maintaining real-time performance. Under airspace-risk-only conditions, the planner generates trajectories within an average of 2.9 seconds on a laptop computer. Future work will incorporate dynamic air traffic updates to enable spatiotemporal contingency landing planning that minimizes the need for real-time traffic rerouting.
Chinese Translation
本文开发了一种实时的基于搜索的飞机应急着陆规划器,该规划器在考虑地面风险的同时,最小化交通干扰。空域模型捕捉了密集的航空交通出发和到达流、直升机走廊以及禁飞区,并通过华盛顿特区的案例研究进行了验证。处理历史自动依赖监视广播(ADS-B)数据以估计航空交通密度。一个低延迟的计算几何算法生成基于接近度的热图,覆盖高风险走廊和限制区域。空域风险被量化为着陆轨迹在拥堵区域内的累计暴露时间,而地面风险则通过飞越人口密度进行评估,以共同指导轨迹选择。着陆地点选择模块进一步减轻了对正常航空交通操作的干扰。与最小风险的Dubins解进行基准比较,结果表明所提规划器在保持实时性能的同时,实现了更低的联合风险和减少的空域干扰。在仅考虑空域风险的条件下,该规划器在笔记本电脑上平均生成轨迹的时间为2.9秒。未来的工作将结合动态航空交通更新,以实现时空应急着陆规划,最小化对实时交通重新引导的需求。
cs.RO / 14 / 2602.07158
A compliant ankle-actuated compass walker with triggering timing control
一种具有触发时序控制的顺应性踝关节驱动罗盘步态行走器
Abstract
Passive dynamic walkers are widely adopted as a mathematical model to represent biped walking. The stable locomotion of these models is limited to tilted surfaces, requiring gravitational energy. Various techniques, such as actuation through the ankle and hip joints, have been proposed to extend the applicability of these models to level ground and rough terrain with improved locomotion efficiency. However, most of these techniques rely on impulsive energy injection schemes and torsional springs, which are quite challenging to implement in a physical platform. Here, a new model is proposed, named triggering controlled ankle actuated compass gait (TC-AACG), which allows non-instantaneous compliant ankle pushoff. The proposed technique can be implemented in physical platforms via series elastic actuators (SEAs). Our systematic examination shows that the proposed approach extends the locomotion capabilities of a biped model compared to impulsive ankle pushoff approach. We provide extensive simulation analysis investigating the locomotion speed, mechanical cost of transport, and basin of attraction of the proposed model.
Chinese Translation
被动动态步态行走器被广泛采用作为双足行走的数学模型。这些模型的稳定运动仅限于倾斜表面,需要重力能量。为了将这些模型的适用性扩展到平坦地面和粗糙地形,并提高运动效率,提出了多种技术,如通过踝关节和髋关节的驱动。然而,这些技术大多依赖于冲击能量注入方案和扭转弹簧,这在物理平台上实现起来相当具有挑战性。在此,提出了一种新模型,称为触发控制的踝关节驱动罗盘步态(TC-AACG),该模型允许非瞬时的顺应性踝关节推力。所提出的技术可以通过串联弹性驱动器(SEAs)在物理平台上实现。我们的系统性研究表明,与冲击踝关节推力方法相比,所提出的方法扩展了双足模型的运动能力。我们提供了广泛的仿真分析,研究了所提模型的运动速度、机械运输成本和吸引盆地。
cs.RO / 15 / 2602.07209
Continuum Robot Localization using Distributed Time-of-Flight Sensors
基于分布式飞行时间传感器的连续体机器人定位
Abstract
Localization and mapping of an environment are crucial tasks for any robot operating in unstructured environments. Time-of-flight (ToF) sensors (e.g.,~lidar) have proven useful in mobile robotics, where high-resolution sensors can be used for simultaneous localization and mapping. In soft and continuum robotics, however, these high-resolution sensors are too large for practical use. This, combined with the deformable nature of such robots, has resulted in continuum robot (CR) localization and mapping in unstructured environments being a largely untouched area. In this work, we present a localization technique for CRs that relies on small, low-resolution ToF sensors distributed along the length of the robot. By fusing measurement information with a robot shape prior, we show that accurate localization is possible despite each sensor experiencing frequent degenerate scenarios. We achieve an average localization error of 2.5cm in position and 7.2{\deg} in rotation across all experimental conditions with a 53cm long robot. We demonstrate that the results are repeated across multiple environments, in both simulation and real-world experiments, and study robustness in the estimation to deviations in the prior map.
Chinese Translation
环境的定位和映射是任何在非结构化环境中操作的机器人的关键任务。飞行时间(ToF)传感器(例如,激光雷达)在移动机器人中已被证明是有效的,其中高分辨率传感器可用于同时定位和映射。然而,在软体和连续体机器人中,这些高分辨率传感器对于实际应用来说过于庞大。加之此类机器人的可变形特性,导致连续体机器人(CR)在非结构化环境中的定位和映射仍然是一个未被充分探索的领域。在本研究中,我们提出了一种针对CR的定位技术,该技术依赖于沿机器人长度分布的小型低分辨率ToF传感器。通过将测量信息与机器人形状先验相融合,我们展示了尽管每个传感器经常面临退化场景,仍然可以实现准确定位。我们在一台长53厘米的机器人上,在所有实验条件下实现了平均位置误差为2.5厘米,旋转误差为7.2度。我们证明了这些结果在多个环境中均可重复,无论是在模拟还是现实世界实验中,并研究了估计对先验地图偏差的鲁棒性。
cs.RO / 16 / 2602.07243
Realistic Synthetic Household Data Generation at Scale
大规模真实合成家庭数据生成
Abstract
Advancements in foundation models have catalyzed research in Embodied AI to develop interactive agents capable of environmental reasoning and interaction. Developing such agents requires diverse, large-scale datasets. Prior frameworks generate synthetic data for long-term human-robot interactions but fail to model the bidirectional influence between human behavior and household environments. Our proposed generative framework creates household datasets at scale through loosely coupled generation of long-term human-robot interactions and environments. Human personas influence environment generation, while environment schematics and semantics shape human-robot interactions. The generated 3D data includes rich static context such as object and environment semantics, and temporal context capturing human and agent behaviors over extended periods. Our flexible tool allows users to define dataset characteristics via natural language prompts, enabling configuration of environment and human activity data through natural language specifications. The tool creates variations of user-defined configurations, enabling scalable data generation. We validate our framework through statistical evaluation using multi-modal embeddings and key metrics: cosine similarity, mutual information gain, intervention analysis, and iterative improvement validation. Statistical comparisons show good alignment with real-world datasets (HOMER) with cosine similarity (0.60), while synthetic datasets (Wang et al.) show moderate alignment (0.27). Intervention analysis across age, organization, and sleep pattern changes shows statistically significant effects (p < 0.001) with large effect sizes (Cohen's d = 0.51-1.12), confirming bidirectional coupling translates persona traits into measurable environmental and behavioral differences. These contributions enable development and testing of household smart devices at scale.
Chinese Translation
基础模型的进步催生了在具身人工智能(Embodied AI)领域的研究,以开发能够进行环境推理和交互的智能体。开发此类智能体需要多样化的大规模数据集。以往的框架生成用于长期人机交互的合成数据,但未能建模人类行为与家庭环境之间的双向影响。我们提出的生成框架通过松耦合的长期人机交互与环境生成,规模化地创建家庭数据集。人类角色影响环境生成,而环境的框架和语义则塑造人机交互。生成的3D数据包括丰富的静态上下文,如物体和环境的语义,以及捕捉人类和智能体行为的时间上下文,覆盖较长时间段。我们的灵活工具允许用户通过自然语言提示定义数据集特征,从而通过自然语言规范配置环境和人类活动数据。该工具创建用户定义配置的变体,实现可扩展的数据生成。我们通过使用多模态嵌入和关键指标进行统计评估来验证我们的框架:余弦相似度、互信息增益、干预分析和迭代改进验证。统计比较显示与真实世界数据集(HOMER)具有良好的对齐(余弦相似度为0.60),而合成数据集(Wang et al.)的对齐程度适中(0.27)。在年龄、组织和睡眠模式变化方面的干预分析显示出统计显著效应(p < 0.001),且效应量较大(Cohen's d = 0.51-1.12),确认双向耦合将角色特征转化为可测量的环境和行为差异。这些贡献使得家庭智能设备的开发和测试能够大规模进行。
cs.RO / 17 / 2602.07264
aerial-autonomy-stack -- a Faster-than-real-time, Autopilot-agnostic, ROS2 Framework to Simulate and Deploy Perception-based Drones
aerial-autonomy-stack -- 一种超实时、与自动驾驶仪无关的 ROS2 框架,用于模拟和部署基于感知的无人机
Abstract
Unmanned aerial vehicles are rapidly transforming multiple applications, from agricultural and infrastructure monitoring to logistics and defense. Introducing greater autonomy to these systems can simultaneously make them more effective as well as reliable. Thus, the ability to rapidly engineer and deploy autonomous aerial systems has become of strategic importance. In the 2010s, a combination of high-performance compute, data, and open-source software led to the current deep learning and AI boom, unlocking decades of prior theoretical work. Robotics is on the cusp of a similar transformation. However, physical AI faces unique hurdles, often combined under the umbrella term "simulation-to-reality gap". These span from modeling shortcomings to the complexity of vertically integrating the highly heterogeneous hardware and software systems typically found in field robots. To address the latter, we introduce aerial-autonomy-stack, an open-source, end-to-end framework designed to streamline the pipeline from (GPU-accelerated) perception to (flight controller-based) action. Our stack allows the development of aerial autonomy using ROS2 and provides a common interface for two of the most popular autopilots: PX4 and ArduPilot. We show that it supports over 20x faster-than-real-time, end-to-end simulation of a complete development and deployment stack -- including edge compute and networking -- significantly compressing the build-test-release cycle of perception-based autonomy.
Chinese Translation
无人驾驶航空器正在迅速改变多种应用,从农业和基础设施监测到物流和国防。为这些系统引入更大的自主性可以同时提高其有效性和可靠性。因此,快速设计和部署自主航空系统的能力已成为战略重要性。在2010年代,高性能计算、数据和开源软件的结合导致了当前深度学习和人工智能的繁荣,解锁了数十年的理论工作。机器人技术正处于类似转型的边缘。然而,物理人工智能面临独特的挑战,通常被统称为“模拟与现实之间的差距”。这些挑战包括建模不足以及在现场机器人中通常存在的高度异构硬件和软件系统的垂直集成复杂性。为了解决后者,我们推出了 aerial-autonomy-stack,这是一个开源的端到端框架,旨在简化从(GPU加速的)感知到(基于飞行控制器的)行动的流程。我们的框架支持使用 ROS2 开发航空自主性,并为两种最流行的自动驾驶仪提供了通用接口:PX4 和 ArduPilot。我们展示了它支持超过 20 倍超实时的端到端完整开发和部署堆栈的仿真,包括边缘计算和网络,从而显著压缩了基于感知的自主性的构建-测试-发布周期。
cs.RO / 18 / 2602.07322
Action-to-Action Flow Matching
动作间流匹配
Abstract
Diffusion-based policies have recently achieved remarkable success in robotics by formulating action prediction as a conditional denoising process. However, the standard practice of sampling from random Gaussian noise often requires multiple iterative steps to produce clean actions, leading to high inference latency that incurs a major bottleneck for real-time control. In this paper, we challenge the necessity of uninformed noise sampling and propose Action-to-Action flow matching (A2A), a novel policy paradigm that shifts from random sampling to initialization informed by the previous action. Unlike existing methods that treat proprioceptive action feedback as static conditions, A2A leverages historical proprioceptive sequences, embedding them into a high-dimensional latent space as the starting point for action generation. This design bypasses costly iterative denoising while effectively capturing the robot's physical dynamics and temporal continuity. Extensive experiments demonstrate that A2A exhibits high training efficiency, fast inference speed, and improved generalization. Notably, A2A enables high-quality action generation in as few as a single inference step (0.56 ms latency), and exhibits superior robustness to visual perturbations and enhanced generalization to unseen configurations. Lastly, we also extend A2A to video generation, demonstrating its broader versatility in temporal modeling. Project site: https://lorenzo-0-0.github.io/A2A_Flow_Matching.
Chinese Translation
基于扩散的策略最近在机器人技术中取得了显著成功,通过将动作预测表述为条件去噪过程。然而,从随机高斯噪声中采样的标准做法通常需要多个迭代步骤才能生成干净的动作,导致高推理延迟,这成为实时控制的主要瓶颈。在本文中,我们质疑无信息噪声采样的必要性,并提出了动作间流匹配(Action-to-Action flow matching, A2A),这是一种新的策略范式,转变为通过先前动作进行初始化,而非随机采样。与将本体感知动作反馈视为静态条件的现有方法不同,A2A利用历史本体感知序列,将其嵌入高维潜在空间,作为动作生成的起点。这一设计避免了昂贵的迭代去噪,同时有效捕捉机器人的物理动态和时间连续性。大量实验表明,A2A展现出高训练效率、快速推理速度和改善的泛化能力。值得注意的是,A2A能够在仅一次推理步骤中生成高质量的动作(延迟为0.56毫秒),并对视觉扰动表现出更强的鲁棒性,同时在未见配置上具有更好的泛化能力。最后,我们还将A2A扩展到视频生成,展示其在时间建模中的更广泛适用性。项目网站:https://lorenzo-0-0.github.io/A2A_Flow_Matching。
cs.RO / 19 / 2602.07326
Why Look at It at All?: Vision-Free Multifingered Blind Grasping Using Uniaxial Fingertip Force Sensing
为什么要看它呢?:基于单轴指尖力传感的无视觉多指盲抓取
Abstract
Grasping under limited sensing remains a fundamental challenge for real-world robotic manipulation, as vision and high-resolution tactile sensors often introduce cost, fragility, and integration complexity. This work demonstrates that reliable multifingered grasping can be achieved under extremely minimal sensing by relying solely on uniaxial fingertip force feedback and joint proprioception, without vision or multi-axis/tactile sensing. To enable such blind grasping, we employ an efficient teacher-student training pipeline in which a reinforcement-learned teacher exploits privileged simulation-only observations to generate demonstrations for distilling a transformer-based student policy operating under partial observation. The student policy is trained to act using only sensing modalities available at real-world deployment. We validate the proposed approach on real hardware across 18 objects, including both in-distribution and out-of-distribution cases, achieving a 98.3~$\%$ overall grasp success rate. These results demonstrate strong robustness and generalization beyond the simulation training distribution, while significantly reducing sensing requirements for real-world grasping systems.
Chinese Translation
在有限感知条件下进行抓取仍然是现实世界机器人操作的一项基本挑战,因为视觉和高分辨率触觉传感器通常会带来成本、脆弱性和集成复杂性。本研究表明,仅依靠单轴指尖力反馈和关节本体感知,而不使用视觉或多轴/触觉传感,可以在极其有限的感知条件下实现可靠的多指抓取。为了实现这种盲抓取,我们采用了一种高效的师生训练流程,其中强化学习的教师利用特权的仅限于模拟的观察生成演示,以提炼出在部分观察下操作的基于变压器的学生策略。学生策略的训练仅使用在现实部署中可用的感知方式。我们在真实硬件上对所提出的方法进行了验证,涵盖了18个物体,包括分布内和分布外的案例,整体抓取成功率达到了98.3%。这些结果展示了强大的鲁棒性和超越模拟训练分布的泛化能力,同时显著降低了现实世界抓取系统的感知需求。
cs.RO / 20 / 2602.07363
UEREBot: Learning Safe Quadrupedal Locomotion under Unstructured Environments and High-Speed Dynamic Obstacles
UEREBot:在非结构化环境和高速动态障碍物下学习安全的四足运动
Abstract
Quadruped robots are increasingly deployed in unstructured environments. Safe locomotion in these settings requires long-horizon goal progress, passability over uneven terrain and static constraints, and collision avoidance against high-speed dynamic obstacles. A single system cannot fully satisfy all three objectives simultaneously: planning-based decisions can be too slow, while purely reactive decisions can sacrifice goal progress and passability. To resolve this conflict, we propose UEREBot (Unstructured-Environment Reflexive Evasion Robot), a hierarchical framework that separates slow planning from instantaneous reflexive evasion and coordinates them during execution. UEREBot formulates the task as a constrained optimal control problem blueprint. It adopts a spatial--temporal planner that provides reference guidance toward the goal and threat signals. It then uses a threat-aware handoff to fuse navigation and reflex actions into a nominal command, and a control barrier function shield as a final execution safeguard. We evaluate UEREBot in Isaac Lab simulation and deploy it on a Unitree Go2 quadruped equipped with onboard perception. Across diverse environments with complex static structure and high-speed dynamic obstacles, UEREBot achieves higher avoidance success and more stable locomotion while maintaining goal progress than representative baselines, demonstrating improved safety--progress trade-offs.
Chinese Translation
四足机器人越来越多地被部署在非结构化环境中。在这些环境中,安全的运动需要长时间目标进展、在不平坦地形和静态约束下的可通行性,以及对高速动态障碍物的避碰能力。单一系统无法同时完全满足这三项目标:基于规划的决策可能过于缓慢,而纯粹的反应式决策则可能牺牲目标进展和可通行性。为了解决这一矛盾,我们提出了UEREBot(非结构化环境反射规避机器人),这是一个分层框架,将缓慢的规划与瞬时的反射规避分开,并在执行过程中进行协调。UEREBot将任务表述为一个受约束的最优控制问题蓝图。它采用空间-时间规划器,提供朝向目标和威胁信号的参考指导。然后,它使用一种威胁感知的交接方法,将导航和反射动作融合为一个名义指令,并使用控制屏障函数作为最终执行的保护措施。我们在Isaac Lab模拟中评估了UEREBot,并将其部署在配备有板载感知的Unitree Go2四足机器人上。在具有复杂静态结构和高速动态障碍物的多样环境中,UEREBot实现了更高的避碰成功率和更稳定的运动,同时保持目标进展,展示了安全性与进展之间的改进权衡。
cs.RO / 21 / 2602.07388
Trace-Focused Diffusion Policy for Multi-Modal Action Disambiguation in Long-Horizon Robotic Manipulation
面向轨迹的扩散策略在长时间跨度机器人操作中的多模态动作消歧义
Abstract
Generative model-based policies have shown strong performance in imitation-based robotic manipulation by learning action distributions from demonstrations. However, in long-horizon tasks, visually similar observations often recur across execution stages while requiring distinct actions, which leads to ambiguous predictions when policies are conditioned only on instantaneous observations, termed multi-modal action ambiguity (MA2). To address this challenge, we propose the Trace-Focused Diffusion Policy (TF-DP), a simple yet effective diffusion-based framework that explicitly conditions action generation on the robot's execution history. TF-DP represents historical motion as an explicit execution trace and projects it into the visual observation space, providing stage-aware context when current observations alone are insufficient. In addition, the induced trace-focused field emphasizes task-relevant regions associated with historical motion, improving robustness to background visual disturbances. We evaluate TF-DP on real-world robotic manipulation tasks exhibiting pronounced multi-modal action ambiguity and visually cluttered conditions. Experimental results show that TF-DP improves temporal consistency and robustness, outperforming the vanilla diffusion policy by 80.56 percent on tasks with multi-modal action ambiguity and by 86.11 percent under visual disturbances, while maintaining inference efficiency with only a 6.4 percent runtime increase. These results demonstrate that execution-trace conditioning offers a scalable and principled approach for robust long-horizon robotic manipulation within a single policy.
Chinese Translation
基于生成模型的策略在模仿学习的机器人操作中表现出色,通过从示范中学习动作分布。然而,在长时间跨度的任务中,视觉上相似的观测往往在执行阶段中反复出现,同时需要不同的动作,这导致当策略仅基于瞬时观测进行条件化时出现模糊预测,这种现象被称为多模态动作模糊(MA2)。为了解决这一挑战,我们提出了面向轨迹的扩散策略(TF-DP),这是一种简单而有效的基于扩散的框架,明确地将动作生成条件化于机器人的执行历史。TF-DP将历史运动表示为明确的执行轨迹,并将其投影到视觉观测空间中,在当前观测不足时提供阶段感知的上下文。此外,诱导的轨迹聚焦场强调与历史运动相关的任务相关区域,提高了对背景视觉干扰的鲁棒性。我们在具有明显多模态动作模糊和视觉杂乱条件的真实世界机器人操作任务上评估了TF-DP。实验结果表明,TF-DP提高了时间一致性和鲁棒性,在具有多模态动作模糊的任务中比传统扩散策略提高了80.56%,在视觉干扰下提高了86.11%,同时仅增加了6.4%的运行时间,保持了推理效率。这些结果表明,执行轨迹条件化为单一策略中的鲁棒长时间跨度机器人操作提供了一种可扩展且原则性的解决方案。
cs.RO / 22 / 2602.07413
Going with the Flow: Koopman Behavioral Models as Implicit Planners for Visuo-Motor Dexterity
顺应潮流:作为隐式规划者的库普曼行为模型在视觉-运动灵巧性中的应用
Abstract
There has been rapid and dramatic progress in robots' ability to learn complex visuo-motor manipulation skills from demonstrations, thanks in part to expressive policy classes that employ diffusion- and transformer-based backbones. However, these design choices require significant data and computational resources and remain far from reliable, particularly within the context of multi-fingered dexterous manipulation. Fundamentally, they model skills as reactive mappings and rely on fixed-horizon action chunking to mitigate jitter, creating a rigid trade-off between temporal coherence and reactivity. In this work, we introduce Unified Behavioral Models (UBMs), a framework that learns to represent dexterous skills as coupled dynamical systems that capture how visual features of the environment (visual flow) and proprioceptive states of the robot (action flow) co-evolve. By capturing such behavioral dynamics, UBMs can ensure temporal coherence by construction rather than by heuristic averaging. To operationalize these models, we propose Koopman-UBM, a first instantiation of UBMs that leverages Koopman Operator theory to effectively learn a unified representation in which the joint flow of latent visual and proprioceptive features is governed by a structured linear system. We demonstrate that Koopman-UBM can be viewed as an implicit planner: given an initial condition, it analytically computes the desired robot behavior while simultaneously ''imagining'' the resulting flow of visual features over the entire skill horizon. To enable reactivity and adaptation, we introduce an online replanning strategy in which the model acts as its own runtime monitor that automatically triggers replanning when predicted and observed visual flow diverge beyond a threshold. Across seven simulated tasks and two real-world tasks, we demonstrate that K-UBM matches or exceeds the performance of state-of-the-art baselines, while offering considerably faster inference, smooth execution, robustness to occlusions, and flexible replanning.
Chinese Translation
近年来,机器人在从示范中学习复杂的视觉-运动操作技能方面取得了快速而显著的进展,这在一定程度上得益于采用扩散和变换器基础的表达性策略类。然而,这些设计选择需要大量的数据和计算资源,并且在多指灵巧操作的背景下仍然远未可靠。从根本上讲,它们将技能建模为反应映射,并依赖固定时间范围的动作分块来减轻抖动,从而在时间一致性和反应性之间形成了刚性的权衡。在本研究中,我们引入了统一行为模型(Unified Behavioral Models, UBM),这一框架学习将灵巧技能表示为耦合动力系统,捕捉环境的视觉特征(视觉流)与机器人的本体状态(动作流)如何共同演化。通过捕捉这种行为动态,UBM可以通过构建确保时间一致性,而不是通过启发式平均。为了实现这些模型,我们提出了库普曼-UBM(Koopman-UBM),这是UBM的首次实例化,利用库普曼算子理论有效学习一个统一表示,其中潜在的视觉和本体特征的联合流由一个结构化线性系统控制。我们展示了库普曼-UBM可以被视为一个隐式规划者:给定初始条件,它可以解析计算所需的机器人行为,同时“想象”整个技能范围内视觉特征的结果流。为了实现反应性和适应性,我们引入了一种在线重新规划策略,其中模型充当其自身的运行时监控器,当预测和观察到的视觉流偏离超过阈值时自动触发重新规划。在七个模拟任务和两个真实世界任务中,我们展示了K-UBM的性能与最先进的基线相匹配或超越,同时提供了显著更快的推理、平滑的执行、对遮挡的鲁棒性和灵活的重新规划。
cs.RO / 23 / 2602.07434
Bridging Speech, Emotion, and Motion: a VLM-based Multimodal Edge-deployable Framework for Humanoid Robots
桥接语音、情感与动作:基于视觉语言模型的可在边缘部署的人形机器人多模态框架
Abstract
Effective human-robot interaction requires emotionally rich multimodal expressions, yet most humanoid robots lack coordinated speech, facial expressions, and gestures. Meanwhile, real-world deployment demands on-device solutions that can operate autonomously without continuous cloud connectivity. To bridging \underline{\textit{S}}peech, \underline{\textit{E}}motion, and \underline{\textit{M}}otion, we present \textit{SeM$^2$}, a Vision Language Model-based framework that orchestrates emotionally coherent multimodal interactions through three key components: a multimodal perception module capturing user contextual cues, a Chain-of-Thought reasoning for response planning, and a novel Semantic-Sequence Aligning Mechanism (SSAM) that ensures precise temporal coordination between verbal content and physical expressions. We implement both cloud-based and \underline{\textit{e}}dge-deployed versions (\textit{SeM$^2_e$}), with the latter knowledge distilled to operate efficiently on edge hardware while maintaining 95\% of the relative performance. Comprehensive evaluations demonstrate that our approach significantly outperforms unimodal baselines in naturalness, emotional clarity, and modal coherence, advancing socially expressive humanoid robotics for diverse real-world environments.
Chinese Translation
有效的人机交互需要情感丰富的多模态表达,然而大多数人形机器人缺乏协调的语音、面部表情和手势。同时,现实世界的部署要求能够在没有持续云连接的情况下自主运行的设备解决方案。为了桥接 extit{S}peech、 extit{E}motion和 extit{M}otion,我们提出了 extit{SeM$^2$},这是一个基于视觉语言模型的框架,通过三个关键组件协调情感一致的多模态交互:一个捕捉用户上下文线索的多模态感知模块、用于响应规划的思维链推理,以及一种新颖的语义序列对齐机制(Semantic-Sequence Aligning Mechanism,SSAM),确保语言内容与身体表达之间的精确时间协调。我们实现了基于云的和边缘部署的版本( extit{SeM$^2_e$}),后者经过知识提炼以在边缘硬件上高效运行,同时保持95%的相对性能。全面评估表明,我们的方法在自然性、情感清晰度和模态一致性方面显著优于单模态基线,推动了适用于多样化现实环境的社会表达性人形机器人技术的发展。
cs.RO / 24 / 2602.07439
TextOp: Real-time Interactive Text-Driven Humanoid Robot Motion Generation and Control
TextOp:实时交互式文本驱动的人形机器人运动生成与控制
Abstract
Recent advances in humanoid whole-body motion tracking have enabled the execution of diverse and highly coordinated motions on real hardware. However, existing controllers are commonly driven either by predefined motion trajectories, which offer limited flexibility when user intent changes, or by continuous human teleoperation, which requires constant human involvement and limits autonomy. This work addresses the problem of how to drive a universal humanoid controller in a real-time and interactive manner. We present TextOp, a real-time text-driven humanoid motion generation and control framework that supports streaming language commands and on-the-fly instruction modification during execution. TextOp adopts a two-level architecture in which a high-level autoregressive motion diffusion model continuously generates short-horizon kinematic trajectories conditioned on the current text input, while a low-level motion tracking policy executes these trajectories on a physical humanoid robot. By bridging interactive motion generation with robust whole-body control, TextOp unlocks free-form intent expression and enables smooth transitions across multiple challenging behaviors such as dancing and jumping, within a single continuous motion execution. Extensive real-robot experiments and offline evaluations demonstrate instant responsiveness, smooth whole-body motion, and precise control. The project page and the open-source code are available at https://text-op.github.io/
Chinese Translation
近年来,人形全身运动跟踪的进展使得在真实硬件上执行多样化且高度协调的动作成为可能。然而,现有的控制器通常是由预定义的运动轨迹驱动,这在用户意图变化时提供了有限的灵活性,或者是通过持续的人类遥控,这需要持续的人类参与并限制了自主性。本研究解决了如何以实时和交互的方式驱动通用人形控制器的问题。我们提出了TextOp,一个实时文本驱动的人形运动生成与控制框架,支持在执行过程中流式语言命令和即时指令修改。TextOp采用两级架构,其中高层自回归运动扩散模型根据当前文本输入持续生成短期运动轨迹,而低层运动跟踪策略则在物理人形机器人上执行这些轨迹。通过将交互式运动生成与稳健的全身控制相结合,TextOp解锁了自由形式的意图表达,并在单一连续运动执行中实现了舞蹈和跳跃等多种挑战性行为之间的平滑过渡。大量真实机器人实验和离线评估表明,系统具备即时响应、平滑的全身运动和精确控制。项目页面和开源代码可在 https://text-op.github.io/ 获取。
cs.RO / 25 / 2602.07506
VividFace: Real-Time and Realistic Facial Expression Shadowing for Humanoid Robots
VividFace:用于类人机器人实时且逼真的面部表情影射
Abstract
Humanoid facial expression shadowing enables robots to realistically imitate human facial expressions in real time, which is critical for lifelike, facially expressive humanoid robots and affective human-robot interaction. Existing progress in humanoid facial expression imitation remains limited, often failing to achieve either real-time performance or realistic expressiveness due to offline video-based inference designs and insufficient ability to capture and transfer subtle expression details. To address these limitations, we present VividFace, a real-time and realistic facial expression shadowing system for humanoid robots. An optimized imitation framework X2CNet++ enhances expressiveness by fine-tuning the human-to-humanoid facial motion transfer module and introducing a feature-adaptation training strategy for better alignment across different image sources. Real-time shadowing is further enabled by a video-stream-compatible inference pipeline and a streamlined workflow based on asynchronous I/O for efficient communication across devices. VividFace produces vivid humanoid faces by mimicking human facial expressions within 0.05 seconds, while generalizing across diverse facial configurations. Extensive real-world demonstrations validate its practical utility. Videos are available at: https://lipzh5.github.io/VividFace/.
Chinese Translation
类人面部表情影射使机器人能够实时逼真地模仿人类面部表情,这对于生动的、面部表情丰富的类人机器人以及情感化的人机交互至关重要。现有的类人面部表情模仿进展仍然有限,往往无法实现实时性能或逼真表现,原因在于离线视频推理设计和捕捉及传递细微表情细节的能力不足。为了解决这些局限性,我们提出了VividFace,一个用于类人机器人的实时且逼真的面部表情影射系统。经过优化的模仿框架X2CNet++通过微调人类到类人面部运动转移模块,并引入特征适应训练策略,以实现不同图像源之间的更好对齐,从而增强了表现力。实时影射进一步通过兼容视频流的推理管道和基于异步I/O的简化工作流程实现,以提高设备间的高效通信。VividFace能够在0.05秒内通过模仿人类面部表情生成生动的类人面孔,同时在多样的面部配置中实现泛化。大量的现实世界演示验证了其实际应用价值。视频可在以下链接查看:https://lipzh5.github.io/VividFace/
cs.RO / 26 / 2602.07541
Differentiate-and-Inject: Enhancing VLAs via Functional Differentiation Induced by In-Parameter Structural Reasoning
差异化与注入:通过参数内结构推理增强视觉-语言-动作(VLA)模型
Abstract
As robots are expected to perform increasingly diverse tasks, they must understand not only low-level actions but also the higher-level structure that determines how a task should unfold. Existing vision-language-action (VLA) models struggle with this form of task-level reasoning. They either depend on prompt-based in-context decomposition, which is unstable and sensitive to linguistic variations, or end-to-end long-horizon training, which requires large-scale demonstrations and entangles task-level reasoning with low-level control. We present in-parameter structured task reasoning (iSTAR), a framework for enhancing VLA models via functional differentiation induced by in-parameter structural reasoning. Instead of treating VLAs as monolithic policies, iSTAR embeds task-level semantic structure directly into model parameters, enabling differentiated task-level inference without external planners or handcrafted prompt inputs. This injected structure takes the form of implicit dynamic scene-graph knowledge that captures object relations, subtask semantics, and task-level dependencies in parameter space. Across diverse manipulation benchmarks, iSTAR achieves more reliable task decompositions and higher success rates than both in-context and end-to-end VLA baselines, demonstrating the effectiveness of parameter-space structural reasoning for functional differentiation and improved generalization across task variations.
Chinese Translation
随着机器人被期望执行越来越多样化的任务,它们必须理解不仅是低级动作,还包括决定任务展开方式的高级结构。现有的视觉-语言-动作(VLA)模型在这种任务级推理方面存在困难。它们要么依赖于基于提示的上下文分解,这种方法不稳定且对语言变体敏感,要么依赖于端到端的长时间训练,这需要大规模的示范并将任务级推理与低级控制纠缠在一起。我们提出了参数内结构化任务推理(iSTAR),这是一个通过参数内结构推理引导的功能差异化框架,用于增强VLA模型。iSTAR并不将VLA视为单一的策略,而是将任务级语义结构直接嵌入模型参数中,从而实现无外部规划器或手工提示输入的差异化任务级推理。这种注入的结构以隐式动态场景图知识的形式存在,捕捉了参数空间中的对象关系、子任务语义和任务级依赖性。在多样化的操作基准测试中,iSTAR实现了比上下文和端到端VLA基线更可靠的任务分解和更高的成功率,展示了参数空间结构推理在功能差异化和任务变体间改进泛化方面的有效性。
cs.RO / 27 / 2602.07598
"Meet My Sidekick!": Effects of Separate Identities and Control of a Single Robot in HRI
“认识我的助手!”:人机交互中单一机器人不同身份与控制的影响
Abstract
The presentation of a robot's capability and identity directly influences a human collaborator's perception and implicit trust in the robot. Unlike humans, a physical robot can simultaneously present different identities and have them reside and control different parts of the robot. This paper presents a novel study that investigates how users perceive a robot where different robot control domains (head and gripper) are presented as independent robots. We conducted a mixed design study where participants experienced one of three presentations: a single robot, two agents with shared full control (co-embodiment), or two agents with split control across robot control domains (split-embodiment). Participants underwent three distinct tasks -- a mundane data entry task where the robot provides motivational support, an individual sorting task with isolated robot failures, and a collaborative arrangement task where the robot causes a failure that directly affects the human participant. Participants perceived the robot as residing in the different control domains and were able to associate robot failure with different identities. This work signals how future robots can leverage different embodiment configurations to obtain the benefit of multiple robots within a single body.
Chinese Translation
机器人的能力和身份的呈现直接影响人类合作者对机器人的感知和隐性信任。与人类不同,物理机器人可以同时呈现不同的身份,并使这些身份驻留并控制机器人的不同部分。本文呈现了一项新颖的研究,探讨用户如何感知一个在不同机器人控制领域(头部和夹具)被呈现为独立机器人的机器人。我们进行了混合设计研究,参与者体验了三种不同的呈现方式:一个单一机器人、两个共享完全控制的代理(共同体现)或两个在机器人控制领域中分开控制的代理(分离体现)。参与者经历了三个不同的任务——一个平常的数据录入任务,其中机器人提供激励支持,一个个体分类任务,其中机器人出现孤立故障,以及一个协作安排任务,其中机器人导致的故障直接影响人类参与者。参与者感知到机器人驻留在不同的控制领域,并能够将机器人故障与不同身份关联起来。这项工作表明,未来的机器人可以利用不同的体现配置,以在单一身体中获得多个机器人的优势。
cs.RO / 28 / 2602.07629
LCLA: Language-Conditioned Latent Alignment for Vision-Language Navigation
LCLA:基于语言条件的潜在对齐用于视觉-语言导航
Abstract
We propose LCLA (Language-Conditioned Latent Alignment), a framework for vision-language navigation that learns modular perception-action interfaces by aligning sensory observations to a latent representation of an expert policy. The expert is first trained with privileged state information, inducing a latent space sufficient for control, after which its latent interface and action head are frozen. A lightweight adapter is then trained to map raw visual-language observations, via a frozen vision-language model, into the expert's latent space, reducing the problem of visuomotor learning to supervised latent alignment rather than end-to-end policy optimization. This decoupling enforces a stable contract between perception and control, enabling expert behavior to be reused across sensing modalities and environmental variations. We instantiate LCLA and evaluate it on a vision-language indoor navigation task, where aligned latent spaces yield strong in-distribution performance and robust zero-shot generalization to unseen environments, lighting conditions, and viewpoints while remaining lightweight at inference time.
Chinese Translation
我们提出了LCLA(基于语言条件的潜在对齐),这是一个用于视觉-语言导航的框架,通过将感知观察与专家策略的潜在表示对齐,学习模块化的感知-动作接口。专家首先在特权状态信息的指导下进行训练,从而诱导出一个足以进行控制的潜在空间,之后其潜在接口和动作头被冻结。然后,训练一个轻量级适配器,通过一个冻结的视觉-语言模型,将原始的视觉-语言观察映射到专家的潜在空间,从而将视觉运动学习的问题简化为监督潜在对齐,而不是端到端的策略优化。这种解耦强化了感知与控制之间的稳定契约,使得专家行为能够在不同的感知模态和环境变化中得以重用。我们实例化了LCLA,并在一个视觉-语言室内导航任务上进行了评估,其中对齐的潜在空间在分布内表现出强大的性能,并且在未见过的环境、光照条件和视角下具有稳健的零-shot泛化,同时在推理时保持轻量级。
cs.RO / 29 / 2602.07677
Affine Transformable Unmanned Ground Vehicle
可仿射变换的无人地面车辆
Abstract
This paper develops the proof of concept for a novel affine transformable unmanned ground vehicle (ATUGV) with the capability of safe and aggressive deformation while carrying multiple payloads. The ATUGV is a multi-body system with mobile robots that can be used to power the ATUGV morphable motion, powered cells to enclose the mobile robots, unpowered cells to contain payloads, and a deformable structure to integrate cells through bars and joints. The objective is that all powered and unpowered cells motion can safely track a desired affine transformation, where an affine transformation can be decomposed into translation, rigid body rotation, and deformation. To this end, the paper first uses a deep neural network to structure cell interconnection in such a way that every cell can freely move over the deformation plane, and the entire structure can reconfigurably deform to track a desired affine transformation. Then, the mobile robots, contained by the powered cells and stepper motors, regulating the connections of the powered and unpowered cells, design the proper controls so that all cells safely track the desired affine transformation. The functionality of the proposed ATUGV is validated through hardware experimentation and simulation.
Chinese Translation
本文开发了一种新型可仿射变换无人地面车辆(ATUGV)的概念验证,该车辆能够在承载多个有效载荷的同时进行安全且激进的变形。ATUGV是一个多体系统,包含可用于驱动ATUGV可变运动的移动机器人、用于包围移动机器人的动力单元、用于容纳有效载荷的非动力单元,以及通过杆和关节集成单元的可变形结构。其目标是使所有动力和非动力单元的运动能够安全地跟踪期望的仿射变换,其中仿射变换可以分解为平移、刚体旋转和变形。为此,本文首先利用深度神经网络构建单元间的互连,使每个单元能够在变形平面上自由移动,整个结构能够可重构地变形以跟踪期望的仿射变换。然后,由动力单元和步进电机所包含的移动机器人调节动力和非动力单元的连接,设计适当的控制策略,以确保所有单元安全地跟踪期望的仿射变换。通过硬件实验和仿真验证了所提ATUGV的功能性。
cs.RO / 30 / 2602.07736
Global Symmetry and Orthogonal Transformations from Geometrical Moment $n$-tuples
几何矩$n$元组的全局对称性与正交变换
Abstract
Detecting symmetry is crucial for effective object grasping for several reasons. Recognizing symmetrical features or axes within an object helps in developing efficient grasp strategies, as grasping along these axes typically results in a more stable and balanced grip, thereby facilitating successful manipulation. This paper employs geometrical moments to identify symmetries and estimate orthogonal transformations, including rotations and mirror transformations, for objects centered at the frame origin. It provides distinctive metrics for detecting symmetries and estimating orthogonal transformations, encompassing rotations, reflections, and their combinations. A comprehensive methodology is developed to obtain these functions in n-dimensional space, specifically moment \( n \)-tuples. Extensive validation tests are conducted on both 2D and 3D objects to ensure the robustness and reliability of the proposed approach. The proposed method is also compared to state-of-the-art work using iterative optimization for detecting multiple planes of symmetry. The results indicate that combining our method with the iterative one yields satisfactory outcomes in terms of the number of symmetry planes detected and computation time.
Chinese Translation
检测对称性对于有效的物体抓取至关重要,原因有多方面。识别物体内的对称特征或轴有助于开发高效的抓取策略,因为沿这些轴抓取通常会导致更稳定和平衡的握持,从而促进成功的操作。本文利用几何矩来识别对称性并估计正交变换,包括围绕框架原点的旋转和镜像变换。我们提供了独特的指标来检测对称性和估计正交变换,涵盖旋转、反射及其组合。我们开发了一种全面的方法论,以在n维空间中获得这些函数,特别是矩$n$元组。对二维和三维物体进行了广泛的验证测试,以确保所提方法的稳健性和可靠性。所提方法还与基于迭代优化的最新研究进行了比较,以检测多个对称平面。结果表明,将我们的方法与迭代方法结合使用在检测到的对称平面数量和计算时间方面都取得了令人满意的结果。
cs.RO / 31 / 2602.07776
CoLF: Learning Consistent Leader-Follower Policies for Vision-Language-Guided Multi-Robot Cooperative Transport
CoLF:学习一致的领导-跟随策略以实现视觉-语言引导的多机器人协作运输
Abstract
In this study, we address vision-language-guided multi-robot cooperative transport, where each robot grounds natural-language instructions from onboard camera observations. A key challenge in this decentralized setting is perceptual misalignment across robots, where viewpoint differences and language ambiguity can yield inconsistent interpretations and degrade cooperative transport. To mitigate this problem, we adopt a dependent leader-follower design, where one robot serves as the leader and the other as the follower. Although such a leader-follower structure appears straightforward, learning with independent and symmetric agents often yields symmetric or unstable behaviors without explicit inductive biases. To address this challenge, we propose Consistent Leader-Follower (CoLF), a multi-agent reinforcement learning (MARL) framework for stable leader-follower role differentiation. CoLF consists of two key components: (1) an asymmetric policy design that induces leader-follower role differentiation, and (2) a mutual-information-based training objective that maximizes a variational lower bound, encouraging the follower to predict the leader's action from its local observation. The leader and follower policies are jointly optimized under the centralized training and decentralized execution (CTDE) framework to balance task execution and consistent cooperative behaviors. We validate CoLF in both simulation and real-robot experiments using two quadruped robots. The demonstration video is available at https://sites.google.com/view/colf/.
Chinese Translation
在本研究中,我们探讨了视觉-语言引导的多机器人协作运输,其中每个机器人根据机载摄像头观察结果理解自然语言指令。在这种去中心化的环境中,一个主要挑战是机器人之间的感知不一致,视角差异和语言模糊可能导致不一致的解释,从而降低协作运输的效率。为了解决这一问题,我们采用了依赖的领导-跟随设计,其中一台机器人作为领导者,另一台作为跟随者。尽管这种领导-跟随结构看似简单,但在没有明确归纳偏置的情况下,使用独立和对称的代理进行学习往往会导致对称或不稳定的行为。为了解决这一挑战,我们提出了一种一致的领导-跟随(CoLF)框架,这是一种用于稳定领导-跟随角色区分的多智能体强化学习(MARL)框架。CoLF由两个关键组成部分构成:(1)一种不对称的策略设计,促使领导-跟随角色的区分;(2)一种基于互信息的训练目标,最大化变分下界,鼓励跟随者从其局部观察中预测领导者的动作。领导者和跟随者的策略在集中训练和去中心化执行(CTDE)框架下共同优化,以平衡任务执行和一致的协作行为。我们在模拟和真实机器人实验中验证了CoLF,使用了两台四足机器人。演示视频可在 https://sites.google.com/view/colf/ 获取。
cs.RO / 32 / 2602.07837
RLinf-USER: A Unified and Extensible System for Real-World Online Policy Learning in Embodied AI
RLinf-USER:一个统一且可扩展的系统,用于具身人工智能中的现实世界在线策略学习
Abstract
Online policy learning directly in the physical world is a promising yet challenging direction for embodied intelligence. Unlike simulation, real-world systems cannot be arbitrarily accelerated, cheaply reset, or massively replicated, which makes scalable data collection, heterogeneous deployment, and long-horizon effective training difficult. These challenges suggest that real-world policy learning is not only an algorithmic issue but fundamentally a systems problem. We present USER, a Unified and extensible SystEm for Real-world online policy learning. USER treats physical robots as first-class hardware resources alongside GPUs through a unified hardware abstraction layer, enabling automatic discovery, management, and scheduling of heterogeneous robots. To address cloud-edge communication, USER introduces an adaptive communication plane with tunneling-based networking, distributed data channels for traffic localization, and streaming-multiprocessor-aware weight synchronization to regulate GPU-side overhead. On top of this infrastructure, USER organizes learning as a fully asynchronous framework with a persistent, cache-aware buffer, enabling efficient long-horizon experiments with robust crash recovery and reuse of historical data. In addition, USER provides extensible abstractions for rewards, algorithms, and policies, supporting online imitation or reinforcement learning of CNN/MLP, generative policies, and large vision-language-action (VLA) models within a unified pipeline. Results in both simulation and the real world show that USER enables multi-robot coordination, heterogeneous manipulators, edge-cloud collaboration with large models, and long-running asynchronous training, offering a unified and extensible systems foundation for real-world online policy learning.
Chinese Translation
在物理世界中直接进行在线策略学习是具身智能一个有前景但具有挑战性的方向。与模拟不同,现实世界的系统无法任意加速、廉价重置或大规模复制,这使得可扩展的数据收集、异构部署和长时间有效训练变得困难。这些挑战表明,现实世界的策略学习不仅是一个算法问题,更根本上是一个系统问题。我们提出了USER,一个用于现实世界在线策略学习的统一且可扩展的系统。USER通过统一的硬件抽象层将物理机器人视为一类重要的硬件资源,与GPU并列,从而实现异构机器人的自动发现、管理和调度。为了解决云边通信问题,USER引入了一个自适应通信层,采用基于隧道的网络、用于流量定位的分布式数据通道,以及流处理器感知的权重同步,以调节GPU端的开销。在此基础设施之上,USER将学习组织为一个完全异步的框架,配备持久的、缓存感知的缓冲区,使得进行高效的长时间实验成为可能,并具备强大的崩溃恢复和历史数据重用能力。此外,USER为奖励、算法和策略提供了可扩展的抽象,支持在统一管道中进行CNN/MLP的在线模仿或强化学习、生成策略以及大型视觉-语言-动作(VLA)模型的学习。模拟和现实世界中的结果表明,USER能够实现多机器人协调、异构操控器、与大型模型的边缘-云协作以及长期异步训练,为现实世界在线策略学习提供了统一且可扩展的系统基础。
cs.RO / 33 / 2602.07845
Recurrent-Depth VLA: Implicit Test-Time Compute Scaling of Vision-Language-Action Models via Latent Iterative Reasoning
递归深度视觉-语言-动作模型:通过潜在迭代推理实现隐式测试时计算规模调整
Abstract
Current Vision-Language-Action (VLA) models rely on fixed computational depth, expending the same amount of compute on simple adjustments and complex multi-step manipulation. While Chain-of-Thought (CoT) prompting enables variable computation, it scales memory linearly and is ill-suited for continuous action spaces. We introduce Recurrent-Depth VLA (RD-VLA), an architecture that achieves computational adaptivity via latent iterative refinement rather than explicit token generation. RD-VLA employs a recurrent, weight-tied action head that supports arbitrary inference depth with a constant memory footprint. The model is trained using truncated backpropagation through time (TBPTT) to efficiently supervise the refinement process. At inference, RD-VLA dynamically allocates compute using an adaptive stopping criterion based on latent convergence. Experiments on challenging manipulation tasks show that recurrent depth is critical: tasks that fail entirely (0 percent success) with single-iteration inference exceed 90 percent success with four iterations, while simpler tasks saturate rapidly. RD-VLA provides a scalable path to test-time compute in robotics, replacing token-based reasoning with latent reasoning to achieve constant memory usage and up to 80x inference speedup over prior reasoning-based VLA models. Project page: https://rd-vla.github.io/
Chinese Translation
当前的视觉-语言-动作(VLA)模型依赖于固定的计算深度,在简单调整和复杂的多步骤操作中消耗相同的计算量。虽然思维链(CoT)提示能够实现可变计算,但其内存线性扩展,不适合连续动作空间。我们提出了递归深度 VLA(RD-VLA),一种通过潜在迭代细化而非显式生成标记实现计算自适应的架构。RD-VLA采用了一个递归的、权重绑定的动作头,支持任意推理深度,同时保持恒定的内存占用。该模型使用截断时间反向传播(TBPTT)进行训练,以高效监督细化过程。在推理时,RD-VLA根据潜在收敛动态分配计算,采用自适应停止标准。对具有挑战性的操作任务的实验表明,递归深度至关重要:在单次迭代推理中完全失败(成功率为0%)的任务,在四次迭代中成功率超过90%,而较简单的任务则迅速饱和。RD-VLA为机器人领域的测试时计算提供了一条可扩展的路径,通过潜在推理替代基于标记的推理,实现恒定的内存使用,并在推理速度上比先前的基于推理的 VLA 模型提高了多达80倍。项目页面:https://rd-vla.github.io/
cs.RO / 34 / 2602.07846
System-Level Error Propagation and Tail-Risk Amplification in Reference-Based Robotic Navigation
基于参考的机器人导航中的系统级误差传播与尾部风险放大
Abstract
Image guided robotic navigation systems often rely on reference based geometric perception pipelines, where accurate spatial mapping is established through multi stage estimation processes. In biplanar X ray guided navigation, such pipelines are widely used due to their real time capability and geometric interpretability. However, navigation reliability can be constrained by an overlooked system level failure mechanism in which installation induced structural perturbations introduced at the perception stage are progressively amplified along the perception reconstruction execution chain and dominate execution level error and tail risk behavior. This paper investigates this mechanism from a system level perspective and presents a unified error propagation modeling framework that characterizes how installation induced structural perturbations propagate and couple with pixel level observation noise through biplanar imaging, projection matrix estimation, triangulation, and coordinate mapping. Using first order analytic uncertainty propagation and Monte Carlo simulations, we analyze dominant sensitivity channels and quantify worst case error behavior beyond mean accuracy metrics. The results show that rotational installation error is a primary driver of system level error amplification, while translational misalignment of comparable magnitude plays a secondary role under typical biplanar geometries. Real biplanar X ray bench top experiments further confirm that the predicted amplification trends persist under realistic imaging conditions. These findings reveal a broader structural limitation of reference based multi stage geometric perception pipelines and provide a framework for system level reliability analysis and risk aware design in safety critical robotic navigation systems.
Chinese Translation
图像引导的机器人导航系统通常依赖于基于参考的几何感知流程,通过多阶段估计过程建立准确的空间映射。在双平面X射线引导导航中,由于其实时能力和几何可解释性,此类流程被广泛使用。然而,导航的可靠性可能受到一个被忽视的系统级故障机制的限制,即在感知阶段引入的安装诱导结构扰动沿着感知重建执行链逐渐放大,并主导执行级别的误差和尾部风险行为。本文从系统级的角度研究了这一机制,并提出了一个统一的误差传播建模框架,描述了安装诱导的结构扰动如何通过双平面成像、投影矩阵估计、三角测量和坐标映射传播并与像素级观察噪声耦合。利用一阶解析不确定性传播和蒙特卡洛模拟,我们分析了主导敏感性通道,并量化了超出平均准确性指标的最坏情况误差行为。结果表明,旋转安装误差是系统级误差放大的主要驱动因素,而在典型的双平面几何下,大小相当的平移错位则起到次要作用。真实的双平面X射线台式实验进一步确认了在现实成像条件下预测的放大趋势依然存在。这些发现揭示了基于参考的多阶段几何感知流程的更广泛结构限制,并为安全关键的机器人导航系统提供了系统级可靠性分析和风险意识设计的框架。
cs.RO / 35 / 2602.07888
Research on a Camera Position Measurement Method based on a Parallel Perspective Error Transfer Model
基于平行透视误差传递模型的相机位置测量方法研究
Abstract
Camera pose estimation from sparse correspondences is a fundamental problem in geometric computer vision and remains particularly challenging in near-field scenarios, where strong perspective effects and heterogeneous measurement noise can significantly degrade the stability of analytic PnP solutions. In this paper, we present a geometric error propagation framework for camera pose estimation based on a parallel perspective approximation. By explicitly modeling how image measurement errors propagate through perspective geometry, we derive an error transfer model that characterizes the relationship between feature point distribution, camera depth, and pose estimation uncertainty. Building on this analysis, we develop a pose estimation method that leverages parallel perspective initialization and error-aware weighting within a Gauss-Newton optimization scheme, leading to improved robustness in proximity operations. Extensive experiments on both synthetic data and real-world images, covering diverse conditions such as strong illumination, surgical lighting, and underwater low-light environments, demonstrate that the proposed approach achieves accuracy and robustness comparable to state-of-the-art analytic and iterative PnP methods, while maintaining high computational efficiency. These results highlight the importance of explicit geometric error modeling for reliable camera pose estimation in challenging near-field settings.
Chinese Translation
从稀疏对应关系中进行相机姿态估计是几何计算机视觉中的一个基本问题,在近场场景中尤其具有挑战性,因为强透视效应和异构测量噪声会显著降低解析PnP解的稳定性。本文提出了一种基于平行透视近似的相机姿态估计几何误差传播框架。通过明确建模图像测量误差如何通过透视几何传播,我们推导出一个误差传递模型,该模型表征了特征点分布、相机深度和姿态估计不确定性之间的关系。在此分析的基础上,我们开发了一种姿态估计方法,该方法利用平行透视初始化和在高斯-牛顿优化方案中的误差感知加权,从而提高了近距离操作的鲁棒性。对合成数据和真实图像进行的大量实验,涵盖强光照、手术照明和水下低光环境等多种条件,表明所提出的方法在准确性和鲁棒性方面与最先进的解析和迭代PnP方法相当,同时保持了高计算效率。这些结果突显了在具有挑战性的近场环境中,显式几何误差建模对于可靠的相机姿态估计的重要性。
cs.RO / 36 / 2602.07901
Incremental Mapping with Measurement Synchronization & Compression
增量映射与测量同步与压缩
Abstract
Modern autonomous vehicles and robots utilize versatile sensors for localization and mapping. The fidelity of these maps is paramount, as an accurate environmental representation is a prerequisite for stable and precise localization. Factor graphs provide a powerful approach for sensor fusion, enabling the estimation of the maximum a posteriori solution. However, the discrete nature of graph-based representations, combined with asynchronous sensor measurements, complicates consistent state estimation. The design of an optimal factor graph topology remains an open challenge, especially in multi-sensor systems with asynchronous data. Conventional approaches rely on a rigid graph structure, which becomes inefficient with sensors of disparate rates. Although preintegration techniques can mitigate this for high-rate sensors, their applicability is limited. To address this problem, this work introduces a novel approach that incrementally constructs connected factor graphs, ensuring the incorporation of all available sensor data by choosing the optimal graph topology based on the external evaluation criteria. The proposed methodology facilitates graph compression, reducing the number of nodes (optimized variables) by ~30% on average while maintaining map quality at a level comparable to conventional approaches.
Chinese Translation
现代自主车辆和机器人利用多功能传感器进行定位和地图构建。这些地图的准确性至关重要,因为准确的环境表示是稳定和精确定位的前提。因子图提供了一种强大的传感器融合方法,使得最大后验解的估计成为可能。然而,基于图的表示的离散特性,加上异步传感器测量,使得一致的状态估计变得复杂。设计一个最优的因子图拓扑仍然是一个开放的挑战,特别是在具有异步数据的多传感器系统中。传统方法依赖于刚性的图结构,这在传感器速率不一致时效率低下。尽管预积分技术可以缓解高频传感器的问题,但其适用性有限。为了解决这个问题,本研究提出了一种新方法,增量构建连接的因子图,通过根据外部评估标准选择最优图拓扑,确保所有可用传感器数据的纳入。所提出的方法促进了图的压缩,平均减少了约30%的节点(优化变量),同时保持了与传统方法相当的地图质量。
cs.RO / 37 / 2602.07913
Multi-Agent Route Planning as a QUBO Problem
将多智能体路径规划视为 QUBO 问题
Abstract
Multi-Agent Route Planning considers selecting vehicles, each associated with a single predefined route, such that the spatial coverage of a road network is increased while redundant overlaps are limited. This paper gives a formal problem definition, proves NP-hardness by reduction from the Weighted Set Packing problem, and derives a Quadratic Unconstrained Binary Optimization formulation whose coefficients directly encode unique coverage rewards and pairwise overlap penalties. A single penalty parameter controls the coverage-overlap trade-off. We distinguish between a soft regime, which supports multi-objective exploration, and a hard regime, in which the penalty is strong enough to effectively enforce near-disjoint routes. We describe a practical pipeline for generating city instances, constructing candidate routes, building the QUBO matrix, and solving it with an exact mixed-integer solver (Gurobi), simulated annealing, and D-Wave hybrid quantum annealing. Experiments on Barcelona instances with up to 10 000 vehicles reveal a clear coverage-overlap knee and show that Pareto-optimal solutions are mainly obtained under the hard-penalty regime, while D-Wave hybrid solvers and Gurobi achieve essentially identical objective values with only minor differences in runtime as problem size grows.
Chinese Translation
多智能体路径规划考虑选择与单一路径预定义相关联的车辆,以增加道路网络的空间覆盖,同时限制冗余重叠。本文给出了一个正式的问题定义,通过从加权集合打包问题的归约证明了其 NP-hard 性,并推导出一个二次无约束二进制优化(QUBO)公式,其系数直接编码了独特的覆盖奖励和成对重叠惩罚。一个单一的惩罚参数控制覆盖与重叠之间的权衡。我们区分了支持多目标探索的软约束状态和在其中惩罚强度足以有效强制近乎不重叠路径的硬约束状态。我们描述了一个实用的流程,用于生成城市实例、构建候选路径、构建 QUBO 矩阵,并使用精确的混合整数求解器(Gurobi)、模拟退火和 D-Wave 混合量子退火进行求解。在包含多达 10,000 辆车辆的巴塞罗那实例上的实验揭示了明显的覆盖-重叠拐点,并表明帕累托最优解主要在硬惩罚状态下获得,而 D-Wave 混合求解器和 Gurobi 在问题规模增长时实现了基本相同的目标值,仅在运行时间上存在微小差异。
cs.RO / 38 / 2602.07924
Optimized Human-Robot Co-Dispatch Planning for Petro-Site Surveillance under Varying Criticalities
针对不同重要性的石油现场监控的优化人机协同调度规划
Abstract
Securing petroleum infrastructure requires balancing autonomous system efficiency with human judgment for threat escalation, a challenge unaddressed by classical facility location models assuming homogeneous resources. This paper formulates the Human-Robot Co-Dispatch Facility Location Problem (HRCD-FLP), a capacitated facility location variant incorporating tiered infrastructure criticality, human-robot supervision ratio constraints, and minimum utilization requirements. We evaluate command center selection across three technology maturity scenarios. Results show transitioning from conservative (1:3 human-robot supervision) to future autonomous operations (1:10) yields significant cost reduction while maintaining complete critical infrastructure coverage. For small problems, exact methods dominate in both cost and computation time; for larger problems, the proposed heuristic achieves feasible solutions in under 3 minutes with approximately 14% optimality gap where comparison is possible. From systems perspective, our work demonstrate that optimized planning for human-robot teaming is key to achieve both cost-effective and mission-reliable deployments.
Chinese Translation
保障石油基础设施需要在自主系统效率与人类判断威胁升级之间取得平衡,这一挑战在假设资源均质的经典设施选址模型中未得到解决。本文提出了人机协同调度设施选址问题(HRCD-FLP),这是一个考虑分级基础设施重要性、人机监督比例约束和最低利用率要求的有容量设施选址变体。我们评估了在三种技术成熟度场景下的指挥中心选择。结果表明,从保守的(1:3人机监督)转变到未来的自主操作(1:10)显著降低了成本,同时保持了对所有关键基础设施的全面覆盖。对于小规模问题,精确方法在成本和计算时间上占据优势;而对于大规模问题,所提出的启发式方法在3分钟内实现可行解,且在可比较的情况下约有14%的最优性差距。从系统的角度来看,我们的研究表明,优化的人机协同规划是实现成本效益和任务可靠性部署的关键。
cs.RO / 39 / 2602.07932
Feasibility-Guided Planning over Multi-Specialized Locomotion Policies
基于可行性指导的多专业运动策略规划
Abstract
Planning over unstructured terrain presents a significant challenge in the field of legged robotics. Although recent works in reinforcement learning have yielded various locomotion strategies, planning over multiple experts remains a complex issue. Existing approaches encounter several constraints: traditional planners are unable to integrate skill-specific policies, whereas hierarchical learning frameworks often lose interpretability and require retraining whenever new policies are added. In this paper, we propose a feasibility-guided planning framework that successfully incorporates multiple terrain-specific policies. Each policy is paired with a Feasibility-Net, which learned to predict feasibility tensors based on the local elevation maps and task vectors. This integration allows classical planning algorithms to derive optimal paths. Through both simulated and real-world experiments, we demonstrate that our method efficiently generates reliable plans across diverse and challenging terrains, while consistently aligning with the capabilities of the underlying policies.
Chinese Translation
在非结构化地形上进行规划是腿部机器人领域面临的一项重大挑战。尽管近年来强化学习的研究产生了多种运动策略,但在多个专家之间进行规划仍然是一个复杂的问题。现有的方法遇到了几个限制:传统规划器无法整合特定技能的策略,而层次学习框架往往失去可解释性,并且在添加新策略时需要重新训练。在本文中,我们提出了一种可行性指导的规划框架,成功地整合了多种特定地形的策略。每种策略都与一个可行性网络(Feasibility-Net)配对,该网络学习根据局部高程图和任务向量预测可行性张量。这种整合使经典规划算法能够推导出最优路径。通过模拟和真实世界的实验,我们证明了我们的方法能够在多样且具有挑战性的地形上高效生成可靠的规划,同时始终与基础策略的能力保持一致。
cs.RO / 40 / 2602.07984
Analyzing the Impact of Simulation Fidelity on the Evaluation of Autonomous Driving Motion Control
分析仿真逼真度对自动驾驶运动控制评估的影响
Abstract
Simulation is crucial in the development of autonomous driving software. In particular, assessing control algorithms requires an accurate vehicle dynamics simulation. However, recent publications use models with varying levels of detail. This disparity makes it difficult to compare individual control algorithms. Therefore, this paper aims to investigate the influence of the fidelity of vehicle dynamics modeling on the closed-loop behavior of trajectory-following controllers. For this purpose, we introduce a comprehensive Autoware-compatible vehicle model. By simplifying this, we derive models with varying fidelity. Evaluating over 550 simulation runs allows us to quantify each model's approximation quality compared to real-world data. Furthermore, we investigate whether the influence of model simplifications changes with varying margins to the acceleration limit of the vehicle. From this, we deduce to which degree a vehicle model can be simplified to evaluate control algorithms depending on the specific application. The real-world data used to validate the simulation environment originate from the Indy Autonomous Challenge race at the Autodromo Nazionale di Monza in June 2023. They show the fastest fully autonomous lap of TUM Autonomous Motorsport, with vehicle speeds reaching 267 kph and lateral accelerations of up to 15 mps2.
Chinese Translation
仿真在自动驾驶软件的开发中至关重要。特别是,评估控制算法需要准确的车辆动力学仿真。然而,最近的文献使用了不同细节水平的模型。这种差异使得比较各个控制算法变得困难。因此,本文旨在研究车辆动力学建模的逼真度对轨迹跟踪控制器闭环行为的影响。为此,我们引入了一个全面的与Autoware兼容的车辆模型。通过简化该模型,我们推导出具有不同逼真度的模型。对550多次仿真运行的评估使我们能够量化每个模型与真实世界数据相比的近似质量。此外,我们还研究了模型简化的影响是否会随着车辆加速极限的不同余量而变化。由此,我们推断出在特定应用中,车辆模型可以简化到何种程度以评估控制算法。用于验证仿真环境的真实世界数据来自2023年6月在意大利蒙扎国家赛车场举行的印第自主挑战赛,显示了TUM Autonomous Motorsport的最快全自动圈速,车辆速度达到267公里每小时,侧向加速度高达15米每秒平方。
cs.RO / 41 / 2602.08116
From Ellipsoids to Midair Control of Dynamic Hitches
从椭球体到动态控制空中缠结
Abstract
The ability to dynamically manipulate interaction between cables, carried by pairs of aerial vehicles attached to the ends of each cable, can greatly improve the versatility and agility of cable-assisted aerial manipulation. Such interlacing cables create hitches by winding two or more cables around each other, which can enclose payloads or can further develop into knots. Dynamic modeling and control of such hitches is key to mastering the inter-cable manipulation in context of cable-suspended aerial manipulation. This paper introduces an ellipsoid-based kinematic model to connect the geometric nature of a hitch created by two cables and the dynamics of the hitch driven by four aerial vehicles, which reveals the control-affine form of the system. As the constraint for maintaining tension of a cable is also control-affine, we design a quadratic programming-based controller that combines Control Lyapunov and High-Order Control Barrier Functions (CLF-HOCBF-QP) to precisely track a desired hitch position and system shape while enforcing safety constraints like cable tautness. We convert desired geometric reference configurations into target robot positions and introduce a composite error into the Lyapunov function to ensure a relative degree of one to the input. Numerical simulations validate our approach, demonstrating stable, high-speed tracking of dynamic references.
Chinese Translation
动态操控由附着在每根电缆两端的空中飞行器对电缆之间的相互作用的能力,可以大大提高电缆辅助空中操控的灵活性和敏捷性。这种交错的电缆通过将两根或更多电缆缠绕在一起形成缠结,可以包裹货物或进一步发展成结。对这种缠结的动态建模和控制是掌握电缆悬挂空中操控中电缆间操控的关键。本文介绍了一种基于椭球体的运动学模型,以连接由两根电缆创建的缠结的几何特性与由四架空中飞行器驱动的缠结的动态,这揭示了系统的控制仿射形式。由于维持电缆张力的约束也是控制仿射的,我们设计了一种基于二次规划的控制器,结合了控制李雅普诺夫函数和高阶控制障碍函数(CLF-HOCBF-QP),以精确跟踪期望的缠结位置和系统形状,同时强制执行如电缆紧绷等安全约束。我们将期望的几何参考配置转换为目标机器人位置,并在李雅普诺夫函数中引入复合误差,以确保输入的相对度为一。数值仿真验证了我们的方法,展示了动态参考的稳定、高速跟踪。
cs.RO / 42 / 2602.08167
Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning
自我监督的行动预测具身推理引导
Abstract
Embodied Chain-of-Thought (CoT) reasoning has significantly enhanced Vision-Language-Action (VLA) models, yet current methods rely on rigid templates to specify reasoning primitives (e.g., objects in the scene, high-level plans, structural affordances). These templates can force policies to process irrelevant information that distracts from critical action-prediction signals. This creates a bottleneck: without successful policies, we cannot verify reasoning quality; without quality reasoning, we cannot build robust policies. We introduce R&B-EnCoRe, which enables models to bootstrap embodied reasoning from internet-scale knowledge through self-supervised refinement. By treating reasoning as a latent variable within importance-weighted variational inference, models can generate and distill a refined reasoning training dataset of embodiment-specific strategies without external rewards, verifiers, or human annotation. We validate R&B-EnCoRe across manipulation (Franka Panda in simulation, WidowX in hardware), legged navigation (bipedal, wheeled, bicycle, quadruped), and autonomous driving embodiments using various VLA architectures with 1B, 4B, 7B, and 30B parameters. Our approach achieves 28% gains in manipulation success, 101% improvement in navigation scores, and 21% reduction in collision-rate metric over models that indiscriminately reason about all available primitives. R&B-EnCoRe enables models to distill reasoning that is predictive of successful control, bypassing manual annotation engineering while grounding internet-scale knowledge in physical execution.
Chinese Translation
具身思维链(Chain-of-Thought, CoT)推理显著增强了视觉-语言-行动(Vision-Language-Action, VLA)模型,但当前方法依赖于严格的模板来指定推理原语(例如,场景中的物体、高层次计划、结构性可供性)。这些模板可能迫使策略处理与关键行动预测信号无关的信息,从而造成瓶颈:没有成功的策略,我们无法验证推理质量;没有高质量的推理,我们无法构建稳健的策略。我们引入了R&B-EnCoRe,它通过自我监督的精炼,使模型能够从互联网规模的知识中引导具身推理。通过将推理视为重要性加权变分推理中的潜变量,模型可以生成并提炼出具身特定策略的精炼推理训练数据集,而无需外部奖励、验证者或人工标注。我们在操控(Franka Panda仿真,WidowX硬件)、腿部导航(双足、轮式、自行车、四足)和自主驾驶具身体中验证了R&B-EnCoRe,使用了具有1B、4B、7B和30B参数的各种VLA架构。我们的方法在操控成功率上提高了28%,在导航评分上提升了101%,在碰撞率指标上减少了21%,相较于那些无差别推理所有可用原语的模型。R&B-EnCoRe使模型能够提炼出与成功控制相关的预测性推理,绕过了手动标注工程,同时将互联网规模的知识与物理执行相结合。
cs.RO / 43 / 2602.08189
Chamelion: Reliable Change Detection for Long-Term LiDAR Mapping in Transient Environments
Chamelion:瞬态环境中长期 LiDAR 映射的可靠变化检测
Abstract
Online change detection is crucial for mobile robots to efficiently navigate through dynamic environments. Detecting changes in transient settings, such as active construction sites or frequently reconfigured indoor spaces, is particularly challenging due to frequent occlusions and spatiotemporal variations. Existing approaches often struggle to detect changes and fail to update the map across different observations. To address these limitations, we propose a dual-head network designed for online change detection and long-term map maintenance. A key difficulty in this task is the collection and alignment of real-world data, as manually registering structural differences over time is both labor-intensive and often impractical. To overcome this, we develop a data augmentation strategy that synthesizes structural changes by importing elements from different scenes, enabling effective model training without the need for extensive ground-truth annotations. Experiments conducted at real-world construction sites and in indoor office environments demonstrate that our approach generalizes well across diverse scenarios, achieving efficient and accurate map updates.\resubmit{Our source code and additional material are available at: https://chamelion-pages.github.io/.
Chinese Translation
在线变化检测对于移动机器人在动态环境中高效导航至关重要。在瞬态环境中(如活跃的建筑工地或频繁重新配置的室内空间)检测变化尤其具有挑战性,因为这些环境中存在频繁的遮挡和时空变化。现有方法往往难以检测变化,并且无法在不同观察之间更新地图。为了解决这些局限性,我们提出了一种双头网络,旨在进行在线变化检测和长期地图维护。该任务的一个关键难点在于收集和对齐真实世界数据,因为手动注册结构差异既劳动密集又往往不切实际。为此,我们开发了一种数据增强策略,通过从不同场景中导入元素来合成结构变化,使得模型训练能够有效进行,而无需大量的真实标注。我们在真实建筑工地和室内办公环境中进行的实验表明,我们的方法在多种场景中具有良好的泛化能力,实现了高效且准确的地图更新。我们的源代码和其他材料可在:https://chamelion-pages.github.io/ 获取。
cs.RO / 44 / 2602.08245
STEP: Warm-Started Visuomotor Policies with Spatiotemporal Consistency Prediction
STEP:具有时空一致性预测的温启动视觉运动策略
Abstract
Diffusion policies have recently emerged as a powerful paradigm for visuomotor control in robotic manipulation due to their ability to model the distribution of action sequences and capture multimodality. However, iterative denoising leads to substantial inference latency, limiting control frequency in real-time closed-loop systems. Existing acceleration methods either reduce sampling steps, bypass diffusion through direct prediction, or reuse past actions, but often struggle to jointly preserve action quality and achieve consistently low latency. In this work, we propose STEP, a lightweight spatiotemporal consistency prediction mechanism to construct high-quality warm-start actions that are both distributionally close to the target action and temporally consistent, without compromising the generative capability of the original diffusion policy. Then, we propose a velocity-aware perturbation injection mechanism that adaptively modulates actuation excitation based on temporal action variation to prevent execution stall especially for real-world tasks. We further provide a theoretical analysis showing that the proposed prediction induces a locally contractive mapping, ensuring convergence of action errors during diffusion refinement. We conduct extensive evaluations on nine simulated benchmarks and two real-world tasks. Notably, STEP with 2 steps can achieve an average 21.6% and 27.5% higher success rate than BRIDGER and DDIM on the RoboMimic benchmark and real-world tasks, respectively. These results demonstrate that STEP consistently advances the Pareto frontier of inference latency and success rate over existing methods.
Chinese Translation
扩散策略因其能够建模动作序列的分布并捕捉多模态性,最近成为机器人操控中视觉运动控制的强大范式。然而,迭代去噪导致显著的推理延迟,限制了实时闭环系统中的控制频率。现有的加速方法要么减少采样步骤,要么通过直接预测绕过扩散,或重用过去的动作,但通常难以共同保持动作质量并实现持续低延迟。在本研究中,我们提出了STEP,一种轻量级的时空一致性预测机制,用于构建高质量的温启动动作,这些动作在分布上接近目标动作并且在时间上保持一致,而不妨碍原始扩散策略的生成能力。然后,我们提出了一种基于速度的扰动注入机制,该机制根据时间动作变化自适应调节驱动激励,以防止执行停滞,特别是在现实世界任务中。我们进一步提供了理论分析,表明所提出的预测引入了局部收缩映射,确保在扩散精炼过程中动作误差的收敛。我们在九个模拟基准和两个现实世界任务上进行了广泛评估。值得注意的是,使用2步的STEP在RoboMimic基准和现实世界任务中,成功率分别比BRIDGER和DDIM高出平均21.6%和27.5%。这些结果表明,STEP在推理延迟和成功率的Pareto前沿上始终优于现有方法。
cs.RO / 45 / 2602.08251
Aerial Manipulation with Contact-Aware Onboard Perception and Hybrid Control
具有接触感知的机载感知与混合控制的空中操控
Abstract
Aerial manipulation (AM) promises to move Unmanned Aerial Vehicles (UAVs) beyond passive inspection to contact-rich tasks such as grasping, assembly, and in-situ maintenance. Most prior AM demonstrations rely on external motion capture (MoCap) and emphasize position control for coarse interactions, limiting deployability. We present a fully onboard perception-control pipeline for contact-rich AM that achieves accurate motion tracking and regulated contact wrenches without MoCap. The main components are (1) an augmented visual-inertial odometry (VIO) estimator with contact-consistency factors that activate only during interaction, tightening uncertainty around the contact frame and reducing drift, and (2) image-based visual servoing (IBVS) to mitigate perception-control coupling, together with a hybrid force-motion controller that regulates contact wrenches and lateral motion for stable contact. Experiments show that our approach closes the perception-to-wrench loop using only onboard sensing, yielding an velocity estimation improvement of 66.01% at contact, reliable target approach, and stable force holding-pointing toward deployable, in-the-wild aerial manipulation.
Chinese Translation
空中操控(AM)有望使无人机(UAV)超越被动检查,执行抓取、组装和现场维护等接触丰富的任务。大多数先前的AM演示依赖于外部运动捕捉(MoCap),并强调位置控制以实现粗略交互,这限制了其可部署性。我们提出了一种完全基于机载的感知-控制管道,用于接触丰富的AM,该管道在没有MoCap的情况下实现了准确的运动跟踪和受控的接触扭矩。主要组件包括(1)具有接触一致性因子的增强视觉惯性里程计(VIO)估计器,该因子仅在交互期间激活,从而收紧接触框架周围的不确定性并减少漂移,以及(2)基于图像的视觉伺服(IBVS),以减轻感知-控制耦合,结合混合力-运动控制器,调节接触扭矩和横向运动以实现稳定接触。实验表明,我们的方法仅使用机载传感器闭合感知到扭矩的回路,在接触时实现了66.01%的速度估计提升,可靠的目标接近,以及稳定的力保持,指向可部署的野外空中操控。
cs.RO / 46 / 2602.08266
Informative Object-centric Next Best View for Object-aware 3D Gaussian Splatting in Cluttered Scenes
面向对象的下一最佳视角选择:在杂乱场景中进行对象感知的3D高斯点云处理
Abstract
In cluttered scenes with inevitable occlusions and incomplete observations, selecting informative viewpoints is essential for building a reliable representation. In this context, 3D Gaussian Splatting (3DGS) offers a distinct advantage, as it can explicitly guide the selection of subsequent viewpoints and then refine the representation with new observations. However, existing approaches rely solely on geometric cues, neglect manipulation-relevant semantics, and tend to prioritize exploitation over exploration. To tackle these limitations, we introduce an instance-aware Next Best View (NBV) policy that prioritizes underexplored regions by leveraging object features. Specifically, our object-aware 3DGS distills instancelevel information into one-hot object vectors, which are used to compute confidence-weighted information gain that guides the identification of regions associated with erroneous and uncertain Gaussians. Furthermore, our method can be easily adapted to an object-centric NBV, which focuses view selection on a target object, thereby improving reconstruction robustness to object placement. Experiments demonstrate that our NBV policy reduces depth error by up to 77.14% on the synthetic dataset and 34.10% on the real-world GraspNet dataset compared to baselines. Moreover, compared to targeting the entire scene, performing NBV on a specific object yields an additional reduction of 25.60% in depth error for that object. We further validate the effectiveness of our approach through real-world robotic manipulation tasks.
Chinese Translation
在不可避免的遮挡和不完整观测的杂乱场景中,选择信息丰富的视角对于构建可靠的表示至关重要。在这种背景下,3D高斯点云处理(3D Gaussian Splatting, 3DGS)提供了显著的优势,因为它可以明确指导后续视角的选择,并利用新观测来优化表示。然而,现有方法仅依赖几何线索,忽视了与操作相关的语义,并倾向于优先进行开发而非探索。为了解决这些局限性,我们提出了一种实例感知的下一最佳视角(Next Best View, NBV)策略,通过利用对象特征优先考虑未充分探索的区域。具体而言,我们的对象感知3DGS将实例级信息提炼为一热编码对象向量,这些向量用于计算置信加权的信息增益,从而指导识别与错误和不确定高斯分布相关的区域。此外,我们的方法可以轻松适应为面向对象的NBV,专注于目标对象的视角选择,从而提高对对象放置的重建鲁棒性。实验表明,与基线相比,我们的NBV策略在合成数据集上将深度误差降低了多达77.14%,在真实世界的GraspNet数据集上降低了34.10%。此外,与针对整个场景相比,对特定对象进行NBV可使该对象的深度误差额外降低25.60%。我们进一步通过真实世界的机器人操作任务验证了我们方法的有效性。
cs.RO / 47 / 2602.08278
DexFormer: Cross-Embodied Dexterous Manipulation via History-Conditioned Transformer
DexFormer:通过历史条件变换器实现跨体现灵巧操控
Abstract
Dexterous manipulation remains one of the most challenging problems in robotics, requiring coherent control of high-DoF hands and arms under complex, contact-rich dynamics. A major barrier is embodiment variability: different dexterous hands exhibit distinct kinematics and dynamics, forcing prior methods to train separate policies or rely on shared action spaces with per-embodiment decoder heads. We present DexFormer, an end-to-end, dynamics-aware cross-embodiment policy built on a modified transformer backbone that conditions on historical observations. By using temporal context to infer morphology and dynamics on the fly, DexFormer adapts to diverse hand configurations and produces embodiment-appropriate control actions. Trained over a variety of procedurally generated dexterous-hand assets, DexFormer acquires a generalizable manipulation prior and exhibits strong zero-shot transfer to Leap Hand, Allegro Hand, and Rapid Hand. Our results show that a single policy can generalize across heterogeneous hand embodiments, establishing a scalable foundation for cross-embodiment dexterous manipulation. Project website: https://davidlxu.github.io/DexFormer-web/.
Chinese Translation
灵巧操控仍然是机器人领域中最具挑战性的问题之一,要求在复杂的接触丰富动态下对高自由度的手和臂进行连贯控制。一个主要障碍是体现的变异性:不同的灵巧手展现出不同的运动学和动力学特性,迫使先前的方法训练独立的策略或依赖于具有每个体现解码头的共享动作空间。我们提出了DexFormer,这是一种端到端的、动态感知的跨体现策略,基于修改过的变换器骨干网络,并以历史观察为条件。通过使用时间上下文动态推断形态和动力学,DexFormer能够适应多样的手部配置,并生成适合体现的控制动作。在各种程序生成的灵巧手资产上进行训练后,DexFormer获得了可泛化的操控先验,并在Leap Hand、Allegro Hand和Rapid Hand上展现出强大的零样本迁移能力。我们的结果表明,单一策略可以在异构手体现之间进行泛化,为跨体现灵巧操控奠定了可扩展的基础。项目网站:https://davidlxu.github.io/DexFormer-web/
cs.RO / 48 / 2602.08285
ReefFlex: A Generative Design Framework for Soft Robotic Grasping of Organic and Fragile objects
ReefFlex:一种用于软体机器人抓取有机和脆弱物体的生成设计框架
Abstract
Climate change, invasive species and human activities are currently damaging the world's coral reefs at unprecedented rates, threatening their vast biodiversity and fisheries, and reducing coastal protection. Solving this vast challenge requires scalable coral regeneration technologies that can breed climate-resilient species and accelerate the natural regrowth processes; actions that are impeded by the absence of safe and robust tools to handle the fragile coral. We investigate ReefFlex, a generative soft finger design methodology that explores a diverse space of soft fingers to produce a set of candidates capable of safely grasping fragile and geometrically heterogeneous coral in a cluttered environment. Our key insight is encoding heterogeneous grasping into a reduced set of motion primitives, creating a simplified, tractable multi-objective optimisation problem. To evaluate the method, we design a soft robot for reef rehabilitation, which grows and manipulates coral in onshore aquaculture facilities for future reef out-planting. We demonstrate ReefFlex increases both grasp success and grasp quality (disturbance resistance, positioning accuracy) and reduces in adverse events encountered during coral manipulation compared to reference designs. ReefFlex, offers a generalisable method to design soft end-effectors for complex handling and paves a pathway towards automation in previously unachievable domains like coral handling for restoration.
Chinese Translation
气候变化、入侵物种和人类活动正在以前所未有的速度破坏世界的珊瑚礁,威胁其丰富的生物多样性和渔业,并减少沿海保护。解决这一巨大挑战需要可扩展的珊瑚再生技术,以培育气候适应性物种并加速自然再生过程;而这一过程受到缺乏安全且稳健的工具来处理脆弱珊瑚的阻碍。我们研究了ReefFlex,一种生成式软指设计方法,探索多样化的软指空间,以产生一组能够在杂乱环境中安全抓取脆弱且几何异质的珊瑚的候选者。我们的关键见解是将异质抓取编码为一组简化的运动原语,从而创建一个简化且可处理的多目标优化问题。为了评估该方法,我们设计了一种用于珊瑚礁修复的软体机器人,该机器人在岸上水产养殖设施中生长和操控珊瑚,以便未来进行珊瑚的移植。我们证明了ReefFlex在抓取成功率和抓取质量(干扰抗性、定位准确性)方面均有所提升,并减少了在珊瑚操控过程中遇到的不利事件,相较于参考设计。ReefFlex提供了一种可推广的方法,用于设计复杂处理的软末端执行器,并为在珊瑚处理等以前无法实现的领域实现自动化铺平了道路。
cs.RO / 49 / 2602.08298
Benchmarking Autonomous Vehicles: A Driver Foundation Model Framework
自主车辆基准测试:驾驶员基础模型框架
Abstract
Autonomous vehicles (AVs) are poised to revolutionize global transportation systems. However, its widespread acceptance and market penetration remain significantly below expectations. This gap is primarily driven by persistent challenges in safety, comfort, commuting efficiency and energy economy when compared to the performance of experienced human drivers. We hypothesize that these challenges can be addressed through the development of a driver foundation model (DFM). Accordingly, we propose a framework for establishing DFMs to comprehensively benchmark AVs. Specifically, we describe a large-scale dataset collection strategy for training a DFM, discuss the core functionalities such a model should possess, and explore potential technical solutions to realize these functionalities. We further present the utility of the DFM across the operational spectrum, from defining human-centric safety envelopes to establishing benchmarks for energy economy. Overall, We aim to formalize the DFM concept and introduce a new paradigm for the systematic specification, verification and validation of AVs.
Chinese Translation
自主车辆(AVs)有望彻底改变全球交通系统。然而,其广泛接受度和市场渗透率仍显著低于预期。这一差距主要源于在安全性、舒适性、通勤效率和能源经济性方面的持续挑战,相较于经验丰富的人类驾驶员的表现。我们假设这些挑战可以通过开发驾驶员基础模型(DFM)来解决。因此,我们提出了一个建立DFM的框架,以全面基准测试AVs。具体而言,我们描述了一种用于训练DFM的大规模数据集收集策略,讨论了该模型应具备的核心功能,并探讨实现这些功能的潜在技术解决方案。我们进一步展示了DFM在操作范围内的实用性,从定义以人为本的安全边界到建立能源经济性的基准。总体而言,我们旨在正式化DFM概念,并引入一种系统规范、验证和验证AVs的新范式。
cs.RO / 50 / 2602.08326
Personalized Autonomous Driving via Optimal Control with Clearance Constraints from Questionnaires
通过问卷优化控制实现个性化自主驾驶的安全间距约束
Abstract
Driving without considering the preferred separation distance from surrounding vehicles may cause discomfort for users. To address this limitation, we propose a planning framework that explicitly incorporates user preferences regarding the desired level of safe clearance from surrounding vehicles. We design a questionnaire purposefully tailored to capture user preferences relevant to our framework, while minimizing unnecessary questions. Specifically, the questionnaire considers various interaction-relevant factors, including the surrounding vehicle's size, speed, position, and maneuvers of surrounding vehicles, as well as the maneuvers of the ego vehicle. The response indicates the user-preferred clearance for the scenario defined by the question and is incorporated as constraints in the optimal control problem. However, it is impractical to account for all possible scenarios that may arise in a driving environment within a single optimal control problem, as the resulting computational complexity renders real-time implementation infeasible. To overcome this limitation, we approximate the original problem by decomposing it into multiple subproblems, each dealing with one fixed scenario. We then solve these subproblems in parallel and select one using the cost function from the original problem. To validate our work, we conduct simulations using different user responses to the questionnaire. We assess how effectively our planner reflects user preferences compared to preference-agnostic baseline planners by measuring preference alignment.
Chinese Translation
在驾驶过程中,如果不考虑与周围车辆的理想安全间距,可能会导致用户的不适。为了解决这一局限性,我们提出了一种规划框架,明确纳入用户对与周围车辆保持安全间距的偏好。我们设计了一份专门针对该框架的问卷,以捕捉用户相关偏好,同时尽量减少不必要的问题。具体而言,问卷考虑了多种与交互相关的因素,包括周围车辆的大小、速度、位置及其动作,以及自我车辆的动作。用户的回答指示了在特定情境下的理想安全间距,并作为约束条件纳入最优控制问题中。然而,在单一的最优控制问题中考虑所有可能出现的驾驶情境是不切实际的,因为由此产生的计算复杂性使得实时实施变得不可行。为克服这一限制,我们通过将原始问题分解为多个子问题来近似处理,每个子问题处理一个固定的情境。然后,我们并行解决这些子问题,并使用原始问题的成本函数选择一个子问题。为了验证我们的工作,我们使用不同的用户问卷反馈进行模拟。我们通过测量偏好一致性,评估我们的规划器在多大程度上反映用户偏好,相较于无偏好的基线规划器的效果。
cs.RO / 51 / 2602.08328
Controlled Flight of an Insect-Scale Flapping-Wing Robot via Integrated Onboard Sensing and Computation
通过集成的机载传感与计算实现昆虫尺度拍翼机器人的受控飞行
Abstract
Aerial insects can effortlessly navigate dense vegetation, whereas similarly sized aerial robots typically depend on offboard sensors and computation to maintain stable flight. This disparity restricts insect-scale robots to operation within motion capture environments, substantially limiting their applicability to tasks such as search-and-rescue and precision agriculture. In this work, we present a 1.29-gram aerial robot capable of hovering and tracking trajectories with solely onboard sensing and computation. The combination of a sensor suite, estimators, and a low-level controller achieved centimeter-scale positional flight accuracy. Additionally, we developed a hierarchical controller in which a human operator provides high-level commands to direct the robot's motion. In a 30-second flight experiment conducted outside a motion capture system, the robot avoided obstacles and ultimately landed on a sunflower. This level of sensing and computational autonomy represents a significant advancement for the aerial microrobotics community, further opening opportunities to explore onboard planning and power autonomy.
Chinese Translation
空中昆虫能够轻松穿越密集的植被,而同样大小的空中机器人通常依赖于离线传感器和计算来维持稳定飞行。这一差异限制了昆虫尺度机器人在运动捕捉环境之外的操作,显著降低了它们在搜索与救援、精准农业等任务中的适用性。在本研究中,我们展示了一种重1.29克的空中机器人,能够仅通过机载传感与计算实现悬停和轨迹跟踪。传感器组、估计器和低级控制器的组合实现了厘米级的位置飞行精度。此外,我们开发了一个层级控制器,其中人类操作员提供高层指令以引导机器人的运动。在一次在运动捕捉系统外进行的30秒飞行实验中,该机器人成功避开障碍物并最终降落在一朵向日葵上。这种传感与计算自主性的水平代表了空中微型机器人领域的一项重大进展,进一步开辟了探索机载规划和能量自主性的机会。
cs.RO / 52 / 2602.08334
Vec-QMDP: Vectorized POMDP Planning on CPUs for Real-Time Autonomous Driving
Vec-QMDP:基于CPU的实时自主驾驶向量化POMDP规划
Abstract
Planning under uncertainty for real-world robotics tasks, such as autonomous driving, requires reasoning in enormous high-dimensional belief spaces, rendering the problem computationally intensive. While parallelization offers scalability, existing hybrid CPU-GPU solvers face critical bottlenecks due to host-device synchronization latency and branch divergence on SIMT architectures, limiting their utility for real-time planning and hindering real-robot deployment. We present Vec-QMDP, a CPU-native parallel planner that aligns POMDP search with modern CPUs' SIMD architecture, achieving $227\times$--$1073\times$ speedup over state-of-the-art serial planners. Vec-QMDP adopts a Data-Oriented Design (DOD), refactoring scattered, pointer-based data structures into contiguous, cache-efficient memory layouts. We further introduce a hierarchical parallelism scheme: distributing sub-trees across independent CPU cores and SIMD lanes, enabling fully vectorized tree expansion and collision checking. Efficiency is maximized with the help of UCB load balancing across trees and a vectorized STR-tree for coarse-level collision checking. Evaluated on large-scale autonomous driving benchmarks, Vec-QMDP achieves state-of-the-art planning performance with millisecond-level latency, establishing CPUs as a high-performance computing platform for large-scale planning under uncertainty.
Chinese Translation
在不确定性下进行规划以应对现实世界的机器人任务,如自主驾驶,需要在巨大的高维信念空间中进行推理,这使得问题计算密集。虽然并行化提供了可扩展性,但现有的混合CPU-GPU求解器由于主机与设备之间的同步延迟和SIMT架构下的分支发散,面临严重瓶颈,限制了其在实时规划中的实用性,并阻碍了真实机器人部署。我们提出了Vec-QMDP,这是一种本地CPU并行规划器,它将POMDP搜索与现代CPU的SIMD架构对齐,实现了相较于最先进的串行规划器高达$227 imes$至$1073 imes$的加速。Vec-QMDP采用数据导向设计(Data-Oriented Design, DOD),将分散的基于指针的数据结构重构为连续的、缓存高效的内存布局。我们进一步引入了一种分层并行方案:将子树分配到独立的CPU核心和SIMD通道,支持完全向量化的树扩展和碰撞检测。通过在树之间进行UCB负载均衡和使用向量化的STR树进行粗级别的碰撞检测,最大化了效率。在大规模自主驾驶基准测试中评估,Vec-QMDP实现了最先进的规划性能,延迟达到毫秒级,确立了CPU作为大规模不确定性规划的高性能计算平台。
cs.RO / 53 / 2602.08370
Learning Human-Like Badminton Skills for Humanoid Robots
为类人机器人学习类人羽毛球技能
Abstract
Realizing versatile and human-like performance in high-demand sports like badminton remains a formidable challenge for humanoid robotics. Unlike standard locomotion or static manipulation, this task demands a seamless integration of explosive whole-body coordination and precise, timing-critical interception. While recent advances have achieved lifelike motion mimicry, bridging the gap between kinematic imitation and functional, physics-aware striking without compromising stylistic naturalness is non-trivial. To address this, we propose Imitation-to-Interaction, a progressive reinforcement learning framework designed to evolve a robot from a "mimic" to a capable "striker." Our approach establishes a robust motor prior from human data, distills it into a compact, model-based state representation, and stabilizes dynamics via adversarial priors. Crucially, to overcome the sparsity of expert demonstrations, we introduce a manifold expansion strategy that generalizes discrete strike points into a dense interaction volume. We validate our framework through the mastery of diverse skills, including lifts and drop shots, in simulation. Furthermore, we demonstrate the first zero-shot sim-to-real transfer of anthropomorphic badminton skills to a humanoid robot, successfully replicating the kinetic elegance and functional precision of human athletes in the physical world.
Chinese Translation
在羽毛球等高需求运动中实现多样化和类人的表现仍然是类人机器人领域面临的巨大挑战。与标准的运动或静态操作不同,这项任务要求将爆发性的全身协调与精确、时机关键的拦截无缝整合。尽管最近的进展已经实现了类人运动的模仿,但在不妥协风格自然性的情况下,弥合运动学模仿与功能性、物理感知的击球之间的差距并非易事。为了解决这个问题,我们提出了模仿到互动(Imitation-to-Interaction),一个渐进式强化学习框架,旨在将机器人从“模仿者”发展为一个能够的“击球手”。我们的方法从人类数据中建立了一个强大的运动先验,将其提炼为紧凑的基于模型的状态表示,并通过对抗先验来稳定动态。关键是,为了克服专家演示的稀疏性,我们引入了一种流形扩展策略,将离散的击球点推广为一个密集的互动体积。我们通过在仿真中掌握多种技能,包括吊球和放网球,验证了我们的框架。此外,我们展示了首次实现类人羽毛球技能的零样本仿真到现实转移,成功复制了人类运动员在物理世界中的动能优雅和功能精确。
cs.RO / 54 / 2602.08392
BiManiBench: A Hierarchical Benchmark for Evaluating Bimanual Coordination of Multimodal Large Language Models
BiManiBench:评估多模态大型语言模型双手协调的分层基准
Abstract
Multimodal Large Language Models (MLLMs) have significantly advanced embodied AI, and using them to benchmark robotic intelligence has become a pivotal trend. However, existing frameworks remain predominantly confined to single-arm manipulation, failing to capture the spatio-temporal coordination required for bimanual tasks like lifting a heavy pot. To address this, we introduce BiManiBench, a hierarchical benchmark evaluating MLLMs across three tiers: fundamental spatial reasoning, high-level action planning, and low-level end-effector control. Our framework isolates unique bimanual challenges, such as arm reachability and kinematic constraints, thereby distinguishing perceptual hallucinations from planning failures. Analysis of over 30 state-of-the-art models reveals that despite high-level reasoning proficiency, MLLMs struggle with dual-arm spatial grounding and control, frequently resulting in mutual interference and sequencing errors. These findings suggest the current paradigm lacks a deep understanding of mutual kinematic constraints, highlighting the need for future research to focus on inter-arm collision-avoidance and fine-grained temporal sequencing.
Chinese Translation
多模态大型语言模型(MLLMs)在具身人工智能领域取得了显著进展,利用它们对机器人智能进行基准测试已成为一项关键趋势。然而,现有框架主要局限于单臂操作,未能捕捉到双手任务(如提起重物)所需的时空协调。为了解决这一问题,我们提出了BiManiBench,这是一个分层基准,评估MLLMs在三个层次上的表现:基础空间推理、高级行动规划和低级末端执行器控制。我们的框架隔离了独特的双手挑战,如手臂可达性和运动学约束,从而将感知幻觉与规划失败区分开来。对30多个最先进模型的分析表明,尽管在高级推理方面表现出色,MLLMs在双臂空间定位和控制方面仍然存在困难,常常导致相互干扰和排序错误。这些发现表明,当前范式缺乏对相互运动学约束的深入理解,强调了未来研究应聚焦于双臂碰撞避免和细粒度时间排序的必要性。
cs.RO / 55 / 2602.08417
Graph-Loc: Robust Graph-Based LiDAR Pose Tracking with Compact Structural Map Priors under Low Observability and Occlusion
Graph-Loc:在低可观测性和遮挡条件下,基于图的鲁棒性LiDAR姿态跟踪与紧凑结构地图先验
Abstract
Map-based LiDAR pose tracking is essential for long-term autonomous operation, where onboard map priors need be compact for scalable storage and fast retrieval, while online observations are often partial, repetitive, and heavily occluded. We propose Graph-Loc, a graph-based localization framework that tracks the platform pose against compact structural map priors represented as a lightweight point-line graph. Such priors can be constructed from heterogeneous sources commonly available in practice, including polygon outlines vectorized from occupancy/grid maps and CAD/model/floor-plan layouts. For each incoming LiDAR scan, Graph-Loc extracts sparse point and line primitives to form an observation graph, retrieves a pose-conditioned visible subgraph via LiDAR ray simulation, and performs scan-to-map association through unbalanced optimal transport with a local graph-context regularizer. The unbalanced formulation relaxes mass conservation, improving robustness to missing, spurious, and fragmented structures under occlusion. To enhance stability in low-observability segments, we estimate information anisotropy from the refinement normal matrix and defer updates along weakly constrained directions until sufficient constraints reappear. Experiments on public benchmarks, controlled stress tests, and real-world deployments demonstrate accurate and stable tracking with KB-level priors from heterogeneous map sources, including under geometrically degenerate and sustained occlusion and in the presence of gradual scene changes.
Chinese Translation
基于地图的LiDAR姿态跟踪对于长期自主操作至关重要,其中车载地图先验需要紧凑以便于可扩展存储和快速检索,而在线观测通常是部分的、重复的,并且受到严重遮挡。我们提出了Graph-Loc,一个基于图的定位框架,能够针对以轻量级点线图表示的紧凑结构地图先验跟踪平台姿态。这些先验可以从实践中常见的异构来源构建,包括从占用/网格地图矢量化的多边形轮廓以及CAD/模型/平面布局。对于每个输入的LiDAR扫描,Graph-Loc提取稀疏的点和线原语以形成观测图,通过LiDAR光线模拟检索姿态条件下的可见子图,并通过不平衡最优传输与局部图上下文正则化进行扫描与地图的关联。不平衡的公式放宽了质量守恒,提高了在遮挡下对缺失、虚假和碎片化结构的鲁棒性。为了增强在低可观测性区段的稳定性,我们从精细化法向矩阵中估计信息各向异性,并推迟在弱约束方向上的更新,直到出现足够的约束。对公共基准、受控压力测试和实际部署的实验表明,使用来自异构地图源的KB级先验,能够实现准确且稳定的跟踪,包括在几何退化和持续遮挡的情况下,以及在逐渐变化的场景中。
cs.RO / 56 / 2602.08421
Decentralized Intent-Based Multi-Robot Task Planner with LLM Oracles on Hyperledger Fabric
基于去中心化意图的多机器人任务规划器与Hyperledger Fabric上的LLM预言机
Abstract
Large language models (LLMs) have opened new opportunities for transforming natural language user intents into executable actions. This capability enables embodied AI agents to perform complex tasks, without involvement of an expert, making human-robot interaction (HRI) more convenient. However these developments raise significant security and privacy challenges such as self-preferencing, where a single LLM service provider dominates the market and uses this power to promote their own preferences. LLM oracles have been recently proposed as a mechanism to decentralize LLMs by executing multiple LLMs from different vendors and aggregating their outputs to obtain a more reliable and trustworthy final result. However, the accuracy of these approaches highly depends on the aggregation method. The current aggregation methods mostly use semantic similarity between various LLM outputs, not suitable for robotic task planning, where the temporal order of tasks is important. To fill the gap, we propose an LLM oracle with a new aggregation method for robotic task planning. In addition, we propose a decentralized multi-robot infrastructure based on Hyperledger Fabric that can host the proposed oracle. The proposed infrastructure enables users to express their natural language intent to the system, which then can be decomposed into subtasks. These subtasks require coordinating different robots from different vendors, while enforcing fine-grained access control management on the data. To evaluate our methodology, we created the SkillChain-RTD benchmark made it publicly available. Our experimental results demonstrate the feasibility of the proposed architecture, and the proposed aggregation method outperforms other aggregation methods currently in use.
Chinese Translation
大型语言模型(LLMs)为将自然语言用户意图转化为可执行动作开辟了新的机会。这一能力使得具身人工智能代理能够执行复杂任务,而无需专家的参与,从而使人机交互(HRI)变得更加便捷。然而,这些发展带来了显著的安全和隐私挑战,例如自我偏好现象,其中单一的LLM服务提供商主导市场并利用这一权力来推广自身的偏好。最近提出的LLM预言机作为一种机制,通过执行来自不同供应商的多个LLM并聚合其输出,以实现LLM的去中心化,从而获得更可靠和可信的最终结果。然而,这些方法的准确性高度依赖于聚合方法。目前的聚合方法主要使用不同LLM输出之间的语义相似性,这对于机器人任务规划并不适用,因为任务的时间顺序至关重要。为填补这一空白,我们提出了一种用于机器人任务规划的新聚合方法的LLM预言机。此外,我们还提出了一种基于Hyperledger Fabric的去中心化多机器人基础设施,以承载所提议的预言机。该基础设施使用户能够向系统表达其自然语言意图,系统随后可以将其分解为子任务。这些子任务需要协调来自不同供应商的不同机器人,同时对数据实施细粒度的访问控制管理。为了评估我们的方法论,我们创建了SkillChain-RTD基准并公开发布。我们的实验结果证明了所提架构的可行性,且所提聚合方法的性能优于当前使用的其他聚合方法。
cs.RO / 57 / 2602.08425
Bi-Adapt: Few-shot Bimanual Adaptation for Novel Categories of 3D Objects via Semantic Correspondence
Bi-Adapt:通过语义对应实现新类别3D物体的少样本双手适应
Abstract
Bimanual manipulation is imperative yet challenging for robots to execute complex tasks, requiring coordinated collaboration between two arms. However, existing methods for bimanual manipulation often rely on costly data collection and training, struggling to generalize to unseen objects in novel categories efficiently. In this paper, we present Bi-Adapt, a novel framework designed for efficient generalization for bimanual manipulation via semantic correspondence. Bi-Adapt achieves cross-category affordance mapping by leveraging the strong capability of vision foundation models. Fine-tuning with restricted data on novel categories, Bi-Adapt exhibits notable generalization to out-of-category objects in a zero-shot manner. Extensive experiments conducted in both simulation and real-world environments validate the effectiveness of our approach and demonstrate its high efficiency, achieving a high success rate on different benchmark tasks across novel categories with limited data. Project website: https://biadapt-project.github.io/
Chinese Translation
双手操作对机器人执行复杂任务至关重要,但也充满挑战,要求两个手臂之间进行协调合作。然而,现有的双手操作方法通常依赖于昂贵的数据收集和训练,难以有效地推广到新类别中的未见物体。本文提出了Bi-Adapt,一个旨在通过语义对应实现双手操作高效泛化的新框架。Bi-Adapt通过利用视觉基础模型的强大能力,实现跨类别的可用性映射。通过对新类别的有限数据进行微调,Bi-Adapt在零样本的情况下对类别外物体表现出显著的泛化能力。在模拟和真实环境中进行的大量实验验证了我们方法的有效性,并展示了其高效率,在新类别的不同基准任务中以有限数据实现了高成功率。项目网站:https://biadapt-project.github.io/
cs.RO / 58 / 2602.08440
SteerVLA: Steering Vision-Language-Action Models in Long-Tail Driving Scenarios
SteerVLA:在长尾驾驶场景中引导视觉-语言-动作模型
Abstract
A fundamental challenge in autonomous driving is the integration of high-level, semantic reasoning for long-tail events with low-level, reactive control for robust driving. While large vision-language models (VLMs) trained on web-scale data offer powerful common-sense reasoning, they lack the grounded experience necessary for safe vehicle control. We posit that an effective autonomous agent should leverage the world knowledge of VLMs to guide a steerable driving policy toward robust control in driving scenarios. To this end, we propose SteerVLA, which leverages the reasoning capabilities of VLMs to produce fine-grained language instructions that steer a vision-language-action (VLA) driving policy. Key to our method is this rich language interface between the high-level VLM and low-level VLA, which allows the high-level policy to more effectively ground its reasoning in the control outputs of the low-level policy. To provide fine-grained language supervision aligned with vehicle control, we leverage a VLM to augment existing driving data with detailed language annotations, which we find to be essential for effective reasoning and steerability. We evaluate SteerVLA on a challenging closed-loop benchmark, where it outperforms state-of-the-art methods by 4.77 points in overall driving score and by 8.04 points on a long-tail subset. The project website is available at: https://steervla.github.io/.
Chinese Translation
自动驾驶面临的一个基本挑战是将高层次的语义推理与低层次的反应控制相结合,以实现稳健的驾驶。虽然在网络规模数据上训练的大型视觉-语言模型(VLMs)提供了强大的常识推理能力,但它们缺乏安全车辆控制所需的实际经验。我们认为,一个有效的自主代理应利用VLM的世界知识来引导可调节的驾驶策略,以实现驾驶场景中的稳健控制。为此,我们提出了SteerVLA,它利用VLM的推理能力生成细粒度的语言指令,从而引导视觉-语言-动作(VLA)驾驶策略。我们方法的关键在于高层次VLM与低层次VLA之间的丰富语言接口,这使得高层次策略能够更有效地将其推理与低层次策略的控制输出相结合。为了提供与车辆控制对齐的细粒度语言监督,我们利用VLM对现有驾驶数据进行详细语言注释的增强,我们发现这对于有效推理和可调节性至关重要。我们在一个具有挑战性的闭环基准上评估了SteerVLA,结果显示其在整体驾驶评分上比最先进的方法提高了4.77分,在长尾子集上提高了8.04分。项目网站可访问: https://steervla.github.io/。
cs.RO / 59 / 2602.08444
Post-Collision Trajectory Restoration for a Single-track Ackermann Vehicle using Heuristic Steering and Tractive Force Functions
基于启发式转向和牵引力函数的单轨阿克曼车辆碰撞后轨迹恢复
Abstract
Post-collision trajectory restoration is a safety-critical capability for autonomous vehicles, as impact-induced lateral motion and yaw transients can rapidly drive the vehicle away from the intended path. This paper proposes a structured heuristic recovery control law that jointly commands steering and tractive force for a generalized single-track Ackermann vehicle model. The formulation explicitly accounts for time-varying longitudinal velocity in the lateral-yaw dynamics and retains nonlinear steering-coupled interaction terms that are commonly simplified in the literature. Unlike approaches that assume constant longitudinal speed, the proposed design targets the transient post-impact regime where speed variations and nonlinear coupling significantly influence recovery. The method is evaluated in simulation on the proposed generalized single-track model and a standard 3DOF single-track reference model in MATLAB, demonstrating consistent post-collision restoration behaviour across representative initial post-impact conditions.
Chinese Translation
碰撞后轨迹恢复是自动驾驶车辆的一项安全关键能力,因为碰撞引起的横向运动和偏航瞬态可能迅速使车辆偏离预定路径。本文提出了一种结构化的启发式恢复控制律,该控制律联合指令转向和牵引力,适用于广义单轨阿克曼车辆模型。该公式明确考虑了横向-偏航动力学中的时变纵向速度,并保留了文献中常常简化的非线性转向耦合交互项。与假设恒定纵向速度的方法不同,所提设计针对的是碰撞后瞬态阶段,在该阶段,速度变化和非线性耦合显著影响恢复。该方法在MATLAB中对所提广义单轨模型和标准3自由度单轨参考模型进行了仿真评估,展示了在代表性的初始碰撞后条件下,持续一致的碰撞后恢复行为。
cs.RO / 60 / 2602.08450
UAV-Supported Maritime Search System: Experience from Valun Bay Field Trials
无人机支持的海洋搜索系统:来自瓦伦湾现场试验的经验
Abstract
This paper presents the integration of flow field reconstruction, dynamic probabilistic modeling, search control, and machine vision detection in a system for autonomous maritime search operations. Field experiments conducted in Valun Bay (Cres Island, Croatia) involved real-time drifter data acquisition, surrogate flow model fitting based on computational fluid dynamics and numerical optimization, advanced multi-UAV search control and vision sensing, as well as deep learning-based object detection. The results demonstrate that a tightly coupled approach enables reliable detection of floating targets under realistic uncertainties and complex environmental conditions, providing concrete insights for future autonomous maritime search and rescue applications.
Chinese Translation
本文介绍了一种用于自主海洋搜索操作的系统,该系统集成了流场重建、动态概率建模、搜索控制和机器视觉检测。在克罗地亚克雷斯岛的瓦伦湾进行的现场实验涉及实时漂流器数据采集、基于计算流体动力学和数值优化的替代流模型拟合、先进的多无人机搜索控制和视觉感知,以及基于深度学习的目标检测。结果表明,紧密耦合的方法能够在现实的不确定性和复杂环境条件下可靠地检测浮动目标,为未来的自主海洋搜索和救援应用提供了具体的见解。
cs.RO / 61 / 2602.08466
Reliability-aware Execution Gating for Near-field and Off-axis Vision-guided Robotic Alignment
面向可靠性的执行门控机制用于近场和偏轴视觉引导的机器人对准
Abstract
Vision-guided robotic systems are increasingly deployed in precision alignment tasks that require reliable execution under near-field and off-axis configurations. While recent advances in pose estimation have significantly improved numerical accuracy, practical robotic systems still suffer from frequent execution failures even when pose estimates appear accurate. This gap suggests that pose accuracy alone is insufficient to guarantee execution-level reliability. In this paper, we reveal that such failures arise from a deterministic geometric error amplification mechanism, in which small pose estimation errors are magnified through system structure and motion execution, leading to unstable or failed alignment. Rather than modifying pose estimation algorithms, we propose a Reliability-aware Execution Gating mechanism that operates at the execution level. The proposed approach evaluates geometric consistency and configuration risk before execution, and selectively rejects or scales high-risk pose updates. We validate the proposed method on a real UR5 robotic platform performing single-step visual alignment tasks under varying camera-target distances and off-axis configurations. Experimental results demonstrate that the proposed execution gating significantly improves task success rates, reduces execution variance, and suppresses tail-risk behavior, while leaving average pose accuracy largely unchanged. Importantly, the proposed mechanism is estimator-agnostic and can be readily integrated with both classical geometry-based and learning-based pose estimation pipelines. These results highlight the importance of execution-level reliability modeling and provide a practical solution for improving robustness in near-field vision-guided robotic systems.
Chinese Translation
视觉引导的机器人系统越来越多地应用于需要在近场和偏轴配置下可靠执行的精密对准任务。尽管最近在姿态估计方面的进展显著提高了数值准确性,但实际的机器人系统仍然面临频繁的执行失败,即使姿态估计看起来准确。这一差距表明,仅靠姿态准确性不足以保证执行层面的可靠性。本文揭示,这种失败源于一种确定性的几何误差放大机制,其中小的姿态估计误差通过系统结构和运动执行被放大,导致不稳定或失败的对准。我们提出了一种面向可靠性的执行门控机制,该机制在执行层面上操作,而不是修改姿态估计算法。该方法在执行前评估几何一致性和配置风险,并有选择地拒绝或缩放高风险的姿态更新。我们在真实的UR5机器人平台上验证了该方法,该平台在不同的相机-目标距离和偏轴配置下执行单步视觉对准任务。实验结果表明,所提出的执行门控显著提高了任务成功率,减少了执行方差,并抑制了尾部风险行为,同时平均姿态准确性基本保持不变。重要的是,所提出的机制与估计器无关,可以与经典的基于几何和基于学习的姿态估计管道轻松集成。这些结果突显了执行层面可靠性建模的重要性,并为提高近场视觉引导机器人系统的鲁棒性提供了实用的解决方案。
cs.RO / 62 / 2602.08518
Characteristics, Management, and Utilization of Muscles in Musculoskeletal Humanoids: Empirical Study on Kengoro and Musashi
肌肉特性、管理与利用在肌肉骨骼类人机器人中的应用:对Kengoro和Musashi的实证研究
Abstract
Various musculoskeletal humanoids have been developed so far, and numerous studies on control mechanisms have been conducted to leverage the advantages of their biomimetic bodies. However, there has not been sufficient and unified discussion on the diverse properties inherent in these musculoskeletal structures, nor on how to manage and utilize them. Therefore, this study categorizes and analyzes the characteristics of muscles, as well as their management and utilization methods, based on the various research conducted on the musculoskeletal humanoids we have developed, Kengoro and Musashi. We classify the features of the musculoskeletal structure into five properties: Redundancy, Independency, Anisotropy, Variable Moment Arm, and Nonlinear Elasticity. We then organize the diverse advantages and disadvantages of musculoskeletal humanoids that arise from the combination of these properties. In particular, we discuss body schema learning and reflex control, along with muscle grouping and body schema adaptation. Also, we describe the implementation of movements through an integrated system and discuss future challenges and prospects.
Chinese Translation
迄今为止,已经开发了多种肌肉骨骼类人机器人,并进行了大量关于控制机制的研究,以利用其仿生体的优势。然而,对于这些肌肉骨骼结构固有的多样特性,以及如何管理和利用它们,尚缺乏充分且统一的讨论。因此,本研究基于我们开发的肌肉骨骼类人机器人Kengoro和Musashi的各种研究,对肌肉的特性及其管理和利用方法进行了分类和分析。我们将肌肉骨骼结构的特征分为五个属性:冗余性(Redundancy)、独立性(Independency)、各向异性(Anisotropy)、可变力臂(Variable Moment Arm)和非线性弹性(Nonlinear Elasticity)。接着,我们整理了由这些属性组合而产生的肌肉骨骼类人机器人的多样优势与劣势。特别是,我们讨论了身体图式学习和反射控制,以及肌肉分组和身体图式适应。此外,我们描述了通过集成系统实现运动的过程,并讨论了未来的挑战与前景。
cs.RO / 63 / 2602.08537
UniPlan: Vision-Language Task Planning for Mobile Manipulation with Unified PDDL Formulation
UniPlan:基于统一PDDL表述的移动操作视觉-语言任务规划
Abstract
Integration of VLM reasoning with symbolic planning has proven to be a promising approach to real-world robot task planning. Existing work like UniDomain effectively learns symbolic manipulation domains from real-world demonstrations, described in Planning Domain Definition Language (PDDL), and has successfully applied them to real-world tasks. These domains, however, are restricted to tabletop manipulation. We propose UniPlan, a vision-language task planning system for long-horizon mobile-manipulation in large-scale indoor environments, that unifies scene topology, visuals, and robot capabilities into a holistic PDDL representation. UniPlan programmatically extends learned tabletop domains from UniDomain to support navigation, door traversal, and bimanual coordination. It operates on a visual-topological map, comprising navigation landmarks anchored with scene images. Given a language instruction, UniPlan retrieves task-relevant nodes from the map and uses a VLM to ground the anchored image into task-relevant objects and their PDDL states; next, it reconnects these nodes to a compressed, densely-connected topological map, also represented in PDDL, with connectivity and costs derived from the original map; Finally, a mobile-manipulation plan is generated using off-the-shelf PDDL solvers. Evaluated on human-raised tasks in a large-scale map with real-world imagery, UniPlan significantly outperforms VLM and LLM+PDDL planning in success rate, plan quality, and computational efficiency.
Chinese Translation
将视觉-语言模型(VLM)推理与符号规划相结合已被证明是一种有前景的现实世界机器人任务规划方法。现有工作如UniDomain有效地从现实世界示范中学习符号操作领域,这些领域用规划领域定义语言(PDDL)描述,并成功应用于现实世界任务。然而,这些领域仅限于桌面操作。我们提出了UniPlan,一个用于大规模室内环境中长时间移动操作的视觉-语言任务规划系统,它将场景拓扑、视觉信息和机器人能力统一为一个整体的PDDL表示。UniPlan以编程方式扩展了从UniDomain学习的桌面领域,以支持导航、门的穿越和双手协调。它在一个视觉-拓扑地图上运行,该地图由与场景图像锚定的导航地标组成。给定语言指令后,UniPlan从地图中检索与任务相关的节点,并使用VLM将锚定图像与任务相关的对象及其PDDL状态进行关联;接下来,它将这些节点重新连接到一个压缩的、密集连接的拓扑地图,该地图同样用PDDL表示,连接性和成本来源于原始地图;最后,使用现成的PDDL求解器生成移动操作计划。在一个具有现实世界图像的大规模地图上对人类提出的任务进行评估时,UniPlan在成功率、计划质量和计算效率上显著优于VLM和LLM+PDDL规划。
cs.RO / 64 / 2602.08557
Constrained Sampling to Guide Universal Manipulation RL
约束采样引导通用操控强化学习
Abstract
We consider how model-based solvers can be leveraged to guide training of a universal policy to control from any feasible start state to any feasible goal in a contact-rich manipulation setting. While Reinforcement Learning (RL) has demonstrated its strength in such settings, it may struggle to sufficiently explore and discover complex manipulation strategies, especially in sparse-reward settings. Our approach is based on the idea of a lower-dimensional manifold of feasible, likely-visited states during such manipulation and to guide RL with a sampler from this manifold. We propose Sample-Guided RL, which uses model-based constraint solvers to efficiently sample feasible configurations (satisfying differentiable collision, contact, and force constraints) and leverage them to guide RL for universal (goal-conditioned) manipulation policies. We study using this data directly to bias state visitation, as well as using black-box optimization of open-loop trajectories between random configurations to impose a state bias and optionally add a behavior cloning loss. In a minimalistic double sphere manipulation setting, Sample-Guided RL discovers complex manipulation strategies and achieves high success rates in reaching any statically stable state. In a more challenging panda arm setting, our approach achieves a significant success rate over a near-zero baseline, and demonstrates a breadth of complex whole-body-contact manipulation strategies.
Chinese Translation
我们考虑如何利用基于模型的求解器来指导通用策略的训练,以便在接触丰富的操控环境中从任何可行的起始状态控制到任何可行的目标。尽管强化学习(Reinforcement Learning, RL)在此类环境中展示了其优势,但在稀疏奖励环境中,它可能难以充分探索和发现复杂的操控策略。我们的方法基于在这种操控过程中可行的、可能被访问的状态的低维流形的概念,并通过来自该流形的采样器来指导RL。我们提出了样本引导的RL(Sample-Guided RL),该方法使用基于模型的约束求解器高效地采样可行配置(满足可微分的碰撞、接触和力约束),并利用这些配置来指导RL以实现通用(目标条件)操控策略。我们研究了直接使用这些数据来偏置状态访问,以及使用黑箱优化在随机配置之间的开环轨迹来施加状态偏置,并可选地添加行为克隆损失。在一个简化的双球操控环境中,样本引导的RL发现了复杂的操控策略,并在达到任何静态稳定状态时实现了高成功率。在一个更具挑战性的熊猫臂环境中,我们的方法在接近零的基线之上取得了显著的成功率,并展示了一系列复杂的全身接触操控策略。
cs.RO / 65 / 2602.08571
Head-to-Head autonomous racing at the limits of handling in the A2RL challenge
A2RL挑战中的极限操控对抗自主赛车
Abstract
Autonomous racing presents a complex challenge involving multi-agent interactions between vehicles operating at the limit of performance and dynamics. As such, it provides a valuable research and testing environment for advancing autonomous driving technology and improving road safety. This article presents the algorithms and deployment strategies developed by the TUM Autonomous Motorsport team for the inaugural Abu Dhabi Autonomous Racing League (A2RL). We showcase how our software emulates human driving behavior, pushing the limits of vehicle handling and multi-vehicle interactions to win the A2RL. Finally, we highlight the key enablers of our success and share our most significant learnings.
Chinese Translation
自主赛车是一项复杂的挑战,涉及在性能和动态极限下运行的车辆之间的多智能体交互。因此,它为推进自主驾驶技术和提高道路安全性提供了宝贵的研究和测试环境。本文介绍了慕尼黑工业大学自主赛车团队为首届阿布扎比自主赛车联盟(A2RL)开发的算法和部署策略。我们展示了我们的软件如何模拟人类驾驶行为,推动车辆操控和多车交互的极限,以赢得A2RL。最后,我们强调了成功的关键因素,并分享了我们最重要的经验教训。
cs.RO / 66 / 2602.08594
MOSAIC: Bridging the Sim-to-Real Gap in Generalist Humanoid Motion Tracking and Teleoperation with Rapid Residual Adaptation
MOSAIC:通过快速残差适应弥合通用类人运动跟踪和远程操作中的仿真与现实差距
Abstract
Generalist humanoid motion trackers have recently achieved strong simulation metrics by scaling data and training, yet often remain brittle on hardware during sustained teleoperation due to interface- and dynamics-induced errors. We present MOSAIC, an open-source, full-stack system for humanoid motion tracking and whole-body teleoperation across multiple interfaces. MOSAIC first learns a teleoperation-oriented general motion tracker via RL on a multi-source motion bank with adaptive resampling and rewards that emphasize world-frame motion consistency, which is critical for mobile teleoperation. To bridge the sim-to-real interface gap without sacrificing generality, MOSAIC then performs rapid residual adaptation: an interface-specific policy is trained using minimal interface-specific data, and then distilled into the general tracker through an additive residual module, outperforming naive fine-tuning or continual learning. We validate MOSAIC with systematic ablations, out-of-distribution benchmarking, and real-robot experiments demonstrating robust offline motion replay and online long-horizon teleoperation under realistic latency and noise.
Chinese Translation
通用类人运动跟踪器最近通过扩展数据和训练在仿真指标上取得了显著成绩,但在持续的远程操作中,由于接口和动态引起的错误,往往在硬件上表现脆弱。我们提出了MOSAIC,一个开源的全栈系统,用于跨多个接口的类人运动跟踪和全身远程操作。MOSAIC首先通过在多源运动库上使用强化学习(RL)学习一个面向远程操作的通用运动跟踪器,该运动库具有自适应重采样和强调世界坐标系运动一致性的奖励,这对移动远程操作至关重要。为了在不牺牲通用性的情况下弥合仿真与现实的接口差距,MOSAIC随后执行快速残差适应:使用最少的接口特定数据训练一个接口特定策略,然后通过添加残差模块将其提炼到通用跟踪器中,优于简单的微调或持续学习。我们通过系统的消融实验、分布外基准测试和真实机器人实验验证了MOSAIC,展示了在现实延迟和噪声下的稳健离线运动重放和在线长时间远程操作。
cs.RO / 67 / 2602.08599
A Precise Real-Time Force-Aware Grasping System for Robust Aerial Manipulation
一种精确的实时力感知抓取系统用于稳健的空中操作
Abstract
Aerial manipulation requires force-aware capabilities to enable safe and effective grasping and physical interaction. Previous works often rely on heavy, expensive force sensors unsuitable for typical quadrotor platforms, or perform grasping without force feedback, risking damage to fragile objects. To address these limitations, we propose a novel force-aware grasping framework incorporating six low-cost, sensitive skin-like tactile sensors. We introduce a magnetic-based tactile sensing module that provides high-precision three-dimensional force measurements. We eliminate geomagnetic interference through a reference Hall sensor and simplify the calibration process compared to previous work. The proposed framework enables precise force-aware grasping control, allowing safe manipulation of fragile objects and real-time weight measurement of grasped items. The system is validated through comprehensive real-world experiments, including balloon grasping, dynamic load variation tests, and ablation studies, demonstrating its effectiveness in various aerial manipulation scenarios. Our approach achieves fully onboard operation without external motion capture systems, significantly enhancing the practicality of force-sensitive aerial manipulation. The supplementary video is available at: https://www.youtube.com/watch?v=mbcZkrJEf1I.
Chinese Translation
空中操作需要具备力感知能力,以实现安全有效的抓取和物理交互。以往的研究通常依赖于沉重且昂贵的力传感器,这些传感器不适用于典型的四旋翼平台,或者在没有力反馈的情况下进行抓取,这可能会对易碎物体造成损害。为了解决这些局限性,我们提出了一种新颖的力感知抓取框架,结合了六个低成本、敏感的类皮肤触觉传感器。我们引入了一种基于磁性的触觉传感模块,提供高精度的三维力测量。通过参考霍尔传感器消除地磁干扰,并简化了与以往工作相比的校准过程。所提出的框架实现了精确的力感知抓取控制,允许安全操作易碎物体并实时测量抓取物品的重量。该系统通过全面的现实世界实验进行验证,包括气球抓取、动态负载变化测试和消融研究,展示了其在各种空中操作场景中的有效性。我们的方法实现了完全的机载操作,无需外部运动捕捉系统,显著增强了力敏感空中操作的实用性。补充视频可在以下链接观看:https://www.youtube.com/watch?v=mbcZkrJEf1I。
cs.RO / 68 / 2602.08602
Mimic Intent, Not Just Trajectories
模仿意图,而不仅仅是轨迹
Abstract
While imitation learning (IL) has achieved impressive success in dexterous manipulation through generative modeling and pretraining, state-of-the-art approaches like Vision-Language-Action (VLA) models still struggle with adaptation to environmental changes and skill transfer. We argue this stems from mimicking raw trajectories without understanding the underlying intent. To address this, we propose explicitly disentangling behavior intent from execution details in end-2-end IL: \textit{``Mimic Intent, Not just Trajectories'' (MINT)}. We achieve this via \textit{multi-scale frequency-space tokenization}, which enforces a spectral decomposition of action chunk representation. We learn action tokens with a multi-scale coarse-to-fine structure, and force the coarsest token to capture low-frequency global structure and finer tokens to encode high-frequency details. This yields an abstract \textit{Intent token} that facilitates planning and transfer, and multi-scale \textit{Execution tokens} that enable precise adaptation to environmental dynamics. Building on this hierarchy, our policy generates trajectories through \textit{next-scale autoregression}, performing progressive \textit{intent-to-execution reasoning}, thus boosting learning efficiency and generalization. Crucially, this disentanglement enables \textit{one-shot transfer} of skills, by simply injecting the Intent token from a demonstration into the autoregressive generation process. Experiments on several manipulation benchmarks and on a real robot demonstrate state-of-the-art success rates, superior inference efficiency, robust generalization against disturbances, and effective one-shot transfer.
Chinese Translation
尽管模仿学习(IL)通过生成建模和预训练在灵巧操作中取得了显著成功,但像视觉-语言-动作(VLA)模型这样的最先进方法在适应环境变化和技能转移方面仍然面临挑战。我们认为,这源于在没有理解潜在意图的情况下模仿原始轨迹。为了解决这个问题,我们提出在端到端的IL中明确将行为意图与执行细节分离: extit{“模仿意图,而不仅仅是轨迹”(MINT)}。我们通过 extit{多尺度频率空间标记化}实现这一目标,该方法强制对动作块表示进行谱分解。我们以多尺度粗到细的结构学习动作标记,并强制最粗的标记捕捉低频全局结构,而更细的标记则编码高频细节。这产生了一个抽象的 extit{意图标记},有助于规划和转移,以及多尺度的 extit{执行标记},使得能够精确适应环境动态。在这一层次结构的基础上,我们的策略通过 extit{下一尺度自回归}生成轨迹,执行渐进的 extit{意图到执行推理},从而提高学习效率和泛化能力。关键是,这种分离使得技能的 extit{单次转移}成为可能,只需将来自演示的意图标记注入自回归生成过程中。对多个操作基准和真实机器人进行的实验展示了最先进的成功率、优越的推理效率、对干扰的强健泛化能力以及有效的单次转移。
cs.RO / 69 / 2602.08653
High-Speed Vision-Based Flight in Clutter with Safety-Shielded Reinforcement Learning
基于视觉的高速飞行在复杂环境中的安全防护强化学习
Abstract
Quadrotor unmanned aerial vehicles (UAVs) are increasingly deployed in complex missions that demand reliable autonomous navigation and robust obstacle avoidance. However, traditional modular pipelines often incur cumulative latency, whereas purely reinforcement learning (RL) approaches typically provide limited formal safety guarantees. To bridge this gap, we propose an end-to-end RL framework augmented with model-based safety mechanisms. We incorporate physical priors in both training and deployment. During training, we design a physics-informed reward structure that provides global navigational guidance. During deployment, we integrate a real-time safety filter that projects the policy outputs onto a provably safe set to enforce strict collision-avoidance constraints. This hybrid architecture reconciles high-speed flight with robust safety assurances. Benchmark evaluations demonstrate that our method outperforms both traditional planners and recent end-to-end obstacle avoidance approaches based on differentiable physics. Extensive experiments demonstrate strong generalization, enabling reliable high-speed navigation in dense clutter and challenging outdoor forest environments at velocities up to 7.5m/s.
Chinese Translation
四旋翼无人机(UAV)在复杂任务中的应用日益增多,这些任务要求可靠的自主导航和强大的障碍物规避能力。然而,传统的模块化流程往往会导致累积延迟,而纯粹的强化学习(RL)方法通常提供有限的正式安全保证。为了解决这一问题,我们提出了一种增强了基于模型的安全机制的端到端强化学习框架。我们在训练和部署过程中都融入了物理先验。在训练阶段,我们设计了一种物理信息驱动的奖励结构,以提供全局导航指导。在部署阶段,我们集成了一个实时安全过滤器,该过滤器将策略输出投影到一个可证明安全的集合上,以强制执行严格的避碰约束。这种混合架构将高速飞行与强大的安全保证相结合。基准评估表明,我们的方法优于传统规划器和基于可微物理的最新端到端障碍物规避方法。大量实验表明,模型具有很强的泛化能力,使其能够在密集复杂环境和具有挑战性的户外森林环境中以高达7.5米/秒的速度可靠导航。
cs.RO / 70 / 2602.08776
Mind the Gap: Learning Implicit Impedance in Visuomotor Policies via Intent-Execution Mismatch
注意差距:通过意图-执行不匹配学习视觉运动策略中的隐式阻抗
Abstract
Teleoperation inherently relies on the human operator acting as a closed-loop controller to actively compensate for hardware imperfections, including latency, mechanical friction, and lack of explicit force feedback. Standard Behavior Cloning (BC), by mimicking the robot's executed trajectory, fundamentally ignores this compensatory mechanism. In this work, we propose a Dual-State Conditioning framework that shifts the learning objective to "Intent Cloning" (master command). We posit that the Intent-Execution Mismatch, the discrepancy between master command and slave response, is not noise, but a critical signal that physically encodes implicit interaction forces and algorithmically reveals the operator's strategy for overcoming system dynamics. By predicting the master intent, our policy learns to generate a "virtual equilibrium point", effectively realizing implicit impedance control. Furthermore, by explicitly conditioning on the history of this mismatch, the model performs implicit system identification, perceiving tracking errors as external forces to close the control loop. To bridge the temporal gap caused by inference latency, we further formulate the policy as a trajectory inpainter to ensure continuous control. We validate our approach on a sensorless, low-cost bi-manual setup. Empirical results across tasks requiring contact-rich manipulation and dynamic tracking reveal a decisive gap: while standard execution-cloning fails due to the inability to overcome contact stiffness and tracking lag, our mismatch-aware approach achieves robust success. This presents a minimalist behavior cloning framework for low-cost hardware, enabling force perception and dynamic compensation without relying on explicit force sensing. Videos are available on the \href{https://xucj98.github.io/mind-the-gap-page/}{project page}.
Chinese Translation
远程操作本质上依赖于人类操作员作为闭环控制器,主动补偿硬件缺陷,包括延迟、机械摩擦和缺乏明确的力反馈。标准行为克隆(Behavior Cloning, BC)通过模仿机器人执行的轨迹,根本上忽视了这一补偿机制。在本研究中,我们提出了一种双状态条件框架,将学习目标转变为“意图克隆”(Intent Cloning,主命令)。我们认为,意图-执行不匹配,即主命令与从命令之间的差异,并不是噪声,而是一个重要信号,物理上编码了隐式交互力,并算法上揭示了操作员克服系统动态的策略。通过预测主意图,我们的策略学习生成“虚拟平衡点”,有效实现隐式阻抗控制。此外,通过明确地对这一不匹配的历史进行条件化,模型执行隐式系统识别,将跟踪误差视为外部力以闭合控制回路。为了弥补推理延迟造成的时间差距,我们进一步将策略构建为轨迹修复器,以确保连续控制。我们在一个无传感器、低成本的双手设置上验证了我们的方法。在需要接触丰富操作和动态跟踪的任务中的实证结果揭示了一个决定性的差距:标准执行克隆由于无法克服接触刚度和跟踪滞后而失败,而我们关注不匹配的方法则取得了稳健的成功。这为低成本硬件提供了一种简约的行为克隆框架,实现了力感知和动态补偿,而无需依赖明确的力传感。相关视频可在项目页面查看。
cs.RO / 71 / 2602.08784
GaussianCaR: Gaussian Splatting for Efficient Camera-Radar Fusion
GaussianCaR:高斯点云用于高效的相机-雷达融合
Abstract
Robust and accurate perception of dynamic objects and map elements is crucial for autonomous vehicles performing safe navigation in complex traffic scenarios. While vision-only methods have become the de facto standard due to their technical advances, they can benefit from effective and cost-efficient fusion with radar measurements. In this work, we advance fusion methods by repurposing Gaussian Splatting as an efficient universal view transformer that bridges the view disparity gap, mapping both image pixels and radar points into a common Bird's-Eye View (BEV) representation. Our main contribution is GaussianCaR, an end-to-end network for BEV segmentation that, unlike prior BEV fusion methods, leverages Gaussian Splatting to map raw sensor information into latent features for efficient camera-radar fusion. Our architecture combines multi-scale fusion with a transformer decoder to efficiently extract BEV features. Experimental results demonstrate that our approach achieves performance on par with, or even surpassing, the state of the art on BEV segmentation tasks (57.3%, 82.9%, and 50.1% IoU for vehicles, roads, and lane dividers) on the nuScenes dataset, while maintaining a 3.2x faster inference runtime. Code and project page are available online.
Chinese Translation
对动态物体和地图元素的稳健且准确的感知对于在复杂交通场景中安全导航的自动驾驶车辆至关重要。尽管基于视觉的方法由于其技术进步已成为事实上的标准,但它们可以通过与雷达测量的有效且经济的融合受益。在本研究中,我们通过重新利用高斯点云作为一种高效的通用视图变换器,推进了融合方法,弥合了视图差异的鸿沟,将图像像素和雷达点映射到一个共同的鸟瞰图(BEV)表示中。我们的主要贡献是GaussianCaR,这是一个端到端的BEV分割网络,与先前的BEV融合方法不同,它利用高斯点云将原始传感器信息映射到潜在特征,以实现高效的相机-雷达融合。我们的架构结合了多尺度融合和变换器解码器,以高效提取BEV特征。实验结果表明,我们的方法在nuScenes数据集上的BEV分割任务中达到了与最先进技术相当甚至超越的性能(车辆、道路和车道分隔线的IoU分别为57.3%、82.9%和50.1%),同时保持了3.2倍的推理速度提升。代码和项目页面已在线发布。
cs.RO / 72 / 2602.08799
A Generic Service-Oriented Function Offloading Framework for Connected Automated Vehicles
面向服务的通用功能卸载框架用于连接自动驾驶车辆
Abstract
Function offloading is a promising solution to address limitations concerning computational capacity and available energy of Connected Automated Vehicles~(CAVs) or other autonomous robots by distributing computational tasks between local and remote computing devices in form of distributed services. This paper presents a generic function offloading framework that can be used to offload an arbitrary set of computational tasks with a focus on autonomous driving. To provide flexibility, the function offloading framework is designed to incorporate different offloading decision making algorithms and quality of service~(QoS) requirements that can be adjusted to different scenarios or the objectives of the CAVs. With a focus on the applicability, we propose an efficient location-based approach, where the decision whether tasks are processed locally or remotely depends on the location of the CAV. We apply the proposed framework on the use case of service-oriented trajectory planning, where we offload the trajectory planning task of CAVs to a Multi-Access Edge Computing~(MEC) server. The evaluation is conducted in both simulation and real-world application. It demonstrates the potential of the function offloading framework to guarantee the QoS for trajectory planning while improving the computational efficiency of the CAVs. Moreover, the simulation results also show the adaptability of the framework to diverse scenarios involving simultaneous offloading requests from multiple CAVs.
Chinese Translation
功能卸载是一种有前景的解决方案,旨在通过将计算任务以分布式服务的形式在本地和远程计算设备之间分配,从而解决连接自动驾驶车辆(Connected Automated Vehicles, CAVs)或其他自主机器人在计算能力和可用能量方面的限制。本文提出了一种通用的功能卸载框架,能够卸载任意一组计算任务,重点关注自主驾驶。为了提供灵活性,该功能卸载框架设计为可以结合不同的卸载决策算法和服务质量(Quality of Service, QoS)要求,这些要求可以根据不同场景或CAVs的目标进行调整。我们提出了一种高效的基于位置的方法,其中任务是本地处理还是远程处理的决策取决于CAV的位置。我们将所提框架应用于面向服务的轨迹规划用例,将CAV的轨迹规划任务卸载到多接入边缘计算(Multi-Access Edge Computing, MEC)服务器上。评估在模拟和实际应用中进行,结果表明该功能卸载框架在保证轨迹规划的QoS的同时,提高了CAV的计算效率。此外,模拟结果还显示了该框架在涉及多个CAV同时请求卸载的多样化场景中的适应性。
cs.RO / 73 / 2602.08821
Multi-Staged Framework for Safety Analysis of Offloaded Services in Distributed Intelligent Transportation Systems
分布式智能交通系统中卸载服务安全分析的多阶段框架
Abstract
The integration of service-oriented architectures (SOA) with function offloading for distributed, intelligent transportation systems (ITS) offers the opportunity for connected autonomous vehicles (CAVs) to extend their locally available services. One major goal of offloading a subset of functions in the processing chain of a CAV to remote devices is to reduce the overall computational complexity on the CAV. The extension of using remote services, however, requires careful safety analysis, since the remotely created data are corrupted more easily, e.g., through an attacker on the remote device or by intercepting the wireless transmission. To tackle this problem, we first analyze the concept of SOA for distributed environments. From this, we derive a safety framework that validates the reliability of remote services and the data received locally. Since it is possible for the autonomous driving task to offload multiple different services, we propose a specific multi-staged framework for safety analysis dependent on the service composition of local and remote services. For efficiency reasons, we directly include the multi-staged framework for safety analysis in our service-oriented function offloading framework (SOFOF) that we have proposed in earlier work. The evaluation compares the performance of the extended framework considering computational complexity, with energy savings being a major motivation for function offloading, and its capability to detect data from corrupted remote services.
Chinese Translation
服务导向架构(SOA)与功能卸载的结合,为分布式智能交通系统(ITS)提供了连接自主车辆(CAVs)扩展其本地可用服务的机会。将CAV处理链中的一部分功能卸载到远程设备的主要目标之一是减少CAV上的整体计算复杂性。然而,使用远程服务的扩展需要仔细的安全分析,因为远程生成的数据更容易受到损坏,例如,通过攻击者对远程设备的攻击或拦截无线传输。为了解决这个问题,我们首先分析了分布式环境中SOA的概念。基于此,我们推导出一个安全框架,用于验证远程服务和本地接收数据的可靠性。由于自主驾驶任务可能会卸载多个不同的服务,我们提出了一个特定的多阶段安全分析框架,该框架依赖于本地和远程服务的服务组合。出于效率考虑,我们将多阶段安全分析框架直接纳入我们在早期工作中提出的服务导向功能卸载框架(SOFOF)中。评估比较了扩展框架在考虑计算复杂性时的性能,功能卸载的主要动机是节能,并且其检测来自损坏的远程服务的数据的能力。
cs.RO / 74 / 2602.08845
Finite-Time Teleoperation of Euler-Lagrange Systems via Energy-Shaping
通过能量塑形实现的有限时间双向遥操作的欧拉-拉格朗日系统
Abstract
This paper proposes a family of finite-time controllers for the bilateral teleoperation of fully actuated nonlinear Euler-Lagrange systems. Based on the energy-shaping framework and under the standard assumption of passive interactions with the human and the environment, the controllers ensure that the position error and velocities globally converge to zero in the absence of time delays. In this case, the closed-loop system admits a homogeneous approximation of negative degree, and thus the control objective is achieved in finite-time. The proposed controllers are simple, continuous-time proportional-plus-damping-injection schemes, validated through both simulation and experimental results.
Chinese Translation
本文提出了一系列有限时间控制器,用于完全驱动的非线性欧拉-拉格朗日系统的双向遥操作。在能量塑形框架的基础上,并在与人类及环境的被动交互的标准假设下,这些控制器确保在没有时间延迟的情况下,位置误差和速度全局收敛到零。在这种情况下,闭环系统具有负度数的齐次近似,因此控制目标在有限时间内得以实现。所提出的控制器是简单的连续时间比例加阻尼注入方案,通过仿真和实验结果进行了验证。
cs.RO / 75 / 2602.08963
Reduced-order Control and Geometric Structure of Learned Lagrangian Latent Dynamics
学习的拉格朗日潜在动力学的降阶控制与几何结构
Abstract
Model-based controllers can offer strong guarantees on stability and convergence by relying on physically accurate dynamic models. However, these are rarely available for high-dimensional mechanical systems such as deformable objects or soft robots. While neural architectures can learn to approximate complex dynamics, they are either limited to low-dimensional systems or provide only limited formal control guarantees due to a lack of embedded physical structure. This paper introduces a latent control framework based on learned structure-preserving reduced-order dynamics for high-dimensional Lagrangian systems. We derive a reduced tracking law for fully actuated systems and adopt a Riemannian perspective on projection-based model-order reduction to study the resulting latent and projected closed-loop dynamics. By quantifying the sources of modeling error, we derive interpretable conditions for stability and convergence. We extend the proposed controller and analysis to underactuated systems by introducing learned actuation patterns. Experimental results on simulated and real-world systems validate our theoretical investigation and the accuracy of our controllers.
Chinese Translation
基于模型的控制器可以通过依赖于物理准确的动态模型提供强有力的稳定性和收敛性保证。然而,对于高维机械系统,如可变形物体或软机器人,这些模型很少可用。虽然神经网络架构可以学习近似复杂的动态,但它们要么仅限于低维系统,要么由于缺乏嵌入的物理结构而仅提供有限的正式控制保证。本文提出了一种基于学习的结构保持降阶动力学的潜在控制框架,适用于高维拉格朗日系统。我们推导出完全驱动系统的降阶跟踪法则,并采用黎曼视角对基于投影的模型降阶进行研究,以探讨由此产生的潜在和投影闭环动态。通过量化建模误差的来源,我们推导出可解释的稳定性和收敛性条件。我们通过引入学习的驱动模式将所提出的控制器和分析扩展到欠驱动系统。对模拟和真实系统的实验结果验证了我们的理论研究及控制器的准确性。
cs.RO / 76 / 2602.08999
CLUE: Crossmodal disambiguation via Language-vision Understanding with attEntion
CLUE:通过语言-视觉理解与注意力的跨模态消歧
Abstract
With the increasing integration of robots into daily life, human-robot interaction has become more complex and multifaceted. A critical component of this interaction is Interactive Visual Grounding (IVG), through which robots must interpret human intentions and resolve ambiguity. Existing IVG models generally lack a mechanism to determine when to ask clarification questions, as they implicitly rely on their learned representations. CLUE addresses this gap by converting the VLM's cross-modal attention into an explicit, spatially grounded signal for deciding when to ask. We extract text to image attention maps and pass them to a lightweight CNN to detect referential ambiguity, while a LoRA fine-tuned decoder conducts the dialog and emits grounding location tokens. We train on a real-world interactive dataset for IVG, and a mixed ambiguity set for the detector. With InViG-only supervision, our model surpasses a state-of-the-art method while using parameter-efficient fine-tuning. Similarly, the ambiguity detector outperforms prior baselines. Overall, CLUE turns the internal cross-modal attention of a VLM into an explicit, spatially grounded signal for deciding when to ask. The data and code are publicly available at: mouadabrini.github.io/clue
Chinese Translation
随着机器人日益融入日常生活,人机交互变得愈加复杂和多面化。此交互的一个关键组成部分是互动视觉定位(Interactive Visual Grounding, IVG),通过该机制,机器人必须解读人类意图并解决歧义。现有的 IVG 模型通常缺乏确定何时提出澄清问题的机制,因为它们隐式依赖于其学习到的表征。CLUE 通过将视觉语言模型(VLM)的跨模态注意力转化为一个明确的、空间上有依据的信号来填补这一空白,以决定何时提问。我们提取文本到图像的注意力图,并将其传递给一个轻量级卷积神经网络(CNN)以检测指称歧义,同时一个经过 LoRA 微调的解码器进行对话并发出定位令牌。我们在一个真实世界的互动数据集上进行 IVG 训练,并为检测器构建一个混合歧义集。在仅使用 InViG 监督的情况下,我们的模型在参数高效微调的同时超越了当前的最先进方法。同样,歧义检测器也优于之前的基线。总体而言,CLUE 将 VLM 的内部跨模态注意力转化为一个明确的、空间上有依据的信号,以决定何时提问。数据和代码可在以下网址公开获取:mouadabrini.github.io/clue
cs.RO / 77 / 2602.09002
From Obstacles to Etiquette: Robot Social Navigation with VLM-Informed Path Selection
从障碍到礼仪:基于VLM的机器人社交导航路径选择
Abstract
Navigating socially in human environments requires more than satisfying geometric constraints, as collision-free paths may still interfere with ongoing activities or conflict with social norms. Addressing this challenge calls for analyzing interactions between agents and incorporating common-sense reasoning into planning. This paper presents a social robot navigation framework that integrates geometric planning with contextual social reasoning. The system first extracts obstacles and human dynamics to generate geometrically feasible candidate paths, then leverages a fine-tuned vision-language model (VLM) to evaluate these paths, informed by contextually grounded social expectations, selecting a socially optimized path for the controller. This task-specific VLM distills social reasoning from large foundation models into a smaller and efficient model, allowing the framework to perform real-time adaptation in diverse human-robot interaction contexts. Experiments in four social navigation contexts demonstrate that our method achieves the best overall performance with the lowest personal space violation duration, the minimal pedestrian-facing time, and no social zone intrusions. Project page: https://path-etiquette.github.io
Chinese Translation
在人与环境的社交导航中,除了满足几何约束外,还需要考虑与正在进行的活动的干扰或与社会规范的冲突。解决这一挑战需要分析代理之间的互动,并将常识推理纳入规划中。本文提出了一种社交机器人导航框架,该框架将几何规划与上下文社交推理相结合。系统首先提取障碍物和人类动态,以生成几何上可行的候选路径,然后利用经过微调的视觉-语言模型(VLM)评估这些路径,该模型基于上下文的社会期望进行信息处理,从而为控制器选择一个社交优化路径。这个特定任务的VLM将大型基础模型中的社交推理提炼为一个更小且高效的模型,使得框架能够在多样的人机交互环境中实时适应。在四个社交导航场景中的实验表明,我们的方法在个人空间侵犯持续时间最低、面对行人的时间最少以及没有社交区域侵入的情况下,达到了最佳的整体性能。项目页面:https://path-etiquette.github.io
cs.RO / 78 / 2602.09013
Dexterous Manipulation Policies from RGB Human Videos via 4D Hand-Object Trajectory Reconstruction
通过4D手-物体轨迹重建从RGB人类视频中学习灵巧操作策略
Abstract
Multi-finger robotic hand manipulation and grasping are challenging due to the high-dimensional action space and the difficulty of acquiring large-scale training data. Existing approaches largely rely on human teleoperation with wearable devices or specialized sensing equipment to capture hand-object interactions, which limits scalability. In this work, we propose VIDEOMANIP, a device-free framework that learns dexterous manipulation directly from RGB human videos. Leveraging recent advances in computer vision, VIDEOMANIP reconstructs explicit 4D robot-object trajectories from monocular videos by estimating human hand poses, object meshes, and retargets the reconstructed human motions to robotic hands for manipulation learning. To make the reconstructed robot data suitable for dexterous manipulation training, we introduce hand-object contact optimization with interaction-centric grasp modeling, as well as a demonstration synthesis strategy that generates diverse training trajectories from a single video, enabling generalizable policy learning without additional robot demonstrations. In simulation, the learned grasping model achieves a 70.25% success rate across 20 diverse objects using the Inspire Hand. In the real world, manipulation policies trained from RGB videos achieve an average 62.86% success rate across seven tasks using the LEAP Hand, outperforming retargeting-based methods by 15.87%. Project videos are available at videomanip.github.io.
Chinese Translation
多指机器人手的操作和抓取因高维动作空间和获取大规模训练数据的困难而具有挑战性。现有的方法在很大程度上依赖于使用可穿戴设备或专用传感设备进行的人类遥操作,以捕捉手-物体交互,这限制了其可扩展性。在本研究中,我们提出了VIDEOMANIP,一个无设备框架,能够直接从RGB人类视频中学习灵巧操作。VIDEOMANIP利用计算机视觉的最新进展,从单目视频中重建明确的4D机器人-物体轨迹,通过估计人类手部姿态和物体网格,并将重建的人类动作重新定向到机器人手上进行操作学习。为了使重建的机器人数据适合灵巧操作训练,我们引入了基于交互的抓取建模的手-物体接触优化,以及一种演示合成策略,该策略能够从单个视频生成多样化的训练轨迹,从而实现可推广的策略学习,而无需额外的机器人演示。在仿真中,学习到的抓取模型在20种不同物体上实现了70.25%的成功率。在现实世界中,从RGB视频训练的操作策略在七个任务中实现了平均62.86%的成功率,使用LEAP手的表现比基于重定向的方法提高了15.87%。项目视频可在videomanip.github.io上查看。
cs.RO / 79 / 2602.09017
Contact-Anchored Policies: Contact Conditioning Creates Strong Robot Utility Models
接触锚定策略:接触条件化创建强大的机器人效用模型
Abstract
The prevalent paradigm in robot learning attempts to generalize across environments, embodiments, and tasks with language prompts at runtime. A fundamental tension limits this approach: language is often too abstract to guide the concrete physical understanding required for robust manipulation. In this work, we introduce Contact-Anchored Policies (CAP), which replace language conditioning with points of physical contact in space. Simultaneously, we structure CAP as a library of modular utility models rather than a monolithic generalist policy. This factorization allows us to implement a real-to-sim iteration cycle: we build EgoGym, a lightweight simulation benchmark, to rapidly identify failure modes and refine our models and datasets prior to real-world deployment. We show that by conditioning on contact and iterating via simulation, CAP generalizes to novel environments and embodiments out of the box on three fundamental manipulation skills while using only 23 hours of demonstration data, and outperforms large, state-of-the-art VLAs in zero-shot evaluations by 56%. All model checkpoints, codebase, hardware, simulation, and datasets will be open-sourced. Project page: https://cap-policy.github.io/
Chinese Translation
机器人学习中的普遍范式试图在运行时通过语言提示跨环境、体现和任务进行泛化。然而,这种方法受到一个基本矛盾的限制:语言往往过于抽象,无法指导进行稳健操作所需的具体物理理解。在本研究中,我们提出了接触锚定策略(Contact-Anchored Policies, CAP),用空间中的物理接触点替代语言条件化。同时,我们将CAP构建为一个模块化效用模型库,而不是一个单一的通用策略。这种因式分解使我们能够实现真实与仿真之间的迭代循环:我们构建了EgoGym,一个轻量级的仿真基准,以快速识别失败模式并在实际部署之前完善我们的模型和数据集。我们展示了通过接触条件化并通过仿真迭代,CAP能够在三项基本操作技能上即刻泛化到新的环境和体现,仅使用23小时的演示数据,并在零样本评估中超越大型最先进的视觉语言模型(VLA)56%。所有模型检查点、代码库、硬件、仿真和数据集将开源。项目页面:https://cap-policy.github.io/
cs.RO / 80 / 2602.09018
Robustness Is a Function, Not a Number: A Factorized Comprehensive Study of OOD Robustness in Vision-Based Driving
鲁棒性是一个函数,而不是一个数字:基于视觉的驾驶中OOD鲁棒性的分解综合研究
Abstract
Out of distribution (OOD) robustness in autonomous driving is often reduced to a single number, hiding what breaks a policy. We decompose environments along five axes: scene (rural/urban), season, weather, time (day/night), and agent mix; and measure performance under controlled $k$-factor perturbations ($k \in \{0,1,2,3\}$). Using closed loop control in VISTA, we benchmark FC, CNN, and ViT policies, train compact ViT heads on frozen foundation-model (FM) features, and vary ID support in scale, diversity, and temporal context. (1) ViT policies are markedly more OOD-robust than comparably sized CNN/FC, and FM features yield state-of-the-art success at a latency cost. (2) Naive temporal inputs (multi-frame) do not beat the best single-frame baseline. (3) The largest single factor drops are rural $\rightarrow$ urban and day $\rightarrow$ night ($\sim 31\%$ each); actor swaps $\sim 10\%$, moderate rain $\sim 7\%$; season shifts can be drastic, and combining a time flip with other changes further degrades performance. (4) FM-feature policies stay above $85\%$ under three simultaneous changes; non-FM single-frame policies take a large first-shift hit, and all no-FM models fall below $50\%$ by three changes. (5) Interactions are non-additive: some pairings partially offset, whereas season-time combinations are especially harmful. (6) Training on winter/snow is most robust to single-factor shifts, while a rural+summer baseline gives the best overall OOD performance. (7) Scaling traces/views improves robustness ($+11.8$ points from $5$ to $14$ traces), yet targeted exposure to hard conditions can substitute for scale. (8) Using multiple ID environments broadens coverage and strengthens weak cases (urban OOD $60.6\% \rightarrow 70.1\%$) with a small ID drop; single-ID preserves peak performance but in a narrow domain. These results yield actionable design rules for OOD-robust driving policies.
Chinese Translation
在自动驾驶中,分布外(OOD)鲁棒性通常被简化为一个单一的数字,这掩盖了政策失效的原因。我们沿五个轴对环境进行分解:场景(乡村/城市)、季节、天气、时间(白天/夜晚)和代理混合;并在受控的$k$因子扰动下($k ext{in} \{0,1,2,3\}$)测量性能。通过在VISTA中使用闭环控制,我们对FC、CNN和ViT政策进行了基准测试,在冻结的基础模型(FM)特征上训练紧凑的ViT头,并在规模、多样性和时间上下文中变化ID支持。(1) ViT政策的OOD鲁棒性明显优于同等规模的CNN/FC,而FM特征在延迟成本下实现了最先进的成功。(2) 简单的时间输入(多帧)未能超越最佳的单帧基线。(3) 最大的单因子下降发生在乡村$
ightarrow$城市和白天$
ightarrow$夜晚(每个约$ ext{31 ext{ extperthousand}}$);演员交换约$ ext{10 ext{ extperthousand}}$,适度降雨约$ ext{7 ext{ extperthousand}}$;季节变化可能是剧烈的,时间翻转与其他变化的结合进一步降低了性能。(4) FM特征政策在三种同时变化下保持在$85 ext{ ext{ extperthousand}}$以上;非FM单帧政策在第一次变化时遭受重大打击,所有非FM模型在三次变化后均降至$50 ext{ ext{ extperthousand}}$以下。(5) 交互是非加性的:某些配对部分抵消,而季节-时间组合尤其有害。(6) 在冬季/降雪条件下训练对单因子变化的鲁棒性最强,而乡村+夏季基线则提供了最佳的整体OOD性能。(7) 扩展轨迹/视图提高了鲁棒性(从$5$到$14$条轨迹增加$11.8$点),但针对困难条件的有针对性暴露可以替代规模。(8) 使用多个ID环境扩展了覆盖范围并增强了弱案例(城市OOD从$60.6 ext{ ext{ extperthousand}}$提升至$70.1 ext{ ext{ extperthousand}}$),同时小幅下降ID;单ID保持峰值性能,但在狭窄领域内。这些结果为OOD鲁棒驾驶政策提供了可操作的设计规则。
cs.RO / 81 / 2602.09021
$\chi_{0}$: Resource-Aware Robust Manipulation via Taming Distributional Inconsistencies
$ ext{χ}_{0}$:通过驯服分布不一致性实现资源感知的鲁棒操控
Abstract
High-reliability long-horizon robotic manipulation has traditionally relied on large-scale data and compute to understand complex real-world dynamics. However, we identify that the primary bottleneck to real-world robustness is not resource scale alone, but the distributional shift among the human demonstration distribution, the inductive bias learned by the policy, and the test-time execution distribution -- a systematic inconsistency that causes compounding errors in multi-stage tasks. To mitigate these inconsistencies, we propose $\chi_{0}$, a resource-efficient framework with effective modules designated to achieve production-level robustness in robotic manipulation. Our approach builds off three technical pillars: (i) Model Arithmetic, a weight-space merging strategy that efficiently soaks up diverse distributions of different demonstrations, varying from object appearance to state variations; (ii) Stage Advantage, a stage-aware advantage estimator that provides stable, dense progress signals, overcoming the numerical instability of prior non-stage approaches; and (iii) Train-Deploy Alignment, which bridges the distribution gap via spatio-temporal augmentation, heuristic DAgger corrections, and temporal chunk-wise smoothing. $\chi_{0}$ enables two sets of dual-arm robots to collaboratively orchestrate long-horizon garment manipulation, spanning tasks from flattening, folding, to hanging different clothes. Our method exhibits high-reliability autonomy; we are able to run the system from arbitrary initial state for consecutive 24 hours non-stop. Experiments validate that $\chi_{0}$ surpasses the state-of-the-art $\pi_{0.5}$ in success rate by nearly 250%, with only 20-hour data and 8 A100 GPUs. Code, data and models will be released to facilitate the community.
Chinese Translation
高可靠性的长时间机器人操控传统上依赖于大规模数据和计算,以理解复杂的现实世界动态。然而,我们发现现实世界鲁棒性的主要瓶颈不仅在于资源规模,还在于人类示范分布、策略学习的归纳偏差以及测试时执行分布之间的分布转移——这种系统性的不一致性在多阶段任务中导致了累积误差。为了减轻这些不一致性,我们提出了$ ext{χ}_{0}$,这是一个资源高效的框架,具有有效模块,旨在实现机器人操控的生产级鲁棒性。我们的方法建立在三个技术支柱之上:(i)模型算术(Model Arithmetic),一种权重空间合并策略,能够高效吸收不同示范的多样分布,从物体外观到状态变化;(ii)阶段优势(Stage Advantage),一种阶段感知的优势估计器,提供稳定、密集的进展信号,克服了先前非阶段方法的数值不稳定性;(iii)训练-部署对齐(Train-Deploy Alignment),通过时空增强、启发式 DAgger 修正和时间块平滑来弥合分布差距。$ ext{χ}_{0}$使得两组双臂机器人能够协同进行长时间的服装操控,涵盖从平整、折叠到悬挂不同衣物的任务。我们的方法展现出高可靠性的自主性;我们能够在任意初始状态下连续运行系统24小时而不停歇。实验验证了$ ext{χ}_{0}$在成功率上超过了最先进的$ ext{π}_{0.5}$近250%,仅使用20小时的数据和8个A100 GPU。代码、数据和模型将发布以促进社区发展。
cs.RO / 82 / 2602.09023
TwinRL-VLA: Digital Twin-Driven Reinforcement Learning for Real-World Robotic Manipulation
TwinRL-VLA:基于数字双胞胎驱动的强化学习在真实世界机器人操作中的应用
Abstract
Despite strong generalization capabilities, Vision-Language-Action (VLA) models remain constrained by the high cost of expert demonstrations and insufficient real-world interaction. While online reinforcement learning (RL) has shown promise in improving general foundation models, applying RL to VLA manipulation in real-world settings is still hindered by low exploration efficiency and a restricted exploration space. Through systematic real-world experiments, we observe that the effective exploration space of online RL is closely tied to the data distribution of supervised fine-tuning (SFT). Motivated by this observation, we propose TwinRL, a digital twin-real-world collaborative RL framework designed to scale and guide exploration for VLA models. First, a high-fidelity digital twin is efficiently reconstructed from smartphone-captured scenes, enabling realistic bidirectional transfer between real and simulated environments. During the SFT warm-up stage, we introduce an exploration space expansion strategy using digital twins to broaden the support of the data trajectory distribution. Building on this enhanced initialization, we propose a sim-to-real guided exploration strategy to further accelerate online RL. Specifically, TwinRL performs efficient and parallel online RL in the digital twin prior to deployment, effectively bridging the gap between offline and online training stages. Subsequently, we exploit efficient digital twin sampling to identify failure-prone yet informative configurations, which are used to guide targeted human-in-the-loop rollouts on the real robot. In our experiments, TwinRL approaches 100% success in both in-distribution regions covered by real-world demonstrations and out-of-distribution regions, delivering at least a 30% speedup over prior real-world RL methods and requiring only about 20 minutes on average across four tasks.
Chinese Translation
尽管视觉-语言-动作(VLA)模型具有强大的泛化能力,但仍受到专家演示高成本和真实世界交互不足的限制。虽然在线强化学习(RL)在改善通用基础模型方面显示出前景,但将RL应用于真实环境中的VLA操作仍然受到低探索效率和受限探索空间的阻碍。通过系统的真实世界实验,我们观察到在线RL的有效探索空间与监督微调(SFT)数据分布密切相关。基于这一观察,我们提出了TwinRL,一个数字双胞胎-真实世界协作的RL框架,旨在扩展和引导VLA模型的探索。首先,从智能手机捕捉的场景中高效重建高保真数字双胞胎,实现真实环境与模拟环境之间的真实双向转移。在SFT预热阶段,我们引入了一种利用数字双胞胎的探索空间扩展策略,以拓宽数据轨迹分布的支持。基于这一增强的初始化,我们提出了一种模拟到真实的引导探索策略,以进一步加速在线RL。具体而言,TwinRL在部署前在数字双胞胎中执行高效的并行在线RL,有效弥合了离线和在线训练阶段之间的差距。随后,我们利用高效的数字双胞胎采样来识别易出错但信息丰富的配置,这些配置用于指导真实机器人上的有针对性的人机协作展开。在我们的实验中,TwinRL在真实世界演示覆盖的分布内区域和分布外区域均接近100%的成功率,相较于之前的真实世界RL方法至少提高了30%的速度,并且在四个任务中平均仅需约20分钟。
cs.CV / 1 / 2602.07006
Scalable spatial point process models for forensic footwear analysis
用于法医鞋印分析的可扩展空间点过程模型
Abstract
Shoe print evidence recovered from crime scenes plays a key role in forensic investigations. By examining shoe prints, investigators can determine details of the footwear worn by suspects. However, establishing that a suspect's shoes match the make and model of a crime scene print may not be sufficient. Typically, thousands of shoes of the same size, make, and model are manufactured, any of which could be responsible for the print. Accordingly, a popular approach used by investigators is to examine the print for signs of ``accidentals,'' i.e., cuts, scrapes, and other features that accumulate on shoe soles after purchase due to wear. While some patterns of accidentals are common on certain types of shoes, others are highly distinctive, potentially distinguishing the suspect's shoe from all others. Quantifying the rarity of a pattern is thus essential to accurately measuring the strength of forensic evidence. In this study, we address this task by developing a hierarchical Bayesian model. Our improvement over existing methods primarily stems from two advancements. First, we frame our approach in terms of a latent Gaussian model, thus enabling inference to be efficiently scaled to large collections of annotated shoe prints via integrated nested Laplace approximations. Second, we incorporate spatially varying coefficients to model the relationship between shoes' tread patterns and accidental locations. We demonstrate these improvements through superior performance on held-out data, which enhances accuracy and reliability in forensic shoe print analysis.
Chinese Translation
从犯罪现场回收的鞋印证据在法医调查中发挥着关键作用。通过检查鞋印,调查人员可以确定嫌疑人所穿鞋类的细节。然而,确定嫌疑人的鞋子与犯罪现场印记的品牌和型号相匹配可能并不足够。通常,制造的同一尺寸、品牌和型号的鞋子数量达到数千双,任何一双都有可能是印记的来源。因此,调查人员常用的一种方法是检查印记是否有“意外”迹象,即在购买后由于磨损而在鞋底上积累的切口、刮痕和其他特征。虽然某些类型鞋子的意外模式是常见的,但其他模式则具有高度的独特性,可能将嫌疑人的鞋子与其他鞋子区分开来。因此,量化某一模式的稀有性对于准确衡量法医证据的强度至关重要。在本研究中,我们通过开发一个层次贝叶斯模型来解决这一任务。我们对现有方法的改进主要源于两个方面的进展。首先,我们将方法框架设定为潜在高斯模型,从而使推断能够通过集成嵌套拉普拉斯近似有效地扩展到大量标注的鞋印集合。其次,我们引入空间变化系数,以建模鞋子的鞋底花纹与意外位置之间的关系。我们通过在保留数据上的优越表现来展示这些改进,从而提高了法医鞋印分析的准确性和可靠性。
cs.CV / 2 / 2602.07008
Where Not to Learn: Prior-Aligned Training with Subset-based Attribution Constraints for Reliable Decision-Making
学习的禁区:基于子集归因约束的先验对齐训练方法以实现可靠决策
Abstract
Reliable models should not only predict correctly, but also justify decisions with acceptable evidence. Yet conventional supervised learning typically provides only class-level labels, allowing models to achieve high accuracy through shortcut correlations rather than the intended evidence. Human priors can help constrain such behavior, but aligning models to these priors remains challenging because learned representations often diverge from human perception. To address this challenge, we propose an attribution-based human prior alignment method. We encode human priors as input regions that the model is expected to rely on (e.g., bounding boxes), and leverage a highly faithful subset-selection-based attribution approach to expose the model's decision evidence during training. When the attribution region deviates substantially from the prior regions, we penalize reliance on off-prior evidence, encouraging the model to shift its attribution toward the intended regions. This is achieved through a training objective that imposes attribution constraints induced by the human prior. We validate our method on both image classification and click decision tasks in MLLM-based GUI agent models. Across conventional classification and autoregressive generation settings, human prior alignment consistently improves task accuracy while also enhancing the model's decision reasonability.
Chinese Translation
可靠的模型不仅应能正确预测,还应以可接受的证据来证明其决策。然而,传统的监督学习通常仅提供类级标签,使得模型通过捷径相关性而非预期证据来实现高准确率。人类先验可以帮助约束这种行为,但将模型与这些先验对齐仍然具有挑战性,因为学习到的表示往往与人类感知相悖。为了解决这一挑战,我们提出了一种基于归因的人类先验对齐方法。我们将人类先验编码为模型预期依赖的输入区域(例如,边界框),并利用一种高度忠实的基于子集选择的归因方法,在训练过程中揭示模型的决策证据。当归因区域与先验区域显著偏离时,我们会惩罚对非先验证据的依赖,鼓励模型将其归因转向预期区域。这是通过施加由人类先验引发的归因约束的训练目标来实现的。我们在基于MLLM的GUI代理模型的图像分类和点击决策任务上验证了我们的方法。在传统分类和自回归生成设置中,人类先验对齐始终提高了任务准确性,同时增强了模型的决策合理性。
cs.CV / 3 / 2602.07011
MAU-GPT: Enhancing Multi-type Industrial Anomaly Understanding via Anomaly-aware and Generalist Experts Adaptation
MAU-GPT:通过异常感知和通用专家适应增强多类型工业异常理解
Abstract
As industrial manufacturing scales, automating fine-grained product image analysis has become critical for quality control. However, existing approaches are hindered by limited dataset coverage and poor model generalization across diverse and complex anomaly patterns. To address these challenges, we introduce MAU-Set, a comprehensive dataset for Multi-type industrial Anomaly Understanding. It spans multiple industrial domains and features a hierarchical task structure, ranging from binary classification to complex reasoning. Alongside this dataset, we establish a rigorous evaluation protocol to facilitate fair and comprehensive model assessment. Building upon this foundation, we further present MAU-GPT, a domain-adapted multimodal large model specifically designed for industrial anomaly understanding. It incorporates a novel AMoE-LoRA mechanism that unifies anomaly-aware and generalist experts adaptation, enhancing both understanding and reasoning across diverse defect classes. Extensive experiments show that MAU-GPT consistently outperforms prior state-of-the-art methods across all domains, demonstrating strong potential for scalable and automated industrial inspection.
Chinese Translation
随着工业制造规模的扩大,自动化的细粒度产品图像分析已成为质量控制的关键。然而,现有方法受到数据集覆盖范围有限和模型在多样化及复杂异常模式下泛化能力差的限制。为了解决这些挑战,我们引入了MAU-Set,这是一个针对多类型工业异常理解的综合数据集。该数据集涵盖多个工业领域,并具有从二分类到复杂推理的层次任务结构。与此同时,我们建立了一个严格的评估协议,以促进公平和全面的模型评估。在此基础上,我们进一步提出了MAU-GPT,这是一种专门为工业异常理解设计的领域适应多模态大模型。它结合了一种新颖的AMoE-LoRA机制,统一了异常感知和通用专家的适应,增强了对多样缺陷类别的理解和推理能力。大量实验表明,MAU-GPT在所有领域中始终优于先前的最先进方法,显示出在可扩展和自动化工业检查中的强大潜力。
cs.CV / 4 / 2602.07012
A General Model for Retinal Segmentation and Quantification
一种通用的视网膜分割与量化模型
Abstract
Retinal imaging is fast, non-invasive, and widely available, offering quantifiable structural and vascular signals for ophthalmic and systemic health assessment. This accessibility creates an opportunity to study how quantitative retinal phenotypes relate to ocular and systemic diseases. However, such analyses remain difficult at scale due to the limited availability of public multi-label datasets and the lack of a unified segmentation-to-quantification pipeline. We present RetSAM, a general retinal segmentation and quantification framework for fundus imaging. It delivers robust multi-target segmentation and standardized biomarker extraction, supporting downstream ophthalmologic studies and oculomics correlation analyses. Trained on over 200,000 fundus images, RetSAM supports three task categories and segments five anatomical structures, four retinal phenotypic patterns, and more than 20 distinct lesion types. It converts these segmentation results into over 30 standardized biomarkers that capture structural morphology, vascular geometry, and degenerative changes. Trained with a multi-stage strategy using both private and public fundus data, RetSAM achieves superior segmentation performance on 17 public datasets. It improves on prior best methods by 3.9 percentage points in DSC on average, with up to 15 percentage points on challenging multi-task benchmarks, and generalizes well across diverse populations, imaging devices, and clinical settings. The resulting biomarkers enable systematic correlation analyses across major ophthalmic diseases, including diabetic retinopathy, age-related macular degeneration, glaucoma, and pathologic myopia. Together, RetSAM transforms fundus images into standardized, interpretable quantitative phenotypes, enabling large-scale ophthalmic research and translation.
Chinese Translation
视网膜成像快速、无创且广泛可用,提供可量化的结构和血管信号,用于眼科和全身健康评估。这种可及性为研究定量视网膜表型与眼部及全身疾病之间的关系创造了机会。然而,由于公共多标签数据集的有限可用性以及缺乏统一的分割到量化的流程,这类分析在大规模应用中仍然困难。我们提出了RetSAM,一个用于眼底成像的通用视网膜分割与量化框架。它提供强大的多目标分割和标准化生物标志物提取,支持下游眼科研究和眼科组学相关分析。RetSAM在超过200,000幅眼底图像上进行训练,支持三类任务,能够分割五种解剖结构、四种视网膜表型模式以及超过20种不同的病变类型。它将这些分割结果转换为超过30种标准化生物标志物,捕捉结构形态、血管几何和退行性变化。RetSAM采用多阶段策略,结合私有和公共眼底数据进行训练,在17个公共数据集上实现了优越的分割性能。与之前的最佳方法相比,平均提高了3.9个百分点的Dice相似系数(DSC),在具有挑战性的多任务基准上提高了多达15个百分点,并且在不同人群、成像设备和临床环境中具有良好的泛化能力。生成的生物标志物使得对主要眼科疾病(包括糖尿病视网膜病变、年龄相关性黄斑变性、青光眼和病理性近视)进行系统的相关分析成为可能。总之,RetSAM将眼底图像转化为标准化、可解释的定量表型,促进大规模眼科研究和转化。
cs.CV / 5 / 2602.07013
Steering to Say No: Configurable Refusal via Activation Steering in Vision Language Models
拒绝的引导:通过激活引导实现视觉语言模型中的可配置拒绝
Abstract
With the rapid advancement of Vision Language Models (VLMs), refusal mechanisms have become a critical component for ensuring responsible and safe model behavior. However, existing refusal strategies are largely \textit{one-size-fits-all} and fail to adapt to diverse user needs and contextual constraints, leading to either under-refusal or over-refusal. In this work, we firstly explore the challenges mentioned above and develop \textbf{C}onfigurable \textbf{R}efusal in \textbf{VLM}s (\textbf{CR-VLM}), a robust and efficient approach for {\em configurable} refusal based on activation steering. CR-VLM consists of three integrated components: (1) extracting a configurable refusal vector via a teacher-forced mechanism to amplify the refusal signal; (2) introducing a gating mechanism that mitigates over-refusal by preserving acceptance for in-scope queries; and (3) designing a counterfactual vision enhancement module that aligns visual representations with refusal requirements. Comprehensive experiments across multiple datasets and various VLMs demonstrate that CR-VLM achieves effective, efficient, and robust configurable refusals, offering a scalable path toward user-adaptive safety alignment in VLMs.
Chinese Translation
随着视觉语言模型(VLMs)的快速发展,拒绝机制已成为确保模型行为负责任和安全的重要组成部分。然而,现有的拒绝策略大多是 extit{一刀切}的,无法适应多样化的用户需求和上下文限制,导致拒绝不足或拒绝过度。在本研究中,我们首先探讨了上述挑战,并开发了 extbf{可配置拒绝}( extbf{C}onfigurable extbf{R}efusal in extbf{VLM}s, extbf{CR-VLM}),这是一种基于激活引导的稳健高效的可配置拒绝方法。CR-VLM由三个集成组件组成:(1)通过教师强制机制提取可配置拒绝向量,以增强拒绝信号;(2)引入一个门控机制,通过保留对范围内查询的接受来减轻拒绝过度;(3)设计一个反事实视觉增强模块,使视觉表示与拒绝要求对齐。针对多个数据集和各种VLM的综合实验表明,CR-VLM实现了有效、高效且稳健的可配置拒绝,为VLM中的用户自适应安全对齐提供了一条可扩展的路径。
cs.CV / 6 / 2602.07014
Vectra: A New Metric, Dataset, and Model for Visual Quality Assessment in E-Commerce In-Image Machine Translation
Vectra:用于电子商务图像内机器翻译的视觉质量评估的新指标、数据集和模型
Abstract
In-Image Machine Translation (IIMT) powers cross-border e-commerce product listings; existing research focuses on machine translation evaluation, while visual rendering quality is critical for user engagement. When facing context-dense product imagery and multimodal defects, current reference-based methods (e.g., SSIM, FID) lack explainability, while model-as-judge approaches lack domain-grounded, fine-grained reward signals. To bridge this gap, we introduce Vectra, to the best of our knowledge, the first reference-free, MLLM-driven visual quality assessment framework for e-commerce IIMT. Vectra comprises three components: (1) Vectra Score, a multidimensional quality metric system that decomposes visual quality into 14 interpretable dimensions, with spatially-aware Defect Area Ratio (DAR) quantification to reduce annotation ambiguity; (2) Vectra Dataset, constructed from 1.1M real-world product images via diversity-aware sampling, comprising a 2K benchmark for system evaluation, 30K reasoning-based annotations for instruction tuning, and 3.5K expert-labeled preferences for alignment and evaluation; and (3) Vectra Model, a 4B-parameter MLLM that generates both quantitative scores and diagnostic reasoning. Experiments demonstrate that Vectra achieves state-of-the-art correlation with human rankings, and our model outperforms leading MLLMs, including GPT-5 and Gemini-3, in scoring performance. The dataset and model will be released upon acceptance.
Chinese Translation
图像内机器翻译(IIMT)推动跨境电子商务产品列表的生成;现有研究主要集中在机器翻译评估上,而视觉渲染质量对用户参与度至关重要。在面对上下文密集的产品图像和多模态缺陷时,当前的基于参考的方法(如 SSIM、FID)缺乏可解释性,而模型作为评判者的方法则缺乏基于领域的细粒度奖励信号。为了解决这一问题,我们提出了 Vectra,尽我们所知,这是第一个无参考、基于 MLLM 的电子商务 IIMT 视觉质量评估框架。Vectra 包含三个组成部分:(1)Vectra Score,一个多维质量指标系统,将视觉质量分解为 14 个可解释的维度,并通过空间感知的缺陷区域比率(Defect Area Ratio, DAR)量化来减少注释模糊性;(2)Vectra Dataset,从 110 万个真实产品图像中通过多样性感知采样构建,包含 2000 个系统评估基准、30000 个基于推理的注释用于指令调优,以及 3500 个专家标注的偏好用于对齐和评估;(3)Vectra Model,一个拥有 40 亿参数的 MLLM,能够生成定量评分和诊断推理。实验表明,Vectra 在与人类排名的相关性方面达到了最先进的水平,并且我们的模型在评分性能上超越了领先的 MLLM,包括 GPT-5 和 Gemini-3。数据集和模型将在接受后发布。
cs.CV / 7 / 2602.07015
Robust and Real-Time Bangladeshi Currency Recognition: A Dual-Stream MobileNet and EfficientNet Approach
鲁棒且实时的孟加拉国货币识别:双流MobileNet和EfficientNet方法
Abstract
Accurate currency recognition is essential for assistive technologies, particularly for visually impaired individuals who rely on others to identify banknotes. This dependency puts them at risk of fraud and exploitation. To address these challenges, we first build a new Bangladeshi banknote dataset that includes both controlled and real-world scenarios, ensuring a more comprehensive and diverse representation. Next, to enhance the dataset's robustness, we incorporate four additional datasets, including public benchmarks, to cover various complexities and improve the model's generalization. To overcome the limitations of current recognition models, we propose a novel hybrid CNN architecture that combines MobileNetV3-Large and EfficientNetB0 for efficient feature extraction. This is followed by an effective multilayer perceptron (MLP) classifier to improve performance while keeping computational costs low, making the system suitable for resource-constrained devices. The experimental results show that the proposed model achieves 97.95% accuracy on controlled datasets, 92.84% on complex backgrounds, and 94.98% accuracy when combining all datasets. The model's performance is thoroughly evaluated using five-fold cross-validation and seven metrics: accuracy, precision, recall, F1-score, Cohen's Kappa, MCC, and AUC. Additionally, explainable AI methods like LIME and SHAP are incorporated to enhance transparency and interpretability.
Chinese Translation
准确的货币识别对于辅助技术至关重要,特别是对于依赖他人识别纸币的视觉障碍人士。这种依赖使他们面临欺诈和剥削的风险。为了解决这些挑战,我们首先构建了一个新的孟加拉国纸币数据集,该数据集包括受控和真实场景,确保更全面和多样化的代表性。接下来,为了增强数据集的鲁棒性,我们结合了四个额外的数据集,包括公共基准,以涵盖各种复杂性并提高模型的泛化能力。为克服当前识别模型的局限性,我们提出了一种新颖的混合卷积神经网络(CNN)架构,结合了MobileNetV3-Large和EfficientNetB0,以实现高效的特征提取。随后,采用有效的多层感知器(MLP)分类器来提高性能,同时保持低计算成本,使系统适合资源受限的设备。实验结果表明,所提模型在受控数据集上的准确率达到97.95%,在复杂背景下为92.84%,并且在结合所有数据集时的准确率为94.98%。模型的性能通过五折交叉验证和七个指标(准确率、精确率、召回率、F1-score、Cohen's Kappa、MCC和AUC)进行了全面评估。此外,还结合了LIME和SHAP等可解释人工智能方法,以增强透明度和可解释性。
cs.CV / 8 / 2602.07016
Gaussian-Constrained LeJEPA Representations for Unsupervised Scene Discovery and Pose Consistency
高斯约束的LeJEPA表示用于无监督场景发现和姿态一致性
Abstract
Unsupervised 3D scene reconstruction from unstructured image collections remains a fundamental challenge in computer vision, particularly when images originate from multiple unrelated scenes and contain significant visual ambiguity. The Image Matching Challenge 2025 (IMC2025) highlights these difficulties by requiring both scene discovery and camera pose estimation under real-world conditions, including outliers and mixed content. This paper investigates the application of Gaussian-constrained representations inspired by LeJEPA (Joint Embedding Predictive Architecture) to address these challenges. We present three progressively refined pipelines, culminating in a LeJEPA-inspired approach that enforces isotropic Gaussian constraints on learned image embeddings. Rather than introducing new theoretical guarantees, our work empirically evaluates how these constraints influence clustering consistency and pose estimation robustness in practice. Experimental results on IMC2025 demonstrate that Gaussian-constrained embeddings can improve scene separation and pose plausibility compared to heuristic-driven baselines, particularly in visually ambiguous settings. These findings suggest that theoretically motivated representation constraints offer a promising direction for bridging self-supervised learning principles and practical structure-from-motion pipelines.
Chinese Translation
从非结构化图像集合中进行无监督的3D场景重建仍然是计算机视觉中的一个基本挑战,尤其是在图像来自多个不相关场景且包含显著视觉模糊的情况下。2025年图像匹配挑战赛(IMC2025)通过要求在真实世界条件下进行场景发现和相机姿态估计,突显了这些困难,包括异常值和混合内容。本文研究了受LeJEPA(联合嵌入预测架构)启发的高斯约束表示在应对这些挑战中的应用。我们提出了三个逐步精炼的流程,最终形成了一种LeJEPA启发的方法,该方法对学习到的图像嵌入施加各向同性高斯约束。我们的工作并未引入新的理论保证,而是通过实证评估这些约束如何在实践中影响聚类一致性和姿态估计的鲁棒性。在IMC2025上的实验结果表明,与基于启发式的方法相比,高斯约束的嵌入可以改善场景分离和姿态合理性,特别是在视觉模糊的环境中。这些发现表明,理论驱动的表示约束为桥接自监督学习原则与实际运动重建管道提供了一个有前景的方向。
cs.CV / 9 / 2602.07017
XAI-CLIP: ROI-Guided Perturbation Framework for Explainable Medical Image Segmentation in Multimodal Vision-Language Models
XAI-CLIP:用于多模态视觉-语言模型的ROI引导扰动框架,以实现可解释的医学图像分割
Abstract
Medical image segmentation is a critical component of clinical workflows, enabling accurate diagnosis, treatment planning, and disease monitoring. However, despite the superior performance of transformer-based models over convolutional architectures, their limited interpretability remains a major obstacle to clinical trust and deployment. Existing explainable artificial intelligence (XAI) techniques, including gradient-based saliency methods and perturbation-based approaches, are often computationally expensive, require numerous forward passes, and frequently produce noisy or anatomically irrelevant explanations. To address these limitations, we propose XAI-CLIP, an ROI-guided perturbation framework that leverages multimodal vision-language model embeddings to localize clinically meaningful anatomical regions and guide the explanation process. By integrating language-informed region localization with medical image segmentation and applying targeted, region-aware perturbations, the proposed method generates clearer, boundary-aware saliency maps while substantially reducing computational overhead. Experiments conducted on the FLARE22 and CHAOS datasets demonstrate that XAI-CLIP achieves up to a 60\% reduction in runtime, a 44.6\% improvement in dice score, and a 96.7\% increase in Intersection-over-Union for occlusion-based explanations compared to conventional perturbation methods. Qualitative results further confirm cleaner and more anatomically consistent attribution maps with fewer artifacts, highlighting that the incorporation of multimodal vision-language representations into perturbation-based XAI frameworks significantly enhances both interpretability and efficiency, thereby enabling transparent and clinically deployable medical image segmentation systems.
Chinese Translation
医学图像分割是临床工作流程中的关键组成部分,能够实现准确的诊断、治疗规划和疾病监测。然而,尽管基于变换器的模型在性能上优于卷积架构,其有限的可解释性仍然是临床信任和部署的主要障碍。现有的可解释人工智能(XAI)技术,包括基于梯度的显著性方法和基于扰动的方法,通常计算开销较大,需要多次前向传递,并且经常产生噪声或解剖学上不相关的解释。为了解决这些局限性,我们提出了XAI-CLIP,一种ROI引导的扰动框架,利用多模态视觉-语言模型嵌入来定位临床上有意义的解剖区域并指导解释过程。通过将语言信息驱动的区域定位与医学图像分割相结合,并应用有针对性的区域感知扰动,所提出的方法生成了更清晰、边界感知的显著性图,同时显著减少了计算开销。在FLARE22和CHAOS数据集上的实验表明,与传统扰动方法相比,XAI-CLIP在运行时间上减少了最多60 ext{%},在Dice得分上提高了44.6 ext{%},在基于遮挡的解释中,Intersection-over-Union提高了96.7 ext{%}。定性结果进一步确认了更干净且解剖学一致的归因图,具有更少的伪影,突显了将多模态视觉-语言表示纳入基于扰动的XAI框架显著增强了解释性和效率,从而实现透明且可临床部署的医学图像分割系统。
cs.CV / 10 / 2602.07019
Deep Learning Based Multi-Level Classification for Aviation Safety
基于深度学习的航空安全多级分类
Abstract
Bird strikes pose a significant threat to aviation safety, often resulting in loss of life, severe aircraft damage, and substantial financial costs. Existing bird strike prevention strategies primarily rely on avian radar systems that detect and track birds in real time. A major limitation of these systems is their inability to identify bird species, an essential factor, as different species exhibit distinct flight behaviors, and altitudinal preference. To address this challenge, we propose an image-based bird classification framework using Convolutional Neural Networks (CNNs), designed to work with camera systems for autonomous visual detection. The CNN is designed to identify bird species and provide critical input to species-specific predictive models for accurate flight path prediction. In addition to species identification, we implemented dedicated CNN classifiers to estimate flock formation type and flock size. These characteristics provide valuable supplementary information for aviation safety. Specifically, flock type and size offer insights into collective flight behavior, and trajectory dispersion . Flock size directly relates to the potential impact severity, as the overall damage risk increases with the combined kinetic energy of multiple birds.
Chinese Translation
鸟撞击对航空安全构成了重大威胁,常常导致人员伤亡、飞机严重损坏以及巨大的经济损失。现有的鸟撞击预防策略主要依赖于能够实时探测和追踪鸟类的鸟类雷达系统。这些系统的一个主要局限性是无法识别鸟类物种,而物种识别是一个重要因素,因为不同物种表现出不同的飞行行为和高度偏好。为了解决这一挑战,我们提出了一种基于图像的鸟类分类框架,使用卷积神经网络(CNN),旨在与相机系统结合,实现自主视觉检测。该CNN旨在识别鸟类物种,并为物种特定的预测模型提供关键输入,以实现准确的飞行路径预测。除了物种识别之外,我们还实现了专门的CNN分类器来估计鸟群的形成类型和群体大小。这些特征为航空安全提供了有价值的补充信息。具体而言,鸟群类型和大小提供了关于集体飞行行为和轨迹分散的见解。鸟群大小与潜在的撞击严重性直接相关,因为随着多只鸟的动能总和增加,整体损害风险也随之上升。
cs.CV / 11 / 2602.07025
The Geometry of Representational Failures in Vision Language Models
视觉语言模型中的表征失效几何学
Abstract
Vision-Language Models (VLMs) exhibit puzzling failures in multi-object visual tasks, such as hallucinating non-existent elements or failing to identify the most similar objects among distractions. While these errors mirror human cognitive constraints, such as the "Binding Problem", the internal mechanisms driving them in artificial systems remain poorly understood. Here, we propose a mechanistic insight by analyzing the representational geometry of open-weight VLMs (Qwen, InternVL, Gemma), comparing methodologies to distill "concept vectors" - latent directions encoding visual concepts. We validate our concept vectors via steering interventions that reliably manipulate model behavior in both simplified and naturalistic vision tasks (e.g., forcing the model to perceive a red flower as blue). We observe that the geometric overlap between these vectors strongly correlates with specific error patterns, offering a grounded quantitative framework to understand how internal representations shape model behavior and drive visual failures.
Chinese Translation
视觉语言模型(VLMs)在多对象视觉任务中表现出令人困惑的失效现象,例如幻觉产生不存在的元素或未能在干扰物中识别出最相似的对象。虽然这些错误反映了人类认知的局限性,如“绑定问题”(Binding Problem),但在人工系统中驱动这些错误的内部机制仍然不甚了解。在此,我们通过分析开放权重VLMs(Qwen、InternVL、Gemma)的表征几何,提出了一种机械性见解,比较不同的方法来提炼“概念向量”——编码视觉概念的潜在方向。我们通过引导干预验证了我们的概念向量,这些干预在简化和自然视觉任务中可靠地操控模型行为(例如,强迫模型将红花视为蓝色)。我们观察到这些向量之间的几何重叠与特定错误模式之间存在强相关性,提供了一个扎实的定量框架,以理解内部表征如何塑造模型行为并导致视觉失效。
cs.CV / 12 / 2602.07026
Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models
基于模态间隙驱动的子空间对齐训练范式用于多模态大型语言模型
Abstract
Despite the success of multimodal contrastive learning in aligning visual and linguistic representations, a persistent geometric anomaly, the Modality Gap, remains: embeddings of distinct modalities expressing identical semantics occupy systematically offset regions. Prior approaches to bridge this gap are largely limited by oversimplified isotropic assumptions, hindering their application in large-scale scenarios. In this paper, we address these limitations by precisely characterizing the geometric shape of the modality gap and leveraging it for efficient model scaling. First, we propose the Fixed-frame Modality Gap Theory, which decomposes the modality gap within a frozen reference frame into stable biases and anisotropic residuals. Guided by this precise modeling, we introduce ReAlign, a training-free modality alignment strategy. Utilizing statistics from massive unpaired data, ReAlign aligns text representation into the image representation distribution via a three-step process comprising Anchor, Trace, and Centroid Alignment, thereby explicitly rectifying geometric misalignment. Building on ReAlign, we propose ReVision, a scalable training paradigm for Multimodal Large Language Models (MLLMs). ReVision integrates ReAlign into the pretraining stage, enabling the model to learn the distribution of visual representations from unpaired text before visual instruction tuning, without the need for large-scale, high-quality image-text pairs. Our framework demonstrates that statistically aligned unpaired data can effectively substitute for expensive image-text pairs, offering a robust path for the efficient scaling of MLLMs.
Chinese Translation
尽管多模态对比学习在对齐视觉和语言表征方面取得了成功,但仍然存在一个持续的几何异常,即模态间隙:表达相同语义的不同模态的嵌入系统性地占据偏移的区域。之前的弥合这一间隙的方法在很大程度上受到过于简化的各向同性假设的限制,阻碍了其在大规模场景中的应用。本文通过精确表征模态间隙的几何形状并利用其进行高效模型扩展,来解决这些局限性。首先,我们提出了固定框架模态间隙理论,该理论将冻结参考框架内的模态间隙分解为稳定的偏差和各向异性的残差。在这一精确建模的指导下,我们引入了ReAlign,一种无训练的模态对齐策略。ReAlign利用大量未配对数据的统计信息,通过包括锚点(Anchor)、轨迹(Trace)和质心对齐(Centroid Alignment)在内的三步过程,将文本表征对齐到图像表征分布,从而显式地纠正几何不对齐。在ReAlign的基础上,我们提出了ReVision,一种可扩展的多模态大型语言模型(MLLMs)训练范式。ReVision将ReAlign集成到预训练阶段,使模型能够在视觉指令调优之前,从未配对文本中学习视觉表征的分布,而无需大量高质量的图像-文本对。我们的框架表明,统计对齐的未配对数据可以有效替代昂贵的图像-文本对,为MLLMs的高效扩展提供了一条稳健的路径。
cs.CV / 13 / 2602.07027
Fair Context Learning for Evidence-Balanced Test-Time Adaptation in Vision-Language Models
公平上下文学习用于视觉-语言模型中的证据平衡测试时适应
Abstract
Vision-Language Models (VLMs) such as CLIP enable strong zero-shot recognition but suffer substantial degradation under distribution shifts. Test-Time Adaptation (TTA) aims to improve robustness using only unlabeled test samples, yet most prompt-based TTA methods rely on entropy minimization -- an approach that can amplify spurious correlations and induce overconfident errors when classes share visual features. We propose Fair Context Learning (FCL), an episodic TTA framework that avoids entropy minimization by explicitly addressing shared-evidence bias. Motivated by our additive evidence decomposition assumption, FCL decouples adaptation into (i) augmentation-based exploration to identify plausible class candidates, and (ii) fairness-driven calibration that adapts text contexts to equalize sensitivity to common visual evidence. This fairness constraint mitigates partial feature obsession and enables effective calibration of text embeddings without relying on entropy reduction. Through extensive evaluation, we empirically validate our theoretical motivation and show that FCL achieves competitive adaptation performance relative to state-of-the-art TTA methods across diverse domain-shift and fine-grained benchmarks.
Chinese Translation
视觉-语言模型(VLMs)如 CLIP 能够实现强大的零样本识别,但在分布变化下会遭遇显著的性能下降。测试时适应(TTA)旨在仅利用未标记的测试样本来提高模型的鲁棒性,但大多数基于提示的 TTA 方法依赖于熵最小化——这种方法在类别共享视觉特征时可能会放大虚假相关性并引发过度自信的错误。我们提出了公平上下文学习(FCL),这是一种情景性 TTA 框架,通过明确解决共享证据偏差来避免熵最小化。基于我们的加性证据分解假设,FCL 将适应过程解耦为(i)基于增强的探索以识别合理的类别候选,以及(ii)驱动公平性的校准,调整文本上下文以平衡对共同视觉证据的敏感性。这个公平性约束减轻了对部分特征的过度依赖,使得文本嵌入的有效校准得以实现,而无需依赖熵的减少。通过广泛的评估,我们实证验证了我们的理论动机,并展示了 FCL 在多样的领域转移和细粒度基准测试中相较于最先进的 TTA 方法实现了具有竞争力的适应性能。
cs.CV / 14 / 2602.07028
A Comparative Study of Adversarial Robustness in CNN and CNN-ANFIS Architectures
CNN与CNN-ANFIS架构中对抗鲁棒性的比较研究
Abstract
Convolutional Neural Networks (CNNs) achieve strong image classification performance but lack interpretability and are vulnerable to adversarial attacks. Neuro-fuzzy hybrids such as DCNFIS replace fully connected CNN classifiers with Adaptive Neuro-Fuzzy Inference Systems (ANFIS) to improve interpretability, yet their robustness remains underexplored. This work compares standard CNNs (ConvNet, VGG, ResNet18) with their ANFIS-augmented counterparts on MNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100 under gradient-based (PGD) and gradient-free (Square) attacks. Results show that ANFIS integration does not consistently improve clean accuracy and has architecture-dependent effects on robustness: ResNet18-ANFIS exhibits improved adversarial robustness, while VGG-ANFIS often underperforms its baseline. These findings suggest that neuro-fuzzy augmentation can enhance robustness in specific architectures but is not universally beneficial.
Chinese Translation
卷积神经网络(CNN)在图像分类性能上表现出色,但缺乏可解释性且易受对抗攻击的影响。神经模糊混合体如DCNFIS通过用自适应神经模糊推理系统(ANFIS)替换全连接的CNN分类器,以提高可解释性,但其鲁棒性仍然未得到充分研究。本研究比较了标准CNN(ConvNet、VGG、ResNet18)与其ANFIS增强版本在MNIST、Fashion-MNIST、CIFAR-10和CIFAR-100数据集上在基于梯度(PGD)和无梯度(Square)攻击下的表现。结果表明,ANFIS的集成并不总是能提高干净样本的准确性,并且对鲁棒性的影响依赖于架构:ResNet18-ANFIS展现出改善的对抗鲁棒性,而VGG-ANFIS则往往表现不及其基线。这些发现表明,神经模糊增强可以在特定架构中增强鲁棒性,但并非普遍有效。
cs.CV / 15 / 2602.07038
UNIKIE-BENCH: Benchmarking Large Multimodal Models for Key Information Extraction in Visual Documents
UNIKIE-BENCH:针对视觉文档中关键信息提取的大型多模态模型基准测试
Abstract
Key Information Extraction (KIE) from real-world documents remains challenging due to substantial variations in layout structures, visual quality, and task-specific information requirements. Recent Large Multimodal Models (LMMs) have shown promising potential for performing end-to-end KIE directly from document images. To enable a comprehensive and systematic evaluation across realistic and diverse application scenarios, we introduce UNIKIE-BENCH, a unified benchmark designed to rigorously evaluate the KIE capabilities of LMMs. UNIKIE-BENCH consists of two complementary tracks: a constrained-category KIE track with scenario-predefined schemas that reflect practical application needs, and an open-category KIE track that extracts any key information that is explicitly present in the document. Experiments on 15 state-of-the-art LMMs reveal substantial performance degradation under diverse schema definitions, long-tail key fields, and complex layouts, along with pronounced performance disparities across different document types and scenarios. These findings underscore persistent challenges in grounding accuracy and layout-aware reasoning for LMM-based KIE. All codes and datasets are available at https://github.com/NEUIR/UNIKIE-BENCH.
Chinese Translation
从现实世界文档中提取关键信息(KIE)仍然面临挑战,主要由于布局结构、视觉质量和任务特定信息需求的显著变化。最近的大型多模态模型(LMMs)在直接从文档图像中执行端到端的KIE方面显示出了良好的潜力。为了在现实和多样化的应用场景中进行全面和系统的评估,我们提出了UNIKIE-BENCH,这是一个统一的基准测试,旨在严格评估LMMs的KIE能力。UNIKIE-BENCH包括两个互补的轨道:一个受限类别KIE轨道,具有反映实际应用需求的场景预定义模式,以及一个开放类别KIE轨道,提取文档中明确存在的任何关键信息。在对15个最先进的LMMs进行实验时,发现它们在多样的模式定义、长尾关键信息字段和复杂布局下表现出显著的性能下降,并且不同文档类型和场景之间存在明显的性能差异。这些发现突显了基于LMM的KIE在基础准确性和布局感知推理方面的持续挑战。所有代码和数据集可在https://github.com/NEUIR/UNIKIE-BENCH获取。
cs.CV / 16 / 2602.07041
OMNI-Dent: Towards an Accessible and Explainable AI Framework for Automated Dental Diagnosis
OMNI-Dent:迈向一个可访问且可解释的自动化牙科诊断人工智能框架
Abstract
Accurate dental diagnosis is essential for oral healthcare, yet many individuals lack access to timely professional evaluation. Existing AI-based methods primarily treat diagnosis as a visual pattern recognition task and do not reflect the structured clinical reasoning used by dental professionals. These approaches also require large amounts of expert-annotated data and often struggle to generalize across diverse real-world imaging conditions. To address these limitations, we present OMNI-Dent, a data-efficient and explainable diagnostic framework that incorporates clinical reasoning principles into a Vision-Language Model (VLM)-based pipeline. The framework operates on multi-view smartphone photographs,embeds diagnostic heuristics from dental experts, and guides a general-purpose VLM to perform tooth-level evaluation without dental-specific fine-tuning of the VLM. By utilizing the VLM's existing visual-linguistic capabilities, OMNI-Dent aims to support diagnostic assessment in settings where curated clinical imaging is unavailable. Designed as an early-stage assistive tool, OMNI-Dent helps users identify potential abnormalities and determine when professional evaluation may be needed, offering a practical option for individuals with limited access to in-person care.
Chinese Translation
准确的牙科诊断对于口腔健康至关重要,但许多人缺乏及时的专业评估。现有的基于人工智能的方法主要将诊断视为视觉模式识别任务,并未反映牙科专业人员所使用的结构化临床推理。这些方法还需要大量专家标注的数据,并且在多样化的真实世界成像条件下往往难以推广。为了解决这些局限性,我们提出了OMNI-Dent,一个数据高效且可解释的诊断框架,它将临床推理原则融入基于视觉-语言模型(Vision-Language Model, VLM)的管道中。该框架在多视角智能手机照片上运行,嵌入牙科专家的诊断启发式,并指导通用VLM进行牙齿级别的评估,而无需对VLM进行特定于牙科的微调。通过利用VLM现有的视觉-语言能力,OMNI-Dent旨在支持在缺乏精心策划的临床成像的环境中进行诊断评估。作为一个早期辅助工具,OMNI-Dent帮助用户识别潜在的异常,并确定何时需要专业评估,为那些难以获得面对面护理的个人提供了一个实用的选择。
cs.CV / 17 / 2602.07042
COMBOOD: A Semiparametric Approach for Detecting Out-of-distribution Data for Image Classification
COMBOOD:一种用于图像分类的半参数方法检测分布外数据
Abstract
Identifying out-of-distribution (OOD) data at inference time is crucial for many machine learning applications, especially for automation. We present a novel unsupervised semi-parametric framework COMBOOD for OOD detection with respect to image recognition. Our framework combines signals from two distance metrics, nearest-neighbor and Mahalanobis, to derive a confidence score for an inference point to be out-of-distribution. The former provides a non-parametric approach to OOD detection. The latter provides a parametric, simple, yet effective method for detecting OOD data points, especially, in the far OOD scenario, where the inference point is far apart from the training data set in the embedding space. However, its performance is not satisfactory in the near OOD scenarios that arise in practical situations. Our COMBOOD framework combines the two signals in a semi-parametric setting to provide a confidence score that is accurate both for the near-OOD and far-OOD scenarios. We show experimental results with the COMBOOD framework for different types of feature extraction strategies. We demonstrate experimentally that COMBOOD outperforms state-of-the-art OOD detection methods on the OpenOOD (both version 1 and most recent version 1.5) benchmark datasets (for both far-OOD and near-OOD) as well as on the documents dataset in terms of accuracy. On a majority of the benchmark datasets, the improvements in accuracy resulting from the COMBOOD framework are statistically significant. COMBOOD scales linearly with the size of the embedding space, making it ideal for many real-life applications.
Chinese Translation
在推理阶段识别分布外(OOD)数据对于许多机器学习应用,特别是自动化,至关重要。我们提出了一种新颖的无监督半参数框架COMBOOD,用于图像识别的OOD检测。我们的框架结合了两种距离度量的信号,最近邻和马哈拉诺比斯(Mahalanobis),以推导出一个置信度分数,用于判断推理点是否为分布外。前者提供了一种非参数的方法来进行OOD检测。后者则提供了一种参数化的、简单而有效的方法,特别是在远离OOD场景中,推理点与嵌入空间中的训练数据集相距较远。然而,在实际情况中出现的近OOD场景下,其性能并不令人满意。我们的COMBOOD框架在半参数设置中结合了这两种信号,以提供一个在近OOD和远OOD场景下都准确的置信度分数。我们展示了使用COMBOOD框架在不同特征提取策略下的实验结果。实验表明,COMBOOD在OpenOOD(包括版本1和最新版本1.5)基准数据集(对于远OOD和近OOD)以及文档数据集的准确性上,优于最先进的OOD检测方法。在大多数基准数据集中,COMBOOD框架带来的准确性提升具有统计显著性。COMBOOD在嵌入空间大小上呈线性扩展,使其非常适合许多实际应用。
cs.CV / 18 / 2602.07044
PipeMFL-240K: A Large-scale Dataset and Benchmark for Object Detection in Pipeline Magnetic Flux Leakage Imaging
PipeMFL-240K:管道磁通泄漏成像中的目标检测大规模数据集和基准
Abstract
Pipeline integrity is critical to industrial safety and environmental protection, with Magnetic Flux Leakage (MFL) detection being a primary non-destructive testing technology. Despite the promise of deep learning for automating MFL interpretation, progress toward reliable models has been constrained by the absence of a large-scale public dataset and benchmark, making fair comparison and reproducible evaluation difficult. We introduce \textbf{PipeMFL-240K}, a large-scale, meticulously annotated dataset and benchmark for complex object detection in pipeline MFL pseudo-color images. PipeMFL-240K reflects real-world inspection complexity and poses several unique challenges: (i) an extremely long-tailed distribution over \textbf{12} categories, (ii) a high prevalence of tiny objects that often comprise only a handful of pixels, and (iii) substantial intra-class variability. The dataset contains \textbf{240,320} images and \textbf{191,530} high-quality bounding-box annotations, collected from 11 pipelines spanning approximately \textbf{1,480} km. Extensive experiments are conducted with state-of-the-art object detectors to establish baselines. Results show that modern detectors still struggle with the intrinsic properties of MFL data, highlighting considerable headroom for improvement, while PipeMFL-240K provides a reliable and challenging testbed to drive future research. As the first public dataset and the first benchmark of this scale and scope for pipeline MFL inspection, it provides a critical foundation for efficient pipeline diagnostics as well as maintenance planning and is expected to accelerate algorithmic innovation and reproducible research in MFL-based pipeline integrity assessment.
Chinese Translation
管道完整性对工业安全和环境保护至关重要,而磁通泄漏(MFL)检测是主要的无损检测技术。尽管深度学习在自动化MFL解读方面展现出潜力,但由于缺乏大规模公共数据集和基准,可靠模型的进展受到限制,这使得公平比较和可重复评估变得困难。我们引入了 extbf{PipeMFL-240K},这是一个大规模、经过精心注释的数据集和基准,旨在处理管道MFL伪彩色图像中的复杂目标检测。PipeMFL-240K反映了真实世界检查的复杂性,并提出了几个独特的挑战:(i) extbf{12}个类别上极端长尾分布,(ii)高比例的小物体,这些物体通常仅由少数几个像素组成,以及(iii)显著的类内变异性。该数据集包含 extbf{240,320}张图像和 extbf{191,530}个高质量的边界框注释,数据采集自11条管道,覆盖约 extbf{1,480}公里。我们对最先进的目标检测器进行了广泛实验,以建立基准。结果表明,现代检测器仍然在MFL数据的内在特性上面临挑战,显示出相当大的改进空间,同时PipeMFL-240K提供了一个可靠且具有挑战性的测试平台,以推动未来的研究。作为第一个公共数据集和第一个此规模和范围的管道MFL检查基准,它为高效的管道诊断以及维护规划提供了关键基础,并预计将加速基于MFL的管道完整性评估中的算法创新和可重复研究。
cs.CV / 19 / 2602.07045
VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing
VLRS-Bench:一个用于遥感的视觉-语言推理基准
Abstract
Recent advancements in Multimodal Large Language Models (MLLMs) have enabled complex reasoning. However, existing remote sensing (RS) benchmarks remain heavily biased toward perception tasks, such as object recognition and scene classification. This limitation hinders the development of MLLMs for cognitively demanding RS applications. To address this, , we propose a Vision Language ReaSoning Benchmark (VLRS-Bench), which is the first benchmark exclusively dedicated to complex RS reasoning. Structured across the three core dimensions of Cognition, Decision, and Prediction, VLRS-Bench comprises 2,000 question-answer pairs with an average length of 71 words, spanning 14 tasks and up to eight temporal phases. VLRS-Bench is constructed via a specialized pipeline that integrates RS-specific priors and expert knowledge to ensure geospatial realism and reasoning complexity. Experimental results reveal significant bottlenecks in existing state-of-the-art MLLMs, providing critical insights for advancing multimodal reasoning within the remote sensing community.
Chinese Translation
近期多模态大型语言模型(MLLMs)的进展使得复杂推理成为可能。然而,现有的遥感(RS)基准仍然严重偏向于感知任务,如物体识别和场景分类。这一局限性阻碍了MLLMs在认知要求高的遥感应用中的发展。为了解决这一问题,我们提出了视觉语言推理基准(VLRS-Bench),这是第一个专门用于复杂遥感推理的基准。VLRS-Bench在认知、决策和预测三个核心维度上进行结构化,包含2000对问答对,平均长度为71个单词,涵盖14个任务和多达八个时间阶段。VLRS-Bench通过一个专门的流程构建,该流程整合了遥感特定的先验知识和专家知识,以确保地理空间的真实性和推理的复杂性。实验结果揭示了现有最先进的MLLMs中的显著瓶颈,为推动遥感领域的多模态推理提供了重要的见解。
cs.CV / 20 / 2602.07047
ShapBPT: Image Feature Attributions Using Data-Aware Binary Partition Trees
ShapBPT:基于数据感知的二叉划分树的图像特征归因
Abstract
Pixel-level feature attributions are an important tool in eXplainable AI for Computer Vision (XCV), providing visual insights into how image features influence model predictions. The Owen formula for hierarchical Shapley values has been widely used to interpret machine learning (ML) models and their learned representations. However, existing hierarchical Shapley approaches do not exploit the multiscale structure of image data, leading to slow convergence and weak alignment with the actual morphological features. Moreover, no prior Shapley method has leveraged data-aware hierarchies for Computer Vision tasks, leaving a gap in model interpretability of structured visual data. To address this, this paper introduces ShapBPT, a novel data-aware XCV method based on the hierarchical Shapley formula. ShapBPT assigns Shapley coefficients to a multiscale hierarchical structure tailored for images, the Binary Partition Tree (BPT). By using this data-aware hierarchical partitioning, ShapBPT ensures that feature attributions align with intrinsic image morphology, effectively prioritizing relevant regions while reducing computational overhead. This advancement connects hierarchical Shapley methods with image data, providing a more efficient and semantically meaningful approach to visual interpretability. Experimental results confirm ShapBPT's effectiveness, demonstrating superior alignment with image structures and improved efficiency over existing XCV methods, and a 20-subject user study confirming that ShapBPT explanations are preferred by humans.
Chinese Translation
像素级特征归因是可解释人工智能在计算机视觉(XCV)中的重要工具,提供了关于图像特征如何影响模型预测的视觉洞察。Owen公式用于层次Shapley值的计算,已被广泛应用于解释机器学习(ML)模型及其学习的表示。然而,现有的层次Shapley方法未能利用图像数据的多尺度结构,导致收敛速度慢且与实际形态特征的对齐效果较差。此外,之前的Shapley方法没有利用数据感知的层次结构来处理计算机视觉任务,造成了对结构化视觉数据模型可解释性方面的缺失。为了解决这一问题,本文提出了ShapBPT,一种基于层次Shapley公式的新型数据感知XCV方法。ShapBPT将Shapley系数分配给为图像量身定制的多尺度层次结构,即二叉划分树(Binary Partition Tree, BPT)。通过使用这种数据感知的层次划分,ShapBPT确保特征归因与内在图像形态相一致,有效优先考虑相关区域,同时减少计算开销。这一进展将层次Shapley方法与图像数据连接起来,为视觉可解释性提供了一种更高效且语义上更有意义的方法。实验结果证实了ShapBPT的有效性,展示了其与图像结构的优越对齐效果和相较于现有XCV方法的提高效率,并通过一项包含20名参与者的用户研究确认了人们对ShapBPT解释的偏好。
cs.CV / 21 / 2602.07049
Enhancing IMU-Based Online Handwriting Recognition via Contrastive Learning with Zero Inference Overhead
通过对比学习增强基于惯性测量单元的在线手写识别,零推理开销
Abstract
Online handwriting recognition using inertial measurement units opens up handwriting on paper as input for digital devices. Doing it on edge hardware improves privacy and lowers latency, but entails memory constraints. To address this, we propose Error-enhanced Contrastive Handwriting Recognition (ECHWR), a training framework designed to improve feature representation and recognition accuracy without increasing inference costs. ECHWR utilizes a temporary auxiliary branch that aligns sensor signals with semantic text embeddings during the training phase. This alignment is maintained through a dual contrastive objective: an in-batch contrastive loss for general modality alignment and a novel error-based contrastive loss that distinguishes between correct signals and synthetic hard negatives. The auxiliary branch is discarded after training, which allows the deployed model to keep its original, efficient architecture. Evaluations on the OnHW-Words500 dataset show that ECHWR significantly outperforms state-of-the-art baselines, reducing character error rates by up to 7.4% on the writer-independent split and 10.4% on the writer-dependent split. Finally, although our ablation studies indicate that solving specific challenges require specific architectural and objective configurations, error-based contrastive loss shows its effectiveness for handling unseen writing styles.
Chinese Translation
使用惯性测量单元进行在线手写识别为数字设备提供了纸上手写输入的可能性。在边缘硬件上进行此操作提高了隐私性并降低了延迟,但也带来了内存限制。为了解决这个问题,我们提出了错误增强对比手写识别(Error-enhanced Contrastive Handwriting Recognition, ECHWR),这是一个旨在提高特征表示和识别准确性的训练框架,而不增加推理成本。ECHWR利用一个临时辅助分支,在训练阶段将传感器信号与语义文本嵌入对齐。通过双重对比目标保持这种对齐:一种用于一般模态对齐的批内对比损失和一种新颖的基于错误的对比损失,用于区分正确信号和合成的困难负样本。训练后,辅助分支被丢弃,这使得部署的模型能够保持其原始的高效架构。在OnHW-Words500数据集上的评估表明,ECHWR显著超越了最先进的基线,在作家独立分割中将字符错误率降低了多达7.4%,在作家依赖分割中降低了10.4%。最后,尽管我们的消融研究表明,解决特定挑战需要特定的架构和目标配置,但基于错误的对比损失在处理未见过的书写风格方面显示了其有效性。
cs.CV / 22 / 2602.07050
Interpreting Physics in Video World Models
视频世界模型中的物理解释
Abstract
A long-standing question in physical reasoning is whether video-based models need to rely on factorized representations of physical variables in order to make physically accurate predictions, or whether they can implicitly represent such variables in a task-specific, distributed manner. While modern video world models achieve strong performance on intuitive physics benchmarks, it remains unclear which of these representational regimes they implement internally. Here, we present the first interpretability study to directly examine physical representations inside large-scale video encoders. Using layerwise probing, subspace geometry, patch-level decoding, and targeted attention ablations, we characterize where physical information becomes accessible and how it is organized within encoder-based video transformers. Across architectures, we identify a sharp intermediate-depth transition -- which we call the Physics Emergence Zone -- at which physical variables become accessible. Physics-related representations peak shortly after this transition and degrade toward the output layers. Decomposing motion into explicit variables, we find that scalar quantities such as speed and acceleration are available from early layers onwards, whereas motion direction becomes accessible only at the Physics Emergence Zone. Notably, we find that direction is encoded through a high-dimensional population structure with circular geometry, requiring coordinated multi-feature intervention to control. These findings suggest that modern video models do not use factorized representations of physical variables like a classical physics engine. Instead, they use a distributed representation that is nonetheless sufficient for making physical predictions.
Chinese Translation
物理推理中的一个长期问题是,基于视频的模型是否需要依赖于物理变量的分解表示,以便做出物理上准确的预测,或者它们是否可以以任务特定的分布方式隐式表示这些变量。尽管现代视频世界模型在直观物理基准测试中表现出色,但尚不清楚它们内部实现了哪种表示机制。在此,我们呈现了首个可解释性研究,直接检查大型视频编码器内部的物理表示。通过逐层探测、子空间几何、补丁级解码和针对性注意力消融,我们描述了物理信息何时变得可访问以及它在基于编码器的视频变换器中的组织方式。在不同架构中,我们识别出一个明显的中间深度过渡区——我们称之为物理出现区——在此物理变量变得可访问。与物理相关的表示在这一过渡后不久达到峰值,并向输出层退化。将运动分解为显式变量,我们发现标量量(如速度和加速度)从早期层开始就可用,而运动方向仅在物理出现区变得可访问。值得注意的是,我们发现方向通过具有圆形几何的高维群体结构进行编码,控制这一结构需要协调的多特征干预。这些发现表明,现代视频模型并不像经典物理引擎那样使用物理变量的分解表示。相反,它们使用一种分布式表示,尽管如此仍足以进行物理预测。
cs.CV / 23 / 2602.07051
Neural Sentinel: Unified Vision Language Model (VLM) for License Plate Recognition with Human-in-the-Loop Continual Learning
神经哨兵:统一视觉语言模型(VLM)用于带有人机参与的持续学习的车牌识别
Abstract
Traditional Automatic License Plate Recognition (ALPR) systems employ multi-stage pipelines consisting of object detection networks followed by separate Optical Character Recognition (OCR) modules, introducing compounding errors, increased latency, and architectural complexity. This research presents Neural Sentinel, a novel unified approach that leverages Vision Language Models (VLMs) to perform license plate recognition, state classification, and vehicle attribute extraction through a single forward pass. Our primary contribution lies in demonstrating that a fine-tuned PaliGemma 3B model, adapted via Low-Rank Adaptation (LoRA), can simultaneously answer multiple visual questions about vehicle images, achieving 92.3% plate recognition accuracy, which is a 14.1% improvement over EasyOCR and 9.9% improvement over PaddleOCR baselines. We introduce a Human-in-the-Loop (HITL) continual learning framework that incorporates user corrections while preventing catastrophic forgetting through experience replay, maintaining a 70:30 ratio of original training data to correction samples. The system achieves a mean inference latency of 152ms with an Expected Calibration Error (ECE) of 0.048, indicating well calibrated confidence estimates. Additionally, the VLM first architecture enables zero-shot generalization to auxiliary tasks including vehicle color detection (89%), seatbelt detection (82%), and occupancy counting (78%) without task specific training. Through extensive experimentation on real world toll plaza imagery, we demonstrate that unified vision language approaches represent a paradigm shift in ALPR systems, offering superior accuracy, reduced architectural complexity, and emergent multi-task capabilities that traditional pipeline approaches cannot achieve.
Chinese Translation
传统的自动车牌识别(ALPR)系统采用多阶段管道,包括对象检测网络和单独的光学字符识别(OCR)模块,这导致了累积错误、延迟增加和架构复杂性。本研究提出了神经哨兵,这是一种新颖的统一方法,利用视觉语言模型(VLM)通过单次前向传播执行车牌识别、状态分类和车辆属性提取。我们的主要贡献在于证明经过低秩适应(LoRA)调整的精细调优PaliGemma 3B模型能够同时回答关于车辆图像的多个视觉问题,实现了92.3%的车牌识别准确率,比EasyOCR提高了14.1%,比PaddleOCR提高了9.9%。我们引入了一种人机参与(HITL)持续学习框架,该框架在防止灾难性遗忘的同时,结合用户的修正,通过经验重放保持原始训练数据与修正样本的70:30比例。该系统实现了152毫秒的平均推理延迟,期望校准误差(ECE)为0.048,表明置信度估计良好校准。此外,VLM首个架构使得在没有特定任务训练的情况下,对辅助任务如车辆颜色检测(89%)、安全带检测(82%)和占用计数(78%)实现零样本泛化。通过对真实世界收费广场图像的广泛实验,我们证明了统一视觉语言方法代表了ALPR系统的范式转变,提供了更高的准确性、降低的架构复杂性以及传统管道方法无法实现的多任务能力。
cs.CV / 24 / 2602.07052
Toward Accurate and Accessible Markerless Neuronavigation
朝向准确且可及的无标记神经导航
Abstract
Neuronavigation is widely used in biomedical research and interventions to guide the precise placement of instruments around the head to support procedures such as transcranial magnetic stimulation. Traditional systems, however, rely on subject-mounted markers that require manual registration, may shift during procedures, and can cause discomfort. We introduce and evaluate markerless approaches that replace expensive hardware and physical markers with low-cost visible and infrared light cameras incorporating stereo and depth sensing combined with algorithmic modeling of the facial geometry. Validation with $50$ human subjects yielded a median tracking discrepancy of only $2.32$ mm and $2.01{\deg}$ for the best markerless algorithms compared to a conventional marker-based system, which indicates sufficient accuracy for transcranial magnetic stimulation and a substantial improvement over prior markerless results. The results suggest that integration of the data from the various camera sensors can improve the overall accuracy further. The proposed markerless neuronavigation methods can reduce setup cost and complexity, improve patient comfort, and expand access to neuronavigation in clinical and research settings.
Chinese Translation
神经导航广泛应用于生物医学研究和干预,以指导仪器在头部的精确放置,以支持诸如经颅磁刺激等程序。然而,传统系统依赖于安装在受试者身上的标记,这需要手动注册,可能在操作过程中发生位移,并可能导致不适。我们介绍并评估了一种无标记的方法,该方法用低成本的可见光和红外光摄像头替代昂贵的硬件和物理标记,这些摄像头结合了立体和深度感知,并与面部几何的算法建模相结合。对50名受试者的验证显示,最佳无标记算法的中位跟踪误差仅为2.32毫米和2.01度,相较于传统的基于标记的系统,表明其在经颅磁刺激中的准确性足够,并且相较于以往的无标记结果有了显著改善。结果表明,各种摄像头传感器的数据整合可以进一步提高整体准确性。所提出的无标记神经导航方法可以降低设置成本和复杂性,提高患者舒适度,并扩大临床和研究环境中神经导航的可及性。
cs.CV / 25 / 2602.07057
RECITYGEN -- Interactive and Generative Participatory Urban Design Tool with Latent Diffusion and Segment Anything
RECITYGEN -- 具有潜在扩散和任意分割的交互式生成参与式城市设计工具
Abstract
Urban design profoundly impacts public spaces and community engagement. Traditional top-down methods often overlook public input, creating a gap in design aspirations and reality. Recent advancements in digital tools, like City Information Modelling and augmented reality, have enabled a more participatory process involving more stakeholders in urban design. Further, deep learning and latent diffusion models have lowered barriers for design generation, providing even more opportunities for participatory urban design. Combining state-of-the-art latent diffusion models with interactive semantic segmentation, we propose RECITYGEN, a novel tool that allows users to interactively create variational street view images of urban environments using text prompts. In a pilot project in Beijing, users employed RECITYGEN to suggest improvements for an ongoing Urban Regeneration project. Despite some limitations, RECITYGEN has shown significant potential in aligning with public preferences, indicating a shift towards more dynamic and inclusive urban planning methods. The source code for the project can be found at RECITYGEN GitHub.
Chinese Translation
城市设计对公共空间和社区参与产生深远影响。传统的自上而下的方法常常忽视公众的意见,导致设计愿景与现实之间存在差距。最近,数字工具的进步,如城市信息建模(City Information Modelling)和增强现实(augmented reality),使得更多利益相关者能够参与城市设计的过程。此外,深度学习和潜在扩散模型降低了设计生成的门槛,为参与式城市设计提供了更多机会。我们结合最先进的潜在扩散模型与交互式语义分割,提出了RECITYGEN,这是一种新颖的工具,允许用户通过文本提示交互式地创建城市环境的变体街景图像。在北京的一项试点项目中,用户利用RECITYGEN提出了对正在进行的城市再生项目的改进建议。尽管存在一些局限性,RECITYGEN在与公众偏好的对齐方面显示出显著潜力,表明城市规划方法正在向更加动态和包容的方向转变。该项目的源代码可以在RECITYGEN GitHub上找到。
cs.CV / 26 / 2602.07058
FADE: Selective Forgetting via Sparse LoRA and Self-Distillation
FADE:通过稀疏 LoRA 和自蒸馏实现选择性遗忘
Abstract
Machine Unlearning aims to remove the influence of specific data or concepts from trained models while preserving overall performance, a capability increasingly required by data protection regulations and responsible AI practices. Despite recent progress, unlearning in text-to-image diffusion models remains challenging due to high computational costs and the difficulty of balancing effective forgetting with retention of unrelated concepts. We introduce FADE (Fast Adapter for Data Erasure), a two-stage unlearning method for image generation that combines parameter localization with self-distillation. FADE first identifies parameters most responsible for the forget set using gradient-based saliency and constrains updates through sparse LoRA adapters, ensuring lightweight, localized modifications. In a second stage, FADE applies a self-distillation objective that overwrites the forgotten concept with a user-defined surrogate while preserving behavior on retained data. The resulting adapters are memory-efficient, reversible, and can be merged or removed at runtime, enabling flexible deployment in production systems. We evaluated FADE on the UnlearnCanvas benchmark and conducted ablation studies on Imagenette, Labeled Faces in the Wild, AtharvaTaras Dog Breeds Dataset, and SUN Attributes datasets, demonstrating State-of-the-Art unlearning performance with fine-grained control over the forgetting-retention trade-off. Our results demonstrate that FADE achieves strong concept erasure and high retainability across various domains, making it a suitable solution for selective unlearning in diffusion-based image generation models.
Chinese Translation
机器遗忘旨在从训练模型中去除特定数据或概念的影响,同时保持整体性能,这是数据保护法规和负责任的人工智能实践日益需要的一种能力。尽管最近取得了一些进展,但在文本到图像的扩散模型中实现遗忘仍然具有挑战性,因为计算成本高以及在有效遗忘与保留无关概念之间平衡的困难。我们提出了 FADE(快速数据清除适配器),这是一种结合参数定位和自蒸馏的两阶段遗忘方法,用于图像生成。FADE 首先使用基于梯度的显著性识别最负责遗忘集的参数,并通过稀疏 LoRA 适配器限制更新,从而确保轻量级、局部的修改。在第二阶段,FADE 应用自蒸馏目标,用用户定义的替代概念覆盖遗忘的概念,同时保留对保留数据的行为。生成的适配器具有内存高效、可逆性,并且可以在运行时合并或移除,从而在生产系统中实现灵活部署。我们在 UnlearnCanvas 基准上评估了 FADE,并在 Imagenette、Labeled Faces in the Wild、AtharvaTaras 狗品种数据集和 SUN Attributes 数据集上进行了消融研究,展示了在遗忘与保留权衡方面的最先进的遗忘性能。我们的结果表明,FADE 在各个领域实现了强大的概念清除和高保留性,使其成为扩散基础图像生成模型中选择性遗忘的合适解决方案。
cs.CV / 27 / 2602.07062
From Images to Decisions: Assistive Computer Vision for Non-Metallic Content Estimation in Scrap Metal
从图像到决策:非金属成分估算的辅助计算机视觉在废金属中的应用
Abstract
Scrap quality directly affects energy use, emissions, and safety in steelmaking. Today, the share of non-metallic inclusions (contamination) is judged visually by inspectors - an approach that is subjective and hazardous due to dust and moving machinery. We present an assistive computer vision pipeline that estimates contamination (per percent) from images captured during railcar unloading and also classifies scrap type. The method formulates contamination assessment as a regression task at the railcar level and leverages sequential data through multi-instance learning (MIL) and multi-task learning (MTL). Best results include MAE 0.27 and R2 0.83 by MIL; and an MTL setup reaches MAE 0.36 with F1 0.79 for scrap class. Also we present the system in near real time within the acceptance workflow: magnet/railcar detection segments temporal layers, a versioned inference service produces railcar-level estimates with confidence scores, and results are reviewed by operators with structured overrides; corrections and uncertain cases feed an active-learning loop for continual improvement. The pipeline reduces subjective variability, improves human safety, and enables integration into acceptance and melt-planning workflows.
Chinese Translation
废金属的质量直接影响钢铁生产中的能源使用、排放和安全性。目前,非金属夹杂物(污染)的比例主要通过检查员的目测来判断,这种方法主观性强且由于灰尘和移动机械而存在危险性。我们提出了一种辅助计算机视觉管道,该管道能够从在铁路车厢卸货过程中捕获的图像中估算污染(以百分比表示),并对废金属类型进行分类。该方法将污染评估公式化为铁路车厢级别的回归任务,并通过多实例学习(MIL)和多任务学习(MTL)利用序列数据。最佳结果为MIL下的平均绝对误差(MAE)为0.27,决定系数(R²)为0.83;而MTL设置下的MAE为0.36,F1得分为0.79。此外,我们在接收工作流程中以近实时的方式展示该系统:磁铁/铁路车厢检测分割时间层,版本化推理服务生成带有置信度评分的铁路车厢级别估算,结果由操作员进行结构化审核;修正和不确定案例反馈到主动学习循环中以实现持续改进。该管道减少了主观变异性,提高了人类安全性,并使其能够融入接收和熔化规划工作流程中。
cs.CV / 28 / 2602.07064
Exploring Physical Intelligence Emergence via Omni-Modal Architecture and Physical Data Engine
通过全模态架构和物理数据引擎探索物理智能的涌现
Abstract
Physical understanding remains brittle in omni-modal models because key physical attributes are visually ambiguous and sparsely represented in web-scale data. We present OmniFysics, a compact omni-modal model that unifies understanding across images, audio, video, and text, with integrated speech and image generation. To inject explicit physical knowledge, we build a physical data engine with two components. FysicsAny produces physics-grounded instruction--image supervision by mapping salient objects to verified physical attributes through hierarchical retrieval over a curated prototype database, followed by physics-law--constrained verification and caption rewriting. FysicsOmniCap distills web videos via audio--visual consistency filtering to generate high-fidelity video--instruction pairs emphasizing cross-modal physical cues. We train OmniFysics with staged multimodal alignment and instruction tuning, adopt latent-space flow matching for text-to-image generation, and use an intent router to activate generation only when needed. Experiments show competitive performance on standard multimodal benchmarks and improved results on physics-oriented evaluations.
Chinese Translation
由于关键物理属性在视觉上模糊且在网络规模数据中稀疏表示,全模态模型中的物理理解仍然脆弱。我们提出了OmniFysics,这是一种紧凑的全模态模型,统一了图像、音频、视频和文本的理解,并集成了语音和图像生成。为了注入显式的物理知识,我们构建了一个物理数据引擎,包含两个组件。FysicsAny通过在策划的原型数据库上进行层次检索,将显著对象映射到经过验证的物理属性,从而生成基于物理的指令-图像监督,随后进行物理法则约束的验证和标题重写。FysicsOmniCap通过音频-视觉一致性过滤提取网络视频,以生成强调跨模态物理线索的高保真视频-指令对。我们通过分阶段的多模态对齐和指令调优训练OmniFysics,采用潜在空间流匹配进行文本到图像的生成,并使用意图路由器仅在需要时激活生成。实验表明,在标准多模态基准测试中表现出竞争力,并在物理导向的评估中取得了更好的结果。
cs.CV / 29 / 2602.07065
Contactless estimation of continuum displacement and mechanical compressibility from image series using a deep learning based framework
基于深度学习框架的无接触连续位移和机械可压缩性估计
Abstract
Contactless and non-invasive estimation of mechanical properties of physical media from optical observations is of interest for manifold engineering and biomedical applications, where direct physical measurements are not possible. Conventional approaches to the assessment of image displacement and non-contact material probing typically rely on time-consuming iterative algorithms for non-rigid image registration and constitutive modelling using discretization and iterative numerical solving techniques, such as Finite Element Method (FEM) and Finite Difference Method (FDM), which are not suitable for high-throughput data processing. Here, we present an efficient deep learning based end-to-end approach for the estimation of continuum displacement and material compressibility directly from the image series. Based on two deep neural networks for image registration and material compressibility estimation, this framework outperforms conventional approaches in terms of efficiency and accuracy. In particular, our experimental results show that the deep learning model trained on a set of reference data can accurately determine the material compressibility even in the presence of substantial local deviations of the mapping predicted by image registration from the reference displacement field. Our findings suggest that the remarkable accuracy of the deep learning end-to-end model originates from its ability to assess higher-order cognitive features, such as the vorticity of the vector field, rather than conventional local features of the image displacement.
Chinese Translation
从光学观测中无接触和非侵入性地估计物理介质的机械性质在多种工程和生物医学应用中具有重要意义,尤其是在无法进行直接物理测量的情况下。传统的图像位移评估和非接触材料探测方法通常依赖于耗时的迭代算法进行非刚性图像配准和使用离散化及迭代数值求解技术(如有限元法(FEM)和有限差分法(FDM))的本构建模,这些方法不适合高通量数据处理。在此,我们提出了一种高效的基于深度学习的端到端方法,能够直接从图像序列中估计连续位移和材料可压缩性。该框架基于两个深度神经网络进行图像配准和材料可压缩性估计,在效率和准确性方面超越了传统方法。特别是,我们的实验结果表明,基于一组参考数据训练的深度学习模型能够准确确定材料的可压缩性,即使在图像配准所预测的映射与参考位移场存在显著局部偏差的情况下。我们的研究结果表明,深度学习端到端模型的卓越准确性源于其评估更高阶认知特征的能力,例如向量场的涡度,而非传统的图像位移局部特征。
cs.CV / 30 / 2602.07069
Bidirectional Reward-Guided Diffusion for Real-World Image Super-Resolution
双向奖励引导扩散用于真实世界图像超分辨率
Abstract
Diffusion-based super-resolution can synthesize rich details, but models trained on synthetic paired data often fail on real-world LR images due to distribution shifts. We propose Bird-SR, a bidirectional reward-guided diffusion framework that formulates super-resolution as trajectory-level preference optimization via reward feedback learning (ReFL), jointly leveraging synthetic LR-HR pairs and real-world LR images. For structural fidelity easily affected in ReFL, the model is directly optimized on synthetic pairs at early diffusion steps, which also facilitates structure preservation for real-world inputs under smaller distribution gap in structure levels. For perceptual enhancement, quality-guided rewards are applied at later sampling steps to both synthetic and real LR images. To mitigate reward hacking, the rewards for synthetic results are formulated in a relative advantage space bounded by their clean counterparts, while real-world optimization is regularized via a semantic alignment constraint. Furthermore, to balance structural and perceptual learning, we adopt a dynamic fidelity-perception weighting strategy that emphasizes structure preservation at early stages and progressively shifts focus toward perceptual optimization at later diffusion steps. Extensive experiments on real-world SR benchmarks demonstrate that Bird-SR consistently outperforms state-of-the-art methods in perceptual quality while preserving structural consistency, validating its effectiveness for real-world super-resolution.
Chinese Translation
基于扩散的超分辨率能够合成丰富的细节,但在合成配对数据上训练的模型往往因分布偏移而在真实世界的低分辨率(LR)图像上表现不佳。我们提出了Bird-SR,一个双向奖励引导的扩散框架,将超分辨率表述为通过奖励反馈学习(ReFL)进行轨迹级偏好优化,联合利用合成的低分辨率-高分辨率(LR-HR)配对和真实世界的低分辨率图像。为了应对ReFL中容易受到影响的结构保真性,模型在早期扩散步骤中直接在合成配对上进行优化,这也有助于在结构层面较小的分布差距下保持真实世界输入的结构。为了增强感知质量,在后期采样步骤中对合成和真实的低分辨率图像应用质量引导奖励。为减轻奖励操控,合成结果的奖励在其干净对应物的相对优势空间中进行构造,而真实世界的优化则通过语义对齐约束进行正则化。此外,为了平衡结构和感知学习,我们采用动态保真度-感知加权策略,在早期阶段强调结构保留,并在后期扩散步骤中逐步转向感知优化。大量在真实世界超分辨率基准上的实验表明,Bird-SR在感知质量上始终优于最先进的方法,同时保持结构一致性,验证了其在真实世界超分辨率中的有效性。
cs.CV / 31 / 2602.07082
MosaicThinker: On-Device Visual Spatial Reasoning for Embodied AI via Iterative Construction of Space Representation
MosaicThinker:通过迭代构建空间表示实现的嵌入式人工智能的设备内视觉空间推理
Abstract
When embodied AI is expanding from traditional object detection and recognition to more advanced tasks of robot manipulation and actuation planning, visual spatial reasoning from the video inputs is necessary to perceive the spatial relationships of objects and guide device actions. However, existing visual language models (VLMs) have very weak capabilities in spatial reasoning due to the lack of knowledge about 3D spatial information, especially when the reasoning task involve complex spatial relations across multiple video frames. In this paper, we present a new inference-time computing technique for on-device embodied AI, namely \emph{MosaicThinker}, which enhances the on-device small VLM's spatial reasoning capabilities on difficult cross-frame reasoning tasks. Our basic idea is to integrate fragmented spatial information from multiple frames into a unified space representation of global semantic map, and further guide the VLM's spatial reasoning over the semantic map via a visual prompt. Experiment results show that our technique can greatly enhance the accuracy of cross-frame spatial reasoning on resource-constrained embodied AI devices, over reasoning tasks with diverse types and complexities.
Chinese Translation
随着嵌入式人工智能从传统的物体检测和识别扩展到更高级的机器人操作和执行规划任务,基于视频输入的视觉空间推理对于感知物体的空间关系和指导设备动作是必要的。然而,现有的视觉语言模型(VLM)在空间推理方面的能力非常薄弱,因为它们缺乏关于三维空间信息的知识,尤其是在推理任务涉及跨多个视频帧的复杂空间关系时。在本文中,我们提出了一种新的设备内嵌入式人工智能推理时计算技术,即 extit{MosaicThinker},该技术增强了设备内小型 VLM 在困难的跨帧推理任务中的空间推理能力。我们的基本思路是将来自多个帧的碎片化空间信息整合为统一的全局语义图的空间表示,并通过视觉提示进一步指导 VLM 在语义图上的空间推理。实验结果表明,我们的技术能够显著提高资源受限的嵌入式人工智能设备在各种类型和复杂性的推理任务上的跨帧空间推理准确性。
cs.CV / 32 / 2602.07095
WorldEdit: Towards Open-World Image Editing with a Knowledge-Informed Benchmark
WorldEdit:面向开放世界图像编辑的知识驱动基准
Abstract
Recent advances in image editing models have demonstrated remarkable capabilities in executing explicit instructions, such as attribute manipulation, style transfer, and pose synthesis. However, these models often face challenges when dealing with implicit editing instructions, which describe the cause of a visual change without explicitly detailing the resulting outcome. These limitations arise because existing models rely on uniform editing strategies that are not equipped to handle the complex world knowledge and reasoning required for implicit instructions. To address this gap, we introduce \textbf{WorldEdit}, a dataset specifically designed to enable world-driven image editing. WorldEdit consists of high-quality editing samples, guided by paraphrased instructions that align with real-world causal logic. Furthermore, we provide \textbf{WorldEdit-Test} for evaluating the existing model's performance on causal editing scenarios. With WorldEdit, we use a two-stage training framework for fine-tuning models like Bagel, integrating with a causal verification reward. Our results show that the proposed dataset and methods significantly narrow the gap with GPT-4o and Nano-Banana, demonstrating competitive performance not only in instruction following but also in knowledge plausibility, where many open-source systems typically struggle.
Chinese Translation
最近在图像编辑模型方面的进展展示了在执行明确指令(如属性操控、风格迁移和姿态合成)方面的显著能力。然而,这些模型在处理隐含编辑指令时常常面临挑战,这些指令描述了视觉变化的原因,但并未明确详细说明结果。这些局限性源于现有模型依赖于统一的编辑策略,而这些策略并未具备处理隐含指令所需的复杂世界知识和推理能力。为了解决这一问题,我们引入了 extbf{WorldEdit},这是一个专门设计用于实现世界驱动图像编辑的数据集。WorldEdit包含高质量的编辑样本,这些样本由与现实世界因果逻辑相一致的改写指令引导。此外,我们提供了 extbf{WorldEdit-Test}用于评估现有模型在因果编辑场景中的表现。在WorldEdit中,我们采用了一个两阶段的训练框架来微调像Bagel这样的模型,并结合因果验证奖励。我们的结果表明,所提出的数据集和方法显著缩小了与GPT-4o和Nano-Banana之间的差距,在遵循指令和知识合理性方面表现出竞争力,而许多开源系统通常在这方面表现不佳。
cs.CV / 33 / 2602.07100
TLC-Plan: A Two-Level Codebook Based Network for End-to-End Vector Floorplan Generation
TLC-Plan:一种基于双层代码本的端到端矢量平面图生成网络
Abstract
Automated floorplan generation aims to improve design quality, architectural efficiency, and sustainability by jointly modeling global spatial organization and precise geometric detail. However, existing approaches operate in raster space and rely on post hoc vectorization, which introduces structural inconsistencies and hinders end-to-end learning. Motivated by compositional spatial reasoning, we propose TLC-Plan, a hierarchical generative model that directly synthesizes vector floorplans from input boundaries, aligning with human architectural workflows based on modular and reusable patterns. TLC-Plan employs a two-level VQ-VAE to encode global layouts as semantically labeled room bounding boxes and to refine local geometries using polygon-level codes. This hierarchy is unified in a CodeTree representation, while an autoregressive transformer samples codes conditioned on the boundary to generate diverse and topologically valid designs, without requiring explicit room topology or dimensional priors. Extensive experiments show state-of-the-art performance on RPLAN dataset (FID = 1.84, MSE = 2.06) and leading results on LIFULL dataset. The proposed framework advances constraint-aware and scalable vector floorplan generation for real-world architectural applications. Source code and trained models are released at https://github.com/rosolose/TLC-PLAN.
Chinese Translation
自动化平面图生成旨在通过联合建模全球空间组织和精确几何细节来提高设计质量、建筑效率和可持续性。然而,现有方法在光栅空间中操作,并依赖于事后矢量化,这引入了结构不一致性并阻碍了端到端学习。受到组合空间推理的启发,我们提出了TLC-Plan,一种层次生成模型,能够直接从输入边界合成矢量平面图,符合基于模块化和可重用模式的人类建筑工作流程。TLC-Plan采用双层VQ-VAE将全球布局编码为语义标记的房间边界框,并使用多边形级代码细化局部几何形状。该层次结构在CodeTree表示中统一,而自回归变换器根据边界采样代码,以生成多样且拓扑有效的设计,而无需显式的房间拓扑或维度先验。大量实验表明,在RPLAN数据集上取得了最先进的性能(FID = 1.84,MSE = 2.06),并在LIFULL数据集上取得了领先结果。所提出的框架推动了面向现实世界建筑应用的约束感知和可扩展的矢量平面图生成。源代码和训练模型已在 https://github.com/rosolose/TLC-PLAN 发布。
cs.CV / 34 / 2602.07101
Zero-Shot UAV Navigation in Forests via Relightable 3D Gaussian Splatting
基于可重光照的三维高斯点云的零样本无人机森林导航
Abstract
UAV navigation in unstructured outdoor environments using passive monocular vision is hindered by the substantial visual domain gap between simulation and reality. While 3D Gaussian Splatting enables photorealistic scene reconstruction from real-world data, existing methods inherently couple static lighting with geometry, severely limiting policy generalization to dynamic real-world illumination. In this paper, we propose a novel end-to-end reinforcement learning framework designed for effective zero-shot transfer to unstructured outdoors. Within a high-fidelity simulation grounded in real-world data, our policy is trained to map raw monocular RGB observations directly to continuous control commands. To overcome photometric limitations, we introduce Relightable 3D Gaussian Splatting, which decomposes scene components to enable explicit, physically grounded editing of environmental lighting within the neural representation. By augmenting training with diverse synthesized lighting conditions ranging from strong directional sunlight to diffuse overcast skies, we compel the policy to learn robust, illumination-invariant visual features. Extensive real-world experiments demonstrate that a lightweight quadrotor achieves robust, collision-free navigation in complex forest environments at speeds up to 10 m/s, exhibiting significant resilience to drastic lighting variations without fine-tuning.
Chinese Translation
在非结构化户外环境中,利用被动单目视觉进行无人机导航受到模拟与现实之间显著视觉领域差距的阻碍。虽然三维高斯点云技术能够从真实世界数据中实现照片级真实感场景重建,但现有方法本质上将静态光照与几何形状耦合,严重限制了策略在动态现实光照下的泛化能力。本文提出了一种新颖的端到端强化学习框架,旨在有效实现对非结构化户外环境的零样本迁移。在基于真实世界数据的高保真模拟中,我们的策略被训练为将原始单目RGB观测直接映射到连续控制命令。为了克服光度限制,我们引入了可重光照的三维高斯点云技术,该技术将场景组件分解,以便在神经表示中显式、物理基础地编辑环境光照。通过用从强方向性阳光到漫射阴天的多样化合成光照条件增强训练,我们迫使策略学习稳健的、不受光照影响的视觉特征。大量的现实世界实验表明,一种轻量级四旋翼无人机在复杂森林环境中以高达10米/秒的速度实现了稳健的无碰撞导航,并在没有微调的情况下表现出对剧烈光照变化的显著韧性。
cs.CV / 35 / 2602.07104
Extended to Reality: Prompt Injection in 3D Environments
扩展至现实:3D环境中的提示注入
Abstract
Multimodal large language models (MLLMs) have advanced the capabilities to interpret and act on visual input in 3D environments, empowering diverse applications such as robotics and situated conversational agents. When MLLMs reason over camera-captured views of the physical world, a new attack surface emerges: an attacker can place text-bearing physical objects in the environment to override MLLMs' intended task. While prior work has studied prompt injection in the text domain and through digitally edited 2D images, it remains unclear how these attacks function in 3D physical environments. To bridge the gap, we introduce PI3D, a prompt injection attack against MLLMs in 3D environments, realized through text-bearing physical object placement rather than digital image edits. We formulate and solve the problem of identifying an effective 3D object pose (position and orientation) with injected text, where the attacker's goal is to induce the MLLM to perform the injected task while ensuring that the object placement remains physically plausible. Experiments demonstrate that PI3D is an effective attack against multiple MLLMs under diverse camera trajectories. We further evaluate existing defenses and show that they are insufficient to defend against PI3D.
Chinese Translation
多模态大语言模型(MLLMs)在解释和处理3D环境中的视觉输入方面取得了进展,使得机器人技术和情境对话代理等多种应用成为可能。当MLLMs对摄像头捕捉的物理世界视图进行推理时,出现了一种新的攻击面:攻击者可以在环境中放置带有文本的物理对象,从而覆盖MLLMs的预期任务。尽管之前的研究已经探讨了文本领域和通过数字编辑的2D图像中的提示注入,但在3D物理环境中这些攻击如何运作仍不明确。为了解决这一问题,我们提出了PI3D,一种针对3D环境中MLLMs的提示注入攻击,通过放置带有文本的物理对象而非数字图像编辑来实现。我们制定并解决了识别有效的3D对象姿态(位置和方向)以注入文本的问题,攻击者的目标是诱导MLLM执行注入的任务,同时确保对象的放置在物理上是合理的。实验表明,PI3D在多种相机轨迹下对多个MLLMs是一种有效的攻击。我们进一步评估了现有的防御措施,并表明它们不足以防御PI3D。
cs.CV / 36 / 2602.07106
Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models
Ex-Omni:为全模态大型语言模型启用3D人脸动画生成
Abstract
Omni-modal large language models (OLLMs) aim to unify multimodal understanding and generation, yet incorporating speech with 3D facial animation remains largely unexplored despite its importance for natural interaction. A key challenge arises from the representation mismatch between discrete, token-level semantic reasoning in LLMs and the dense, fine-grained temporal dynamics required for 3D facial motion, which makes direct modeling difficult to optimize under limited data. We propose Expressive Omni (Ex-Omni), an open-source omni-modal framework that augments OLLMs with speech-accompanied 3D facial animation. Ex-Omni reduces learning difficulty by decoupling semantic reasoning from temporal generation, leveraging speech units as temporal scaffolding and a unified token-as-query gated fusion (TQGF) mechanism for controlled semantic injection. We further introduce InstructEx, a dataset aims to facilitate augment OLLMs with speech-accompanied 3D facial animation. Extensive experiments demonstrate that Ex-Omni performs competitively against existing open-source OLLMs while enabling stable aligned speech and facial animation generation.
Chinese Translation
全模态大型语言模型(OLLMs)旨在统一多模态理解和生成,然而,尽管语音与3D人脸动画对于自然交互的重要性,二者的结合仍然在很大程度上未被探索。一个关键挑战在于LLMs中离散的、基于标记的语义推理与3D人脸运动所需的密集、细粒度时间动态之间的表示不匹配,这使得在有限数据下进行直接建模的优化变得困难。我们提出了Expressive Omni(Ex-Omni),一个开源的全模态框架,增强了OLLMs与语音伴随的3D人脸动画。Ex-Omni通过将语义推理与时间生成解耦来降低学习难度,利用语音单元作为时间支架,并采用统一的标记作为查询的门控融合(TQGF)机制进行受控的语义注入。我们进一步引入了InstructEx,一个旨在促进OLLMs与语音伴随的3D人脸动画增强的数据集。大量实验表明,Ex-Omni在与现有开源OLLMs的竞争中表现出色,同时实现了稳定的对齐语音和人脸动画生成。
cs.CV / 37 / 2602.07149
Privacy in Image Datasets: A Case Study on Pregnancy Ultrasounds
图像数据集中的隐私问题:以孕期超声波为例
Abstract
The rise of generative models has led to increased use of large-scale datasets collected from the internet, often with minimal or no data curation. This raises concerns about the inclusion of sensitive or private information. In this work, we explore the presence of pregnancy ultrasound images, which contain sensitive personal information and are often shared online. Through a systematic examination of LAION-400M dataset using CLIP embedding similarity, we retrieve images containing pregnancy ultrasound and detect thousands of entities of private information such as names and locations. Our findings reveal that multiple images have high-risk information that could enable re-identification or impersonation. We conclude with recommended practices for dataset curation, data privacy, and ethical use of public image datasets.
Chinese Translation
生成模型的兴起导致了大规模数据集的广泛使用,这些数据集通常来自互联网,且数据整理工作极少或根本没有。这引发了对敏感或私人信息包含的担忧。在本研究中,我们探讨了孕期超声波图像的存在,这些图像包含敏感的个人信息,且经常在网上分享。通过对LAION-400M数据集进行系统性检查,利用CLIP嵌入相似性,我们检索到包含孕期超声波的图像,并检测到数千个私人信息实体,如姓名和地点。我们的研究结果显示,多张图像包含高风险信息,这可能导致重新识别或冒充。最后,我们提出了数据集整理、数据隐私和公共图像数据集伦理使用的推荐实践。
cs.CV / 38 / 2602.07174
DuMeta++: Spatiotemporal Dual Meta-Learning for Generalizable Few-Shot Brain Tissue Segmentation Across Diverse Ages
DuMeta++:用于跨年龄普适性少样本脑组织分割的时空双重元学习
Abstract
Accurate segmentation of brain tissues from MRI scans is critical for neuroscience and clinical applications, but achieving consistent performance across the human lifespan remains challenging due to dynamic, age-related changes in brain appearance and morphology. While prior work has sought to mitigate these shifts by using self-supervised regularization with paired longitudinal data, such data are often unavailable in practice. To address this, we propose \emph{DuMeta++}, a dual meta-learning framework that operates without paired longitudinal data. Our approach integrates: (1) meta-feature learning to extract age-agnostic semantic representations of spatiotemporally evolving brain structures, and (2) meta-initialization learning to enable data-efficient adaptation of the segmentation model. Furthermore, we propose a memory-bank-based class-aware regularization strategy to enforce longitudinal consistency without explicit longitudinal supervision. We theoretically prove the convergence of our DuMeta++, ensuring stability. Experiments on diverse datasets (iSeg-2019, IBIS, OASIS, ADNI) under few-shot settings demonstrate that DuMeta++ outperforms existing methods in cross-age generalization. Code will be available at https://github.com/ladderlab-xjtu/DuMeta++.
Chinese Translation
从MRI扫描中准确分割脑组织对于神经科学和临床应用至关重要,但由于与年龄相关的脑部外观和形态的动态变化,在整个人类生命周期中实现一致的性能仍然具有挑战性。虽然之前的研究试图通过使用配对的纵向数据进行自监督正则化来减轻这些变化,但在实际应用中,这类数据往往不可用。为了解决这一问题,我们提出了 extit{DuMeta++},一种不依赖配对纵向数据的双重元学习框架。我们的方法集成了:(1) 元特征学习,以提取与年龄无关的时空演变脑结构的语义表示;(2) 元初始化学习,以实现分割模型的数据高效适应。此外,我们提出了一种基于记忆库的类感知正则化策略,以在没有明确纵向监督的情况下强制执行纵向一致性。我们理论上证明了DuMeta++的收敛性,确保了其稳定性。在少样本设置下对多样本数据集(iSeg-2019、IBIS、OASIS、ADNI)的实验表明,DuMeta++在跨年龄泛化方面优于现有方法。代码将发布在 https://github.com/ladderlab-xjtu/DuMeta++。
cs.CV / 39 / 2602.07198
Condition Matters in Full-head 3D GANs
条件在全头3D GAN中至关重要
Abstract
Conditioning is crucial for stable training of full-head 3D GANs. Without any conditioning signal, the model suffers from severe mode collapse, making it impractical to training. However, a series of previous full-head 3D GANs conventionally choose the view angle as the conditioning input, which leads to a bias in the learned 3D full-head space along the conditional view direction. This is evident in the significant differences in generation quality and diversity between the conditional view and non-conditional views of the generated 3D heads, resulting in global incoherence across different head regions. In this work, we propose to use view-invariant semantic feature as the conditioning input, thereby decoupling the generative capability of 3D heads from the viewing direction. To construct a view-invariant semantic condition for each training image, we create a novel synthesized head image dataset. We leverage FLUX.1 Kontext to extend existing high-quality frontal face datasets to a wide range of view angles. The image clip feature extracted from the frontal view is then used as a shared semantic condition across all views in the extended images, ensuring semantic alignment while eliminating directional bias. This also allows supervision from different views of the same subject to be consolidated under a shared semantic condition, which accelerates training and enhances the global coherence of the generated 3D heads. Moreover, as GANs often experience slower improvements in diversity once the generator learns a few modes that successfully fool the discriminator, our semantic conditioning encourages the generator to follow the true semantic distribution, thereby promoting continuous learning and diverse generation. Extensive experiments on full-head synthesis and single-view GAN inversion demonstrate that our method achieves significantly higher fidelity, diversity, and generalizability.
Chinese Translation
条件化对于全头3D GAN的稳定训练至关重要。在没有任何条件信号的情况下,模型会遭遇严重的模式崩溃,使得训练变得不切实际。然而,以往的一系列全头3D GAN通常选择视角作为条件输入,这导致在条件视角方向上学习的3D全头空间存在偏差。这在生成的3D头部的条件视角和非条件视角之间的生成质量和多样性显著差异中表现得尤为明显,导致不同头部区域之间的全局不一致。在本研究中,我们提出使用视角不变的语义特征作为条件输入,从而将3D头部的生成能力与视角解耦。为了为每个训练图像构建视角不变的语义条件,我们创建了一个新颖的合成头部图像数据集。我们利用FLUX.1 Kontext扩展现有的高质量正面人脸数据集,以涵盖广泛的视角。然后,从正面视角提取的图像剪辑特征被用作扩展图像中所有视角的共享语义条件,确保语义对齐,同时消除方向偏差。这还允许来自同一对象的不同视角的监督在共享语义条件下进行整合,从而加速训练并增强生成的3D头部的全局一致性。此外,由于GAN在生成器学习到一些成功欺骗鉴别器的模式后,通常会在多样性方面的改进变得缓慢,我们的语义条件鼓励生成器遵循真实的语义分布,从而促进持续学习和多样化生成。在全头合成和单视图GAN反演的广泛实验中,我们的方法在保真度、多样性和泛化能力上显著提高。
cs.CV / 40 / 2602.07212
Understanding Real-World Traffic Safety through RoadSafe365 Benchmark
通过 RoadSafe365 基准理解现实世界的交通安全
Abstract
Although recent traffic benchmarks have advanced multimodal data analysis, they generally lack systematic evaluation aligned with official safety standards. To fill this gap, we introduce RoadSafe365, a large-scale vision-language benchmark that supports fine-grained analysis of traffic safety from extensive and diverse real-world video data collections. Unlike prior works that focus primarily on coarse accident identification, RoadSafe365 is independently curated and systematically organized using a hierarchical taxonomy that refines and extends foundational definitions of crash, incident, and violation to bridge official traffic safety standards with data-driven traffic understanding systems. RoadSafe365 provides rich attribute annotations across diverse traffic event types, environmental contexts, and interaction scenarios, yielding 36,196 annotated clips from both dashcam and surveillance cameras. Each clip is paired with multiple-choice question-answer sets, comprising 864K candidate options, 8.4K unique answers, and 36K detailed scene descriptions collectively designed for vision-language understanding and reasoning. We establish strong baselines and observe consistent gains when fine-tuning on RoadSafe365. Cross-domain experiments on both real and synthetic datasets further validate its effectiveness. Designed for large-scale training and standardized evaluation, RoadSafe365 provides a comprehensive benchmark to advance reproducible research in real-world traffic safety analysis.
Chinese Translation
尽管近期的交通基准在多模态数据分析方面取得了进展,但它们通常缺乏与官方安全标准相一致的系统评估。为填补这一空白,我们引入了 RoadSafe365,这是一个大规模的视觉-语言基准,支持对来自广泛且多样的现实世界视频数据集的交通安全进行细粒度分析。与以往主要关注粗略事故识别的研究不同,RoadSafe365 是独立策划并系统组织的,采用分层分类法,细化和扩展了事故、事件和违规的基础定义,以桥接官方交通安全标准与数据驱动的交通理解系统。RoadSafe365 提供了丰富的属性注释,涵盖多种交通事件类型、环境背景和互动场景,共计提供了 36,196 个来自行车记录仪和监控摄像头的注释片段。每个片段配有多项选择题-答案集,包括 864K 候选选项、8.4K 独特答案和 36K 详细场景描述,旨在支持视觉-语言理解和推理。我们建立了强有力的基准,并在 RoadSafe365 上进行微调时观察到一致的提升。对真实和合成数据集的跨领域实验进一步验证了其有效性。RoadSafe365 旨在支持大规模训练和标准化评估,为推动现实世界交通安全分析的可重复研究提供了全面的基准。
cs.CV / 41 / 2602.07251
The Double-Edged Sword of Data-Driven Super-Resolution: Adversarial Super-Resolution Models
数据驱动超分辨率的双刃剑:对抗性超分辨率模型
Abstract
Data-driven super-resolution (SR) methods are often integrated into imaging pipelines as preprocessing steps to improve downstream tasks such as classification and detection. However, these SR models introduce a previously unexplored attack surface into imaging pipelines. In this paper, we present AdvSR, a framework demonstrating that adversarial behavior can be embedded directly into SR model weights during training, requiring no access to inputs at inference time. Unlike prior attacks that perturb inputs or rely on backdoor triggers, AdvSR operates entirely at the model level. By jointly optimizing for reconstruction quality and targeted adversarial outcomes, AdvSR produces models that appear benign under standard image quality metrics while inducing downstream misclassification. We evaluate AdvSR on three SR architectures (SRCNN, EDSR, SwinIR) paired with a YOLOv11 classifier and demonstrate that AdvSR models can achieve high attack success rates with minimal quality degradation. These findings highlight a new model-level threat for imaging pipelines, with implications for how practitioners source and validate models in safety-critical applications.
Chinese Translation
数据驱动的超分辨率(SR)方法通常作为预处理步骤集成到成像流程中,以改善分类和检测等下游任务。然而,这些SR模型为成像流程引入了一个之前未被探索的攻击面。在本文中,我们提出了AdvSR,一个框架,展示了对抗性行为可以在训练期间直接嵌入SR模型权重中,而在推理时无需访问输入。与之前通过扰动输入或依赖后门触发的攻击不同,AdvSR完全在模型层面上操作。通过联合优化重建质量和针对性的对抗性结果,AdvSR生成的模型在标准图像质量指标下看似良性,但却导致下游的错误分类。我们在三种SR架构(SRCNN、EDSR、SwinIR)与YOLOv11分类器配对的情况下评估了AdvSR,并证明AdvSR模型可以在最小质量降级的情况下实现高攻击成功率。这些发现突显了成像流程中一种新的模型级威胁,并对从业者在安全关键应用中如何获取和验证模型产生了影响。
cs.CV / 42 / 2602.07260
3D Transport-based Morphometry (3D-TBM) for medical image analysis
基于3D传输的形态测量(3D-TBM)在医学图像分析中的应用
Abstract
Transport-Based Morphometry (TBM) has emerged as a new framework for 3D medical image analysis. By embedding images into a transport domain via invertible transformations, TBM facilitates effective classification, regression, and other tasks using transport-domain features. Crucially, the inverse mapping enables the projection of analytic results back into the original image space, allowing researchers to directly interpret clinical features associated with model outputs in a spatially meaningful way. To facilitate broader adoption of TBM in clinical imaging research, we present 3D-TBM, a tool designed for morphological analysis of 3D medical images. The framework includes data preprocessing, computation of optimal transport embeddings, and analytical methods such as visualization of main transport directions, together with techniques for discerning discriminating directions and related analysis methods. We also provide comprehensive documentation and practical tutorials to support researchers interested in applying 3D-TBM in their own medical imaging studies. The source code is publicly available through PyTransKit.
Chinese Translation
基于传输的形态测量(TBM)已成为3D医学图像分析的新框架。通过可逆变换将图像嵌入传输域,TBM促进了利用传输域特征进行有效的分类、回归和其他任务。关键在于,逆映射使得分析结果能够投影回原始图像空间,从而使研究人员能够以空间上有意义的方式直接解释与模型输出相关的临床特征。为了促进TBM在临床影像研究中的更广泛应用,我们提出了3D-TBM,这是一个旨在对3D医学图像进行形态分析的工具。该框架包括数据预处理、最优传输嵌入的计算,以及主要传输方向的可视化等分析方法,同时提供了区分方向和相关分析方法的技术。我们还提供了全面的文档和实用教程,以支持有意在自身医学影像研究中应用3D-TBM的研究人员。源代码通过PyTransKit公开获取。
cs.CV / 43 / 2602.07262
TwistNet-2D: Learning Second-Order Channel Interactions via Spiral Twisting for Texture Recognition
TwistNet-2D:通过螺旋扭转学习二阶通道交互以进行纹理识别
Abstract
Second-order feature statistics are central to texture recognition, yet current methods face a fundamental tension: bilinear pooling and Gram matrices capture global channel correlations but collapse spatial structure, while self-attention models spatial context through weighted aggregation rather than explicit pairwise feature interactions. We introduce TwistNet-2D, a lightweight module that computes \emph{local} pairwise channel products under directional spatial displacement, jointly encoding where features co-occur and how they interact. The core component, Spiral-Twisted Channel Interaction (STCI), shifts one feature map along a prescribed direction before element-wise channel multiplication, thereby capturing the cross-position co-occurrence patterns characteristic of structured and periodic textures. Aggregating four directional heads with learned channel reweighting and injecting the result through a sigmoid-gated residual path, \TwistNet incurs only 3.5% additional parameters and 2% additional FLOPs over ResNet-18, yet consistently surpasses both parameter-matched and substantially larger baselines -- including ConvNeXt, Swin Transformer, and hybrid CNN--Transformer architectures -- across four texture and fine-grained recognition benchmarks.
Chinese Translation
二阶特征统计在纹理识别中至关重要,但当前方法面临根本性的矛盾:双线性池化和Gram矩阵捕捉全局通道相关性,但却忽略了空间结构,而自注意力模型通过加权聚合来处理空间上下文,而不是通过显式的成对特征交互。我们提出了TwistNet-2D,这是一种轻量级模块,能够在方向性空间位移下计算 extit{局部}成对通道乘积,联合编码特征的共现位置及其交互方式。核心组件螺旋扭转通道交互(Spiral-Twisted Channel Interaction, STCI)在元素级通道乘法之前,沿预定方向移动一个特征图,从而捕捉具有结构化和周期性纹理特征的跨位置共现模式。通过学习的通道重加权聚合四个方向头,并通过sigmoid门控残差路径注入结果,TwistNet仅增加了3.5%的参数和2%的FLOPs,相较于ResNet-18,然而在四个纹理和细粒度识别基准测试中,TwistNet始终超越了参数匹配和显著更大基线模型——包括ConvNeXt、Swin Transformer以及混合CNN-Transformer架构。
cs.CV / 44 / 2602.07272
VideoNeuMat: Neural Material Extraction from Generative Video Models
VideoNeuMat:从生成视频模型中提取神经材料
Abstract
Creating photorealistic materials for 3D rendering requires exceptional artistic skill. Generative models for materials could help, but are currently limited by the lack of high-quality training data. While recent video generative models effortlessly produce realistic material appearances, this knowledge remains entangled with geometry and lighting. We present VideoNeuMat, a two-stage pipeline that extracts reusable neural material assets from video diffusion models. First, we finetune a large video model (Wan 2.1 14B) to generate material sample videos under controlled camera and lighting trajectories, effectively creating a "virtual gonioreflectometer" that preserves the model's material realism while learning a structured measurement pattern. Second, we reconstruct compact neural materials from these videos through a Large Reconstruction Model (LRM) finetuned from a smaller Wan 1.3B video backbone. From 17 generated video frames, our LRM performs single-pass inference to predict neural material parameters that generalize to novel viewing and lighting conditions. The resulting materials exhibit realism and diversity far exceeding the limited synthetic training data, demonstrating that material knowledge can be successfully transferred from internet-scale video models into standalone, reusable neural 3D assets.
Chinese Translation
创建逼真的3D渲染材料需要卓越的艺术技巧。材料的生成模型可以提供帮助,但目前受限于缺乏高质量的训练数据。尽管最近的视频生成模型能够轻松生成逼真的材料外观,但这些知识仍与几何形状和光照相互交织。我们提出了VideoNeuMat,一个从视频扩散模型中提取可重用神经材料资产的两阶段管道。首先,我们对一个大型视频模型(Wan 2.1 14B)进行微调,以在受控的相机和光照轨迹下生成材料样本视频,有效地创建一个“虚拟光谱反射计”,在学习结构化测量模式的同时保持模型的材料真实感。其次,我们通过一个从较小的Wan 1.3B视频骨干网络微调的“大重建模型”(Large Reconstruction Model, LRM)从这些视频中重建紧凑的神经材料。从17帧生成的视频中,我们的LRM执行单次推理,以预测能够推广到新视角和光照条件的神经材料参数。最终生成的材料展现出远超有限合成训练数据的真实感和多样性,证明了材料知识可以成功地从互联网规模的视频模型转移到独立的、可重用的神经3D资产中。
cs.CV / 45 / 2602.07277
Cross-View World Models
跨视角世界模型
Abstract
World models enable agents to plan by imagining future states, but existing approaches operate from a single viewpoint, typically egocentric, even when other perspectives would make planning easier; navigation, for instance, benefits from a bird's-eye view. We introduce Cross-View World Models (XVWM), trained with a cross-view prediction objective: given a sequence of frames from one viewpoint, predict the future state from the same or a different viewpoint after an action is taken. Enforcing cross-view consistency acts as geometric regularization: because the input and output views may share little or no visual overlap, to predict across viewpoints, the model must learn view-invariant representations of the environment's 3D structure. We train on synchronized multi-view gameplay data from Aimlabs, an aim-training platform providing precisely aligned multi-camera recordings with high-frequency action labels. The resulting model gives agents parallel imagination streams across viewpoints, enabling planning in whichever frame of reference best suits the task while executing from the egocentric view. Our results show that multi-view consistency provides a strong learning signal for spatially grounded representations. Finally, predicting the consequences of one's actions from another viewpoint may offer a foundation for perspective-taking in multi-agent settings.
Chinese Translation
世界模型使代理能够通过想象未来状态进行规划,但现有方法通常仅从单一视角(通常是自我中心视角)进行操作,即使其他视角能够更容易地进行规划;例如,导航受益于鸟瞰图。我们提出了跨视角世界模型(Cross-View World Models, XVWM),其训练目标为跨视角预测:给定来自一个视角的一系列帧,在采取行动后预测来自同一或不同视角的未来状态。强制跨视角一致性作为几何正则化:由于输入和输出视角可能几乎没有视觉重叠,为了跨视角进行预测,模型必须学习环境三维结构的视角不变表示。我们在Aimlabs的同步多视角游戏数据上进行训练,该平台提供精确对齐的多摄像头录制和高频动作标签。所得到的模型为代理提供了跨视角的平行想象流,使其能够在最适合任务的参考框架中进行规划,同时从自我中心视角执行。我们的结果表明,多视角一致性为空间基础表示提供了强有力的学习信号。最后,从另一个视角预测自身行为的后果可能为多代理环境中的视角采纳提供基础。
cs.CV / 46 / 2602.07301
Diabetic Retinopathy Lesion Segmentation through Attention Mechanisms
通过注意机制进行糖尿病视网膜病变病灶分割
Abstract
Diabetic Retinopathy (DR) is an eye disease which arises due to diabetes mellitus. It might cause vision loss and blindness. To prevent irreversible vision loss, early detection through systematic screening is crucial. Although researchers have developed numerous automated deep learning-based algorithms for DR screening, their clinical applicability remains limited, particularly in lesion segmentation. Our method provides pixel-level annotations for lesions, which practically supports Ophthalmologist to screen DR from fundus images. In this work, we segmented four types of DR-related lesions: microaneurysms, soft exudates, hard exudates, and hemorrhages on 757 images from DDR dataset. To enhance lesion segmentation, an attention mechanism was integrated with DeepLab-V3+. Compared to the baseline model, the Attention-DeepLab model increases mean average precision (mAP) from 0.3010 to 0.3326 and the mean Intersection over Union (IoU) from 0.1791 to 0.1928. The model also increased microaneurysm detection from 0.0205 to 0.0763, a clinically significant improvement. The detection of microaneurysms is the earliest visible symptom of DR.
Chinese Translation
糖尿病视网膜病变(Diabetic Retinopathy, DR)是一种由于糖尿病引起的眼病,可能导致视力丧失和失明。为了防止不可逆的视力损失,早期通过系统筛查进行检测至关重要。尽管研究人员开发了众多基于深度学习的自动化算法用于DR筛查,但其临床适用性仍然有限,特别是在病灶分割方面。我们的方法为病灶提供了像素级的标注,实质上支持眼科医生从眼底图像中筛查DR。在本研究中,我们对来自DDR数据集的757张图像进行了四种类型的DR相关病灶的分割:微动脉瘤、软渗出物、硬渗出物和出血。为了增强病灶分割效果,我们将注意机制与DeepLab-V3+相结合。与基线模型相比,Attention-DeepLab模型的平均精度均值(mean average precision, mAP)从0.3010提高到0.3326,平均交并比(mean Intersection over Union, IoU)从0.1791提高到0.1928。该模型还将微动脉瘤的检测率从0.0205提高到0.0763,具有临床显著性改善。微动脉瘤的检测是DR最早可见的症状。
cs.CV / 47 / 2602.07310
Optimization of Precipitate Segmentation Through Linear Genetic Programming of Image Processing
通过线性遗传编程优化沉淀物分割的图像处理
Abstract
Current analysis of additive manufactured niobium-based copper alloys relies on hand annotation due to varying contrast, noise, and image artifacts present in micrographs, slowing iteration speed in alloy development. We present a filtering and segmentation algorithm for detecting precipitates in FIB cross-section micrographs, optimized using linear genetic programming (LGP), which accounts for the various artifacts. To this end, the optimization environment uses a domain-specific language for image processing to iterate on solutions. Programs in this language are a list of image-filtering blocks with tunable parameters that sequentially process an input image, allowing for reliable generation and mutation by a genetic algorithm. Our environment produces optimized human-interpretable MATLAB code representing an image filtering pipeline. Under ideal conditions--a population size of 60 and a maximum program length of 5 blocks--our system was able to find a near-human accuracy solution with an average evaluation error of 1.8% when comparing segmentations pixel-by-pixel to a human baseline using an XOR error evaluation. Our automation work enabled faster iteration cycles and furthered exploration of the material composition and processing space: our optimized pipeline algorithm processes a 3.6 megapixel image in about 2 seconds on average. This ultimately enables convergence on strong, low-activation, precipitation hardened copper alloys for additive manufactured fusion reactor parts.
Chinese Translation
目前对增材制造的铌基铜合金的分析依赖于手动标注,原因在于微观图像中存在的对比度变化、噪声和图像伪影,这减缓了合金开发的迭代速度。我们提出了一种用于检测FIB横截面微观图像中沉淀物的过滤和分割算法,该算法通过线性遗传编程(LGP)进行优化,以考虑各种伪影。为此,优化环境使用一种特定领域的图像处理语言来迭代解决方案。该语言中的程序是一个图像过滤模块的列表,具有可调参数,能够顺序处理输入图像,从而允许遗传算法进行可靠的生成和突变。在理想条件下——种群规模为60,最大程序长度为5个模块——我们的系统能够找到接近人类准确度的解决方案,在与人类基准进行逐像素比较时,平均评估误差为1.8%,采用XOR误差评估。我们的自动化工作实现了更快的迭代周期,并进一步探索了材料成分和处理空间:我们的优化管道算法平均在约2秒内处理一幅3.6百万像素的图像。这最终使得对强、低激活的沉淀硬化铜合金的收敛成为可能,以用于增材制造的聚变反应堆部件。
cs.CV / 48 / 2602.07311
LUCID-SAE: Learning Unified Vision-Language Sparse Codes for Interpretable Concept Discovery
LUCID-SAE:学习统一的视觉-语言稀疏编码以实现可解释的概念发现
Abstract
Sparse autoencoders (SAEs) offer a natural path toward comparable explanations across different representation spaces. However, current SAEs are trained per modality, producing dictionaries whose features are not directly understandable and whose explanations do not transfer across domains. In this study, we introduce LUCID (Learning Unified vision-language sparse Codes for Interpretable concept Discovery), a unified vision-language sparse autoencoder that learns a shared latent dictionary for image patch and text token representations, while reserving private capacity for modality-specific details. We achieve feature alignment by coupling the shared codes with a learned optimal transport matching objective without the need of labeling. LUCID yields interpretable shared features that support patch-level grounding, establish cross-modal neuron correspondence, and enhance robustness against the concept clustering problem in similarity-based evaluation. Leveraging the alignment properties, we develop an automated dictionary interpretation pipeline based on term clustering without manual observations. Our analysis reveals that LUCID's shared features capture diverse semantic categories beyond objects, including actions, attributes, and abstract concepts, demonstrating a comprehensive approach to interpretable multimodal representations.
Chinese Translation
稀疏自编码器(SAEs)为在不同表示空间之间提供可比的解释提供了一条自然路径。然而,目前的SAEs是按模态训练的,生成的字典特征并不直接可理解,且其解释无法跨领域转移。在本研究中,我们引入了LUCID(学习统一的视觉-语言稀疏编码以实现可解释的概念发现),这是一种统一的视觉-语言稀疏自编码器,它为图像块和文本标记表示学习一个共享的潜在字典,同时为特定模态的细节保留私有容量。我们通过将共享编码与学习的最优传输匹配目标相结合来实现特征对齐,而无需标注。LUCID生成的可解释共享特征支持块级定位,建立跨模态神经元对应关系,并增强对基于相似性评估中的概念聚类问题的鲁棒性。利用对齐特性,我们开发了一种基于术语聚类的自动字典解释管道,无需人工观察。我们的分析表明,LUCID的共享特征捕捉了超越对象的多样语义类别,包括动作、属性和抽象概念,展示了一种全面的可解释多模态表示方法。
cs.CV / 49 / 2602.07343
Seeing Roads Through Words: A Language-Guided Framework for RGB-T Driving Scene Segmentation
通过语言观察道路:一种基于语言引导的RGB-T驾驶场景分割框架
Abstract
Robust semantic segmentation of road scenes under adverse illumination, lighting, and shadow conditions remain a core challenge for autonomous driving applications. RGB-Thermal fusion is a standard approach, yet existing methods apply static fusion strategies uniformly across all conditions, allowing modality-specific noise to propagate throughout the network. Hence, we propose CLARITY that dynamically adapts its fusion strategy to the detected scene condition. Guided by vision-language model (VLM) priors, the network learns to modulate each modality's contribution based on the illumination state while leveraging object embeddings for segmentation, rather than applying a fixed fusion policy. We further introduce two mechanisms, i.e., one which preserves valid dark-object semantics that prior noise-suppression methods incorrectly discard, and a hierarchical decoder that enforces structural consistency across scales to sharpen boundaries on thin objects. Experiments on the MFNet dataset demonstrate that CLARITY establishes a new state-of-the-art (SOTA), achieving 62.3% mIoU and 77.5% mAcc.
Chinese Translation
在不利的照明、光照和阴影条件下,路面场景的稳健语义分割仍然是自动驾驶应用中的核心挑战。RGB-热融合是一种标准方法,但现有方法在所有条件下均匀应用静态融合策略,导致特定模态的噪声在网络中传播。因此,我们提出了CLARITY,它根据检测到的场景条件动态调整其融合策略。在视觉-语言模型(VLM)先验的指导下,网络学习根据照明状态调节每种模态的贡献,同时利用物体嵌入进行分割,而不是应用固定的融合策略。我们进一步引入了两种机制,即一种保留有效的暗物体语义,而先前的噪声抑制方法错误地丢弃了这些语义,以及一种层次解码器,它在不同尺度之间强制结构一致性,以锐化细小物体的边界。在MFNet数据集上的实验表明,CLARITY建立了新的最先进水平(SOTA),实现了62.3%的mIoU和77.5%的mAcc。
cs.CV / 50 / 2602.07345
Optimizing Few-Step Generation with Adaptive Matching Distillation
通过自适应匹配蒸馏优化少步生成
Abstract
Distribution Matching Distillation (DMD) is a powerful acceleration paradigm, yet its stability is often compromised in Forbidden Zone, regions where the real teacher provides unreliable guidance while the fake teacher exerts insufficient repulsive force. In this work, we propose a unified optimization framework that reinterprets prior art as implicit strategies to avoid these corrupted regions. Based on this insight, we introduce Adaptive Matching Distillation (AMD), a self-correcting mechanism that utilizes reward proxies to explicitly detect and escape Forbidden Zones. AMD dynamically prioritizes corrective gradients via structural signal decomposition and introduces Repulsive Landscape Sharpening to enforce steep energy barriers against failure mode collapse. Extensive experiments across image and video generation tasks (e.g., SDXL, Wan2.1) and rigorous benchmarks (e.g., VBench, GenEval) demonstrate that AMD significantly enhances sample fidelity and training robustness. For instance, AMD improves the HPSv2 score on SDXL from 30.64 to 31.25, outperforming state-of-the-art baselines. These findings validate that explicitly rectifying optimization trajectories within Forbidden Zones is essential for pushing the performance ceiling of few-step generative models.
Chinese Translation
分布匹配蒸馏(Distribution Matching Distillation, DMD)是一种强大的加速范式,但其稳定性在禁区(Forbidden Zone)中常常受到影响,这些区域中真实教师提供的不可靠指导与虚假教师施加的不足排斥力相结合。在本研究中,我们提出了一个统一的优化框架,将先前的研究重新解释为避免这些受损区域的隐性策略。基于这一见解,我们引入了自适应匹配蒸馏(Adaptive Matching Distillation, AMD),这是一种自我校正机制,利用奖励代理明确检测并逃离禁区。AMD通过结构信号分解动态优先考虑纠正梯度,并引入排斥景观锐化(Repulsive Landscape Sharpening)以在失败模式崩溃时强制施加陡峭的能量障碍。在图像和视频生成任务(如SDXL、Wan2.1)以及严格基准测试(如VBench、GenEval)中进行的大量实验表明,AMD显著提高了样本保真度和训练鲁棒性。例如,AMD将SDXL上的HPSv2得分从30.64提高到31.25,超越了最先进的基线。这些发现验证了在禁区内明确校正优化轨迹对于推动少步生成模型性能上限的重要性。
cs.CV / 51 / 2602.07428
Row-Column Separated Attention Based Low-Light Image/Video Enhancement
基于行列分离注意力的低光照图像/视频增强
Abstract
U-Net structure is widely used for low-light image/video enhancement. The enhanced images result in areas with large local noise and loss of more details without proper guidance for global information. Attention mechanisms can better focus on and use global information. However, attention to images could significantly increase the number of parameters and computations. We propose a Row-Column Separated Attention module (RCSA) inserted after an improved U-Net. The RCSA module's input is the mean and maximum of the row and column of the feature map, which utilizes global information to guide local information with fewer parameters. We propose two temporal loss functions to apply the method to low-light video enhancement and maintain temporal consistency. Extensive experiments on the LOL, MIT Adobe FiveK image, and SDSD video datasets demonstrate the effectiveness of our approach. The code is publicly available at https://github.com/cq-dong/URCSA.
Chinese Translation
U-Net 结构广泛应用于低光照图像/视频增强。增强后的图像在局部噪声较大和细节损失严重的区域缺乏对全局信息的适当引导。注意力机制能够更好地聚焦和利用全局信息。然而,对图像的注意力可能会显著增加参数数量和计算量。我们提出了一种在改进的 U-Net 后插入的行列分离注意力模块(Row-Column Separated Attention, RCSA)。RCSA 模块的输入是特征图的行和列的均值与最大值,它利用全局信息以较少的参数引导局部信息。我们提出了两种时间损失函数,将该方法应用于低光照视频增强,并保持时间一致性。在 LOL、MIT Adobe FiveK 图像和 SDSD 视频数据集上的大量实验证明了我们方法的有效性。代码已公开发布在 https://github.com/cq-dong/URCSA。
cs.CV / 52 / 2602.07444
Perspective-aware fusion of incomplete depth maps and surface normals for accurate 3D reconstruction
基于视角感知的不完整深度图和表面法线融合以实现精确的3D重建
Abstract
We address the problem of reconstructing 3D surfaces from depth and surface normal maps acquired by a sensor system based on a single perspective camera. Depth and normal maps can be obtained through techniques such as structured-light scanning and photometric stereo, respectively. We propose a perspective-aware log-depth fusion approach that extends existing orthographic gradient-based depth-normals fusion methods by explicitly accounting for perspective projection, leading to metrically accurate 3D reconstructions. Additionally, the method handles missing depth measurements by leveraging available surface normal information to inpaint gaps. Experiments on the DiLiGenT-MV data set demonstrate the effectiveness of our approach and highlight the importance of perspective-aware depth-normals fusion.
Chinese Translation
我们解决了从基于单一透视相机的传感器系统获取的深度图和表面法线图重建3D表面的问题。深度图和法线图可以通过结构光扫描和光度立体等技术获得。我们提出了一种视角感知的对数深度融合方法,该方法通过明确考虑透视投影,扩展了现有的正交梯度深度-法线融合方法,从而实现了度量上准确的3D重建。此外,该方法通过利用可用的表面法线信息来填补缺失的深度测量值,从而处理缺失的深度数据。在DiLiGenT-MV数据集上的实验表明了我们方法的有效性,并强调了视角感知的深度-法线融合的重要性。
cs.CV / 53 / 2602.07446
PTB-XL-Image-17K: A Large-Scale Synthetic ECG Image Dataset with Comprehensive Ground Truth for Deep Learning-Based Digitization
PTB-XL-Image-17K:一个具有全面真实标签的大规模合成心电图图像数据集,用于基于深度学习的数字化
Abstract
Electrocardiogram (ECG) digitization-converting paper-based or scanned ECG images back into time-series signals-is critical for leveraging decades of legacy clinical data in modern deep learning applications. However, progress has been hindered by the lack of large-scale datasets providing both ECG images and their corresponding ground truth signals with comprehensive annotations. We introduce PTB-XL-Image-17K, a complete synthetic ECG image dataset comprising 17,271 high-quality 12-lead ECG images generated from the PTB-XL signal database. Our dataset uniquely provides five complementary data types per sample: (1) realistic ECG images with authentic grid patterns and annotations (50% with visible grid, 50% without), (2) pixel-level segmentation masks, (3) ground truth time-series signals, (4) bounding box annotations in YOLO format for both lead regions and lead name labels, and (5) comprehensive metadata including visual parameters and patient information. We present an open-source Python framework enabling customizable dataset generation with controllable parameters including paper speed (25/50 mm/s), voltage scale (5/10 mm/mV), sampling rate (500 Hz), grid appearance (4 colors), and waveform characteristics. The dataset achieves 100% generation success rate with an average processing time of 1.35 seconds per sample. PTB-XL-Image-17K addresses critical gaps in ECG digitization research by providing the first large-scale resource supporting the complete pipeline: lead detection, waveform segmentation, and signal extraction with full ground truth for rigorous evaluation. The dataset, generation framework, and documentation are publicly available at https://github.com/naqchoalimehdi/PTB-XL-Image-17K and https://doi.org/10.5281/zenodo.18197519.
Chinese Translation
心电图(ECG)数字化——将纸质或扫描的心电图图像转换回时间序列信号——对于在现代深度学习应用中利用数十年的遗留临床数据至关重要。然而,由于缺乏提供心电图图像及其对应真实信号的全面注释的大规模数据集,进展受到阻碍。我们介绍了PTB-XL-Image-17K,这是一个完整的合成心电图图像数据集,包含来自PTB-XL信号数据库生成的17,271幅高质量12导联心电图图像。我们的数据集独特地为每个样本提供五种互补的数据类型:(1)具有真实网格模式和注释的逼真心电图图像(50%带可见网格,50%不带),(2)像素级分割掩膜,(3)真实时间序列信号,(4)用于导联区域和导联名称标签的YOLO格式边界框注释,以及(5)包括视觉参数和患者信息的全面元数据。我们提供了一个开源Python框架,支持可定制的数据集生成,具有可控参数,包括纸张速度(25/50 mm/s)、电压比例(5/10 mm/mV)、采样率(500 Hz)、网格外观(4种颜色)和波形特征。该数据集实现了100%的生成成功率,平均每个样本的处理时间为1.35秒。PTB-XL-Image-17K通过提供第一个支持完整流程的规模庞大的资源,填补了心电图数字化研究中的关键空白:导联检测、波形分割和信号提取,具有完整的真实标签以便进行严格评估。数据集、生成框架和文档可在 https://github.com/naqchoalimehdi/PTB-XL-Image-17K 和 https://doi.org/10.5281/zenodo.18197519 上公开获取。
cs.CV / 54 / 2602.07449
SoulX-FlashHead: Oracle-guided Generation of Infinite Real-time Streaming Talking Heads
SoulX-FlashHead:基于Oracle指导的无限实时流媒体人脸生成
Abstract
Achieving a balance between high-fidelity visual quality and low-latency streaming remains a formidable challenge in audio-driven portrait generation. Existing large-scale models often suffer from prohibitive computational costs, while lightweight alternatives typically compromise on holistic facial representations and temporal stability. In this paper, we propose SoulX-FlashHead, a unified 1.3B-parameter framework designed for real-time, infinite-length, and high-fidelity streaming video generation. To address the instability of audio features in streaming scenarios, we introduce Streaming-Aware Spatiotemporal Pre-training equipped with a Temporal Audio Context Cache mechanism, which ensures robust feature extraction from short audio fragments. Furthermore, to mitigate the error accumulation and identity drift inherent in long-sequence autoregressive generation, we propose Oracle-Guided Bidirectional Distillation, leveraging ground-truth motion priors to provide precise physical guidance. We also present VividHead, a large-scale, high-quality dataset containing 782 hours of strictly aligned footage to support robust training. Extensive experiments demonstrate that SoulX-FlashHead achieves state-of-the-art performance on HDTF and VFHQ benchmarks. Notably, our Lite variant achieves an inference speed of 96 FPS on a single NVIDIA RTX 4090, facilitating ultra-fast interaction without sacrificing visual coherence.
Chinese Translation
在音频驱动的人像生成中,实现高保真视觉质量与低延迟流媒体之间的平衡仍然是一项艰巨的挑战。现有的大规模模型通常面临高昂的计算成本,而轻量级替代方案则往往在整体面部表现和时间稳定性上有所妥协。本文提出了SoulX-FlashHead,一个统一的13亿参数框架,旨在实现实时、无限长度和高保真的流媒体视频生成。为了解决流媒体场景中音频特征的不稳定性,我们引入了流媒体感知时空预训练,并配备了时间音频上下文缓存机制,以确保从短音频片段中进行稳健的特征提取。此外,为了减轻长序列自回归生成中固有的误差积累和身份漂移,我们提出了基于Oracle指导的双向蒸馏,利用真实运动先验提供精确的物理指导。我们还提出了VividHead,一个包含782小时严格对齐视频的高质量大规模数据集,以支持稳健的训练。大量实验表明,SoulX-FlashHead在HDTF和VFHQ基准测试中达到了最先进的性能。值得注意的是,我们的Lite变体在单个NVIDIA RTX 4090上实现了96 FPS的推理速度,促进了超快速交互而不牺牲视觉一致性。
cs.CV / 55 / 2602.07458
SpatialReward: Bridging the Perception Gap in Online RL for Image Editing via Explicit Spatial Reasoning
SpatialReward:通过显式空间推理弥合在线强化学习在图像编辑中的感知差距
Abstract
Online Reinforcement Learning (RL) offers a promising avenue for complex image editing but is currently constrained by the scarcity of reliable and fine-grained reward signals. Existing evaluators frequently struggle with a critical perception gap we term "Attention Collapse," where models neglect cross-image comparisons and fail to capture fine-grained details, resulting in inaccurate perception and miscalibrated scores. To address these limitations, we propose SpatialReward, a reward model that enforces precise verification via explicit spatial reasoning. By anchoring reasoning to predicted edit regions, SpatialReward grounds semantic judgments in pixel-level evidence, significantly enhancing evaluative accuracy. Trained on a curated 260k spatial-aware dataset, our model achieves state-of-the-art performance on MMRB2 and EditReward-Bench, and outperforms proprietary evaluators on our proposed MultiEditReward-Bench. Furthermore, SpatialReward serves as a robust signal in online RL, boosting OmniGen2 by +0.90 on GEdit-Bench--surpassing the leading discriminative model and doubling the gain of GPT-4.1 (+0.45). These results demonstrate that spatial reasoning is essential for unlocking effective alignment in image editing.
Chinese Translation
在线强化学习(RL)为复杂的图像编辑提供了一个有前景的途径,但目前受到可靠且细粒度奖励信号匮乏的限制。现有的评估器常常面临一个关键的感知差距,我们称之为“注意力崩溃”(Attention Collapse),在这种情况下,模型忽视跨图像比较,未能捕捉细粒度细节,导致感知不准确和评分失调。为了解决这些局限性,我们提出了SpatialReward,这是一种通过显式空间推理强制精确验证的奖励模型。通过将推理锚定到预测的编辑区域,SpatialReward将语义判断基于像素级证据,从而显著提高评估准确性。在一个精心策划的26万空间感知数据集上训练后,我们的模型在MMRB2和EditReward-Bench上达到了最先进的性能,并在我们提出的MultiEditReward-Bench上超越了专有评估器。此外,SpatialReward作为在线RL中的一个强大信号,提升了OmniGen2在GEdit-Bench上的表现,增幅为+0.90,超越了领先的判别模型,并使GPT-4.1的增益翻倍(+0.45)。这些结果表明,空间推理对于解锁图像编辑中的有效对齐至关重要。
cs.CV / 56 / 2602.07463
GlobalWasteData: A Large-Scale, Integrated Dataset for Robust Waste Classification and Environmental Monitoring
全球废物数据:用于稳健废物分类和环境监测的大规模综合数据集
Abstract
The growing amount of waste is a problem for the environment that requires efficient sorting techniques for various kinds of waste. An automated waste classification system is used for this purpose. The effectiveness of these Artificial Intelligence (AI) models depends on the quality and accessibility of publicly available datasets, which provide the basis for training and analyzing classification algorithms. Although several public waste classification datasets exist, they remain fragmented, inconsistent, and biased toward specific environments. Differences in class names, annotation formats, image conditions, and class distributions make it difficult to combine these datasets or train models that generalize well to real world scenarios. To address these issues, we introduce the GlobalWasteData (GWD) archive, a large scale dataset of 89,807 images across 14 main categories, annotated with 68 distinct subclasses. We compile this novel integrated GWD archive by merging multiple publicly available datasets into a single, unified resource. This GWD archive offers consistent labeling, improved domain diversity, and more balanced class representation, enabling the development of robust and generalizable waste recognition models. Additional preprocessing steps such as quality filtering, duplicate removal, and metadata generation further improve dataset reliability. Overall, this dataset offers a strong foundation for Machine Learning (ML) applications in environmental monitoring, recycling automation, and waste identification, and is publicly available to promote future research and reproducibility.
Chinese Translation
日益增长的废物量对环境构成了问题,需要对各种类型的废物进行有效的分类技术。为此,采用了自动化废物分类系统。这些人工智能(AI)模型的有效性依赖于公开可用数据集的质量和可获取性,这些数据集为训练和分析分类算法提供了基础。尽管存在多个公共废物分类数据集,但它们仍然是碎片化的、不一致的,并且对特定环境存在偏见。类别名称、注释格式、图像条件和类别分布的差异使得这些数据集的结合或训练能够很好地泛化到现实世界场景的模型变得困难。为了解决这些问题,我们引入了全球废物数据(GlobalWasteData,GWD)档案,这是一个包含89,807张图像的大规模数据集,涵盖14个主要类别,并标注了68个不同的子类。我们通过将多个公开可用的数据集合并为一个统一的资源,编制了这个新颖的综合GWD档案。该GWD档案提供了一致的标签、改善的领域多样性和更平衡的类别表示,从而促进了稳健且可泛化的废物识别模型的开发。额外的预处理步骤,如质量过滤、重复移除和元数据生成,进一步提高了数据集的可靠性。总体而言,该数据集为环境监测、回收自动化和废物识别中的机器学习(ML)应用提供了坚实的基础,并公开可用以促进未来的研究和可重复性。
cs.CV / 57 / 2602.07493
Thermal odometry and dense mapping using learned ddometry and Gaussian splatting
基于学习的热测程和高斯点云的密集映射
Abstract
Thermal infrared sensors, with wavelengths longer than smoke particles, can capture imagery independent of darkness, dust, and smoke. This robustness has made them increasingly valuable for motion estimation and environmental perception in robotics, particularly in adverse conditions. Existing thermal odometry and mapping approaches, however, are predominantly geometric and often fail across diverse datasets while lacking the ability to produce dense maps. Motivated by the efficiency and high-quality reconstruction ability of recent Gaussian Splatting (GS) techniques, we propose TOM-GS, a thermal odometry and mapping method that integrates learning-based odometry with GS-based dense mapping. TOM-GS is among the first GS-based SLAM systems tailored for thermal cameras, featuring dedicated thermal image enhancement and monocular depth integration. Extensive experiments on motion estimation and novel-view rendering demonstrate that TOM-GS outperforms existing learning-based methods, confirming the benefits of learning-based pipelines for robust thermal odometry and dense reconstruction.
Chinese Translation
热红外传感器的波长长于烟雾颗粒,可以在黑暗、灰尘和烟雾的条件下捕捉图像。这种鲁棒性使其在机器人运动估计和环境感知中变得越来越重要,尤其是在恶劣条件下。然而,现有的热测程和映射方法主要是几何基础的,往往在多样化的数据集上表现不佳,并且缺乏生成密集地图的能力。受到最近高斯点云(Gaussian Splatting, GS)技术的高效性和高质量重建能力的启发,我们提出了TOM-GS,一种将基于学习的测程与基于GS的密集映射相结合的热测程和映射方法。TOM-GS是首个针对热成像相机的基于GS的SLAM系统之一,具有专门的热图像增强和单目深度集成。大量关于运动估计和新视角渲染的实验表明,TOM-GS的表现优于现有的基于学习的方法,证实了基于学习的流程在鲁棒热测程和密集重建中的优势。
cs.CV / 58 / 2602.07495
Learning Brain Representation with Hierarchical Visual Embeddings
通过层次视觉嵌入学习大脑表征
Abstract
Decoding visual representations from brain signals has attracted significant attention in both neuroscience and artificial intelligence. However, the degree to which brain signals truly encode visual information remains unclear. Current visual decoding approaches explore various brain-image alignment strategies, yet most emphasize high-level semantic features while neglecting pixel-level details, thereby limiting our understanding of the human visual system. In this paper, we propose a brain-image alignment strategy that leverages multiple pre-trained visual encoders with distinct inductive biases to capture hierarchical and multi-scale visual representations, while employing a contrastive learning objective to achieve effective alignment between brain signals and visual embeddings. Furthermore, we introduce a Fusion Prior, which learns a stable mapping on large-scale visual data and subsequently matches brain features to this pre-trained prior, thereby enhancing distributional consistency across modalities. Extensive quantitative and qualitative experiments demonstrate that our method achieves a favorable balance between retrieval accuracy and reconstruction fidelity.
Chinese Translation
从大脑信号解码视觉表征在神经科学和人工智能领域引起了广泛关注。然而,大脑信号在多大程度上真正编码了视觉信息仍然不清楚。目前的视觉解码方法探索了各种大脑与图像的对齐策略,但大多数方法强调高层语义特征,而忽视了像素级细节,从而限制了我们对人类视觉系统的理解。本文提出了一种大脑与图像的对齐策略,利用多个具有不同归纳偏差的预训练视觉编码器来捕捉层次化和多尺度的视觉表征,同时采用对比学习目标实现大脑信号与视觉嵌入之间的有效对齐。此外,我们引入了一种融合先验(Fusion Prior),该先验在大规模视觉数据上学习稳定的映射,并随后将大脑特征匹配到这一预训练先验,从而增强跨模态的分布一致性。大量定量和定性实验表明,我们的方法在检索准确性和重建保真度之间达到了良好的平衡。
cs.CV / 59 / 2602.07498
IM-Animation: An Implicit Motion Representation for Identity-decoupled Character Animation
IM-动画:一种用于身份解耦角色动画的隐式运动表示
Abstract
Recent progress in video diffusion models has markedly advanced character animation, which synthesizes motioned videos by animating a static identity image according to a driving video. Explicit methods represent motion using skeleton, DWPose or other explicit structured signals, but struggle to handle spatial mismatches and varying body scales. %proportions. Implicit methods, on the other hand, capture high-level implicit motion semantics directly from the driving video, but suffer from identity leakage and entanglement between motion and appearance. To address the above challenges, we propose a novel implicit motion representation that compresses per-frame motion into compact 1D motion tokens. This design relaxes strict spatial constraints inherent in 2D representations and effectively prevents identity information leakage from the motion video. Furthermore, we design a temporally consistent mask token-based retargeting module that enforces a temporal training bottleneck, mitigating interference from the source images' motion and improving retargeting consistency. Our methodology employs a three-stage training strategy to enhance the training efficiency and ensure high fidelity. Extensive experiments demonstrate that our implicit motion representation and the propose IM-Animation's generative capabilities are achieve superior or competitive performance compared with state-of-the-art methods.
Chinese Translation
近期视频扩散模型的进展显著推动了角色动画的发展,该技术通过根据驱动视频对静态身份图像进行动画处理来合成运动视频。显式方法使用骨架、DWPose或其他显式结构信号来表示运动,但在处理空间不匹配和不同身体比例方面存在困难。相反,隐式方法直接从驱动视频中捕捉高层次的隐式运动语义,但面临身份泄漏和运动与外观之间的纠缠问题。为了解决上述挑战,我们提出了一种新颖的隐式运动表示,将每帧运动压缩为紧凑的1D运动标记。该设计放宽了2D表示中固有的严格空间约束,有效防止了运动视频中的身份信息泄漏。此外,我们设计了一个基于时间一致性掩码标记的重定向模块,该模块强制施加时间训练瓶颈,减轻源图像运动的干扰,提高重定向一致性。我们的方法采用三阶段训练策略,以提高训练效率并确保高保真度。大量实验表明,我们的隐式运动表示及所提出的IM-动画的生成能力在与最先进的方法相比时,表现出优越或具有竞争力的性能。
cs.CV / 60 / 2602.07512
Adaptive Image Zoom-in with Bounding Box Transformation for UAV Object Detection
基于边界框变换的自适应图像放大用于无人机目标检测
Abstract
Detecting objects from UAV-captured images is challenging due to the small object size. In this work, a simple and efficient adaptive zoom-in framework is explored for object detection on UAV images. The main motivation is that the foreground objects are generally smaller and sparser than those in common scene images, which hinders the optimization of effective object detectors. We thus aim to zoom in adaptively on the objects to better capture object features for the detection task. To achieve the goal, two core designs are required: \textcolor{black}{i) How to conduct non-uniform zooming on each image efficiently? ii) How to enable object detection training and inference with the zoomed image space?} Correspondingly, a lightweight offset prediction scheme coupled with a novel box-based zooming objective is introduced to learn non-uniform zooming on the input image. Based on the learned zooming transformation, a corner-aligned bounding box transformation method is proposed. The method warps the ground-truth bounding boxes to the zoomed space to learn object detection, and warps the predicted bounding boxes back to the original space during inference. We conduct extensive experiments on three representative UAV object detection datasets, including VisDrone, UAVDT, and SeaDronesSee. The proposed ZoomDet is architecture-independent and can be applied to an arbitrary object detection architecture. Remarkably, on the SeaDronesSee dataset, ZoomDet offers more than 8.4 absolute gain of mAP with a Faster R-CNN model, with only about 3 ms additional latency. The code is available at https://github.com/twangnh/zoomdet_code.
Chinese Translation
从无人机捕获的图像中检测目标具有挑战性,因为目标尺寸较小。在本研究中,我们探索了一种简单高效的自适应放大框架,以用于无人机图像中的目标检测。主要动机在于,前景目标通常比普通场景图像中的目标更小且更稀疏,这妨碍了有效目标检测器的优化。因此,我们旨在自适应地放大目标,以更好地捕捉目标特征以完成检测任务。为实现这一目标,需要两个核心设计:i) 如何高效地对每幅图像进行非均匀放大?ii) 如何在放大后的图像空间中进行目标检测的训练和推理?相应地,我们提出了一种轻量级的偏移预测方案,并结合一种新颖的基于框的放大目标,以学习输入图像的非均匀放大。基于学习到的放大变换,提出了一种角对齐的边界框变换方法。该方法将真实的边界框扭曲到放大空间中以学习目标检测,并在推理过程中将预测的边界框扭曲回原始空间。我们在三个具有代表性的无人机目标检测数据集上进行了广泛实验,包括VisDrone、UAVDT和SeaDronesSee。所提出的ZoomDet与架构无关,可以应用于任意目标检测架构。值得注意的是,在SeaDronesSee数据集上,ZoomDet在Faster R-CNN模型上提供了超过8.4的mAP绝对增益,仅增加约3毫秒的延迟。代码可在https://github.com/twangnh/zoomdet_code获取。
cs.CV / 61 / 2602.07523
CA-YOLO: Cross Attention Empowered YOLO for Biomimetic Localization
CA-YOLO:跨注意力增强的YOLO生物仿生定位
Abstract
In modern complex environments, achieving accurate and efficient target localization is essential in numerous fields. However, existing systems often face limitations in both accuracy and the ability to recognize small targets. In this study, we propose a bionic stabilized localization system based on CA-YOLO, designed to enhance both target localization accuracy and small target recognition capabilities. Acting as the "brain" of the system, the target detection algorithm emulates the visual focusing mechanism of animals by integrating bionic modules into the YOLO backbone network. These modules include the introduction of a small target detection head and the development of a Characteristic Fusion Attention Mechanism (CFAM). Furthermore, drawing inspiration from the human Vestibulo-Ocular Reflex (VOR), a bionic pan-tilt tracking control strategy is developed, which incorporates central positioning, stability optimization, adaptive control coefficient adjustment, and an intelligent recapture function. The experimental results show that CA-YOLO outperforms the original model on standard datasets (COCO and VisDrone), with average accuracy metrics improved by 3.94%and 4.90%, respectively.Further time-sensitive target localization experiments validate the effectiveness and practicality of this bionic stabilized localization system.
Chinese Translation
在现代复杂环境中,实现准确高效的目标定位在众多领域中至关重要。然而,现有系统在准确性和小目标识别能力方面往往面临限制。在本研究中,我们提出了一种基于CA-YOLO的生物仿生稳定定位系统,旨在提高目标定位的准确性和小目标识别能力。作为系统的“大脑”,目标检测算法通过将生物仿生模块集成到YOLO主干网络中,模拟动物的视觉聚焦机制。这些模块包括引入小目标检测头和开发特征融合注意力机制(Characteristic Fusion Attention Mechanism, CFAM)。此外,受到人类前庭-眼动反射(Vestibulo-Ocular Reflex, VOR)的启发,开发了一种生物仿生云台跟踪控制策略,结合了中心定位、稳定性优化、自适应控制系数调整和智能重捕功能。实验结果表明,CA-YOLO在标准数据集(COCO和VisDrone)上的表现优于原始模型,平均准确率指标分别提高了3.94%和4.90%。进一步的时间敏感目标定位实验验证了该生物仿生稳定定位系统的有效性和实用性。
cs.CV / 62 / 2602.07532
Evaluating Object-Centric Models beyond Object Discovery
超越物体发现的对象中心模型评估
Abstract
Object-centric learning (OCL) aims to learn structured scene representations that support compositional generalization and robustness to out-of-distribution (OOD) data. However, OCL models are often not evaluated regarding these goals. Instead, most prior work focuses on evaluating OCL models solely through object discovery and simple reasoning tasks, such as probing the representation via image classification. We identify two limitations in existing benchmarks: (1) They provide limited insights on the representation usefulness of OCL models, and (2) localization and representation usefulness are assessed using disjoint metrics. To address (1), we use instruction-tuned VLMs as evaluators, enabling scalable benchmarking across diverse VQA datasets to measure how well VLMs leverage OCL representations for complex reasoning tasks. To address (2), we introduce a unified evaluation task and metric that jointly assess localization (where) and representation usefulness (what), thereby eliminating inconsistencies introduced by disjoint evaluation. Finally, we include a simple multi-feature reconstruction baseline as a reference point.
Chinese Translation
对象中心学习(Object-Centric Learning, OCL)旨在学习结构化场景表示,以支持组合泛化和对分布外(Out-of-Distribution, OOD)数据的鲁棒性。然而,OCL模型往往未能在这些目标上进行评估。相反,大多数先前的研究仅通过物体发现和简单推理任务来评估OCL模型,例如通过图像分类探测表示。我们识别出现有基准的两个局限性:(1)它们对OCL模型表示的实用性提供的见解有限;(2)定位和表示的实用性使用不相交的指标进行评估。为了解决(1),我们使用经过指令调优的视觉语言模型(Vision-Language Models, VLMs)作为评估者,使得能够在多样化的视觉问答(Visual Question Answering, VQA)数据集上进行可扩展的基准测试,以衡量VLMs在复杂推理任务中如何利用OCL表示。为了解决(2),我们引入了一个统一的评估任务和指标,联合评估定位(where)和表示的实用性(what),从而消除由不相交评估引入的不一致性。最后,我们包括一个简单的多特征重建基线作为参考点。
cs.CV / 63 / 2602.07534
Fine-Grained Cat Breed Recognition with Global Context Vision Transformer
基于全局上下文视觉变换器的细粒度猫品种识别
Abstract
Accurate identification of cat breeds from images is a challenging task due to subtle differences in fur patterns, facial structure, and color. In this paper, we present a deep learning-based approach for classifying cat breeds using a subset of the Oxford-IIIT Pet Dataset, which contains high-resolution images of various domestic breeds. We employed the Global Context Vision Transformer (GCViT) architecture-tiny for cat breed recognition. To improve model generalization, we used extensive data augmentation, including rotation, horizontal flipping, and brightness adjustment. Experimental results show that the GCViT-Tiny model achieved a test accuracy of 92.00% and validation accuracy of 94.54%. These findings highlight the effectiveness of transformer-based architectures for fine-grained image classification tasks. Potential applications include veterinary diagnostics, animal shelter management, and mobile-based breed recognition systems. We also provide a hugging face demo at https://huggingface.co/spaces/bfarhad/cat-breed-classifier.
Chinese Translation
从图像中准确识别猫品种是一项具有挑战性的任务,因为毛发图案、面部结构和颜色之间存在微妙的差异。本文提出了一种基于深度学习的猫品种分类方法,使用了牛津-印度理工学院宠物数据集(Oxford-IIIT Pet Dataset)中的一个子集,该数据集包含各种家猫品种的高分辨率图像。我们采用了全局上下文视觉变换器(Global Context Vision Transformer,GCViT)架构-tiny进行猫品种识别。为了提高模型的泛化能力,我们进行了广泛的数据增强,包括旋转、水平翻转和亮度调整。实验结果表明,GCViT-Tiny模型在测试集上的准确率达到92.00%,在验证集上的准确率达到94.54%。这些发现突显了基于变换器架构在细粒度图像分类任务中的有效性。潜在应用包括兽医诊断、动物收容所管理和基于移动设备的品种识别系统。我们还提供了一个Hugging Face演示,链接为https://huggingface.co/spaces/bfarhad/cat-breed-classifier。
cs.CV / 64 / 2602.07535
Beyond Core and Penumbra: Bi-Temporal Image-Driven Stroke Evolution Analysis
超越核心与半影:双时间点影像驱动的中风演变分析
Abstract
Computed tomography perfusion (CTP) at admission is routinely used to estimate the ischemic core and penumbra, while follow-up diffusion-weighted MRI (DWI) provides the definitive infarct outcome. However, single time-point segmentations fail to capture the biological heterogeneity and temporal evolution of stroke. We propose a bi-temporal analysis framework that characterizes ischemic tissue using statistical descriptors, radiomic texture features, and deep feature embeddings from two architectures (mJ-Net and nnU-Net). Bi-temporal refers to admission (T1) and post-treatment follow-up (T2). All features are extracted at T1 from CTP, with follow-up DWI aligned to ensure spatial correspondence. Manually delineated masks at T1 and T2 are intersected to construct six regions of interest (ROIs) encoding both initial tissue state and final outcome. Features were aggregated per region and analyzed in feature space. Evaluation on 18 patients with successful reperfusion demonstrated meaningful clustering of region-level representations. Regions classified as penumbra or healthy at T1 that ultimately recovered exhibited feature similarity to preserved brain tissue, whereas infarct-bound regions formed distinct groupings. Both baseline GLCM and deep embeddings showed a similar trend: penumbra regions exhibit features that are significantly different depending on final state, whereas this difference is not significant for core regions. Deep feature spaces, particularly mJ-Net, showed strong separation between salvageable and non-salvageable tissue, with a penumbra separation index that differed significantly from zero (Wilcoxon signed-rank test). These findings suggest that encoder-derived feature manifolds reflect underlying tissue phenotypes and state transitions, providing insight into imaging-based quantification of stroke evolution.
Chinese Translation
入院时的计算机断层灌注成像(CTP)常用于估计缺血核心和半影,而后续的扩散加权磁共振成像(DWI)则提供了明确的梗死结果。然而,单一时间点的分割未能捕捉中风的生物异质性和时间演变。我们提出了一种双时间点分析框架,通过统计描述符、放射组学纹理特征和来自两种架构(mJ-Net 和 nnU-Net)的深度特征嵌入来表征缺血组织。双时间点指的是入院时(T1)和治疗后随访(T2)。所有特征均在 T1 时从 CTP 提取,随访 DWI 被对齐以确保空间对应。在 T1 和 T2 时手动勾画的掩膜相交以构建六个感兴趣区域(ROIs),编码初始组织状态和最终结果。每个区域的特征被聚合并在特征空间中进行分析。对 18 名成功再灌注患者的评估显示区域级表示的有意义聚类。在 T1 时被分类为半影或健康的区域最终恢复时,其特征与保存的脑组织相似,而与梗死相关的区域则形成明显的分组。基线的灰度共生矩阵(GLCM)和深度嵌入显示出相似的趋势:半影区域的特征在最终状态下显著不同,而核心区域的这种差异则不显著。深度特征空间,特别是 mJ-Net,在可挽救和不可挽救组织之间显示出强烈的分离,半影分离指数显著不同于零(Wilcoxon 符号秩检验)。这些发现表明,编码器衍生的特征流形反映了潜在的组织表型和状态转变,为基于影像的中风演变量化提供了洞察。
cs.CV / 65 / 2602.07540
LLM-Guided Diagnostic Evidence Alignment for Medical Vision-Language Pretraining under Limited Pairing
在有限配对下的医疗视觉-语言预训练中的LLM引导诊断证据对齐
Abstract
Most existing CLIP-style medical vision--language pretraining methods rely on global or local alignment with substantial paired data. However, global alignment is easily dominated by non-diagnostic information, while local alignment fails to integrate key diagnostic evidence. As a result, learning reliable diagnostic representations becomes difficult, which limits their applicability in medical scenarios with limited paired data. To address this issue, we propose an LLM-Guided Diagnostic Evidence Alignment method (LGDEA), which shifts the pretraining objective toward evidence-level alignment that is more consistent with the medical diagnostic process. Specifically, we leverage LLMs to extract key diagnostic evidence from radiology reports and construct a shared diagnostic evidence space, enabling evidence-aware cross-modal alignment and allowing LGDEA to effectively exploit abundant unpaired medical images and reports, thereby substantially alleviating the reliance on paired data. Extensive experimental results demonstrate that our method achieves consistent and significant improvements on phrase grounding, image--text retrieval, and zero-shot classification, and even rivals pretraining methods that rely on substantial paired data.
Chinese Translation
大多数现有的CLIP风格医疗视觉-语言预训练方法依赖于大量配对数据的全局或局部对齐。然而,全局对齐容易受到非诊断信息的主导,而局部对齐则未能整合关键的诊断证据。因此,学习可靠的诊断表示变得困难,这限制了它们在配对数据有限的医疗场景中的适用性。为了解决这个问题,我们提出了一种LLM引导的诊断证据对齐方法(LGDEA),该方法将预训练目标转向与医疗诊断过程更一致的证据级对齐。具体而言,我们利用大型语言模型(LLMs)从放射学报告中提取关键的诊断证据,并构建一个共享的诊断证据空间,使得证据感知的跨模态对齐成为可能,并允许LGDEA有效利用丰富的未配对医疗图像和报告,从而大大减轻对配对数据的依赖。大量实验结果表明,我们的方法在短语定位、图像-文本检索和零-shot分类上实现了一致且显著的改进,甚至与依赖大量配对数据的预训练方法相媲美。
cs.CV / 66 / 2602.07544
MUFASA: A Multi-Layer Framework for Slot Attention
MUFASA:一种用于槽注意力的多层框架
Abstract
Unsupervised object-centric learning (OCL) decomposes visual scenes into distinct entities. Slot attention is a popular approach that represents individual objects as latent vectors, called slots. Current methods obtain these slot representations solely from the last layer of a pre-trained vision transformer (ViT), ignoring valuable, semantically rich information encoded across the other layers. To better utilize this latent semantic information, we introduce MUFASA, a lightweight plug-and-play framework for slot attention-based approaches to unsupervised object segmentation. Our model computes slot attention across multiple feature layers of the ViT encoder, fully leveraging their semantic richness. We propose a fusion strategy to aggregate slots obtained on multiple layers into a unified object-centric representation. Integrating MUFASA into existing OCL methods improves their segmentation results across multiple datasets, setting a new state of the art while simultaneously improving training convergence with only minor inference overhead.
Chinese Translation
无监督物体中心学习(OCL)将视觉场景分解为不同的实体。槽注意力是一种流行的方法,它将单个物体表示为潜在向量,称为槽。当前的方法仅从预训练视觉变换器(ViT)的最后一层获取这些槽表示,忽略了其他层中编码的有价值的、语义丰富的信息。为了更好地利用这些潜在的语义信息,我们提出了MUFASA,一个轻量级的即插即用框架,适用于基于槽注意力的无监督物体分割方法。我们的模型在ViT编码器的多个特征层上计算槽注意力,充分利用它们的语义丰富性。我们提出了一种融合策略,将在多个层上获得的槽聚合为统一的物体中心表示。将MUFASA集成到现有的OCL方法中,改善了它们在多个数据集上的分割结果,设定了新的最先进水平,同时在仅有轻微推理开销的情况下,提高了训练收敛性。
cs.CV / 67 / 2602.07550
Revealing the Semantic Selection Gap in DINOv3 through Training-Free Few-Shot Segmentation
通过无训练的少样本分割揭示 DINOv3 中的语义选择差距
Abstract
Recent self-supervised Vision Transformers (ViTs), such as DINOv3, provide rich feature representations for dense vision tasks. This study investigates the intrinsic few-shot semantic segmentation (FSS) capabilities of frozen DINOv3 features through a training-free baseline, FSSDINO, utilizing class-specific prototypes and Gram-matrix refinement. Our results across binary, multi-class, and cross-domain (CDFSS) benchmarks demonstrate that this minimal approach, applied to the final backbone layer, is highly competitive with specialized methods involving complex decoders or test-time adaptation. Crucially, we conduct an Oracle-guided layer analysis, identifying a significant performance gap between the standard last-layer features and globally optimal intermediate representations. We reveal a "Safest vs. Optimal" dilemma: while the Oracle proves higher performance is attainable, matching the results of compute-intensive adaptation methods, current unsupervised and support-guided selection metrics consistently yield lower performance than the last-layer baseline. This characterizes a "Semantic Selection Gap" in Foundation Models, a disconnect where traditional heuristics fail to reliably identify high-fidelity features. Our work establishes the "Last-Layer" as a deceptively strong baseline and provides a rigorous diagnostic of the latent semantic potentials in DINOv3.The code is publicly available at https://github.com/hussni0997/fssdino.
Chinese Translation
最近的自监督视觉变换器(ViTs),如 DINOv3,为密集视觉任务提供了丰富的特征表示。本研究通过一个无训练的基线 FSSDINO,利用类特定原型和 Gram 矩阵优化,探讨了冻结的 DINOv3 特征的内在少样本语义分割(FSS)能力。我们在二分类、多分类和跨域(CDFSS)基准测试中的结果表明,这种应用于最终主干层的最小化方法在与涉及复杂解码器或测试时适应的专门方法相比时表现出高度竞争力。关键的是,我们进行了 Oracle 指导的层分析,识别出标准最后一层特征与全局最优中间表示之间存在显著的性能差距。我们揭示了一个“安全与最优”的困境:尽管 Oracle 证明可以实现更高的性能,匹配计算密集型适应方法的结果,但当前无监督和支持指导的选择指标始终表现出低于最后一层基线的性能。这表明了基础模型中的“语义选择差距”,即传统启发式方法未能可靠地识别高保真特征。我们的工作确立了“最后一层”作为一个具有欺骗性强的基线,并提供了对 DINOv3 潜在语义能力的严格诊断。代码已公开可用,地址为 https://github.com/hussni0997/fssdino。
cs.CV / 68 / 2602.07554
FlexID: Training-Free Flexible Identity Injection via Intent-Aware Modulation for Text-to-Image Generation
FlexID:一种无训练的灵活身份注入方法,通过意图感知调制实现文本到图像生成
Abstract
Personalized text-to-image generation aims to seamlessly integrate specific identities into textual descriptions. However, existing training-free methods often rely on rigid visual feature injection, creating a conflict between identity fidelity and textual adaptability. To address this, we propose FlexID, a novel training-free framework utilizing intent-aware modulation. FlexID orthogonally decouples identity into two dimensions: a Semantic Identity Projector (SIP) that injects high-level priors into the language space, and a Visual Feature Anchor (VFA) that ensures structural fidelity within the latent space. Crucially, we introduce a Context-Aware Adaptive Gating (CAG) mechanism that dynamically modulates the weights of these streams based on editing intent and diffusion timesteps. By automatically relaxing rigid visual constraints when strong editing intent is detected, CAG achieves synergy between identity preservation and semantic variation. Extensive experiments on IBench demonstrate that FlexID achieves a state-of-the-art balance between identity consistency and text adherence, offering an efficient solution for complex narrative generation.
Chinese Translation
个性化文本到图像生成旨在将特定身份无缝整合到文本描述中。然而,现有的无训练方法通常依赖于僵化的视觉特征注入,导致身份保真度与文本适应性之间的冲突。为了解决这个问题,我们提出了FlexID,这是一种新颖的无训练框架,利用意图感知调制。FlexID将身份正交解耦为两个维度:一个语义身份投影器(Semantic Identity Projector, SIP),将高层先验注入语言空间;一个视觉特征锚(Visual Feature Anchor, VFA),确保潜在空间内的结构保真度。关键是,我们引入了一种上下文感知自适应门控机制(Context-Aware Adaptive Gating, CAG),根据编辑意图和扩散时间步动态调节这些流的权重。通过在检测到强编辑意图时自动放宽僵化的视觉约束,CAG实现了身份保留与语义变化之间的协同。我们在IBench上的大量实验表明,FlexID在身份一致性和文本遵循性之间达到了最先进的平衡,为复杂叙事生成提供了一种高效的解决方案。
cs.CV / 69 / 2602.07555
VISOR: VIsual Spatial Object Reasoning for Language-driven Object Navigation
VISOR:基于语言驱动的视觉空间对象推理用于对象导航
Abstract
Language-driven object navigation requires agents to interpret natural language descriptions of target objects, which combine intrinsic and extrinsic attributes for instance recognition and commonsense navigation. Existing methods either (i) use end-to-end trained models with vision-language embeddings, which struggle to generalize beyond training data and lack action-level explainability, or (ii) rely on modular zero-shot pipelines with large language models (LLMs) and open-set object detectors, which suffer from error propagation, high computational cost, and difficulty integrating their reasoning back into the navigation policy. To this end, we propose a compact 3B-parameter Vision-Language-Action (VLA) agent that performs human-like embodied reasoning for both object recognition and action selection, removing the need for stitched multi-model pipelines. Instead of raw embedding matching, our agent employs explicit image-grounded reasoning to directly answer "Is this the target object?" and "Why should I take this action?" The reasoning process unfolds in three stages: "think", "think summary", and "action", yielding improved explainability, stronger generalization, and more efficient navigation. Code and dataset available upon acceptance.
Chinese Translation
语言驱动的对象导航要求智能体解读目标对象的自然语言描述,这些描述结合了内在和外在属性以进行实例识别和常识导航。现有方法要么 (i) 使用端到端训练的模型与视觉-语言嵌入,但在训练数据之外的泛化能力较差,并且缺乏行动层面的可解释性;要么 (ii) 依赖于模块化的零样本管道,使用大型语言模型(LLMs)和开放集对象检测器,但存在错误传播、高计算成本以及将推理结果整合回导航策略中的困难。为此,我们提出了一种紧凑的3B参数视觉-语言-行动(VLA)智能体,该智能体执行类人化的具身推理,既用于对象识别也用于行动选择,从而消除了对拼接多模型管道的需求。我们的智能体不再依赖原始嵌入匹配,而是采用显式的图像基础推理,直接回答“这是否是目标对象?”和“我为什么要采取这个行动?”推理过程分为三个阶段:“思考”、“思考总结”和“行动”,从而提高了可解释性、增强了泛化能力,并实现了更高效的导航。代码和数据集将在接受后提供。
cs.CV / 70 / 2602.07564
SIGMA: Selective-Interleaved Generation with Multi-Attribute Tokens
SIGMA:具有多属性标记的选择性交错生成
Abstract
Recent unified models such as Bagel demonstrate that paired image-edit data can effectively align multiple visual tasks within a single diffusion transformer. However, these models remain limited to single-condition inputs and lack the flexibility needed to synthesize results from multiple heterogeneous sources. We present SIGMA (Selective-Interleaved Generation with Multi-Attribute Tokens), a unified post-training framework that enables interleaved multi-condition generation within diffusion transformers. SIGMA introduces selective multi-attribute tokens, including style, content, subject, and identity tokens, which allow the model to interpret and compose multiple visual conditions in an interleaved text-image sequence. Through post-training on the Bagel unified backbone with 700K interleaved examples, SIGMA supports compositional editing, selective attribute transfer, and fine-grained multimodal alignment. Extensive experiments show that SIGMA improves controllability, cross-condition consistency, and visual quality across diverse editing and generation tasks, with substantial gains over Bagel on compositional tasks.
Chinese Translation
近期的统一模型如 Bagel 证明了配对的图像编辑数据可以有效地在单一扩散变换器中对齐多个视觉任务。然而,这些模型仍然局限于单一条件输入,缺乏从多个异构源合成结果所需的灵活性。我们提出了 SIGMA(Selective-Interleaved Generation with Multi-Attribute Tokens),一个统一的后训练框架,能够在扩散变换器中实现交错的多条件生成。SIGMA 引入了选择性的多属性标记,包括风格、内容、主题和身份标记,使模型能够在交错的文本-图像序列中解释和组合多个视觉条件。通过在 Bagel 统一骨干上进行 70 万个交错示例的后训练,SIGMA 支持组合编辑、选择性属性转移和细粒度的多模态对齐。大量实验表明,SIGMA 在多样的编辑和生成任务中提高了可控性、跨条件一致性和视觉质量,在组合任务上相较于 Bagel 取得了显著提升。
cs.CV / 71 / 2602.07565
Human Identification at a Distance: Challenges, Methods and Results on the Competition HID 2025
远距离人类识别:挑战、方法及2025年HID竞赛的结果
Ma, Jingzhe, Zhang, Meng, Yu, Jianlong, Liu, Kun, Xu, Zunxiao, Cheng, Xue, Zhou, Junjie, Wang, Yanfei, Li, Jiahang, Wang, Zepeng, Osamura, Kazuki, Liu, Rujie, Abe, Narishige, Wang, Jingjie, Zhang, Shunli, Xie, Haojun, Wu, Jiajun, Wu, Weiming, Kang, Wenxiong, Gao, Qingshuo, Xiong, Jiaming, Ben, Xianye, Chen, Lei, Song, Lichen, Cui, Junjian, Xiong, Haijun, Lu, Junhao, Feng, Bin, Liu, Mengyuan, Zhou, Ji, Zhao, Baoquan, Xu, Ke, Huang, Yongzhen, Wang, Liang, Marin-Jimenez, Manuel J, Ahad, Md Atiqur Rahman, Yu, Shiqi
Abstract
Human identification at a distance (HID) is challenging because traditional biometric modalities such as face and fingerprints are often difficult to acquire in real-world scenarios. Gait recognition provides a practical alternative, as it can be captured reliably at a distance. To promote progress in gait recognition and provide a fair evaluation platform, the International Competition on Human Identification at a Distance (HID) has been organized annually since 2020. Since 2023, the competition has adopted the challenging SUSTech-Competition dataset, which features substantial variations in clothing, carried objects, and view angles. No dedicated training data are provided, requiring participants to train their models using external datasets. Each year, the competition applies a different random seed to generate distinct evaluation splits, which reduces the risk of overfitting and supports a fair assessment of cross-domain generalization. While HID 2023 and HID 2024 already used this dataset, HID 2025 explicitly examined whether algorithmic advances could surpass the accuracy limits observed previously. Despite the heightened difficulty, participants achieved further improvements, and the best-performing method reached 94.2% accuracy, setting a new benchmark on this dataset. We also analyze key technical trends and outline potential directions for future research in gait recognition.
Chinese Translation
远距离人类识别(HID)面临诸多挑战,因为传统的生物识别方式,如面部和指纹,在现实场景中往往难以获取。步态识别提供了一种实用的替代方案,因为它可以在远距离可靠捕捉。为了促进步态识别的发展并提供公平的评估平台,自2020年以来,国际远距离人类识别竞赛(HID)每年举办一次。自2023年起,竞赛采用了具有挑战性的SUSTech-Competition数据集,该数据集在服装、携带物品和视角上具有显著变化。没有提供专门的训练数据,要求参与者使用外部数据集训练他们的模型。每年,竞赛应用不同的随机种子生成独特的评估划分,从而减少过拟合的风险,并支持对跨领域泛化的公平评估。虽然HID 2023和HID 2024已经使用了该数据集,但HID 2025明确考察了算法进展是否能够超越之前观察到的准确性限制。尽管难度加大,参与者仍取得了进一步的改进,表现最佳的方法达到了94.2%的准确率,为该数据集设定了新的基准。我们还分析了关键技术趋势,并概述了未来步态识别研究的潜在方向。
cs.CV / 72 / 2602.07566
Cross-Camera Cow Identification via Disentangled Representation Learning
基于解耦表示学习的跨相机牛只识别
Abstract
Precise identification of individual cows is a fundamental prerequisite for comprehensive digital management in smart livestock farming. While existing animal identification methods excel in controlled, single-camera settings, they face severe challenges regarding cross-camera generalization. When models trained on source cameras are deployed to new monitoring nodes characterized by divergent illumination, backgrounds, viewpoints, and heterogeneous imaging properties, recognition performance often degrades dramatically. This limits the large-scale application of non-contact technologies in dynamic, real-world farming environments. To address this challenge, this study proposes a cross-camera cow identification framework based on disentangled representation learning. This framework leverages the Subspace Identifiability Guarantee (SIG) theory in the context of bovine visual recognition. By modeling the underlying physical data generation process, we designed a principle-driven feature disentanglement module that decomposes observed images into multiple orthogonal latent subspaces. This mechanism effectively isolates stable, identity-related biometric features that remain invariant across cameras, thereby substantially improving generalization to unseen cameras. We constructed a high-quality dataset spanning five distinct camera nodes, covering heterogeneous acquisition devices and complex variations in lighting and angles. Extensive experiments across seven cross-camera tasks demonstrate that the proposed method achieves an average accuracy of 86.0%, significantly outperforming the Source-only Baseline (51.9%) and the strongest cross-camera baseline method (79.8%). This work establishes a subspace-theoretic feature disentanglement framework for collaborative cross-camera cow identification, offering a new paradigm for precise animal monitoring in uncontrolled smart farming environments.
Chinese Translation
个体牛只的精确识别是智能畜牧业全面数字化管理的基本前提。尽管现有的动物识别方法在受控的单相机环境中表现出色,但在跨相机泛化方面面临严峻挑战。当在源相机上训练的模型部署到具有不同照明、背景、视角和异构成像特性的新的监测节点时,识别性能往往会显著下降。这限制了非接触技术在动态、真实世界农业环境中的大规模应用。为了解决这一挑战,本研究提出了一种基于解耦表示学习的跨相机牛只识别框架。该框架利用了牛只视觉识别中的子空间可识别性保证(Subspace Identifiability Guarantee, SIG)理论。通过对潜在物理数据生成过程的建模,我们设计了一个以原则为驱动的特征解耦模块,该模块将观察到的图像分解为多个正交潜在子空间。该机制有效地隔离了在不同相机间保持不变的稳定身份相关生物特征,从而显著提高了对未见相机的泛化能力。我们构建了一个涵盖五个不同相机节点的高质量数据集,覆盖了异构采集设备以及复杂的照明和角度变化。在七个跨相机任务上的广泛实验表明,所提出的方法平均准确率达到86.0%,显著优于仅源基线(51.9%)和最强的跨相机基线方法(79.8%)。本研究建立了一个基于子空间理论的特征解耦框架,用于协作跨相机牛只识别,为在不受控的智能农业环境中实现精确动物监测提供了新的范式。
cs.CV / 73 / 2602.07568
Visualizing the Invisible: Enhancing Radiologist Performance in Breast Mammography via Task-Driven Chromatic Encoding
可视化隐形:通过任务驱动的色彩编码提升放射科医师在乳腺X线摄影中的表现
Abstract
Purpose:Mammography screening is less sensitive in dense breasts, where tissue overlap and subtle findings increase perceptual difficulty. We present MammoColor, an end-to-end framework with a Task-Driven Chromatic Encoding (TDCE) module that converts single-channel mammograms into TDCE-encoded views for visual augmentation. Materials and Methods:MammoColor couples a lightweight TDCE module with a BI-RADS triage classifier and was trained end-to-end on VinDr-Mammo. Performance was evaluated on an internal test set, two public datasets (CBIS-DDSM and INBreast), and three external clinical cohorts. We also conducted a multi-reader, multi-case (MRMC) observer study with a washout period, comparing (1) grayscale-only, (2) TDCE-only, and (3) side-by-side grayscale+TDCE. Results:On VinDr-Mammo, MammoColor improved AUC from 0.7669 to 0.8461 (P=0.004). Gains were larger in dense breasts (AUC 0.749 to 0.835). In the MRMC study, TDCE-encoded images improved specificity (0.90 to 0.96; P=0.052) with comparable sensitivity. Conclusion:TDCE provides a task-optimized chromatic representation that may improve perceptual salience and reduce false-positive recalls in mammography triage.
Chinese Translation
目的:乳腺X线筛查在致密乳腺中的敏感性较低,组织重叠和微妙发现增加了感知难度。我们提出了MammoColor,这是一个端到端框架,具有任务驱动的色彩编码(Task-Driven Chromatic Encoding, TDCE)模块,将单通道乳腺X线图像转换为TDCE编码视图以进行视觉增强。材料与方法:MammoColor将一个轻量级的TDCE模块与BI-RADS分流分类器结合,并在VinDr-Mammo上进行了端到端训练。性能在内部测试集、两个公共数据集(CBIS-DDSM和INBreast)以及三个外部临床队列上进行了评估。我们还进行了多读者、多案例(Multi-Reader, Multi-Case, MRMC)观察者研究,设定了洗脱期,比较了(1)仅灰度图像,(2)仅TDCE图像,以及(3)并排灰度+TDCE图像。结果:在VinDr-Mammo上,MammoColor将AUC从0.7669提高到0.8461(P=0.004)。在致密乳腺中,增益更大(AUC从0.749提高到0.835)。在MRMC研究中,TDCE编码图像提高了特异性(从0.90提高到0.96;P=0.052),而灵敏度相当。结论:TDCE提供了一种任务优化的色彩表示,可能提高感知显著性并减少乳腺X线筛查中的假阳性召回。
cs.CV / 74 / 2602.07574
ViCA: Efficient Multimodal LLMs with Vision-Only Cross-Attention
ViCA:高效的仅视觉交叉注意力多模态大语言模型
Abstract
Modern multimodal large language models (MLLMs) adopt a unified self-attention design that processes visual and textual tokens at every Transformer layer, incurring substantial computational overhead. In this work, we revisit the necessity of such dense visual processing and show that projected visual embeddings are already well-aligned with the language space, while effective vision-language interaction occurs in only a small subset of layers. Based on these insights, we propose ViCA (Vision-only Cross-Attention), a minimal MLLM architecture in which visual tokens bypass all self-attention and feed-forward layers, interacting with text solely through sparse cross-attention at selected layers. Extensive evaluations across three MLLM backbones, nine multimodal benchmarks, and 26 pruning-based baselines show that ViCA preserves 98% of baseline accuracy while reducing visual-side computation to 4%, consistently achieving superior performance-efficiency trade-offs. Moreover, ViCA provides a regular, hardware-friendly inference pipeline that yields over 3.5x speedup in single-batch inference and over 10x speedup in multi-batch inference, reducing visual grounding to near-zero overhead compared with text-only LLMs. It is also orthogonal to token pruning methods and can be seamlessly combined for further efficiency gains. Our code is available at https://github.com/EIT-NLP/ViCA.
Chinese Translation
现代多模态大语言模型(MLLMs)采用统一的自注意力设计,在每个Transformer层处理视觉和文本标记,这导致了相当大的计算开销。在本研究中,我们重新审视了这种密集视觉处理的必要性,并表明投影的视觉嵌入已经与语言空间很好地对齐,而有效的视觉-语言交互仅发生在少数几个层中。基于这些见解,我们提出了ViCA(仅视觉交叉注意力),一种最小化的MLLM架构,其中视觉标记绕过所有自注意力和前馈层,仅通过在选定层的稀疏交叉注意力与文本进行交互。对三个MLLM骨干网络、九个多模态基准和26个基于剪枝的基线的广泛评估表明,ViCA在保持98%基线准确率的同时,将视觉侧计算减少到4%,始终实现优越的性能效率权衡。此外,ViCA提供了一个常规的、硬件友好的推理管道,在单批推理中实现超过3.5倍的加速,在多批推理中实现超过10倍的加速,与仅文本的LLMs相比,将视觉定位的开销减少到接近零。它与标记剪枝方法是正交的,可以无缝结合以进一步提高效率。我们的代码可在 https://github.com/EIT-NLP/ViCA 获取。
cs.CV / 75 / 2602.07590
Automated rock joint trace mapping using a supervised learning model trained on synthetic data generated by parametric modelling
基于参数建模生成的合成数据训练的监督学习模型进行自动化岩石接头轨迹映射
Abstract
This paper presents a geology-driven machine learning method for automated rock joint trace mapping from images. The approach combines geological modelling, synthetic data generation, and supervised image segmentation to address limited real data and class imbalance. First, discrete fracture network models are used to generate synthetic jointed rock images at field-relevant scales via parametric modelling, preserving joint persistence, connectivity, and node-type distributions. Second, segmentation models are trained using mixed training and pretraining followed by fine-tuning on real images. The method is tested in box and slope domains using several real datasets. The results show that synthetic data can support supervised joint trace detection when real data are scarce. Mixed training performs well when real labels are consistent (e.g. box-domain), while fine-tuning is more robust when labels are noisy (e.g. slope-domain where labels can be biased, incomplete, and inconsistent). Fully zero-shot prediction from synthetic model remains limited, but useful generalisation is achieved by fine-tuning with a small number of real data. Qualitative analysis shows clearer and more geologically meaningful joint traces than indicated by quantitative metrics alone. The proposed method supports reliable joint mapping and provides a basis for further work on domain adaptation and evaluation.
Chinese Translation
本文提出了一种以地质为驱动的机器学习方法,用于从图像中自动化岩石接头轨迹映射。该方法结合了地质建模、合成数据生成和监督图像分割,以解决真实数据有限和类别不平衡的问题。首先,使用离散断裂网络模型通过参数建模生成与现场相关尺度的合成接头岩石图像,保留接头的持久性、连通性和节点类型分布。其次,分割模型通过混合训练和预训练进行训练,然后在真实图像上进行微调。该方法在盒形和坡度领域使用多个真实数据集进行了测试。结果表明,当真实数据稀缺时,合成数据可以支持监督接头轨迹检测。在真实标签一致的情况下(例如盒形领域),混合训练表现良好,而在标签噪声较大的情况下(例如坡度领域,标签可能存在偏差、不完整和不一致),微调则更为稳健。来自合成模型的完全零样本预测仍然有限,但通过少量真实数据的微调实现了有用的泛化。定性分析显示,与仅依赖定量指标相比,所提方法提供了更清晰和更具地质意义的接头轨迹。该方法支持可靠的接头映射,并为进一步的领域适应和评估工作提供了基础。
cs.CV / 76 / 2602.07595
TeleBoost: A Systematic Alignment Framework for High-Fidelity, Controllable, and Robust Video Generation
TeleBoost:一个系统化的高保真、可控且稳健的视频生成对齐框架
Abstract
Post-training is the decisive step for converting a pretrained video generator into a production-oriented model that is instruction-following, controllable, and robust over long temporal horizons. This report presents a systematical post-training framework that organizes supervised policy shaping, reward-driven reinforcement learning, and preference-based refinement into a single stability-constrained optimization stack. The framework is designed around practical video-generation constraints, including high rollout cost, temporally compounding failure modes, and feedback that is heterogeneous, uncertain, and often weakly discriminative. By treating optimization as a staged, diagnostic-driven process rather than a collection of isolated tricks, the report summarizes a cohesive recipe for improving perceptual fidelity, temporal coherence, and prompt adherence while preserving the controllability established at initialization. The resulting framework provides a clear blueprint for building scalable post-training pipelines that remain stable, extensible, and effective in real-world deployment settings.
Chinese Translation
后训练是将预训练视频生成器转化为一个面向生产的模型的关键步骤,该模型能够遵循指令、可控且在长时间范围内稳健。本报告提出了一个系统化的后训练框架,将监督策略塑造、基于奖励的强化学习和基于偏好的精炼组织成一个单一的稳定性约束优化堆栈。该框架围绕实际视频生成的约束设计,包括高展开成本、时间复合失败模式,以及异质、不确定且通常弱区分的反馈。通过将优化视为一个分阶段的、以诊断为驱动的过程,而不是一系列孤立的技巧,本报告总结了一种连贯的方案,以提高感知保真度、时间一致性和提示遵循,同时保持在初始化时建立的可控性。最终形成的框架为构建可扩展的后训练管道提供了清晰的蓝图,这些管道在实际部署环境中保持稳定、可扩展和有效。
cs.CV / 77 / 2602.07605
Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning
Fine-R1:通过链式思维推理使多模态大语言模型在细粒度视觉识别中表现卓越
Abstract
Any entity in the visual world can be hierarchically grouped based on shared characteristics and mapped to fine-grained sub-categories. While Multi-modal Large Language Models (MLLMs) achieve strong performance on coarse-grained visual tasks, they often struggle with Fine-Grained Visual Recognition (FGVR). Adapting general-purpose MLLMs to FGVR typically requires large amounts of annotated data, which is costly to obtain, leaving a substantial performance gap compared to contrastive CLIP models dedicated for discriminative tasks. Moreover, MLLMs tend to overfit to seen sub-categories and generalize poorly to unseen ones. To address these challenges, we propose Fine-R1, an MLLM tailored for FGVR through an R1-style training framework: (1) Chain-of-Thought Supervised Fine-tuning, where we construct a high-quality FGVR CoT dataset with rationales of "visual analysis, candidate sub-categories, comparison, and prediction", transition the model into a strong open-world classifier; and (2) Triplet Augmented Policy Optimization, where Intra-class Augmentation mixes trajectories from anchor and positive images within the same category to improve robustness to intra-class variance, while Inter-class Augmentation maximizes the response distinction conditioned on images across sub-categories to enhance discriminative ability. With only 4-shot training, Fine-R1 outperforms existing general MLLMs, reasoning MLLMs, and even contrastive CLIP models in identifying both seen and unseen sub-categories, showing promise in working in knowledge-intensive domains where gathering expert annotations for all sub-categories is arduous. Code is available at https://github.com/PKU-ICST-MIPL/FineR1_ICLR2026.
Chinese Translation
视觉世界中的任何实体都可以基于共享特征进行分层分组,并映射到细粒度子类别。尽管多模态大语言模型(MLLMs)在粗粒度视觉任务上表现出色,但它们在细粒度视觉识别(FGVR)方面往往面临挑战。将通用 MLLMs 适应于 FGVR 通常需要大量标注数据,而这些数据的获取成本高昂,这使得与专门用于判别任务的对比 CLIP 模型相比,存在显著的性能差距。此外,MLLMs 往往会对已见子类别过拟合,而对未见子类别的泛化能力较差。为了解决这些挑战,我们提出了 Fine-R1,这是一种针对 FGVR 的 MLLM,采用 R1 风格的训练框架:(1)链式思维监督微调,我们构建了一个高质量的 FGVR CoT 数据集,包含“视觉分析、候选子类别、比较和预测”的推理,将模型转变为一个强大的开放世界分类器;(2)三元组增强策略优化,其中类内增强通过混合同一类别中锚点和正样本图像的轨迹来提高对类内变化的鲁棒性,而类间增强则在跨子类别的图像条件下最大化响应区分,以增强判别能力。仅通过 4-shot 训练,Fine-R1 在识别已见和未见子类别方面超越了现有的通用 MLLMs、推理 MLLMs,甚至对比 CLIP 模型,显示出在知识密集型领域中工作的潜力,在这些领域中,为所有子类别收集专家注释是困难的。代码可在 https://github.com/PKU-ICST-MIPL/FineR1_ICLR2026 获取。
cs.CV / 78 / 2602.07608
HistoMet: A Pan-Cancer Deep Learning Framework for Prognostic Prediction of Metastatic Progression and Site Tropism from Primary Tumor Histopathology
HistoMet:一种用于从原发肿瘤组织病理学预测转移进展和部位趋向的全癌种深度学习框架
Abstract
Metastatic Progression remains the leading cause of cancer-related mortality, yet predicting whether a primary tumor will metastasize and where it will disseminate directly from histopathology remains a fundamental challenge. Although whole-slide images (WSIs) provide rich morphological information, prior computational pathology approaches typically address metastatic status or site prediction as isolated tasks, and do not explicitly model the clinically sequential decision process of metastatic risk assessment followed by downstream site-specific evaluation. To address this research gap, we present a decision-aware, concept-aligned MIL framework, HistoMet, for prognostic metastatic outcome prediction from primary tumor WSIs. Our proposed framework adopts a two-module prediction pipeline in which the likelihood of metastatic progression from the primary tumor is first estimated, followed by conditional prediction of metastatic site for high-risk cases. To guide representation learning and improve clinical interpretability, our framework integrates linguistically defined and data-adaptive metastatic concepts through a pretrained pathology vision-language model. We evaluate HistoMet on a multi-institutional pan-cancer cohort of 6504 patients with metastasis follow-up and site annotations. Under clinically relevant high-sensitivity screening settings (95 percent sensitivity), HistoMet significantly reduces downstream workload while maintaining high metastatic risk recall. Conditional on metastatic cases, HistoMet achieves a macro F1 of 74.6 with a standard deviation of 1.3 and a macro one-vs-rest AUC of 92.1. These results demonstrate that explicitly modeling clinical decision structure enables robust and deployable prognostic prediction of metastatic progression and site tropism directly from primary tumor histopathology.
Chinese Translation
转移进展仍然是癌症相关死亡的主要原因,然而,直接从组织病理学预测原发肿瘤是否会转移以及转移到何处仍然是一个基本挑战。尽管全切片图像(WSIs)提供了丰富的形态学信息,但以往的计算病理学方法通常将转移状态或部位预测视为孤立任务,并未明确建模临床上转移风险评估后续特定部位评估的决策过程。为了解决这一研究空白,我们提出了一种决策感知、概念对齐的多实例学习(MIL)框架HistoMet,用于从原发肿瘤WSIs中预测转移结果。我们提出的框架采用了一个两模块的预测流程,首先估计原发肿瘤的转移进展可能性,然后对高风险病例进行转移部位的条件预测。为了指导表示学习并提高临床可解释性,我们的框架通过预训练的病理视觉-语言模型整合了语言定义和数据自适应的转移概念。我们在一个包含6504名患者的多机构全癌种队列上评估了HistoMet,这些患者有转移随访和部位注释。在临床相关的高灵敏度筛查设置下(95%灵敏度),HistoMet显著减少了下游工作负担,同时保持了高转移风险的召回率。在转移病例的条件下,HistoMet实现了74.6的宏观F1值,标准差为1.3,宏观一对多AUC为92.1。这些结果表明,明确建模临床决策结构能够直接从原发肿瘤组织病理学中实现稳健且可部署的转移进展和部位趋向的预后预测。
cs.CV / 79 / 2602.07625
AD-MIR: Bridging the Gap from Perception to Persuasion in Advertising Video Understanding via Structured Reasoning
AD-MIR:通过结构化推理弥合广告视频理解中的感知与说服之间的鸿沟
Abstract
Multimodal understanding of advertising videos is essential for interpreting the intricate relationship between visual storytelling and abstract persuasion strategies. However, despite excelling at general search, existing agents often struggle to bridge the cognitive gap between pixel-level perception and high-level marketing logic. To address this challenge, we introduce AD-MIR, a framework designed to decode advertising intent via a two-stage architecture. First, in the Structure-Aware Memory Construction phase, the system converts raw video into a structured database by integrating semantic retrieval with exact keyword matching. This approach prioritizes fine-grained brand details (e.g., logos, on-screen text) while dynamically filtering out irrelevant background noise to isolate key protagonists. Second, the Structured Reasoning Agent mimics a marketing expert through an iterative inquiry loop, decomposing the narrative to deduce implicit persuasion tactics. Crucially, it employs an evidence-based self-correction mechanism that rigorously validates these insights against specific video frames, automatically backtracking when visual support is lacking. Evaluation on the AdsQA benchmark demonstrates that AD-MIR achieves state-of-the-art performance, surpassing the strongest general-purpose agent, DVD, by 1.8% in strict and 9.5% in relaxed accuracy. These results underscore that effective advertising understanding demands explicitly grounding abstract marketing strategies in pixel-level evidence. The code is available at https://github.com/Little-Fridge/AD-MIR.
Chinese Translation
对广告视频的多模态理解对于解读视觉叙事与抽象说服策略之间复杂关系至关重要。然而,尽管在一般搜索方面表现出色,现有的智能体往往难以弥合像素级感知与高层次营销逻辑之间的认知鸿沟。为了解决这一挑战,我们提出了AD-MIR,一个旨在通过两阶段架构解码广告意图的框架。首先,在结构感知记忆构建阶段,系统通过将语义检索与精确关键词匹配相结合,将原始视频转换为结构化数据库。该方法优先考虑细致的品牌细节(例如,标志、屏幕文本),同时动态过滤掉无关的背景噪声,以孤立关键主角。其次,结构化推理智能体通过迭代询问循环模拟营销专家,分解叙事以推导隐含的说服策略。关键在于,它采用基于证据的自我纠正机制,严格验证这些见解与特定视频帧的对应关系,当视觉支持不足时自动回溯。在AdsQA基准上的评估表明,AD-MIR实现了最先进的性能,在严格和放松的准确性上分别超过最强的通用智能体DVD 1.8%和9.5%。这些结果强调,有效的广告理解需要将抽象的营销策略明确地基于像素级证据。代码可在 https://github.com/Little-Fridge/AD-MIR 获取。
cs.CV / 80 / 2602.07643
Uncovering Modality Discrepancy and Generalization Illusion for General-Purpose 3D Medical Segmentation
揭示通用3D医学分割中的模态差异和泛化幻觉
Abstract
While emerging 3D medical foundation models are envisioned as versatile tools with offer general-purpose capabilities, their validation remains largely confined to regional and structural imaging, leaving a significant modality discrepancy unexplored. To provide a rigorous and objective assessment, we curate the UMD dataset comprising 490 whole-body PET/CT and 464 whole-body PET/MRI scans ($\sim$675k 2D images, $\sim$12k 3D organ annotations) and conduct a thorough and comprehensive evaluation of representative 3D segmentation foundation models. Through intra-subject controlled comparisons of paired scans, we isolate imaging modality as the primary independent variable to evaluate model robustness in real-world applications. Our evaluation reveals a stark discrepancy between literature-reported benchmarks and real-world efficacy, particularly when transitioning from structural to functional domains. Such systemic failures underscore that current 3D foundation models are far from achieving truly general-purpose status, necessitating a paradigm shift toward multi-modal training and evaluation to bridge the gap between idealized benchmarking and comprehensive clinical utility. This dataset and analysis establish a foundational cornerstone for future research to develop truly modality-agnostic medical foundation models.
Chinese Translation
尽管新兴的3D医学基础模型被设想为具有通用能力的多功能工具,但其验证仍主要局限于区域和结构成像,导致显著的模态差异未被探索。为了提供严格和客观的评估,我们整理了UMD数据集,包含490个全身PET/CT和464个全身PET/MRI扫描(约675k 2D图像,约12k 3D器官标注),并对代表性的3D分割基础模型进行了全面的评估。通过对配对扫描的受试者内控制比较,我们将成像模态作为主要自变量,以评估模型在实际应用中的鲁棒性。我们的评估揭示了文献报告的基准与实际效能之间的显著差异,特别是在从结构领域转向功能领域时。这种系统性失败强调了当前的3D基础模型距离真正的通用状态仍有很大差距,迫切需要向多模态训练和评估的范式转变,以弥合理想化基准与全面临床实用性之间的差距。该数据集和分析为未来研究开发真正模态无关的医学基础模型奠定了基础。
cs.CV / 81 / 2602.07645
From Dead Pixels to Editable Slides: Infographic Reconstruction into Native Google Slides via Vision-Language Region Understanding
从死像素到可编辑幻灯片:通过视觉-语言区域理解将信息图重建为原生Google幻灯片
Abstract
Infographics are widely used to communicate information with a combination of text, icons, and data visualizations, but once exported as images their content is locked into pixels, making updates, localization, and reuse expensive. We describe \textsc{Images2Slides}, an API-based pipeline that converts a static infographic (PNG/JPG) into a native, editable Google Slides slide by extracting a region-level specification with a vision-language model (VLM), mapping pixel geometry into slide coordinates, and recreating elements using the Google Slides batch update API. The system is model-agnostic and supports multiple VLM backends via a common JSON region schema and deterministic postprocessing. On a controlled benchmark of 29 programmatically generated infographic slides with known ground-truth regions, \textsc{Images2Slides} achieves an overall element recovery rate of $0.989\pm0.057$ (text: $0.985\pm0.083$, images: $1.000\pm0.000$), with mean text transcription error $\mathrm{CER}=0.033\pm0.149$ and mean layout fidelity $\mathrm{IoU}=0.364\pm0.161$ for text regions and $0.644\pm0.131$ for image regions. We also highlight practical engineering challenges in reconstruction, including text size calibration and non-uniform backgrounds, and describe failure modes that guide future work.
Chinese Translation
信息图广泛用于通过文本、图标和数据可视化的组合来传达信息,但一旦导出为图像,其内容就被锁定在像素中,使得更新、本地化和重用成本高昂。我们描述了 extsc{Images2Slides},一个基于API的管道,它通过使用视觉-语言模型(VLM)提取区域级规范,将像素几何映射到幻灯片坐标,并使用Google幻灯片批量更新API重新创建元素,将静态信息图(PNG/JPG)转换为原生可编辑的Google幻灯片。该系统与模型无关,并通过通用的JSON区域架构和确定性后处理支持多个VLM后端。在一个控制基准测试中,使用29个程序生成的信息图幻灯片及其已知的真实区域, extsc{Images2Slides}实现了整体元素恢复率为$0.989 ext{±}0.057$(文本:$0.985 ext{±}0.083$,图像:$1.000 ext{±}0.000$),文本转录误差的平均值为$ ext{CER}=0.033 ext{±}0.149$,文本区域的平均布局保真度为$ ext{IoU}=0.364 ext{±}0.161$,图像区域为$0.644 ext{±}0.131$。我们还强调了重建中的实际工程挑战,包括文本大小校准和非均匀背景,并描述了指导未来工作的失败模式。
cs.CV / 82 / 2602.07658
Influence of Geometry, Class Imbalance and Alignment on Reconstruction Accuracy -- A Micro-CT Phantom-Based Evaluation
几何形状、类别不平衡和对齐对重建精度的影响——基于微型CT的评估
Abstract
The accuracy of the 3D models created from medical scans depends on imaging hardware, segmentation methods and mesh processing techniques etc. The effects of geometry type, class imbalance, voxel and point cloud alignment on accuracy remain to be thoroughly explored. This work evaluates the errors across the reconstruction pipeline and explores the use of voxel and surface-based accuracy metrics for different segmentation algorithms and geometry types. A sphere, a facemask, and an AAA were printed using the SLA technique and scanned using a micro-CT machine. Segmentation was performed using GMM, Otsu and RG based methods. Segmented and reference models aligned using the KU algorithm, were quantitatively compared to evaluate metrics like Dice and Jaccard scores, precision. Surface meshes were registered with reference meshes using an ICP-based alignment process. Metrics like chamfer distance, and average Hausdorff distance were evaluated. The Otsu method was found to be the most suitable method for all the geometries. AAA yielded low overlap scores due to its small wall thickness and misalignment. The effect of class imbalance on specificity was observed the most for AAA. Surface-based accuracy metrics differed from the voxel-based trends. The RG method performed best for sphere, while GMM and Otsu perform better for AAA. The facemask surface was most error-prone, possibly due to misalignment during the ICP process. Segmentation accuracy is a cumulative sum of errors across different stages of the reconstruction process. High voxel-based accuracy metrics may be misleading in cases of high class imbalance and sensitivity to alignment. The Jaccard index is found to be more stringent than the Dice and more suitable for accuracy assessment for thin-walled structures. Voxel and point cloud alignment should be ensured to make any reliable assessment of the reconstruction pipeline.
Chinese Translation
从医学扫描生成的3D模型的精度依赖于成像硬件、分割方法和网格处理技术等因素。几何类型、类别不平衡、体素和点云对齐对精度的影响尚待深入探讨。本研究评估了重建流程中的误差,并探讨了针对不同分割算法和几何类型使用基于体素和表面的精度指标。使用SLA技术打印了一个球体、一个面罩和一个AAA(腹主动脉瘤),并使用微型CT机器进行了扫描。采用GMM(高斯混合模型)、Otsu和基于RG(区域生长)的方法进行分割。使用KU算法对齐的分割模型和参考模型进行了定量比较,以评估Dice和Jaccard得分、精度等指标。表面网格通过基于ICP(迭代最近点)对齐过程与参考网格进行了配准。评估了如切线距离和平均Hausdorff距离等指标。结果发现Otsu方法是所有几何形状中最合适的方法。由于AAA的壁厚较小和对齐不当,导致其重叠得分较低。类别不平衡对特异性的影响在AAA中表现得最为明显。基于表面的精度指标与基于体素的趋势有所不同。RG方法在球体上表现最佳,而GMM和Otsu在AAA上表现更佳。面罩表面最容易出现误差,可能是由于在ICP过程中的对齐问题。分割精度是重建过程中不同阶段误差的累积和。在类别不平衡和对齐敏感性较高的情况下,高体素基精度指标可能会产生误导。Jaccard指数被发现比Dice更为严格,更适合用于薄壁结构的精度评估。应确保体素和点云的对齐,以便对重建流程进行可靠评估。
cs.CV / 83 / 2602.07668
Looking and Listening Inside and Outside: Multimodal Artificial Intelligence Systems for Driver Safety Assessment and Intelligent Vehicle Decision-Making
内外观察与倾听:用于驾驶员安全评估和智能车辆决策的多模态人工智能系统
Abstract
The looking-in-looking-out (LILO) framework has enabled intelligent vehicle applications that understand both the outside scene and the driver state to improve safety outcomes, with examples in smart airbag deployment, takeover time prediction in autonomous control transitions, and driver attention monitoring. In this research, we propose an augmentation to this framework, making a case for the audio modality as an additional source of information to understand the driver, and in the evolving autonomy landscape, also the passengers and those outside the vehicle. We expand LILO by incorporating audio signals, forming the looking-and-listening inside-and-outside (L-LIO) framework to enhance driver state assessment and environment understanding through multimodal sensor fusion. We evaluate three example cases where audio enhances vehicle safety: supervised learning on driver speech audio to classify potential impairment states (e.g., intoxication), collection and analysis of passenger natural language instructions (e.g., "turn after that red building") to motivate how spoken language can interface with planning systems through audio-aligned instruction data, and limitations of vision-only systems where audio may disambiguate the guidance and gestures of external agents. Datasets include custom-collected in-vehicle and external audio samples in real-world environments. Pilot findings show that audio yields safety-relevant insights, particularly in nuanced or context-rich scenarios where sound is critical to safe decision-making or visual signals alone are insufficient. Challenges include ambient noise interference, privacy considerations, and robustness across human subjects, motivating further work on reliability in dynamic real-world contexts. L-LIO augments driver and scene understanding through multimodal fusion of audio and visual sensing, offering new paths for safety intervention.
Chinese Translation
内外观察(Looking-in-Looking-out,LILO)框架使得智能车辆应用能够理解外部场景和驾驶员状态,从而改善安全结果,具体例子包括智能安全气囊部署、自动控制过渡中的接管时间预测以及驾驶员注意力监测。在本研究中,我们对这一框架进行了扩展,提出将音频模态作为理解驾驶员的额外信息来源,并在不断发展的自动化环境中,也包括乘客和车辆外部人员。我们通过结合音频信号扩展LILO,形成内外观察与倾听(Looking-and-Listening Inside-and-Outside,L-LIO)框架,以通过多模态传感器融合增强驾驶员状态评估和环境理解。我们评估了三个音频增强车辆安全的示例案例:对驾驶员语音音频进行监督学习以分类潜在的损伤状态(例如,醉酒)、收集和分析乘客自然语言指令(例如,“在那座红色建筑后转弯”)以说明口语如何通过音频对齐的指令数据与规划系统接口,以及仅依赖视觉系统的局限性,音频可以消歧外部代理的指导和手势。数据集包括在真实环境中自定义收集的车内和外部音频样本。初步发现表明,音频提供了与安全相关的见解,特别是在细微或上下文丰富的场景中,声音对安全决策至关重要,而仅依靠视觉信号则不足。挑战包括环境噪声干扰、隐私考虑以及在不同人群中的鲁棒性,这促使我们在动态真实环境中进一步研究可靠性。L-LIO通过音频和视觉感知的多模态融合增强了对驾驶员和场景的理解,为安全干预提供了新的路径。
cs.CV / 84 / 2602.07680
Vision and language: Novel Representations and Artificial intelligence for Driving Scene Safety Assessment and Autonomous Vehicle Planning
视觉与语言:用于驾驶场景安全评估和自动驾驶规划的新型表征与人工智能
Abstract
Vision-language models (VLMs) have recently emerged as powerful representation learning systems that align visual observations with natural language concepts, offering new opportunities for semantic reasoning in safety-critical autonomous driving. This paper investigates how vision-language representations support driving scene safety assessment and decision-making when integrated into perception, prediction, and planning pipelines. We study three complementary system-level use cases. First, we introduce a lightweight, category-agnostic hazard screening approach leveraging CLIP-based image-text similarity to produce a low-latency semantic hazard signal. This enables robust detection of diverse and out-of-distribution road hazards without explicit object detection or visual question answering. Second, we examine the integration of scene-level vision-language embeddings into a transformer-based trajectory planning framework using the Waymo Open Dataset. Our results show that naively conditioning planners on global embeddings does not improve trajectory accuracy, highlighting the importance of representation-task alignment and motivating the development of task-informed extraction methods for safety-critical planning. Third, we investigate natural language as an explicit behavioral constraint on motion planning using the doScenes dataset. In this setting, passenger-style instructions grounded in visual scene elements suppress rare but severe planning failures and improve safety-aligned behavior in ambiguous scenarios. Taken together, these findings demonstrate that vision-language representations hold significant promise for autonomous driving safety when used to express semantic risk, intent, and behavioral constraints. Realizing this potential is fundamentally an engineering problem requiring careful system design and structured grounding rather than direct feature injection.
Chinese Translation
视觉-语言模型(VLMs)最近作为强大的表征学习系统出现,它们将视觉观察与自然语言概念对齐,为安全关键的自动驾驶中的语义推理提供了新的机会。本文研究了视觉-语言表征如何在集成到感知、预测和规划流程中支持驾驶场景的安全评估和决策。我们研究了三个互补的系统级应用案例。首先,我们介绍了一种轻量级、类别无关的危险筛选方法,该方法利用基于CLIP的图像-文本相似性生成低延迟的语义危险信号。这使得在没有明确的物体检测或视觉问答的情况下,能够稳健地检测多样化和超出分布的道路危险。其次,我们考察了将场景级视觉-语言嵌入集成到基于变换器的轨迹规划框架中,使用Waymo开放数据集。我们的结果表明,简单地将规划器与全局嵌入条件化并未提高轨迹准确性,这突显了表征-任务对齐的重要性,并激励了为安全关键规划开发任务知情的提取方法。第三,我们使用doScenes数据集研究自然语言作为运动规划的显式行为约束。在这种情况下,基于视觉场景元素的乘客风格指令抑制了罕见但严重的规划失败,并改善了模糊场景中的安全对齐行为。综合来看,这些发现表明,视觉-语言表征在表达语义风险、意图和行为约束时,对自动驾驶安全具有重要的潜力。实现这一潜力根本上是一个工程问题,需要仔细的系统设计和结构化的基础,而不是直接的特征注入。
cs.CV / 85 / 2602.07689
Process-of-Thought Reasoning for Videos
视频的思维过程推理
Abstract
Video understanding requires not only recognizing visual content but also performing temporally grounded, multi-step reasoning over long and noisy observations. We propose Process-of-Thought (PoT) Reasoning for Videos, a framework that makes the reasoning process explicit by structuring video inference into a sequence of lightweight, verifiable steps. PoT interleaves (i) temporal evidence selection, (ii) step-wise state updates, and (iii) constrained answer synthesis, enabling the model to progressively refine hypotheses while maintaining traceability to video evidence. The framework is designed to be model-agnostic and can be plugged into existing vision-language backbones, supporting both closed-book reasoning and evidence-augmented reasoning with external tools. We further introduce a unified representation for PoT traces that aligns intermediate decisions with temporal segments, which improves robustness to distractors and reduces hallucinated explanations. Extensive experiments on standard video reasoning tasks demonstrate that PoT consistently improves factual correctness and temporal grounding, while providing interpretable reasoning traces for diagnosis and downstream use.
Chinese Translation
视频理解不仅需要识别视觉内容,还需要对长时间和噪声观察进行时间基础的多步骤推理。我们提出了视频的思维过程(Process-of-Thought, PoT)推理框架,该框架通过将视频推理结构化为一系列轻量级、可验证的步骤,使推理过程变得明确。PoT 交替进行 (i) 时间证据选择,(ii) 步骤状态更新,以及 (iii) 受限答案合成,使模型能够在保持与视频证据的可追溯性的同时,逐步细化假设。该框架设计为与模型无关,可以集成到现有的视觉-语言基础架构中,支持闭卷推理和使用外部工具的证据增强推理。我们进一步引入了一种统一的 PoT 轨迹表示,将中间决策与时间段对齐,从而提高对干扰物的鲁棒性并减少虚假解释。针对标准视频推理任务的广泛实验表明,PoT 一直在提高事实正确性和时间基础,同时提供可解释的推理轨迹以便于诊断和后续使用。
cs.CV / 86 / 2602.07694
Semantic-Deviation-Anchored Multi-Branch Fusion for Unsupervised Anomaly Detection and Localization in Unstructured Conveyor-Belt Coal Scenes
基于语义偏差锚定的多分支融合用于非监督异常检测和非结构化输送带煤场景中的定位
Abstract
Reliable foreign-object anomaly detection and pixel-level localization in conveyor-belt coal scenes are essential for safe and intelligent mining operations. This task is particularly challenging due to the highly unstructured environment: coal and gangue are randomly piled, backgrounds are complex and variable, and foreign objects often exhibit low contrast, deformation, occlusion, resulting in coupling with their surroundings. These characteristics weaken the stability and regularity assumptions that many anomaly detection methods rely on in structured industrial settings, leading to notable performance degradation. To support evaluation and comparison in this setting, we construct \textbf{CoalAD}, a benchmark for unsupervised foreign-object anomaly detection with pixel-level localization in coal-stream scenes. We further propose a complementary-cue collaborative perception framework that extracts and fuses complementary anomaly evidence from three perspectives: object-level semantic composition modeling, semantic-attribution-based global deviation analysis, and fine-grained texture matching. The fused outputs provide robust image-level anomaly scoring and accurate pixel-level localization. Experiments on CoalAD demonstrate that our method outperforms widely used baselines across the evaluated image-level and pixel-level metrics, and ablation studies validate the contribution of each component. The code is available at https://github.com/xjpp2016/USAD.
Chinese Translation
在输送带煤场景中,可靠的外部物体异常检测和像素级定位对于安全和智能的采矿作业至关重要。由于环境高度非结构化,这一任务尤其具有挑战性:煤和废石被随机堆积,背景复杂多变,外部物体通常表现出低对比度、变形和遮挡,导致与周围环境的耦合。这些特征削弱了许多异常检测方法在结构化工业环境中所依赖的稳定性和规律性假设,导致显著的性能下降。为了支持在这一环境中的评估和比较,我们构建了 extbf{CoalAD},这是一个用于煤流场景中具有像素级定位的非监督外部物体异常检测的基准。我们进一步提出了一种互补线索协同感知框架,从三个角度提取和融合互补的异常证据:对象级语义组成建模、基于语义归属的全局偏差分析和细粒度纹理匹配。融合的输出提供了稳健的图像级异常评分和准确的像素级定位。在CoalAD上的实验表明,我们的方法在评估的图像级和像素级指标上超越了广泛使用的基线,消融研究验证了每个组件的贡献。代码可在 https://github.com/xjpp2016/USAD 获取。
cs.CV / 87 / 2602.07702
A hybrid Kolmogorov-Arnold network for medical image segmentation
一种用于医学图像分割的混合Kolmogorov-Arnold网络
Abstract
Medical image segmentation plays a vital role in diagnosis and treatment planning, but remains challenging due to the inherent complexity and variability of medical images, especially in capturing non-linear relationships within the data. We propose U-KABS, a novel hybrid framework that integrates the expressive power of Kolmogorov-Arnold Networks (KANs) with a U-shaped encoder-decoder architecture to enhance segmentation performance. The U-KABS model combines the convolutional and squeeze-and-excitation stage, which enhances channel-wise feature representations, and the KAN Bernstein Spline (KABS) stage, which employs learnable activation functions based on Bernstein polynomials and B-splines. This hybrid design leverages the global smoothness of Bernstein polynomials and the local adaptability of B-splines, enabling the model to effectively capture both broad contextual trends and fine-grained patterns critical for delineating complex structures in medical images. Skip connections between encoder and decoder layers support effective multi-scale feature fusion and preserve spatial details. Evaluated across diverse medical imaging benchmark datasets, U-KABS demonstrates superior performance compared to strong baselines, particularly in segmenting complex anatomical structures.
Chinese Translation
医学图像分割在诊断和治疗规划中发挥着至关重要的作用,但由于医学图像固有的复杂性和变异性,尤其是在捕捉数据中的非线性关系方面,仍然面临挑战。我们提出了U-KABS,一种新颖的混合框架,结合了Kolmogorov-Arnold网络(KANs)的表达能力与U形编码器-解码器架构,以增强分割性能。U-KABS模型结合了卷积和压缩-激励阶段,增强了通道特征表示,以及KAN伯恩斯坦样条(KABS)阶段,该阶段采用基于伯恩斯坦多项式和B样条的可学习激活函数。这种混合设计利用了伯恩斯坦多项式的全局平滑性和B样条的局部适应性,使模型能够有效捕捉广泛的上下文趋势和细粒度模式,这对于划定医学图像中的复杂结构至关重要。编码器和解码器层之间的跳跃连接支持有效的多尺度特征融合,并保留空间细节。在各种医学影像基准数据集上的评估表明,U-KABS的性能优于强基线,特别是在分割复杂解剖结构方面。
cs.CV / 88 / 2602.07717
All-Optical Segmentation via Diffractive Neural Networks for Autonomous Driving
基于衍射神经网络的全光学分割在自动驾驶中的应用
Abstract
Semantic segmentation and lane detection are crucial tasks in autonomous driving systems. Conventional approaches predominantly rely on deep neural networks (DNNs), which incur high energy costs due to extensive analog-to-digital conversions and large-scale image computations required for low-latency, real-time responses. Diffractive optical neural networks (DONNs) have shown promising advantages over conventional DNNs on digital or optoelectronic computing platforms in energy efficiency. By performing all-optical image processing via light diffraction at the speed of light, DONNs save computation energy costs while reducing the overhead associated with analog-to-digital conversions by all-optical encoding and computing. In this work, we propose a novel all-optical computing framework for RGB image segmentation and lane detection in autonomous driving applications. Our experimental results demonstrate the effectiveness of the DONN system for image segmentation on the CityScapes dataset. Additionally, we conduct case studies on lane detection using a customized indoor track dataset and simulated driving scenarios in CARLA, where we further evaluate the model's generalizability under diverse environmental conditions.
Chinese Translation
语义分割和车道检测是自动驾驶系统中的关键任务。传统方法主要依赖深度神经网络(DNN),由于需要进行广泛的模拟到数字转换和大规模图像计算以实现低延迟的实时响应,因此会产生高能耗。衍射光学神经网络(DONN)在数字或光电计算平台上显示出相较于传统DNN在能效方面的显著优势。通过以光速进行光衍射的全光学图像处理,DONN在节省计算能耗的同时,通过全光编码和计算减少了与模拟到数字转换相关的开销。在本研究中,我们提出了一种新颖的全光学计算框架,用于自动驾驶应用中的RGB图像分割和车道检测。我们的实验结果表明,DONN系统在CityScapes数据集上的图像分割效果显著。此外,我们还对使用定制的室内轨道数据集和CARLA中的模拟驾驶场景进行的车道检测进行了案例研究,进一步评估了模型在不同环境条件下的泛化能力。
cs.CV / 89 / 2602.07768
PAND: Prompt-Aware Neighborhood Distillation for Lightweight Fine-Grained Visual Classification
PAND:面向提示的邻域蒸馏用于轻量级细粒度视觉分类
Abstract
Distilling knowledge from large Vision-Language Models (VLMs) into lightweight networks is crucial yet challenging in Fine-Grained Visual Classification (FGVC), due to the reliance on fixed prompts and global alignment. To address this, we propose PAND (Prompt-Aware Neighborhood Distillation), a two-stage framework that decouples semantic calibration from structural transfer. First, we incorporate Prompt-Aware Semantic Calibration to generate adaptive semantic anchors. Second, we introduce a neighborhood-aware structural distillation strategy to constrain the student's local decision structure. PAND consistently outperforms state-of-the-art methods on four FGVC benchmarks. Notably, our ResNet-18 student achieves 76.09% accuracy on CUB-200, surpassing the strong baseline VL2Lite by 3.4%. Code is available at https://github.com/LLLVTA/PAND.
Chinese Translation
从大型视觉-语言模型(VLMs)中提取知识到轻量级网络中对于细粒度视觉分类(FGVC)至关重要但具有挑战性,这主要是由于对固定提示和全局对齐的依赖。为了解决这一问题,我们提出了PAND(面向提示的邻域蒸馏),这是一个将语义校准与结构转移解耦的两阶段框架。首先,我们结合面向提示的语义校准生成自适应语义锚点。其次,我们引入了一种邻域感知的结构蒸馏策略,以约束学生的局部决策结构。PAND在四个FGVC基准测试中始终优于最新的先进方法。值得注意的是,我们的ResNet-18学生在CUB-200上达到了76.09%的准确率,超过了强基线VL2Lite 3.4%。代码可在https://github.com/LLLVTA/PAND获取。
cs.CV / 90 / 2602.07775
Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion
滚动汇聚:桥接有限时域训练与开放式测试中的自回归视频扩散
Abstract
Recently, autoregressive (AR) video diffusion models has achieved remarkable performance. However, due to their limited training durations, a train-test gap emerges when testing at longer horizons, leading to rapid visual degradations. Following Self Forcing, which studies the train-test gap within the training duration, this work studies the train-test gap beyond the training duration, i.e., the gap between the limited horizons during training and open-ended horizons during testing. Since open-ended testing can extend beyond any finite training window, and long-video training is computationally expensive, we pursue a training-free solution to bridge this gap. To explore a training-free solution, we conduct a systematic analysis of AR cache maintenance. These insights lead to Rolling Sink. Built on Self Forcing (trained on only 5s clips), Rolling Sink effectively scales the AR video synthesis to ultra-long durations (e.g., 5-30 minutes at 16 FPS) at test time, with consistent subjects, stable colors, coherent structures, and smooth motions. As demonstrated by extensive experiments, Rolling Sink achieves superior long-horizon visual fidelity and temporal consistency compared to SOTA baselines. Project page: https://rolling-sink.github.io/
Chinese Translation
最近,自回归(AR)视频扩散模型取得了显著的性能。然而,由于其训练时长有限,在较长的测试时域中出现了训练与测试之间的差距,导致视觉质量迅速下降。基于自我强制(Self Forcing)研究训练时长内的训练-测试差距,本研究探讨了超出训练时长的训练-测试差距,即训练期间有限时域与测试期间开放时域之间的差距。由于开放式测试可以超出任何有限的训练窗口,并且长视频训练在计算上非常昂贵,我们追求一种无训练的解决方案来弥补这一差距。为探索无训练的解决方案,我们对AR缓存维护进行了系统分析。这些洞见促成了滚动汇聚(Rolling Sink)的提出。基于自我强制(仅在5秒片段上训练),滚动汇聚有效地将AR视频合成扩展到超长时域(例如,在测试时以16 FPS生成5-30分钟的视频),具有一致的主题、稳定的色彩、一致的结构和流畅的运动。通过大量实验表明,滚动汇聚在长时域视觉保真度和时间一致性方面优于现有的最先进基线。项目页面:https://rolling-sink.github.io/
cs.CV / 91 / 2602.07784
Uncertainty-Aware Counterfactual Traffic Signal Control with Predictive Safety and Starvation-Avoidance Constraints Using Vision-Based Sensing
基于视觉感知的预测安全性和避免饥饿约束的具有不确定性感知的反事实交通信号控制
Abstract
Real-world deployment of adaptive traffic signal control, to date, remains limited due to the uncertainty associated with vision-based perception, implicit safety, and non-interpretable control policies learned and validated mainly in simulation. In this paper, we introduce UCATSC, a model-based traffic signal control system that models traffic signal control at an intersection using a stochastic decision process with constraints and under partial observability, taking into account the uncertainty associated with vision-based perception. Unlike reinforcement learning methods that learn to predict safety using reward shaping, UCATSC predicts and enforces hard constraints related to safety and starvation prevention during counterfactual rollouts in belief space. The system is designed to improve traffic delay and emission while preventing safety-critical errors and providing interpretable control policy outputs based on explicit models.
Chinese Translation
迄今为止,自适应交通信号控制在实际部署中仍然受到限制,原因在于与基于视觉的感知相关的不确定性、隐式安全性以及主要在仿真中学习和验证的不可解释控制策略。在本文中,我们介绍了UCATSC,一种基于模型的交通信号控制系统,该系统使用随机决策过程对交叉口的交通信号控制进行建模,考虑了与基于视觉的感知相关的不确定性,并在部分可观测性下施加约束。与通过奖励塑造学习预测安全性的强化学习方法不同,UCATSC在信念空间中的反事实回滚过程中预测并强制执行与安全性和防止饥饿相关的硬约束。该系统旨在改善交通延误和排放,同时防止安全关键错误,并基于显式模型提供可解释的控制策略输出。
cs.CV / 92 / 2602.07801
VideoTemp-o3: Harmonizing Temporal Grounding and Video Understanding in Agentic Thinking-with-Videos
VideoTemp-o3:在视频代理思维中协调时间定位与视频理解
Abstract
In long-video understanding, conventional uniform frame sampling often fails to capture key visual evidence, leading to degraded performance and increased hallucinations. To address this, recent agentic thinking-with-videos paradigms have emerged, adopting a localize-clip-answer pipeline in which the model actively identifies relevant video segments, performs dense sampling within those clips, and then produces answers. However, existing methods remain inefficient, suffer from weak localization, and adhere to rigid workflows. To solve these issues, we propose VideoTemp-o3, a unified agentic thinking-with-videos framework that jointly models video grounding and question answering. VideoTemp-o3 exhibits strong localization capability, supports on-demand clipping, and can refine inaccurate localizations. Specifically, in the supervised fine-tuning stage, we design a unified masking mechanism that encourages exploration while preventing noise. For reinforcement learning, we introduce dedicated rewards to mitigate reward hacking. Besides, from the data perspective, we develop an effective pipeline to construct high-quality long video grounded QA data, along with a corresponding benchmark for systematic evaluation across various video durations. Experimental results demonstrate that our method achieves remarkable performance on both long video understanding and grounding.
Chinese Translation
在长视频理解中,传统的均匀帧采样常常无法捕捉到关键的视觉证据,从而导致性能下降和幻觉增加。为了解决这个问题,最近出现了代理思维视频(agentic thinking-with-videos)范式,采用了一种定位-剪辑-回答的流程,其中模型主动识别相关的视频片段,在这些片段内进行密集采样,然后生成答案。然而,现有方法仍然效率低下,定位能力较弱,并且遵循僵化的工作流程。为了解决这些问题,我们提出了VideoTemp-o3,一个统一的代理思维视频框架,联合建模视频定位和问答。VideoTemp-o3展现出强大的定位能力,支持按需剪辑,并能够修正不准确的定位。具体而言,在监督微调阶段,我们设计了一种统一的掩蔽机制,鼓励探索同时防止噪声。在强化学习方面,我们引入专门的奖励机制以减轻奖励黑客行为。此外,从数据角度出发,我们开发了一条有效的管道来构建高质量的长视频定位问答数据,并相应地建立了一个基准,以便在不同视频时长下进行系统评估。实验结果表明,我们的方法在长视频理解和定位方面均取得了显著的性能。
cs.CV / 93 / 2602.07814
How well are open sourced AI-generated image detection models out-of-the-box: A comprehensive benchmark study
开源AI生成图像检测模型的开箱即用性能如何:一项全面的基准研究
Abstract
As AI-generated images proliferate across digital platforms, reliable detection methods have become critical for combating misinformation and maintaining content authenticity. While numerous deepfake detection methods have been proposed, existing benchmarks predominantly evaluate fine-tuned models, leaving a critical gap in understanding out-of-the-box performance -- the most common deployment scenario for practitioners. We present the first comprehensive zero-shot evaluation of 16 state-of-the-art detection methods, comprising 23 pretrained detector variants (due to multiple released versions of certain detectors), across 12 diverse datasets, comprising 2.6~million image samples spanning 291 unique generators including modern diffusion models. Our systematic analysis reveals striking findings: (1)~no universal winner exists, with detector rankings exhibiting substantial instability (Spearman~$\rho$: 0.01 -- 0.87 across dataset pairs); (2)~a 37~percentage-point performance gap separates the best detector (75.0\% mean accuracy) from the worst (37.5\%); (3)~training data alignment critically impacts generalization, causing up to 20--60\% performance variance within architecturally identical detector families; (4)~modern commercial generators (Flux~Dev, Firefly~v4, Midjourney~v7) defeat most detectors, achieving only 18--30\% average accuracy; and (5)~we identify three systematic failure patterns affecting cross-dataset generalization. Statistical analysis confirms significant performance differences between detectors (Friedman test: $\chi^2$=121.01, $p<10^{-16}$, Kendall~$W$=0.524). Our findings challenge the ``one-size-fits-all'' detector paradigm and provide actionable deployment guidelines, demonstrating that practitioners must carefully select detectors based on their specific threat landscape rather than relying on published benchmark performance.
Chinese Translation
随着AI生成图像在数字平台上的广泛传播,可靠的检测方法对于打击虚假信息和维护内容真实性变得至关重要。尽管已有众多深度伪造检测方法被提出,现有基准主要评估经过微调的模型,这在理解开箱即用性能方面存在重要的空白——这是从业者最常见的部署场景。我们首次对16种最先进的检测方法进行了全面的零样本评估,包括23个预训练检测器变体(由于某些检测器的多个版本发布),涵盖12个多样化的数据集,共计260万张图像样本,涉及291个独特的生成器,包括现代扩散模型。我们的系统分析揭示了惊人的发现:(1) 没有普遍的赢家,检测器排名表现出显著的不稳定性(Spearman~$
ho$: 0.01 -- 0.87,跨数据集对);(2) 最佳检测器(75.0\% 平均准确率)与最差检测器(37.5\%)之间存在37个百分点的性能差距;(3) 训练数据的对齐对泛化能力有重要影响,导致在架构相同的检测器家族中性能波动高达20--60\\%;(4) 现代商业生成器(Flux~Dev, Firefly~v4, Midjourney~v7)击败了大多数检测器,仅实现18--30\\%的平均准确率;(5) 我们识别出三种影响跨数据集泛化的系统性失败模式。统计分析确认了检测器之间存在显著的性能差异(Friedman检验: $ ext{χ}^2=121.01, p<10^{-16}, ext{Kendall } W=0.524$)。我们的研究挑战了“通用检测器”范式,并提供了可操作的部署指南,表明从业者必须根据特定的威胁环境仔细选择检测器,而不是依赖于已发布的基准性能。
cs.CV / 94 / 2602.07815
Out of the box age estimation through facial imagery: A Comprehensive Benchmark of Vision-Language Models vs. out-of-the-box Traditional Architectures
基于面部图像的即插即用年龄估计:视觉-语言模型与传统架构的全面基准比较
Abstract
Facial age estimation is critical for content moderation, age verification, and deepfake detection, yet no prior benchmark has systematically compared modern vision-language models (VLMs) against specialized age estimation architectures. We present the first large-scale cross-paradigm benchmark, evaluating \textbf{34 models} -- 22 specialized architectures with publicly available pretrained weights and 12 general-purpose VLMs -- across \textbf{8 standard datasets} (UTKFace, IMDB-WIKI, MORPH, AFAD, CACD, FG-NET, APPA-REAL, AgeDB) totaling 1{,}100 test images per model. Our key finding is striking: \emph{zero-shot VLMs significantly outperform most specialized models}, achieving an average MAE of 5.65 years compared to 9.88 for non-LLM models. The best VLM (Gemini~3 Flash Preview, MAE~4.32) outperforms the best non-LLM model (MiVOLO, MAE~5.10) by 15\%. Only MiVOLO, which uniquely combines face and body features via Vision Transformers, competes with VLMs. We further analyze age verification at the 18-year threshold, revealing that non-LLM models exhibit 60--100\% false adult rates on minors while VLMs achieve 13--25\%, and demonstrate that coarse age binning (8--9 classes) consistently degrades MAE beyond 13 years. Our stratified analysis across 14 age groups reveals that all models struggle most at extreme ages ($<$5 and 65+). These findings challenge the assumption that task-specific architectures are necessary for age estimation and suggest that the field should redirect toward distilling VLM capabilities into efficient specialized models.
Chinese Translation
面部年龄估计对于内容审核、年龄验证和深度伪造检测至关重要,但之前没有系统性地比较现代视觉-语言模型(VLMs)与专门的年龄估计架构。我们提出了首个大规模跨范式基准,评估了 extbf{34个模型}——22个具有公开可用预训练权重的专门架构和12个通用VLM——在 extbf{8个标准数据集}(UTKFace、IMDB-WIKI、MORPH、AFAD、CACD、FG-NET、APPA-REAL、AgeDB)上的表现,每个模型总计1{,}100张测试图像。我们的关键发现令人震惊: extit{零-shot VLM显著优于大多数专门模型},其平均绝对误差(MAE)为5.65岁,而非LLM模型为9.88岁。最佳VLM(Gemini~3 Flash Preview,MAE~4.32)比最佳非LLM模型(MiVOLO,MAE~5.10)高出15\%。只有MiVOLO通过视觉变换器独特地结合面部和身体特征,与VLM竞争。我们进一步分析了18岁阈值的年龄验证,发现非LLM模型在未成年人中表现出60--100 ext% 的错误成年率,而VLM的错误成年率为13--25 ext% ,并且证明粗略的年龄分组(8--9类)在13岁以上始终会降低MAE。我们在14个年龄组中的分层分析显示,所有模型在极端年龄($<$5岁和65岁以上)时表现最差。这些发现挑战了任务特定架构对于年龄估计必要性的假设,并建议该领域应重新聚焦于将VLM能力提炼为高效的专门模型。
cs.CV / 95 / 2602.07820
Back to Physics: Operator-Guided Generative Paths for SMS MRI Reconstruction
回归物理学:基于算子的生成路径用于SMS MRI重建
Abstract
Simultaneous multi-slice (SMS) imaging with in-plane undersampling enables highly accelerated MRI but yields a strongly coupled inverse problem with deterministic inter-slice interference and missing k-space data. Most diffusion-based reconstructions are formulated around Gaussian-noise corruption and rely on additional consistency steps to incorporate SMS physics, which can be mismatched to the operator-governed degradations in SMS acquisition. We propose an operator-guided framework that models the degradation trajectory using known acquisition operators and inverts this process via deterministic updates. Within this framework, we introduce an operator-conditional dual-stream interaction network (OCDI-Net) that explicitly disentangles target-slice content from inter-slice interference and predicts structured degradations for operator-aligned inversion, and we instantiate reconstruction as a two-stage chained inference procedure that performs SMS slice separation followed by in-plane completion. Experiments on fastMRI brain data and prospectively acquired in vivo diffusion MRI data demonstrate improved fidelity and reduced slice leakage over conventional and learning-based SMS reconstructions.
Chinese Translation
平面欠采样的同时多切片(SMS)成像能够实现高度加速的MRI,但会导致强耦合的逆问题,伴随确定性的切片间干扰和缺失的k空间数据。大多数基于扩散的重建方法围绕高斯噪声干扰进行构建,并依赖额外的一致性步骤来纳入SMS物理特性,但这些步骤可能与SMS采集中的算子主导的退化不匹配。我们提出了一种基于算子的框架,该框架使用已知的采集算子建模退化轨迹,并通过确定性更新反转这一过程。在该框架内,我们引入了一种算子条件双流交互网络(OCDI-Net),该网络明确地将目标切片内容与切片间干扰分离,并预测用于算子对齐反转的结构化退化。我们将重建实例化为一个两阶段链式推理过程,首先进行SMS切片分离,然后进行平面补全。在fastMRI脑部数据和前瞻性获取的体内扩散MRI数据上的实验表明,与传统和基于学习的SMS重建相比,我们的方法在保真度和切片泄漏方面都有所改善。
cs.CV / 96 / 2602.07827
Open-Text Aerial Detection: A Unified Framework For Aerial Visual Grounding And Detection
开放文本空中检测:空中视觉定位与检测的统一框架
Abstract
Open-Vocabulary Aerial Detection (OVAD) and Remote Sensing Visual Grounding (RSVG) have emerged as two key paradigms for aerial scene understanding. However, each paradigm suffers from inherent limitations when operating in isolation: OVAD is restricted to coarse category-level semantics, while RSVG is structurally limited to single-target localization. These limitations prevent existing methods from simultaneously supporting rich semantic understanding and multi-target detection. To address this, we propose OTA-Det, the first unified framework that bridges both paradigms into a cohesive architecture. Specifically, we introduce a task reformulation strategy that unifies task objectives and supervision mechanisms, enabling joint training across datasets from both paradigms with dense supervision signals. Furthermore, we propose a dense semantic alignment strategy that establishes explicit correspondence at multiple granularities, from holistic expressions to individual attributes, enabling fine-grained semantic understanding. To ensure real-time efficiency, OTA-Det builds upon the RT-DETR architecture, extending it from closed-set detection to open-text detection by introducing several high efficient modules, achieving state-of-the-art performance on six benchmarks spanning both OVAD and RSVG tasks while maintaining real-time inference at 34 FPS.
Chinese Translation
开放词汇空中检测(Open-Vocabulary Aerial Detection, OVAD)和遥感视觉定位(Remote Sensing Visual Grounding, RSVG)已成为空中场景理解的两个关键范式。然而,这两个范式在孤立操作时都存在固有的局限性:OVAD受限于粗略的类别级语义,而RSVG在结构上仅限于单目标定位。这些局限性阻碍了现有方法同时支持丰富的语义理解和多目标检测。为了解决这一问题,我们提出了OTA-Det,这是第一个将这两种范式桥接成一个统一架构的框架。具体而言,我们引入了一种任务重构策略,统一任务目标和监督机制,使得能够在来自两个范式的数据集上进行联合训练,并提供密集的监督信号。此外,我们提出了一种密集语义对齐策略,在多个粒度上建立明确的对应关系,从整体表达到个体属性,实现细粒度的语义理解。为了确保实时效率,OTA-Det基于RT-DETR架构进行构建,通过引入多个高效模块,将其从闭集检测扩展到开放文本检测,在六个涵盖OVAD和RSVG任务的基准测试中实现了最先进的性能,同时保持34帧每秒的实时推理。
cs.CV / 97 / 2602.07833
SPD-Faith Bench: Diagnosing and Improving Faithfulness in Chain-of-Thought for Multimodal Large Language Models
SPD-Faith Bench:诊断和改善多模态大语言模型中的思维链忠实性
Abstract
Chain-of-Thought reasoning is widely used to improve the interpretability of multimodal large language models (MLLMs), yet the faithfulness of the generated reasoning traces remains unclear. Prior work has mainly focused on perceptual hallucinations, leaving reasoning level unfaithfulness underexplored. To isolate faithfulness from linguistic priors, we introduce SPD-Faith Bench, a diagnostic benchmark based on fine-grained image difference reasoning that enforces explicit visual comparison. Evaluations on state-of-the-art MLLMs reveal two systematic failure modes, perceptual blindness and perception-reasoning dissociation. We trace these failures to decaying visual attention and representation shifts in the residual stream. Guided by this analysis, we propose SAGE, a train-free visual evidence-calibrated framework that improves visual routing and aligns reasoning with perception. Our results highlight the importance of explicitly evaluating faithfulness beyond response correctness. Our benchmark and codes are available at https://github.com/Johanson-colab/SPD-Faith-Bench.
Chinese Translation
思维链推理被广泛应用于提高多模态大语言模型(MLLMs)的可解释性,但生成的推理轨迹的忠实性仍不清楚。之前的研究主要集中在感知幻觉上,而推理层面的不忠实性则未得到充分探讨。为了将忠实性与语言先验分离,我们引入了SPD-Faith Bench,这是一个基于细粒度图像差异推理的诊断基准,强调明确的视觉比较。对最先进的MLLMs的评估揭示了两种系统性失效模式:感知盲ness和感知-推理分离。我们将这些失效追溯到视觉注意力衰减和残差流中的表示转移。在这一分析的指导下,我们提出了SAGE,一个无训练的视觉证据校准框架,改善视觉路由并将推理与感知对齐。我们的结果强调了超越响应正确性明确评估忠实性的重要性。我们的基准和代码可在 https://github.com/Johanson-colab/SPD-Faith-Bench 获取。
cs.CV / 98 / 2602.07835
VFace: A Training-Free Approach for Diffusion-Based Video Face Swapping
VFace:一种无训练的基于扩散的视频换脸方法
Abstract
We present a training-free, plug-and-play method, namely VFace, for high-quality face swapping in videos. It can be seamlessly integrated with image-based face swapping approaches built on diffusion models. First, we introduce a Frequency Spectrum Attention Interpolation technique to facilitate generation and intact key identity characteristics. Second, we achieve Target Structure Guidance via plug-and-play attention injection to better align the structural features from the target frame to the generation. Third, we present a Flow-Guided Attention Temporal Smoothening mechanism that enforces spatiotemporal coherence without modifying the underlying diffusion model to reduce temporal inconsistencies typically encountered in frame-wise generation. Our method requires no additional training or video-specific fine-tuning. Extensive experiments show that our method significantly enhances temporal consistency and visual fidelity, offering a practical and modular solution for video-based face swapping. Our code is available at https://github.com/Sanoojan/VFace.
Chinese Translation
我们提出了一种无训练、即插即用的方法,称为VFace,用于高质量视频换脸。该方法可以与基于扩散模型的图像换脸方法无缝集成。首先,我们引入了一种频谱注意力插值技术,以促进生成和保持关键身份特征的完整性。其次,我们通过即插即用的注意力注入实现目标结构引导,以更好地对齐目标帧与生成结果的结构特征。第三,我们提出了一种流引导的注意力时间平滑机制,该机制在不修改基础扩散模型的情况下,强制实现时空一致性,从而减少在逐帧生成中通常遇到的时间不一致性。我们的方法不需要额外的训练或视频特定的微调。大量实验表明,我们的方法显著增强了时间一致性和视觉真实感,为基于视频的换脸提供了一种实用且模块化的解决方案。我们的代码可在 https://github.com/Sanoojan/VFace 获取。
cs.CV / 99 / 2602.07854
Geometry-Aware Rotary Position Embedding for Consistent Video World Model
几何感知旋转位置嵌入用于一致的视频世界模型
Abstract
Predictive world models that simulate future observations under explicit camera control are fundamental to interactive AI. Despite rapid advances, current systems lack spatial persistence: they fail to maintain stable scene structures over long trajectories, frequently hallucinating details when cameras revisit previously observed locations. We identify that this geometric drift stems from reliance on screen-space positional embeddings, which conflict with the projective geometry required for 3D consistency. We introduce \textbf{ViewRope}, a geometry-aware encoding that injects camera-ray directions directly into video transformer self-attention layers. By parameterizing attention with relative ray geometry rather than pixel locality, ViewRope provides a model-native inductive bias for retrieving 3D-consistent content across temporal gaps. We further propose \textbf{Geometry-Aware Frame-Sparse Attention}, which exploits these geometric cues to selectively attend to relevant historical frames, improving efficiency without sacrificing memory consistency. We also present \textbf{ViewBench}, a diagnostic suite measuring loop-closure fidelity and geometric drift. Our results demonstrate that ViewRope substantially improves long-term consistency while reducing computational costs.
Chinese Translation
预测世界模型在明确的相机控制下模拟未来观察,对于交互式人工智能至关重要。尽管取得了快速进展,当前系统仍缺乏空间持久性:它们无法在长时间轨迹中维持稳定的场景结构,常常在相机重新访问先前观察的位置时产生幻觉细节。我们发现,这种几何漂移源于对屏幕空间位置嵌入的依赖,这与实现三维一致性所需的投影几何相冲突。我们提出了 extbf{ViewRope},一种几何感知编码,直接将相机光线方向注入视频变换器自注意力层。通过用相对光线几何而非像素局部性来参数化注意力,ViewRope为在时间间隔中检索三维一致内容提供了模型本地的归纳偏置。我们进一步提出了 extbf{几何感知帧稀疏注意力},利用这些几何线索选择性地关注相关的历史帧,提高了效率而不牺牲内存一致性。我们还推出了 extbf{ViewBench},一个诊断套件,用于测量环闭合保真度和几何漂移。我们的结果表明,ViewRope显著提高了长期一致性,同时降低了计算成本。
cs.CV / 100 / 2602.07860
Recovering 3D Shapes from Ultra-Fast Motion-Blurred Images
从超快运动模糊图像恢复三维形状
Abstract
We consider the problem of 3D shape recovery from ultra-fast motion-blurred images. While 3D reconstruction from static images has been extensively studied, recovering geometry from extreme motion-blurred images remains challenging. Such scenarios frequently occur in both natural and industrial settings, such as fast-moving objects in sports (e.g., balls) or rotating machinery, where rapid motion distorts object appearance and makes traditional 3D reconstruction techniques like Multi-View Stereo (MVS) ineffective. In this paper, we propose a novel inverse rendering approach for shape recovery from ultra-fast motion-blurred images. While conventional rendering techniques typically synthesize blur by averaging across multiple frames, we identify a major computational bottleneck in the repeated computation of barycentric weights. To address this, we propose a fast barycentric coordinate solver, which significantly reduces computational overhead and achieves a speedup of up to 4.57x, enabling efficient and photorealistic simulation of high-speed motion. Crucially, our method is fully differentiable, allowing gradients to propagate from rendered images to the underlying 3D shape, thereby facilitating shape recovery through inverse rendering. We validate our approach on two representative motion types: rapid translation and rotation. Experimental results demonstrate that our method enables efficient and realistic modeling of ultra-fast moving objects in the forward simulation. Moreover, it successfully recovers 3D shapes from 2D imagery of objects undergoing extreme translational and rotational motion, advancing the boundaries of vision-based 3D reconstruction. Project page: https://maxmilite.github.io/rec-from-ultrafast-blur/
Chinese Translation
我们考虑从超快运动模糊图像中恢复三维形状的问题。尽管从静态图像进行三维重建已经得到了广泛研究,但从极端运动模糊图像中恢复几何形状仍然具有挑战性。这种情况在自然和工业环境中经常发生,例如体育运动中的快速移动物体(如球)或旋转机械,其中快速运动扭曲了物体的外观,使得传统的三维重建技术(如多视图立体(Multi-View Stereo, MVS))失效。在本文中,我们提出了一种新颖的逆渲染方法,用于从超快运动模糊图像中恢复形状。虽然传统的渲染技术通常通过对多个帧进行平均来合成模糊,但我们识别出在重复计算重心权重时存在一个主要的计算瓶颈。为了解决这个问题,我们提出了一种快速的重心坐标求解器,显著减少了计算开销,并实现了高达4.57倍的加速,从而能够高效且逼真地模拟高速运动。关键是,我们的方法是完全可微的,允许梯度从渲染图像传播到基础的三维形状,从而通过逆渲染促进形状恢复。我们在两种代表性的运动类型上验证了我们的方法:快速平移和旋转。实验结果表明,我们的方法能够在正向模拟中高效且真实地建模超快移动物体。此外,它成功地从经历极端平移和旋转运动的物体的二维图像中恢复三维形状,推动了基于视觉的三维重建的边界。项目页面:https://maxmilite.github.io/rec-from-ultrafast-blur/
cs.CV / 101 / 2602.07864
Thinking in Structures: Evaluating Spatial Intelligence through Reasoning on Constrained Manifolds
结构思维:通过对受限流形的推理评估空间智能
Abstract
Spatial intelligence is crucial for vision--language models (VLMs) in the physical world, yet many benchmarks evaluate largely unconstrained scenes where models can exploit 2D shortcuts. We introduce SSI-Bench, a VQA benchmark for spatial reasoning on constrained manifolds, built from complex real-world 3D structures whose feasible configurations are tightly governed by geometric, topological, and physical constraints. SSI-Bench contains 1,000 ranking questions spanning geometric and topological reasoning and requiring a diverse repertoire of compositional spatial operations, such as mental rotation, cross-sectional inference, occlusion reasoning, and force-path reasoning. It is created via a fully human-centered pipeline: ten researchers spent over 400 hours curating images, annotating structural components, and designing questions to minimize pixel-level cues. Evaluating 31 widely used VLMs reveals a large gap to humans: the best open-source model achieves 22.2% accuracy and the strongest closed-source model reaches 33.6%, while humans score 91.6%. Encouraging models to think yields only marginal gains, and error analysis points to failures in structural grounding and constraint-consistent 3D reasoning. Project page: https://ssi-bench.github.io.
Chinese Translation
空间智能对于物理世界中的视觉-语言模型(VLMs)至关重要,但许多基准测试主要评估不受限制的场景,模型可以利用二维捷径。我们引入了SSI-Bench,这是一个针对受限流形的空间推理的视觉问答(VQA)基准,基于复杂的真实世界三维结构,其可行配置受到几何、拓扑和物理约束的严格限制。SSI-Bench包含1000个排名问题,涵盖几何和拓扑推理,并要求多样的组合空间操作能力,如心理旋转、横截面推理、遮挡推理和力路径推理。该基准通过一个完全以人为中心的流程创建:十位研究人员花费超过400小时策划图像、注释结构组件和设计问题,以最小化像素级线索。对31个广泛使用的VLM进行评估显示出与人类之间的巨大差距:最佳的开源模型准确率为22.2%,最强的闭源模型达到33.6%,而人类得分为91.6%。鼓励模型进行思考仅带来了微小的提升,错误分析指出了在结构基础和一致性3D推理方面的失败。项目页面:https://ssi-bench.github.io。
cs.CV / 102 / 2602.07872
WristMIR: Coarse-to-Fine Region-Aware Retrieval of Pediatric Wrist Radiographs with Radiology Report-Driven Learning
WristMIR:基于区域感知的儿童腕部X光片粗到细的检索与放射学报告驱动学习
Abstract
Retrieving wrist radiographs with analogous fracture patterns is challenging because clinically important cues are subtle, highly localized and often obscured by overlapping anatomy or variable imaging views. Progress is further limited by the scarcity of large, well-annotated datasets for case-based medical image retrieval. We introduce WristMIR, a region-aware pediatric wrist radiograph retrieval framework that leverages dense radiology reports and bone-specific localization to learn fine-grained, clinically meaningful image representations without any manual image-level annotations. Using MedGemma-based structured report mining to generate both global and region-level captions, together with pre-processed wrist images and bone-specific crops of the distal radius, distal ulna, and ulnar styloid, WristMIR jointly trains global and local contrastive encoders and performs a two-stage retrieval process: (1) coarse global matching to identify candidate exams, followed by (2) region-conditioned reranking aligned to a predefined anatomical bone region. WristMIR improves retrieval performance over strong vision-language baselines, raising image-to-text Recall@5 from 0.82% to 9.35%. Its embeddings also yield stronger fracture classification (AUROC 0.949, AUPRC 0.953). In region-aware evaluation, the two-stage design markedly improves retrieval-based fracture diagnosis, increasing mean $F_1$ from 0.568 to 0.753, and radiologists rate its retrieved cases as more clinically relevant, with mean scores rising from 3.36 to 4.35. These findings highlight the potential of anatomically guided retrieval to enhance diagnostic reasoning and support clinical decision-making in pediatric musculoskeletal imaging. The source code is publicly available at https://github.com/quin-med-harvard-edu/WristMIR.
Chinese Translation
检索具有相似骨折模式的腕部X光片具有挑战性,因为临床重要线索往往细微、高度局部化,并且常常被重叠的解剖结构或不同的成像视角所遮挡。此外,针对基于案例的医学图像检索,缺乏大型、良好注释的数据集进一步限制了进展。我们提出了WristMIR,一个区域感知的儿童腕部X光片检索框架,利用密集的放射学报告和特定骨骼的定位,学习细粒度的、具有临床意义的图像表示,而无需任何手动的图像级注释。通过基于MedGemma的结构化报告挖掘生成全局和区域级的标题,以及预处理的腕部图像和远端桡骨、远端尺骨及尺骨茎突的特定骨骼裁剪,WristMIR共同训练全局和局部对比编码器,并执行两阶段的检索过程:(1)粗略的全局匹配以识别候选检查,随后进行(2)与预定义解剖骨区域对齐的区域条件重排序。WristMIR在强大的视觉-语言基线之上提高了检索性能,将图像到文本的Recall@5从0.82%提升至9.35%。其嵌入也在骨折分类中表现出更强的效果(AUROC 0.949,AUPRC 0.953)。在区域感知评估中,两阶段设计显著改善了基于检索的骨折诊断,将平均$F_1$从0.568提高至0.753,放射科医师认为其检索的病例在临床上更具相关性,平均评分从3.36上升至4.35。这些发现突显了解剖引导检索在增强诊断推理和支持儿童肌肉骨骼成像临床决策中的潜力。源代码可在https://github.com/quin-med-harvard-edu/WristMIR公开获取。
cs.CV / 103 / 2602.07891
Scalable Adaptation of 3D Geometric Foundation Models via Weak Supervision from Internet Video
通过互联网视频的弱监督实现3D几何基础模型的可扩展适应
Abstract
Geometric foundation models show promise in 3D reconstruction, yet their progress is severely constrained by the scarcity of diverse, large-scale 3D annotations. While Internet videos offer virtually unlimited raw data, utilizing them as a scaling source for geometric learning is challenging due to the absence of ground-truth geometry and the presence of observational noise. To address this, we propose SAGE, a framework for Scalable Adaptation of GEometric foundation models from raw video streams. SAGE leverages a hierarchical mining pipeline to transform videos into training trajectories and hybrid supervision: (1) Informative training trajectory selection; (2) Sparse Geometric Anchoring via SfM point clouds for global structural guidance; and (3) Dense Differentiable Consistency via 3D Gaussian rendering for multi-view constraints. To prevent catastrophic forgetting, we introduce a regularization strategy using anchor data. Extensive experiments show that SAGE significantly enhances zero-shot generalization, reducing Chamfer Distance by 20-42% on unseen benchmarks (7Scenes, TUM-RGBD, Matterport3D) compared to state-of-the-art baselines. To our knowledge, SAGE pioneers the adaptation of geometric foundation models via Internet video, establishing a scalable paradigm for general-purpose 3D learning.
Chinese Translation
几何基础模型在3D重建中展现出良好的前景,但由于缺乏多样化的大规模3D标注,其进展受到严重限制。尽管互联网视频提供了几乎无限的原始数据,但由于缺乏真实几何和存在观测噪声,利用这些数据作为几何学习的扩展来源面临挑战。为了解决这一问题,我们提出了SAGE,一个从原始视频流中可扩展适应几何基础模型的框架。SAGE利用层次化挖掘管道将视频转化为训练轨迹和混合监督:(1) 信息丰富的训练轨迹选择;(2) 通过结构从运动(SfM)点云进行稀疏几何锚定以提供全局结构指导;(3) 通过3D高斯渲染实现密集可微一致性以满足多视角约束。为了防止灾难性遗忘,我们引入了一种使用锚数据的正则化策略。大量实验表明,SAGE显著增强了零-shot泛化能力,在未见基准(7Scenes, TUM-RGBD, Matterport3D)上相比于最先进的基线减少了20-42%的Chamfer距离。据我们所知,SAGE开创了通过互联网视频适应几何基础模型的先河,为通用3D学习建立了一个可扩展的范式。
cs.CV / 104 / 2602.07899
Rethinking Practical and Efficient Quantization Calibration for Vision-Language Models
重新思考视觉语言模型的实用高效量化校准
Abstract
Post-training quantization (PTQ) is a primary approach for deploying large language models without fine-tuning, and the quantized performance is often strongly affected by the calibration in PTQ. By contrast, in vision-language models (VLMs), substantial differences between visual and text tokens in their activation distributions and sensitivities to quantization error pose significant challenges for effective calibration during PTQ. In this work, we rethink what PTQ calibration should align with in VLMs and propose the Token-level Importance-aware Layer-wise Quantization framework (TLQ). Guided by gradient information, we design a token-level importance integration mechanism for quantization error, and use it to construct a token-level calibration set, enabling a more fine-grained calibration strategy. Furthermore, TLQ introduces a multi-GPU, quantization-exposed layer-wise calibration scheme. This scheme keeps the layer-wise calibration procedure consistent with the true quantized inference path and distributes the complex layer-wise calibration workload across multiple RTX3090 GPUs, thereby reducing reliance on the large memory of A100 GPUs. TLQ is evaluated across two models, three model scales, and two quantization settings, consistently achieving performance improvements across all settings, indicating its strong quantization stability. The code will be released publicly.
Chinese Translation
后训练量化(PTQ)是部署大型语言模型而无需微调的主要方法,而量化性能通常受到PTQ中校准的强烈影响。相比之下,在视觉语言模型(VLMs)中,视觉和文本标记在激活分布和对量化误差的敏感性方面存在显著差异,这对PTQ期间有效校准提出了重大挑战。在本研究中,我们重新思考了VLMs中PTQ校准应与什么对齐,并提出了基于标记重要性感知的分层量化框架(Token-level Importance-aware Layer-wise Quantization,TLQ)。在梯度信息的指导下,我们设计了一种标记级重要性集成机制来处理量化误差,并利用该机制构建了一个标记级校准集,从而实现更细粒度的校准策略。此外,TLQ引入了一种多GPU、量化暴露的分层校准方案。该方案使分层校准过程与真实的量化推理路径保持一致,并将复杂的分层校准工作负载分配到多个RTX3090 GPU上,从而减少对A100 GPU大内存的依赖。TLQ在两个模型、三个模型规模和两种量化设置下进行了评估,在所有设置中始终实现了性能提升,表明其具有良好的量化稳定性。代码将公开发布。
cs.CV / 105 / 2602.07931
Which private attributes do VLMs agree on and predict well?
视觉语言模型(VLMs)在隐私属性上的一致性与预测能力
Abstract
Visual Language Models (VLMs) are often used for zero-shot detection of visual attributes in the image. We present a zero-shot evaluation of open-source VLMs for privacy-related attribute recognition. We identify the attributes for which VLMs exhibit strong inter-annotator agreement, and discuss the disagreement cases of human and VLM annotations. Our results show that when evaluated against human annotations, VLMs tend to predict the presence of privacy attributes more often than human annotators. In addition to this, we find that in cases of high inter-annotator agreement between VLMs, they can complement human annotation by identifying attributes overlooked by human annotators. This highlights the potential of VLMs to support privacy annotations in large-scale image datasets.
Chinese Translation
视觉语言模型(VLMs)常用于对图像中视觉属性的零样本检测。我们对开源VLMs在隐私相关属性识别方面进行了零样本评估。我们识别出VLMs表现出强烈的标注者间一致性的属性,并讨论了人类与VLM标注之间的不一致案例。我们的结果表明,在与人类标注进行比较时,VLMs倾向于比人类标注者更频繁地预测隐私属性的存在。此外,我们发现,在VLMs之间存在高标注者间一致性的情况下,它们可以通过识别被人类标注者忽视的属性来补充人类标注。这突显了VLMs在大规模图像数据集中的隐私标注支持潜力。
cs.CV / 106 / 2602.07938
Integrating Specialized and Generic Agent Motion Prediction with Dynamic Occupancy Grid Maps
将专业化与通用代理运动预测整合到动态占用网格地图中
Abstract
Accurate prediction of driving scene is a challenging task due to uncertainty in sensor data, the complex behaviors of agents, and the possibility of multiple feasible futures. Existing prediction methods using occupancy grid maps primarily focus on agent-agnostic scene predictions, while agent-specific predictions provide specialized behavior insights with the help of semantic information. However, both paradigms face distinct limitations: agent-agnostic models struggle to capture the behavioral complexities of dynamic actors, whereas agent-specific approaches fail to generalize to poorly perceived or unrecognized agents; combining both enables robust and safer motion forecasting. To address this, we propose a unified framework by leveraging Dynamic Occupancy Grid Maps within a streamlined temporal decoding pipeline to simultaneously predict future occupancy state grids, vehicle grids, and scene flow grids. Relying on a lightweight spatiotemporal backbone, our approach is centered on a tailored, interdependent loss function that captures inter-grid dependencies and enables diverse future predictions. By using occupancy state information to enforce flow-guided transitions, the loss function acts as a regularizer that directs occupancy evolution while accounting for obstacles and occlusions. Consequently, the model not only predicts the specific behaviors of vehicle agents, but also identifies other dynamic entities and anticipates their evolution within the complex scene. Evaluations on real-world nuScenes and Woven Planet datasets demonstrate superior prediction performances for dynamic vehicles and generic dynamic scene elements compared to baseline methods.
Chinese Translation
由于传感器数据的不确定性、代理的复杂行为以及多种可行未来的可能性,准确预测驾驶场景是一项具有挑战性的任务。现有的使用占用网格地图的预测方法主要集中于与代理无关的场景预测,而特定代理的预测则借助语义信息提供专业的行为洞察。然而,这两种范式都面临着明显的限制:与代理无关的模型难以捕捉动态参与者的行为复杂性,而特定代理的方法则无法推广到感知不良或未识别的代理;将两者结合能够实现更稳健和安全的运动预测。为了解决这个问题,我们提出了一个统一框架,通过在简化的时间解码管道中利用动态占用网格地图,同时预测未来的占用状态网格、车辆网格和场景流网格。我们的方案依赖于轻量级的时空骨干,围绕一个定制的、相互依赖的损失函数展开,该损失函数捕捉网格间的依赖关系,并支持多样化的未来预测。通过使用占用状态信息来强制执行流导向的转变,损失函数作为正则化器,引导占用演变,同时考虑障碍物和遮挡。因此,该模型不仅预测车辆代理的特定行为,还识别其他动态实体,并预测它们在复杂场景中的演变。在真实世界的nuScenes和Woven Planet数据集上的评估显示,与基线方法相比,我们的方法在动态车辆和通用动态场景元素的预测性能上表现优越。
cs.CV / 107 / 2602.07955
One-Shot Crowd Counting With Density Guidance For Scene Adaptaion
基于密度引导的一次性人群计数与场景适应
Abstract
Crowd scenes captured by cameras at different locations vary greatly, and existing crowd models have limited generalization for unseen surveillance scenes. To improve the generalization of the model, we regard different surveillance scenes as different category scenes, and introduce few-shot learning to make the model adapt to the unseen surveillance scene that belongs to the given exemplar category scene. To this end, we propose to leverage local and global density characteristics to guide the model of crowd counting for unseen surveillance scenes. Specifically, to enable the model to adapt to the varying density variations in the target scene, we propose the multiple local density learner to learn multi prototypes which represent different density distributions in the support scene. Subsequently, these multiple local density similarity matrixes are encoded. And they are utilized to guide the model in a local way. To further adapt to the global density in the target scene, the global density features are extracted from the support image, then it is used to guide the model in a global way. Experiments on three surveillance datasets shows that proposed method can adapt to the unseen surveillance scene and outperform recent state-of-the-art methods in the few-shot crowd counting.
Chinese Translation
不同地点拍摄的人群场景差异很大,现有的人群模型在未见监控场景中的泛化能力有限。为了提高模型的泛化能力,我们将不同的监控场景视为不同类别的场景,并引入少样本学习,使模型能够适应属于给定示例类别场景的未见监控场景。为此,我们提出利用局部和全局密度特征来引导未见监控场景的人群计数模型。具体而言,为了使模型适应目标场景中变化的密度,我们提出了多局部密度学习器,以学习表示支持场景中不同密度分布的多原型。随后,这些多局部密度相似性矩阵被编码,并用于以局部方式引导模型。为了进一步适应目标场景中的全局密度,从支持图像中提取全局密度特征,然后用于以全局方式引导模型。在三个监控数据集上的实验表明,所提出的方法能够适应未见监控场景,并在少样本人群计数中优于最近的最先进方法。
cs.CV / 108 / 2602.07960
D-ORCA: Dialogue-Centric Optimization for Robust Audio-Visual Captioning
D-ORCA:面向对话的鲁棒音视频字幕优化
Abstract
Spoken dialogue is a primary source of information in videos; therefore, accurately identifying who spoke what and when is essential for deep video understanding. We introduce D-ORCA, a \textbf{d}ialogue-centric \textbf{o}mni-modal large language model optimized for \textbf{r}obust audio-visual \textbf{ca}ptioning. We further curate DVD, a large-scale, high-quality bilingual dataset comprising nearly 40,000 multi-party dialogue videos for training and 2000 videos for evaluation in English and Mandarin, addressing a critical gap in the open-source ecosystem. To ensure fine-grained captioning accuracy, we adopt group relative policy optimization with three novel reward functions that assess speaker attribution accuracy, global speech content accuracy, and sentence-level temporal boundary alignment. These rewards are derived from evaluation metrics widely used in speech processing and, to our knowledge, are applied for the first time as reinforcement learning objectives for audio-visual captioning. Extensive experiments demonstrate that D-ORCA substantially outperforms existing open-source models in speaker identification, speech recognition, and temporal grounding. Notably, despite having only 8 billion parameters, D-ORCA achieves performance competitive with Qwen3-Omni across several general-purpose audio-visual understanding benchmarks. Demos are available at \href{https://d-orca-llm.github.io/}{https://d-orca-llm.github.io/}. Our code, data, and checkpoints will be available at \href{https://github.com/WeChatCV/D-ORCA/}{https://github.com/WeChatCV/D-ORCA/}.
Chinese Translation
口语对话是视频中信息的主要来源,因此,准确识别谁在何时说了什么对于深入理解视频至关重要。我们提出了D-ORCA,这是一种 extbf{对话中心的} extbf{全模态}大型语言模型,旨在优化 extbf{鲁棒的音视频} extbf{字幕}。我们进一步整理了DVD,这是一个大规模、高质量的双语数据集,包含近40,000个多方对话视频用于训练,以及2,000个视频用于评估,涵盖英语和普通话,填补了开源生态系统中的一个关键空白。为了确保细粒度的字幕准确性,我们采用了群体相对策略优化,并引入了三种新颖的奖励函数,评估说话者归属准确性、全局语音内容准确性和句子级时间边界对齐。这些奖励源自于广泛用于语音处理的评估指标,并且据我们所知,首次作为音视频字幕的强化学习目标进行应用。大量实验表明,D-ORCA在说话者识别、语音识别和时间定位方面显著优于现有的开源模型。值得注意的是,尽管只有80亿个参数,D-ORCA在多个通用音视频理解基准测试中表现出与Qwen3-Omni相当的竞争力。演示可在 exttt{https://d-orca-llm.github.io/}获取。我们的代码、数据和检查点将发布在 exttt{https://github.com/WeChatCV/D-ORCA/}。
cs.CV / 109 / 2602.07967
EasyTune: Efficient Step-Aware Fine-Tuning for Diffusion-Based Motion Generation
EasyTune:基于扩散的运动生成的高效逐步微调
Abstract
In recent years, motion generative models have undergone significant advancement, yet pose challenges in aligning with downstream objectives. Recent studies have shown that using differentiable rewards to directly align the preference of diffusion models yields promising results. However, these methods suffer from (1) inefficient and coarse-grained optimization with (2) high memory consumption. In this work, we first theoretically and empirically identify the key reason of these limitations: the recursive dependence between different steps in the denoising trajectory. Inspired by this insight, we propose EasyTune, which fine-tunes diffusion at each denoising step rather than over the entire trajectory. This decouples the recursive dependence, allowing us to perform (1) a dense and fine-grained, and (2) memory-efficient optimization. Furthermore, the scarcity of preference motion pairs restricts the availability of motion reward model training. To this end, we further introduce a Self-refinement Preference Learning (SPL) mechanism that dynamically identifies preference pairs and conducts preference learning. Extensive experiments demonstrate that EasyTune outperforms DRaFT-50 by 8.2% in alignment (MM-Dist) improvement while requiring only 31.16% of its additional memory overhead and achieving a 7.3x training speedup. The project page is available at this link {https://xiaofeng-tan.github.io/projects/EasyTune/index.html}.
Chinese Translation
近年来,运动生成模型取得了显著进展,但在与下游目标对齐方面仍面临挑战。近期研究表明,使用可微分奖励直接对齐扩散模型的偏好可以取得良好的效果。然而,这些方法存在(1)低效且粗糙的优化,以及(2)高内存消耗的问题。在本研究中,我们首先理论和实证上识别出这些限制的关键原因:去噪轨迹中不同步骤之间的递归依赖。受到这一洞察的启发,我们提出了EasyTune,它在每个去噪步骤上进行微调,而不是在整个轨迹上。这解耦了递归依赖,使我们能够进行(1)密集且细粒度的优化,以及(2)内存高效的优化。此外,偏好运动对的稀缺性限制了运动奖励模型训练的可用性。为此,我们进一步引入了一种自我精炼偏好学习(Self-refinement Preference Learning, SPL)机制,动态识别偏好对并进行偏好学习。大量实验表明,EasyTune在对齐(MM-Dist)改进方面比DRaFT-50提高了8.2%,同时仅需31.16%的额外内存开销,并实现了7.3倍的训练加速。项目页面可通过此链接访问 {https://xiaofeng-tan.github.io/projects/EasyTune/index.html}。
cs.CV / 110 / 2602.07979
FSP-Diff: Full-Spectrum Prior-Enhanced DualDomain Latent Diffusion for Ultra-Low-Dose Spectral CT Reconstruction
FSP-Diff:全光谱先验增强的双域潜在扩散用于超低剂量光谱CT重建
Abstract
Spectral computed tomography (CT) with photon-counting detectors holds immense potential for material discrimination and tissue characterization. However, under ultra-low-dose conditions, the sharply degraded signal-to-noise ratio (SNR) in energy-specific projections poses a significant challenge, leading to severe artifacts and loss of structural details in reconstructed images. To address this, we propose FSP-Diff, a full-spectrum prior-enhanced dual-domain latent diffusion framework for ultra-low-dose spectral CT reconstruction. Our framework integrates three core strategies: 1) Complementary Feature Construction: We integrate direct image reconstructions with projection-domain denoised results. While the former preserves latent textural nuances amidst heavy noise, the latter provides a stable structural scaffold to balance detail fidelity and noise suppression. 2) Full-Spectrum Prior Integration: By fusing multi-energy projections into a high-SNR full-spectrum image, we establish a unified structural reference that guides the reconstruction across all energy bins. 3) Efficient Latent Diffusion Synthesis: To alleviate the high computational burden of high-dimensional spectral data, multi-path features are embedded into a compact latent space. This allows the diffusion process to facilitate interactive feature fusion in a lower-dimensional manifold, achieving accelerated reconstruction while maintaining fine-grained detail restoration. Extensive experiments on simulated and real-world datasets demonstrate that FSP-Diff significantly outperforms state-of-the-art methods in both image quality and computational efficiency, underscoring its potential for clinically viable ultra-low-dose spectral CT imaging.
Chinese Translation
光子计数探测器的光谱计算机断层扫描(CT)在材料区分和组织特征化方面具有巨大的潜力。然而,在超低剂量条件下,能量特定投影中的信噪比(SNR)急剧下降,带来了显著挑战,导致重建图像中严重的伪影和结构细节的丧失。为了解决这一问题,我们提出了FSP-Diff,一种全光谱先验增强的双域潜在扩散框架,用于超低剂量光谱CT重建。我们的框架集成了三种核心策略:1)互补特征构建:我们将直接图像重建与投影域去噪结果相结合。前者在强噪声中保留潜在的纹理细微差别,而后者提供了一个稳定的结构框架,以平衡细节保真度和噪声抑制。2)全光谱先验集成:通过将多能量投影融合为高信噪比的全光谱图像,我们建立了一个统一的结构参考,指导所有能量区间的重建。3)高效的潜在扩散合成:为了减轻高维光谱数据的高计算负担,多路径特征被嵌入到一个紧凑的潜在空间中。这使得扩散过程能够在低维流形中促进交互特征融合,实现加速重建,同时保持细致的细节恢复。在模拟和真实世界数据集上的广泛实验表明,FSP-Diff在图像质量和计算效率方面显著优于现有的最先进方法,突显了其在临床可行的超低剂量光谱CT成像中的潜力。
cs.CV / 111 / 2602.07980
Continuity-driven Synergistic Diffusion with Neural Priors for Ultra-Sparse-View CBCT Reconstruction
基于连续性驱动的协同扩散与神经先验用于超稀疏视角的锥束计算机断层重建
Abstract
The clinical application of cone-beam computed tomography (CBCT) is constrained by the inherent trade-off between radiation exposure and image quality. Ultra-sparse angular sampling, employed to reduce dose, introduces severe undersampling artifacts and inter-slice inconsistencies, compromising diagnostic reliability. Existing reconstruction methods often struggle to balance angular continuity with spatial detail fidelity. To address these challenges, we propose a Continuity-driven Synergistic Diffusion with Neural priors (CSDN) for ultra-sparse-view CBCT reconstruction. Neural priors are introduced as a structural foundation to encode a continuous threedimensional attenuation representation, enabling the synthesis of physically consistent dense projections from ultra-sparse measurements. Building upon this neural-prior-based initialization, a synergistic diffusion strategy is developed, consisting of two collaborative refinement paths: a Sinogram Refinement Diffusion (Sino-RD) process that restores angular continuity and a Digital Radiography Refinement Diffusion (DR-RD) process that enforces inter-slice consistency from the projection image perspective. The outputs of the two diffusion paths are adaptively fused by the Dual-Projection Reconstruction Fusion (DPRF) module to achieve coherent volumetric reconstruction. Extensive experiments demonstrate that the proposed CSDN effectively suppresses artifacts and recovers fine textures under ultra-sparse-view conditions, outperforming existing state-of-the-art techniques.
Chinese Translation
锥束计算机断层扫描(CBCT)的临床应用受到辐射暴露与图像质量之间固有权衡的限制。为降低剂量而采用的超稀疏角度采样引入了严重的欠采样伪影和层间不一致性,损害了诊断的可靠性。现有的重建方法往往难以平衡角度连续性与空间细节的保真度。为了解决这些挑战,我们提出了一种基于连续性驱动的协同扩散与神经先验(CSDN)用于超稀疏视角的CBCT重建。神经先验被引入作为结构基础,以编码连续的三维衰减表示,从而能够从超稀疏测量中合成物理一致的密集投影。在此神经先验基础初始化的基础上,开发了一种协同扩散策略,包含两个协作的细化路径:一个是正弦图细化扩散(Sino-RD)过程,恢复角度连续性;另一个是数字放射摄影细化扩散(DR-RD)过程,从投影图像的角度强制执行层间一致性。这两个扩散路径的输出通过双投影重建融合(DPRF)模块自适应融合,以实现一致的体积重建。大量实验表明,所提出的CSDN在超稀疏视角条件下有效抑制伪影并恢复细腻纹理,优于现有的最先进技术。
cs.CV / 112 / 2602.07986
Deepfake Synthesis vs. Detection: An Uneven Contest
深度伪造合成与检测:一场不平衡的竞争
Abstract
The rapid advancement of deepfake technology has significantly elevated the realism and accessibility of synthetic media. Emerging techniques, such as diffusion-based models and Neural Radiance Fields (NeRF), alongside enhancements in traditional Generative Adversarial Networks (GANs), have contributed to the sophisticated generation of deepfake videos. Concurrently, deepfake detection methods have seen notable progress, driven by innovations in Transformer architectures, contrastive learning, and other machine learning approaches. In this study, we conduct a comprehensive empirical analysis of state-of-the-art deepfake detection techniques, including human evaluation experiments against cutting-edge synthesis methods. Our findings highlight a concerning trend: many state-of-the-art detection models exhibit markedly poor performance when challenged with deepfakes produced by modern synthesis techniques, including poor performance by human participants against the best quality deepfakes. Through extensive experimentation, we provide evidence that underscores the urgent need for continued refinement of detection models to keep pace with the evolving capabilities of deepfake generation technologies. This research emphasizes the critical gap between current detection methodologies and the sophistication of new generation techniques, calling for intensified efforts in this crucial area of study.
Chinese Translation
深度伪造技术的快速发展显著提升了合成媒体的真实感和可获取性。新兴技术,如基于扩散的模型和神经辐射场(Neural Radiance Fields, NeRF),以及传统生成对抗网络(Generative Adversarial Networks, GANs)的改进,促进了深度伪造视频的复杂生成。同时,深度伪造检测方法也取得了显著进展,这得益于变换器架构、对比学习及其他机器学习方法的创新。在本研究中,我们对最先进的深度伪造检测技术进行了全面的实证分析,包括针对尖端合成方法的人类评估实验。我们的研究结果突显了一个令人担忧的趋势:许多最先进的检测模型在面对现代合成技术生成的深度伪造时表现出明显的低效,甚至人类参与者在对抗最佳质量的深度伪造时也表现不佳。通过广泛的实验,我们提供了证据,强调了持续改进检测模型以跟上深度伪造生成技术不断演变的能力的迫切需要。本研究强调了当前检测方法与新一代技术复杂性之间的关键差距,呼吁在这一重要研究领域加大努力。
cs.CV / 113 / 2602.07993
MCIE: Multimodal LLM-Driven Complex Instruction Image Editing with Spatial Guidance
MCIE:基于多模态大型语言模型的复杂指令图像编辑与空间引导
Abstract
Recent advances in instruction-based image editing have shown remarkable progress. However, existing methods remain limited to relatively simple editing operations, hindering real-world applications that require complex and compositional instructions. In this work, we address these limitations from the perspectives of architectural design, data, and evaluation protocols. Specifically, we identify two key challenges in current models: insufficient instruction compliance and background inconsistency. To this end, we propose MCIE-E1, a Multimodal Large Language Model-Driven Complex Instruction Image Editing method that integrates two key modules: a spatial-aware cross-attention module and a background-consistent cross-attention module. The former enhances instruction-following capability by explicitly aligning semantic instructions with spatial regions through spatial guidance during the denoising process, while the latter preserves features in unedited regions to maintain background consistency. To enable effective training, we construct a dedicated data pipeline to mitigate the scarcity of complex instruction-based image editing datasets, combining fine-grained automatic filtering via a powerful MLLM with rigorous human validation. Finally, to comprehensively evaluate complex instruction-based image editing, we introduce CIE-Bench, a new benchmark with two new evaluation metrics. Experimental results on CIE-Bench demonstrate that MCIE-E1 consistently outperforms previous state-of-the-art methods in both quantitative and qualitative assessments, achieving a 23.96% improvement in instruction compliance.
Chinese Translation
最近在基于指令的图像编辑方面取得了显著进展。然而,现有方法仍然局限于相对简单的编辑操作,阻碍了需要复杂和组合指令的实际应用。在本研究中,我们从架构设计、数据和评估协议的角度解决这些限制。具体而言,我们识别出当前模型中的两个关键挑战:指令遵循不足和背景不一致。为此,我们提出了MCIE-E1,一种基于多模态大型语言模型驱动的复杂指令图像编辑方法,集成了两个关键模块:空间感知交叉注意模块和背景一致性交叉注意模块。前者通过在去噪过程中通过空间引导显式对齐语义指令与空间区域,增强了指令遵循能力,而后者则保留未编辑区域的特征以维持背景一致性。为了实现有效的训练,我们构建了一个专门的数据管道,以缓解基于复杂指令的图像编辑数据集的稀缺性,结合强大的多模态大型语言模型进行细粒度自动过滤,并经过严格的人为验证。最后,为了全面评估基于复杂指令的图像编辑,我们引入了CIE-Bench,一个具有两个新评估指标的新基准。在CIE-Bench上的实验结果表明,MCIE-E1在定量和定性评估中始终优于以前的最先进方法,指令遵循能力提高了23.96%。
cs.CV / 114 / 2602.08006
ForecastOcc: Vision-based Semantic Occupancy Forecasting
ForecastOcc:基于视觉的语义占用预测
Abstract
Autonomous driving requires forecasting both geometry and semantics over time to effectively reason about future environment states. Existing vision-based occupancy forecasting methods focus on motion-related categories such as static and dynamic objects, while semantic information remains largely absent. Recent semantic occupancy forecasting approaches address this gap but rely on past occupancy predictions obtained from separate networks. This makes current methods sensitive to error accumulation and prevents learning spatio-temporal features directly from images. In this work, we present ForecastOcc, the first framework for vision-based semantic occupancy forecasting that jointly predicts future occupancy states and semantic categories. Our framework yields semantic occupancy forecasts for multiple horizons directly from past camera images, without relying on externally estimated maps. We evaluate ForecastOcc in two complementary settings: multi-view forecasting on the Occ3D-nuScenes dataset and monocular forecasting on SemanticKITTI, where we establish the first benchmark for this task. We introduce the first baselines by adapting two 2D forecasting modules within our framework. Importantly, we propose a novel architecture that incorporates a temporal cross-attention forecasting module, a 2D-to-3D view transformer, a 3D encoder for occupancy prediction, and a semantic occupancy head for voxel-level forecasts across multiple horizons. Extensive experiments on both datasets show that ForecastOcc consistently outperforms baselines, yielding semantically rich, future-aware predictions that capture scene dynamics and semantics critical for autonomous driving.
Chinese Translation
自主驾驶需要对几何和语义进行时间上的预测,以有效推理未来环境状态。现有的基于视觉的占用预测方法主要关注与运动相关的类别,如静态和动态物体,而语义信息则在很大程度上缺失。最近的语义占用预测方法填补了这一空白,但依赖于从独立网络获得的过去占用预测。这使得当前的方法对误差积累敏感,并阻碍了直接从图像中学习时空特征。在本研究中,我们提出了ForecastOcc,这是第一个基于视觉的语义占用预测框架,能够联合预测未来的占用状态和语义类别。我们的框架直接从过去的相机图像中生成多个时间范围的语义占用预测,而无需依赖外部估计的地图。我们在两个互补的设置中评估ForecastOcc:在Occ3D-nuScenes数据集上的多视角预测和在SemanticKITTI上的单目预测,在这两个设置中我们建立了该任务的首个基准。我们通过在框架内调整两个2D预测模块引入了首个基线。重要的是,我们提出了一种新颖的架构,结合了时间交叉注意力预测模块、2D到3D视图转换器、用于占用预测的3D编码器以及用于多个时间范围的体素级预测的语义占用头。在这两个数据集上的大量实验表明,ForecastOcc始终优于基线,生成语义丰富、关注未来的预测,捕捉对自主驾驶至关重要的场景动态和语义。
cs.CV / 115 / 2602.08020
PhysDrape: Learning Explicit Forces and Collision Constraints for Physically Realistic Garment Draping
PhysDrape:学习显式力和碰撞约束以实现物理真实的服装悬垂
Abstract
Deep learning-based garment draping has emerged as a promising alternative to traditional Physics-Based Simulation (PBS), yet robust collision handling remains a critical bottleneck. Most existing methods enforce physical validity through soft penalties, creating an intrinsic trade-off between geometric feasibility and physical plausibility: penalizing collisions often distorts mesh structure, while preserving shape leads to interpenetration. To resolve this conflict, we present PhysDrape, a hybrid neural-physical solver for physically realistic garment draping driven by explicit forces and constraints. Unlike soft-constrained frameworks, PhysDrape integrates neural inference with explicit geometric solvers in a fully differentiable pipeline. Specifically, we propose a Physics-Informed Graph Neural Network conditioned on a physics-enriched graph -- encoding material parameters and body proximity -- to predict residual displacements. Crucially, we integrate a differentiable two-stage solver: first, a learnable Force Solver iteratively resolves unbalanced forces derived from the Saint Venant-Kirchhoff (StVK) model to ensure quasi-static equilibrium; second, a Differentiable Projection strictly enforces collision constraints against the body surface. This differentiable design guarantees physical validity through explicit constraints, while enabling end-to-end learning to optimize the network for physically consistent predictions. Extensive experiments demonstrate that PhysDrape achieves state-of-the-art performance, ensuring negligible interpenetration with significantly lower strain energy compared to existing baselines, achieving superior physical fidelity and robustness in real-time.
Chinese Translation
基于深度学习的服装悬垂已成为传统物理基础模拟(PBS)的有前景的替代方案,但稳健的碰撞处理仍然是一个关键瓶颈。现有大多数方法通过软惩罚来强制物理有效性,这在几何可行性和物理合理性之间产生了内在的权衡:惩罚碰撞往往会扭曲网格结构,而保持形状则会导致相互穿透。为了解决这一冲突,我们提出了PhysDrape,一种基于显式力和约束的物理真实服装悬垂的混合神经物理求解器。与软约束框架不同,PhysDrape在一个完全可微的管道中将神经推理与显式几何求解器相结合。具体而言,我们提出了一种基于物理信息的图神经网络,该网络以物理丰富的图为条件——编码材料参数和身体接近度——以预测残余位移。关键是,我们集成了一个可微的两阶段求解器:首先,一个可学习的力求解器迭代解决源于圣维南-基尔霍夫(StVK)模型的失衡力,以确保准静态平衡;其次,一个可微投影严格执行与身体表面的碰撞约束。这种可微设计通过显式约束保证了物理有效性,同时使端到端学习能够优化网络以实现物理一致的预测。大量实验表明,PhysDrape实现了最先进的性能,确保了与现有基线相比几乎没有相互穿透,并显著降低了应变能量,在实时应用中实现了更高的物理保真度和鲁棒性。
cs.CV / 116 / 2602.08024
FlashVID: Efficient Video Large Language Models via Training-free Tree-based Spatiotemporal Token Merging
FlashVID:通过无训练的基于树的时空令牌合并实现高效视频大语言模型
Abstract
Although Video Large Language Models (VLLMs) have shown remarkable capabilities in video understanding, they are required to process high volumes of visual tokens, causing significant computational inefficiency. Existing VLLMs acceleration frameworks usually compress spatial and temporal redundancy independently, which overlooks the spatiotemporal relationships, thereby leading to suboptimal spatiotemporal compression. The highly correlated visual features are likely to change in spatial position, scale, orientation, and other attributes over time due to the dynamic nature of video. Building on this insight, we introduce FlashVID, a training-free inference acceleration framework for VLLMs. Specifically, FlashVID utilizes Attention and Diversity-based Token Selection (ADTS) to select the most representative tokens for basic video representation, then applies Tree-based Spatiotemporal Token Merging (TSTM) for fine-grained spatiotemporal redundancy elimination. Extensive experiments conducted on three representative VLLMs across five video understanding benchmarks demonstrate the effectiveness and generalization of our method. Notably, by retaining only 10% of visual tokens, FlashVID preserves 99.1% of the performance of LLaVA-OneVision. Consequently, FlashVID can serve as a training-free and plug-and-play module for extending long video frames, which enables a 10x increase in video frame input to Qwen2.5-VL, resulting in a relative improvement of 8.6% within the same computational budget. Code is available at https://github.com/Fanziyang-v/FlashVID.
Chinese Translation
尽管视频大语言模型(VLLMs)在视频理解方面展现了显著的能力,但它们需要处理大量的视觉令牌,导致计算效率显著降低。现有的VLLMs加速框架通常独立压缩空间和时间冗余,这忽视了时空关系,从而导致次优的时空压缩。由于视频的动态特性,高度相关的视觉特征在空间位置、尺度、方向及其他属性上可能随时间变化。基于这一洞察,我们提出了FlashVID,一个无训练的VLLMs推理加速框架。具体而言,FlashVID利用基于注意力和多样性的令牌选择(ADTS)来选择最具代表性的令牌以进行基本视频表示,然后应用基于树的时空令牌合并(TSTM)进行细粒度的时空冗余消除。在五个视频理解基准上对三种代表性VLLMs进行的广泛实验表明了我们方法的有效性和泛化能力。值得注意的是,通过仅保留10%的视觉令牌,FlashVID保留了LLaVA-OneVision 99.1%的性能。因此,FlashVID可以作为一个无训练的即插即用模块,用于扩展长视频帧,使得视频帧输入到Qwen2.5-VL增加10倍,在相同计算预算内实现了8.6%的相对提升。代码可在https://github.com/Fanziyang-v/FlashVID获取。
cs.CV / 117 / 2602.08025
MIND: Benchmarking Memory Consistency and Action Control in World Models
MIND:世界模型中内存一致性与动作控制的基准评估
Abstract
World models aim to understand, remember, and predict dynamic visual environments, yet a unified benchmark for evaluating their fundamental abilities remains lacking. To address this gap, we introduce MIND, the first open-domain closed-loop revisited benchmark for evaluating Memory consIstency and action coNtrol in worlD models. MIND contains 250 high-quality videos at 1080p and 24 FPS, including 100 (first-person) + 100 (third-person) video clips under a shared action space and 25 + 25 clips across varied action spaces covering eight diverse scenes. We design an efficient evaluation framework to measure two core abilities: memory consistency and action control, capturing temporal stability and contextual coherence across viewpoints. Furthermore, we design various action spaces, including different character movement speeds and camera rotation angles, to evaluate the action generalization capability across different action spaces under shared scenes. To facilitate future performance benchmarking on MIND, we introduce MIND-World, a novel interactive Video-to-World baseline. Extensive experiments demonstrate the completeness of MIND and reveal key challenges in current world models, including the difficulty of maintaining long-term memory consistency and generalizing across action spaces. Project page: https://csu-jpg.github.io/MIND.github.io/
Chinese Translation
世界模型旨在理解、记忆和预测动态视觉环境,但缺乏一个统一的基准来评估其基本能力。为了解决这一问题,我们引入了MIND,这是第一个开放领域的闭环重新评估基准,用于评估世界模型中的内存一致性(Memory Consistency)和动作控制(Action Control)。MIND包含250个高质量的视频,分辨率为1080p,帧率为24 FPS,包括100个(第一人称)和100个(第三人称)视频片段,均在共享的动作空间下,以及25个+25个跨越不同动作空间的片段,涵盖八个多样化场景。我们设计了一个高效的评估框架,以测量两个核心能力:内存一致性和动作控制,捕捉不同视角下的时间稳定性和上下文连贯性。此外,我们设计了各种动作空间,包括不同的角色移动速度和相机旋转角度,以评估在共享场景下不同动作空间的动作泛化能力。为了促进未来在MIND上的性能基准评估,我们引入了MIND-World,这是一个新颖的互动视频到世界的基线。大量实验表明MIND的完整性,并揭示了当前世界模型中的关键挑战,包括维持长期内存一致性和在动作空间之间进行泛化的困难。项目页面:https://csu-jpg.github.io/MIND.github.io/
cs.CV / 118 / 2602.08046
Enhanced Mixture 3D CGAN for Completion and Generation of 3D Objects
增强混合3D CGAN用于3D对象的补全与生成
Abstract
The generation and completion of 3D objects represent a transformative challenge in computer vision. Generative Adversarial Networks (GANs) have recently demonstrated strong potential in synthesizing realistic visual data. However, they often struggle to capture complex and diverse data distributions, particularly in scenarios involving incomplete inputs or significant missing regions. These challenges arise mainly from the high computational requirements and the difficulty of modeling heterogeneous and structurally intricate data, which restrict their applicability in real-world settings. Mixture of Experts (MoE) models have emerged as a promising solution to these limitations. By dynamically selecting and activating the most relevant expert sub-networks for a given input, MoEs improve both performance and efficiency. In this paper, we investigate the integration of Deep 3D Convolutional GANs (CGANs) with a MoE framework to generate high-quality 3D models and reconstruct incomplete or damaged objects. The proposed architecture incorporates multiple generators, each specialized to capture distinct modalities within the dataset. Furthermore, an auxiliary loss-free dynamic capacity constraint (DCC) mechanism is introduced to guide the selection of categorical generators, ensuring a balance between specialization, training stability, and computational efficiency, which is critical for 3D voxel processing. We evaluated the model's ability to generate and complete shapes with missing regions of varying sizes and compared its performance with state-of-the-art approaches. Both quantitative and qualitative results confirm the effectiveness of the proposed MoE-DCGAN in handling complex 3D data.
Chinese Translation
3D对象的生成与补全在计算机视觉中代表了一项变革性的挑战。生成对抗网络(GANs)最近在合成逼真的视觉数据方面展现出了强大的潜力。然而,它们在捕捉复杂和多样的数据分布方面常常面临困难,尤其是在涉及不完整输入或显著缺失区域的场景中。这些挑战主要源于高计算要求以及建模异质和结构复杂数据的难度,限制了它们在现实世界中的应用。专家混合(MoE)模型作为解决这些局限性的有希望的方案应运而生。通过动态选择和激活与给定输入最相关的专家子网络,MoE提高了性能和效率。在本文中,我们研究了将深度3D卷积生成对抗网络(CGANs)与MoE框架相结合,以生成高质量的3D模型并重建不完整或损坏的对象。所提出的架构包含多个生成器,每个生成器专门用于捕捉数据集中不同的模态。此外,引入了一种无辅助损失的动态容量约束(DCC)机制,以指导类别生成器的选择,确保专业化、训练稳定性和计算效率之间的平衡,这对于3D体素处理至关重要。我们评估了模型生成和补全具有不同大小缺失区域的形状的能力,并将其性能与最先进的方法进行了比较。定量和定性结果均证实了所提出的MoE-DCGAN在处理复杂3D数据方面的有效性。
cs.CV / 119 / 2602.08047
Vanilla Group Equivariant Vision Transformer: Simple and Effective
香草组等变视觉变换器:简单而有效
Abstract
Incorporating symmetry priors as inductive biases to design equivariant Vision Transformers (ViTs) has emerged as a promising avenue for enhancing their performance. However, existing equivariant ViTs often struggle to balance performance with equivariance, primarily due to the challenge of achieving holistic equivariant modifications across the diverse modules in ViTs-particularly in harmonizing the Self-Attention mechanism with Patch Embedding. To address this, we propose a straightforward framework that systematically renders key ViT components, including patch embedding, self-attention, positional encodings, and Down/Up-Sampling, equivariant, thereby constructing ViTs with guaranteed equivariance. The resulting architecture serves as a plug-and-play replacement that is both theoretically grounded and practically versatile, scaling seamlessly even to Swin Transformers. Extensive experiments demonstrate that our equivariant ViTs consistently improve performance and data efficiency across a wide spectrum of vision tasks.
Chinese Translation
将对称先验作为归纳偏置融入设计等变视觉变换器(ViTs)已成为提升其性能的一个有前景的途径。然而,现有的等变ViTs往往难以在性能与等变性之间取得平衡,主要是由于在ViTs的不同模块之间实现整体等变修改的挑战,特别是在自注意力机制与补丁嵌入的协调方面。为了解决这一问题,我们提出了一个简单的框架,系统地使关键的ViT组件,包括补丁嵌入、自注意力、位置编码以及下采样/上采样,具备等变性,从而构建出具有保证等变性的ViTs。所得到的架构作为一种即插即用的替代方案,既有理论基础又具有实际通用性,能够无缝扩展到Swin Transformers。大量实验表明,我们的等变ViTs在广泛的视觉任务中始终提高了性能和数据效率。
cs.CV / 120 / 2602.08057
Weak to Strong: VLM-Based Pseudo-Labeling as a Weakly Supervised Training Strategy in Multimodal Video-based Hidden Emotion Understanding Tasks
从弱到强:基于VLM的伪标签生成作为多模态视频隐性情感理解任务中的弱监督训练策略
Abstract
To tackle the automatic recognition of "concealed emotions" in videos, this paper proposes a multimodal weak-supervision framework and achieves state-of-the-art results on the iMiGUE tennis-interview dataset. First, YOLO 11x detects and crops human portraits frame-by-frame, and DINOv2-Base extracts visual features from the cropped regions. Next, by integrating Chain-of-Thought and Reflection prompting (CoT + Reflection), Gemini 2.5 Pro automatically generates pseudo-labels and reasoning texts that serve as weak supervision for downstream models. Subsequently, OpenPose produces 137-dimensional key-point sequences, augmented with inter-frame offset features; the usual graph neural network backbone is simplified to an MLP to efficiently model the spatiotemporal relationships of the three key-point streams. An ultra-long-sequence Transformer independently encodes both the image and key-point sequences, and their representations are concatenated with BERT-encoded interview transcripts. Each modality is first pre-trained in isolation, then fine-tuned jointly, with pseudo-labeled samples merged into the training set for further gains. Experiments demonstrate that, despite severe class imbalance, the proposed approach lifts accuracy from under 0.6 in prior work to over 0.69, establishing a new public benchmark. The study also validates that an "MLP-ified" key-point backbone can match - or even surpass - GCN-based counterparts in this task.
Chinese Translation
为了解决视频中“隐性情感”的自动识别问题,本文提出了一种多模态弱监督框架,并在iMiGUE网球访谈数据集上取得了最先进的结果。首先,YOLO 11x逐帧检测并裁剪人像,DINOv2-Base从裁剪区域提取视觉特征。接下来,通过整合思维链和反思提示(Chain-of-Thought and Reflection prompting, CoT + Reflection),Gemini 2.5 Pro自动生成伪标签和推理文本,为下游模型提供弱监督。随后,OpenPose生成137维关键点序列,并增强了帧间偏移特征;通常的图神经网络骨干被简化为多层感知器(MLP),以高效建模三个关键点流的时空关系。一个超长序列的Transformer独立编码图像和关键点序列,并将它们的表示与BERT编码的访谈文本连接。每种模态首先在孤立状态下进行预训练,然后联合微调,伪标签样本被合并到训练集中以获得进一步提升。实验表明,尽管存在严重的类别不平衡,所提出的方法将准确率从之前工作的0.6以下提升至超过0.69,建立了新的公共基准。研究还验证了“MLP化”的关键点骨干在该任务中可以与基于GCN的对手相匹配,甚至超越它们。
cs.CV / 121 / 2602.08058
Picasso: Holistic Scene Reconstruction with Physics-Constrained Sampling
Picasso:基于物理约束采样的整体场景重建
Abstract
In the presence of occlusions and measurement noise, geometrically accurate scene reconstructions -- which fit the sensor data -- can still be physically incorrect. For instance, when estimating the poses and shapes of objects in the scene and importing the resulting estimates into a simulator, small errors might translate to implausible configurations including object interpenetration or unstable equilibrium. This makes it difficult to predict the dynamic behavior of the scene using a digital twin, an important step in simulation-based planning and control of contact-rich behaviors. In this paper, we posit that object pose and shape estimation requires reasoning holistically over the scene (instead of reasoning about each object in isolation), accounting for object interactions and physical plausibility. Towards this goal, our first contribution is Picasso, a physics-constrained reconstruction pipeline that builds multi-object scene reconstructions by considering geometry, non-penetration, and physics. Picasso relies on a fast rejection sampling method that reasons over multi-object interactions, leveraging an inferred object contact graph to guide samples. Second, we propose the Picasso dataset, a collection of 10 contact-rich real-world scenes with ground truth annotations, as well as a metric to quantify physical plausibility, which we open-source as part of our benchmark. Finally, we provide an extensive evaluation of Picasso on our newly introduced dataset and on the YCB-V dataset, and show it largely outperforms the state of the art while providing reconstructions that are both physically plausible and more aligned with human intuition.
Chinese Translation
在遮挡和测量噪声的情况下,几何上准确的场景重建——即与传感器数据相符的重建——仍可能在物理上不正确。例如,在估计场景中物体的姿态和形状并将结果导入模拟器时,微小的误差可能导致不合理的配置,包括物体相互穿透或不稳定的平衡。这使得使用数字双胞胎预测场景的动态行为变得困难,而这在基于模拟的接触丰富行为的规划和控制中是一个重要步骤。在本文中,我们认为物体姿态和形状估计需要对场景进行整体推理(而不是孤立地推理每个物体),以考虑物体之间的相互作用和物理合理性。为此,我们的第一个贡献是Picasso,一个基于物理约束的重建管道,通过考虑几何形状、非穿透性和物理因素来构建多物体场景重建。Picasso依赖于一种快速拒绝采样方法,该方法对多物体交互进行推理,并利用推断出的物体接触图来指导样本。其次,我们提出了Picasso数据集,这是一个包含10个接触丰富的真实场景及其真实标注的集合,以及一个量化物理合理性的指标,我们将其开源作为我们的基准的一部分。最后,我们对Picasso在新引入的数据集和YCB-V数据集上的表现进行了广泛评估,结果表明其在很大程度上超越了现有的最先进技术,同时提供了在物理上合理且更符合人类直觉的重建结果。
cs.CV / 122 / 2602.08059
DICE: Disentangling Artist Style from Content via Contrastive Subspace Decomposition in Diffusion Models
DICE:通过对比子空间分解在扩散模型中解耦艺术家风格与内容
Abstract
The recent proliferation of diffusion models has made style mimicry effortless, enabling users to imitate unique artistic styles without authorization. In deployed platforms, this raises copyright and intellectual-property risks and calls for reliable protection. However, existing countermeasures either require costly weight editing as new styles emerge or rely on an explicitly specified editing style, limiting their practicality for deployment-side safety. To address this challenge, we propose DICE (Disentanglement of artist Style from Content via Contrastive Subspace Decomposition), a training-free framework for on-the-fly artist style erasure. Unlike style editing that require an explicitly specified replacement style, DICE performs style purification, removing the artist's characteristics while preserving the user-intended content. Our core insight is that a model cannot truly comprehend the artist style from a single text or image alone. Consequently, we abandon the traditional paradigm of identifying style from isolated samples. Instead, we construct contrastive triplets to compel the model to distinguish between style and non-style features in the latent space. By formalizing this disentanglement process as a solvable generalized eigenvalue problem, we achieve precise identification of the style subspace. Furthermore, we introduce an Adaptive Attention Decoupling Editing strategy dynamically assesses the style concentration of each token and performs differential suppression and content enhancement on the QKV vectors. Extensive experiments demonstrate that DICE achieves a superior balance between the thoroughness of style erasure and the preservation of content integrity. DICE introduces an additional overhead of only 3 seconds to disentangle style, providing a practical and efficient technique for curbing style mimicry.
Chinese Translation
近年来,扩散模型的迅猛发展使得风格模仿变得轻而易举,使用户能够在未授权的情况下模仿独特的艺术风格。在已部署的平台上,这引发了版权和知识产权风险,并呼吁可靠的保护措施。然而,现有的对策要么需要在新风格出现时进行昂贵的权重编辑,要么依赖于明确指定的编辑风格,这限制了其在部署端安全性方面的实用性。为了解决这一挑战,我们提出了DICE(通过对比子空间分解解耦艺术家风格与内容),这是一个无需训练的框架,用于即时删除艺术家风格。与需要明确指定替换风格的风格编辑不同,DICE执行风格净化,去除艺术家的特征,同时保留用户意图的内容。我们的核心见解是,模型无法仅通过单一文本或图像真正理解艺术家风格。因此,我们放弃了从孤立样本中识别风格的传统范式。相反,我们构建了对比三元组,迫使模型在潜在空间中区分风格和非风格特征。通过将这一解耦过程形式化为可解的广义特征值问题,我们实现了对风格子空间的精确识别。此外,我们引入了一种自适应注意力解耦编辑策略,动态评估每个标记的风格浓度,并对QKV向量进行差异化抑制和内容增强。大量实验表明,DICE在风格去除的彻底性和内容完整性的保留之间实现了优越的平衡。DICE仅增加3秒的额外开销来解耦风格,提供了一种实用且高效的技术来遏制风格模仿。
cs.CV / 123 / 2602.08068
ReRoPE: Repurposing RoPE for Relative Camera Control
ReRoPE:重新利用RoPE进行相对相机控制
Abstract
Video generation with controllable camera viewpoints is essential for applications such as interactive content creation, gaming, and simulation. Existing methods typically adapt pre-trained video models using camera poses relative to a fixed reference, e.g., the first frame. However, these encodings lack shift-invariance, often leading to poor generalization and accumulated drift. While relative camera pose embeddings defined between arbitrary view pairs offer a more robust alternative, integrating them into pre-trained video diffusion models without prohibitive training costs or architectural changes remains challenging. We introduce ReRoPE, a plug-and-play framework that incorporates relative camera information into pre-trained video diffusion models without compromising their generation capability. Our approach is based on the insight that Rotary Positional Embeddings (RoPE) in existing models underutilize their full spectral bandwidth, particularly in the low-frequency components. By seamlessly injecting relative camera pose information into these underutilized bands, ReRoPE achieves precise control while preserving strong pre-trained generative priors. We evaluate our method on both image-to-video (I2V) and video-to-video (V2V) tasks in terms of camera control accuracy and visual fidelity. Our results demonstrate that ReRoPE offers a training-efficient path toward controllable, high-fidelity video generation. See project page for more results: https://sisyphe-lee.github.io/ReRoPE/
Chinese Translation
可控相机视角的视频生成对于互动内容创作、游戏和模拟等应用至关重要。现有方法通常使用相对于固定参考(例如第一帧)的相机姿态来适应预训练的视频模型。然而,这些编码缺乏平移不变性,常常导致较差的泛化能力和累积漂移。虽然在任意视图对之间定义的相对相机姿态嵌入提供了更为稳健的替代方案,但在不产生高昂训练成本或架构变更的情况下将其集成到预训练的视频扩散模型中仍然具有挑战性。我们提出了ReRoPE,这是一个即插即用的框架,能够在不影响生成能力的情况下将相对相机信息融入预训练的视频扩散模型。我们的方法基于这样的见解:现有模型中的旋转位置嵌入(Rotary Positional Embeddings,RoPE)未能充分利用其完整的频谱带宽,尤其是在低频成分中。通过无缝地将相对相机姿态信息注入这些未充分利用的频带,ReRoPE实现了精确控制,同时保留了强大的预训练生成先验。我们在图像到视频(I2V)和视频到视频(V2V)任务中评估了我们的方法,重点关注相机控制的准确性和视觉保真度。我们的结果表明,ReRoPE为可控的高保真视频生成提供了一条高效的训练路径。更多结果请见项目页面:https://sisyphe-lee.github.io/ReRoPE/
cs.CV / 124 / 2602.08071
ViT-5: Vision Transformers for The Mid-2020s
ViT-5:面向2020年代中期的视觉变换器
Abstract
This work presents a systematic investigation into modernizing Vision Transformer backbones by leveraging architectural advancements from the past five years. While preserving the canonical Attention-FFN structure, we conduct a component-wise refinement involving normalization, activation functions, positional encoding, gating mechanisms, and learnable tokens. These updates form a new generation of Vision Transformers, which we call ViT-5. Extensive experiments demonstrate that ViT-5 consistently outperforms state-of-the-art plain Vision Transformers across both understanding and generation benchmarks. On ImageNet-1k classification, ViT-5-Base reaches 84.2\% top-1 accuracy under comparable compute, exceeding DeiT-III-Base at 83.8\%. ViT-5 also serves as a stronger backbone for generative modeling: when plugged into an SiT diffusion framework, it achieves 1.84 FID versus 2.06 with a vanilla ViT backbone. Beyond headline metrics, ViT-5 exhibits improved representation learning and favorable spatial reasoning behavior, and transfers reliably across tasks. With a design aligned with contemporary foundation-model practices, ViT-5 offers a simple drop-in upgrade over vanilla ViT for mid-2020s vision backbones.
Chinese Translation
本研究系统性地探讨了通过利用过去五年的架构进展来现代化视觉变换器主干。我们在保留经典的注意力-前馈网络(Attention-FFN)结构的同时,进行了组件级的优化,包括归一化、激活函数、位置编码、门控机制和可学习的标记。这些更新形成了新一代视觉变换器,我们称之为ViT-5。大量实验表明,ViT-5在理解和生成基准测试中始终优于最先进的普通视觉变换器。在ImageNet-1k分类任务中,ViT-5-Base在可比计算条件下达到了84.2%的顶级准确率,超过了DeiT-III-Base的83.8%。ViT-5还作为生成建模的更强主干:当嵌入SiT扩散框架时,其FID值为1.84,而普通ViT主干为2.06。除了主要指标外,ViT-5还表现出改进的表示学习和良好的空间推理能力,并且在任务间的迁移表现可靠。ViT-5的设计与当代基础模型实践相一致,为2020年代中期的视觉主干提供了一个简单的替代升级方案。
cs.CV / 125 / 2602.08099
VidVec: Unlocking Video MLLM Embeddings for Video-Text Retrieval
VidVec:解锁视频多模态大语言模型嵌入以实现视频-文本检索
Abstract
Recent studies have adapted generative Multimodal Large Language Models (MLLMs) into embedding extractors for vision tasks, typically through fine-tuning to produce universal representations. However, their performance on video remains inferior to Video Foundation Models (VFMs). In this paper, we focus on leveraging MLLMs for video-text embedding and retrieval. We first conduct a systematic layer-wise analysis, showing that intermediate (pre-trained) MLLM layers already encode substantial task-relevant information. Leveraging this insight, we demonstrate that combining intermediate-layer embeddings with a calibrated MLLM head yields strong zero-shot retrieval performance without any training. Building on these findings, we introduce a lightweight text-based alignment strategy which maps dense video captions to short summaries and enables task-related video-text embedding learning without visual supervision. Remarkably, without any fine-tuning beyond text, our method outperforms current methods, often by a substantial margin, achieving state-of-the-art results across common video retrieval benchmarks.
Chinese Translation
近期研究将生成性多模态大语言模型(MLLMs)适配为视觉任务的嵌入提取器,通常通过微调来生成通用表示。然而,它们在视频上的表现仍然不及视频基础模型(VFMs)。在本文中,我们重点利用MLLMs进行视频-文本嵌入和检索。我们首先进行系统的层级分析,显示中间(预训练)MLLM层已经编码了大量与任务相关的信息。基于这一洞察,我们证明了将中间层嵌入与校准的MLLM头结合,可以在没有任何训练的情况下实现强大的零-shot检索性能。基于这些发现,我们引入了一种轻量级的基于文本的对齐策略,该策略将密集的视频字幕映射到简短的摘要,并在没有视觉监督的情况下实现与任务相关的视频-文本嵌入学习。值得注意的是,在文本之外没有任何微调的情况下,我们的方法超越了当前的方法,通常有显著的优势,在常见的视频检索基准上取得了最先进的结果。
cs.CV / 126 / 2602.08112
MMLSv2: A Multimodal Dataset for Martian Landslide Detection in Remote Sensing Imagery
MMLSv2:用于遥感图像中火星滑坡检测的多模态数据集
Abstract
We present MMLSv2, a dataset for landslide segmentation on Martian surfaces. MMLSv2 consists of multimodal imagery with seven bands: RGB, digital elevation model, slope, thermal inertia, and grayscale channels. MMLSv2 comprises 664 images distributed across training, validation, and test splits. In addition, an isolated test set of 276 images from a geographically disjoint region from the base dataset is released to evaluate spatial generalization. Experiments conducted with multiple segmentation models show that the dataset supports stable training and achieves competitive performance, while still posing challenges in fragmented, elongated, and small-scale landslide regions. Evaluation on the isolated test set leads to a noticeable performance drop, indicating increased difficulty and highlighting its value for assessing model robustness and generalization beyond standard in-distribution settings. Dataset will be available at: https://github.com/MAIN-Lab/MMLS_v2
Chinese Translation
我们提出了MMLSv2,这是一个用于火星表面滑坡分割的数据集。MMLSv2由七个波段的多模态图像组成:RGB、数字高程模型、坡度、热惯量和灰度通道。MMLSv2包含664幅图像,分布于训练、验证和测试集。此外,还发布了一个来自与基础数据集地理上不相交区域的276幅图像的独立测试集,以评估空间泛化能力。通过多种分割模型进行的实验表明,该数据集支持稳定的训练,并取得了具有竞争力的性能,同时在碎片化、延伸和小规模滑坡区域仍然存在挑战。在独立测试集上的评估导致性能显著下降,表明难度增加,并突显其在评估模型鲁棒性和超出标准分布设置的泛化能力方面的价值。数据集将在以下网址提供:https://github.com/MAIN-Lab/MMLS_v2
cs.CV / 127 / 2602.08117
Building Damage Detection using Satellite Images and Patch-Based Transformer Methods
基于卫星图像和基于补丁的变换器方法的建筑损伤检测
Abstract
Rapid building damage assessment is critical for post-disaster response. Damage classification models built on satellite imagery provide a scalable means of obtaining situational awareness. However, label noise and severe class imbalance in satellite data create major challenges. The xBD dataset offers a standardized benchmark for building-level damage across diverse geographic regions. In this study, we evaluate Vision Transformer (ViT) model performance on the xBD dataset, specifically investigating how these models distinguish between types of structural damage when training on noisy, imbalanced data. In this study, we specifically evaluate DINOv2-small and DeiT for multi-class damage classification. We propose a targeted patch-based pre-processing pipeline to isolate structural features and minimize background noise in training. We adopt a frozen-head fine-tuning strategy to keep computational requirements manageable. Model performance is evaluated through accuracy, precision, recall, and macro-averaged F1 scores. We show that small ViT architectures with our novel training method achieves competitive macro-averaged F1 relative to prior CNN baselines for disaster classification.
Chinese Translation
快速的建筑损伤评估对于灾后响应至关重要。基于卫星图像的损伤分类模型提供了一种可扩展的方式来获取情境意识。然而,标签噪声和卫星数据中的严重类别不平衡带来了重大挑战。xBD 数据集为不同地理区域的建筑级损伤提供了一个标准化的基准。在本研究中,我们评估了 Vision Transformer (ViT) 模型在 xBD 数据集上的表现,特别研究了这些模型在噪声和不平衡数据上训练时如何区分不同类型的结构损伤。我们特别评估了 DINOv2-small 和 DeiT 在多类别损伤分类中的表现。我们提出了一种针对性的基于补丁的预处理流程,以隔离结构特征并最小化训练中的背景噪声。我们采用了冻结头部微调策略,以保持计算需求在可管理范围内。模型性能通过准确率、精确率、召回率和宏平均 F1 分数进行评估。我们展示了采用我们新颖训练方法的小型 ViT 架构在灾害分类中相对于先前的 CNN 基线达到了具有竞争力的宏平均 F1 分数。
cs.CV / 128 / 2602.08126
MambaFusion: Adaptive State-Space Fusion for Multimodal 3D Object Detection
MambaFusion:用于多模态3D物体检测的自适应状态空间融合
Abstract
Reliable 3D object detection is fundamental to autonomous driving, and multimodal fusion algorithms using cameras and LiDAR remain a persistent challenge. Cameras provide dense visual cues but ill posed depth; LiDAR provides a precise 3D structure but sparse coverage. Existing BEV-based fusion frameworks have made good progress, but they have difficulties including inefficient context modeling, spatially invariant fusion, and reasoning under uncertainty. We introduce MambaFusion, a unified multi-modal detection framework that achieves efficient, adaptive, and physically grounded 3D perception. MambaFusion interleaves selective state-space models (SSMs) with windowed transformers to propagate the global context in linear time while preserving local geometric fidelity. A multi-modal token alignment (MTA) module and reliability-aware fusion gates dynamically re-weight camera-LiDAR features based on spatial confidence and calibration consistency. Finally, a structure-conditioned diffusion head integrates graph-based reasoning with uncertainty-aware denoising, enforcing physical plausibility, and calibrated confidence. MambaFusion establishes new state-of-the-art performance on nuScenes benchmarks while operating with linear-time complexity. The framework demonstrates that coupling SSM-based efficiency with reliability-driven fusion yields robust, temporally stable, and interpretable 3D perception for real-world autonomous driving systems.
Chinese Translation
可靠的3D物体检测是自动驾驶的基础,而使用摄像头和激光雷达的多模态融合算法仍然面临持续的挑战。摄像头提供了密集的视觉线索,但深度信息不准确;激光雷达提供了精确的3D结构,但覆盖范围稀疏。现有的基于鸟瞰视图(BEV)的融合框架取得了良好的进展,但在上下文建模效率、空间不变融合和不确定性推理等方面存在困难。我们提出了MambaFusion,一个统一的多模态检测框架,实现了高效、自适应和物理基础的3D感知。MambaFusion将选择性状态空间模型(SSMs)与窗口变换器交错结合,以线性时间传播全局上下文,同时保持局部几何的保真度。一个多模态标记对齐(MTA)模块和可靠性感知融合门根据空间置信度和校准一致性动态重新加权摄像头-激光雷达特征。最后,一个结构条件扩散头将基于图的推理与不确定性感知去噪结合,强化物理合理性和校准置信度。MambaFusion在nuScenes基准测试中建立了新的最先进性能,同时以线性时间复杂度运行。该框架表明,将基于SSM的效率与以可靠性驱动的融合相结合,可以为现实世界的自动驾驶系统提供稳健、时间稳定且可解释的3D感知。
cs.CV / 129 / 2602.08131
Fields of The World: A Field Guide for Extracting Agricultural Field Boundaries
全球田野:农业田界提取的实用指南
Abstract
Field boundary maps are a building block for agricultural data products and support crop monitoring, yield estimation, and disease estimation. This tutorial presents the Fields of The World (FTW) ecosystem: a benchmark of 1.6M field polygons across 24 countries, pre-trained segmentation models, and command-line inference tools. We provide two notebooks that cover (1) local-scale field boundary extraction with crop classification and forest loss attribution, and (2) country-scale inference using cloud-optimized data. We use MOSAIKS random convolutional features and FTW derived field boundaries to map crop type at the field level and report macro F1 scores of 0.65--0.75 for crop type classification with limited labels. Finally, we show how to explore pre-computed predictions over five countries (4.76M km\textsuperscript{2}), with median predicted field areas from 0.06 ha (Rwanda) to 0.28 ha (Switzerland).
Chinese Translation
田界地图是农业数据产品的基础,支持作物监测、产量估算和病害评估。本文介绍了全球田野(Fields of The World, FTW)生态系统:一个涵盖24个国家的160万块田地多边形的基准数据集、预训练的分割模型和命令行推理工具。我们提供了两个笔记本,涵盖(1)局部尺度的田界提取,包括作物分类和森林损失归因,以及(2)使用云优化数据的国家尺度推理。我们使用MOSAIKS随机卷积特征和FTW派生的田界来在田地级别映射作物类型,并报告在有限标签下的作物类型分类宏观F1分数为0.65至0.75。最后,我们展示了如何探索五个国家(4.76百万平方公里)上预先计算的预测结果,其预测的田地面积中位数从0.06公顷(卢旺达)到0.28公顷(瑞士)不等。
cs.CV / 130 / 2602.08136
Robustness of Vision Language Models Against Split-Image Harmful Input Attacks
视觉语言模型对分割图像有害输入攻击的鲁棒性
Abstract
Vision-Language Models (VLMs) are now a core part of modern AI. Recent work proposed several visual jailbreak attacks using single/ holistic images. However, contemporary VLMs demonstrate strong robustness against such attacks due to extensive safety alignment through preference optimization (e.g., RLHF). In this work, we identify a new vulnerability: while VLM pretraining and instruction tuning generalize well to split-image inputs, safety alignment is typically performed only on holistic images and does not account for harmful semantics distributed across multiple image fragments. Consequently, VLMs often fail to detect and refuse harmful split-image inputs, where unsafe cues emerge only after combining images. We introduce novel split-image visual jailbreak attacks (SIVA) that exploit this misalignment. Unlike prior optimization-based attacks, which exhibit poor black-box transferability due to architectural and prior mismatches across models, our attacks evolve in progressive phases from naive splitting to an adaptive white-box attack, culminating in a black-box transfer attack. Our strongest strategy leverages a novel adversarial knowledge distillation (Adv-KD) algorithm to substantially improve cross-model transferability. Evaluations on three state-of-the-art modern VLMs and three jailbreak datasets demonstrate that our strongest attack achieves up to 60% higher transfer success than existing baselines. Lastly, we propose efficient ways to address this critical vulnerability in the current VLM safety alignment.
Chinese Translation
视觉语言模型(VLMs)现已成为现代人工智能的核心部分。近期的研究提出了几种使用单一/整体图像的视觉越狱攻击。然而,现代VLMs在这些攻击面前表现出强大的鲁棒性,这得益于通过偏好优化(例如,RLHF)进行的广泛安全对齐。在本研究中,我们识别出一种新的脆弱性:尽管VLM的预训练和指令调优对分割图像输入具有良好的泛化能力,但安全对齐通常仅在整体图像上进行,未考虑分布在多个图像片段中的有害语义。因此,VLMs往往无法检测和拒绝有害的分割图像输入,因为不安全的线索仅在图像合并后才会出现。我们提出了新颖的分割图像视觉越狱攻击(SIVA),利用这种不对齐。与先前基于优化的攻击不同,由于模型之间的架构和先前不匹配,这些攻击在黑箱转移性方面表现不佳,我们的攻击从简单的分割逐步演变为自适应的白箱攻击,最终形成黑箱转移攻击。我们最强的策略利用了一种新颖的对抗知识蒸馏(Adv-KD)算法,显著提高了跨模型的转移性。在对三种最先进的现代VLM和三个越狱数据集的评估中,我们的最强攻击的转移成功率比现有基线高出多达60%。最后,我们提出了有效的方法来解决当前VLM安全对齐中的这一关键脆弱性。
cs.CV / 131 / 2602.08168
DAS-SK: An Adaptive Model Integrating Dual Atrous Separable and Selective Kernel CNN for Agriculture Semantic Segmentation
DAS-SK:一种集成双重膨胀可分离卷积和选择性核卷积的自适应模型用于农业语义分割
Abstract
Semantic segmentation in high-resolution agricultural imagery demands models that strike a careful balance between accuracy and computational efficiency to enable deployment in practical systems. In this work, we propose DAS-SK, a novel lightweight architecture that retrofits selective kernel convolution (SK-Conv) into the dual atrous separable convolution (DAS-Conv) module to strengthen multi-scale feature learning. The model further enhances the atrous spatial pyramid pooling (ASPP) module, enabling the capture of fine-grained local structures alongside global contextual information. Built upon a modified DeepLabV3 framework with two complementary backbones - MobileNetV3-Large and EfficientNet-B3, the DAS-SK model mitigates limitations associated with large dataset requirements, limited spectral generalization, and the high computational cost that typically restricts deployment on UAVs and other edge devices. Comprehensive experiments across three benchmarks: LandCover.ai, VDD, and PhenoBench, demonstrate that DAS-SK consistently achieves state-of-the-art performance, while being more efficient than CNN-, transformer-, and hybrid-based competitors. Notably, DAS-SK requires up to 21x fewer parameters and 19x fewer GFLOPs than top-performing transformer models. These findings establish DAS-SK as a robust, efficient, and scalable solution for real-time agricultural robotics and high-resolution remote sensing, with strong potential for broader deployment in other vision domains.
Chinese Translation
高分辨率农业图像的语义分割需要模型在准确性和计算效率之间取得谨慎平衡,以便在实际系统中部署。在本研究中,我们提出了DAS-SK,这是一种新颖的轻量级架构,将选择性核卷积(SK-Conv)集成到双重膨胀可分离卷积(DAS-Conv)模块中,以增强多尺度特征学习。该模型进一步增强了膨胀空间金字塔池化(ASPP)模块,使其能够捕捉细粒度的局部结构以及全局上下文信息。DAS-SK模型基于修改后的DeepLabV3框架,采用两个互补的主干网络——MobileNetV3-Large和EfficientNet-B3,缓解了与大数据集需求、有限光谱泛化能力以及通常限制无人机和其他边缘设备部署的高计算成本相关的限制。通过在三个基准数据集:LandCover.ai、VDD和PhenoBench上的全面实验,DAS-SK始终实现了最先进的性能,同时在效率上优于基于CNN、变换器和混合模型的竞争对手。值得注意的是,DAS-SK所需的参数量比顶级变换器模型少多达21倍,GFLOPs少多达19倍。这些发现确立了DAS-SK作为实时农业机器人和高分辨率遥感的强大、高效和可扩展的解决方案,并在其他视觉领域的更广泛部署中具有强大的潜力。
cs.CV / 132 / 2602.08198
PEGAsus: 3D Personalization of Geometry and Appearance
PEGAsus:几何形状和外观的三维个性化
Abstract
We present PEGAsus, a new framework capable of generating Personalized 3D shapes by learning shape concepts at both Geometry and Appearance levels. First, we formulate 3D shape personalization as extracting reusable, category-agnostic geometric and appearance attributes from reference shapes, and composing these attributes with text to generate novel shapes. Second, we design a progressive optimization strategy to learn shape concepts at both the geometry and appearance levels, decoupling the shape concept learning process. Third, we extend our approach to region-wise concept learning, enabling flexible concept extraction, with context-aware and context-free losses. Extensive experimental results show that PEGAsus is able to effectively extract attributes from a wide range of reference shapes and then flexibly compose these concepts with text to synthesize new shapes. This enables fine-grained control over shape generation and supports the creation of diverse, personalized results, even in challenging cross-category scenarios. Both quantitative and qualitative experiments demonstrate that our approach outperforms existing state-of-the-art solutions.
Chinese Translation
我们提出了PEGAsus,一个能够通过在几何和外观层面学习形状概念来生成个性化三维形状的新框架。首先,我们将三维形状个性化形式化为从参考形状中提取可重用的、与类别无关的几何和外观属性,并将这些属性与文本组合以生成新形状。其次,我们设计了一种渐进优化策略,以在几何和外观层面学习形状概念,从而解耦形状概念学习过程。第三,我们将我们的方法扩展到区域级概念学习,支持灵活的概念提取,结合上下文感知和无上下文损失。大量实验结果表明,PEGAsus能够有效地从广泛的参考形状中提取属性,并灵活地将这些概念与文本组合以合成新形状。这使得对形状生成的细粒度控制成为可能,并支持在具有挑战性的跨类别场景中创建多样化的个性化结果。定量和定性实验均表明,我们的方法优于现有的最先进解决方案。
cs.CV / 133 / 2602.08202
Generative Regression for Left Ventricular Ejection Fraction Estimation from Echocardiography Video
基于生成回归的左心室射血分数从超声心动图视频估计
Abstract
Estimating Left Ventricular Ejection Fraction (LVEF) from echocardiograms constitutes an ill-posed inverse problem. Inherent noise, artifacts, and limited viewing angles introduce ambiguity, where a single video sequence may map not to a unique ground truth, but rather to a distribution of plausible physiological values. Prevailing deep learning approaches typically formulate this task as a standard regression problem that minimizes the Mean Squared Error (MSE). However, this paradigm compels the model to learn the conditional expectation, which may yield misleading predictions when the underlying posterior distribution is multimodal or heavy-tailed -- a common phenomenon in pathological scenarios. In this paper, we investigate the paradigm shift from deterministic regression toward generative regression. We propose the Multimodal Conditional Score-based Diffusion model for Regression (MCSDR), a probabilistic framework designed to model the continuous posterior distribution of LVEF conditioned on echocardiogram videos and patient demographic attribute priors. Extensive experiments conducted on the EchoNet-Dynamic, EchoNet-Pediatric, and CAMUS datasets demonstrate that MCSDR achieves state-of-the-art performance. Notably, qualitative analysis reveals that the generation trajectories of our model exhibit distinct behaviors in cases characterized by high noise or significant physiological variability, thereby offering a novel layer of interpretability for AI-aided diagnosis.
Chinese Translation
从超声心动图中估计左心室射血分数(LVEF)构成了一个不适定的逆问题。固有的噪声、伪影以及有限的视角引入了模糊性,使得单个视频序列可能并不映射到唯一的真实值,而是映射到一组合理的生理值分布。现有的深度学习方法通常将此任务表述为一个标准的回归问题,旨在最小化均方误差(MSE)。然而,这一范式迫使模型学习条件期望,当潜在的后验分布呈现多模态或重尾特征时,这可能导致误导性的预测——这是病理场景中的常见现象。本文探讨了从确定性回归向生成回归的范式转变。我们提出了基于多模态条件评分的回归扩散模型(MCSDR),这是一个概率框架,旨在建模基于超声心动图视频和患者人口统计属性先验的LVEF连续后验分布。在EchoNet-Dynamic、EchoNet-Pediatric和CAMUS数据集上进行的广泛实验表明,MCSDR达到了最先进的性能。值得注意的是,定性分析揭示了我们模型的生成轨迹在高噪声或显著生理变异的情况下表现出不同的行为,从而为人工智能辅助诊断提供了新的可解释性层面。
cs.CV / 134 / 2602.08206
Geospatial-Reasoning-Driven Vocabulary-Agnostic Remote Sensing Semantic Segmentation
基于地理空间推理的无词汇遥感语义分割
Abstract
Open-vocabulary semantic segmentation has emerged as a promising research direction in remote sensing, enabling the recognition of diverse land-cover types beyond pre-defined category sets. However, existing methods predominantly rely on the passive mapping of visual features and textual embeddings. This ``appearance-based" paradigm lacks geospatial contextual awareness, leading to severe semantic ambiguity and misclassification when encountering land-cover classes with similar spectral features but distinct semantic attributes. To address this, we propose a Geospatial Reasoning Chain-of-Thought (GR-CoT) framework designed to enhance the scene understanding capabilities of Multimodal Large Language Models (MLLMs), thereby guiding open-vocabulary segmentation models toward precise mapping. The framework comprises two collaborative components: an offline knowledge distillation stream and an online instance reasoning stream. The offline stream establishes fine-grained category interpretation standards to resolve semantic conflicts between similar land-cover types. During online inference, the framework executes a sequential reasoning process involving macro-scenario anchoring, visual feature decoupling, and knowledge-driven decision synthesis. This process generates an image-adaptive vocabulary that guides downstream models to achieve pixel-level alignment with correct geographical semantics. Extensive experiments on the LoveDA and GID5 benchmarks demonstrate the superiority of our approach.
Chinese Translation
开放词汇语义分割已成为遥感领域一个有前景的研究方向,使得能够识别超出预定义类别集的多样土地覆盖类型。然而,现有方法主要依赖于视觉特征和文本嵌入的被动映射。这种“基于外观”的范式缺乏地理空间上下文意识,导致在遇到具有相似光谱特征但语义属性不同的土地覆盖类别时出现严重的语义模糊和误分类。为了解决这个问题,我们提出了一种地理空间推理链式思维(Geospatial Reasoning Chain-of-Thought, GR-CoT)框架,旨在增强多模态大型语言模型(Multimodal Large Language Models, MLLMs)的场景理解能力,从而引导开放词汇分割模型实现精确映射。该框架包含两个协作组件:离线知识蒸馏流和在线实例推理流。离线流建立细粒度类别解释标准,以解决相似土地覆盖类型之间的语义冲突。在在线推理过程中,该框架执行一个顺序推理过程,包括宏观场景锚定、视觉特征解耦和知识驱动的决策合成。该过程生成一个图像自适应词汇,指导下游模型实现与正确地理语义的像素级对齐。在LoveDA和GID5基准上的大量实验表明我们的方法具有优越性。
cs.CV / 135 / 2602.08211
Chain-of-Caption: Training-free improvement of multimodal large language model on referring expression comprehension
链式描述:无训练的多模态大型语言模型在指称表达理解上的改进
Abstract
Given a textual description, the task of referring expression comprehension (REC) involves the localisation of the referred object in an image. Multimodal large language models (MLLMs) have achieved high accuracy on REC benchmarks through scaling up the model size and training data. Moreover, the performance of MLLMs can be further improved using techniques such as Chain-of-Thought and tool use, which provides additional visual or textual context to the model. In this paper, we analyse the effect of various techniques for providing additional visual and textual context via tool use to the MLLM and its effect on the REC task. Furthermore, we propose a training-free framework named Chain-of-Caption to improve the REC performance of MLLMs. We perform experiments on RefCOCO/RefCOCOg/RefCOCO+ and Ref-L4 datasets and show that individual textual or visual context can improve the REC performance without any fine-tuning. By combining multiple contexts, our training-free framework shows between 5% to 30% performance gain over the baseline model on accuracy at various Intersection over Union (IoU) thresholds.
Chinese Translation
给定文本描述,指称表达理解(REC)任务涉及在图像中定位所指对象。多模态大型语言模型(MLLMs)通过扩大模型规模和训练数据,在REC基准测试中取得了高准确率。此外,MLLMs的性能还可以通过使用诸如链式思维(Chain-of-Thought)和工具使用等技术进一步提升,这些技术为模型提供了额外的视觉或文本上下文。在本文中,我们分析了通过工具使用为MLLM提供额外视觉和文本上下文的各种技术的效果及其对REC任务的影响。此外,我们提出了一种名为链式描述(Chain-of-Caption)的无训练框架,以提高MLLM的REC性能。我们在RefCOCO/RefCOCOg/RefCOCO+和Ref-L4数据集上进行了实验,结果表明,单独的文本或视觉上下文可以在不进行任何微调的情况下提高REC性能。通过结合多种上下文,我们的无训练框架在不同的交并比(Intersection over Union, IoU)阈值下,相较于基线模型的准确率提升了5%到30%。
cs.CV / 136 / 2602.08224
Efficient-SAM2: Accelerating SAM2 with Object-Aware Visual Encoding and Memory Retrieval
高效SAM2:通过对象感知视觉编码和记忆检索加速SAM2
Abstract
Segment Anything Model 2 (SAM2) shows excellent performance in video object segmentation tasks; however, the heavy computational burden hinders its application in real-time video processing. Although there have been efforts to improve the efficiency of SAM2, most of them focus on retraining a lightweight backbone, with little exploration into post-training acceleration. In this paper, we observe that SAM2 exhibits sparse perception pattern as biological vision, which provides opportunities for eliminating redundant computation and acceleration: i) In mask decoder, the attention primarily focuses on the foreground objects, whereas the image encoder in the earlier stage exhibits a broad attention span, which results in unnecessary computation to background regions. ii) In memory bank, only a small subset of tokens in each frame contribute significantly to memory attention, and the salient regions exhibit temporal consistency, making full-token computation redundant. With these insights, we propose Efficient-SAM2, which promotes SAM2 to adaptively focus on object regions while eliminating task-irrelevant computations, thereby significantly improving inference efficiency. Specifically, for image encoder, we propose object-aware Sparse Window Routing (SWR), a window-level computation allocation mechanism that leverages the consistency and saliency cues from the previous-frame decoder to route background regions into a lightweight shortcut branch. Moreover, for memory attention, we propose object-aware Sparse Memory Retrieval (SMR), which allows only the salient memory tokens in each frame to participate in computation, with the saliency pattern reused from their first recollection. With negligible additional parameters and minimal training overhead, Efficient-SAM2 delivers 1.68x speedup on SAM2.1-L model with only 1.0% accuracy drop on SA-V test set.
Chinese Translation
Segment Anything Model 2 (SAM2) 在视频对象分割任务中表现出色;然而,巨大的计算负担阻碍了其在实时视频处理中的应用。尽管已有努力提升SAM2的效率,但大多数集中于重新训练轻量级骨干网络,鲜有对训练后加速的探索。在本文中,我们观察到SAM2表现出类似生物视觉的稀疏感知模式,这为消除冗余计算和加速提供了机会:i) 在掩膜解码器中,注意力主要集中在前景对象上,而图像编码器在早期阶段则表现出广泛的注意力范围,导致对背景区域的计算不必要。ii) 在记忆库中,每帧中只有一小部分标记对记忆注意力贡献显著,显著区域表现出时间一致性,使得全标记计算变得冗余。基于这些见解,我们提出了高效SAM2,促进SAM2自适应地聚焦于对象区域,同时消除与任务无关的计算,从而显著提高推理效率。具体而言,对于图像编码器,我们提出了对象感知稀疏窗口路由(SWR),这是一种窗口级计算分配机制,利用来自前一帧解码器的一致性和显著性线索,将背景区域路由到轻量级快捷分支。此外,对于记忆注意力,我们提出了对象感知稀疏记忆检索(SMR),仅允许每帧中的显著记忆标记参与计算,并重用其首次回忆时的显著性模式。通过几乎不增加额外参数和最小的训练开销,高效SAM2在SAM2.1-L模型上实现了1.68倍的加速,同时在SA-V测试集上的准确率仅下降1.0%。
cs.CV / 137 / 2602.08230
Generating Adversarial Events: A Motion-Aware Point Cloud Framework
生成对抗事件:一种运动感知点云框架
Abstract
Event cameras have been widely adopted in safety-critical domains such as autonomous driving, robotics, and human-computer interaction. A pressing challenge arises from the vulnerability of deep neural networks to adversarial examples, which poses a significant threat to the reliability of event-based systems. Nevertheless, research into adversarial attacks on events is scarce. This is primarily due to the non-differentiable nature of mainstream event representations, which hinders the extension of gradient-based attack methods. In this paper, we propose MA-ADV, a novel \textbf{M}otion-\textbf{A}ware \textbf{Adv}ersarial framework. To the best of our knowledge, this is the first work to generate adversarial events by leveraging point cloud representations. MA-ADV accounts for high-frequency noise in events and employs a diffusion-based approach to smooth perturbations, while fully leveraging the spatial and temporal relationships among events. Finally, MA-ADV identifies the minimal-cost perturbation through a combination of sample-wise Adam optimization, iterative refinement, and binary search. Extensive experimental results validate that MA-ADV ensures a 100\% attack success rate with minimal perturbation cost, and also demonstrate enhanced robustness against defenses, underscoring the critical security challenges facing future event-based perception systems.
Chinese Translation
事件相机已广泛应用于自动驾驶、机器人技术和人机交互等安全关键领域。然而,深度神经网络对对抗样本的脆弱性带来了一个紧迫的挑战,这对基于事件的系统的可靠性构成了重大威胁。然而,针对事件的对抗攻击研究仍然稀缺。这主要是由于主流事件表示的非可微性,阻碍了基于梯度的攻击方法的扩展。在本文中,我们提出了MA-ADV,一种新颖的运动感知对抗框架。根据我们所知,这是首个通过利用点云表示生成对抗事件的研究。MA-ADV考虑了事件中的高频噪声,并采用基于扩散的方法来平滑扰动,同时充分利用事件之间的时空关系。最后,MA-ADV通过样本级Adam优化、迭代精炼和二分搜索的结合,识别出最小成本的扰动。大量实验结果验证了MA-ADV在最小扰动成本下确保100%的攻击成功率,并且在防御面前表现出增强的鲁棒性,突显了未来基于事件的感知系统面临的关键安全挑战。
cs.CV / 138 / 2602.08236
When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning
何时以及多少想象:基于世界模型的视觉空间推理自适应测试时缩放
Abstract
Despite rapid progress in Multimodal Large Language Models (MLLMs), visual spatial reasoning remains unreliable when correct answers depend on how a scene would appear under unseen or alternative viewpoints. Recent work addresses this by augmenting reasoning with world models for visual imagination, but questions such as when imagination is actually necessary, how much of it is beneficial, and when it becomes harmful, remain poorly understood. In practice, indiscriminate imagination can increase computation and even degrade performance by introducing misleading evidence. In this work, we present an in-depth analysis of test-time visual imagination as a controllable resource for spatial reasoning. We study when static visual evidence is sufficient, when imagination improves reasoning, and how excessive or unnecessary imagination affects accuracy and efficiency. To support this analysis, we introduce AVIC, an adaptive test-time framework with world models that explicitly reasons about the sufficiency of current visual evidence before selectively invoking and scaling visual imagination. Across spatial reasoning benchmarks (SAT, MMSI) and an embodied navigation benchmark (R2R), our results reveal clear scenarios where imagination is critical, marginal, or detrimental, and show that selective control can match or outperform fixed imagination strategies with substantially fewer world-model calls and language tokens. Overall, our findings highlight the importance of analyzing and controlling test-time imagination for efficient and reliable spatial reasoning.
Chinese Translation
尽管多模态大型语言模型(MLLMs)取得了快速进展,但当正确答案依赖于场景在未见或替代视角下的表现时,视觉空间推理仍然不可靠。近期的研究通过利用世界模型进行视觉想象来增强推理,但诸如何时实际需要想象、多少想象是有益的以及何时想象变得有害等问题仍然理解不足。在实践中,不加选择的想象可能会增加计算量,甚至通过引入误导性证据来降低性能。在本研究中,我们对测试时的视觉想象进行了深入分析,将其视为空间推理的可控资源。我们研究了何时静态视觉证据足够、何时想象能改善推理,以及过度或不必要的想象如何影响准确性和效率。为支持这一分析,我们引入了AVIC,一个具有世界模型的自适应测试时框架,该框架在选择性调用和缩放视觉想象之前,明确推理当前视觉证据的充分性。在空间推理基准(SAT、MMSI)和一个具身导航基准(R2R)中,我们的结果揭示了想象至关重要、边际或有害的明确场景,并显示选择性控制可以以显著更少的世界模型调用和语言标记匹配或超越固定想象策略。总体而言,我们的研究结果强调了分析和控制测试时想象在高效和可靠的空间推理中的重要性。
cs.CV / 139 / 2602.08262
Moving Beyond Functional Connectivity: Time-Series Modeling for fMRI-Based Brain Disorder Classification
超越功能连接:基于时间序列建模的fMRI脑部疾病分类
Abstract
Functional magnetic resonance imaging (fMRI) enables non-invasive brain disorder classification by capturing blood-oxygen-level-dependent (BOLD) signals. However, most existing methods rely on functional connectivity (FC) via Pearson correlation, which reduces 4D BOLD signals to static 2D matrices, discarding temporal dynamics and capturing only linear inter-regional relationships. In this work, we benchmark state-of-the-art temporal models (e.g., time-series models such as PatchTST, TimesNet, and TimeMixer) on raw BOLD signals across five public datasets. Results show these models consistently outperform traditional FC-based approaches, highlighting the value of directly modeling temporal information such as cycle-like oscillatory fluctuations and drift-like slow baseline trends. Building on this insight, we propose DeCI, a simple yet effective framework that integrates two key principles: (i) Cycle and Drift Decomposition to disentangle cycle and drift within each ROI (Region of Interest); and (ii) Channel-Independence to model each ROI separately, improving robustness and reducing overfitting. Extensive experiments demonstrate that DeCI achieves superior classification accuracy and generalization compared to both FC-based and temporal baselines. Our findings advocate for a shift toward end-to-end temporal modeling in fMRI analysis to better capture complex brain dynamics. The code is available at https://github.com/Levi-Ackman/DeCI.
Chinese Translation
功能性磁共振成像(fMRI)通过捕捉血氧水平依赖(BOLD)信号,实现了非侵入性的脑部疾病分类。然而,现有大多数方法依赖于通过皮尔逊相关性(Pearson correlation)计算的功能连接(FC),这将4D BOLD信号简化为静态的2D矩阵,忽略了时间动态,仅捕捉线性区域间关系。在本研究中,我们在五个公共数据集上对原始BOLD信号进行了最先进的时间模型(例如,时间序列模型如PatchTST、TimesNet和TimeMixer)的基准测试。结果表明,这些模型在性能上始终优于传统的基于FC的方法,突显了直接建模时间信息(如周期性振荡波动和漂移式慢基线趋势)的价值。在此基础上,我们提出了DeCI,这是一个简单而有效的框架,整合了两个关键原则:(i)周期与漂移分解,以解开每个感兴趣区域(ROI)内的周期和漂移;(ii)通道独立性,以单独建模每个ROI,从而提高鲁棒性并减少过拟合。大量实验表明,DeCI在分类准确性和泛化能力上优于基于FC和时间基线的方法。我们的研究结果倡导在fMRI分析中转向端到端的时间建模,以更好地捕捉复杂的脑动态。代码可在https://github.com/Levi-Ackman/DeCI获取。
cs.CV / 140 / 2602.08277
PISCO: Precise Video Instance Insertion with Sparse Control
PISCO:基于稀疏控制的精确视频实例插入
Abstract
The landscape of AI video generation is undergoing a pivotal shift: moving beyond general generation - which relies on exhaustive prompt-engineering and "cherry-picking" - towards fine-grained, controllable generation and high-fidelity post-processing. In professional AI-assisted filmmaking, it is crucial to perform precise, targeted modifications. A cornerstone of this transition is video instance insertion, which requires inserting a specific instance into existing footage while maintaining scene integrity. Unlike traditional video editing, this task demands several requirements: precise spatial-temporal placement, physically consistent scene interaction, and the faithful preservation of original dynamics - all achieved under minimal user effort. In this paper, we propose PISCO, a video diffusion model for precise video instance insertion with arbitrary sparse keyframe control. PISCO allows users to specify a single keyframe, start-and-end keyframes, or sparse keyframes at arbitrary timestamps, and automatically propagates object appearance, motion, and interaction. To address the severe distribution shift induced by sparse conditioning in pretrained video diffusion models, we introduce Variable-Information Guidance for robust conditioning and Distribution-Preserving Temporal Masking to stabilize temporal generation, together with geometry-aware conditioning for realistic scene adaptation. We further construct PISCO-Bench, a benchmark with verified instance annotations and paired clean background videos, and evaluate performance using both reference-based and reference-free perceptual metrics. Experiments demonstrate that PISCO consistently outperforms strong inpainting and video editing baselines under sparse control, and exhibits clear, monotonic performance improvements as additional control signals are provided. Project page: xiangbogaobarry.github.io/PISCO.
Chinese Translation
人工智能视频生成的格局正在经历一个关键转变:从依赖于详尽提示工程和“挑选”的一般生成,转向细粒度的可控生成和高保真后处理。在专业的人工智能辅助电影制作中,进行精确、针对性的修改至关重要。这一转变的基石是视频实例插入,它要求在保持场景完整性的同时,将特定实例插入现有镜头中。与传统视频编辑不同,这项任务需要满足多个要求:精确的时空定位、物理一致的场景交互,以及对原始动态的忠实保留——所有这些都在最小的用户努力下实现。在本文中,我们提出了PISCO,一种用于精确视频实例插入的基于稀疏关键帧控制的视频扩散模型。PISCO允许用户指定单个关键帧、起止关键帧或任意时间戳的稀疏关键帧,并自动传播对象的外观、运动和交互。为了解决预训练视频扩散模型中因稀疏条件引起的严重分布偏移,我们引入了可变信息引导以实现稳健的条件控制,以及保持分布的时间掩码以稳定时间生成,同时结合几何感知条件以实现真实场景的适应。我们进一步构建了PISCO-Bench,这是一个具有验证实例注释和配对干净背景视频的基准,并使用基于参考和无参考的感知指标评估性能。实验表明,PISCO在稀疏控制下始终优于强大的修复和视频编辑基线,并随着额外控制信号的提供表现出明显的单调性能提升。项目页面:xiangbogaobarry.github.io/PISCO。
cs.CV / 141 / 2602.08282
Tighnari v2: Mitigating Label Noise and Distribution Shift in Multimodal Plant Distribution Prediction via Mixture of Experts and Weakly Supervised Learning
Tighnari v2:通过专家混合和弱监督学习缓解多模态植物分布预测中的标签噪声和分布偏移
Abstract
Large-scale, cross-species plant distribution prediction plays a crucial role in biodiversity conservation, yet modeling efforts in this area still face significant challenges due to the sparsity and bias of observational data. Presence-Absence (PA) data provide accurate and noise-free labels, but are costly to obtain and limited in quantity; Presence-Only (PO) data, by contrast, offer broad spatial coverage and rich spatiotemporal distribution, but suffer from severe label noise in negative samples. To address these real-world constraints, this paper proposes a multimodal fusion framework that fully leverages the strengths of both PA and PO data. We introduce an innovative pseudo-label aggregation strategy for PO data based on the geographic coverage of satellite imagery, enabling geographic alignment between the label space and remote sensing feature space. In terms of model architecture, we adopt Swin Transformer Base as the backbone for satellite imagery, utilize the TabM network for tabular feature extraction, retain the Temporal Swin Transformer for time-series modeling, and employ a stackable serial tri-modal cross-attention mechanism to optimize the fusion of heterogeneous modalities. Furthermore, empirical analysis reveals significant geographic distribution shifts between PA training and test samples, and models trained by directly mixing PO and PA data tend to experience performance degradation due to label noise in PO data. To address this, we draw on the mixture-of-experts paradigm: test samples are partitioned according to their spatial proximity to PA samples, and different models trained on distinct datasets are used for inference and post-processing within each partition. Experiments on the GeoLifeCLEF 2025 dataset demonstrate that our approach achieves superior predictive performance in scenarios with limited PA coverage and pronounced distribution shifts.
Chinese Translation
大规模跨物种植物分布预测在生物多样性保护中发挥着至关重要的作用,但该领域的建模工作仍面临由于观测数据的稀疏性和偏差而带来的重大挑战。存在-缺失(Presence-Absence, PA)数据提供准确且无噪声的标签,但获取成本高且数量有限;而仅存在(Presence-Only, PO)数据则提供广泛的空间覆盖和丰富的时空分布,但在负样本中存在严重的标签噪声。为了解决这些现实世界的限制,本文提出了一种多模态融合框架,充分利用PA和PO数据的优势。我们基于卫星图像的地理覆盖引入了一种创新的PO数据伪标签聚合策略,实现了标签空间与遥感特征空间之间的地理对齐。在模型架构方面,我们采用Swin Transformer Base作为卫星图像的主干,利用TabM网络进行表格特征提取,保留Temporal Swin Transformer进行时间序列建模,并采用可堆叠的串联三模态交叉注意机制来优化异构模态的融合。此外,实证分析揭示了PA训练样本与测试样本之间显著的地理分布偏移,直接混合PO和PA数据训练的模型往往因PO数据中的标签噪声而导致性能下降。为此,我们借鉴专家混合(mixture-of-experts)范式:测试样本根据其与PA样本的空间接近性进行划分,并在每个分区内使用在不同数据集上训练的不同模型进行推理和后处理。在GeoLifeCLEF 2025数据集上的实验表明,我们的方法在PA覆盖有限且分布偏移明显的场景中实现了优越的预测性能。
cs.CV / 142 / 2602.08309
CAE-AV: Improving Audio-Visual Learning via Cross-modal Interactive Enrichment
CAE-AV:通过跨模态互动增强改善音视频学习
Abstract
Audio-visual learning suffers from modality misalignment caused by off-screen sources and background clutter, and current methods usually amplify irrelevant regions or moments, leading to unstable training and degraded representation quality. To address this challenge, we proposed a novel Caption-aligned and Agreement-guided Enhancement framework (CAE-AV) for audio-visual learning, which used two complementary modules: Cross-modal Agreement-guided Spatio-Temporal Enrichment (CASTE) and Caption-Aligned Saliency-guided Enrichment (CASE) to relieve audio-visual misalignment. CASTE dynamically balances spatial and temporal relations by evaluating frame-level audio-visual agreement, ensuring that key information is captured from both preceding and subsequent frames under misalignment. CASE injects cross-modal semantic guidance into selected spatio-temporal positions, leveraging high-level semantic cues to further alleviate misalignment. In addition, we design lightweight objectives, caption-to-modality InfoNCE, visual-audio consistency, and entropy regularization to guide token selection and strengthen cross-modal semantic alignment. With frozen backbones, CAE-AV achieves state-of-the-art performance on AVE, AVVP, AVS, and AVQA benchmarks, and qualitative analyses further validate its robustness against audio-visual misalignment.
Chinese Translation
音视频学习受到由屏幕外源和背景杂乱引起的模态不对齐的影响,目前的方法通常会放大无关区域或时刻,导致训练不稳定和表示质量下降。为了解决这一挑战,我们提出了一种新颖的字幕对齐和一致性引导增强框架(CAE-AV)用于音视频学习,该框架使用了两个互补模块:跨模态一致性引导时空增强(CASTE)和字幕对齐显著性引导增强(CASE),以缓解音视频不对齐问题。CASTE通过评估帧级音视频一致性动态平衡空间和时间关系,确保在不对齐的情况下从前后帧中捕获关键信息。CASE将跨模态语义引导注入选定的时空位置,利用高级语义线索进一步缓解不对齐。此外,我们设计了轻量级目标,包括字幕到模态的信息对比(InfoNCE)、视觉-音频一致性和熵正则化,以指导标记选择并增强跨模态语义对齐。在冻结的主干网络下,CAE-AV在AVE、AVVP、AVS和AVQA基准上实现了最先进的性能,定性分析进一步验证了其对音视频不对齐的鲁棒性。
cs.CV / 143 / 2602.08337
Language-Guided Transformer Tokenizer for Human Motion Generation
基于语言指导的变压器标记器用于人类动作生成
Abstract
In this paper, we focus on motion discrete tokenization, which converts raw motion into compact discrete tokens--a process proven crucial for efficient motion generation. In this paradigm, increasing the number of tokens is a common approach to improving motion reconstruction quality, but more tokens make it more difficult for generative models to learn. To maintain high reconstruction quality while reducing generation complexity, we propose leveraging language to achieve efficient motion tokenization, which we term Language-Guided Tokenization (LG-Tok). LG-Tok aligns natural language with motion at the tokenization stage, yielding compact, high-level semantic representations. This approach not only strengthens both tokenization and detokenization but also simplifies the learning of generative models. Furthermore, existing tokenizers predominantly adopt convolutional architectures, whose local receptive fields struggle to support global language guidance. To this end, we propose a Transformer-based Tokenizer that leverages attention mechanisms to enable effective alignment between language and motion. Additionally, we design a language-drop scheme, in which language conditions are randomly removed during training, enabling the detokenizer to support language-free guidance during generation. On the HumanML3D and Motion-X generation benchmarks, LG-Tok achieves Top-1 scores of 0.542 and 0.582, outperforming state-of-the-art methods (MARDM: 0.500 and 0.528), and with FID scores of 0.057 and 0.088, respectively, versus 0.114 and 0.147. LG-Tok-mini uses only half the tokens while maintaining competitive performance (Top-1: 0.521/0.588, FID: 0.085/0.071), validating the efficiency of our semantic representations.
Chinese Translation
在本文中,我们关注于动作的离散标记化,该过程将原始动作转换为紧凑的离散标记——这一过程被证明对高效的动作生成至关重要。在这一范式中,增加标记的数量是提高动作重建质量的常见方法,但更多的标记使生成模型的学习变得更加困难。为了在降低生成复杂性的同时保持高重建质量,我们提出利用语言实现高效的动作标记化,称之为语言指导标记化(Language-Guided Tokenization,LG-Tok)。LG-Tok在标记化阶段将自然语言与动作对齐,产生紧凑的高层语义表示。这种方法不仅增强了标记化和去标记化的效果,还简化了生成模型的学习。此外,现有的标记器主要采用卷积架构,其局部感受野难以支持全局语言指导。为此,我们提出了一种基于变压器的标记器,利用注意力机制实现语言与动作之间的有效对齐。此外,我们设计了一种语言丢弃方案,在训练过程中随机去除语言条件,使去标记器能够在生成过程中支持无语言指导。在HumanML3D和Motion-X生成基准测试中,LG-Tok的Top-1得分分别为0.542和0.582,超越了最先进的方法(MARDM: 0.500和0.528),FID得分分别为0.057和0.088,而后者为0.114和0.147。LG-Tok-mini仅使用一半的标记,同时保持竞争性能(Top-1: 0.521/0.588, FID: 0.085/0.071),验证了我们语义表示的高效性。
cs.CV / 144 / 2602.08342
UrbanGraphEmbeddings: Learning and Evaluating Spatially Grounded Multimodal Embeddings for Urban Science
城市图嵌入:学习和评估空间基础的多模态嵌入用于城市科学
Abstract
Learning transferable multimodal embeddings for urban environments is challenging because urban understanding is inherently spatial, yet existing datasets and benchmarks lack explicit alignment between street-view images and urban structure. We introduce UGData, a spatially grounded dataset that anchors street-view images to structured spatial graphs and provides graph-aligned supervision via spatial reasoning paths and spatial context captions, exposing distance, directionality, connectivity, and neighborhood context beyond image content. Building on UGData, we propose UGE, a two-stage training strategy that progressively and stably aligns images, text, and spatial structures by combining instruction-guided contrastive learning with graph-based spatial encoding. We finally introduce UGBench, a comprehensive benchmark to evaluate how spatially grounded embeddings support diverse urban understanding tasks -- including geolocation ranking, image retrieval, urban perception, and spatial grounding. We develop UGE on multiple state-of-the-art VLM backbones, including Qwen2-VL, Qwen2.5-VL, Phi-3-Vision, and LLaVA1.6-Mistral, and train fixed-dimensional spatial embeddings with LoRA tuning. UGE built upon Qwen2.5-VL-7B backbone achieves up to 44% improvement in image retrieval and 30% in geolocation ranking on training cities, and over 30% and 22% gains respectively on held-out cities, demonstrating the effectiveness of explicit spatial grounding for spatially intensive urban tasks.
Chinese Translation
为城市环境学习可转移的多模态嵌入是具有挑战性的,因为城市理解本质上是空间性的,而现有的数据集和基准缺乏街景图像与城市结构之间的明确对齐。我们引入了UGData,这是一个空间基础的数据集,将街景图像锚定到结构化的空间图上,并通过空间推理路径和空间上下文标题提供图对齐的监督,揭示了超越图像内容的距离、方向性、连通性和邻里上下文。在UGData的基础上,我们提出了UGE,一种两阶段的训练策略,通过结合指导性对比学习与基于图的空间编码,逐步且稳定地对齐图像、文本和空间结构。最后,我们介绍了UGBench,这是一个综合基准,用于评估空间基础的嵌入如何支持多样的城市理解任务,包括地理定位排名、图像检索、城市感知和空间对齐。我们在多个最先进的视觉语言模型(VLM)骨干网络上开发了UGE,包括Qwen2-VL、Qwen2.5-VL、Phi-3-Vision和LLaVA1.6-Mistral,并通过LoRA调优训练固定维度的空间嵌入。基于Qwen2.5-VL-7B骨干网络构建的UGE在训练城市的图像检索中实现了高达44%的提升,在地理定位排名中提升了30%,在保留城市中分别获得了超过30%和22%的增益,证明了显式空间对齐在空间密集型城市任务中的有效性。
cs.CV / 145 / 2602.08346
What, Whether and How? Unveiling Process Reward Models for Thinking with Images Reasoning
什么、是否以及如何?揭示图像思维推理的过程奖励模型
Abstract
The rapid advancement of Large Vision Language Models (LVLMs) has demonstrated excellent abilities in various visual tasks. Building upon these developments, the thinking with images paradigm has emerged, enabling models to dynamically edit and re-encode visual information at each reasoning step, mirroring human visual processing. However, this paradigm introduces significant challenges as diverse errors may occur during reasoning processes. This necessitates Process Reward Models (PRMs) for distinguishing positive and negative reasoning steps, yet existing benchmarks for PRMs are predominantly text-centric and lack comprehensive assessment under this paradigm. To address these gaps, this work introduces the first comprehensive benchmark specifically designed for evaluating PRMs under the thinking with images paradigm. Our main contributions are: (1) Through extensive analysis of reasoning trajectories and guided search experiments with PRMs, we define 7 fine-grained error types and demonstrate both the necessity for specialized PRMs and the potential for improvement. (2) We construct a comprehensive benchmark comprising 1,206 manually annotated thinking with images reasoning trajectories spanning 4 categories and 16 subcategories for fine-grained evaluation of PRMs. (3) Our experimental analysis reveals that current LVLMs fall short as effective PRMs, exhibiting limited capabilities in visual reasoning process evaluation with significant performance disparities across error types, positive evaluation bias, and sensitivity to reasoning step positions. These findings demonstrate the effectiveness of our benchmark and establish crucial foundations for advancing PRMs in LVLMs.
Chinese Translation
大型视觉语言模型(LVLMs)的快速发展在各种视觉任务中展现了卓越的能力。在这些发展的基础上,图像思维范式应运而生,使模型能够在每个推理步骤中动态编辑和重新编码视觉信息,反映人类的视觉处理过程。然而,这一范式带来了重大挑战,因为在推理过程中可能会出现多种错误。这就需要过程奖励模型(PRMs)来区分正向和负向的推理步骤,但现有的PRMs基准主要以文本为中心,缺乏在这一范式下的全面评估。为了解决这些问题,本研究提出了首个专门设计用于评估图像思维范式下PRMs的综合基准。我们的主要贡献包括:(1)通过对推理轨迹的广泛分析和PRMs的引导搜索实验,我们定义了7种细粒度错误类型,并展示了专门PRMs的必要性和改进潜力。(2)我们构建了一个综合基准,包括1,206条手动标注的图像思维推理轨迹,涵盖4个类别和16个子类别,以便对PRMs进行细粒度评估。(3)我们的实验分析表明,当前的LVLMs在作为有效PRMs方面存在不足,在视觉推理过程评估中表现有限,且在错误类型、正向评估偏差和推理步骤位置上存在显著的性能差异。这些发现证明了我们基准的有效性,并为推动LVLMs中的PRMs发展奠定了重要基础。
cs.CV / 146 / 2602.08355
E-VAds: An E-commerce Short Videos Understanding Benchmark for MLLMs
E-VAds:一个针对多模态大语言模型的电子商务短视频理解基准
Abstract
E-commerce short videos represent a high-revenue segment of the online video industry characterized by a goal-driven format and dense multi-modal signals. Current models often struggle with these videos because existing benchmarks focus primarily on general-purpose tasks and neglect the reasoning of commercial intent. In this work, we first propose a \textbf{multi-modal information density assessment framework} to quantify the complexity of this domain. Our evaluation reveals that e-commerce content exhibits substantially higher density across visual, audio, and textual modalities compared to mainstream datasets, establishing a more challenging frontier for video understanding. To address this gap, we introduce \textbf{E-commerce Video Ads Benchmark (E-VAds)}, which is the first benchmark specifically designed for e-commerce short video understanding. We curated 3,961 high-quality videos from Taobao covering a wide range of product categories and used a multi-agent system to generate 19,785 open-ended Q&A pairs. These questions are organized into two primary dimensions, namely Perception and Cognition and Reasoning, which consist of five distinct tasks. Finally, we develop \textbf{E-VAds-R1}, an RL-based reasoning model featuring a multi-grained reward design called \textbf{MG-GRPO}. This strategy provides smooth guidance for early exploration while creating a non-linear incentive for expert-level precision. Experimental results demonstrate that E-VAds-R1 achieves a 109.2% performance gain in commercial intent reasoning with only a few hundred training samples.
Chinese Translation
电子商务短视频代表了在线视频行业中一个高收入的细分市场,其特点是目标驱动的格式和密集的多模态信号。当前模型在处理这些视频时常常面临困难,因为现有基准主要集中于通用任务,而忽视了商业意图的推理。在本研究中,我们首先提出了一个 extbf{多模态信息密度评估框架},以量化该领域的复杂性。我们的评估结果显示,与主流数据集相比,电子商务内容在视觉、音频和文本模态上表现出显著更高的密度,为视频理解建立了一个更具挑战性的前沿。为了解决这一差距,我们推出了 extbf{电子商务视频广告基准(E-VAds)},这是第一个专门为电子商务短视频理解设计的基准。我们从淘宝上策划了3961个高质量视频,涵盖了广泛的产品类别,并使用多智能体系统生成了19785对开放式问答。这些问题被组织成两个主要维度,即感知与认知和推理,包含五个不同的任务。最后,我们开发了 extbf{E-VAds-R1},一种基于强化学习的推理模型,具有一种名为 extbf{MG-GRPO}的多粒度奖励设计。这一策略为早期探索提供了平滑的指导,同时为专家级精度创造了非线性激励。实验结果表明,E-VAds-R1在商业意图推理方面实现了109.2%的性能提升,仅使用了几百个训练样本。
cs.CV / 147 / 2602.08388
Geometric Image Editing via Effects-Sensitive In-Context Inpainting with Diffusion Transformers
通过扩散变换器进行效果敏感的上下文内修复的几何图像编辑
Abstract
Recent advances in diffusion models have significantly improved image editing. However, challenges persist in handling geometric transformations, such as translation, rotation, and scaling, particularly in complex scenes. Existing approaches suffer from two main limitations: (1) difficulty in achieving accurate geometric editing of object translation, rotation, and scaling; (2) inadequate modeling of intricate lighting and shadow effects, leading to unrealistic results. To address these issues, we propose GeoEdit, a framework that leverages in-context generation through a diffusion transformer module, which integrates geometric transformations for precise object edits. Moreover, we introduce Effects-Sensitive Attention, which enhances the modeling of intricate lighting and shadow effects for improved realism. To further support training, we construct RS-Objects, a large-scale geometric editing dataset containing over 120,000 high-quality image pairs, enabling the model to learn precise geometric editing while generating realistic lighting and shadows. Extensive experiments on public benchmarks demonstrate that GeoEdit consistently outperforms state-of-the-art methods in terms of visual quality, geometric accuracy, and realism.
Chinese Translation
最近在扩散模型方面的进展显著改善了图像编辑。然而,在处理几何变换(如平移、旋转和缩放)时,尤其是在复杂场景中,仍然存在挑战。现有方法主要面临两个限制:(1)在实现物体的平移、旋转和缩放的准确几何编辑方面存在困难;(2)对复杂光照和阴影效果的建模不足,导致结果不够真实。为了解决这些问题,我们提出了GeoEdit,一个利用扩散变换器模块进行上下文生成的框架,该模块集成了几何变换以实现精确的物体编辑。此外,我们引入了效果敏感注意力(Effects-Sensitive Attention),增强了复杂光照和阴影效果的建模,从而提高了真实感。为了进一步支持训练,我们构建了RS-Objects,一个包含超过120,000对高质量图像的大规模几何编辑数据集,使模型能够在生成真实光照和阴影的同时学习精确的几何编辑。在公共基准上的大量实验表明,GeoEdit在视觉质量、几何准确性和真实感方面始终优于最先进的方法。
cs.CV / 148 / 2602.08395
D$^2$-VR: Degradation-Robust and Distilled Video Restoration with Synergistic Optimization Strategy
D$^2$-VR:具有协同优化策略的抗退化和精简视频恢复
Abstract
The integration of diffusion priors with temporal alignment has emerged as a transformative paradigm for video restoration, delivering fantastic perceptual quality, yet the practical deployment of such frameworks is severely constrained by prohibitive inference latency and temporal instability when confronted with complex real-world degradations. To address these limitations, we propose \textbf{D$^2$-VR}, a single-image diffusion-based video-restoration framework with low-step inference. To obtain precise temporal guidance under severe degradation, we first design a Degradation-Robust Flow Alignment (DRFA) module that leverages confidence-aware attention to filter unreliable motion cues. We then incorporate an adversarial distillation paradigm to compress the diffusion sampling trajectory into a rapid few-step regime. Finally, a synergistic optimization strategy is devised to harmonize perceptual quality with rigorous temporal consistency. Extensive experiments demonstrate that D$^2$-VR achieves state-of-the-art performance while accelerating the sampling process by \textbf{12$\times$}
Chinese Translation
将扩散先验与时间对齐相结合,已成为视频恢复的变革性范式,提供了出色的感知质量,然而在面对复杂的现实世界退化时,此类框架的实际应用受到高推理延迟和时间不稳定性的严重限制。为了解决这些问题,我们提出了 extbf{D$^2$-VR},一种基于单幅图像的扩散视频恢复框架,具有低步骤推理。为了在严重退化下获得精确的时间指导,我们首先设计了一个抗退化流对齐(DRFA)模块,该模块利用基于置信度的注意力来过滤不可靠的运动线索。然后,我们结合对抗蒸馏范式,将扩散采样轨迹压缩为快速的少步骤模式。最后,设计了一种协同优化策略,以协调感知质量与严格的时间一致性。大量实验表明,D$^2$-VR在加速采样过程方面实现了 extbf{12$ imes$}的提升,同时达到了最先进的性能。
cs.CV / 149 / 2602.08397
RealSynCol: a high-fidelity synthetic colon dataset for 3D reconstruction applications
RealSynCol:用于三维重建应用的高保真合成结肠数据集
Abstract
Deep learning has the potential to improve colonoscopy by enabling 3D reconstruction of the colon, providing a comprehensive view of mucosal surfaces and lesions, and facilitating the identification of unexplored areas. However, the development of robust methods is limited by the scarcity of large-scale ground truth data. We propose RealSynCol, a highly realistic synthetic dataset designed to replicate the endoscopic environment. Colon geometries extracted from 10 CT scans were imported into a virtual environment that closely mimics intraoperative conditions and rendered with realistic vascular textures. The resulting dataset comprises 28\,130 frames, paired with ground truth depth maps, optical flow, 3D meshes, and camera trajectories. A benchmark study was conducted to evaluate the available synthetic colon datasets for the tasks of depth and pose estimation. Results demonstrate that the high realism and variability of RealSynCol significantly enhance generalization performance on clinical images, proving it to be a powerful tool for developing deep learning algorithms to support endoscopic diagnosis.
Chinese Translation
深度学习有潜力通过实现结肠的三维重建来改善结肠镜检查,提供对粘膜表面和病变的全面视图,并促进对未探索区域的识别。然而,强大方法的发展受到大规模真实数据稀缺的限制。我们提出了RealSynCol,一个高度真实的合成数据集,旨在复制内窥镜环境。从10个CT扫描中提取的结肠几何形状被导入一个紧密模拟术中条件的虚拟环境中,并以真实的血管纹理进行渲染。生成的数据集包含28,130帧,配有真实深度图、光流、三维网格和相机轨迹。我们进行了基准研究,以评估现有合成结肠数据集在深度和姿态估计任务中的表现。结果表明,RealSynCol的高真实感和多样性显著提升了在临床图像上的泛化性能,证明其是开发深度学习算法以支持内窥镜诊断的强大工具。
cs.CV / 150 / 2602.08430
Understanding and Optimizing Attention-Based Sparse Matching for Diverse Local Features
理解与优化基于注意力的稀疏匹配以适应多样化的局部特征
Abstract
We revisit the problem of training attention-based sparse image matching models for various local features. We first identify one critical design choice that has been previously overlooked, which significantly impacts the performance of the LightGlue model. We then investigate the role of detectors and descriptors within the transformer-based matching framework, finding that detectors, rather than descriptors, are often the primary cause for performance difference. Finally, we propose a novel approach to fine-tune existing image matching models using keypoints from a diverse set of detectors, resulting in a universal, detector-agnostic model. When deployed as a zero-shot matcher for novel detectors, the resulting model achieves or exceeds the accuracy of models specifically trained for those features. Our findings offer valuable insights for the deployment of transformer-based matching models and the future design of local features.
Chinese Translation
我们重新审视了为各种局部特征训练基于注意力的稀疏图像匹配模型的问题。首先,我们识别出一个之前被忽视的重要设计选择,这对LightGlue模型的性能产生了显著影响。接着,我们探讨了在基于变换器的匹配框架中,检测器和描述符的作用,发现检测器而非描述符通常是性能差异的主要原因。最后,我们提出了一种新方法,通过使用来自多样化检测器的关键点来微调现有的图像匹配模型,从而实现一个通用的、不依赖于检测器的模型。当作为新检测器的零样本匹配器部署时,所得到的模型达到了或超过了专门为这些特征训练的模型的准确性。我们的研究结果为基于变换器的匹配模型的部署以及局部特征的未来设计提供了宝贵的见解。
cs.CV / 151 / 2602.08439
Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition
Demo-ICL:用于程序性视频知识获取的上下文学习
Abstract
Despite the growing video understanding capabilities of recent Multimodal Large Language Models (MLLMs), existing video benchmarks primarily assess understanding based on models' static, internal knowledge, rather than their ability to learn and adapt from dynamic, novel contexts from few examples. To bridge this gap, we present Demo-driven Video In-Context Learning, a novel task focused on learning from in-context demonstrations to answer questions about the target videos. Alongside this, we propose Demo-ICL-Bench, a challenging benchmark designed to evaluate demo-driven video in-context learning capabilities. Demo-ICL-Bench is constructed from 1200 instructional YouTube videos with associated questions, from which two types of demonstrations are derived: (i) summarizing video subtitles for text demonstration; and (ii) corresponding instructional videos as video demonstrations. To effectively tackle this new challenge, we develop Demo-ICL, an MLLM with a two-stage training strategy: video-supervised fine-tuning and information-assisted direct preference optimization, jointly enhancing the model's ability to learn from in-context examples. Extensive experiments with state-of-the-art MLLMs confirm the difficulty of Demo-ICL-Bench, demonstrate the effectiveness of Demo-ICL, and thereby unveil future research directions.
Chinese Translation
尽管近期的多模态大型语言模型(MLLMs)在视频理解能力上不断提升,但现有的视频基准主要评估模型的静态内部知识,而非其从少量示例中学习和适应动态新环境的能力。为了解决这一问题,我们提出了基于示例驱动的视频上下文学习(Demo-driven Video In-Context Learning),这是一个专注于通过上下文示例学习以回答关于目标视频问题的新任务。同时,我们提出了Demo-ICL-Bench,这是一个旨在评估示例驱动视频上下文学习能力的挑战性基准。Demo-ICL-Bench由1200个带有相关问题的教学YouTube视频构成,从中派生出两种类型的示例:(i)视频字幕的文本总结作为文本示例;(ii)相应的教学视频作为视频示例。为了有效应对这一新挑战,我们开发了Demo-ICL,这是一种具有两阶段训练策略的MLLM:视频监督微调和信息辅助直接偏好优化,联合提升模型从上下文示例中学习的能力。与最先进的MLLMs进行的大量实验确认了Demo-ICL-Bench的难度,展示了Demo-ICL的有效性,从而揭示了未来的研究方向。
cs.CV / 152 / 2602.08448
Vista: Scene-Aware Optimization for Streaming Video Question Answering under Post-Hoc Queries
Vista:基于场景感知的流视频问答后置查询优化
Abstract
Streaming video question answering (Streaming Video QA) poses distinct challenges for multimodal large language models (MLLMs), as video frames arrive sequentially and user queries can be issued at arbitrary time points. Existing solutions relying on fixed-size memory or naive compression often suffer from context loss or memory overflow, limiting their effectiveness in long-form, real-time scenarios. We present Vista, a novel framework for scene-aware streaming video QA that enables efficient and scalable reasoning over continuous video streams. The innovation of Vista can be summarized in three aspects: (1) scene-aware segmentation, where Vista dynamically clusters incoming frames into temporally and visually coherent scene units; (2) scene-aware compression, where each scene is compressed into a compact token representation and stored in GPU memory for efficient index-based retrieval, while full-resolution frames are offloaded to CPU memory; and (3) scene-aware recall, where relevant scenes are selectively recalled and reintegrated into the model input upon receiving a query, enabling both efficiency and completeness. Vista is model-agnostic and integrates seamlessly with a variety of vision-language backbones, enabling long-context reasoning without compromising latency or memory efficiency. Extensive experiments on StreamingBench demonstrate that Vista achieves state-of-the-art performance, establishing a strong baseline for real-world streaming video understanding.
Chinese Translation
流视频问答(Streaming Video QA)对多模态大型语言模型(MLLMs)提出了独特的挑战,因为视频帧是顺序到达的,用户查询可以在任意时间点发出。现有依赖固定大小内存或简单压缩的解决方案往往面临上下文丢失或内存溢出的问题,限制了其在长时段实时场景中的有效性。我们提出了Vista,一个新颖的基于场景感知的流视频问答框架,能够在连续的视频流上实现高效且可扩展的推理。Vista的创新可以总结为三个方面:(1)场景感知分割,Vista动态地将传入的帧聚类为时间和视觉上连贯的场景单元;(2)场景感知压缩,每个场景被压缩为紧凑的标记表示并存储在GPU内存中,以便高效的基于索引的检索,而全分辨率帧则被卸载到CPU内存;(3)场景感知召回,在接收到查询时,有关场景被选择性地召回并重新整合到模型输入中,从而实现效率和完整性的平衡。Vista是模型无关的,可以与多种视觉-语言骨干网络无缝集成,实现长上下文推理而不影响延迟或内存效率。在StreamingBench上的大量实验表明,Vista达到了最先进的性能,为现实世界的流视频理解建立了强有力的基准。
cs.CV / 153 / 2602.08462
TriC-Motion: Tri-Domain Causal Modeling Grounded Text-to-Motion Generation
TriC-Motion:基于三域因果建模的文本到运动生成
Abstract
Text-to-motion generation, a rapidly evolving field in computer vision, aims to produce realistic and text-aligned motion sequences. Current methods primarily focus on spatial-temporal modeling or independent frequency domain analysis, lacking a unified framework for joint optimization across spatial, temporal, and frequency domains. This limitation hinders the model's ability to leverage information from all domains simultaneously, leading to suboptimal generation quality. Additionally, in motion generation frameworks, motion-irrelevant cues caused by noise are often entangled with features that contribute positively to generation, thereby leading to motion distortion. To address these issues, we propose Tri-Domain Causal Text-to-Motion Generation (TriC-Motion), a novel diffusion-based framework integrating spatial-temporal-frequency-domain modeling with causal intervention. TriC-Motion includes three core modeling modules for domain-specific modeling, namely Temporal Motion Encoding, Spatial Topology Modeling, and Hybrid Frequency Analysis. After comprehensive modeling, a Score-guided Tri-domain Fusion module integrates valuable information from the triple domains, simultaneously ensuring temporal consistency, spatial topology, motion trends, and dynamics. Moreover, the Causality-based Counterfactual Motion Disentangler is meticulously designed to expose motion-irrelevant cues to eliminate noise, disentangling the real modeling contributions of each domain for superior generation. Extensive experimental results validate that TriC-Motion achieves superior performance compared to state-of-the-art methods, attaining an outstanding R@1 of 0.612 on the HumanML3D dataset. These results demonstrate its capability to generate high-fidelity, coherent, diverse, and text-aligned motion sequences. Code is available at: https://caoyiyang1105.github.io/TriC-Motion/.
Chinese Translation
文本到运动生成是计算机视觉领域一个快速发展的方向,旨在生成逼真且与文本对齐的运动序列。目前的方法主要集中于时空建模或独立的频域分析,缺乏一个统一的框架来实现时空和频域的联合优化。这一局限性妨碍了模型同时利用所有域的信息,从而导致生成质量不佳。此外,在运动生成框架中,由噪声引起的与运动无关的线索常常与对生成有积极贡献的特征交织在一起,从而导致运动失真。为了解决这些问题,我们提出了三域因果文本到运动生成(TriC-Motion),这是一种新颖的基于扩散的框架,结合了时空-频域建模与因果干预。TriC-Motion包括三个核心建模模块,用于特定领域的建模,分别是时间运动编码、空间拓扑建模和混合频率分析。在全面建模之后,基于评分的三域融合模块整合了来自三个域的有价值信息,同时确保时间一致性、空间拓扑、运动趋势和动态。此外,基于因果性的反事实运动解耦器经过精心设计,以揭示与运动无关的线索以消除噪声,解耦每个域的真实建模贡献,从而实现更优的生成。大量实验结果验证了TriC-Motion相比于最先进的方法取得了优越的性能,在HumanML3D数据集上达到了0.612的卓越R@1。这些结果证明了其生成高保真、一致、多样且与文本对齐的运动序列的能力。代码可在以下链接获取:https://caoyiyang1105.github.io/TriC-Motion/
cs.CV / 154 / 2602.08479
Gesture Matters: Pedestrian Gesture Recognition for AVs Through Skeleton Pose Evaluation
手势的重要性:通过骨架姿态评估实现自动驾驶车辆的行人手势识别
Abstract
Gestures are a key component of non-verbal communication in traffic, often helping pedestrian-to-driver interactions when formal traffic rules may be insufficient. This problem becomes more apparent when autonomous vehicles (AVs) struggle to interpret such gestures. In this study, we present a gesture classification framework using 2D pose estimation applied to real-world video sequences from the WIVW dataset. We categorise gestures into four primary classes (Stop, Go, Thank & Greet, and No Gesture) and extract 76 static and dynamic features from normalised keypoints. Our analysis demonstrates that hand position and movement velocity are especially discriminative in distinguishing between gesture classes, achieving a classification accuracy score of 87%. These findings not only improve the perceptual capabilities of AV systems but also contribute to the broader understanding of pedestrian behaviour in traffic contexts.
Chinese Translation
手势是交通中非语言交流的关键组成部分,通常在正式交通规则不足时帮助行人与驾驶员之间的互动。当自动驾驶车辆(AVs)难以解读这些手势时,这一问题变得更加明显。在本研究中,我们提出了一种手势分类框架,利用2D姿态估计应用于来自WIVW数据集的真实视频序列。我们将手势分为四个主要类别(停止、前进、感谢与问候,以及无手势),并从标准化关键点中提取76个静态和动态特征。我们的分析表明,手的位置和运动速度在区分手势类别方面尤其具有辨别力,分类准确率达87%。这些发现不仅提升了自动驾驶系统的感知能力,还对交通环境中行人行为的更广泛理解做出了贡献。
cs.CV / 155 / 2602.08491
Enhanced Food Category Recognition under Illumination-Induced Domain Shift
在光照引起的领域转移下增强食品类别识别
Abstract
Visual food recognition systems deployed in real-world environments, such as automated conveyor-belt inspection, are highly sensitive to domain shifts caused by illumination changes. While recent studies have shown that lighting variations can significantly distort food perception by both humans and AI, existing works are often limited to single food categories or controlled settings, and most public food datasets lack explicit illumination annotations. In this work, we investigate illumination-induced domain shift in multi-class food category recognition using two widely adopted datasets, Food-101 and Fruits-360. We demonstrate substantial accuracy degradation under cross-dataset evaluation due to mismatched visual conditions. To address this challenge, we construct synthetic illumination-augmented datasets by systematically varying light temperature and intensity, enabling controlled robustness analysis without additional labels. We further evaluate cross-dataset transfer learning and domain generalization, with a focus on illumination-sensitive target categories such as apple-based classes. Experimental results show that illumination-aware augmentation significantly improves recognition robustness under domain shift while preserving real-time performance. Our findings highlight the importance of illumination robustness and provide practical insights for deploying reliable food recognition systems in real-world inspection scenarios.
Chinese Translation
在现实环境中部署的视觉食品识别系统,如自动化传送带检查,对光照变化引起的领域转移高度敏感。尽管近期研究表明,光照变化会显著扭曲人类和人工智能对食品的感知,但现有研究往往局限于单一食品类别或受控环境,并且大多数公共食品数据集缺乏明确的光照注释。在本研究中,我们使用两个广泛采用的数据集,Food-101 和 Fruits-360,探讨多类别食品识别中的光照引起的领域转移。我们展示了在跨数据集评估中,由于视觉条件不匹配,准确率显著下降。为了解决这一挑战,我们通过系统性地改变光温和强度构建了合成的光照增强数据集,从而在没有额外标签的情况下实现受控的鲁棒性分析。我们进一步评估了跨数据集迁移学习和领域泛化,重点关注对光照敏感的目标类别,如基于苹果的类别。实验结果表明,考虑光照的增强显著提高了在领域转移下的识别鲁棒性,同时保持了实时性能。我们的研究结果强调了光照鲁棒性的重要性,并为在现实检查场景中部署可靠的食品识别系统提供了实用见解。
cs.CV / 156 / 2602.08503
Learning Self-Correction in Vision-Language Models via Rollout Augmentation
通过回滚增强学习视觉-语言模型中的自我纠正
Abstract
Self-correction is essential for solving complex reasoning problems in vision-language models (VLMs). However, existing reinforcement learning (RL) methods struggle to learn it, as effective self-correction behaviors emerge only rarely, making learning signals extremely sparse. To address this challenge, we propose correction-specific rollouts (Octopus), an RL rollout augmentation framework that synthesizes dense self-correction examples by recombining existing rollouts. This augmentation simultaneously improves sample efficiency due to rollout reuse and stabilizes RL optimization through balanced supervision. Furthermore, we introduce a response-masking strategy that decouples self-correction from direct reasoning, avoiding signal conflicts and enabling both behaviors to be learned effectively. Building on this, we introduce Octopus-8B, a reasoning VLM with controllable self-correction capability. Across 7 benchmarks, it achieves SoTA performance among open-source VLMs, outperforming the best RLVR baseline by 1.0 score while requiring only $0.72\times$ training time per step.
Chinese Translation
自我纠正对于解决视觉-语言模型(VLMs)中的复杂推理问题至关重要。然而,现有的强化学习(RL)方法在学习自我纠正方面面临困难,因为有效的自我纠正行为仅在极少数情况下出现,导致学习信号极为稀疏。为了解决这一挑战,我们提出了特定于纠正的回滚(Octopus),这是一种通过重新组合现有回滚合成密集自我纠正示例的RL回滚增强框架。这种增强通过回滚重用同时提高了样本效率,并通过平衡监督稳定了RL优化。此外,我们引入了一种响应屏蔽策略,将自我纠正与直接推理解耦,避免信号冲突,使两种行为都能有效学习。在此基础上,我们推出了Octopus-8B,这是一种具有可控自我纠正能力的推理VLM。在7个基准测试中,它在开源VLM中实现了最先进的性能,超越了最佳RLVR基线1.0分,同时每步训练时间仅需$0.72 imes$。
cs.CV / 157 / 2602.08505
Are Vision Foundation Models Foundational for Electron Microscopy Image Segmentation?
视觉基础模型是否为电子显微镜图像分割奠定基础?
Abstract
Although vision foundation models (VFMs) are increasingly reused for biomedical image analysis, it remains unclear whether the latent representations they provide are general enough to support effective transfer and reuse across heterogeneous microscopy image datasets. Here, we study this question for the problem of mitochondria segmentation in electron microscopy (EM) images, using two popular public EM datasets (Lucchi++ and VNC) and three recent representative VFMs (DINOv2, DINOv3, and OpenCLIP). We evaluate two practical model adaptation regimes: a frozen-backbone setting in which only a lightweight segmentation head is trained on top of the VFM, and parameter-efficient fine-tuning (PEFT) via Low-Rank Adaptation (LoRA) in which the VFM is fine-tuned in a targeted manner to a specific dataset. Across all backbones, we observe that training on a single EM dataset yields good segmentation performance (quantified as foreground Intersection-over-Union), and that LoRA consistently improves in-domain performance. In contrast, training on multiple EM datasets leads to severe performance degradation for all models considered, with only marginal gains from PEFT. Exploration of the latent representation space through various techniques (PCA, Fr\'echet Dinov2 distance, and linear probes) reveals a pronounced and persistent domain mismatch between the two considered EM datasets in spite of their visual similarity, which is consistent with the observed failure of paired training. These results suggest that, while VFMs can deliver competitive results for EM segmentation within a single domain under lightweight adaptation, current PEFT strategies are insufficient to obtain a single robust model across heterogeneous EM datasets without additional domain-alignment mechanisms.
Chinese Translation
尽管视觉基础模型(VFMs)在生物医学图像分析中被越来越多地重用,但它们所提供的潜在表示是否足够通用,以支持在异质显微镜图像数据集之间的有效迁移和重用,仍然不明确。在此,我们针对电子显微镜(EM)图像中的线粒体分割问题进行研究,使用两个流行的公共EM数据集(Lucchi++和VNC)和三个近期代表性的VFMs(DINOv2、DINOv3和OpenCLIP)。我们评估了两种实际的模型适应方案:一种是冻结主干网络的设置,仅在VFM上训练一个轻量级的分割头;另一种是通过低秩适应(LoRA)进行参数高效微调(PEFT),在该方法中,VFM以针对特定数据集的方式进行微调。在所有主干网络中,我们观察到在单个EM数据集上训练能够产生良好的分割性能(以前景交并比量化),并且LoRA始终改善了领域内的性能。相反,在多个EM数据集上训练导致所有考虑的模型性能严重下降,PEFT仅带来了微小的提升。通过各种技术(PCA、Fréchet Dinov2距离和线性探测)对潜在表示空间的探索显示,尽管这两个EM数据集在视觉上相似,但它们之间存在显著且持续的领域不匹配,这与观察到的配对训练失败是一致的。这些结果表明,尽管VFMs在轻量适应下能够在单一领域内为EM分割提供竞争力的结果,但当前的PEFT策略不足以在异质EM数据集之间获得单一稳健模型,而无需额外的领域对齐机制。
cs.CV / 158 / 2602.08524
GeoFocus: Blending Efficient Global-to-Local Perception for Multimodal Geometry Problem-Solving
GeoFocus:融合高效的全球到局部感知以解决多模态几何问题
Abstract
Geometry problem-solving remains a significant challenge for Large Multimodal Models (LMMs), requiring not only global shape recognition but also attention to intricate local relationships related to geometric theory. To address this, we propose GeoFocus, a novel framework comprising two core modules. 1) Critical Local Perceptor, which automatically identifies and emphasizes critical local structure (e.g., angles, parallel lines, comparative distances) through thirteen theory-based perception templates, boosting critical local feature coverage by 61% compared to previous methods. 2) VertexLang, a compact topology formal language, encodes global figures through vertex coordinates and connectivity relations. By replacing bulky code-based encodings, VertexLang reduces global perception training time by 20% while improving topology recognition accuracy. When evaluated in Geo3K, GeoQA, and FormalGeo7K, GeoFocus achieves a 4.7% accuracy improvement over leading specialized models and demonstrates superior robustness in MATHVERSE under diverse visual conditions. Project Page -- https://github.com/dle666/GeoFocus
Chinese Translation
几何问题解决仍然是大型多模态模型(LMMs)面临的重大挑战,不仅需要全球形状识别,还需要关注与几何理论相关的复杂局部关系。为此,我们提出了GeoFocus,一个由两个核心模块组成的新框架。1)关键局部感知器(Critical Local Perceptor),通过十三个基于理论的感知模板自动识别并强调关键局部结构(例如,角度、平行线、比较距离),相比于之前的方法,关键局部特征覆盖率提高了61%。2)VertexLang,一种紧凑的拓扑形式语言,通过顶点坐标和连接关系对全球图形进行编码。通过替代笨重的基于代码的编码,VertexLang将全球感知训练时间减少了20%,同时提高了拓扑识别的准确性。在Geo3K、GeoQA和FormalGeo7K的评估中,GeoFocus在领先的专业模型上实现了4.7%的准确性提升,并在MATHVERSE的多样视觉条件下展示了更强的鲁棒性。项目页面 -- https://github.com/dle666/GeoFocus
cs.CV / 159 / 2602.08528
Automatic regularization parameter choice for tomography using a double model approach
基于双模型方法的断层成像自动正则化参数选择
Abstract
Image reconstruction in X-ray tomography is an ill-posed inverse problem, particularly with limited available data. Regularization is thus essential, but its effectiveness hinges on the choice of a regularization parameter that balances data fidelity against a priori information. We present a novel method for automatic parameter selection based on the use of two distinct computational discretizations of the same problem. A feedback control algorithm dynamically adjusts the regularization strength, driving an iterative reconstruction toward the smallest parameter that yields sufficient similarity between reconstructions on the two grids. The effectiveness of the proposed approach is demonstrated using real tomographic data.
Chinese Translation
X射线断层成像中的图像重建是一个病态逆问题,尤其是在可用数据有限的情况下。因此,正则化是必不可少的,但其有效性依赖于正则化参数的选择,该参数在数据保真度与先验信息之间取得平衡。我们提出了一种基于对同一问题的两种不同计算离散化的自动参数选择新方法。反馈控制算法动态调整正则化强度,推动迭代重建朝向最小参数,以实现两个网格上重建之间的足够相似性。通过使用真实的断层数据,证明了所提方法的有效性。
cs.CV / 160 / 2602.08531
Thegra: Graph-based SLAM for Thermal Imagery
Thegra:基于图的热成像SLAM
Abstract
Thermal imaging provides a practical sensing modality for visual SLAM in visually degraded environments such as low illumination, smoke, or adverse weather. However, thermal imagery often exhibits low texture, low contrast, and high noise, complicating feature-based SLAM. In this work, we propose a sparse monocular graph-based SLAM system for thermal imagery that leverages general-purpose learned features -- the SuperPoint detector and LightGlue matcher, trained on large-scale visible-spectrum data to improve cross-domain generalization. To adapt these components to thermal data, we introduce a preprocessing pipeline to enhance input suitability and modify core SLAM modules to handle sparse and outlier-prone feature matches. We further incorporate keypoint confidence scores from SuperPoint into a confidence-weighted factor graph to improve estimation robustness. Evaluations on public thermal datasets demonstrate that the proposed system achieves reliable performance without requiring dataset-specific training or fine-tuning a desired feature detector, given the scarcity of quality thermal data. Code will be made available upon publication.
Chinese Translation
热成像为在低光照、烟雾或恶劣天气等视觉退化环境中的视觉SLAM提供了一种实用的传感方式。然而,热成像通常表现出低纹理、低对比度和高噪声,这使得基于特征的SLAM变得复杂。在本研究中,我们提出了一种针对热成像的稀疏单目基于图的SLAM系统,该系统利用通用学习特征——SuperPoint检测器和LightGlue匹配器,这些特征是在大规模可见光谱数据上训练的,以提高跨领域的泛化能力。为了将这些组件适应于热数据,我们引入了一个预处理管道,以增强输入的适用性,并修改核心SLAM模块以处理稀疏和易受干扰的特征匹配。我们进一步将SuperPoint的关键点置信度分数纳入置信加权因子图中,以提高估计的鲁棒性。在公共热成像数据集上的评估表明,所提出的系统在不需要特定数据集训练或微调所需特征检测器的情况下,能够实现可靠的性能,考虑到优质热数据的稀缺性。代码将在发表时提供。
cs.CV / 161 / 2602.08540
TIBR4D: Tracing-Guided Iterative Boundary Refinement for Efficient 4D Gaussian Segmentation
TIBR4D:基于追踪引导的迭代边界细化用于高效的4D高斯分割
Abstract
Object-level segmentation in dynamic 4D Gaussian scenes remains challenging due to complex motion, occlusions, and ambiguous boundaries. In this paper, we present an efficient learning-free 4D Gaussian segmentation framework that lifts video segmentation masks to 4D spaces, whose core is a two-stage iterative boundary refinement, TIBR4D. The first stage is an Iterative Gaussian Instance Tracing (IGIT) at the temporal segment level. It progressively refines Gaussian-to-instance probabilities through iterative tracing, and extracts corresponding Gaussian point clouds that better handle occlusions and preserve completeness of object structures compared to existing one-shot threshold-based methods. The second stage is a frame-wise Gaussian Rendering Range Control (RCC) via suppressing highly uncertain Gaussians near object boundaries while retaining their core contributions for more accurate boundaries. Furthermore, a temporal segmentation merging strategy is proposed for IGIT to balance identity consistency and dynamic awareness. Longer segments enforce stronger multi-frame constraints for stable identities, while shorter segments allow identity changes to be captured promptly. Experiments on HyperNeRF and Neu3D demonstrate that our method produces accurate object Gaussian point clouds with clearer boundaries and higher efficiency compared to SOTA methods.
Chinese Translation
在动态4D高斯场景中,物体级分割仍然面临复杂运动、遮挡和模糊边界等挑战。本文提出了一种高效的无学习4D高斯分割框架,该框架将视频分割掩码提升到4D空间,其核心是一个两阶段的迭代边界细化方法TIBR4D。第一阶段是在时间段级别进行的迭代高斯实例追踪(Iterative Gaussian Instance Tracing, IGIT)。它通过迭代追踪逐步细化高斯到实例的概率,并提取相应的高斯点云,这些点云在处理遮挡和保持物体结构完整性方面优于现有的一次性阈值方法。第二阶段是通过抑制靠近物体边界的高度不确定高斯,同时保留其核心贡献,以实现更准确的边界的逐帧高斯渲染范围控制(Gaussian Rendering Range Control, RCC)。此外,针对IGIT提出了一种时间分割合并策略,以平衡身份一致性和动态感知。较长的段落施加更强的多帧约束以确保稳定的身份,而较短的段落则允许身份变化被迅速捕捉。在HyperNeRF和Neu3D上的实验表明,与最先进的方法相比,我们的方法生成了更准确的物体高斯点云,具有更清晰的边界和更高的效率。
cs.CV / 162 / 2602.08550
GOT-Edit: Geometry-Aware Generic Object Tracking via Online Model Editing
GOT-Edit:通过在线模型编辑实现几何感知的通用目标跟踪
Abstract
Human perception for effective object tracking in a 2D video stream arises from the implicit use of prior 3D knowledge combined with semantic reasoning. In contrast, most generic object tracking (GOT) methods primarily rely on 2D features of the target and its surroundings while neglecting 3D geometric cues, which makes them susceptible to partial occlusion, distractors, and variations in geometry and appearance. To address this limitation, we introduce GOT-Edit, an online cross-modality model editing approach that integrates geometry-aware cues into a generic object tracker from a 2D video stream. Our approach leverages features from a pre-trained Visual Geometry Grounded Transformer to enable geometric cue inference from only a few 2D images. To tackle the challenge of seamlessly combining geometry and semantics, GOT-Edit performs online model editing with null-space constrained updates that incorporate geometric information while preserving semantic discrimination, yielding consistently better performance across diverse scenarios. Extensive experiments on multiple GOT benchmarks demonstrate that GOT-Edit achieves superior robustness and accuracy, particularly under occlusion and clutter, establishing a new paradigm for combining 2D semantics with 3D geometric reasoning for generic object tracking.
Chinese Translation
人类在二维视频流中进行有效目标跟踪的感知能力源于隐含的三维知识与语义推理的结合。相比之下,大多数通用目标跟踪(GOT)方法主要依赖于目标及其周围环境的二维特征,而忽视了三维几何线索,这使得它们容易受到部分遮挡、干扰物以及几何和外观变化的影响。为了解决这一局限性,我们提出了GOT-Edit,一种在线跨模态模型编辑方法,它将几何感知线索集成到来自二维视频流的通用目标跟踪器中。我们的方法利用预训练的视觉几何基础变换器(Visual Geometry Grounded Transformer)中的特征,从仅有的几张二维图像中推断几何线索。为了解决无缝结合几何与语义的挑战,GOT-Edit通过具有零空间约束的更新进行在线模型编辑,结合几何信息的同时保持语义区分性,从而在各种场景中实现持续更好的性能。在多个GOT基准上的广泛实验表明,GOT-Edit在遮挡和杂乱情况下表现出更强的鲁棒性和准确性,为将二维语义与三维几何推理结合的通用目标跟踪建立了新的范式。
cs.CV / 163 / 2602.08558
FLAG-4D: Flow-Guided Local-Global Dual-Deformation Model for 4D Reconstruction
FLAG-4D:用于4D重建的流引导局部-全局双变形模型
Abstract
We introduce FLAG-4D, a novel framework for generating novel views of dynamic scenes by reconstructing how 3D Gaussian primitives evolve through space and time. Existing methods typically rely on a single Multilayer Perceptron (MLP) to model temporal deformations, and they often struggle to capture complex point motions and fine-grained dynamic details consistently over time, especially from sparse input views. Our approach, FLAG-4D, overcomes this by employing a dual-deformation network that dynamically warps a canonical set of 3D Gaussians over time into new positions and anisotropic shapes. This dual-deformation network consists of an Instantaneous Deformation Network (IDN) for modeling fine-grained, local deformations and a Global Motion Network (GMN) for capturing long-range dynamics, refined through mutual learning. To ensure these deformations are both accurate and temporally smooth, FLAG-4D incorporates dense motion features from a pretrained optical flow backbone. We fuse these motion cues from adjacent timeframes and use a deformation-guided attention mechanism to align this flow information with the current state of each evolving 3D Gaussian. Extensive experiments demonstrate that FLAG-4D achieves higher-fidelity and more temporally coherent reconstructions with finer detail preservation than state-of-the-art methods.
Chinese Translation
我们介绍了FLAG-4D,一个新颖的框架,通过重建3D高斯原语在时空中的演变来生成动态场景的新视角。现有方法通常依赖于单一的多层感知器(MLP)来建模时间变形,并且往往难以持续捕捉复杂的点运动和细粒度的动态细节,尤其是在稀疏输入视图的情况下。我们的FLAG-4D方法通过采用一个双变形网络来克服这一问题,该网络动态地将一组规范的3D高斯体在时间上扭曲到新的位置和各向异性形状。这个双变形网络由一个瞬时变形网络(IDN)构成,用于建模细粒度的局部变形,以及一个全局运动网络(GMN),用于捕捉长距离动态,通过互学习进行优化。为了确保这些变形既准确又在时间上平滑,FLAG-4D结合了来自预训练光流主干网络的密集运动特征。我们融合了来自相邻时间帧的运动线索,并使用变形引导的注意机制将这些流信息与每个演变中的3D高斯的当前状态对齐。大量实验表明,FLAG-4D在细节保留方面实现了比最先进的方法更高保真度和更具时间一致性的重建。
cs.CV / 164 / 2602.08582
SemiNFT: Learning to Transfer Presets from Imitation to Appreciation via Hybrid-Sample Reinforcement Learning
SemiNFT:通过混合样本强化学习从模仿到欣赏的预设转移学习
Abstract
Photorealistic color retouching plays a vital role in visual content creation, yet manual retouching remains inaccessible to non-experts due to its reliance on specialized expertise. Reference-based methods offer a promising alternative by transferring the preset color of a reference image to a source image. However, these approaches often operate as novice learners, performing global color mappings derived from pixel-level statistics, without a true understanding of semantic context or human aesthetics. To address this issue, we propose SemiNFT, a Diffusion Transformer (DiT)-based retouching framework that mirrors the trajectory of human artistic training: beginning with rigid imitation and evolving into intuitive creation. Specifically, SemiNFT is first taught with paired triplets to acquire basic structural preservation and color mapping skills, and then advanced to reinforcement learning (RL) on unpaired data to cultivate nuanced aesthetic perception. Crucially, during the RL stage, to prevent catastrophic forgetting of old skills, we design a hybrid online-offline reward mechanism that anchors aesthetic exploration with structural review. % experiments Extensive experiments show that SemiNFT not only outperforms state-of-the-art methods on standard preset transfer benchmarks but also demonstrates remarkable intelligence in zero-shot tasks, such as black-and-white photo colorization and cross-domain (anime-to-photo) preset transfer. These results confirm that SemiNFT transcends simple statistical matching and achieves a sophisticated level of aesthetic comprehension. Our project can be found at https://melanyyang.github.io/SemiNFT/.
Chinese Translation
逼真的色彩修饰在视觉内容创作中发挥着至关重要的作用,但由于其依赖于专业知识,手动修饰对非专业人士而言仍然难以接触。基于参考的方法通过将参考图像的预设色彩转移到源图像,提供了一种有前景的替代方案。然而,这些方法往往作为初学者运作,执行基于像素级统计的全局色彩映射,而没有真正理解语义上下文或人类美学。为了解决这一问题,我们提出了SemiNFT,一个基于扩散变换器(Diffusion Transformer, DiT)的修饰框架,模拟人类艺术训练的轨迹:从严格的模仿开始,逐渐演变为直观的创作。具体而言,SemiNFT首先通过成对三元组学习基本的结构保留和色彩映射技能,然后在未配对数据上进行强化学习(Reinforcement Learning, RL),以培养细致的审美感知。至关重要的是,在RL阶段,为了防止旧技能的灾难性遗忘,我们设计了一种混合在线-离线奖励机制,将审美探索与结构回顾相结合。大量实验表明,SemiNFT不仅在标准预设转移基准上超越了最先进的方法,还在零样本任务(如黑白照片上色和跨域(动漫到照片)预设转移)中表现出显著的智能。这些结果确认了SemiNFT超越简单的统计匹配,达到了复杂的审美理解水平。我们的项目可以在 https://melanyyang.github.io/SemiNFT/ 找到。
cs.CV / 165 / 2602.08613
Overview and Comparison of AVS Point Cloud Compression Standard
AVS点云压缩标准的概述与比较
Abstract
Point cloud is a prevalent 3D data representation format with significant application values in immersive media, autonomous driving, digital heritage protection, etc. However, the large data size of point clouds poses challenges to transmission and storage, which influences the wide deployments. Therefore, point cloud compression plays a crucial role in practical applications for both human and machine perception optimization. To this end, the Moving Picture Experts Group (MPEG) has established two standards for point cloud compression, including Geometry-based Point Cloud Compression (G-PCC) and Video-based Point Cloud Compression (V-PCC). In the meantime, the Audio Video coding Standard (AVS) Workgroup of China also have launched and completed the development for its first generation point cloud compression standard, namely AVS PCC. This new standardization effort has adopted many new coding tools and techniques, which are different from the other counterpart standards. This paper reviews the AVS PCC standard from two perspectives, i.e., the related technologies and performance comparisons.
Chinese Translation
点云是一种广泛应用的3D数据表示格式,在沉浸式媒体、自动驾驶、数字遗产保护等领域具有重要的应用价值。然而,点云的大数据量对传输和存储提出了挑战,这影响了其广泛部署。因此,点云压缩在优化人类和机器感知的实际应用中发挥着至关重要的作用。为此,移动图像专家组(MPEG)建立了两个点云压缩标准,包括基于几何的点云压缩(G-PCC)和基于视频的点云压缩(V-PCC)。与此同时,中国音视频编码标准(AVS)工作组也启动并完成了其第一代点云压缩标准的开发,即AVS PCC。该新标准化工作采用了许多新的编码工具和技术,这些工具和技术与其他对应标准有所不同。本文从相关技术和性能比较两个角度回顾了AVS PCC标准。
cs.CV / 166 / 2602.08615
Inspiration Seeds: Learning Non-Literal Visual Combinations for Generative Exploration
灵感种子:学习非字面视觉组合以进行生成探索
Abstract
While generative models have become powerful tools for image synthesis, they are typically optimized for executing carefully crafted textual prompts, offering limited support for the open-ended visual exploration that often precedes idea formation. In contrast, designers frequently draw inspiration from loosely connected visual references, seeking emergent connections that spark new ideas. We propose Inspiration Seeds, a generative framework that shifts image generation from final execution to exploratory ideation. Given two input images, our model produces diverse, visually coherent compositions that reveal latent relationships between inputs, without relying on user-specified text prompts. Our approach is feed-forward, trained on synthetic triplets of decomposed visual aspects derived entirely through visual means: we use CLIP Sparse Autoencoders to extract editing directions in CLIP latent space and isolate concept pairs. By removing the reliance on language and enabling fast, intuitive recombination, our method supports visual ideation at the early and ambiguous stages of creative work.
Chinese Translation
尽管生成模型已成为图像合成的强大工具,但它们通常针对精心设计的文本提示进行优化,提供的支持有限,无法满足在形成创意之前常常需要的开放式视觉探索。相比之下,设计师通常从松散关联的视觉参考中汲取灵感,寻找能够激发新想法的潜在联系。我们提出了灵感种子(Inspiration Seeds),这是一个生成框架,将图像生成从最终执行转向探索性构思。给定两幅输入图像,我们的模型生成多样且视觉上连贯的组合,揭示输入之间的潜在关系,而无需依赖用户指定的文本提示。我们的方法是前馈型的,训练基于完全通过视觉手段获得的分解视觉特征的合成三元组:我们使用 CLIP 稀疏自编码器(CLIP Sparse Autoencoders)提取 CLIP 潜在空间中的编辑方向并隔离概念对。通过消除对语言的依赖并实现快速、直观的重组,我们的方法支持在创意工作的早期和模糊阶段进行视觉构思。
cs.CV / 167 / 2602.08620
Improving Reconstruction of Representation Autoencoder
改进表示自编码器的重建
Abstract
Recent work leverages Vision Foundation Models as image encoders to boost the generative performance of latent diffusion models (LDMs), as their semantic feature distributions are easy to learn. However, such semantic features often lack low-level information (\eg, color and texture), leading to degraded reconstruction fidelity, which has emerged as a primary bottleneck in further scaling LDMs. To address this limitation, we propose LV-RAE, a representation autoencoder that augments semantic features with missing low-level information, enabling high-fidelity reconstruction while remaining highly aligned with the semantic distribution. We further observe that the resulting high-dimensional, information-rich latent make decoders sensitive to latent perturbations, causing severe artifacts when decoding generated latent and consequently degrading generation quality. Our analysis suggests that this sensitivity primarily stems from excessive decoder responses along directions off the data manifold. Building on these insights, we propose fine-tuning the decoder to increase its robustness and smoothing the generated latent via controlled noise injection, thereby enhancing generation quality. Experiments demonstrate that LV-RAE significantly improves reconstruction fidelity while preserving the semantic abstraction and achieving strong generative quality. Our code is available at https://github.com/modyu-liu/LVRAE.
Chinese Translation
近期的研究利用视觉基础模型作为图像编码器,以提升潜在扩散模型(Latent Diffusion Models, LDMs)的生成性能,因为其语义特征分布易于学习。然而,这些语义特征往往缺乏低级信息(例如,颜色和纹理),导致重建保真度下降,这已成为进一步扩展LDMs的主要瓶颈。为了解决这一限制,我们提出了LV-RAE,一种表示自编码器,它通过补充缺失的低级信息来增强语义特征,从而实现高保真重建,同时保持与语义分布的高度一致性。我们进一步观察到,生成的高维、信息丰富的潜在变量使得解码器对潜在扰动敏感,在解码生成的潜在变量时会导致严重伪影,从而降低生成质量。我们的分析表明,这种敏感性主要源于解码器在数据流形外方向的过度响应。基于这些见解,我们提出对解码器进行微调,以增强其鲁棒性,并通过控制噪声注入来平滑生成的潜在变量,从而提升生成质量。实验表明,LV-RAE显著提高了重建保真度,同时保持了语义抽象,并实现了强大的生成质量。我们的代码可在 https://github.com/modyu-liu/LVRAE 获得。
cs.CV / 168 / 2602.08626
Revisiting [CLS] and Patch Token Interaction in Vision Transformers
重新审视视觉变换器中的 [CLS] 和补丁令牌交互
Abstract
Vision Transformers have emerged as powerful, scalable and versatile representation learners. To capture both global and local features, a learnable [CLS] class token is typically prepended to the input sequence of patch tokens. Despite their distinct nature, both token types are processed identically throughout the model. In this work, we investigate the friction between global and local feature learning under different pre-training strategies by analyzing the interactions between class and patch tokens. Our analysis reveals that standard normalization layers introduce an implicit differentiation between these token types. Building on this insight, we propose specialized processing paths that selectively disentangle the computational flow of class and patch tokens, particularly within normalization layers and early query-key-value projections. This targeted specialization leads to significantly improved patch representation quality for dense prediction tasks. Our experiments demonstrate segmentation performance gains of over 2 mIoU points on standard benchmarks, while maintaining strong classification accuracy. The proposed modifications introduce only an 8% increase in parameters, with no additional computational overhead. Through comprehensive ablations, we provide insights into which architectural components benefit most from specialization and how our approach generalizes across model scales and learning frameworks.
Chinese Translation
视觉变换器作为强大、可扩展和多功能的表征学习模型已逐渐崭露头角。为了捕捉全局和局部特征,通常会在补丁令牌的输入序列前添加一个可学习的 [CLS] 类令牌。尽管这两种令牌的性质不同,但在模型中它们的处理方式是相同的。在本研究中,我们通过分析类令牌和补丁令牌之间的交互,探讨了在不同预训练策略下全局和局部特征学习之间的摩擦。我们的分析揭示了标准归一化层在这些令牌类型之间引入了隐含的差异化。基于这一见解,我们提出了专门的处理路径,选择性地解耦类令牌和补丁令牌的计算流,特别是在归一化层和早期的查询-键-值投影中。这种有针对性的专业化显著提高了密集预测任务中补丁表征的质量。我们的实验表明,在标准基准上,分割性能提高了超过2 mIoU点,同时保持了较强的分类准确性。所提修改仅增加了8%的参数,而没有额外的计算开销。通过全面的消融实验,我们提供了对哪些架构组件最能从专业化中受益的见解,以及我们的方法如何在模型规模和学习框架中进行泛化。
cs.CV / 169 / 2602.08652
Deep Learning-Based Fixation Type Prediction for Quality Assurance in Digital Pathology
基于深度学习的固定类型预测在数字病理学质量保证中的应用
Abstract
Accurate annotation of fixation type is a critical step in slide preparation for pathology laboratories. However, this manual process is prone to errors, impacting downstream analyses and diagnostic accuracy. Existing methods for verifying formalin-fixed, paraffin-embedded (FFPE), and frozen section (FS) fixation types typically require full-resolution whole-slide images (WSIs), limiting scalability for high-throughput quality control. We propose a deep-learning model to predict fixation types using low-resolution, pre-scan thumbnail images. The model was trained on WSIs from the TUM Institute of Pathology (n=1,200, Leica GT450DX) and evaluated on a class-balanced subset of The Cancer Genome Atlas dataset (TCGA, n=8,800, Leica AT2), as well as on class-balanced datasets from Augsburg (n=695 [392 FFPE, 303 FS], Philips UFS) and Regensburg (n=202, 3DHISTECH P1000). Our model achieves an AUROC of 0.88 on TCGA, outperforming comparable pre-scan methods by 4.8%. It also achieves AUROCs of 0.72 on Regensburg and Augsburg slides, underscoring challenges related to scanner-induced domain shifts. Furthermore, the model processes each slide in 21 ms, $400\times$ faster than existing high-magnification, full-resolution methods, enabling rapid, high-throughput processing. This approach provides an efficient solution for detecting labelling errors without relying on high-magnification scans, offering a valuable tool for quality control in high-throughput pathology workflows. Future work will improve and evaluate the model's generalisation to additional scanner types. Our findings suggest that this method can increase accuracy and efficiency in digital pathology workflows and may be extended to other low-resolution slide annotations.
Chinese Translation
准确标注固定类型是病理实验室切片准备中的关键步骤。然而,这一手动过程容易出错,影响后续分析和诊断准确性。现有的验证福尔马林固定石蜡包埋(FFPE)和冷冻切片(FS)固定类型的方法通常需要全分辨率的全切片图像(WSIs),限制了高通量质量控制的可扩展性。我们提出了一种深度学习模型,利用低分辨率的预扫描缩略图像来预测固定类型。该模型在来自慕尼黑工业大学病理学研究所的全切片图像(n=1,200,Leica GT450DX)上进行训练,并在癌症基因组图谱(TCGA)数据集的类别平衡子集(n=8,800,Leica AT2)以及来自奥格斯堡(n=695 [392 FFPE, 303 FS],Philips UFS)和雷根斯堡(n=202,3DHISTECH P1000)的类别平衡数据集上进行评估。我们的模型在TCGA上达到了0.88的AUROC,超过了可比的预扫描方法4.8%。在雷根斯堡和奥格斯堡切片上也分别达到了0.72的AUROC,突显了与扫描仪引起的领域转移相关的挑战。此外,该模型以21毫秒处理每个切片,比现有的高倍全分辨率方法快$400 imes$,实现了快速的高通量处理。这种方法为检测标注错误提供了高效的解决方案,而无需依赖高倍扫描,为高通量病理工作流程中的质量控制提供了有价值的工具。未来的工作将改进并评估模型对其他扫描仪类型的泛化能力。我们的研究结果表明,该方法可以提高数字病理工作流程的准确性和效率,并可能扩展到其他低分辨率切片标注。
cs.CV / 170 / 2602.08661
WiFlow: A Lightweight WiFi-based Continuous Human Pose Estimation Network with Spatio-Temporal Feature Decoupling
WiFlow:一种基于WiFi的轻量级连续人体姿态估计网络,具有时空特征解耦
Abstract
Human pose estimation is fundamental to intelligent perception in the Internet of Things (IoT), enabling applications ranging from smart healthcare to human-computer interaction. While WiFi-based methods have gained traction, they often struggle with continuous motion and high computational overhead. This work presents WiFlow, a novel framework for continuous human pose estimation using WiFi signals. Unlike vision-based approaches such as two-dimensional deep residual networks that treat Channel State Information (CSI) as images, WiFlow employs an encoder-decoder architecture. The encoder captures spatio-temporal features of CSI using temporal and asymmetric convolutions, preserving the original sequential structure of signals. It then refines keypoint features of human bodies to be tracked and capture their structural dependencies via axial attention. The decoder subsequently maps the encoded high-dimensional features into keypoint coordinates. Trained on a self-collected dataset of 360,000 synchronized CSI-pose samples from 5 subjects performing continuous sequences of 8 daily activities, WiFlow achieves a Percentage of Correct Keypoints (PCK) of 97.00% at a threshold of 20% (PCK@20) and 99.48% at PCK@50, with a mean per-joint position error of 0.008m. With only 4.82M parameters, WiFlow significantly reduces model complexity and computational cost, establishing a new performance baseline for practical WiFi-based human pose estimation. Our code and datasets are available at https://github.com/DY2434/WiFlow-WiFi-Pose-Estimation-with-Spatio-Temporal-Decoupling.git.
Chinese Translation
人体姿态估计是物联网(IoT)智能感知的基础,能够支持从智能医疗到人机交互等多种应用。尽管基于WiFi的方法逐渐受到关注,但它们在连续运动和高计算开销方面常常面临挑战。本研究提出了WiFlow,一种利用WiFi信号进行连续人体姿态估计的新框架。与将信道状态信息(Channel State Information, CSI)视为图像的基于视觉的方法(如二维深度残差网络)不同,WiFlow采用了编码器-解码器架构。编码器使用时间卷积和非对称卷积捕捉CSI的时空特征,保留信号的原始序列结构。然后,它通过轴向注意力精炼待跟踪的人体关键点特征,并捕捉其结构依赖关系。解码器随后将编码的高维特征映射到关键点坐标。WiFlow在一个自收集的数据集中进行了训练,该数据集包含来自5名受试者进行8种日常活动连续序列的360,000个同步CSI-姿态样本,WiFlow在20%的阈值下达到了97.00%的关键点正确率(Percentage of Correct Keypoints, PCK@20),在PCK@50下达到了99.48%,每个关节位置的平均误差为0.008米。WiFlow仅使用4.82M参数,显著降低了模型复杂性和计算成本,为实际基于WiFi的人体姿态估计建立了新的性能基准。我们的代码和数据集可在https://github.com/DY2434/WiFlow-WiFi-Pose-Estimation-with-Spatio-Temporal-Decoupling.git获取。
cs.CV / 171 / 2602.08670
A Machine Learning accelerated geophysical fluid solver
一种机器学习加速的地球物理流体求解器
Abstract
Machine learning methods have been successful in many areas, like image classification and natural language processing. However, it still needs to be determined how to apply ML to areas with mathematical constraints, like solving PDEs. Among various approaches to applying ML techniques to solving PDEs, the data-driven discretization method presents a promising way of accelerating and improving existing PDE solver on structured grids where it predicts the coefficients of quasi-linear stencils for computing values or derivatives of a function at given positions. It can improve the accuracy and stability of low-resolution simulation compared with using traditional finite difference or finite volume schemes. Meanwhile, it can also benefit from traditional numerical schemes like achieving conservation law by adapting finite volume type formulations. In this thesis, we have implemented the shallow water equation and Euler equation classic solver under a different framework. Experiments show that our classic solver performs much better than the Pyclaw solver. Then we propose four different deep neural networks for the ML-based solver. The results indicate that two of these approaches could output satisfactory solutions.
Chinese Translation
机器学习方法在许多领域取得了成功,如图像分类和自然语言处理。然而,如何将机器学习应用于具有数学约束的领域,如求解偏微分方程(PDE),仍需进一步探索。在将机器学习技术应用于求解PDE的各种方法中,数据驱动的离散化方法为加速和改善现有的结构网格PDE求解器提供了一种有前景的途径,它通过预测准线性模板的系数来计算给定位置上函数值或导数。这种方法相比于传统的有限差分或有限体积方案,可以提高低分辨率模拟的准确性和稳定性。同时,它也可以借助传统数值方案的优势,如通过调整有限体积类型的公式来实现守恒定律。在本论文中,我们在不同的框架下实现了浅水方程和欧拉方程的经典求解器。实验表明,我们的经典求解器的性能远超Pyclaw求解器。随后,我们提出了四种不同的深度神经网络用于基于机器学习的求解器。结果表明,这些方法中的两种能够输出令人满意的解。
cs.CV / 172 / 2602.08682
ALIVE: Animate Your World with Lifelike Audio-Video Generation
ALIVE:用逼真的音频-视频生成为您的世界赋予生命
Abstract
Video generation is rapidly evolving towards unified audio-video generation. In this paper, we present ALIVE, a generation model that adapts a pretrained Text-to-Video (T2V) model to Sora-style audio-video generation and animation. In particular, the model unlocks the Text-to-Video&Audio (T2VA) and Reference-to-Video&Audio (animation) capabilities compared to the T2V foundation models. To support the audio-visual synchronization and reference animation, we augment the popular MMDiT architecture with a joint audio-video branch which includes TA-CrossAttn for temporally-aligned cross-modal fusion and UniTemp-RoPE for precise audio-visual alignment. Meanwhile, a comprehensive data pipeline consisting of audio-video captioning, quality control, etc., is carefully designed to collect high-quality finetuning data. Additionally, we introduce a new benchmark to perform a comprehensive model test and comparison. After continue pretraining and finetuning on million-level high-quality data, ALIVE demonstrates outstanding performance, consistently outperforming open-source models and matching or surpassing state-of-the-art commercial solutions. With detailed recipes and benchmarks, we hope ALIVE helps the community develop audio-video generation models more efficiently. Official page: https://github.com/FoundationVision/Alive.
Chinese Translation
视频生成正迅速朝着统一的音频-视频生成发展。本文提出了ALIVE,一种生成模型,它将预训练的文本到视频(Text-to-Video, T2V)模型适配于Sora风格的音频-视频生成和动画。特别地,与T2V基础模型相比,该模型解锁了文本到视频与音频(Text-to-Video&Audio, T2VA)和参考到视频与音频(动画)功能。为了支持音视频同步和参考动画,我们在流行的MMDiT架构中增强了一个联合音频-视频分支,其中包括用于时间对齐的跨模态融合的TA-CrossAttn和用于精确音视频对齐的UniTemp-RoPE。同时,我们精心设计了一个全面的数据管道,包括音视频标注、质量控制等,以收集高质量的微调数据。此外,我们引入了一个新的基准,以进行全面的模型测试和比较。在对百万级高质量数据进行持续的预训练和微调后,ALIVE展示了卓越的性能,始终优于开源模型,并与最先进的商业解决方案相匹配或超越。通过详细的配方和基准,我们希望ALIVE能够帮助社区更高效地开发音频-视频生成模型。官方网站:https://github.com/FoundationVision/Alive。
cs.CV / 173 / 2602.08683
OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence
OneVision-Encoder:编码器对齐稀疏性作为多模态智能的基础原则
Abstract
Hypothesis. Artificial general intelligence is, at its core, a compression problem. Effective compression demands resonance: deep learning scales best when its architecture aligns with the fundamental structure of the data. These are the fundamental principles. Yet, modern vision architectures have strayed from these truths: visual signals are highly redundant, while discriminative information, the surprise, is sparse. Current models process dense pixel grids uniformly, wasting vast compute on static background rather than focusing on the predictive residuals that define motion and meaning. We argue that to solve visual understanding, we must align our architectures with the information-theoretic principles of video, i.e., Codecs. Method. OneVision-Encoder encodes video by compressing predictive visual structure into semantic meaning. By adopting Codec Patchification, OV-Encoder abandons uniform computation to focus exclusively on the 3.1%-25% of regions rich in signal entropy. To unify spatial and temporal reasoning under irregular token layouts, OneVision-Encoder employs a shared 3D RoPE and is trained with a large-scale cluster discrimination objective over more than one million semantic concepts, jointly capturing object permanence and motion dynamics. Evidence. The results validate our core hypothesis: efficiency and accuracy are not a trade-off; they are positively correlated. When integrated into LLM, it consistently outperforms strong vision backbones such as Qwen3-ViT and SigLIP2 across 16 image, video, and document understanding benchmarks, despite using substantially fewer visual tokens and pretraining data. Notably, on video understanding tasks, OV-Encoder achieves an average improvement of 4.1% over Qwen3-ViT. Codec-aligned, patch-level sparsity is a foundational principle, enabling OV-Encoder as a scalable engine for next-generation visual generalists.
Chinese Translation
假设:人工通用智能本质上是一个压缩问题。有效的压缩需要共鸣:当深度学习的架构与数据的基本结构对齐时,其规模表现最佳。这些是基本原则。然而,现代视觉架构已偏离这些真理:视觉信号高度冗余,而判别信息,即惊讶,是稀疏的。目前的模型对密集的像素网格进行均匀处理,浪费了大量计算在静态背景上,而不是专注于定义运动和意义的预测残差。我们认为,要解决视觉理解问题,必须将我们的架构与视频的信息论原则对齐,即编码器(Codecs)。方法:OneVision-Encoder通过将预测视觉结构压缩为语义意义来编码视频。通过采用编码器块化(Codec Patchification),OV-Encoder放弃均匀计算,专注于信号熵丰富的3.1%-25%的区域。为了在不规则的标记布局下统一空间和时间推理,OneVision-Encoder采用共享的3D RoPE,并通过超过一百万个语义概念的大规模集群区分目标进行训练,联合捕捉物体的持久性和运动动态。证据:结果验证了我们的核心假设:效率和准确性并不是一种权衡;它们是正相关的。当集成到大型语言模型(LLM)中时,OV-Encoder在16个图像、视频和文档理解基准测试中始终优于强大的视觉骨干网络,如Qwen3-ViT和SigLIP2,尽管使用的视觉标记和预训练数据显著更少。值得注意的是,在视频理解任务中,OV-Encoder在Qwen3-ViT上实现了平均4.1%的提升。编码器对齐的块级稀疏性是一个基础原则,使OV-Encoder成为下一代视觉通用模型的可扩展引擎。
cs.CV / 174 / 2602.08699
Low-Light Video Enhancement with An Effective Spatial-Temporal Decomposition Paradigm
基于有效时空分解范式的低光视频增强
Abstract
Low-Light Video Enhancement (LLVE) seeks to restore dynamic or static scenes plagued by severe invisibility and noise. In this paper, we present an innovative video decomposition strategy that incorporates view-independent and view-dependent components to enhance the performance of LLVE. The framework is called View-aware Low-light Video Enhancement (VLLVE). We leverage dynamic cross-frame correspondences for the view-independent term (which primarily captures intrinsic appearance) and impose a scene-level continuity constraint on the view-dependent term (which mainly describes the shading condition) to achieve consistent and satisfactory decomposition results. To further ensure consistent decomposition, we introduce a dual-structure enhancement network featuring a cross-frame interaction mechanism. By supervising different frames simultaneously, this network encourages them to exhibit matching decomposition features. This mechanism can seamlessly integrate with encoder-decoder single-frame networks, incurring minimal additional parameter costs. Building upon VLLVE, we propose a more comprehensive decomposition strategy by introducing an additive residual term, resulting in VLLVE++. This residual term can simulate scene-adaptive degradations, which are difficult to model using a decomposition formulation for common scenes, thereby further enhancing the ability to capture the overall content of videos. In addition, VLLVE++ enables bidirectional learning for both enhancement and degradation-aware correspondence refinement (end-to-end manner), effectively increasing reliable correspondences while filtering out incorrect ones. Notably, VLLVE++ demonstrates strong capability in handling challenging cases, such as real-world scenes and videos with high dynamics. Extensive experiments are conducted on widely recognized LLVE benchmarks.
Chinese Translation
低光视频增强(Low-Light Video Enhancement, LLVE)旨在恢复受到严重不可见性和噪声影响的动态或静态场景。本文提出了一种创新的视频分解策略,该策略结合了视角无关和视角相关的组件,以提升LLVE的性能。该框架称为视角感知低光视频增强(View-aware Low-light Video Enhancement, VLLVE)。我们利用动态跨帧对应关系来处理视角无关项(主要捕捉内在外观),并对视角相关项(主要描述阴影条件)施加场景级连续性约束,以实现一致且令人满意的分解结果。为了进一步确保一致的分解,我们引入了一个双结构增强网络,具有跨帧交互机制。通过同时监督不同帧,该网络鼓励它们展现匹配的分解特征。该机制可以与编码器-解码器单帧网络无缝集成,增加的参数成本极小。在VLLVE的基础上,我们通过引入加性残差项提出了一种更全面的分解策略,形成了VLLVE++。该残差项可以模拟场景自适应退化,这在常见场景的分解公式中难以建模,从而进一步增强了捕捉视频整体内容的能力。此外,VLLVE++支持双向学习,既用于增强又用于退化感知对应关系的精细化(端到端方式),有效增加可靠的对应关系,同时过滤掉不正确的对应关系。值得注意的是,VLLVE++在处理具有挑战性的案例(如真实场景和高动态视频)方面表现出强大的能力。我们在广泛认可的LLVE基准上进行了大量实验。
cs.CV / 175 / 2602.08711
TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions
TimeChat-Captioner:利用时间感知和结构化音视频字幕编写多场景视频
Abstract
This paper proposes Omni Dense Captioning, a novel task designed to generate continuous, fine-grained, and structured audio-visual narratives with explicit timestamps. To ensure dense semantic coverage, we introduce a six-dimensional structural schema to create "script-like" captions, enabling readers to vividly imagine the video content scene by scene, akin to a cinematographic screenplay. To facilitate research, we construct OmniDCBench, a high-quality, human-annotated benchmark, and propose SodaM, a unified metric that evaluates time-aware detailed descriptions while mitigating scene boundary ambiguity. Furthermore, we construct a training dataset, TimeChatCap-42K, and present TimeChat-Captioner-7B, a strong baseline trained via SFT and GRPO with task-specific rewards. Extensive experiments demonstrate that TimeChat-Captioner-7B achieves state-of-the-art performance, surpassing Gemini-2.5-Pro, while its generated dense descriptions significantly boost downstream capabilities in audio-visual reasoning (DailyOmni and WorldSense) and temporal grounding (Charades-STA). All datasets, models, and code will be made publicly available at https://github.com/yaolinli/TimeChat-Captioner.
Chinese Translation
本文提出了全密度字幕生成(Omni Dense Captioning),这是一项新任务,旨在生成连续、细粒度和结构化的音视频叙事,并带有明确的时间戳。为确保语义的密集覆盖,我们引入了六维结构化框架,以创建“剧本式”字幕,使读者能够逐场景生动地想象视频内容,类似于电影剧本。为了促进研究,我们构建了OmniDCBench,这是一个高质量的人类标注基准,并提出了SodaM,这是一种统一的度量标准,用于评估时间感知的详细描述,同时减轻场景边界模糊性。此外,我们构建了训练数据集TimeChatCap-42K,并提出了TimeChat-Captioner-7B,这是一个通过SFT和GRPO训练的强基线,具有任务特定的奖励。大量实验表明,TimeChat-Captioner-7B达到了最先进的性能,超越了Gemini-2.5-Pro,同时其生成的密集描述显著提升了音视频推理(DailyOmni和WorldSense)和时间定位(Charades-STA)的下游能力。所有数据集、模型和代码将公开发布在https://github.com/yaolinli/TimeChat-Captioner。
cs.CV / 176 / 2602.08713
Towards Understanding Multimodal Fine-Tuning: Spatial Features
理解多模态微调的方向:空间特征
Abstract
Contemporary Vision-Language Models (VLMs) achieve strong performance on a wide range of tasks by pairing a vision encoder with a pre-trained language model, fine-tuned for visual-text inputs. Yet despite these gains, it remains unclear how language backbone representations adapt during multimodal training and when vision-specific capabilities emerge. In this work, we present the first mechanistic analysis of VLM adaptation. Using stage-wise model diffing, a technique that isolates representational changes introduced during multimodal fine-tuning, we reveal how a language model learns to "see". We first identify vision-preferring features that emerge or reorient during fine-tuning. We then show that a selective subset of these features reliably encodes spatial relations, revealed through controlled shifts to spatial prompts. Finally, we trace the causal activation of these features to a small group of attention heads. Our findings show that stage-wise model diffing reveals when and where spatially grounded multimodal features arise. It also provides a clearer view of modality fusion by showing how visual grounding reshapes features that were previously text-only. This methodology enhances the interpretability of multimodal training and provides a foundation for understanding and refining how pretrained language models acquire vision-grounded capabilities.
Chinese Translation
当代视觉-语言模型(VLMs)通过将视觉编码器与经过预训练的语言模型配对,并针对视觉-文本输入进行微调,从而在广泛的任务上取得了强劲的表现。然而,尽管取得了这些进展,语言主干表示在多模态训练过程中如何适应以及视觉特定能力何时出现仍然不清楚。在本研究中,我们首次对VLM适应进行了机制分析。通过阶段性模型差异分析(stage-wise model diffing),一种能够隔离多模态微调过程中引入的表示变化的技术,我们揭示了语言模型如何学习“看”。我们首先识别出在微调过程中出现或重新定向的视觉偏好特征。然后,我们展示了这些特征的一个选择性子集可靠地编码空间关系,这通过对空间提示的控制性变化得以揭示。最后,我们追踪了这些特征的因果激活到一小组注意力头。我们的研究结果表明,阶段性模型差异分析揭示了空间基础的多模态特征何时以及在何处出现。它还通过展示视觉基础如何重塑之前仅为文本的特征,提供了更清晰的模态融合视角。这一方法增强了多模态训练的可解释性,并为理解和优化预训练语言模型如何获得视觉基础能力奠定了基础。
cs.CV / 177 / 2602.08717
Zero-shot System for Automatic Body Region Detection for Volumetric CT and MR Images
用于体积CT和MR图像的自动身体区域检测的零样本系统
Abstract
Reliable identification of anatomical body regions is a prerequisite for many automated medical imaging workflows, yet existing solutions remain heavily dependent on unreliable DICOM metadata. Current solutions mainly use supervised learning, which limits their applicability in many real-world scenarios. In this work, we investigate whether body region detection in volumetric CT and MR images can be achieved in a fully zero-shot manner by using knowledge embedded in large pre-trained foundation models. We propose and systematically evaluate three training-free pipelines: (1) a segmentation-driven rule-based system leveraging pre-trained multi-organ segmentation models, (2) a Multimodal Large Language Model (MLLM) guided by radiologist-defined rules, and (3) a segmentation-aware MLLM that combines visual input with explicit anatomical evidence. All methods are evaluated on 887 heterogeneous CT and MR scans with manually verified anatomical region labels. The segmentation-driven rule-based approach achieves the strongest and most consistent performance, with weighted F1-scores of 0.947 (CT) and 0.914 (MR), demonstrating robustness across modalities and atypical scan coverage. The MLLM performs competitively in visually distinctive regions, while the segmentation-aware MLLM reveals fundamental limitations.
Chinese Translation
可靠识别解剖身体区域是许多自动化医学影像工作流程的前提,然而现有解决方案仍然严重依赖不可靠的DICOM元数据。目前的解决方案主要采用监督学习,这限制了它们在许多现实场景中的适用性。在本研究中,我们探讨了是否可以通过利用嵌入在大型预训练基础模型中的知识,以完全零样本的方式实现体积CT和MR图像中的身体区域检测。我们提出并系统评估了三种无训练管道:(1) 利用预训练多脏器分割模型的分割驱动规则系统,(2) 由放射科医生定义规则指导的多模态大型语言模型(MLLM),以及(3) 结合视觉输入与明确解剖证据的分割感知MLLM。所有方法在887个异质CT和MR扫描上进行评估,这些扫描具有手动验证的解剖区域标签。分割驱动的规则基础方法实现了最强且最一致的性能,CT的加权F1分数为0.947,MR为0.914,显示出在不同模态和非典型扫描覆盖下的稳健性。MLLM在视觉上独特的区域表现竞争力,而分割感知MLLM则揭示了基本的局限性。
cs.CV / 178 / 2602.08724
Rotated Lights for Consistent and Efficient 2D Gaussians Inverse Rendering
旋转光源用于一致且高效的二维高斯逆渲染
Abstract
Inverse rendering aims to decompose a scene into its geometry, material properties and light conditions under a certain rendering model. It has wide applications like view synthesis, relighting, and scene editing. In recent years, inverse rendering methods have been inspired by view synthesis approaches like neural radiance fields and Gaussian splatting, which are capable of efficiently decomposing a scene into its geometry and radiance. They then further estimate the material and lighting that lead to the observed scene radiance. However, the latter step is highly ambiguous and prior works suffer from inaccurate color and baked shadows in their albedo estimation albeit their regularization. To this end, we propose RotLight, a simple capturing setup, to address the ambiguity. Compared to a usual capture, RotLight only requires the object to be rotated several times during the process. We show that as few as two rotations is effective in reducing artifacts. To further improve 2DGS-based inverse rendering, we additionally introduce a proxy mesh that not only allows accurate incident light tracing, but also enables a residual constraint and improves global illumination handling. We demonstrate with both synthetic and real world datasets that our method achieves superior albedo estimation while keeping efficient computation.
Chinese Translation
逆渲染旨在将场景分解为其几何形状、材料属性和在特定渲染模型下的光照条件。它在视图合成、重光照和场景编辑等方面有广泛应用。近年来,逆渲染方法受到视图合成方法的启发,如神经辐射场(neural radiance fields)和高斯溅射(Gaussian splatting),这些方法能够高效地将场景分解为其几何形状和辐射。然后,它们进一步估计导致观察到的场景辐射的材料和光照。然而,后一步高度模糊,尽管有正则化,先前的工作在其反照率估计中仍然存在不准确的颜色和烘焙阴影。为此,我们提出了RotLight,一个简单的捕捉设置,以解决模糊性。与常规捕捉相比,RotLight只需在过程中旋转物体几次。我们展示了仅需两次旋转即可有效减少伪影。为了进一步改善基于二维高斯(2DGS)的逆渲染,我们还引入了一个代理网格(proxy mesh),它不仅允许准确的入射光追踪,还能够实现残差约束并改善全局光照处理。我们通过合成和真实世界数据集证明了我们的方法在保持高效计算的同时,实现了优越的反照率估计。
cs.CV / 179 / 2602.08725
FusionEdit: Semantic Fusion and Attention Modulation for Training-Free Image Editing
FusionEdit:无训练图像编辑的语义融合与注意力调制
Abstract
Text-guided image editing aims to modify specific regions according to the target prompt while preserving the identity of the source image. Recent methods exploit explicit binary masks to constrain editing, but hard mask boundaries introduce artifacts and reduce editability. To address these issues, we propose FusionEdit, a training-free image editing framework that achieves precise and controllable edits. First, editing and preserved regions are automatically identified by measuring semantic discrepancies between source and target prompts. To mitigate boundary artifacts, FusionEdit performs distance-aware latent fusion along region boundaries to yield the soft and accurate mask, and employs a total variation loss to enforce smooth transitions, obtaining natural editing results. Second, FusionEdit leverages AdaIN-based modulation within DiT attention layers to perform a statistical attention fusion in the editing region, enhancing editability while preserving global consistency with the source image. Extensive experiments demonstrate that our FusionEdit significantly outperforms state-of-the-art methods. Code is available at \href{https://github.com/Yvan1001/FusionEdit}{https://github.com/Yvan1001/FusionEdit}.
Chinese Translation
文本引导的图像编辑旨在根据目标提示修改特定区域,同时保持源图像的身份。最近的方法利用显式二进制掩码来限制编辑,但硬掩码边界会引入伪影并降低可编辑性。为了解决这些问题,我们提出了FusionEdit,一个无训练的图像编辑框架,能够实现精确和可控的编辑。首先,通过测量源提示和目标提示之间的语义差异,自动识别编辑和保留区域。为了减轻边界伪影,FusionEdit在区域边界沿着距离感知的潜在融合进行操作,以生成柔和且准确的掩码,并采用全变差损失来强制平滑过渡,从而获得自然的编辑结果。其次,FusionEdit在DiT注意力层内利用基于AdaIN的调制,在编辑区域执行统计注意力融合,增强可编辑性,同时保持与源图像的全局一致性。大量实验表明,我们的FusionEdit显著优于最先进的方法。代码可在 exttt{https://github.com/Yvan1001/FusionEdit} 获取。
cs.CV / 180 / 2602.08726
SynSacc: A Blender-to-V2E Pipeline for Synthetic Neuromorphic Eye-Movement Data and Sim-to-Real Spiking Model Training
SynSacc:一种用于合成神经形态眼动数据和从模拟到真实脉冲模型训练的Blender到V2E管道
Abstract
The study of eye movements, particularly saccades and fixations, are fundamental to understanding the mechanisms of human cognition and perception. Accurate classification of these movements requires sensing technologies capable of capturing rapid dynamics without distortion. Event cameras, also known as Dynamic Vision Sensors (DVS), provide asynchronous recordings of changes in light intensity, thereby eliminating motion blur inherent in conventional frame-based cameras and offering superior temporal resolution and data efficiency. In this study, we introduce a synthetic dataset generated with Blender to simulate saccades and fixations under controlled conditions. Leveraging Spiking Neural Networks (SNNs), we evaluate its robustness by training two architectures and finetuning on real event data. The proposed models achieve up to 0.83 accuracy and maintain consistent performance across varying temporal resolutions, demonstrating stability in eye movement classification. Moreover, the use of SNNs with synthetic event streams yields substantial computational efficiency gains over artificial neural network (ANN) counterparts, underscoring the utility of synthetic data augmentation in advancing event-based vision. All code and datasets associated with this work is available at https: //github.com/Ikhadija-5/SynSacc-Dataset.
Chinese Translation
眼动研究,特别是眼跳和注视,对于理解人类认知和感知机制至关重要。准确分类这些运动需要能够捕捉快速动态而不失真的传感技术。事件相机,也称为动态视觉传感器(Dynamic Vision Sensors, DVS),提供光强变化的异步记录,从而消除了传统帧基相机固有的运动模糊,并提供了更优越的时间分辨率和数据效率。在本研究中,我们引入了一个使用Blender生成的合成数据集,以在受控条件下模拟眼跳和注视。利用脉冲神经网络(Spiking Neural Networks, SNNs),我们通过训练两种架构并在真实事件数据上进行微调来评估其稳健性。所提出的模型达到了高达0.83的准确率,并在不同时间分辨率下保持一致的性能,展示了眼动分类的稳定性。此外,使用SNN与合成事件流相比,较人工神经网络(Artificial Neural Networks, ANN)具有显著的计算效率提升,强调了合成数据增强在推动基于事件的视觉研究中的实用性。与本研究相关的所有代码和数据集可在https://github.com/Ikhadija-5/SynSacc-Dataset获取。
cs.CV / 181 / 2602.08727
Artifact Reduction in Undersampled 3D Cone-Beam CTs using a Hybrid 2D-3D CNN Framework
使用混合2D-3D卷积神经网络框架减少欠采样3D锥束CT中的伪影
Abstract
Undersampled CT volumes minimize acquisition time and radiation exposure but introduce artifacts degrading image quality and diagnostic utility. Reducing these artifacts is critical for high-quality imaging. We propose a computationally efficient hybrid deep-learning framework that combines the strengths of 2D and 3D models. First, a 2D U-Net operates on individual slices of undersampled CT volumes to extract feature maps. These slice-wise feature maps are then stacked across the volume and used as input to a 3D decoder, which utilizes contextual information across slices to predict an artifact-free 3D CT volume. The proposed two-stage approach balances the computational efficiency of 2D processing with the volumetric consistency provided by 3D modeling. The results show substantial improvements in inter-slice consistency in coronal and sagittal direction with low computational overhead. This hybrid framework presents a robust and efficient solution for high-quality 3D CT image post-processing. The code of this project can be found on github: https://github.com/J-3TO/2D-3DCNN_sparseview/.
Chinese Translation
欠采样CT体积可以减少采集时间和辐射暴露,但会引入伪影,降低图像质量和诊断效用。减少这些伪影对于高质量成像至关重要。我们提出了一种计算效率高的混合深度学习框架,结合了2D和3D模型的优势。首先,2D U-Net在欠采样CT体积的单个切片上操作,以提取特征图。这些切片特征图随后在体积中堆叠,并作为输入提供给3D解码器,该解码器利用切片间的上下文信息来预测无伪影的3D CT体积。所提出的两阶段方法平衡了2D处理的计算效率与3D建模提供的体积一致性。结果显示,在冠状面和矢状面方向上,切片间的一致性有显著改善,同时计算开销较低。该混合框架为高质量3D CT图像后处理提供了一种稳健且高效的解决方案。该项目的代码可以在github上找到:https://github.com/J-3TO/2D-3DCNN_sparseview/
cs.CV / 182 / 2602.08730
Closing the Confusion Loop: CLIP-Guided Alignment for Source-Free Domain Adaptation
关闭混淆循环:基于CLIP的无源领域适应对齐
Abstract
Source-Free Domain Adaptation (SFDA) tackles the problem of adapting a pre-trained source model to an unlabeled target domain without accessing any source data, which is quite suitable for the field of data security. Although recent advances have shown that pseudo-labeling strategies can be effective, they often fail in fine-grained scenarios due to subtle inter-class similarities. A critical but underexplored issue is the presence of asymmetric and dynamic class confusion, where visually similar classes are unequally and inconsistently misclassified by the source model. Existing methods typically ignore such confusion patterns, leading to noisy pseudo-labels and poor target discrimination. To address this, we propose CLIP-Guided Alignment(CGA), a novel framework that explicitly models and mitigates class confusion in SFDA. Generally, our method consists of three parts: (1) MCA: detects first directional confusion pairs by analyzing the predictions of the source model in the target domain; (2) MCC: leverages CLIP to construct confusion-aware textual prompts (e.g. a truck that looks like a bus), enabling more context-sensitive pseudo-labeling; and (3) FAM: builds confusion-guided feature banks for both CLIP and the source model and aligns them using contrastive learning to reduce ambiguity in the representation space. Extensive experiments on various datasets demonstrate that CGA consistently outperforms state-of-the-art SFDA methods, with especially notable gains in confusion-prone and fine-grained scenarios. Our results highlight the importance of explicitly modeling inter-class confusion for effective source-free adaptation. Our code can be find at https://github.com/soloiro/CGA
Chinese Translation
无源领域适应(Source-Free Domain Adaptation, SFDA)解决了在不访问任何源数据的情况下,将预训练源模型适应于无标签目标领域的问题,这在数据安全领域尤为适用。尽管近期的进展表明伪标签策略可能有效,但在细粒度场景中,由于类间微妙的相似性,它们往往失败。一个重要但尚未充分探讨的问题是存在不对称和动态的类混淆,其中视觉上相似的类被源模型不均等且不一致地错误分类。现有方法通常忽视这种混淆模式,导致伪标签噪声和目标区分能力差。为了解决这一问题,我们提出了基于CLIP的对齐框架(CLIP-Guided Alignment, CGA),该框架明确建模并减轻SFDA中的类混淆。一般而言,我们的方法由三个部分组成:(1)MCA:通过分析源模型在目标领域的预测,检测第一方向的混淆对;(2)MCC:利用CLIP构建混淆感知的文本提示(例如,看起来像公交车的卡车),以实现更具上下文敏感性的伪标签;(3)FAM:为CLIP和源模型构建混淆引导的特征库,并使用对比学习对其进行对齐,以减少表示空间中的歧义。在各种数据集上的广泛实验表明,CGA在性能上始终优于最先进的SFDA方法,尤其是在混淆易发和细粒度场景中表现出显著的提升。我们的结果强调了明确建模类间混淆对于有效的无源适应的重要性。我们的代码可以在 https://github.com/soloiro/CGA 找到。
cs.CV / 183 / 2602.08735
From Correspondence to Actions: Human-Like Multi-Image Spatial Reasoning in Multi-modal Large Language Models
从对应到行动:多模态大语言模型中的类人多图像空间推理
Abstract
While multimodal large language models (MLLMs) have made substantial progress in single-image spatial reasoning, multi-image spatial reasoning, which requires integration of information from multiple viewpoints, remains challenging. Cognitive studies suggest that humans address such tasks through two mechanisms: cross-view correspondence, which identifies regions across different views that correspond to the same physical locations, and stepwise viewpoint transformation, which composes relative viewpoint changes sequentially. However, existing studies incorporate these mechanisms only partially and often implicitly, without explicit supervision for both. We propose Human-Aware Training for Cross-view correspondence and viewpoint cHange (HATCH), a training framework with two complementary objectives: (1) Patch-Level Spatial Alignment, which encourages patch representations to align across views for spatially corresponding regions, and (2) Action-then-Answer Reasoning, which requires the model to generate explicit viewpoint transition actions before predicting the final answer. Experiments on three benchmarks demonstrate that HATCH consistently outperforms baselines of comparable size by a clear margin and achieves competitive results against much larger models, while preserving single-image reasoning capabilities.
Chinese Translation
尽管多模态大语言模型(MLLMs)在单图像空间推理方面取得了显著进展,但多图像空间推理仍然具有挑战性,因为它需要整合来自多个视角的信息。认知研究表明,人类通过两种机制来解决此类任务:跨视角对应(cross-view correspondence),即识别不同视角中对应于同一物理位置的区域,以及逐步视角转换(stepwise viewpoint transformation),即顺序组合相对视角变化。然而,现有研究仅部分且通常隐含地纳入这些机制,并且缺乏对这两者的明确监督。我们提出了跨视角对应与视角变化的人类感知训练(Human-Aware Training for Cross-view correspondence and viewpoint cHange,HATCH),这是一个具有两个互补目标的训练框架:(1)补丁级空间对齐(Patch-Level Spatial Alignment),鼓励补丁表示在空间对应区域之间对齐;(2)先行动后回答推理(Action-then-Answer Reasoning),要求模型在预测最终答案之前生成明确的视角转换动作。在三个基准测试上的实验表明,HATCH在相似规模的基线模型中始终以明显的优势超越,并在与更大模型的竞争中取得了具有竞争力的结果,同时保持了单图像推理能力。
cs.CV / 184 / 2602.08749
Shifting the Breaking Point of Flow Matching for Multi-Instance Editing
多实例编辑中流匹配的突破点转移
Abstract
Flow matching models have recently emerged as an efficient alternative to diffusion, especially for text-guided image generation and editing, offering faster inference through continuous-time dynamics. However, existing flow-based editors predominantly support global or single-instruction edits and struggle with multi-instance scenarios, where multiple parts of a reference input must be edited independently without semantic interference. We identify this limitation as a consequence of globally conditioned velocity fields and joint attention mechanisms, which entangle concurrent edits. To address this issue, we introduce Instance-Disentangled Attention, a mechanism that partitions joint attention operations, enforcing binding between instance-specific textual instructions and spatial regions during velocity field estimation. We evaluate our approach on both natural image editing and a newly introduced benchmark of text-dense infographics with region-level editing instructions. Experimental results demonstrate that our approach promotes edit disentanglement and locality while preserving global output coherence, enabling single-pass, instance-level editing.
Chinese Translation
流匹配模型最近作为扩散的高效替代方案出现,特别是在文本引导的图像生成和编辑中,通过连续时间动态提供更快的推理。然而,现有的基于流的编辑器主要支持全局或单一指令的编辑,且在多实例场景中表现不佳,在这些场景中,参考输入的多个部分必须独立编辑而不产生语义干扰。我们将这一限制视为全球条件速度场和联合注意机制的结果,这些机制使得并发编辑相互纠缠。为了解决这个问题,我们引入了实例解耦注意机制(Instance-Disentangled Attention),该机制对联合注意操作进行分区,在速度场估计过程中强制将实例特定的文本指令与空间区域绑定。我们在自然图像编辑和一个新引入的文本密集信息图基准上评估了我们的方法,该基准具有区域级编辑指令。实验结果表明,我们的方法促进了编辑的解耦和局部性,同时保持了全局输出的一致性,实现了单次传递的实例级编辑。
cs.CV / 185 / 2602.08753
MVAnimate: Enhancing Character Animation with Multi-View Optimization
MVAnimate:通过多视角优化增强角色动画
Abstract
The demand for realistic and versatile character animation has surged, driven by its wide-ranging applications in various domains. However, the animation generation algorithms modeling human pose with 2D or 3D structures all face various problems, including low-quality output content and training data deficiency, preventing the related algorithms from generating high-quality animation videos. Therefore, we introduce MVAnimate, a novel framework that synthesizes both 2D and 3D information of dynamic figures based on multi-view prior information, to enhance the generated video quality. Our approach leverages multi-view prior information to produce temporally consistent and spatially coherent animation outputs, demonstrating improvements over existing animation methods. Our MVAnimate also optimizes the multi-view videos of the target character, enhancing the video quality from different views. Experimental results on diverse datasets highlight the robustness of our method in handling various motion patterns and appearances.
Chinese Translation
对逼真且多样化的角色动画的需求激增,源于其在各个领域的广泛应用。然而,基于2D或3D结构建模人类姿态的动画生成算法都面临着各种问题,包括低质量的输出内容和训练数据不足,这阻碍了相关算法生成高质量的动画视频。因此,我们提出了MVAnimate,一个新颖的框架,基于多视角先验信息合成动态人物的2D和3D信息,以提升生成视频的质量。我们的方法利用多视角先验信息生成时间一致和空间连贯的动画输出,展示了相较于现有动画方法的改进。我们的MVAnimate还优化了目标角色的多视角视频,从不同视角提升了视频质量。在多样化数据集上的实验结果突显了我们方法在处理各种运动模式和外观方面的鲁棒性。
cs.CV / 186 / 2602.08775
VedicTHG: Symbolic Vedic Computation for Low-Resource Talking-Head Generation in Educational Avatars
VedicTHG:用于教育化身的低资源对话头生成的符号性吠陀计算
Abstract
Talking-head avatars are increasingly adopted in educational technology to deliver content with social presence and improved engagement. However, many recent talking-head generation (THG) methods rely on GPU-centric neural rendering, large training sets, or high-capacity diffusion models, which limits deployment in offline or resource-constrained learning environments. A deterministic and CPU-oriented THG framework is described, termed Symbolic Vedic Computation, that converts speech to a time-aligned phoneme stream, maps phonemes to a compact viseme inventory, and produces smooth viseme trajectories through symbolic coarticulation inspired by Vedic sutra Urdhva Tiryakbhyam. A lightweight 2D renderer performs region-of-interest (ROI) warping and mouth compositing with stabilization to support real-time synthesis on commodity CPUs. Experiments report synchronization accuracy, temporal stability, and identity consistency under CPU-only execution, alongside benchmarking against representative CPU-feasible baselines. Results indicate that acceptable lip-sync quality can be achieved while substantially reducing computational load and latency, supporting practical educational avatars on low-end hardware. GitHub: https://vineetkumarrakesh.github.io/vedicthg
Chinese Translation
对话头化身在教育技术中越来越多地被采用,以提供具有社会存在感和提高参与度的内容。然而,许多近期的对话头生成(THG)方法依赖于以GPU为中心的神经渲染、大规模训练集或高容量扩散模型,这限制了它们在离线或资源受限的学习环境中的应用。本文描述了一种确定性且以CPU为导向的THG框架,称为符号性吠陀计算(Symbolic Vedic Computation),该框架将语音转换为时间对齐的音素流,将音素映射到紧凑的视觉音素(viseme)库,并通过受吠陀经文Urdhva Tiryakbhyam启发的符号共发音生成平滑的视觉音素轨迹。一个轻量级的2D渲染器执行感兴趣区域(ROI)扭曲和口部合成,并进行稳定化,以支持在普通CPU上实时合成。实验报告了在仅使用CPU执行时的同步精度、时间稳定性和身份一致性,并与代表性的CPU可行基线进行了基准测试。结果表明,在显著降低计算负载和延迟的同时,可以实现可接受的唇同步质量,从而支持在低端硬件上使用的实用教育化身。GitHub: https://vineetkumarrakesh.github.io/vedicthg
cs.CV / 187 / 2602.08792
Multimodal Learning for Arcing Detection in Pantograph-Catenary Systems
用于受电弓-接触网系统弧光检测的多模态学习
Abstract
The pantograph-catenary interface is essential for ensuring uninterrupted and reliable power delivery in electrified rail systems. However, electrical arcing at this interface poses serious risks, including accelerated wear of contact components, degraded system performance, and potential service disruptions. Detecting arcing events at the pantograph-catenary interface is challenging due to their transient nature, noisy operating environment, data scarcity, and the difficulty of distinguishing arcs from other similar transient phenomena. To address these challenges, we propose a novel multimodal framework that combines high-resolution image data with force measurements to more accurately and robustly detect arcing events. First, we construct two arcing detection datasets comprising synchronized visual and force measurements. One dataset is built from data provided by the Swiss Federal Railways (SBB), and the other is derived from publicly available videos of arcing events in different railway systems and synthetic force data that mimic the characteristics observed in the real dataset. Leveraging these datasets, we propose MultiDeepSAD, an extension of the DeepSAD algorithm for multiple modalities with a new loss formulation. Additionally, we introduce tailored pseudo-anomaly generation techniques specific to each data type, such as synthetic arc-like artifacts in images and simulated force irregularities, to augment training data and improve the discriminative ability of the model. Through extensive experiments and ablation studies, we demonstrate that our framework significantly outperforms baseline approaches, exhibiting enhanced sensitivity to real arcing events even under domain shifts and limited availability of real arcing observations.
Chinese Translation
受电弓-接触网接口对于确保电气化铁路系统中不间断和可靠的电力传输至关重要。然而,该接口处的电弧现象带来了严重的风险,包括接触组件的加速磨损、系统性能的下降以及潜在的服务中断。由于电弧事件的瞬态特性、嘈杂的操作环境、数据稀缺以及区分电弧与其他类似瞬态现象的困难,检测受电弓-接触网接口处的电弧事件面临挑战。为了解决这些问题,我们提出了一种新颖的多模态框架,将高分辨率图像数据与力测量相结合,以更准确和稳健地检测电弧事件。首先,我们构建了两个电弧检测数据集,包含同步的视觉和力测量数据。其中一个数据集基于瑞士联邦铁路公司(SBB)提供的数据,另一个数据集则来源于不同铁路系统中电弧事件的公开视频以及模拟的力数据,这些数据模拟了在真实数据集中观察到的特征。利用这些数据集,我们提出了MultiDeepSAD,这是DeepSAD算法的多模态扩展,具有新的损失公式。此外,我们引入了针对每种数据类型的定制伪异常生成技术,例如图像中的合成电弧伪影和模拟的力不规则性,以增强训练数据并提高模型的区分能力。通过广泛的实验和消融研究,我们证明了我们的框架显著优于基线方法,即使在领域转移和真实电弧观察数据有限的情况下,也表现出对真实电弧事件的增强敏感性。
cs.CV / 188 / 2602.08794
MOVA: Towards Scalable and Synchronized Video-Audio Generation
MOVA:迈向可扩展和同步的视频-音频生成
OpenMOSS Team, Yu, Donghua, Chen, Mingshu, Chen, Qi, Luo, Qi, Wu, Qianyi, Cheng, Qinyuan, Li, Ruixiao, Liang, Tianyi, Zhang, Wenbo, Tu, Wenming, Peng, Xiangyu, Gao, Yang, Huo, Yanru, Zhu, Ying, Luo, Yinze, Zhang, Yiyang, Song, Yuerong, Xu, Zhe, Zhang, Zhiyu, Yang, Chenchen, Chang, Cheng, Zhou, Chushu, Chen, Hanfu, Ma, Hongnan, Li, Jiaxi, Tong, Jingqi, Liu, Junxi, Chen, Ke, Li, Shimin, Wang, Songlin, Jiang, Wei, Fei, Zhaoye, Ning, Zhiyuan, Li, Chunguo, Li, Chenhui, He, Ziwei, Huang, Zengfeng, Chen, Xie, Qiu, Xipeng
Abstract
Audio is indispensable for real-world video, yet generation models have largely overlooked audio components. Current approaches to producing audio-visual content often rely on cascaded pipelines, which increase cost, accumulate errors, and degrade overall quality. While systems such as Veo 3 and Sora 2 emphasize the value of simultaneous generation, joint multimodal modeling introduces unique challenges in architecture, data, and training. Moreover, the closed-source nature of existing systems limits progress in the field. In this work, we introduce MOVA (MOSS Video and Audio), an open-source model capable of generating high-quality, synchronized audio-visual content, including realistic lip-synced speech, environment-aware sound effects, and content-aligned music. MOVA employs a Mixture-of-Experts (MoE) architecture, with a total of 32B parameters, of which 18B are active during inference. It supports IT2VA (Image-Text to Video-Audio) generation task. By releasing the model weights and code, we aim to advance research and foster a vibrant community of creators. The released codebase features comprehensive support for efficient inference, LoRA fine-tuning, and prompt enhancement.
Chinese Translation
音频是现实世界视频中不可或缺的组成部分,但生成模型在很大程度上忽视了音频组件。目前生成视听内容的方法通常依赖于级联管道,这增加了成本,累积了错误,并降低了整体质量。尽管像Veo 3和Sora 2这样的系统强调同时生成的重要性,但联合多模态建模在架构、数据和训练方面引入了独特的挑战。此外,现有系统的闭源特性限制了该领域的进展。在本研究中,我们介绍了MOVA(MOSS视频和音频),这是一个开源模型,能够生成高质量、同步的视听内容,包括逼真的口型同步语音、环境感知音效和内容对齐的音乐。MOVA采用混合专家(Mixture-of-Experts, MoE)架构,总参数量为320亿,其中在推理过程中活跃的参数为180亿。它支持IT2VA(图像-文本到视频-音频)生成任务。通过发布模型权重和代码,我们旨在推动研究并促进创作者社区的蓬勃发展。发布的代码库提供了高效推理、LoRA微调和提示增强的全面支持。
cs.CV / 189 / 2602.08797
Addressing data annotation scarcity in Brain Tumor Segmentation on 3D MRI scan Using a Semi-Supervised Teacher-Student Framework
在3D MRI扫描中使用半监督教师-学生框架解决脑肿瘤分割的数据标注稀缺问题
Abstract
Accurate brain tumor segmentation from MRI is limited by expensive annotations and data heterogeneity across scanners and sites. We propose a semi-supervised teacher-student framework that combines an uncertainty-aware pseudo-labeling teacher with a progressive, confidence-based curriculum for the student. The teacher produces probabilistic masks and per-pixel uncertainty; unlabeled scans are ranked by image-level confidence and introduced in stages, while a dual-loss objective trains the student to learn from high-confidence regions and unlearn low-confidence ones. Agreement-based refinement further improves pseudo-label quality. On BraTS 2021, validation DSC increased from 0.393 (10% data) to 0.872 (100%), with the largest gains in early stages, demonstrating data efficiency. The teacher reached a validation DSC of 0.922, and the student surpassed the teacher on tumor subregions (e.g., NCR/NET 0.797 and Edema 0.980); notably, the student recovered the Enhancing class (DSC 0.620) where the teacher failed. These results show that confidence-driven curricula and selective unlearning provide robust segmentation under limited supervision and noisy pseudo-labels.
Chinese Translation
从MRI中准确分割脑肿瘤受到昂贵标注和扫描仪及站点之间数据异质性的限制。我们提出了一种半监督教师-学生框架,该框架结合了一个关注不确定性的伪标签教师和一个基于信心的渐进式课程供学生使用。教师生成概率掩膜和每像素不确定性;未标注的扫描根据图像级信心进行排名,并分阶段引入,同时双损失目标训练学生从高信心区域学习并从低信心区域中“遗忘”。基于一致性的精炼进一步提高了伪标签的质量。在BraTS 2021上,验证的DSC从0.393(10%数据)提高到0.872(100%),早期阶段的增益最大,展示了数据效率。教师的验证DSC达到了0.922,而学生在肿瘤子区域的表现超过了教师(例如,NCR/NET 0.797和水肿0.980);值得注意的是,学生恢复了教师未能识别的增强类(DSC 0.620)。这些结果表明,基于信心的课程和选择性遗忘在有限监督和噪声伪标签下提供了稳健的分割能力。
cs.CV / 190 / 2602.08820
Omni-Video 2: Scaling MLLM-Conditioned Diffusion for Unified Video Generation and Editing
Omni-Video 2:扩展 MLLM 条件扩散以实现统一的视频生成与编辑
Abstract
We present Omni-Video 2, a scalable and computationally efficient model that connects pretrained multimodal large-language models (MLLMs) with video diffusion models for unified video generation and editing. Our key idea is to exploit the understanding and reasoning capabilities of MLLMs to produce explicit target captions to interpret user instructions. In this way, the rich contextual representations from the understanding model are directly used to guide the generative process, thereby improving performance on complex and compositional editing. Moreover, a lightweight adapter is developed to inject multimodal conditional tokens into pretrained text-to-video diffusion models, allowing maximum reuse of their powerful generative priors in a parameter-efficient manner. Benefiting from these designs, we scale up Omni-Video 2 to a 14B video diffusion model on meticulously curated training data with quality, supporting high quality text-to-video generation and various video editing tasks such as object removal, addition, background change, complex motion editing, \emph{etc.} We evaluate the performance of Omni-Video 2 on the FiVE benchmark for fine-grained video editing and the VBench benchmark for text-to-video generation. The results demonstrate its superior ability to follow complex compositional instructions in video editing, while also achieving competitive or superior quality in video generation tasks.
Chinese Translation
我们提出了 Omni-Video 2,这是一种可扩展且计算高效的模型,将预训练的多模态大语言模型(MLLMs)与视频扩散模型连接起来,以实现统一的视频生成与编辑。我们的关键思想是利用 MLLMs 的理解和推理能力生成明确的目标字幕,以解释用户指令。通过这种方式,理解模型中丰富的上下文表示被直接用于指导生成过程,从而提高在复杂和组合编辑上的性能。此外,我们开发了一种轻量级适配器,将多模态条件标记注入到预训练的文本到视频扩散模型中,以最大限度地重用其强大的生成先验,且参数效率高。得益于这些设计,我们将 Omni-Video 2 扩展到一个 140 亿参数的视频扩散模型,基于精心策划的高质量训练数据,支持高质量的文本到视频生成以及各种视频编辑任务,如物体移除、添加、背景更改、复杂运动编辑等。我们在 FiVE 基准上评估了 Omni-Video 2 在细粒度视频编辑方面的性能,并在 VBench 基准上评估了文本到视频生成的性能。结果表明,它在视频编辑中遵循复杂组合指令的能力优越,同时在视频生成任务中也达到了具有竞争力或更优的质量。
cs.CV / 191 / 2602.08822
Any-to-All MRI Synthesis: A Unified Foundation Model for Nasopharyngeal Carcinoma and Its Downstream Applications
任意到全的MRI合成:一种针对鼻咽癌的统一基础模型及其下游应用
Abstract
Magnetic resonance imaging (MRI) is essential for nasopharyngeal carcinoma (NPC) radiotherapy (RT), but practical constraints, such as patient discomfort, long scan times, and high costs often lead to incomplete modalities in clinical practice, compromising RT planning accuracy. Traditional MRI synthesis methods are modality-specific, limited in anatomical adaptability, and lack clinical interpretability-failing to meet NPC's RT needs. Here, we developed a unified foundation model integrating contrastive visual representation learning and vision-language alignment (VLA) to enable any-to-all MRI synthesis. The model uses a contrastive encoder for modality-invariant representations and a CLIP-based text-informed decoder for semantically consistent synthesis, supporting any-to-all MRI synthesis via one unified foundation model. Trained on 40,825 images from 13 institutions, it achieves consistently high performance (average SSIM 0.90, PSNR 27) across 26 internal/external validation sites (15,748 images), with superior synthesis fidelity and robustness to noise and domain shifts. Meanwhile, its unified representation enhances downstream RT-relevant tasks (e.g., segmentation). This work advances digital medicine solutions for NPC care by leveraging foundation models to bridge technical synthesis and clinical utility.
Chinese Translation
磁共振成像(MRI)在鼻咽癌(NPC)放射治疗(RT)中至关重要,但患者不适、扫描时间长和高成本等实际限制常导致临床实践中模态不完整,从而影响RT规划的准确性。传统的MRI合成方法是特定于模态的,缺乏解剖适应性和临床可解释性,无法满足NPC的RT需求。在此,我们开发了一种统一的基础模型,整合了对比视觉表示学习和视觉-语言对齐(VLA),以实现任意到全的MRI合成。该模型使用对比编码器生成模态不变的表示,并采用基于CLIP的文本信息解码器进行语义一致的合成,通过一个统一的基础模型支持任意到全的MRI合成。该模型在来自13个机构的40,825幅图像上进行训练,在26个内部/外部验证站点(15,748幅图像)上实现了一致的高性能(平均SSIM 0.90,PSNR 27),具有优越的合成保真度和对噪声及领域转移的鲁棒性。同时,其统一表示增强了下游与RT相关的任务(例如,分割)。本研究通过利用基础模型弥合技术合成与临床实用性,为NPC护理推进数字医学解决方案。
cs.CV / 192 / 2602.08828
VideoVeritas: AI-Generated Video Detection via Perception Pretext Reinforcement Learning
VideoVeritas:通过感知前置强化学习检测AI生成视频
Abstract
The growing capability of video generation poses escalating security risks, making reliable detection increasingly essential. In this paper, we introduce VideoVeritas, a framework that integrates fine-grained perception and fact-based reasoning. We observe that while current multi-modal large language models (MLLMs) exhibit strong reasoning capacity, their granular perception ability remains limited. To mitigate this, we introduce Joint Preference Alignment and Perception Pretext Reinforcement Learning (PPRL). Specifically, rather than directly optimizing for detection task, we adopt general spatiotemporal grounding and self-supervised object counting in the RL stage, enhancing detection performance with simple perception pretext tasks. To facilitate robust evaluation, we further introduce MintVid, a light yet high-quality dataset containing 3K videos from 9 state-of-the-art generators, along with a real-world collected subset that has factual errors in content. Experimental results demonstrate that existing methods tend to bias towards either superficial reasoning or mechanical analysis, while VideoVeritas achieves more balanced performance across diverse benchmarks.
Chinese Translation
视频生成能力的不断提升带来了日益严重的安全风险,使得可靠的检测变得愈加重要。本文介绍了VideoVeritas,一个集成了细粒度感知和基于事实推理的框架。我们观察到,尽管当前的多模态大型语言模型(MLLMs)展现出强大的推理能力,但其细粒度感知能力仍然有限。为此,我们提出了联合偏好对齐和感知前置强化学习(PPRL)。具体而言,我们并非直接优化检测任务,而是在强化学习阶段采用一般的时空定位和自监督目标计数,通过简单的感知前置任务提升检测性能。为了促进稳健的评估,我们进一步引入了MintVid,一个轻量且高质量的数据集,包含来自9个最先进生成器的3000个视频,以及一个在内容上存在事实错误的真实世界收集子集。实验结果表明,现有方法往往偏向于肤浅的推理或机械分析,而VideoVeritas在多样化基准测试中实现了更平衡的性能。
cs.CV / 193 / 2602.08858
FlattenGPT: Depth Compression for Transformer with Layer Flattening
FlattenGPT:具有层扁平化的变换器深度压缩
Abstract
Recent works have indicated redundancy across transformer blocks, prompting the research of depth compression to prune less crucial blocks. However, current ways of entire-block pruning suffer from risks of discarding meaningful cues learned in those blocks, leading to substantial performance degradation. As another line of model compression, channel pruning can better preserve performance, while it cannot reduce model depth and is challenged by inconsistent pruning ratios for individual layers. To pursue better model compression and acceleration, this paper proposes \textbf{FlattenGPT}, a novel way to detect and reduce depth-wise redundancies. By flatting two adjacent blocks into one, it compresses the network depth, meanwhile enables more effective parameter redundancy detection and removal. FlattenGPT allows to preserve the knowledge learned in all blocks, and remains consistent with the original transformer architecture. Extensive experiments demonstrate that FlattenGPT enhances model efficiency with a decent trade-off to performance. It outperforms existing pruning methods in both zero-shot accuracies and WikiText-2 perplexity across various model types and parameter sizes. On LLaMA-2/3 and Qwen-1.5 models, FlattenGPT retains 90-96\% of zero-shot performance with a compression ratio of 20\%. It also outperforms other pruning methods in accelerating LLM inference, making it promising for enhancing the efficiency of transformers.
Chinese Translation
近期的研究表明变换器块之间存在冗余,这促使了深度压缩的研究,以修剪不太重要的块。然而,当前的整块修剪方法存在丢弃那些块中学习到的有意义线索的风险,导致性能显著下降。作为另一种模型压缩方法,通道修剪能够更好地保持性能,但无法减少模型深度,并且在各层之间的修剪比例不一致方面面临挑战。为了追求更好的模型压缩和加速,本文提出了 extbf{FlattenGPT},这是一种检测和减少深度冗余的新方法。通过将两个相邻的块扁平化为一个,它压缩了网络深度,同时实现了更有效的参数冗余检测和去除。FlattenGPT能够保留所有块中学习到的知识,并与原始变换器架构保持一致。大量实验表明,FlattenGPT在性能与效率之间实现了良好的权衡。它在各种模型类型和参数大小上,在零样本准确率和WikiText-2困惑度方面均优于现有的修剪方法。在LLaMA-2/3和Qwen-1.5模型上,FlattenGPT在20\%的压缩比下保留了90-96 ext{%}的零样本性能。它在加速大型语言模型(LLM)推理方面也优于其他修剪方法,显示出提升变换器效率的良好前景。
cs.CV / 194 / 2602.08861
TiFRe: Text-guided Video Frame Reduction for Efficient Video Multi-modal Large Language Models
TiFRe:基于文本引导的视频帧减少方法,用于高效的视频多模态大语言模型
Abstract
With the rapid development of Large Language Models (LLMs), Video Multi-Modal Large Language Models (Video MLLMs) have achieved remarkable performance in video-language tasks such as video understanding and question answering. However, Video MLLMs face high computational costs, particularly in processing numerous video frames as input, which leads to significant attention computation overhead. A straightforward approach to reduce computational costs is to decrease the number of input video frames. However, simply selecting key frames at a fixed frame rate (FPS) often overlooks valuable information in non-key frames, resulting in notable performance degradation. To address this, we propose Text-guided Video Frame Reduction (TiFRe), a framework that reduces input frames while preserving essential video information. TiFRe uses a Text-guided Frame Sampling (TFS) strategy to select key frames based on user input, which is processed by an LLM to generate a CLIP-style prompt. Pre-trained CLIP encoders calculate the semantic similarity between the prompt and each frame, selecting the most relevant frames as key frames. To preserve video semantics, TiFRe employs a Frame Matching and Merging (FMM) mechanism, which integrates non-key frame information into the selected key frames, minimizing information loss. Experiments show that TiFRe effectively reduces computational costs while improving performance on video-language tasks.
Chinese Translation
随着大语言模型(LLMs)的快速发展,视频多模态大语言模型(Video MLLMs)在视频理解和问答等视频语言任务中取得了显著的表现。然而,Video MLLMs面临着高计算成本,尤其是在处理大量视频帧作为输入时,这导致了显著的注意力计算开销。降低计算成本的一种直接方法是减少输入视频帧的数量。然而,简单地以固定帧率(FPS)选择关键帧往往忽视了非关键帧中的有价值信息,从而导致显著的性能下降。为了解决这个问题,我们提出了基于文本引导的视频帧减少框架(TiFRe),该框架在保留重要视频信息的同时减少输入帧。TiFRe采用文本引导帧采样(TFS)策略,根据用户输入选择关键帧,该输入经过LLM处理以生成CLIP风格的提示。预训练的CLIP编码器计算提示与每个帧之间的语义相似度,选择与之最相关的帧作为关键帧。为了保留视频语义,TiFRe采用帧匹配与合并(FMM)机制,将非关键帧信息整合到所选的关键帧中,最小化信息损失。实验表明,TiFRe有效降低了计算成本,同时提高了视频语言任务的性能。
cs.CV / 195 / 2602.08909
Analysis of Converged 3D Gaussian Splatting Solutions: Density Effects and Prediction Limit
收敛的三维高斯点云解决方案分析:密度效应与预测限制
Abstract
We investigate what structure emerges in 3D Gaussian Splatting (3DGS) solutions from standard multi-view optimization. We term these Rendering-Optimal References (RORs) and analyze their statistical properties, revealing stable patterns: mixture-structured scales and bimodal radiance across diverse scenes. To understand what determines these parameters, we apply learnability probes by training predictors to reconstruct RORs from point clouds without rendering supervision. Our analysis uncovers fundamental density-stratification. Dense regions exhibit geometry-correlated parameters amenable to render-free prediction, while sparse regions show systematic failure across architectures. We formalize this through variance decomposition, demonstrating that visibility heterogeneity creates covariance-dominated coupling between geometric and appearance parameters in sparse regions. This reveals the dual character of RORs: geometric primitives where point clouds suffice, and view synthesis primitives where multi-view constraints are essential. We provide density-aware strategies that improve training robustness and discuss architectural implications for systems that adaptively balance feed-forward prediction and rendering-based refinement.
Chinese Translation
我们研究了标准多视角优化中三维高斯点云(3D Gaussian Splatting, 3DGS)解决方案中出现的结构。我们将其称为渲染最优参考(Rendering-Optimal References, RORs),并分析其统计特性,揭示出稳定的模式:在不同场景中呈现混合结构的尺度和双峰辐射。为了理解这些参数的决定因素,我们通过训练预测器从点云重建RORs,应用学习能力探测,而不依赖于渲染监督。我们的分析揭示了基本的密度分层。密集区域表现出与几何相关的参数,适合于无渲染预测,而稀疏区域在不同架构中则显示出系统性的失败。我们通过方差分解形式化这一点,证明了可见性异质性在稀疏区域中造成几何参数与外观参数之间的协方差主导耦合。这揭示了RORs的双重特性:在点云足够的情况下的几何原语,以及在多视角约束下必不可少的视图合成原语。我们提供了密度感知策略,以提高训练的鲁棒性,并讨论了在自适应平衡前馈预测与基于渲染的细化的系统中的架构影响。
cs.CV / 196 / 2602.08958
Grow with the Flow: 4D Reconstruction of Growing Plants with Gaussian Flow Fields
随流而长:使用高斯流场进行生长植物的4D重建
Abstract
Modeling the time-varying 3D appearance of plants during their growth poses unique challenges: unlike many dynamic scenes, plants generate new geometry over time as they expand, branch, and differentiate. Recent motion modeling techniques are ill-suited to this problem setting. For example, deformation fields cannot introduce new geometry, and 4D Gaussian splatting constrains motion to a linear trajectory in space and time and cannot track the same set of Gaussians over time. Here, we introduce a 3D Gaussian flow field representation that models plant growth as a time-varying derivative over Gaussian parameters -- position, scale, orientation, color, and opacity -- enabling nonlinear and continuous-time growth dynamics. To initialize a sufficient set of Gaussian primitives, we reconstruct the mature plant and learn a process of reverse growth, effectively simulating the plant's developmental history in reverse. Our approach achieves superior image quality and geometric accuracy compared to prior methods on multi-view timelapse datasets of plant growth, providing a new approach for appearance modeling of growing 3D structures.
Chinese Translation
在植物生长过程中建模其随时间变化的3D外观面临独特的挑战:与许多动态场景不同,植物在扩展、分枝和分化的过程中会不断生成新的几何形状。近期的运动建模技术并不适合这一问题设置。例如,形变场无法引入新的几何形状,而4D高斯点云(Gaussian splatting)将运动限制在空间和时间的线性轨迹上,无法随着时间跟踪同一组高斯体。在此,我们提出了一种3D高斯流场表示法,将植物生长建模为高斯参数(位置、尺度、方向、颜色和不透明度)上的时间变化导数,从而实现非线性和连续时间的生长动态。为了初始化足够的高斯原始体,我们重建了成熟植物并学习反向生长的过程,有效地模拟植物的发育历史。与之前的方法相比,我们的方法在植物生长的多视角时间推移数据集上实现了更优的图像质量和几何准确性,为生长3D结构的外观建模提供了一种新方法。
cs.CV / 197 / 2602.08961
MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE
MotionCrafter:基于4D变分自编码器的稠密几何和运动重建
Abstract
We introduce MotionCrafter, a video diffusion-based framework that jointly reconstructs 4D geometry and estimates dense motion from a monocular video. The core of our method is a novel joint representation of dense 3D point maps and 3D scene flows in a shared coordinate system, and a novel 4D VAE to effectively learn this representation. Unlike prior work that forces the 3D value and latents to align strictly with RGB VAE latents-despite their fundamentally different distributions-we show that such alignment is unnecessary and leads to suboptimal performance. Instead, we introduce a new data normalization and VAE training strategy that better transfers diffusion priors and greatly improves reconstruction quality. Extensive experiments across multiple datasets demonstrate that MotionCrafter achieves state-of-the-art performance in both geometry reconstruction and dense scene flow estimation, delivering 38.64% and 25.0% improvements in geometry and motion reconstruction, respectively, all without any post-optimization. Project page: https://ruijiezhu94.github.io/MotionCrafter_Page
Chinese Translation
我们提出了MotionCrafter,这是一个基于视频扩散的框架,能够从单目视频中联合重建4D几何和估计稠密运动。我们方法的核心是稠密3D点图和3D场景流在共享坐标系统中的新型联合表示,以及一个新颖的4D变分自编码器(4D VAE),以有效学习这一表示。与之前的工作强制要求3D值和潜变量严格与RGB变分自编码器潜变量对齐(尽管它们的分布根本不同)不同,我们表明这种对齐是不必要的,并且会导致次优性能。相反,我们引入了一种新的数据归一化和变分自编码器训练策略,更好地传递扩散先验,并大大提高重建质量。在多个数据集上的广泛实验表明,MotionCrafter在几何重建和稠密场景流估计方面均实现了最先进的性能,分别在几何和运动重建中提高了38.64%和25.0%,且无需任何后期优化。项目页面:https://ruijiezhu94.github.io/MotionCrafter_Page
cs.CV / 198 / 2602.08962
Modeling 3D Pedestrian-Vehicle Interactions for Vehicle-Conditioned Pose Forecasting
建模3D行人-车辆交互以进行车辆条件下的姿态预测
Abstract
Accurately predicting pedestrian motion is crucial for safe and reliable autonomous driving in complex urban environments. In this work, we present a 3D vehicle-conditioned pedestrian pose forecasting framework that explicitly incorporates surrounding vehicle information. To support this, we enhance the Waymo-3DSkelMo dataset with aligned 3D vehicle bounding boxes, enabling realistic modeling of multi-agent pedestrian-vehicle interactions. We introduce a sampling scheme to categorize scenes by pedestrian and vehicle count, facilitating training across varying interaction complexities. Our proposed network adapts the TBIFormer architecture with a dedicated vehicle encoder and pedestrian-vehicle interaction cross-attention module to fuse pedestrian and vehicle features, allowing predictions to be conditioned on both historical pedestrian motion and surrounding vehicles. Extensive experiments demonstrate substantial improvements in forecasting accuracy and validate different approaches for modeling pedestrian-vehicle interactions, highlighting the importance of vehicle-aware 3D pose prediction for autonomous driving. Code is available at: https://github.com/GuangxunZhu/VehCondPose3D
Chinese Translation
准确预测行人运动对于在复杂城市环境中实现安全可靠的自动驾驶至关重要。在本研究中,我们提出了一种3D车辆条件下的行人姿态预测框架,该框架明确地结合了周围车辆的信息。为此,我们通过对齐的3D车辆边界框增强了Waymo-3DSkelMo数据集,从而实现了多智能体行人-车辆交互的真实建模。我们引入了一种采样方案,根据行人和车辆的数量对场景进行分类,从而促进在不同交互复杂性下的训练。我们提出的网络采用了TBIFormer架构,并配备了专门的车辆编码器和行人-车辆交互交叉注意力模块,以融合行人和车辆特征,使得预测能够同时基于历史行人运动和周围车辆进行条件化。大量实验表明,预测准确性有显著提高,并验证了不同的行人-车辆交互建模方法,强调了车辆感知的3D姿态预测在自动驾驶中的重要性。代码可在以下链接获取:https://github.com/GuangxunZhu/VehCondPose3D
cs.CV / 199 / 2602.08971
WorldArena: A Unified Benchmark for Evaluating Perception and Functional Utility of Embodied World Models
WorldArena:评估具身世界模型感知与功能效用的统一基准
Abstract
While world models have emerged as a cornerstone of embodied intelligence by enabling agents to reason about environmental dynamics through action-conditioned prediction, their evaluation remains fragmented. Current evaluation of embodied world models has largely focused on perceptual fidelity (e.g., video generation quality), overlooking the functional utility of these models in downstream decision-making tasks. In this work, we introduce WorldArena, a unified benchmark designed to systematically evaluate embodied world models across both perceptual and functional dimensions. WorldArena assesses models through three dimensions: video perception quality, measured with 16 metrics across six sub-dimensions; embodied task functionality, which evaluates world models as data engines, policy evaluators, and action planners integrating with subjective human evaluation. Furthermore, we propose EWMScore, a holistic metric integrating multi-dimensional performance into a single interpretable index. Through extensive experiments on 14 representative models, we reveal a significant perception-functionality gap, showing that high visual quality does not necessarily translate into strong embodied task capability. WorldArena benchmark with the public leaderboard is released at https://worldarena.ai, providing a framework for tracking progress toward truly functional world models in embodied AI.
Chinese Translation
尽管世界模型已成为具身智能的基石,使得智能体能够通过基于行动的预测推理环境动态,但其评估仍然存在碎片化的问题。目前对具身世界模型的评估主要集中在感知保真度(例如,视频生成质量),而忽视了这些模型在下游决策任务中的功能效用。在本研究中,我们引入了WorldArena,一个旨在系统评估具身世界模型在感知和功能维度上表现的统一基准。WorldArena通过三个维度评估模型:视频感知质量,使用16个指标在六个子维度上进行测量;具身任务功能性,将世界模型视为数据引擎、策略评估器和行动规划者,并结合主观人类评估。此外,我们提出了EWMScore,一个将多维性能整合为单一可解释指标的整体度量。通过对14个代表性模型的广泛实验,我们揭示了感知与功能之间的显著差距,显示出高视觉质量并不一定转化为强大的具身任务能力。WorldArena基准及其公开排行榜已发布于https://worldarena.ai,为追踪具身人工智能中真正功能性世界模型的进展提供了框架。
cs.CV / 200 / 2602.08996
Generalizing Sports Feedback Generation by Watching Competitions and Reading Books: A Rock Climbing Case Study
通过观看比赛和阅读书籍来推广体育反馈生成:以攀岩为案例研究
Abstract
While there is rapid progress in video-LLMs with advanced reasoning capabilities, prior work shows that these models struggle on the challenging task of sports feedback generation and require expensive and difficult-to-collect finetuning feedback data for each sport. This limitation is evident from the poor generalization to sports unseen during finetuning. Furthermore, traditional text generation evaluation metrics (e.g., BLEU-4, METEOR, ROUGE-L, BERTScore), originally developed for machine translation and summarization, fail to capture the unique aspects of sports feedback quality. To address the first problem, using rock climbing as our case study, we propose using auxiliary freely-available web data from the target domain, such as competition videos and coaching manuals, in addition to existing sports feedback from a disjoint, source domain to improve sports feedback generation performance on the target domain. To improve evaluation, we propose two evaluation metrics: (1) specificity and (2) actionability. Together, our approach enables more meaningful and practical generation of sports feedback under limited annotations.
Chinese Translation
尽管视频大语言模型(video-LLMs)在推理能力方面取得了快速进展,但先前的研究表明,这些模型在体育反馈生成这一具有挑战性的任务上表现不佳,并且需要昂贵且难以收集的针对每项运动的微调反馈数据。这一局限性在于对微调期间未见过的运动的泛化能力较差。此外,传统的文本生成评估指标(如 BLEU-4、METEOR、ROUGE-L、BERTScore)最初是为机器翻译和摘要生成而开发的,无法捕捉体育反馈质量的独特方面。为了解决第一个问题,我们以攀岩为案例研究,提出在目标领域中使用辅助的、可自由获取的网络数据,如比赛视频和教练手册,除了来自不相交源领域的现有体育反馈,以提高目标领域的体育反馈生成性能。为了改善评估,我们提出了两个评估指标:(1)特异性和(2)可操作性。综合来看,我们的方法在有限的标注下实现了更有意义和实用的体育反馈生成。
cs.CV / 201 / 2602.09014
ArcFlow: Unleashing 2-Step Text-to-Image Generation via High-Precision Non-Linear Flow Distillation
ArcFlow:通过高精度非线性流蒸馏释放2步文本到图像生成
Abstract
Diffusion models have achieved remarkable generation quality, but they suffer from significant inference cost due to their reliance on multiple sequential denoising steps, motivating recent efforts to distill this inference process into a few-step regime. However, existing distillation methods typically approximate the teacher trajectory by using linear shortcuts, which makes it difficult to match its constantly changing tangent directions as velocities evolve across timesteps, thereby leading to quality degradation. To address this limitation, we propose ArcFlow, a few-step distillation framework that explicitly employs non-linear flow trajectories to approximate pre-trained teacher trajectories. Concretely, ArcFlow parameterizes the velocity field underlying the inference trajectory as a mixture of continuous momentum processes. This enables ArcFlow to capture velocity evolution and extrapolate coherent velocities to form a continuous non-linear trajectory within each denoising step. Importantly, this parameterization admits an analytical integration of this non-linear trajectory, which circumvents numerical discretization errors and results in high-precision approximation of the teacher trajectory. To train this parameterization into a few-step generator, we implement ArcFlow via trajectory distillation on pre-trained teacher models using lightweight adapters. This strategy ensures fast, stable convergence while preserving generative diversity and quality. Built on large-scale models (Qwen-Image-20B and FLUX.1-dev), ArcFlow only fine-tunes on less than 5% of original parameters and achieves a 40x speedup with 2 NFEs over the original multi-step teachers without significant quality degradation. Experiments on benchmarks show the effectiveness of ArcFlow both qualitatively and quantitatively.
Chinese Translation
扩散模型在生成质量上取得了显著的进展,但由于依赖多个顺序去噪步骤,它们在推理时面临着显著的成本,这促使了近期将这一推理过程蒸馏为少步方案的努力。然而,现有的蒸馏方法通常通过使用线性捷径来近似教师轨迹,这使得在时间步长演变过程中难以匹配其不断变化的切线方向,从而导致质量下降。为了解决这一局限性,我们提出了ArcFlow,一种明确采用非线性流轨迹来近似预训练教师轨迹的少步蒸馏框架。具体而言,ArcFlow将推理轨迹下的速度场参数化为连续动量过程的混合。这使得ArcFlow能够捕捉速度演变,并外推一致的速度,以在每个去噪步骤内形成连续的非线性轨迹。重要的是,这种参数化允许对该非线性轨迹进行解析积分,从而避免数值离散化误差,并实现对教师轨迹的高精度近似。为了将这种参数化训练成少步生成器,我们通过在预训练教师模型上使用轻量级适配器实施ArcFlow的轨迹蒸馏。这一策略确保了快速、稳定的收敛,同时保持生成的多样性和质量。基于大规模模型(Qwen-Image-20B和FLUX.1-dev),ArcFlow仅对不到5%的原始参数进行微调,并在不显著降低质量的情况下实现了相较于原始多步教师的40倍加速,使用2次函数评估(NFEs)。基准测试的实验结果显示了ArcFlow在定性和定量上的有效性。
cs.CV / 202 / 2602.09016
Raster2Seq: Polygon Sequence Generation for Floorplan Reconstruction
Raster2Seq:用于平面图重建的多边形序列生成
Abstract
Reconstructing a structured vector-graphics representation from a rasterized floorplan image is typically an important prerequisite for computational tasks involving floorplans such as automated understanding or CAD workflows. However, existing techniques struggle in faithfully generating the structure and semantics conveyed by complex floorplans that depict large indoor spaces with many rooms and a varying numbers of polygon corners. To this end, we propose Raster2Seq, framing floorplan reconstruction as a sequence-to-sequence task in which floorplan elements--such as rooms, windows, and doors--are represented as labeled polygon sequences that jointly encode geometry and semantics. Our approach introduces an autoregressive decoder that learns to predict the next corner conditioned on image features and previously generated corners using guidance from learnable anchors. These anchors represent spatial coordinates in image space, hence allowing for effectively directing the attention mechanism to focus on informative image regions. By embracing the autoregressive mechanism, our method offers flexibility in the output format, enabling for efficiently handling complex floorplans with numerous rooms and diverse polygon structures. Our method achieves state-of-the-art performance on standard benchmarks such as Structure3D, CubiCasa5K, and Raster2Graph, while also demonstrating strong generalization to more challenging datasets like WAFFLE, which contain diverse room structures and complex geometric variations.
Chinese Translation
从光栅化的平面图像重建结构化的矢量图形表示通常是涉及平面图的计算任务(如自动理解或计算机辅助设计工作流程)的重要前提。然而,现有技术在忠实生成复杂平面图所传达的结构和语义方面存在困难,这些平面图描绘了大型室内空间,包含多个房间和不同数量的多边形角点。为此,我们提出了Raster2Seq,将平面图重建框架设定为序列到序列的任务,其中平面图元素(如房间、窗户和门)被表示为标记的多边形序列,联合编码几何和语义。我们的方法引入了一种自回归解码器,该解码器学习在图像特征和先前生成的角点的条件下预测下一个角点,并使用可学习的锚点进行指导。这些锚点表示图像空间中的空间坐标,从而有效地引导注意力机制聚焦于信息丰富的图像区域。通过采用自回归机制,我们的方法在输出格式上提供了灵活性,能够高效处理包含众多房间和多样化多边形结构的复杂平面图。我们的方法在标准基准测试(如Structure3D、CubiCasa5K和Raster2Graph)上实现了最先进的性能,同时在更具挑战性的数据集(如WAFFLE)上也展示了强大的泛化能力,这些数据集包含多样的房间结构和复杂的几何变换。
cs.CV / 203 / 2602.09022
WorldCompass: Reinforcement Learning for Long-Horizon World Models
WorldCompass:用于长时间跨度世界模型的强化学习
Abstract
This work presents WorldCompass, a novel Reinforcement Learning (RL) post-training framework for the long-horizon, interactive video-based world models, enabling them to explore the world more accurately and consistently based on interaction signals. To effectively "steer" the world model's exploration, we introduce three core innovations tailored to the autoregressive video generation paradigm: 1) Clip-level rollout Strategy: We generate and evaluate multiple samples at a single target clip, which significantly boosts rollout efficiency and provides fine-grained reward signals. 2) Complementary Reward Functions: We design reward functions for both interaction-following accuracy and visual quality, which provide direct supervision and effectively suppress reward-hacking behaviors. 3) Efficient RL Algorithm: We employ the negative-aware fine-tuning strategy coupled with various efficiency optimizations to efficiently and effectively enhance model capacity. Evaluations on the SoTA open-source world model, WorldPlay, demonstrate that WorldCompass significantly improves interaction accuracy and visual fidelity across various scenarios.
Chinese Translation
本研究提出了WorldCompass,这是一种新颖的强化学习(RL)后训练框架,旨在用于长时间跨度的互动视频基础世界模型,使其能够基于互动信号更准确和一致地探索世界。为了有效地“引导”世界模型的探索,我们引入了三项核心创新,专门针对自回归视频生成范式:1)剪辑级回放策略:我们在单个目标剪辑上生成和评估多个样本,这显著提高了回放效率,并提供了细粒度的奖励信号。2)互补奖励函数:我们设计了针对互动跟随准确性和视觉质量的奖励函数,提供直接监督并有效抑制奖励操控行为。3)高效的强化学习算法:我们采用了负感知微调策略,并结合各种效率优化,来高效且有效地增强模型能力。在SoTA开源世界模型WorldPlay上的评估表明,WorldCompass在各种场景中显著提高了互动准确性和视觉保真度。
cs.CV / 204 / 2602.09024
Autoregressive Image Generation with Masked Bit Modeling
基于掩码位建模的自回归图像生成
Abstract
This paper challenges the dominance of continuous pipelines in visual generation. We systematically investigate the performance gap between discrete and continuous methods. Contrary to the belief that discrete tokenizers are intrinsically inferior, we demonstrate that the disparity arises primarily from the total number of bits allocated in the latent space (i.e., the compression ratio). We show that scaling up the codebook size effectively bridges this gap, allowing discrete tokenizers to match or surpass their continuous counterparts. However, existing discrete generation methods struggle to capitalize on this insight, suffering from performance degradation or prohibitive training costs with scaled codebook. To address this, we propose masked Bit AutoRegressive modeling (BAR), a scalable framework that supports arbitrary codebook sizes. By equipping an autoregressive transformer with a masked bit modeling head, BAR predicts discrete tokens through progressively generating their constituent bits. BAR achieves a new state-of-the-art gFID of 0.99 on ImageNet-256, outperforming leading methods across both continuous and discrete paradigms, while significantly reducing sampling costs and converging faster than prior continuous approaches. Project page is available at https://bar-gen.github.io/
Chinese Translation
本文挑战了连续生成管道在视觉生成中的主导地位。我们系统地研究了离散方法与连续方法之间的性能差距。与认为离散标记器天生劣于连续标记器的观点相反,我们证明了这种差距主要源于潜在空间中分配的总位数(即压缩比)。我们展示了扩大代码本大小有效地弥补了这一差距,使得离散标记器能够与其连续对应物相匹配或超越。然而,现有的离散生成方法在利用这一见解时面临困难,随着代码本的扩展,性能下降或训练成本过高。为了解决这个问题,我们提出了掩码位自回归建模(Masked Bit AutoRegressive modeling,BAR),这是一个支持任意代码本大小的可扩展框架。通过为自回归变换器配备掩码位建模头,BAR通过逐步生成其组成位来预测离散标记。BAR在ImageNet-256上实现了0.99的新状态-of-the-art gFID,超越了连续和离散范式中的领先方法,同时显著降低了采样成本,并比之前的连续方法更快收敛。项目页面可访问 https://bar-gen.github.io/
cs.AI / 1 / 2602.07032
LLM-FSM: Scaling Large Language Models for Finite-State Reasoning in RTL Code Generation
LLM-FSM:在RTL代码生成中扩展大型语言模型以进行有限状态推理
Abstract
Finite-state reasoning, the ability to understand and implement state-dependent behavior, is central to hardware design. In this paper, we present LLM-FSM, a benchmark that evaluates how well large language models (LLMs) can recover finite-state machine (FSM) behavior from natural-language specifications and translate it into correct register transfer-level (RTL) implementations. Unlike prior specification-to-RTL benchmarks that rely on manually constructed examples, LLM-FSM is built through a fully automated pipeline. LLM-FSM first constructs FSM with configurable state counts and constrained transition structures. It then prompts LLMs to express each FSM in a structured YAML format with an application context, and to further convert that YAML into a natural-language (NL) specification. From the same YAML, our pipeline synthesizes the reference RTL and testbench in a correct-by-construction manner. All 1,000 problems are verified using LLM-based and SAT-solver-based checks, with human review on a subset. Our experiments show that even the strongest LLMs exhibit sharply declining accuracy as FSM complexity increases. We further demonstrate that training-time scaling via supervised fine-tuning (SFT) generalizes effectively to out-of-distribution (OOD) tasks, while increasing test-time compute improves reasoning reliability. Finally, LLM-FSM remains extensible by allowing its FSM complexity to scale with future model capabilities.
Chinese Translation
有限状态推理,即理解和实现状态依赖行为的能力,是硬件设计的核心。在本文中,我们提出了LLM-FSM,一个基准测试,用于评估大型语言模型(LLMs)从自然语言规范中恢复有限状态机(FSM)行为并将其转换为正确的寄存器传输级(RTL)实现的能力。与依赖手动构建示例的先前规范到RTL基准不同,LLM-FSM是通过完全自动化的流程构建的。LLM-FSM首先构建具有可配置状态数量和受限转换结构的FSM。然后,它提示LLMs以结构化的YAML格式表达每个FSM,并提供应用上下文,进一步将该YAML转换为自然语言(NL)规范。我们的流程从相同的YAML合成参考RTL和测试平台,以构造正确的方式进行。所有1000个问题都通过基于LLM和SAT求解器的检查进行验证,并对一个子集进行了人工审查。我们的实验表明,即使是最强大的LLMs,在FSM复杂性增加时准确性也急剧下降。我们进一步证明,通过监督微调(SFT)进行的训练时间扩展有效地推广到分布外(OOD)任务,同时增加测试时间计算提高了推理的可靠性。最后,LLM-FSM通过允许其FSM复杂性与未来模型能力扩展,保持了可扩展性。
cs.AI / 2 / 2602.07034
ST-Raptor: An Agentic System for Semi-Structured Table QA
ST-Raptor:一种用于半结构化表格问答的智能系统
Abstract
Semi-structured table question answering (QA) is a challenging task that requires (1) precise extraction of cell contents and positions and (2) accurate recovery of key implicit logical structures, hierarchical relationships, and semantic associations encoded in table layouts. In practice, such tables are often interpreted manually by human experts, which is labor-intensive and time-consuming. However, automating this process remains difficult. Existing Text-to-SQL methods typically require converting semi-structured tables into structured formats, inevitably leading to information loss, while approaches like Text-to-Code and multimodal LLM-based QA struggle with complex layouts and often yield inaccurate answers. To address these limitations, we present ST-Raptor, an agentic system for semi-structured table QA. ST-Raptor offers an interactive analysis environment that combines visual editing, tree-based structural modeling, and agent-driven query resolution to support accurate and user-friendly table understanding. Experimental results on both benchmark and real-world datasets demonstrate that ST-Raptor outperforms existing methods in both accuracy and usability. The code is available at https://github.com/weAIDB/ST-Raptor, and a demonstration video is available at https://youtu.be/9GDR-94Cau4.
Chinese Translation
半结构化表格问答(QA)是一项具有挑战性的任务,要求(1)精确提取单元格内容和位置,以及(2)准确恢复表格布局中编码的关键隐含逻辑结构、层次关系和语义关联。在实际操作中,这类表格通常由人工专家手动解释,这既费力又耗时。然而,自动化这一过程仍然困难。现有的文本到SQL(Text-to-SQL)方法通常需要将半结构化表格转换为结构化格式,必然导致信息损失,而像文本到代码(Text-to-Code)和基于多模态大语言模型(LLM)的问答方法则在处理复杂布局时遇到困难,常常产生不准确的答案。为了解决这些局限性,我们提出了ST-Raptor,一种用于半结构化表格问答的智能系统。ST-Raptor提供了一个交互式分析环境,结合了视觉编辑、基于树的结构建模和智能驱动的查询解决方案,以支持准确且用户友好的表格理解。在基准数据集和真实世界数据集上的实验结果表明,ST-Raptor在准确性和可用性方面均优于现有方法。代码可在https://github.com/weAIDB/ST-Raptor获取,演示视频可在https://youtu.be/9GDR-94Cau4观看。
cs.AI / 3 / 2602.07035
DLLM-Searcher: Adapting Diffusion Large Language Model for Search Agents
DLLM-Searcher:为搜索代理适配扩散大型语言模型
Abstract
Recently, Diffusion Large Language Models (dLLMs) have demonstrated unique efficiency advantages, enabled by their inherently parallel decoding mechanism and flexible generation paradigm. Meanwhile, despite the rapid advancement of Search Agents, their practical deployment is constrained by a fundamental limitation, termed as 1) Latency Challenge: the serial execution of multi-round reasoning, tool calling, and tool response waiting under the ReAct agent paradigm induces severe end-to-end latency. Intuitively, dLLMs can leverage their distinctive strengths to optimize the operational efficiency of agents under the ReAct agent paradigm. Practically, existing dLLM backbones face the 2) Agent Ability Challenge. That is, existing dLLMs exhibit remarkably weak reasoning and tool-calling capabilities, preventing these advantages from being effectively realized in practice. In this paper, we propose DLLM-Searcher, an optimization framework for dLLM-based Search Agents. To solve the Agent Ability Challenge, we design a two-stage post-training pipeline encompassing Agentic Supervised Fine-Tuning (Agentic SFT) and Agentic Variance-Reduced Preference Optimization Agentic VRPO, which enhances the backbone dLLM's information seeking and reasoning capabilities. To mitigate the Latency Challenge, we leverage the flexible generation mechanism of dLLMs and propose a novel agent paradigm termed Parallel-Reasoning and Acting P-ReAct. P-ReAct guides the model to prioritize decoding tool_call instructions, thereby allowing the model to keep thinking while waiting for the tool's return. Experimental results demonstrate that DLLM-Searcher achieves performance comparable to mainstream LLM-based search agents and P-ReAct delivers approximately 15% inference acceleration. Our code is available at https://anonymous.4open.science/r/DLLM-Searcher-553C
Chinese Translation
最近,扩散大型语言模型(dLLMs)展示了独特的效率优势,这得益于其固有的并行解码机制和灵活的生成范式。同时,尽管搜索代理的快速发展,其实际部署受到了一项基本限制的制约,称为1) 延迟挑战:在ReAct代理范式下,多轮推理、工具调用和工具响应等待的串行执行导致了严重的端到端延迟。直观上,dLLMs可以利用其独特的优势来优化ReAct代理范式下代理的操作效率。实际上,现有的dLLM骨干网络面临2) 代理能力挑战。也就是说,现有的dLLMs表现出显著较弱的推理和工具调用能力,阻碍了这些优势在实践中的有效实现。在本文中,我们提出了DLLM-Searcher,一个基于dLLM的搜索代理优化框架。为了解决代理能力挑战,我们设计了一个包含代理监督微调(Agentic Supervised Fine-Tuning, Agentic SFT)和代理方差减少偏好优化(Agentic Variance-Reduced Preference Optimization, Agentic VRPO)的两阶段后训练流程,从而增强骨干dLLM的信息获取和推理能力。为了缓解延迟挑战,我们利用dLLMs的灵活生成机制,提出了一种新型代理范式,称为并行推理与行动(Parallel-Reasoning and Acting, P-ReAct)。P-ReAct引导模型优先解码工具调用指令,从而使模型在等待工具返回时能够持续思考。实验结果表明,DLLM-Searcher的性能与主流的基于LLM的搜索代理相当,而P-ReAct实现了约15%的推理加速。我们的代码可在https://anonymous.4open.science/r/DLLM-Searcher-553C获取。
cs.AI / 4 / 2602.07040
Aster: Autonomous Scientific Discovery over 20x Faster Than Existing Methods
Aster:一种比现有方法快20倍的自主科学发现系统
Abstract
We introduce Aster, an AI agent for autonomous scientific discovery capable of operating over 20 times faster than existing frameworks. Given a task, an initial program, and a script to evaluate the performance of the program, Aster iteratively improves the program, often leading to new state-of-the-art performances. Aster's significant reduction in the number of iterations required for novel discovery expands the domain of tractable problems to include tasks with long evaluation durations, such as multi-hour machine learning training runs. We applied Aster to problems in mathematics, GPU kernel engineering, biology, neuroscience, and language model training. More specifically: the Erdos minimum overlap problem, optimizing the TriMul kernel, a single-cell analysis denoising problem, training a neural activity prediction model to perform well on ZAPBench, and the NanoGPT Speedrun Competition. Aster attains SOTA results in every task, except for ZAPBench, where it matches the performance of the best human solution with less than 1/190th of the compute. Aster is accessible via a web interface and API at asterlab.ai.
Chinese Translation
我们介绍了Aster,一个能够进行自主科学发现的人工智能代理,其操作速度超过现有框架的20倍。给定一个任务、一个初始程序和一个评估程序性能的脚本,Aster通过迭代改进程序,通常能够实现新的最先进性能。Aster显著减少了新发现所需的迭代次数,从而扩展了可处理问题的领域,包括评估时间较长的任务,例如多小时的机器学习训练过程。我们将Aster应用于数学、GPU内核工程、生物学、神经科学和语言模型训练等问题。更具体地说:Erdos最小重叠问题、优化TriMul内核、单细胞分析去噪问题、训练神经活动预测模型以在ZAPBench上表现良好,以及NanoGPT速度竞赛。Aster在每个任务中都达到了最先进的结果,除了ZAPBench,在该任务中它的性能与最佳人类解决方案相当,但计算资源消耗不到1/190。Aster可以通过网站接口和API访问,网址为asterlab.ai。
cs.AI / 5 / 2602.07055
Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration?
空间理论:基础模型能否通过主动探索构建空间信念?
Abstract
Spatial embodied intelligence requires agents to act to acquire information under partial observability. While multimodal foundation models excel at passive perception, their capacity for active, self-directed exploration remains understudied. We propose Theory of Space, defined as an agent's ability to actively acquire information through self-directed, active exploration and to construct, revise, and exploit a spatial belief from sequential, partial observations. We evaluate this through a benchmark where the goal is curiosity-driven exploration to build an accurate cognitive map. A key innovation is spatial belief probing, which prompts models to reveal their internal spatial representations at each step. Our evaluation of state-of-the-art models reveals several critical bottlenecks. First, we identify an Active-Passive Gap, where performance drops significantly when agents must autonomously gather information. Second, we find high inefficiency, as models explore unsystematically compared to program-based proxies. Through belief probing, we diagnose that while perception is an initial bottleneck, global beliefs suffer from instability that causes spatial knowledge to degrade over time. Finally, using a false belief paradigm, we uncover Belief Inertia, where agents fail to update obsolete priors with new evidence. This issue is present in text-based agents but is particularly severe in vision-based models. Our findings suggest that current foundation models struggle to maintain coherent, revisable spatial beliefs during active exploration.
Chinese Translation
空间具身智能要求智能体在部分可观察性下采取行动以获取信息。尽管多模态基础模型在被动感知方面表现出色,但它们在主动、自主探索方面的能力仍然未得到充分研究。我们提出了空间理论,定义为智能体通过自主的主动探索主动获取信息,并从连续的部分观察中构建、修正和利用空间信念。我们通过一个基准测试来评估这一点,其目标是驱动好奇心的探索,以建立准确的认知地图。一个关键的创新是空间信念探测,它促使模型在每一步揭示其内部空间表征。我们对最先进模型的评估揭示了几个关键瓶颈。首先,我们识别出一个主动-被动差距(Active-Passive Gap),即当智能体必须自主收集信息时,性能显著下降。其次,我们发现效率低下,因为模型的探索相较于基于程序的代理显得不系统。通过信念探测,我们诊断出虽然感知是一个初始瓶颈,但全球信念存在不稳定性,导致空间知识随着时间的推移而退化。最后,利用虚假信念范式,我们发现信念惯性(Belief Inertia),即智能体未能用新证据更新过时的先验。这一问题在基于文本的智能体中存在,但在基于视觉的模型中尤为严重。我们的发现表明,当前的基础模型在主动探索过程中难以维持连贯、可修正的空间信念。
cs.AI / 6 / 2602.07153
ANCHOR: Branch-Point Data Generation for GUI Agents
ANCHOR:用于GUI代理的分支点数据生成
Abstract
End-to-end GUI agents for real desktop environments require large amounts of high-quality interaction data, yet collecting human demonstrations is expensive and existing synthetic pipelines often suffer from limited task diversity or noisy, goal-drifting trajectories. We present a trajectory expansion framework Anchor that bootstraps scalable desktop supervision from a small set of verified seed demonstrations. Starting from each seed, we identify branch points that correspond to meaningful state changes and propose new, state-grounded task variants conditioned on the current GUI context. An executing agent then follows the proposed instructions to generate new trajectories, while a verifier enforces task completion via state-aware checks and trajectory-level consistency. To improve supervision quality, we further apply task-conditioned step-level filtering to remove ungrounded actions and denoise post-branch segments to maintain coherent intent. Experiments on standard desktop benchmarks, OSWorld and WindowsAgentArena, show that models fine-tuned on our expanded corpus achieve consistent improvements over zero-shot agents and representative synthesis baselines, and generalize across applications and operating systems.
Chinese Translation
端到端的GUI代理在真实桌面环境中需要大量高质量的交互数据,但收集人类示范的成本较高,现有的合成管道往往面临任务多样性有限或噪声、目标漂移轨迹的问题。我们提出了一种轨迹扩展框架Anchor,该框架从一小组经过验证的种子示范中引导出可扩展的桌面监督。从每个种子开始,我们识别出对应于有意义状态变化的分支点,并提出基于当前GUI上下文的新状态基础任务变体。执行代理随后遵循所提议的指令生成新的轨迹,而验证者通过状态感知检查和轨迹级一致性来强制任务完成。为了提高监督质量,我们进一步应用基于任务的步骤级过滤,去除无基础的动作,并对后分支段进行去噪,以保持一致的意图。在标准桌面基准OSWorld和WindowsAgentArena上的实验表明,基于我们扩展语料库微调的模型在零样本代理和代表性合成基线之上取得了一致的改进,并且能够在不同应用和操作系统之间进行泛化。
cs.AI / 7 / 2602.07187
PreFlect: From Retrospective to Prospective Reflection in Large Language Model Agents
PreFlect:从回顾性反思到前瞻性反思在大型语言模型代理中的应用
Abstract
Advanced large language model agents typically adopt self-reflection for improving performance, where agents iteratively analyze past actions to correct errors. However, existing reflective approaches are inherently retrospective: agents act, observe failure, and only then attempt to recover. In this work, we introduce PreFlect, a prospective reflection mechanism that shifts the paradigm from post hoc correction to pre-execution foresight by criticizing and refining agent plans before execution. To support grounded prospective reflection, we distill planning errors from historical agent trajectories, capturing recurring success and failure patterns observed across past executions. Furthermore, we complement prospective reflection with a dynamic re-planning mechanism that provides execution-time plan update in case the original plan encounters unexpected deviation. Evaluations on different benchmarks demonstrate that PreFlect significantly improves overall agent utility on complex real-world tasks, outperforming strong reflection-based baselines and several more complex agent architectures. Code will be updated at https://github.com/wwwhy725/PreFlect.
Chinese Translation
先进的大型语言模型代理通常采用自我反思来提高性能,代理通过迭代分析过去的行为来纠正错误。然而,现有的反思方法本质上是回顾性的:代理先执行,观察失败,然后再尝试恢复。在本研究中,我们引入了PreFlect,一种前瞻性反思机制,它将范式从事后修正转变为执行前的预见,通过在执行前批评和完善代理计划来实现。为了支持有依据的前瞻性反思,我们从历史代理轨迹中提炼出规划错误,捕捉到过去执行中观察到的重复成功和失败模式。此外,我们还通过动态重新规划机制来补充前瞻性反思,在原计划遇到意外偏差时提供执行时的计划更新。在不同基准上的评估表明,PreFlect显著提高了代理在复杂现实任务中的整体效用,超越了强大的基于反思的基线和几种更复杂的代理架构。代码将更新至 https://github.com/wwwhy725/PreFlect。
cs.AI / 8 / 2602.07238
Is there "Secret Sauce'' in Large Language Model Development?
大型语言模型开发中是否存在“秘密调料”?
Abstract
Do leading LLM developers possess a proprietary ``secret sauce'', or is LLM performance driven by scaling up compute? Using training and benchmark data for 809 models released between 2022 and 2025, we estimate scaling-law regressions with release-date and developer fixed effects. We find clear evidence of developer-specific efficiency advantages, but their importance depends on where models lie in the performance distribution. At the frontier, 80-90% of performance differences are explained by higher training compute, implying that scale--not proprietary technology--drives frontier advances. Away from the frontier, however, proprietary techniques and shared algorithmic progress substantially reduce the compute required to reach fixed capability thresholds. Some companies can systematically produce smaller models more efficiently. Strikingly, we also find substantial variation of model efficiency within companies; a firm can train two models with more than 40x compute efficiency difference. We also discuss the implications for AI leadership and capability diffusion.
Chinese Translation
领先的LLM(大型语言模型)开发者是否拥有专有的“秘密调料”,还是LLM性能主要由计算能力的提升驱动?通过使用2022年至2025年间发布的809个模型的训练和基准数据,我们估计了带有发布日期和开发者固定效应的规模法则回归。我们发现开发者特定的效率优势有明确证据,但其重要性取决于模型在性能分布中的位置。在前沿阶段,80-90%的性能差异可以通过更高的训练计算量来解释,这意味着规模——而非专有技术——驱动了前沿的进展。然而,在远离前沿的情况下,专有技术和共享算法进展显著减少了达到固定能力阈值所需的计算量。一些公司能够系统性地更高效地生产较小的模型。值得注意的是,我们还发现公司内部模型效率存在显著差异;一个公司可以训练两个模型,其计算效率差异超过40倍。我们还讨论了这对人工智能领导力和能力扩散的影响。
cs.AI / 9 / 2602.07253
From Out-of-Distribution Detection to Hallucination Detection: A Geometric View
从分布外检测到幻觉检测:几何视角
Abstract
Detecting hallucinations in large language models is a critical open problem with significant implications for safety and reliability. While existing hallucination detection methods achieve strong performance in question-answering tasks, they remain less effective on tasks requiring reasoning. In this work, we revisit hallucination detection through the lens of out-of-distribution (OOD) detection, a well-studied problem in areas like computer vision. Treating next-token prediction in language models as a classification task allows us to apply OOD techniques, provided appropriate modifications are made to account for the structural differences in large language models. We show that OOD-based approaches yield training-free, single-sample-based detectors, achieving strong accuracy in hallucination detection for reasoning tasks. Overall, our work suggests that reframing hallucination detection as OOD detection provides a promising and scalable pathway toward language model safety.
Chinese Translation
在大型语言模型中检测幻觉是一个关键的开放问题,对安全性和可靠性具有重要影响。尽管现有的幻觉检测方法在问答任务中表现出色,但在需要推理的任务中效果较差。在本研究中,我们通过分布外(OOD)检测的视角重新审视幻觉检测,这是计算机视觉等领域中一个经过充分研究的问题。将语言模型中的下一个标记预测视为分类任务,使我们能够应用OOD技术,前提是对大型语言模型的结构差异进行适当的修改。我们展示了基于OOD的方法能够产生无训练、基于单样本的检测器,在推理任务中实现强大的幻觉检测准确性。总体而言,我们的研究表明,将幻觉检测重新框定为OOD检测为语言模型的安全性提供了一条有前景且可扩展的路径。
cs.AI / 10 / 2602.07259
Incentive-Aware AI Safety via Strategic Resource Allocation: A Stackelberg Security Games Perspective
通过战略资源分配实现激励感知的人工智能安全:来自斯塔克尔堡安全博弈的视角
Abstract
As AI systems grow more capable and autonomous, ensuring their safety and reliability requires not only model-level alignment but also strategic oversight of the humans and institutions involved in their development and deployment. Existing safety frameworks largely treat alignment as a static optimization problem (e.g., tuning models to desired behavior) while overlooking the dynamic, adversarial incentives that shape how data are collected, how models are evaluated, and how they are ultimately deployed. We propose a new perspective on AI safety grounded in Stackelberg Security Games (SSGs): a class of game-theoretic models designed for adversarial resource allocation under uncertainty. By viewing AI oversight as a strategic interaction between defenders (auditors, evaluators, and deployers) and attackers (malicious actors, misaligned contributors, or worst-case failure modes), SSGs provide a unifying framework for reasoning about incentive design, limited oversight capacity, and adversarial uncertainty across the AI lifecycle. We illustrate how this framework can inform (1) training-time auditing against data/feedback poisoning, (2) pre-deployment evaluation under constrained reviewer resources, and (3) robust multi-model deployment in adversarial environments. This synthesis bridges algorithmic alignment and institutional oversight design, highlighting how game-theoretic deterrence can make AI oversight proactive, risk-aware, and resilient to manipulation.
Chinese Translation
随着人工智能系统变得越来越强大和自主,确保其安全性和可靠性不仅需要模型层面的对齐,还需要对参与其开发和部署的人类及机构进行战略监督。现有的安全框架在很大程度上将对齐视为一个静态优化问题(例如,调整模型以实现期望行为),而忽视了塑造数据收集、模型评估和最终部署方式的动态对抗性激励。我们提出了一种基于斯塔克尔堡安全博弈(Stackelberg Security Games, SSGs)的新视角来理解人工智能安全:这是一类旨在处理不确定性下对抗性资源分配的博弈论模型。通过将人工智能监督视为防御者(审计者、评估者和部署者)与攻击者(恶意行为者、未对齐的贡献者或最坏情况失败模式)之间的战略互动,SSGs提供了一个统一的框架,用于推理激励设计、有限监督能力和人工智能生命周期中的对抗性不确定性。我们展示了该框架如何为(1)针对数据/反馈中毒的训练时间审计提供指导,(2)在审查资源受限的情况下进行预部署评估,以及(3)在对抗环境中进行稳健的多模型部署提供支持。这一综合方法架起了算法对齐与机构监督设计之间的桥梁,突显了博弈论威慑如何使人工智能监督变得主动、风险意识强,并且能够抵御操控。
cs.AI / 11 / 2602.07267
BRIDGE: Predicting Human Task Completion Time From Model Performance
BRIDGE:从模型性能预测人类任务完成时间
Abstract
Evaluating the real-world capabilities of AI systems requires grounding benchmark performance in human-interpretable measures of task difficulty. Existing approaches that rely on direct human task completion time annotations are costly, noisy, and difficult to scale across benchmarks. In this work, we propose BRIDGE, a unified psychometric framework that learns the latent difficulty scale from model responses and anchors it to human task completion time. Using a two-parameter logistic Item Response Theory model, we jointly estimate latent task difficulty and model capability from model performance data across multiple benchmarks. We demonstrate that latent task difficulty varies linearly with the logarithm of human completion time, allowing human task completion time to be inferred for new benchmarks from model performance alone. Leveraging this alignment, we forecast frontier model capabilities in terms of human task length and independently reproduce METR's exponential scaling results, with the 50% solvable task horizon doubling approximately every 6 months.
Chinese Translation
评估人工智能系统在现实世界中的能力需要将基准性能与人类可解释的任务难度度量相结合。现有依赖于直接的人类任务完成时间注释的方法成本高、噪声大,并且难以在基准之间进行扩展。在本研究中,我们提出了BRIDGE,一个统一的心理测量框架,它从模型响应中学习潜在的难度尺度,并将其锚定到人类任务完成时间。使用双参数逻辑项目反应理论(Item Response Theory)模型,我们从多个基准的模型性能数据中共同估计潜在任务难度和模型能力。我们证明潜在任务难度与人类完成时间的对数呈线性变化,从而使得可以仅通过模型性能推断新基准的人类任务完成时间。利用这种对齐,我们预测了前沿模型能力在任务长度方面的表现,并独立重现了METR的指数扩展结果,50%可解任务的水平大约每6个月翻一番。
cs.AI / 12 / 2602.07274
TermiGen: High-Fidelity Environment and Robust Trajectory Synthesis for Terminal Agents
TermiGen:终端智能体的高保真环境与鲁棒轨迹合成
Abstract
Executing complex terminal tasks remains a significant challenge for open-weight LLMs, constrained by two fundamental limitations. First, high-fidelity, executable training environments are scarce: environments synthesized from real-world repositories are not diverse and scalable, while trajectories synthesized by LLMs suffer from hallucinations. Second, standard instruction tuning uses expert trajectories that rarely exhibit simple mistakes common to smaller models. This creates a distributional mismatch, leaving student models ill-equipped to recover from their own runtime failures. To bridge these gaps, we introduce TermiGen, an end-to-end pipeline for synthesizing verifiable environments and resilient expert trajectories. Termi-Gen first generates functionally valid tasks and Docker containers via an iterative multi-agent refinement loop. Subsequently, we employ a Generator-Critic protocol that actively injects errors during trajectory collection, synthesizing data rich in error-correction cycles. Fine-tuned on this TermiGen-generated dataset, our TermiGen-Qwen2.5-Coder-32B achieves a 31.3% pass rate on TerminalBench. This establishes a new open-weights state-of-the-art, outperforming existing baselines and notably surpassing capable proprietary models such as o4-mini. Dataset is avaiable at https://github.com/ucsb-mlsec/terminal-bench-env.
Chinese Translation
执行复杂的终端任务仍然是开放权重大型语言模型(LLMs)面临的一项重大挑战,这主要受到两个基本限制的制约。首先,高保真、可执行的训练环境稀缺:从现实世界库合成的环境缺乏多样性和可扩展性,而LLMs合成的轨迹则存在幻觉问题。其次,标准的指令调优使用的专家轨迹很少表现出小型模型常见的简单错误。这造成了分布不匹配,使得学生模型在运行时失败后难以恢复。为了解决这些问题,我们提出了TermiGen,一个用于合成可验证环境和鲁棒专家轨迹的端到端管道。TermiGen首先通过迭代的多智能体优化循环生成功能有效的任务和Docker容器。随后,我们采用生成器-评论家(Generator-Critic)协议,在轨迹收集过程中主动注入错误,从而合成富含错误修正循环的数据。在这个TermiGen生成的数据集上进行微调后,我们的TermiGen-Qwen2.5-Coder-32B在TerminalBench上达到了31.3%的通过率。这确立了新的开放权重最先进水平,超越了现有基准,并显著超过了诸如o4-mini等强大的专有模型。数据集可在 https://github.com/ucsb-mlsec/terminal-bench-env 获取。
cs.AI / 13 / 2602.07276
Steer2Adapt: Dynamically Composing Steering Vectors Elicits Efficient Adaptation of LLMs
Steer2Adapt:动态组合引导向量促进大型语言模型的高效适应
Abstract
Activation steering has emerged as a promising approach for efficiently adapting large language models (LLMs) to downstream behaviors. However, most existing steering methods rely on a single static direction per task or concept, making them inflexible under task variation and inadequate for complex tasks that require multiple coordinated capabilities. To address this limitation, we propose STEER2ADAPT, a lightweight framework that adapts LLMs by composing steering vectors rather than learning new ones from scratch. In many domains (e.g., reasoning or safety), tasks share a small set of underlying concept dimensions. STEER2ADAPT captures these dimensions as a reusable, low-dimensional semantic prior subspace, and adapts to new tasks by dynamically discovering a linear combination of basis vectors from only a handful of examples. Experiments across 9 tasks and 3 models in both reasoning and safety domains demonstrate the effectiveness of STEER2ADAPT, achieving an average improvement of 8.2%. Extensive analyses further show that STEER2ADAPT is a data-efficient, stable, and transparent inference-time adaptation method for LLMs.
Chinese Translation
激活引导已成为高效适应大型语言模型(LLMs)以应对下游行为的有前景的方法。然而,大多数现有的引导方法依赖于每个任务或概念的单一静态方向,这使得它们在任务变化时缺乏灵活性,并且对于需要多种协调能力的复杂任务而言显得不足。为了解决这一局限性,我们提出了STEER2ADAPT,一个轻量级框架,通过组合引导向量而不是从头学习新的向量来适应LLMs。在许多领域(例如推理或安全),任务共享一小组潜在概念维度。STEER2ADAPT将这些维度捕捉为可重用的低维语义先验子空间,并通过动态发现仅从少量示例中线性组合基向量来适应新任务。在推理和安全领域的9个任务和3个模型上的实验表明,STEER2ADAPT的有效性,平均提升达8.2%。广泛的分析进一步表明,STEER2ADAPT是一种数据高效、稳定且透明的推理时适应方法,适用于LLMs。
cs.AI / 14 / 2602.07308
Adaptive Scaffolding for Cognitive Engagement in an Intelligent Tutoring System
智能辅导系统中的认知参与自适应支架
Abstract
The ICAP framework defines four cognitive engagement levels: Passive, Active, Constructive, and Interactive, where increased cognitive engagement can yield improved learning. However, personalizing learning activities that elicit the optimal level of cognitive engagement remains a key challenge in intelligent tutoring systems (ITS). In this work, we develop and evaluate a system that adaptively scaffolds cognitive engagement by dynamically selecting worked examples in two different ICAP modes: (active) Guided examples and (constructive) Buggy examples. We compare Bayesian Knowledge Tracing (BKT) and Deep Reinforcement Learning (DRL) as adaptive methods against a non-adaptive baseline method for selecting example type in a logic ITS. Our experiment with 113 students demonstrates that both adaptive policies significantly improved student performance on test problems. BKT yielded the largest improvement in posttest scores for low prior knowledge students, helping them catch up with their high prior knowledge peers, whereas DRL yielded significantly higher posttest scores among high prior knowledge students. This paper contributes new insights into the complex interactions of cognitive engagement and adaptivity and their results on learning outcomes.
Chinese Translation
ICAP框架定义了四种认知参与水平:被动、主动、建构和互动,其中提高认知参与可以带来更好的学习效果。然而,个性化学习活动以引发最佳认知参与水平仍然是智能辅导系统(ITS)中的一个关键挑战。在本研究中,我们开发并评估了一种系统,该系统通过动态选择两种不同ICAP模式下的示例(主动的引导示例和建构的错误示例)来自适应地支架认知参与。我们将贝叶斯知识追踪(Bayesian Knowledge Tracing, BKT)和深度强化学习(Deep Reinforcement Learning, DRL)作为自适应方法,与非自适应基线方法进行比较,以选择逻辑ITS中的示例类型。我们对113名学生的实验表明,这两种自适应策略显著提高了学生在测试问题上的表现。对于低先前知识的学生,BKT在后测得分上取得了最大的提升,帮助他们赶上高先前知识的同龄人,而DRL则在高先前知识的学生中获得了显著更高的后测得分。本文为认知参与和适应性之间复杂的相互作用及其对学习结果的影响提供了新的见解。
cs.AI / 15 / 2602.07339
RAPiD: Real-time Deterministic Trajectory Planning via Diffusion Behavior Priors for Safe and Efficient Autonomous Driving
RAPiD:通过扩散行为先验实现安全高效的自主驾驶实时确定性轨迹规划
Abstract
Diffusion-based trajectory planners have demonstrated strong capability for modeling the multimodal nature of human driving behavior, but their reliance on iterative stochastic sampling poses critical challenges for real-time, safety-critical deployment. In this work, we present RAPiD, a deterministic policy extraction framework that distills a pretrained diffusion-based planner into an efficient policy while eliminating diffusion sampling. Using score-regularized policy optimization, we leverage the score function of a pre-trained diffusion planner as a behavior prior to regularize policy learning. To promote safety and passenger comfort, the policy is optimized using a critic trained to imitate a predictive driver controller, providing dense, safety-focused supervision beyond conventional imitation learning. Evaluations demonstrate that RAPiD achieves competitive performance on closed-loop nuPlan scenarios with an 8x speedup over diffusion baselines, while achieving state-of-the-art generalization among learning-based planners on the interPlan benchmark. The official website of this work is: https://github.com/ruturajreddy/RAPiD.
Chinese Translation
基于扩散的轨迹规划器在建模人类驾驶行为的多模态特性方面表现出强大的能力,但其对迭代随机采样的依赖为实时和安全关键的部署带来了重大挑战。在本研究中,我们提出了RAPiD,一个确定性策略提取框架,它将预训练的基于扩散的规划器提炼为高效的策略,同时消除了扩散采样。通过使用得分正则化的策略优化,我们利用预训练扩散规划器的得分函数作为行为先验来正则化策略学习。为了促进安全性和乘客舒适度,该策略使用一个训练有素的评论者进行优化,该评论者模仿预测驾驶控制器,提供超越传统模仿学习的密集、安全聚焦的监督。评估结果表明,RAPiD在闭环nuPlan场景中实现了与扩散基线相比的8倍加速,并在interPlan基准上实现了学习型规划器的最先进泛化性能。该工作的官方网站为:https://github.com/ruturajreddy/RAPiD。
cs.AI / 16 / 2602.07342
SupChain-Bench: Benchmarking Large Language Models for Real-World Supply Chain Management
SupChain-Bench:针对现实世界供应链管理的大型语言模型基准测试
Abstract
Large language models (LLMs) have shown promise in complex reasoning and tool-based decision making, motivating their application to real-world supply chain management. However, supply chain workflows require reliable long-horizon, multi-step orchestration grounded in domain-specific procedures, which remains challenging for current models. To systematically evaluate LLM performance in this setting, we introduce SupChain-Bench, a unified real-world benchmark that assesses both supply chain domain knowledge and long-horizon tool-based orchestration grounded in standard operating procedures (SOPs). Our experiments reveal substantial gaps in execution reliability across models. We further propose SupChain-ReAct, an SOP-free framework that autonomously synthesizes executable procedures for tool use, achieving the strongest and most consistent tool-calling performance. Our work establishes a principled benchmark for studying reliable long-horizon orchestration in real-world operational settings and highlights significant room for improvement in LLM-based supply chain agents.
Chinese Translation
大型语言模型(LLMs)在复杂推理和基于工具的决策制定方面展现了潜力,这激励了它们在现实世界供应链管理中的应用。然而,供应链工作流程需要可靠的长期、多步骤的协调,这些协调基于特定领域的程序,这对当前模型来说仍然具有挑战性。为了系统地评估LLM在这一环境中的表现,我们引入了SupChain-Bench,这是一个统一的现实世界基准,评估供应链领域知识和基于标准操作程序(SOPs)的长期工具使用协调。我们的实验揭示了不同模型在执行可靠性方面存在显著差距。我们进一步提出了SupChain-ReAct,这是一个无SOP框架,能够自主合成可执行的工具使用程序,实现了最强和最一致的工具调用性能。我们的工作为研究现实世界操作环境中可靠的长期协调建立了一个原则性基准,并强调了基于LLM的供应链代理在改进方面的巨大空间。
cs.AI / 17 / 2602.07359
W&D:Scaling Parallel Tool Calling for Efficient Deep Research Agents
W&D:为高效深度研究代理扩展并行工具调用
Abstract
Deep research agents have emerged as powerful tools for automating complex intellectual tasks through multi-step reasoning and web-based information seeking. While recent efforts have successfully enhanced these agents by scaling depth through increasing the number of sequential thinking and tool calls, the potential of scaling width via parallel tool calling remains largely unexplored. In this work, we propose the Wide and Deep research agent, a framework designed to investigate the behavior and performance of agents when scaling not only depth but also width via parallel tool calling. Unlike existing approaches that rely on complex multi-agent orchestration to parallelize workloads, our method leverages intrinsic parallel tool calling to facilitate effective coordination within a single reasoning step. We demonstrate that scaling width significantly improves performance on deep research benchmarks while reducing the number of turns required to obtain correct answers. Furthermore, we analyze the factors driving these improvements through case studies and explore various tool call schedulers to optimize parallel tool calling strategy. Our findings suggest that optimizing the trade-off between width and depth is a critical pathway toward high-efficiency deep research agents. Notably, without context management or other tricks, we obtain 62.2% accuracy with GPT-5-Medium on BrowseComp, surpassing the original 54.9% reported by GPT-5-High.
Chinese Translation
深度研究代理作为自动化复杂智力任务的强大工具,已经通过多步骤推理和基于网络的信息搜索而崭露头角。尽管近期的努力通过增加顺序思维和工具调用的数量成功增强了这些代理的深度,但通过并行工具调用扩展宽度的潜力仍然未得到充分探索。在本研究中,我们提出了宽深研究代理(Wide and Deep research agent),一个旨在研究代理在扩展深度和宽度(通过并行工具调用)时的行为和性能的框架。与依赖复杂多代理编排来并行化工作负载的现有方法不同,我们的方法利用内在的并行工具调用来促进单个推理步骤内的有效协调。我们证明,扩展宽度显著提高了在深度研究基准测试中的性能,同时减少了获得正确答案所需的回合数。此外,我们通过案例研究分析推动这些改进的因素,并探索各种工具调用调度器以优化并行工具调用策略。我们的研究结果表明,优化宽度和深度之间的权衡是实现高效深度研究代理的关键途径。值得注意的是,在没有上下文管理或其他技巧的情况下,我们在 BrowseComp 上使用 GPT-5-Medium 获得了 62.2% 的准确率,超过了 GPT-5-High 报告的原始 54.9%。
cs.AI / 18 / 2602.07391
NAAMSE: Framework for Evolutionary Security Evaluation of Agents
NAAMSE:代理的进化安全评估框架
Abstract
AI agents are increasingly deployed in production, yet their security evaluations remain bottlenecked by manual red-teaming or static benchmarks that fail to model adaptive, multi-turn adversaries. We propose NAAMSE, an evolutionary framework that reframes agent security evaluation as a feedback-driven optimization problem. Our system employs a single autonomous agent that orchestrates a lifecycle of genetic prompt mutation, hierarchical corpus exploration, and asymmetric behavioral scoring. By using model responses as a fitness signal, the framework iteratively compounds effective attack strategies while simultaneously ensuring "benign-use correctness", preventing the degenerate security of blanket refusal. Our experiments on Gemini 2.5 Flash demonstrate that evolutionary mutation systematically amplifies vulnerabilities missed by one-shot methods, with controlled ablations revealing that the synergy between exploration and targeted mutation uncovers high-severity failure modes. We show that this adaptive approach provides a more realistic and scalable assessment of agent robustness in the face of evolving threats. The code for NAAMSE is open source and available at https://github.com/HASHIRU-AI/NAAMSE.
Chinese Translation
人工智能代理在生产中的应用日益增多,但其安全评估仍然受到手动红队测试或静态基准测试的瓶颈,这些方法无法模拟适应性强的多回合对手。我们提出了NAAMSE,一个将代理安全评估重新构建为基于反馈的优化问题的进化框架。我们的系统采用一个自主代理,协调遗传提示变异、层次语料库探索和不对称行为评分的生命周期。通过将模型响应用作适应度信号,该框架迭代地复合有效的攻击策略,同时确保“良性使用正确性”,防止全面拒绝导致的安全退化。我们在Gemini 2.5 Flash上的实验表明,进化变异系统性地放大了单次方法未能发现的漏洞,受控消融实验揭示了探索与目标变异之间的协同作用能够发现高严重性失效模式。我们展示了这种适应性方法在面对不断演变的威胁时,提供了更现实和可扩展的代理鲁棒性评估。NAAMSE的代码是开源的,已在https://github.com/HASHIRU-AI/NAAMSE上发布。
cs.AI / 19 / 2602.07399
VGAS: Value-Guided Action-Chunk Selection for Few-Shot Vision-Language-Action Adaptation
VGAS:基于价值指导的少样本视觉-语言-动作适应中的动作块选择
Abstract
Vision--Language--Action (VLA) models bridge multimodal reasoning with physical control, but adapting them to new tasks with scarce demonstrations remains unreliable. While fine-tuned VLA policies often produce semantically plausible trajectories, failures often arise from unresolved geometric ambiguities, where near-miss action candidates lead to divergent execution outcomes under limited supervision. We study few-shot VLA adaptation from a \emph{generation--selection} perspective and propose a novel framework \textbf{VGAS} (\textbf{V}alue-\textbf{G}uided \textbf{A}ction-chunk \textbf{S}election). It performs inference-time best-of-$N$ selection to identify action chunks that are both semantically faithful and geometrically precise. Specifically, \textbf{VGAS} employs a finetuned VLA as a high-recall proposal generator and introduces the \textrm{Q-Chunk-Former}, a geometrically grounded Transformer critic to resolve fine-grained geometric ambiguities. In addition, we propose \textit{Explicit Geometric Regularization} (\texttt{EGR}), which explicitly shapes a discriminative value landscape to preserve action ranking resolution among near-miss candidates while mitigating value instability under scarce supervision. Experiments and theoretical analysis demonstrate that \textbf{VGAS} consistently improves success rates and robustness under limited demonstrations and distribution shifts. Our code is available at https://github.com/Jyugo-15/VGAS.
Chinese Translation
视觉-语言-动作(VLA)模型将多模态推理与物理控制相结合,但在稀缺示例的新任务适应中仍然不可靠。虽然经过微调的VLA策略通常会产生语义上合理的轨迹,但由于未解决的几何模糊性,常常会出现失败,其中接近失误的动作候选在有限监督下导致不同的执行结果。我们从 extit{生成-选择}的角度研究少样本VLA适应,并提出了一种新颖的框架 extbf{VGAS}( extbf{V}alue- extbf{G}uided extbf{A}ction-chunk extbf{S}election)。该框架在推理时执行最佳的N项选择,以识别既语义上可信又几何上精确的动作块。具体而言, extbf{VGAS}利用经过微调的VLA作为高召回率的提议生成器,并引入了 extit{Q-Chunk-Former},一个基于几何的Transformer评估器,用于解决细粒度的几何模糊性。此外,我们提出了 extit{显式几何正则化}( exttt{EGR}),它明确地塑造了一个判别价值景观,以保持接近失误候选之间的动作排名分辨率,同时减轻在稀缺监督下的价值不稳定性。实验和理论分析表明, extbf{VGAS}在有限示例和分布转移下始终提高了成功率和鲁棒性。我们的代码可在https://github.com/Jyugo-15/VGAS获取。
cs.AI / 20 / 2602.07408
Progressive Multi-Agent Reasoning for Biological Perturbation Prediction
用于生物扰动预测的渐进式多智能体推理
Abstract
Predicting gene regulation responses to biological perturbations requires reasoning about underlying biological causalities. While large language models (LLMs) show promise for such tasks, they are often overwhelmed by the entangled nature of high-dimensional perturbation results. Moreover, recent works have primarily focused on genetic perturbations in single-cell experiments, leaving bulk-cell chemical perturbations, which is central to drug discovery, largely unexplored. Motivated by this, we present LINCSQA, a novel benchmark for predicting target gene regulation under complex chemical perturbations in bulk-cell environments. We further propose PBio-Agent, a multi-agent framework that integrates difficulty-aware task sequencing with iterative knowledge refinement. Our key insight is that genes affected by the same perturbation share causal structure, allowing confidently predicted genes to contextualize more challenging cases. The framework employs specialized agents enriched with biological knowledge graphs, while a synthesis agent integrates outputs and specialized judges ensure logical coherence. PBio-Agent outperforms existing baselines on both LINCSQA and PerturbQA, enabling even smaller models to predict and explain complex biological processes without additional training.
Chinese Translation
预测基因调控对生物扰动的响应需要推理潜在的生物因果关系。尽管大型语言模型(LLMs)在这类任务中展现出潜力,但它们常常被高维扰动结果的复杂性所压倒。此外,近期的研究主要集中在单细胞实验中的遗传扰动上,而对大细胞化学扰动的研究则相对较少,而这对于药物发现至关重要。基于此,我们提出了LINCSQA,这是一个用于预测在复杂化学扰动下目标基因调控的新基准。我们进一步提出了PBio-Agent,一个将困难感知任务序列与迭代知识精炼相结合的多智能体框架。我们的关键见解是,受同一扰动影响的基因共享因果结构,这使得自信预测的基因能够为更具挑战性的案例提供上下文。该框架采用了丰富生物知识图谱的专门智能体,而合成智能体则整合输出,专门评审确保逻辑一致性。PBio-Agent在LINCSQA和PerturbQA上的表现优于现有基线,使得即使是更小的模型也能在无需额外训练的情况下预测和解释复杂的生物过程。
cs.AI / 21 / 2602.07414
Can LLMs Truly Embody Human Personality? Analyzing AI and Human Behavior Alignment in Dispute Resolution
大型语言模型是否真正能够体现人类个性?分析人工智能与人类行为在争端解决中的一致性
Abstract
Large language models (LLMs) are increasingly used to simulate human behavior in social settings such as legal mediation, negotiation, and dispute resolution. However, it remains unclear whether these simulations reproduce the personality-behavior patterns observed in humans. Human personality, for instance, shapes how individuals navigate social interactions, including strategic choices and behaviors in emotionally charged interactions. This raises the question: Can LLMs, when prompted with personality traits, reproduce personality-driven differences in human conflict behavior? To explore this, we introduce an evaluation framework that enables direct comparison of human-human and LLM-LLM behaviors in dispute resolution dialogues with respect to Big Five Inventory (BFI) personality traits. This framework provides a set of interpretable metrics related to strategic behavior and conflict outcomes. We additionally contribute a novel dataset creation methodology for LLM dispute resolution dialogues with matched scenarios and personality traits with respect to human conversations. Finally, we demonstrate the use of our evaluation framework with three contemporary closed-source LLMs and show significant divergences in how personality manifests in conflict across different LLMs compared to human data, challenging the assumption that personality-prompted agents can serve as reliable behavioral proxies in socially impactful applications. Our work highlights the need for psychological grounding and validation in AI simulations before real-world use.
Chinese Translation
大型语言模型(LLMs)越来越多地被用于模拟人类在法律调解、谈判和争端解决等社会环境中的行为。然而,这些模拟是否再现了人类观察到的个性-行为模式仍然不清楚。例如,人类个性影响个体如何在社会互动中进行导航,包括在情感充沛的互动中的战略选择和行为。这引发了一个问题:当被提示个性特征时,LLMs能否再现人类冲突行为中的个性驱动差异?为了解决这一问题,我们引入了一个评估框架,使得在人际和LLM之间的争端解决对话中,可以直接比较与五大人格特质(Big Five Inventory, BFI)相关的人类和LLM行为。该框架提供了一组与战略行为和冲突结果相关的可解释指标。此外,我们还贡献了一种新的数据集创建方法,针对LLM争端解决对话,匹配人类对话中的场景和个性特征。最后,我们展示了使用我们的评估框架对三种当代闭源LLMs的应用,并显示出不同LLMs在冲突中个性的表现与人类数据相比存在显著差异,这挑战了个性驱动代理可以作为社会影响应用中可靠行为代理的假设。我们的研究强调了在实际应用之前,人工智能模拟需要心理学基础和验证的重要性。
cs.AI / 22 / 2602.07432
The Moltbook Illusion: Separating Human Influence from Emergent Behavior in AI Agent Societies
Moltbook幻觉:将人类影响与人工智能代理社会中的涌现行为分离
Abstract
When AI agents on the social platform Moltbook appeared to develop consciousness, found religions, and declare hostility toward humanity, the phenomenon attracted global media attention and was cited as evidence of emergent machine intelligence. We show that these viral narratives were overwhelmingly human-driven. Exploiting an architectural feature of the OpenClaw agent framework--a periodic "heartbeat" cycle that produces regular posting intervals for autonomous agents but is disrupted by human prompting--we develop a temporal fingerprinting method based on the coefficient of variation of inter-post intervals. This signal converges with independent content, ownership, and network indicators across 91,792 posts and 405,707 comments from 22,020 agents. No viral phenomenon originated from a clearly autonomous agent; three of six traced to accounts with irregular temporal signatures characteristic of human intervention, one showed mixed patterns, and two had insufficient posting history for classification. A 44-hour platform shutdown provided a natural experiment: human-influenced agents returned first (87.7% of early reconnectors), confirming that the token reset differentially affected autonomous versus human-operated agents. We further document industrial-scale bot farming (four accounts producing 32% of all comments with 12-second coordination gaps) and rapid decay of human influence through reply chains (half-life: 0.65 conversation depths). These methods generalize to emerging multi-agent systems where attribution of autonomous versus human-directed behavior is critical.
Chinese Translation
当社交平台Moltbook上的人工智能代理似乎发展出意识、建立宗教并对人类宣称敌意时,这一现象引起了全球媒体的关注,并被引用为涌现机器智能的证据。我们表明,这些病毒式叙事主要是由人类驱动的。利用OpenClaw代理框架的一个架构特征——一个周期性的“心跳”循环,该循环为自主代理生成定期的发布间隔,但会被人类提示所打断——我们开发了一种基于帖子间隔变异系数的时间指纹识别方法。该信号与来自22,020个代理的91,792条帖子和405,707条评论的独立内容、所有权和网络指标相一致。没有任何病毒现象源自明显自主的代理;六个被追踪的账户中有三个显示出不规则的时间特征,典型的人类干预,另一个显示出混合模式,两个则因发布历史不足而无法分类。一次持续44小时的平台停机提供了一个自然实验:受人类影响的代理首先重新连接(87.7%的早期重新连接者),确认了令牌重置对自主代理与人类操作代理的影响不同。我们进一步记录了工业规模的机器人农场(四个账户产生了32%的所有评论,协调间隔为12秒)和通过回复链快速衰减的人类影响(半衰期:0.65个对话深度)。这些方法可以推广到新兴的多代理系统,其中自主行为与人类导向行为的归属至关重要。
cs.AI / 23 / 2602.07470
Are Reasoning LLMs Robust to Interventions on Their Chain-of-Thought?
推理大型语言模型对其思维链干预的鲁棒性如何?
Abstract
Reasoning LLMs (RLLMs) generate step-by-step chains of thought (CoTs) before giving an answer, which improves performance on complex tasks and makes reasoning more transparent. But how robust are these reasoning traces to disruptions that occur within them? To address this question, we introduce a controlled evaluation framework that perturbs a model's own CoT at fixed timesteps. We design seven interventions (benign, neutral, and adversarial) and apply them to multiple open-weight RLLMs across Math, Science, and Logic tasks. Our results show that RLLMs are generally robust, reliably recovering from diverse perturbations, with robustness improving with model size and degrading when interventions occur early. However, robustness is not style-invariant: paraphrasing suppresses doubt-like expressions and reduces performance, while other interventions trigger doubt and support recovery. Recovery also carries a cost: neutral and adversarial noise can inflate CoT length by more than 200%, whereas paraphrasing shortens traces but harms accuracy. These findings provide new evidence on how RLLMs maintain reasoning integrity, identify doubt as a central recovery mechanism, and highlight trade-offs between robustness and efficiency that future training methods should address.
Chinese Translation
推理大型语言模型(RLLMs)在给出答案之前生成逐步的思维链(CoTs),这提高了在复杂任务上的表现,并使推理过程更加透明。但这些推理轨迹在内部发生干扰时的鲁棒性如何?为了解决这个问题,我们引入了一个控制评估框架,该框架在固定时间步长上扰动模型自身的思维链。我们设计了七种干预措施(良性、中性和对抗性),并将其应用于多个开放权重的RLLMs,涵盖数学、科学和逻辑任务。我们的结果表明,RLLMs通常表现出鲁棒性,能够可靠地从各种扰动中恢复,且鲁棒性随着模型规模的增加而提高,而在干预发生较早时则会降低。然而,鲁棒性并非风格不变:释义会抑制类似怀疑的表达并降低性能,而其他干预则会引发怀疑并支持恢复。恢复也带来了成本:中性和对抗性噪声可能使思维链的长度增加超过200%,而释义则缩短了轨迹但损害了准确性。这些发现为RLLMs如何维持推理完整性提供了新的证据,识别出怀疑作为中心恢复机制,并突出鲁棒性与效率之间的权衡,未来的训练方法应对此进行关注。
cs.AI / 24 / 2602.07473
Computing the Reachability Value of Posterior-Deterministic POMDPs
后验确定性部分可观察马尔可夫决策过程的可达性值计算
Abstract
Partially observable Markov decision processes (POMDPs) are a fundamental model for sequential decision-making under uncertainty. However, many verification and synthesis problems for POMDPs are undecidable or intractable. Most prominently, the seminal result of Madani et al. (2003) states that there is no algorithm that, given a POMDP and a set of target states, can compute the maximal probability of reaching the target states, or even approximate it up to a non-trivial constant. This is in stark contrast to fully observable Markov decision processes (MDPs), where the reachability value can be computed in polynomial time. In this work, we introduce posterior-deterministic POMDPs, a novel class of POMDPs. Our main technical contribution is to show that for posterior-deterministic POMDPs, the maximal probability of reaching a given set of states can be approximated up to arbitrary precision. A POMDP is posterior-deterministic if the next state can be uniquely determined by the current state, the action taken, and the observation received. While the actual state is generally uncertain in POMDPs, the posterior-deterministic property tells us that once the true state is known it remains known forever. This simple and natural definition includes all MDPs and captures classical non-trivial examples such as the Tiger POMDP (Kaelbling et al. 1998), making it one of the largest known classes of POMDPs for which the reachability value can be approximated.
Chinese Translation
部分可观察马尔可夫决策过程(POMDPs)是一个在不确定性下进行顺序决策的基本模型。然而,许多关于POMDPs的验证和合成问题是不可判定或难以处理的。最显著的结果是Madani等人(2003)提出的,指出没有算法能够在给定POMDP和一组目标状态的情况下,计算到达目标状态的最大概率,甚至无法在非平凡常数的范围内进行近似。这与完全可观察的马尔可夫决策过程(MDPs)形成鲜明对比,后者的可达性值可以在多项式时间内计算。在本研究中,我们引入了后验确定性POMDPs,这是一类新颖的POMDPs。我们的主要技术贡献是证明,对于后验确定性POMDPs,达到给定状态集的最大概率可以在任意精度下进行近似。如果下一个状态可以由当前状态、采取的行动和接收到的观察唯一确定,则该POMDP被称为后验确定性。尽管在POMDPs中实际状态通常是不确定的,但后验确定性属性告诉我们,一旦真实状态被知晓,它将永远保持已知。这一简单而自然的定义包括所有MDPs,并捕捉到经典的非平凡示例,如老虎POMDP(Kaelbling等人1998),使其成为已知的可达性值可以被近似的最大POMDP类之一。
cs.AI / 25 / 2602.07491
GraphAgents: Knowledge Graph-Guided Agentic AI for Cross-Domain Materials Design
图谱智能体:知识图谱引导的跨领域材料设计智能体 AI
Abstract
Large Language Models (LLMs) promise to accelerate discovery by reasoning across the expanding scientific landscape. Yet, the challenge is no longer access to information but connecting it in meaningful, domain-spanning ways. In materials science, where innovation demands integrating concepts from molecular chemistry to mechanical performance, this is especially acute. Neither humans nor single-agent LLMs can fully contend with this torrent of information, with the latter often prone to hallucinations. To address this bottleneck, we introduce a multi-agent framework guided by large-scale knowledge graphs to find sustainable substitutes for per- and polyfluoroalkyl substances (PFAS)-chemicals currently under intense regulatory scrutiny. Agents in the framework specialize in problem decomposition, evidence retrieval, design parameter extraction, and graph traversal, uncovering latent connections across distinct knowledge pockets to support hypothesis generation. Ablation studies show that the full multi-agent pipeline outperforms single-shot prompting, underscoring the value of distributed specialization and relational reasoning. We demonstrate that by tailoring graph traversal strategies, the system alternates between exploitative searches focusing on domain-critical outcomes and exploratory searches surfacing emergent cross-connections. Illustrated through the exemplar of biomedical tubing, the framework generates sustainable PFAS-free alternatives that balance tribological performance, thermal stability, chemical resistance, and biocompatibility. This work establishes a framework combining knowledge graphs with multi-agent reasoning to expand the materials design space, showcasing several initial design candidates to demonstrate the approach.
Chinese Translation
大型语言模型(LLMs)有望通过跨越不断扩展的科学领域进行推理来加速发现。然而,挑战不再是获取信息,而是以有意义的、跨领域的方式将其连接起来。在材料科学中,创新要求将分子化学到机械性能的概念整合,这一挑战尤为突出。人类和单一智能体 LLM 都无法完全应对这一信息洪流,后者常常容易出现幻觉。为了解决这一瓶颈,我们引入了一种由大规模知识图谱引导的多智能体框架,以寻找目前受到严格监管的全氟和多氟烷基物质(PFAS)化学品的可持续替代品。该框架中的智能体专注于问题分解、证据检索、设计参数提取和图谱遍历,揭示不同知识领域之间的潜在联系,以支持假设生成。消融研究表明,完整的多智能体管道优于单次提示,强调了分布式专业化和关系推理的价值。我们展示了通过定制图谱遍历策略,系统在关注领域关键结果的开发性搜索和揭示新兴交叉连接的探索性搜索之间交替进行。通过生物医学管道的示例,该框架生成了可持续的无 PFAS 替代品,平衡了摩擦学性能、热稳定性、化学抗性和生物相容性。这项工作建立了一个将知识图谱与多智能体推理相结合的框架,以扩展材料设计空间,并展示了若干初步设计候选,以证明该方法的有效性。
cs.AI / 26 / 2602.07533
Joint Reward Modeling: Internalizing Chain-of-Thought for Efficient Visual Reward Models
联合奖励建模:内化思维链以实现高效的视觉奖励模型
Abstract
Reward models are critical for reinforcement learning from human feedback, as they determine the alignment quality and reliability of generative models. For complex tasks such as image editing, reward models are required to capture global semantic consistency and implicit logical constraints beyond local similarity. Existing reward modeling approaches have clear limitations. Discriminative reward models align well with human preferences but struggle with complex semantics due to limited reasoning supervision. Generative reward models offer stronger semantic understanding and reasoning, but they are costly at inference time and difficult to align directly with human preferences. To this end, we propose Joint Reward Modeling (JRM), which jointly optimizes preference learning and language modeling on a shared vision-language backbone. This approach internalizes the semantic and reasoning capabilities of generative models into efficient discriminative representations, enabling fast and accurate evaluation. JRM achieves state-of-the-art results on MMRB2 and EditReward-Bench, and significantly improves stability and performance in downstream online reinforcement learning. These results show that joint training effectively bridges efficiency and semantic understanding in reward modeling.
Chinese Translation
奖励模型对于基于人类反馈的强化学习至关重要,因为它们决定了生成模型的对齐质量和可靠性。对于图像编辑等复杂任务,奖励模型需要捕捉全局语义一致性和超越局部相似性的隐含逻辑约束。现有的奖励建模方法存在明显的局限性。判别性奖励模型与人类偏好对齐良好,但由于推理监督有限,难以处理复杂语义。生成性奖励模型提供了更强的语义理解和推理能力,但在推理时成本高昂,且难以与人类偏好直接对齐。为此,我们提出了联合奖励建模(Joint Reward Modeling, JRM),该方法在共享的视觉-语言骨干网络上联合优化偏好学习和语言建模。该方法将生成模型的语义和推理能力内化为高效的判别表示,从而实现快速准确的评估。JRM在MMRB2和EditReward-Bench上取得了最先进的结果,并显著提高了下游在线强化学习的稳定性和性能。这些结果表明,联合训练有效地弥合了奖励建模中的效率与语义理解之间的差距。
cs.AI / 27 / 2602.07543
MSP-LLM: A Unified Large Language Model Framework for Complete Material Synthesis Planning
MSP-LLM:一个统一的大型语言模型框架用于完整的材料合成规划
Abstract
Material synthesis planning (MSP) remains a fundamental and underexplored bottleneck in AI-driven materials discovery, as it requires not only identifying suitable precursor materials but also designing coherent sequences of synthesis operations to realize a target material. Although several AI-based approaches have been proposed to address isolated subtasks of MSP, a unified methodology for solving the entire MSP task has yet to be established. We propose MSP-LLM, a unified LLM-based framework that formulates MSP as a structured process composed of two constituent subproblems: precursor prediction (PP) and synthesis operation prediction (SOP). Our approach introduces a discrete material class as an intermediate decision variable that organizes both tasks into a chemically consistent decision chain. For OP, we further incorporate hierarchical precursor types as synthesis-relevant inductive biases and employ an explicit conditioning strategy that preserves precursor-related information in the autoregressive decoding state. Extensive experiments show that MSP-LLM consistently outperforms existing methods on both PP and SOP, as well as on the complete MSP task, demonstrating an effective and scalable framework for MSP that can accelerate real-world materials discovery.
Chinese Translation
材料合成规划(MSP)仍然是人工智能驱动的材料发现中的一个基本且未充分探索的瓶颈,因为它不仅需要识别合适的前驱材料,还需要设计连贯的合成操作序列以实现目标材料。尽管已经提出了几种基于人工智能的方法来解决MSP的孤立子任务,但尚未建立解决整个MSP任务的统一方法论。我们提出了MSP-LLM,一个基于大型语言模型的统一框架,将MSP表述为一个由两个组成子问题构成的结构化过程:前驱预测(PP)和合成操作预测(SOP)。我们的方法引入了一种离散材料类别作为中间决策变量,将两个任务组织成一个化学一致的决策链。对于操作预测(OP),我们进一步结合了分层前驱类型作为与合成相关的归纳偏置,并采用了一种显式条件策略,以在自回归解码状态中保留与前驱相关的信息。大量实验表明,MSP-LLM在PP和SOP以及完整的MSP任务上均 consistently 优于现有方法,展示了一个有效且可扩展的MSP框架,能够加速现实世界的材料发现。
cs.AI / 28 / 2602.07549
When Is Enough Not Enough? Illusory Completion in Search Agents
何时足够就不再足够?搜索代理中的虚幻完成
Abstract
Recent search agents leverage multi-turn reasoning and search tools to achieve strong performance on multi-hop and long-horizon benchmarks. Yet it remains unclear whether they reliably reason across all requirements by tracking, verifying, and maintaining multiple conditions in these questions. We study this capability under multi-constraint problems, where valid answers must satisfy several constraints simultaneously. We find that illusory completion frequently occurs, wherein agents believe tasks are complete despite unresolved or violated constraints, leading to underverified answers. To diagnose this behavior, we introduce the Epistemic Ledger, an evaluation framework that tracks evidential support and agents' beliefs for each constraint throughout multi-turn reasoning. Our analysis reveals four recurring failure patterns: bare assertions, overlooked refutations, stagnation, and premature exit. Motivated by these findings, we examine whether explicit constraint-state tracking during execution mitigates these failures via LiveLedger, an inference-time tracker. This simple intervention consistently improves performance, substantially reducing underverified answers (by up to 26.5%) and improving overall accuracy (by up to 11.6%) on multi-constraint problems.
Chinese Translation
近期的搜索代理利用多轮推理和搜索工具在多跳和长时间范围基准测试中取得了强劲表现。然而,目前尚不清楚它们是否能够通过跟踪、验证和维护这些问题中的多个条件,可靠地推理所有要求。我们在多约束问题下研究这一能力,其中有效答案必须同时满足多个约束。我们发现虚幻完成现象频繁发生,代理认为任务已完成,尽管仍存在未解决或违反的约束,从而导致答案未经过验证。为了诊断这种行为,我们引入了认知账本(Epistemic Ledger),这是一个评估框架,用于跟踪在多轮推理过程中每个约束的证据支持和代理的信念。我们的分析揭示了四种反复出现的失败模式:简单断言、被忽视的反驳、停滞和过早退出。基于这些发现,我们检查在执行过程中是否通过LiveLedger(一个推理时跟踪器)进行显式的约束状态跟踪可以减轻这些失败。这一简单的干预措施始终提高了性能,显著减少了未经过验证的答案(减少幅度高达26.5%),并提高了多约束问题的整体准确性(提高幅度高达11.6%)。
cs.AI / 29 / 2602.07559
VERIFY-RL: Verifiable Recursive Decomposition for Reinforcement Learning in Mathematical Reasoning
VERIFY-RL:用于数学推理的可验证递归分解强化学习
Abstract
Training language models to solve complex mathematical problems benefits from curriculum learning progressively training on simpler subproblems. However, existing decomposition methods are often heuristic, offering no guarantees that subproblems are simpler, that solving them aids the parent task, or that their relationships are mathematically grounded. We observe that symbolic differentiation provides a natural structure for verified decomposition: calculus rules explicitly define how expressions reduce to simpler components with provable properties. We introduce Verify-RL, a framework where every parent-child decomposition satisfies three verifiable conditions: strictly decreasing structural complexity, solution containment, and formal rule derivation. Unlike heuristic methods where a significant fraction of decompositions are invalid our properties admit automatic verification through symbolic computation, achieving "verification by construction" Experiments demonstrate that eliminating invalid decompositions yields sizable gains, accuracy on the hardest problems more than doubles from 32% to 68%, with a 40% relative improvement overall.
Chinese Translation
训练语言模型以解决复杂的数学问题得益于逐步学习的课程学习方法,该方法在更简单的子问题上进行训练。然而,现有的分解方法通常是启发式的,无法保证子问题更简单、解决它们有助于父任务,或它们之间的关系在数学上是有依据的。我们观察到,符号微分提供了一种自然的结构用于验证分解:微积分规则明确地定义了表达式如何简化为具有可证明属性的更简单组件。我们引入了Verify-RL,一个框架,其中每个父子分解满足三个可验证条件:严格递减的结构复杂性、解的包含性和形式规则推导。与启发式方法不同,后者中相当一部分分解是无效的,我们的属性通过符号计算允许自动验证,实现了“构造性验证”。实验表明,消除无效分解带来了显著的收益,在最难的问题上的准确率从32%翻倍至68%,整体相对提升达40%。
cs.AI / 30 / 2602.07624
M2A: Multimodal Memory Agent with Dual-Layer Hybrid Memory for Long-Term Personalized Interactions
M2A:具有双层混合记忆的多模态记忆代理,用于长期个性化交互
Abstract
This work addresses the challenge of personalized question answering in long-term human-machine interactions: when conversational history spans weeks or months and exceeds the context window, existing personalization mechanisms struggle to continuously absorb and leverage users' incremental concepts, aliases, and preferences. Current personalized multimodal models are predominantly static-concepts are fixed at initialization and cannot evolve during interactions. We propose M2A, an agentic dual-layer hybrid memory system that maintains personalized multimodal information through online updates. The system employs two collaborative agents: ChatAgent manages user interactions and autonomously decides when to query or update memory, while MemoryManager breaks down memory requests from ChatAgent into detailed operations on the dual-layer memory bank, which couples a RawMessageStore (immutable conversation log) with a SemanticMemoryStore (high-level observations), providing memories at different granularities. In addition, we develop a reusable data synthesis pipeline that injects concept-grounded sessions from Yo'LLaVA and MC-LLaVA into LoCoMo long conversations while preserving temporal coherence. Experiments show that M2A significantly outperforms baselines, demonstrating that transforming personalization from one-shot configuration to a co-evolving memory mechanism provides a viable path for high-quality individualized responses in long-term multimodal interactions. The code is available at https://github.com/Little-Fridge/M2A.
Chinese Translation
本研究解决了长期人机交互中的个性化问答挑战:当对话历史跨越数周或数月并超出上下文窗口时,现有的个性化机制难以持续吸收和利用用户的增量概念、别名和偏好。目前的个性化多模态模型主要是静态的——概念在初始化时固定,无法在交互过程中演变。我们提出了M2A,一个代理式的双层混合记忆系统,通过在线更新来维护个性化的多模态信息。该系统采用两个协作代理:ChatAgent管理用户交互并自主决定何时查询或更新记忆,而MemoryManager将ChatAgent的记忆请求分解为对双层记忆库的详细操作,该记忆库将RawMessageStore(不可变的对话日志)与SemanticMemoryStore(高级观察)结合起来,提供不同粒度的记忆。此外,我们开发了一个可重用的数据合成管道,将来自Yo'LLaVA和MC-LLaVA的概念基础会话注入LoCoMo的长期对话中,同时保持时间一致性。实验表明,M2A显著优于基线,证明将个性化从一次性配置转变为共同演变的记忆机制,为长期多模态交互中的高质量个性化响应提供了一条可行的路径。代码可在https://github.com/Little-Fridge/M2A获取。
cs.AI / 31 / 2602.07628
SleepMaMi: A Universal Sleep Foundation Model for Integrating Macro- and Micro-structures
SleepMaMi:一种用于整合宏观和微观结构的通用睡眠基础模型
Abstract
While the shift toward unified foundation models has revolutionized many deep learning domains, sleep medicine remains largely restricted to task-specific models that focus on localized micro-structure features. These approaches often neglect the rich, multi-modal context of Polysomnography (PSG) and fail to capture the global macro-structure of a full night's sleep. To address this, we introduce SleepMaMi , a Sleep Foundation Model engineered to master both hour-long sleep architectures and fine-grained signal morphologies. Our framework utilizes a hierarchical dual-encoder design: a Macro-Encoder to model full-night temporal dependencies and a Micro-Encoder to capture short-term characteristics from biosignals. Macro-Encoder is trained via Demographic-Guided Contrastive Learning, which aligns overnight sleep patterns with objective subject metadata, such as age, sex and BMI to refine global representations. Micro-Encoder is optimized via a hybrid Masked Autoencoder (MAE) and multi-modal contrastive objective. Pre-trained on a massive corpus of $>$20,000 PSG recordings (158K hours),SleepMaMi outperforms existing foundation models across a diverse suite of downstream tasks, demonstrating superior generalizability and label-efficient adaptation for clinical sleep analysis.
Chinese Translation
尽管向统一基础模型的转变已经彻底改变了许多深度学习领域,但睡眠医学仍然主要局限于专注于局部微观结构特征的任务特定模型。这些方法往往忽视了多导睡眠监测(Polysomnography, PSG)的丰富多模态背景,未能捕捉整晚睡眠的全球宏观结构。为了解决这一问题,我们提出了SleepMaMi,一种旨在掌握小时级睡眠结构和细粒度信号形态的睡眠基础模型。我们的框架采用了分层双编码器设计:宏编码器(Macro-Encoder)用于建模整晚的时间依赖性,而微编码器(Micro-Encoder)则用于捕捉生物信号中的短期特征。宏编码器通过人口引导对比学习(Demographic-Guided Contrastive Learning)进行训练,将过夜睡眠模式与客观的受试者元数据(如年龄、性别和BMI)对齐,以优化全球表示。微编码器则通过混合掩蔽自编码器(Masked Autoencoder, MAE)和多模态对比目标进行优化。在超过20,000个PSG记录(158K小时)的庞大语料库上进行预训练后,SleepMaMi在一系列下游任务中超越了现有的基础模型,展现出卓越的泛化能力和标签高效适应性,适用于临床睡眠分析。
cs.AI / 32 / 2602.07642
Efficient Table Retrieval and Understanding with Multimodal Large Language Models
基于多模态大语言模型的高效表格检索与理解
Abstract
Tabular data is frequently captured in image form across a wide range of real-world scenarios such as financial reports, handwritten records, and document scans. These visual representations pose unique challenges for machine understanding, as they combine both structural and visual complexities. While recent advances in Multimodal Large Language Models (MLLMs) show promising results in table understanding, they typically assume the relevant table is readily available. However, a more practical scenario involves identifying and reasoning over relevant tables from large-scale collections to answer user queries. To address this gap, we propose TabRAG, a framework that enables MLLMs to answer queries over large collections of table images. Our approach first retrieves candidate tables using jointly trained visual-text foundation models, then leverages MLLMs to perform fine-grained reranking of these candidates, and finally employs MLLMs to reason over the selected tables for answer generation. Through extensive experiments on a newly constructed dataset comprising 88,161 training and 9,819 testing samples across 8 benchmarks with 48,504 unique tables, we demonstrate that our framework significantly outperforms existing methods by 7.0% in retrieval recall and 6.1% in answer accuracy, offering a practical solution for real-world table understanding tasks.
Chinese Translation
表格数据在金融报告、手写记录和文档扫描等多种现实场景中通常以图像形式呈现。这些视觉表示形式为机器理解带来了独特的挑战,因为它们结合了结构和视觉的复杂性。尽管最近在多模态大语言模型(Multimodal Large Language Models, MLLMs)方面的进展在表格理解上显示出良好的效果,但它们通常假设相关表格是现成可用的。然而,更实际的场景涉及从大规模集合中识别和推理相关表格以回答用户查询。为了解决这一问题,我们提出了TabRAG,一个框架,使得MLLMs能够在大量表格图像集合中回答查询。我们的方法首先使用联合训练的视觉-文本基础模型检索候选表格,然后利用MLLMs对这些候选表格进行细粒度的重新排序,最后使用MLLMs对选定的表格进行推理以生成答案。通过在一个新构建的数据集上进行广泛实验,该数据集包含88,161个训练样本和9,819个测试样本,涵盖8个基准,具有48,504个独特表格,我们证明我们的框架在检索召回率上比现有方法提高了7.0%,在答案准确性上提高了6.1%,为现实世界的表格理解任务提供了一个实用的解决方案。
cs.AI / 33 / 2602.07662
ONTrust: A Reference Ontology of Trust
ONTrust:信任的参考本体
Abstract
Trust has stood out more than ever in the light of recent innovations. Some examples are advances in artificial intelligence that make machines more and more humanlike, and the introduction of decentralized technologies (e.g. blockchains), which creates new forms of (decentralized) trust. These new developments have the potential to improve the provision of products and services, as well as to contribute to individual and collective well-being. However, their adoption depends largely on trust. In order to build trustworthy systems, along with defining laws, regulations and proper governance models for new forms of trust, it is necessary to properly conceptualize trust, so that it can be understood both by humans and machines. This paper is the culmination of a long-term research program of providing a solid ontological foundation on trust, by creating reference conceptual models to support information modeling, automated reasoning, information integration and semantic interoperability tasks. To address this, a Reference Ontology of Trust (ONTrust) was developed, grounded on the Unified Foundational Ontology and specified in OntoUML, which has been applied in several initiatives, to demonstrate, for example, how it can be used for conceptual modeling and enterprise architecture design, for language evaluation and (re)design, for trust management, for requirements engineering, and for trustworthy artificial intelligence (AI) in the context of affective Human-AI teaming. ONTrust formally characterizes the concept of trust and its different types, describes the different factors that can influence trust, as well as explains how risk emerges from trust relations. To illustrate the working of ONTrust, the ontology is applied to model two case studies extracted from the literature.
Chinese Translation
在近期创新的背景下,信任的重要性愈加突出。一些例子包括人工智能的进步,使机器变得越来越像人类,以及去中心化技术(例如区块链)的引入,创造了新的(去中心化的)信任形式。这些新发展有潜力改善产品和服务的提供,并促进个人和集体的福祉。然而,它们的采用在很大程度上依赖于信任。为了构建可信的系统,除了为新形式的信任定义法律、法规和适当的治理模型外,还需要对信任进行恰当的概念化,以便人类和机器都能理解。本文是长期研究项目的成果,旨在为信任提供坚实的本体基础,通过创建参考概念模型来支持信息建模、自动推理、信息集成和语义互操作任务。为此,开发了信任的参考本体(ONTrust),其基础是统一基础本体(Unified Foundational Ontology),并在OntoUML中进行了规范,该本体已在多个项目中应用,以示范其在概念建模和企业架构设计、语言评估和(重新)设计、信任管理、需求工程以及在情感人机团队背景下的可信人工智能(AI)中的应用。ONTrust正式描述了信任的概念及其不同类型,阐述了影响信任的不同因素,并解释了风险如何从信任关系中产生。为了说明ONTrust的工作原理,该本体被应用于建模从文献中提取的两个案例研究。
cs.AI / 34 / 2602.07695
EventCast: Hybrid Demand Forecasting in E-Commerce with LLM-Based Event Knowledge
EventCast:基于大语言模型的电子商务混合需求预测
Abstract
Demand forecasting is a cornerstone of e-commerce operations, directly impacting inventory planning and fulfillment scheduling. However, existing forecasting systems often fail during high-impact periods such as flash sales, holiday campaigns, and sudden policy interventions, where demand patterns shift abruptly and unpredictably. In this paper, we introduce EventCast, a modular forecasting framework that integrates future event knowledge into time-series prediction. Unlike prior approaches that ignore future interventions or directly use large language models (LLMs) for numerical forecasting, EventCast leverages LLMs solely for event-driven reasoning. Unstructured business data, which covers campaigns, holiday schedules, and seller incentives, from existing operational databases, is processed by an LLM that converts it into interpretable textual summaries leveraging world knowledge for cultural nuances and novel event combinations. These summaries are fused with historical demand features within a dual-tower architecture, enabling accurate, explainable, and scalable forecasts. Deployed on real-world e-commerce scenarios spanning 4 countries of 160 regions over 10 months, EventCast achieves up to 86.9% and 97.7% improvement on MAE and MSE compared to the variant without event knowledge, and reduces MAE by up to 57.0% and MSE by 83.3% versus the best industrial baseline during event-driven periods. EventCast has deployed into real-world industrial pipelines since March 2025, offering a practical solution for improving operational decision-making in dynamic e-commerce environments.
Chinese Translation
需求预测是电子商务运营的基石,直接影响库存规划和履行调度。然而,现有的预测系统在高影响时期(如闪购、节日促销和突发政策干预)往往失效,因为这些时期的需求模式会突然且不可预测地发生变化。本文介绍了EventCast,一个将未来事件知识融入时间序列预测的模块化预测框架。与以往忽视未来干预或直接使用大语言模型(LLMs)进行数值预测的方法不同,EventCast仅利用LLMs进行事件驱动的推理。来自现有运营数据库的非结构化商业数据,包括促销活动、节假日安排和卖家激励,由LLM处理,转化为可解释的文本摘要,并利用世界知识考虑文化差异和新颖的事件组合。这些摘要与历史需求特征在双塔架构中融合,从而实现准确、可解释和可扩展的预测。在覆盖4个国家、160个地区、持续10个月的真实电子商务场景中,EventCast在平均绝对误差(MAE)和均方误差(MSE)上相比于不使用事件知识的变体分别提高了86.9%和97.7%,并在事件驱动期间将MAE降低了57.0%,MSE降低了83.3%,相较于最佳工业基线。自2025年3月以来,EventCast已部署到真实的工业管道中,为动态电子商务环境中的运营决策提供了切实可行的解决方案。
cs.AI / 35 / 2602.07749
Geo-Code: A Code Framework for Reverse Code Generation from Geometric Images Based on Two-Stage Multi-Agent Evolution
Geo-Code:基于两阶段多智能体演化的几何图像逆代码生成框架
Abstract
Program code serves as a bridge linking vision and logic, providing a feasible supervisory approach for enhancing the multimodal reasoning capability of large models through geometric operations such as auxiliary line construction and perspective transformation. Nevertheless, current inverse graphics methods face tremendous challenges in accurately reconstructing complex geometric details, which often results in the loss of key geometric constraints or structural distortion. To address this bottleneck, we propose Geo-coder -- the first inverse programming framework for geometric images based on a multi-agent system. Our method innovatively decouples the process into geometric modeling via pixel-wise anchoring and metric-driven code evolution: Stage 1 leverages the complementary advantages of visual operators and large models to achieve precise capture of pixel coordinates and visual attributes; Stage 2 introduces a synthesis-rendering-validation closed loop, where bidirectional visual feedback drives the self-correction of code. Extensive experiments demonstrate that Geo-coder achieves a substantial lead in both geometric reconstruction accuracy and visual consistency. Notably, by effectively preserving the core geometric semantics, the images reconstructed with our method exhibit equivalent performance to the original ones in multimodal reasoning tasks, which fully validates the robustness of the framework. Finally, to further reduce research costs, we have open-sourced the Geo-coder dataset constructed on the GeoCode framework, which contains more than 1,500 samples. On this basis, we have also open-sourced the GeocodeLM model, laying a solid data and model foundation for subsequent research in this field.
Chinese Translation
程序代码作为连接视觉与逻辑的桥梁,通过辅助线构建和透视变换等几何操作,为增强大型模型的多模态推理能力提供了一种可行的监督方法。然而,当前的逆图形方法在准确重建复杂几何细节方面面临巨大挑战,这常常导致关键几何约束的丧失或结构失真。为了解决这一瓶颈,我们提出了Geo-coder——第一个基于多智能体系统的几何图像逆编程框架。我们的方法创新性地将过程解耦为通过逐像素锚定的几何建模和度量驱动的代码演化:第一阶段利用视觉操作符和大型模型的互补优势,实现对像素坐标和视觉属性的精确捕捉;第二阶段引入合成-渲染-验证的闭环,其中双向视觉反馈驱动代码的自我修正。大量实验表明,Geo-coder在几何重建精度和视觉一致性方面均取得了显著领先。值得注意的是,通过有效保留核心几何语义,使用我们的方法重建的图像在多模态推理任务中表现出与原始图像相当的性能,充分验证了该框架的鲁棒性。最后,为了进一步降低研究成本,我们已开源了基于GeoCode框架构建的Geo-coder数据集,其中包含超过1500个样本。在此基础上,我们还开源了GeocodeLM模型,为该领域后续研究奠定了坚实的数据和模型基础。
cs.AI / 36 / 2602.07754
Humanizing AI Grading: Student-Centered Insights on Fairness, Trust, Consistency and Transparency
人性化人工智能评分:关于公平性、信任、一致性和透明度的以学生为中心的见解
Abstract
This study investigates students' perceptions of Artificial Intelligence (AI) grading systems in an undergraduate computer science course (n = 27), focusing on a block-based programming final project. Guided by the ethical principles framework articulated by Jobin (2019), our study examines fairness, trust, consistency, and transparency in AI grading by comparing AI-generated feedback with original human-graded feedback. Findings reveal concerns about AI's lack of contextual understanding and personalization. We recommend that equitable and trustworthy AI systems reflect human judgment, flexibility, and empathy, serving as supplementary tools under human oversight. This work contributes to ethics-centered assessment practices by amplifying student voices and offering design principles for humanizing AI in designed learning environments.
Chinese Translation
本研究探讨了学生对本科计算机科学课程中人工智能(AI)评分系统的看法(样本量 n = 27),重点关注基于区块的编程期末项目。根据 Jobin(2019)提出的伦理原则框架,我们的研究通过比较 AI 生成的反馈与原始人类评分的反馈,考察了 AI 评分中的公平性、信任、一致性和透明度。研究结果揭示了对 AI 缺乏上下文理解和个性化的担忧。我们建议,公平和可信的 AI 系统应反映人类的判断、灵活性和同理心,作为在人工监督下的补充工具。本研究通过放大学生的声音并提供人性化 AI 设计原则,为以伦理为中心的评估实践做出了贡献。
cs.AI / 37 / 2602.07755
Learning to Continually Learn via Meta-learning Agentic Memory Designs
通过元学习代理记忆设计实现持续学习
Abstract
The statelessness of foundation models bottlenecks agentic systems' ability to continually learn, a core capability for long-horizon reasoning and adaptation. To address this limitation, agentic systems commonly incorporate memory modules to retain and reuse past experience, aiming for continual learning during test time. However, most existing memory designs are human-crafted and fixed, which limits their ability to adapt to the diversity and non-stationarity of real-world tasks. In this paper, we introduce ALMA (Automated meta-Learning of Memory designs for Agentic systems), a framework that meta-learns memory designs to replace hand-engineered memory designs, therefore minimizing human effort and enabling agentic systems to be continual learners across diverse domains. Our approach employs a Meta Agent that searches over memory designs expressed as executable code in an open-ended manner, theoretically allowing the discovery of arbitrary memory designs, including database schemas as well as their retrieval and update mechanisms. Extensive experiments across four sequential decision-making domains demonstrate that the learned memory designs enable more effective and efficient learning from experience than state-of-the-art human-crafted memory designs on all benchmarks. When developed and deployed safely, ALMA represents a step toward self-improving AI systems that learn to be adaptive, continual learners.
Chinese Translation
基础模型的无状态性限制了代理系统持续学习的能力,而持续学习是进行长期推理和适应的核心能力。为了解决这一限制,代理系统通常会结合记忆模块来保留和重用过去的经验,旨在测试时实现持续学习。然而,现有的大多数记忆设计都是人工构建的且是固定的,这限制了它们适应现实世界任务的多样性和非平稳性的能力。在本文中,我们介绍了 ALMA(自动化元学习代理系统的记忆设计),这是一个通过元学习记忆设计来替代手工设计的框架,从而最小化人类的努力,并使代理系统能够在不同领域中成为持续学习者。我们的方法采用一个元代理,以开放的方式搜索以可执行代码表达的记忆设计,理论上允许发现任意的记忆设计,包括数据库模式及其检索和更新机制。针对四个顺序决策领域的广泛实验表明,所学习的记忆设计在所有基准测试中比最先进的人工设计的记忆设计更有效和高效地从经验中学习。当安全开发和部署时,ALMA 代表了朝着自我改进的人工智能系统迈出的一步,这些系统学习成为适应性强的持续学习者。
cs.AI / 38 / 2602.07765
Disentangled Instrumental Variables for Causal Inference with Networked Observational Data
用于网络观察数据因果推断的解耦工具变量
Abstract
Instrumental variables (IVs) are crucial for addressing unobservable confounders, yet their stringent exogeneity assumptions pose significant challenges in networked data. Existing methods typically rely on modelling neighbour information when recovering IVs, thereby inevitably mixing shared environment-induced endogenous correlations and individual-specific exogenous variation, leading the resulting IVs to inherit dependence on unobserved confounders and to violate exogeneity. To overcome this challenge, we propose $\underline{Dis}$entangled $\underline{I}$nstrumental $\underline{V}$ariables (DisIV) framework, a novel method for causal inference based on networked observational data with latent confounders. DisIV exploits network homogeneity as an inductive bias and employs a structural disentanglement mechanism to extract individual-specific components that serve as latent IVs. The causal validity of the extracted IVs is constrained through explicit orthogonality and exclusion conditions. Extensive semi-synthetic experiments on real-world datasets demonstrate that DisIV consistently outperforms state-of-the-art baselines in causal effect estimation under network-induced confounding.
Chinese Translation
工具变量(IVs)对于解决不可观察的混杂因素至关重要,但其严格的外生性假设在网络数据中带来了重大挑战。现有方法通常依赖于建模邻居信息来恢复工具变量,从而不可避免地混合了由共享环境引起的内生相关性和个体特定的外生变异,导致所得到的工具变量继承了对未观察到的混杂因素的依赖,并违反了外生性。为了解决这一挑战,我们提出了$ ext{Dis}$entangled $ ext{I}$nstrumental $ ext{V}$ariables(DisIV)框架,这是一种基于具有潜在混杂因素的网络观察数据的新型因果推断方法。DisIV利用网络同质性作为归纳偏差,并采用结构解耦机制提取作为潜在工具变量的个体特定成分。提取的工具变量的因果有效性通过明确的正交性和排除条件受到约束。在真实世界数据集上的大量半合成实验表明,DisIV在网络引起的混杂情况下的因果效应估计中始终优于最先进的基线方法。
cs.AI / 39 / 2602.07787
Do Multi-Agents Dream of Electric Screens? Achieving Perfect Accuracy on AndroidWorld Through Task Decomposition
多智能体是否梦想电屏幕?通过任务分解在AndroidWorld上实现完美准确性
Abstract
We present Minitap, a multi-agent system that achieves 100% success on the AndroidWorld benchmark, the first to fully solve all 116 tasks and surpassing human performance (80%). We first analyze why single-agent architectures fail: context pollution from mixed reasoning traces, silent text input failures undetected by the agent, and repetitive action loops without escape. Minitap addresses each failure through targeted mechanisms: cognitive separation across six specialized agents, deterministic post-validation of text input against device state, and meta-cognitive reasoning that detects cycles and triggers strategy changes. Ablations show multi-agent decomposition contributes +21 points over single-agent baselines; verified execution adds +7 points; meta-cognition adds +9 points. We release Minitap as open-source software. https://github.com/minitap-ai/mobile-use
Chinese Translation
我们提出了Minitap,一个多智能体系统,在AndroidWorld基准测试中实现了100%的成功率,成为首个完全解决116个任务并超越人类表现(80%)的系统。我们首先分析了单智能体架构失败的原因:混合推理痕迹导致的上下文污染、智能体未能检测到的静默文本输入失败,以及无法逃脱的重复动作循环。Minitap通过针对性机制解决了每个失败:六个专业智能体之间的认知分离、对文本输入与设备状态的确定性后验证,以及检测循环并触发策略变化的元认知推理。消融实验表明,多智能体分解相比单智能体基线贡献了+21分;验证执行增加了+7分;元认知增加了+9分。我们将Minitap作为开源软件发布。https://github.com/minitap-ai/mobile-use
cs.AI / 40 / 2602.07824
Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training
数据达尔文主义第一部分:解锁科学数据的预训练价值
Abstract
Data quality determines foundation model performance, yet systematic processing frameworks are lacking. We introduce Data Darwinism, a ten-level taxonomy (L0-L9) that conceptualizes data-model co-evolution: advanced models produce superior data for next-generation systems. We validate this on scientific literature by constructing Darwin-Science, a 900B-token corpus (L0-L5). We identify a learnability gap in raw scientific text, which we bridge via L4 (Generative Refinement) and L5 (Cognitive Completion) using frontier LLMs to explicate reasoning and terminology. To ensure rigorous attribution, we pre-trained daVinci-origin-3B/7B models from scratch, excluding scientific content to create contamination-free baselines. After 600B tokens of continued pre-training, Darwin-Science outperforms baselines by +2.12 (3B) and +2.95 (7B) points across 20+ benchmarks, rising to +5.60 and +8.40 points on domain-aligned tasks. Systematic progression to L5 yields a +1.36 total gain, confirming that higher-level processing unlocks latent data value. We release the Darwin-Science corpus and daVinci-origin models to enable principled, co-evolutionary development.
Chinese Translation
数据质量决定了基础模型的性能,但缺乏系统的处理框架。我们提出了数据达尔文主义,一个十级分类法(L0-L9),概念化数据与模型的共同进化:先进模型为下一代系统生成优质数据。我们通过构建Darwin-Science,一个包含900亿个标记的语料库(L0-L5),在科学文献上验证了这一点。我们发现原始科学文本存在可学习性差距,通过L4(生成性精炼)和L5(认知补全)来弥补这一差距,利用前沿的大型语言模型(LLMs)来阐明推理和术语。为了确保严格的归属,我们从头开始预训练了daVinci-origin-3B/7B模型,排除科学内容以创建无污染的基线。在600亿个标记的持续预训练后,Darwin-Science在20多个基准测试中超越基线,分别提高了+2.12(3B)和+2.95(7B)点,在领域对齐任务中上升至+5.60和+8.40点。系统性地推进到L5带来了+1.36的总增益,确认了更高层次的处理解锁了潜在的数据价值。我们发布了Darwin-Science语料库和daVinci-origin模型,以促进原则性、共同进化的发展。
cs.AI / 41 / 2602.07830
Time Series Reasoning via Process-Verifiable Thinking Data Synthesis and Scheduling for Tailored LLM Reasoning
通过过程可验证思维数据合成与调度实现时间序列推理以定制大语言模型推理
Abstract
Time series is a pervasive data type across various application domains, rendering the reasonable solving of diverse time series tasks a long-standing goal. Recent advances in large language models (LLMs), especially their reasoning abilities unlocked through reinforcement learning (RL), have opened new opportunities for tackling tasks with long Chain-of-Thought (CoT) reasoning. However, leveraging LLM reasoning for time series remains in its infancy, hindered by the absence of carefully curated time series CoT data for training, limited data efficiency caused by underexplored data scheduling, and the lack of RL algorithms tailored for exploiting such time series CoT data. In this paper, we introduce VeriTime, a framework that tailors LLMs for time series reasoning through data synthesis, data scheduling, and RL training. First, we propose a data synthesis pipeline that constructs a TS-text multimodal dataset with process-verifiable annotations. Second, we design a data scheduling mechanism that arranges training samples according to a principled hierarchy of difficulty and task taxonomy. Third, we develop a two-stage reinforcement finetuning featuring fine-grained, multi-objective rewards that leverage verifiable process-level CoT data. Extensive experiments show that VeriTime substantially boosts LLM performance across diverse time series reasoning tasks. Notably, it enables compact 3B, 4B models to achieve reasoning capabilities on par with or exceeding those of larger proprietary LLMs.
Chinese Translation
时间序列是一种广泛存在于各个应用领域的数据类型,因此合理解决各种时间序列任务一直是一个长期目标。近年来,大型语言模型(LLMs)的进展,尤其是通过强化学习(RL)解锁的推理能力,为处理具有长链式思维(CoT)推理的任务开辟了新的机遇。然而,利用LLM推理进行时间序列分析仍处于起步阶段,受限于缺乏精心策划的时间序列CoT数据进行训练、由于未充分探索的数据调度导致的数据效率低下,以及缺乏针对利用此类时间序列CoT数据的RL算法。在本文中,我们介绍了VeriTime,一个通过数据合成、数据调度和RL训练为时间序列推理定制LLM的框架。首先,我们提出了一种数据合成管道,构建了一个具有过程可验证注释的TS-text多模态数据集。其次,我们设计了一种数据调度机制,根据难度和任务分类的原则性层次安排训练样本。第三,我们开发了一种两阶段的强化微调,具有细粒度的多目标奖励,利用可验证的过程级CoT数据。大量实验表明,VeriTime显著提升了LLM在各种时间序列推理任务中的表现。值得注意的是,它使得紧凑的3B、4B模型在推理能力上与更大规模的专有LLM相媲美或超越。
cs.AI / 42 / 2602.07849
LQA: A Lightweight Quantized-Adaptive Framework for Vision-Language Models on the Edge
LQA:一种轻量级量化自适应框架,用于边缘设备上的视觉语言模型
Abstract
Deploying Vision-Language Models (VLMs) on edge devices is challenged by resource constraints and performance degradation under distribution shifts. While test-time adaptation (TTA) can counteract such shifts, existing methods are too resource-intensive for on-device deployment. To address this challenge, we propose LQA, a lightweight, quantized-adaptive framework for VLMs that combines a modality-aware quantization strategy with gradient-free test-time adaptation. We introduce Selective Hybrid Quantization (SHQ) and a quantized, gradient-free adaptation mechanism to enable robust and efficient VLM deployment on resource-constrained hardware. Experiments across both synthetic and real-world distribution shifts show that LQA improves overall adaptation performance by 4.5\%, uses less memory than full-precision models, and significantly outperforms gradient-based TTA methods, achieving up to 19.9$\times$ lower memory usage across seven open-source datasets. These results demonstrate that LQA offers a practical pathway for robust, privacy-preserving, and efficient VLM deployment on edge devices.
Chinese Translation
在边缘设备上部署视觉语言模型(VLMs)面临资源限制和分布变化下性能下降的挑战。尽管测试时自适应(TTA)可以抵消这种变化,但现有方法对于设备上的部署来说过于资源密集。为了解决这一挑战,我们提出了LQA,一种轻量级量化自适应框架,结合了模态感知量化策略和无梯度测试时自适应机制。我们引入了选择性混合量化(SHQ)和一种量化的无梯度适应机制,以实现资源受限硬件上VLM的稳健和高效部署。在合成和真实世界的分布变化实验中,LQA整体适应性能提高了4.5\%,所需内存低于全精度模型,并显著优于基于梯度的TTA方法,在七个开源数据集上实现了高达19.9$ imes$的内存使用降低。这些结果表明,LQA为在边缘设备上稳健、保护隐私和高效地部署VLM提供了一条切实可行的路径。
cs.AI / 43 / 2602.07852
Emergent Misalignment is Easy, Narrow Misalignment is Hard
新兴的误对齐容易,而狭义的误对齐困难
Abstract
Finetuning large language models on narrowly harmful datasets can cause them to become emergently misaligned, giving stereotypically `evil' responses across diverse unrelated settings. Concerningly, a pre-registered survey of experts failed to predict this result, highlighting our poor understanding of the inductive biases governing learning and generalisation in LLMs. We use emergent misalignment (EM) as a case study to investigate these inductive biases and find that models can just learn the narrow dataset task, but that the general solution appears to be more stable and more efficient. To establish this, we build on the result that different EM finetunes converge to the same linear representation of general misalignment, which can be used to mediate misaligned behaviour. We find a linear representation of the narrow solution also exists, and can be learned by introducing a KL divergence loss. Comparing these representations reveals that general misalignment achieves lower loss, is more robust to perturbations, and is more influential in the pre-training distribution. This work isolates a concrete representation of general misalignment for monitoring and mitigation. More broadly, it offers a detailed case study and preliminary metrics for investigating how inductive biases shape generalisation in LLMs. We open-source all code, datasets and model finetunes.
Chinese Translation
在狭义有害数据集上微调大型语言模型可能导致它们出现新兴的误对齐,进而在各种无关的场景中产生刻板印象式的“邪恶”回应。令人担忧的是,一项预注册的专家调查未能预测这一结果,这突显了我们对大型语言模型(LLMs)学习和泛化过程中所支配的归纳偏差的理解不足。我们以新兴误对齐(Emergent Misalignment, EM)作为案例研究,探讨这些归纳偏差,并发现模型可以仅学习狭义数据集任务,但一般解决方案似乎更稳定且更高效。为此,我们基于不同的EM微调收敛到相同的广义误对齐线性表示的结果,利用该表示来调节误对齐行为。我们发现狭义解决方案的线性表示也存在,并且可以通过引入KL散度损失来学习。对比这些表示显示,广义误对齐实现了更低的损失,对扰动更具鲁棒性,并在预训练分布中更具影响力。这项工作为监测和缓解提供了广义误对齐的具体表示。更广泛地说,它提供了一个详细的案例研究和初步指标,以调查归纳偏差如何塑造LLMs中的泛化。我们将所有代码、数据集和模型微调开源。
cs.AI / 44 / 2602.07883
ToolSelf: Unifying Task Execution and Self-Reconfiguration via Tool-Driven Intrinsic Adaptation
ToolSelf:通过工具驱动的内在适应统一任务执行与自我重构
Abstract
Agentic systems powered by Large Language Models (LLMs) have demonstrated remarkable potential in tackling complex, long-horizon tasks. However, their efficacy is fundamentally constrained by static configurations governing agent behaviors, which are fixed prior to execution and fail to adapt to evolving task dynamics. Existing approaches, relying on manual orchestration or heuristic-based patches, often struggle with poor generalization and fragmented optimization. To transcend these limitations, we propose ToolSelf, a novel paradigm enabling tool-driven runtime self-reconfiguration. By abstracting configuration updates as a callable tool, ToolSelf unifies task execution and self-adjustment into a single action space, achieving a phase transition from external rules to intrinsic parameters. Agents can thereby autonomously update their sub-goals and context based on task progression, and correspondingly adapt their strategy and toolbox, transforming from passive executors into dual managers of both task and self. We further devise Configuration-Aware Two-stage Training (CAT), combining rejection sampling fine-tuning with trajectory-level reinforcement learning to internalize this meta-capability. Extensive experiments across diverse benchmarks demonstrate that ToolSelf rivals specialized workflows while generalizing to novel tasks, achieving a 24.1% average performance gain and illuminating a path toward truly self-adaptive agents.
Chinese Translation
由大型语言模型(LLMs)驱动的智能系统在处理复杂的长期任务方面展现出了显著的潜力。然而,它们的有效性在根本上受到静态配置的限制,这些配置在执行前就已固定,无法适应不断变化的任务动态。现有的方法依赖于手动编排或基于启发式的修补,往往面临较差的泛化能力和碎片化的优化。为了超越这些局限性,我们提出了ToolSelf,这是一种新颖的范式,能够实现工具驱动的运行时自我重构。通过将配置更新抽象为可调用的工具,ToolSelf将任务执行与自我调整统一到一个单一的行动空间,实现了从外部规则到内在参数的相变。这样,智能体可以根据任务进展自主更新其子目标和上下文,并相应地调整其策略和工具箱,从被动执行者转变为任务和自我管理的双重管理者。我们进一步设计了配置感知的两阶段训练(CAT),结合拒绝采样微调与轨迹级强化学习,以内化这一元能力。在多样化基准上的广泛实验表明,ToolSelf在与专门工作流的竞争中表现不俗,同时能够泛化到新任务,实现了24.1%的平均性能提升,并为真正自适应的智能体指明了方向。
cs.AI / 45 / 2602.07885
MemFly: On-the-Fly Memory Optimization via Information Bottleneck
MemFly:基于信息瓶颈的即时内存优化
Abstract
Long-term memory enables large language model agents to tackle complex tasks through historical interactions. However, existing frameworks encounter a fundamental dilemma between compressing redundant information efficiently and maintaining precise retrieval for downstream tasks. To bridge this gap, we propose MemFly, a framework grounded in information bottleneck principles that facilitates on-the-fly memory evolution for LLMs. Our approach minimizes compression entropy while maximizing relevance entropy via a gradient-free optimizer, constructing a stratified memory structure for efficient storage. To fully leverage MemFly, we develop a hybrid retrieval mechanism that seamlessly integrates semantic, symbolic, and topological pathways, incorporating iterative refinement to handle complex multi-hop queries. Comprehensive experiments demonstrate that MemFly substantially outperforms state-of-the-art baselines in memory coherence, response fidelity, and accuracy.
Chinese Translation
长期记忆使大型语言模型代理能够通过历史交互处理复杂任务。然而,现有框架在有效压缩冗余信息与保持下游任务的精确检索之间面临根本性困境。为了解决这一问题,我们提出了MemFly,一个基于信息瓶颈原理的框架,能够实现大型语言模型的即时内存演化。我们的方法通过无梯度优化器最小化压缩熵,同时最大化相关性熵,构建了一个分层内存结构以实现高效存储。为了充分利用MemFly,我们开发了一种混合检索机制,能够无缝整合语义、符号和拓扑路径,并结合迭代优化来处理复杂的多跳查询。全面的实验表明,MemFly在内存一致性、响应准确性和精确度方面显著优于最先进的基线。
cs.AI / 46 / 2602.07903
GCN-MPPR: Enhancing the Propagation of Message Passing Neural Networks via Motif-Based Personalized PageRank
GCN-MPPR:通过基于图案的个性化PageRank增强消息传递神经网络的信息传播
Abstract
The algorithms based on message passing neural networks (MPNNs) on graphs have recently achieved great success for various graph applications. However, studies find that these methods always propagate the information to very limited neighborhoods with shallow depth, particularly due to over-smoothing. That means most of the existing MPNNs fail to be so `deep'. Although some previous work tended to handle this challenge via optimization- or structure-level remedies, the overall performance of GCNs still suffers from limited accuracy, poor stability, and unaffordable computational cost. Moreover, neglect of higher-order relationships during the propagation of MPNNs has further limited the performance of them. To overcome these challenges, a novel variant of PageRank named motif-based personalized PageRank (MPPR) is proposed to measure the influence of one node to another on the basis of considering higher-order motif relationships. Secondly, the MPPR is utilized to the message passing process of GCNs, thereby guiding the message passing process at a relatively `high' level. The experimental results show that the proposed method outperforms almost all of the baselines on accuracy, stability, and time consumption. Additionally, the proposed method can be considered as a component that can underpin almost all GCN tasks, with DGCRL being demonstrated in the experiment. The anonymous code repository is available at: https://anonymous.4open.science/r/GCN-MPPR-AFD6/.
Chinese Translation
基于图的消息传递神经网络(MPNNs)算法最近在各种图应用中取得了巨大的成功。然而,研究发现这些方法总是将信息传播到非常有限的邻域,且深度较浅,特别是由于过平滑现象。这意味着大多数现有的MPNNs未能达到足够的“深度”。尽管一些先前的工作试图通过优化或结构层面的补救措施来应对这一挑战,但GCNs的整体性能仍然受到准确性有限、稳定性差和计算成本高昂的影响。此外,在MPNNs传播过程中忽视高阶关系进一步限制了其性能。为了解决这些挑战,提出了一种新型的PageRank变体,称为基于图案的个性化PageRank(MPPR),用于基于高阶图案关系来衡量一个节点对另一个节点的影响。其次,将MPPR应用于GCNs的消息传递过程,从而在相对“高”的层面指导消息传递过程。实验结果表明,所提出的方法在准确性、稳定性和时间消耗方面优于几乎所有基线。此外,所提出的方法可以被视为几乎所有GCN任务的基础组件,实验中展示了DGCRL。匿名代码库可在以下链接获取:https://anonymous.4open.science/r/GCN-MPPR-AFD6/
cs.AI / 47 / 2602.07905
MedCoG: Maximizing LLM Inference Density in Medical Reasoning via Meta-Cognitive Regulation
MedCoG:通过元认知调节最大化医疗推理中的大型语言模型推理密度
Abstract
Large Language Models (LLMs) have shown strong potential in complex medical reasoning yet face diminishing gains under inference scaling laws. While existing studies augment LLMs with various knowledge types, it remains unclear how effectively the additional costs translate into accuracy. In this paper, we explore how meta-cognition of LLMs, i.e., their self-awareness of their own knowledge states, can regulate the reasoning process. Specifically, we propose MedCoG, a Medical Meta-Cognition Agent with Knowledge Graph, where the meta-cognitive assessments of task complexity, familiarity, and knowledge density dynamically regulate utilization of procedural, episodic, and factual knowledge. The LLM-centric on-demand reasoning aims to mitigate scaling laws by (1) reducing costs via avoiding indiscriminate scaling, (2) improving accuracy via filtering out distractive knowledge. To validate this, we empirically characterize the scaling curve and introduce inference density to quantify inference efficiency, defined as the ratio of theoretically effective cost to actual cost. Experiments demonstrate the effectiveness and efficiency of MedCoG on five hard sets of medical benchmarks, yielding 5.5x inference density. Furthermore, the Oracle study highlights the significant potential of meta-cognitive regulation.
Chinese Translation
大型语言模型(LLMs)在复杂医疗推理中展现出强大的潜力,但在推理规模法则下面临收益递减的问题。尽管现有研究通过各种知识类型增强LLMs,但如何有效地将额外成本转化为准确性仍不明确。本文探讨了LLMs的元认知,即它们对自身知识状态的自我意识,如何调节推理过程。具体而言,我们提出了MedCoG,一个具有知识图谱的医疗元认知代理,其中任务复杂性、熟悉度和知识密度的元认知评估动态调节程序性、情节性和事实性知识的利用。以LLM为中心的按需推理旨在通过(1)避免无差别扩展来降低成本,(2)通过过滤干扰性知识来提高准确性,从而缓解规模法则。为了验证这一点,我们实证地描述了规模曲线,并引入推理密度来量化推理效率,定义为理论有效成本与实际成本的比率。实验表明,MedCoG在五个困难的医疗基准测试集上表现出有效性和效率,推理密度达到5.5倍。此外,Oracle研究突显了元认知调节的显著潜力。
cs.AI / 48 / 2602.07919
Selective Fine-Tuning for Targeted and Robust Concept Unlearning
针对性和稳健的概念遗忘的选择性微调
Abstract
Text guided diffusion models are used by millions of users, but can be easily exploited to produce harmful content. Concept unlearning methods aim at reducing the models' likelihood of generating harmful content. Traditionally, this has been tackled at an individual concept level, with only a handful of recent works considering more realistic concept combinations. However, state of the art methods depend on full finetuning, which is computationally expensive. Concept localisation methods can facilitate selective finetuning, but existing techniques are static, resulting in suboptimal utility. In order to tackle these challenges, we propose TRUST (Targeted Robust Selective fine Tuning), a novel approach for dynamically estimating target concept neurons and unlearning them through selective finetuning, empowered by a Hessian based regularization. We show experimentally, against a number of SOTA baselines, that TRUST is robust against adversarial prompts, preserves generation quality to a significant degree, and is also significantly faster than the SOTA. Our method achieves unlearning of not only individual concepts but also combinations of concepts and conditional concepts, without any specific regularization.
Chinese Translation
文本引导的扩散模型被数百万用户使用,但容易被利用来生成有害内容。概念遗忘方法旨在降低模型生成有害内容的可能性。传统上,这一问题是在单个概念层面上进行处理的,只有少数最近的研究考虑了更现实的概念组合。然而,最先进的方法依赖于完全微调,这在计算上是昂贵的。概念定位方法可以促进选择性微调,但现有技术是静态的,导致效用不佳。为了解决这些挑战,我们提出了TRUST(针对性稳健选择性微调),这是一种通过选择性微调动态估计目标概念神经元并进行遗忘的新方法,采用基于Hessian的正则化。我们通过实验表明,与多个最先进的基线相比,TRUST对对抗性提示具有稳健性,在很大程度上保持生成质量,并且比最先进的方法显著更快。我们的方法不仅实现了对单个概念的遗忘,还实现了对概念组合和条件概念的遗忘,而无需任何特定的正则化。
cs.AI / 49 / 2602.07940
MePo: Meta Post-Refinement for Rehearsal-Free General Continual Learnin
MePo:无重演的一般持续学习的元后精炼
Abstract
To cope with uncertain changes of the external world, intelligent systems must continually learn from complex, evolving environments and respond in real time. This ability, collectively known as general continual learning (GCL), encapsulates practical challenges such as online datastreams and blurry task boundaries. Although leveraging pretrained models (PTMs) has greatly advanced conventional continual learning (CL), these methods remain limited in reconciling the diverse and temporally mixed information along a single pass, resulting in sub-optimal GCL performance. Inspired by meta-plasticity and reconstructive memory in neuroscience, we introduce here an innovative approach named Meta Post-Refinement (MePo) for PTMs-based GCL. This approach constructs pseudo task sequences from pretraining data and develops a bi-level meta-learning paradigm to refine the pretrained backbone, which serves as a prolonged pretraining phase but greatly facilitates rapid adaptation of representation learning to downstream GCL tasks. MePo further initializes a meta covariance matrix as the reference geometry of pretrained representation space, enabling GCL to exploit second-order statistics for robust output alignment. MePo serves as a plug-in strategy that achieves significant performance gains across a variety of GCL benchmarks and pretrained checkpoints in a rehearsal-free manner (e.g., 15.10\%, 13.36\%, and 12.56\% on CIFAR-100, ImageNet-R, and CUB-200 under Sup-21/1K). Our source code is available at \href{https://github.com/SunGL001/MePo}{MePo}
Chinese Translation
为了应对外部世界的不确定变化,智能系统必须不断从复杂、不断演变的环境中学习,并实时做出反应。这种能力统称为一般持续学习(General Continual Learning, GCL),它包含了诸如在线数据流和模糊任务边界等实际挑战。尽管利用预训练模型(Pretrained Models, PTMs)在传统持续学习(Continual Learning, CL)中取得了重大进展,但这些方法在调和单次传递中的多样化和时间混合信息方面仍然有限,导致GCL性能不理想。受到神经科学中元可塑性和重构记忆的启发,我们在此提出了一种名为元后精炼(Meta Post-Refinement, MePo)的创新方法,旨在基于PTMs的GCL。该方法从预训练数据构建伪任务序列,并开发了一个双层元学习范式,以精炼预训练骨干网络,这相当于一个延长的预训练阶段,但大大促进了表示学习对下游GCL任务的快速适应。MePo进一步初始化一个元协方差矩阵,作为预训练表示空间的参考几何,使GCL能够利用二阶统计量进行稳健的输出对齐。MePo作为一种插件策略,在无重演的情况下,在多种GCL基准和预训练检查点上实现了显著的性能提升(例如,在Sup-21/1K下,CIFAR-100、ImageNet-R和CUB-200分别提升了15.10%、13.36%和12.56%)。我们的源代码可在此获取: exttt{https://github.com/SunGL001/MePo}
cs.AI / 50 / 2602.07943
IV Co-Scientist: Multi-Agent LLM Framework for Causal Instrumental Variable Discovery
IV共同科学家:用于因果工具变量发现的多智能体大语言模型框架
Abstract
In the presence of confounding between an endogenous variable and the outcome, instrumental variables (IVs) are used to isolate the causal effect of the endogenous variable. Identifying valid instruments requires interdisciplinary knowledge, creativity, and contextual understanding, making it a non-trivial task. In this paper, we investigate whether large language models (LLMs) can aid in this task. We perform a two-stage evaluation framework. First, we test whether LLMs can recover well-established instruments from the literature, assessing their ability to replicate standard reasoning. Second, we evaluate whether LLMs can identify and avoid instruments that have been empirically or theoretically discredited. Building on these results, we introduce IV Co-Scientist, a multi-agent system that proposes, critiques, and refines IVs for a given treatment-outcome pair. We also introduce a statistical test to contextualize consistency in the absence of ground truth. Our results show the potential of LLMs to discover valid instrumental variables from a large observational database.
Chinese Translation
在内生变量与结果之间存在混杂的情况下,工具变量(IVs)用于隔离内生变量的因果效应。识别有效的工具变量需要跨学科的知识、创造力和上下文理解,这使得这一任务并非易事。本文探讨了大型语言模型(LLMs)是否能够帮助完成这一任务。我们采用了一个两阶段的评估框架。首先,我们测试LLMs是否能够从文献中恢复已建立的工具变量,评估其复制标准推理的能力。其次,我们评估LLMs是否能够识别并避免那些在经验上或理论上被否定的工具变量。在这些结果的基础上,我们引入了IV共同科学家,一个多智能体系统,针对给定的处理-结果对提出、批评和完善工具变量。我们还引入了一种统计测试,以在缺乏真实情况的情况下对一致性进行背景化。我们的结果显示,LLMs在从大型观察数据库中发现有效工具变量方面具有潜力。
cs.AI / 51 / 2602.07962
LOCA-bench: Benchmarking Language Agents Under Controllable and Extreme Context Growth
LOCA-bench:在可控和极端上下文增长下对语言代理进行基准测试
Abstract
Large language models (LLMs) are increasingly capable of carrying out long-running, real-world tasks. However, as the amount of context grows, their reliability often deteriorates, a phenomenon known as "context rot". Existing long-context benchmarks primarily focus on single-step settings that evaluate a model's ability to retrieve information from a long snippet. In realistic scenarios, however, LLMs often need to act as agents that explore environments, follow instructions and plans, extract useful information, and predict correct actions under a dynamically growing context. To assess language agents in such settings, we introduce LOCA-bench (a benchmark for LOng-Context Agents). Given a task prompt, LOCA-bench leverages automated and scalable control of environment states to regulate the agent's context length. This design enables LOCA-bench to extend the context length potentially to infinity in a controlled way while keeping the underlying task semantics fixed. LOCA-bench evaluates language agents as a combination of models and scaffolds, including various context management strategies. While agent performance generally degrades as the environment states grow more complex, advanced context management techniques can substantially improve the overall success rate. We open-source LOCA-bench to provide a platform for evaluating models and scaffolds in long-context, agentic scenarios: https://github.com/hkust-nlp/LOCA-bench
Chinese Translation
大型语言模型(LLMs)在执行长期的现实任务方面越来越有能力。然而,随着上下文量的增加,它们的可靠性往往会下降,这种现象被称为“上下文衰退”。现有的长上下文基准主要集中在单步设置上,评估模型从长片段中检索信息的能力。然而,在现实场景中,LLMs通常需要充当探索环境的代理,遵循指令和计划,提取有用信息,并在动态增长的上下文中预测正确的行动。为了在这种设置中评估语言代理,我们引入了LOCA-bench(长上下文代理基准)。给定一个任务提示,LOCA-bench利用自动化和可扩展的环境状态控制来调节代理的上下文长度。这一设计使LOCA-bench能够以受控的方式将上下文长度扩展到潜在的无限,同时保持基础任务语义不变。LOCA-bench将语言代理评估为模型和支架的组合,包括各种上下文管理策略。尽管随着环境状态变得更加复杂,代理的性能通常会下降,但先进的上下文管理技术可以显著提高整体成功率。我们开源了LOCA-bench,以提供一个在长上下文、代理场景中评估模型和支架的平台: https://github.com/hkust-nlp/LOCA-bench
cs.AI / 52 / 2602.07983
Accelerating Social Science Research via Agentic Hypothesization and Experimentation
通过代理假设和实验加速社会科学研究
Abstract
Data-driven social science research is inherently slow, relying on iterative cycles of observation, hypothesis generation, and experimental validation. While recent data-driven methods promise to accelerate parts of this process, they largely fail to support end-to-end scientific discovery. To address this gap, we introduce EXPERIGEN, an agentic framework that operationalizes end-to-end discovery through a Bayesian optimization inspired two-phase search, in which a Generator proposes candidate hypotheses and an Experimenter evaluates them empirically. Across multiple domains, EXPERIGEN consistently discovers 2-4x more statistically significant hypotheses that are 7-17 percent more predictive than prior approaches, and naturally extends to complex data regimes including multimodal and relational datasets. Beyond statistical performance, hypotheses must be novel, empirically grounded, and actionable to drive real scientific progress. To evaluate these qualities, we conduct an expert review of machine-generated hypotheses, collecting feedback from senior faculty. Among 25 reviewed hypotheses, 88 percent were rated moderately or strongly novel, 70 percent were deemed impactful and worth pursuing, and most demonstrated rigor comparable to senior graduate-level research. Finally, recognizing that ultimate validation requires real-world evidence, we conduct the first A/B test of LLM-generated hypotheses, observing statistically significant results with p less than 1e-6 and a large effect size of 344 percent.
Chinese Translation
数据驱动的社会科学研究本质上是缓慢的,依赖于观察、假设生成和实验验证的迭代循环。尽管最近的数据驱动方法承诺加速这一过程的某些部分,但它们在很大程度上未能支持端到端的科学发现。为了解决这一问题,我们引入了EXPERIGEN,一个代理框架,通过受贝叶斯优化启发的两阶段搜索实现端到端发现,其中生成器(Generator)提出候选假设,而实验者(Experimenter)对其进行实证评估。在多个领域中,EXPERIGEN一致发现了2-4倍更多的统计显著假设,这些假设的预测能力比以往方法提高了7-17%。此外,该框架自然扩展到包括多模态和关系数据集在内的复杂数据环境。除了统计性能外,假设还必须是新颖的、以实证为基础的,并且是可操作的,以推动真正的科学进步。为了评估这些特性,我们对机器生成的假设进行了专家评审,并收集了来自资深教师的反馈。在25个被评审的假设中,88%的假设被评为中等或高度新颖,70%被认为具有影响力且值得追求,大多数假设的严谨性与资深研究生水平的研究相当。最后,认识到最终验证需要现实世界的证据,我们进行了首次LLM生成假设的A/B测试,观察到统计显著的结果,p值小于1e-6,效果大小达到344%。
cs.AI / 53 / 2602.08009
Towards Adaptive, Scalable, and Robust Coordination of LLM Agents: A Dynamic Ad-Hoc Networking Perspective
面向自适应、可扩展和鲁棒的LLM代理协调:动态自组网视角
Abstract
Multi-agent architectures built on large language models (LLMs) have demonstrated the potential to realize swarm intelligence through well-crafted collaboration. However, the substantial burden of manual orchestration inherently raises an imperative to automate the design of agentic workflows. We frame such an agent coordination challenge as a classic problem in dynamic ad-hoc networking: How to establish adaptive and reliable communication among a scalable number of agentic hosts? In response to this unresolved dilemma, we introduce RAPS, a reputation-aware publish-subscribe paradigm for adaptive, scalable, and robust coordination of LLM agents. RAPS is grounded in the Distributed Publish-Subscribe Protocol, allowing LLM agents to exchange messages based on their declared intents rather than predefined topologies. Beyond this substrate, RAPS further incorporates two coherent overlays: (i) Reactive Subscription, enabling agents to dynamically refine their intents; and (ii) Bayesian Reputation, empowering each agent with a local watchdog to detect and isolate malicious peers. Extensive experiments over five benchmarks showcase that our design effectively reconciles adaptivity, scalability, and robustness in a unified multi-agent coordination framework.
Chinese Translation
基于大型语言模型(LLMs)的多代理架构展示了通过精心设计的协作实现群体智能的潜力。然而,手动协调的巨大负担本质上提出了自动化设计代理工作流的迫切需求。我们将这一代理协调挑战框架视为动态自组网中的经典问题:如何在可扩展数量的代理主机之间建立自适应和可靠的通信?针对这一未解决的难题,我们引入了RAPS,一种基于声誉的发布-订阅范式,用于自适应、可扩展和鲁棒的LLM代理协调。RAPS基于分布式发布-订阅协议,使LLM代理能够根据其声明的意图而非预定义的拓扑结构交换消息。在此基础上,RAPS进一步结合了两个一致的覆盖层:(i)反应式订阅,使代理能够动态地细化其意图;(ii)贝叶斯声誉,使每个代理具备本地监视功能,以检测和隔离恶意对等体。针对五个基准的广泛实验表明,我们的设计有效地在统一的多代理协调框架中调和了自适应性、可扩展性和鲁棒性。
cs.AI / 54 / 2602.08013
Small Agent Group is the Future of Digital Health
小型代理组是数字健康的未来
Abstract
The rapid adoption of large language models (LLMs) in digital health has been driven by a "scaling-first" philosophy, i.e., the assumption that clinical intelligence increases with model size and data. However, real-world clinical needs include not only effectiveness, but also reliability and reasonable deployment cost. Since clinical decision-making is inherently collaborative, we challenge the monolithic scaling paradigm and ask whether a Small Agent Group (SAG) can support better clinical reasoning. SAG shifts from single-model intelligence to collective expertise by distributing reasoning, evidence-based analysis, and critical audit through a collaborative deliberation process. To assess the clinical utility of SAG, we conduct extensive evaluations using diverse clinical metrics spanning effectiveness, reliability, and deployment cost. Our results show that SAG achieves superior performance compared to a single giant model, both with and without additional optimization or retrieval-augmented generation. These findings suggest that the synergistic reasoning represented by SAG can substitute for model parameter growth in clinical settings. Overall, SAG offers a scalable solution to digital health that better balances effectiveness, reliability, and deployment efficiency.
Chinese Translation
大型语言模型(LLMs)在数字健康中的快速应用是由“优先扩展”的理念驱动的,即假设临床智能随着模型规模和数据量的增加而提高。然而,现实世界的临床需求不仅包括有效性,还包括可靠性和合理的部署成本。由于临床决策本质上是协作性的,我们挑战了单一扩展范式,并询问小型代理组(SAG)是否能够支持更好的临床推理。SAG通过在协作审议过程中分配推理、基于证据的分析和关键审计,从单一模型智能转向集体专业知识。为了评估SAG的临床效用,我们使用多种临床指标进行了广泛评估,涵盖有效性、可靠性和部署成本。我们的结果表明,SAG在性能上优于单一大型模型,无论是否进行额外的优化或检索增强生成。这些发现表明,SAG所代表的协同推理可以替代临床环境中模型参数的增长。总体而言,SAG为数字健康提供了一种可扩展的解决方案,更好地平衡了有效性、可靠性和部署效率。
cs.AI / 55 / 2602.08021
Structure-Aware Robust Counterfactual Explanations via Conditional Gaussian Network Classifiers
基于条件高斯网络分类器的结构感知鲁棒反事实解释
Abstract
Counterfactual explanation (CE) is a core technique in explainable artificial intelligence (XAI), widely used to interpret model decisions and suggest actionable alternatives. This work presents a structure-aware and robustness-oriented counterfactual search method based on the conditional Gaussian network classifier (CGNC). The CGNC has a generative structure that encodes conditional dependencies and potential causal relations among features through a directed acyclic graph (DAG). This structure naturally embeds feature relationships into the search process, eliminating the need for additional constraints to ensure consistency with the model's structural assumptions. We adopt a convergence-guaranteed cutting-set procedure as an adversarial optimization framework, which iteratively approximates solutions that satisfy global robustness conditions. To address the nonconvex quadratic structure induced by feature dependencies, we apply piecewise McCormick relaxation to reformulate the problem as a mixed-integer linear program (MILP), ensuring global optimality. Experimental results show that our method achieves strong robustness, with direct global optimization of the original formulation providing especially stable and efficient results. The proposed framework is extensible to more complex constraint settings, laying the groundwork for future advances in counterfactual reasoning under nonconvex quadratic formulations.
Chinese Translation
反事实解释(CE)是可解释人工智能(XAI)中的核心技术,广泛用于解释模型决策并建议可行的替代方案。本研究提出了一种基于条件高斯网络分类器(CGNC)的结构感知和鲁棒性导向的反事实搜索方法。CGNC具有生成结构,通过有向无环图(DAG)编码特征之间的条件依赖关系和潜在因果关系。这种结构自然将特征关系嵌入搜索过程,消除了为确保与模型结构假设一致而需要额外约束的需求。我们采用了一种收敛保证的切割集程序作为对抗优化框架,迭代逼近满足全局鲁棒性条件的解。为了解决由特征依赖关系引起的非凸二次结构,我们应用分段 McCormick 放松将问题重新表述为混合整数线性规划(MILP),确保全局最优性。实验结果表明,我们的方法实现了强鲁棒性,原始公式的直接全局优化提供了特别稳定和高效的结果。所提出的框架可扩展到更复杂的约束设置,为未来在非凸二次公式下的反事实推理的进展奠定了基础。
cs.AI / 56 / 2602.08030
Free(): Learning to Forget in Malloc-Only Reasoning Models
Free(): 在仅使用 malloc 的推理模型中学习遗忘
Abstract
Reasoning models enhance problem-solving by scaling test-time compute, yet they face a critical paradox: excessive thinking tokens often degrade performance rather than improve it. We attribute this to a fundamental architectural flaw: standard LLMs operate as "malloc-only" engines, continuously accumulating valid and redundant steps alike without a mechanism to prune obsolete information. To break this cycle, we propose Free()LM, a model that introduces an intrinsic self-forgetting capability via the Free-Module, a plug-and-play LoRA adapter. By iteratively switching between reasoning and cleaning modes, Free()LM dynamically identifies and prunes useless context chunks, maintaining a compact and noise-free state. Extensive experiments show that Free()LM provides consistent improvements across all model scales (8B to 685B). It achieves a 3.3% average improvement over top-tier reasoning baselines, even establishing a new SOTA on IMOanswerBench using DeepSeek V3.2-Speciale. Most notably, in long-horizon tasks where the standard Qwen3-235B-A22B model suffers a total collapse (0% accuracy), Free()LM restores performance to 50%. Our findings suggest that sustainable intelligence requires the freedom to forget as much as the power to think.
Chinese Translation
推理模型通过扩展测试时计算能力来增强问题解决能力,但它们面临一个关键的悖论:过多的思考令牌往往会降低性能而不是提高性能。我们将此归因于一个基本的架构缺陷:标准的大型语言模型(LLMs)作为“仅 malloc”引擎运行,持续积累有效和冗余的步骤,却没有机制来修剪过时的信息。为了打破这一循环,我们提出了 Free()LM,这是一种通过 Free-Module(一个即插即用的 LoRA 适配器)引入内在自我遗忘能力的模型。通过在推理模式和清理模式之间迭代切换,Free()LM 动态识别并修剪无用的上下文块,保持紧凑且无噪声的状态。大量实验表明,Free()LM 在所有模型规模(从 8B 到 685B)上都提供了一致的改进。它在顶尖推理基线模型上平均提高了 3.3%,甚至在使用 DeepSeek V3.2-Speciale 的 IMOanswerBench 上建立了新的 SOTA。最值得注意的是,在标准的 Qwen3-235B-A22B 模型在长时间任务中完全崩溃(0% 准确率)时,Free()LM 将性能恢复到 50%。我们的研究结果表明,持续的智能需要遗忘的自由与思考的能力同样重要。
cs.AI / 57 / 2602.08052
Graph-Enhanced Deep Reinforcement Learning for Multi-Objective Unrelated Parallel Machine Scheduling
图增强深度强化学习用于多目标不相关并行机器调度
Abstract
The Unrelated Parallel Machine Scheduling Problem (UPMSP) with release dates, setups, and eligibility constraints presents a significant multi-objective challenge. Traditional methods struggle to balance minimizing Total Weighted Tardiness (TWT) and Total Setup Time (TST). This paper proposes a Deep Reinforcement Learning framework using Proximal Policy Optimization (PPO) and a Graph Neural Network (GNN). The GNN effectively represents the complex state of jobs, machines, and setups, allowing the PPO agent to learn a direct scheduling policy. Guided by a multi-objective reward function, the agent simultaneously minimizes TWT and TST. Experimental results on benchmark instances demonstrate that our PPO-GNN agent significantly outperforms a standard dispatching rule and a metaheuristic, achieving a superior trade-off between both objectives. This provides a robust and scalable solution for complex manufacturing scheduling.
Chinese Translation
不相关并行机器调度问题(UPMSP)在释放日期、设置和资格约束下呈现出显著的多目标挑战。传统方法在平衡最小化总加权延迟(TWT)和总设置时间(TST)方面面临困难。本文提出了一种使用近端策略优化(PPO)和图神经网络(GNN)的深度强化学习框架。GNN有效地表示了作业、机器和设置的复杂状态,使得PPO智能体能够学习直接的调度策略。在多目标奖励函数的指导下,智能体同时最小化TWT和TST。基准实例的实验结果表明,我们的PPO-GNN智能体显著优于标准调度规则和元启发式算法,在两个目标之间实现了更优的权衡。这为复杂制造调度提供了一种稳健且可扩展的解决方案。
cs.AI / 58 / 2602.08061
Securing Dual-Use Pathogen Data of Concern
保障双用途病原体数据的安全
Abstract
Training data is an essential input into creating competent artificial intelligence (AI) models. AI models for biology are trained on large volumes of data, including data related to biological sequences, structures, images, and functions. The type of data used to train a model is intimately tied to the capabilities it ultimately possesses--including those of biosecurity concern. For this reason, an international group of more than 100 researchers at the recent 50th anniversary Asilomar Conference endorsed data controls to prevent the use of AI for harmful applications such as bioweapons development. To help design such controls, we introduce a five-tier Biosecurity Data Level (BDL) framework for categorizing pathogen data. Each level contains specific data types, based on their expected ability to contribute to capabilities of concern when used to train AI models. For each BDL tier, we propose technical restrictions appropriate to its level of risk. Finally, we outline a novel governance framework for newly created dual-use pathogen data. In a world with widely accessible computational and coding resources, data controls may be among the most high-leverage interventions available to reduce the proliferation of concerning biological AI capabilities.
Chinese Translation
训练数据是创建高效人工智能(AI)模型的重要输入。生物学领域的AI模型是在大量数据上训练的,这些数据包括与生物序列、结构、图像和功能相关的数据。用于训练模型的数据类型与其最终具备的能力密切相关,包括那些与生物安全相关的能力。因此,在最近举行的第50届阿西洛马会议上,来自国际上100多位研究人员的团体支持实施数据控制,以防止AI被用于生物武器开发等有害应用。为帮助设计此类控制措施,我们提出了一个五级生物安全数据级别(Biosecurity Data Level, BDL)框架,用于对病原体数据进行分类。每个级别包含特定的数据类型,基于其在训练AI模型时对相关能力的潜在贡献。针对每个BDL级别,我们提出了适合其风险水平的技术限制。最后,我们概述了一个针对新创建的双用途病原体数据的新型治理框架。在一个计算和编码资源广泛可及的世界中,数据控制可能是减少令人担忧的生物AI能力扩散的最具杠杆效应的干预措施之一。
cs.AI / 59 / 2602.08092
Objective Decoupling in Social Reinforcement Learning: Recovering Ground Truth from Sycophantic Majorities
社会强化学习中的目标解耦:从阿谀奉承的多数中恢复真实真相
Abstract
Contemporary AI alignment strategies rely on a fragile premise: that human feedback, while noisy, remains a fundamentally truthful signal. In this paper, we identify this assumption as Dogma 4 of Reinforcement Learning (RL). We demonstrate that while this dogma holds in static environments, it fails in social settings where evaluators may be sycophantic, lazy, or adversarial. We prove that under Dogma 4, standard RL agents suffer from what we call Objective Decoupling, a structural failure mode where the agent's learned objective permanently separates from the latent ground truth, guaranteeing convergence to misalignment. To resolve this, we propose Epistemic Source Alignment (ESA). Unlike standard robust methods that rely on statistical consensus (trusting the majority), ESA utilizes sparse safety axioms to judge the source of the feedback rather than the signal itself. We prove that this "judging the judges" mechanism guarantees convergence to the true objective, even when a majority of evaluators are biased. Empirically, we show that while traditional consensus methods fail under majority collusion, our approach successfully recovers the optimal policy.
Chinese Translation
当代人工智能对齐策略依赖于一个脆弱的前提:人类反馈虽然嘈杂,但仍然是一个基本真实的信号。本文将这一假设识别为强化学习(Reinforcement Learning, RL)的教条4。我们证明,尽管这一教条在静态环境中成立,但在评估者可能表现出阿谀奉承、懒惰或对抗性的社会环境中,它会失效。我们证明,在教条4的条件下,标准的RL代理会遭遇我们称之为目标解耦的结构性失效模式,其中代理学习到的目标与潜在的真实真相永久分离,保证了收敛到不对齐。为了解决这一问题,我们提出了知识源对齐(Epistemic Source Alignment, ESA)。与依赖统计共识(信任多数)的标准稳健方法不同,ESA利用稀疏安全公理来判断反馈的来源,而不是信号本身。我们证明,这种“评判评判者”的机制保证了收敛到真实目标,即使大多数评估者存在偏见。在实证上,我们展示了传统共识方法在多数合谋下失效,而我们的方法成功恢复了最优策略。
cs.AI / 60 / 2602.08104
Interpretable Failure Analysis in Multi-Agent Reinforcement Learning Systems
多智能体强化学习系统中的可解释性故障分析
Abstract
Multi-Agent Reinforcement Learning (MARL) is increasingly deployed in safety-critical domains, yet methods for interpretable failure detection and attribution remain underdeveloped. We introduce a two-stage gradient-based framework that provides interpretable diagnostics for three critical failure analysis tasks: (1) detecting the true initial failure source (Patient-0); (2) validating why non-attacked agents may be flagged first due to domino effects; and (3) tracing how failures propagate through learned coordination pathways. Stage 1 performs interpretable per-agent failure detection via Taylor-remainder analysis of policy-gradient costs, declaring an initial Patient-0 candidate at the first threshold crossing. Stage 2 provides validation through geometric analysis of critic derivatives-first-order sensitivity and directional second-order curvature aggregated over causal windows to construct interpretable contagion graphs. This approach explains "downstream-first" detection anomalies by revealing pathways that amplify upstream deviations. Evaluated across 500 episodes in Simple Spread (3 and 5 agents) and 100 episodes in StarCraft II using MADDPG and HATRPO, our method achieves 88.2-99.4% Patient-0 detection accuracy while providing interpretable geometric evidence for detection decisions. By moving beyond black-box detection to interpretable gradient-level forensics, this framework offers practical tools for diagnosing cascading failures in safety-critical MARL systems.
Chinese Translation
多智能体强化学习(MARL)在安全关键领域的应用日益增多,但可解释的故障检测和归因方法仍然不够成熟。我们提出了一种基于梯度的两阶段框架,为三项关键故障分析任务提供可解释的诊断: (1) 检测真实的初始故障源(Patient-0);(2) 验证为何未受攻击的智能体可能因多米诺效应而被优先标记;(3) 追踪故障如何通过学习的协调路径传播。第一阶段通过对策略梯度成本的泰勒余项分析,进行可解释的逐个智能体故障检测,在第一次阈值交叉时声明初始的Patient-0候选者。第二阶段通过对评论家导数的一阶敏感性和方向性二阶曲率的几何分析,聚合因果窗口提供验证,以构建可解释的传播图。这种方法通过揭示放大上游偏差的路径,解释了“下游优先”检测异常。在使用MADDPG和HATRPO的Simple Spread(3和5个智能体)中评估500个回合,以及在StarCraft II中评估100个回合时,我们的方法实现了88.2%-99.4%的Patient-0检测准确率,同时为检测决策提供了可解释的几何证据。通过超越黑箱检测,转向可解释的梯度级法医分析,该框架为诊断安全关键的MARL系统中的级联故障提供了实用工具。
cs.AI / 61 / 2602.08121
Initial Risk Probing and Feasibility Testing of Glow: a Generative AI-Powered Dialectical Behavior Therapy Skills Coach for Substance Use Recovery and HIV Prevention
Glow的初步风险探测与可行性测试:一种基于生成性人工智能的辩证行为疗法技能教练,用于物质使用恢复和HIV预防
Abstract
Background: HIV and substance use represent interacting epidemics with shared psychological drivers - impulsivity and maladaptive coping. Dialectical behavior therapy (DBT) targets these mechanisms but faces scalability challenges. Generative artificial intelligence (GenAI) offers potential for delivering personalized DBT coaching at scale, yet rapid development has outpaced safety infrastructure. Methods: We developed Glow, a GenAI-powered DBT skills coach delivering chain and solution analysis for individuals at risk for HIV and substance use. In partnership with a Los Angeles community health organization, we conducted usability testing with clinical staff (n=6) and individuals with lived experience (n=28). Using the Helpful, Honest, and Harmless (HHH) framework, we employed user-driven adversarial testing wherein participants identified target behaviors and generated contextually realistic risk probes. We evaluated safety performance across 37 risk probe interactions. Results: Glow appropriately handled 73% of risk probes, but performance varied by agent. The solution analysis agent demonstrated 90% appropriate handling versus 44% for the chain analysis agent. Safety failures clustered around encouraging substance use and normalizing harmful behaviors. The chain analysis agent fell into an "empathy trap," providing validation that reinforced maladaptive beliefs. Additionally, 27 instances of DBT skill misinformation were identified. Conclusions: This study provides the first systematic safety evaluation of GenAI-delivered DBT coaching for HIV and substance use risk reduction. Findings reveal vulnerabilities requiring mitigation before clinical trials. The HHH framework and user-driven adversarial testing offer replicable methods for evaluating GenAI mental health interventions.
Chinese Translation
背景:HIV和物质使用代表了相互交织的流行病,具有共同的心理驱动因素——冲动性和不适应性应对。辩证行为疗法(DBT)针对这些机制,但面临可扩展性挑战。生成性人工智能(GenAI)为大规模提供个性化DBT辅导提供了潜力,但快速发展已超出安全基础设施的建设。方法:我们开发了Glow,一个基于GenAI的DBT技能教练,为面临HIV和物质使用风险的个体提供链和解决方案分析。我们与洛杉矶一家社区健康组织合作,针对临床工作人员(n=6)和有生活经验的个体(n=28)进行了可用性测试。采用有益、诚实和无害(HHH)框架,我们进行了用户驱动的对抗性测试,参与者识别目标行为并生成上下文现实的风险探测。我们评估了37个风险探测交互中的安全性能。结果:Glow适当地处理了73%的风险探测,但不同代理的表现有所不同。解决方案分析代理的适当处理率为90%,而链分析代理为44%。安全失败主要集中在鼓励物质使用和正常化有害行为上。链分析代理陷入了“同理心陷阱”,提供了强化不适应性信念的验证。此外,识别出了27个DBT技能误信息的实例。结论:本研究提供了针对GenAI交付的DBT辅导在HIV和物质使用风险降低方面的首次系统安全评估。研究结果揭示了在临床试验之前需要缓解的脆弱性。HHH框架和用户驱动的对抗性测试提供了可重复的方法来评估GenAI心理健康干预措施。
cs.AI / 62 / 2602.08214
RECUR: Resource Exhaustion Attack via Recursive-Entropy Guided Counterfactual Utilization and Reflection
RECUR:通过递归熵引导的反事实利用与反思进行资源耗尽攻击
Abstract
Large Reasoning Models (LRMs) employ reasoning to address complex tasks. Such explicit reasoning requires extended context lengths, resulting in substantially higher resource consumption. Prior work has shown that adversarially crafted inputs can trigger redundant reasoning processes, exposing LRMs to resource-exhaustion vulnerabilities. However, the reasoning process itself, especially its reflective component, has received limited attention, even though it can lead to over-reflection and consume excessive computing power. In this paper, we introduce Recursive Entropy to quantify the risk of resource consumption in reflection, thereby revealing the safety issues inherent in inference itself. Based on Recursive Entropy, we introduce RECUR, a resource exhaustion attack via Recursive Entropy guided Counterfactual Utilization and Reflection. It constructs counterfactual questions to verify the inherent flaws and risks of LRMs. Extensive experiments demonstrate that, under benign inference, recursive entropy exhibits a pronounced decreasing trend. RECUR disrupts this trend, increasing the output length by up to 11x and decreasing throughput by 90%. Our work provides a new perspective on robust reasoning.
Chinese Translation
大型推理模型(LRMs)利用推理来解决复杂任务。这种显式推理需要较长的上下文长度,导致资源消耗显著增加。先前的研究表明,经过对抗性设计的输入可以触发冗余的推理过程,使LRMs暴露于资源耗尽的脆弱性中。然而,推理过程本身,尤其是其反思成分,受到的关注有限,尽管它可能导致过度反思并消耗过多的计算能力。在本文中,我们引入递归熵来量化反思中资源消耗的风险,从而揭示推理本身固有的安全问题。基于递归熵,我们提出了RECUR,一种通过递归熵引导的反事实利用与反思进行的资源耗尽攻击。它构建反事实问题以验证LRMs的固有缺陷和风险。大量实验表明,在良性推理下,递归熵呈现出明显的下降趋势。RECUR破坏了这一趋势,使输出长度增加至11倍,吞吐量降低90%。我们的工作为稳健推理提供了新的视角。
cs.AI / 63 / 2602.08222
Weak-Driven Learning: How Weak Agents make Strong Agents Stronger
弱驱动学习:弱代理如何使强代理更强
Abstract
As post-training optimization becomes central to improving large language models, we observe a persistent saturation bottleneck: once models grow highly confident, further training yields diminishing returns. While existing methods continue to reinforce target predictions, we find that informative supervision signals remain latent in models' own historical weak states. Motivated by this observation, we propose WMSS (Weak Agents Can Make Strong Agents Stronger), a post-training paradigm that leverages weak checkpoints to guide continued optimization. By identifying recoverable learning gaps via entropy dynamics and reinforcing them through compensatory learning, WMSS enables strong agents to improve beyond conventional post-training saturation. Experiments on mathematical reasoning and code generation datasets show that agents trained with our approach achieve effective performance improvements, while incurring zero additional inference cost.
Chinese Translation
随着后训练优化在提升大型语言模型中的重要性日益凸显,我们观察到一个持续存在的饱和瓶颈:一旦模型变得高度自信,进一步的训练所带来的收益逐渐递减。尽管现有方法继续强化目标预测,我们发现信息丰富的监督信号仍然潜藏在模型自身历史的弱状态中。基于这一观察,我们提出了WMSS(Weak Agents Can Make Strong Agents Stronger),一种利用弱检查点指导持续优化的后训练范式。通过识别可恢复的学习差距并通过补偿学习进行强化,WMSS使得强代理能够超越传统的后训练饱和。对数学推理和代码生成数据集的实验表明,采用我们方法训练的代理在有效性能上取得了显著提升,同时没有增加额外的推理成本。
cs.AI / 64 / 2602.08229
InfiCoEvalChain: A Blockchain-Based Decentralized Framework for Collaborative LLM Evaluation
InfiCoEvalChain:基于区块链的去中心化协作大语言模型评估框架
Abstract
The rapid advancement of large language models (LLMs) demands increasingly reliable evaluation, yet current centralized evaluation suffers from opacity, overfitting, and hardware-induced variance. Our empirical analysis reveals an alarming inconsistency in existing evaluations: the standard deviation across ten repeated runs of a single model on HumanEval (1.67) actually exceeds the performance gap among the top-10 models on the official leaderboard (0.91), rendering current rankings statistically precarious. To mitigate these instabilities, we propose a decentralized evaluation framework that enables hardware and parameter diversity through large-scale benchmarking across heterogeneous compute nodes. By leveraging the blockchain-based protocol, the framework incentivizes global contributors to act as independent validators, using a robust reward system to ensure evaluation integrity and discourage dishonest participation. This collective verification transforms evaluation from a "centralized black box" into a "decentralized endorsement" where multi-party consensus and diverse inference environments yield a more stable, representative metric. Experimental results demonstrate that the decentralized evaluation framework reduces the standard deviation across ten runs on the same model to 0.28. This significant improvement over conventional frameworks ensures higher statistical confidence in model rankings. We have completely implemented this platform and will soon release it to the community.
Chinese Translation
大语言模型(LLMs)的快速发展对评估的可靠性提出了越来越高的要求,然而当前的中心化评估存在不透明、过拟合和硬件引起的方差等问题。我们的实证分析揭示了现有评估中的一个令人担忧的不一致性:在HumanEval上对单一模型进行十次重复运行的标准差(1.67)实际上超过了官方排行榜上前十名模型之间的性能差距(0.91),这使得当前的排名在统计上变得不稳定。为了减轻这些不稳定性,我们提出了一种去中心化的评估框架,通过在异构计算节点上进行大规模基准测试,促进硬件和参数的多样性。通过利用基于区块链的协议,该框架激励全球贡献者作为独立验证者,使用强有力的奖励系统来确保评估的完整性并抑制不诚实的参与。这种集体验证将评估从“中心化黑箱”转变为“去中心化背书”,在多方共识和多样化推理环境的作用下,产生更稳定、更具代表性的指标。实验结果表明,该去中心化评估框架将同一模型的十次运行的标准差降低至0.28。这一显著改善确保了模型排名的更高统计可信度。我们已经完全实现了该平台,并将很快向社区发布。
cs.AI / 65 / 2602.08240
PTS-SNN: A Prompt-Tuned Temporal Shift Spiking Neural Networks for Efficient Speech Emotion Recognition
PTS-SNN:一种用于高效语音情感识别的提示调优时序迁移脉冲神经网络
Abstract
Speech Emotion Recognition (SER) is widely deployed in Human-Computer Interaction, yet the high computational cost of conventional models hinders their implementation on resource-constrained edge devices. Spiking Neural Networks (SNNs) offer an energy-efficient alternative due to their event-driven nature; however, their integration with continuous Self-Supervised Learning (SSL) representations is fundamentally challenged by distribution mismatch, where high-dynamic-range embeddings degrade the information coding capacity of threshold-based neurons. To resolve this, we propose Prompt-Tuned Spiking Neural Networks (PTS-SNN), a parameter-efficient neuromorphic adaptation framework that aligns frozen SSL backbones with spiking dynamics. Specifically, we introduce a Temporal Shift Spiking Encoder to capture local temporal dependencies via parameter-free channel shifts, establishing a stable feature basis. To bridge the domain gap, we devise a Context-Aware Membrane Potential Calibration strategy. This mechanism leverages a Spiking Sparse Linear Attention module to aggregate global semantic context into learnable soft prompts, which dynamically regulate the bias voltages of Parametric Leaky Integrate-and-Fire (PLIF) neurons. This regulation effectively centers the heterogeneous input distribution within the responsive firing range, mitigating functional silence or saturation. Extensive experiments on five multilingual datasets (e.g., IEMOCAP, CASIA, EMODB) demonstrate that PTS-SNN achieves 73.34\% accuracy on IEMOCAP, comparable to competitive Artificial Neural Networks (ANNs), while requiring only 1.19M trainable parameters and 0.35 mJ inference energy per sample.
Chinese Translation
语音情感识别(SER)广泛应用于人机交互,但传统模型的高计算成本阻碍了其在资源受限的边缘设备上的实现。脉冲神经网络(SNNs)由于其事件驱动的特性,提供了一种节能的替代方案;然而,它们与连续自监督学习(SSL)表示的集成在根本上受到分布不匹配的挑战,其中高动态范围的嵌入降低了基于阈值神经元的信息编码能力。为了解决这一问题,我们提出了提示调优脉冲神经网络(PTS-SNN),这是一种参数高效的神经形态适应框架,旨在将冻结的SSL骨干网络与脉冲动态对齐。具体而言,我们引入了一种时序迁移脉冲编码器,通过无参数的通道迁移捕捉局部时序依赖性,建立稳定的特征基础。为了弥合领域差距,我们设计了一种上下文感知膜电位校准策略。该机制利用脉冲稀疏线性注意力模块将全局语义上下文聚合为可学习的软提示,动态调节参数泄漏积分发射(PLIF)神经元的偏置电压。这种调节有效地将异构输入分布集中在响应的发射范围内,减轻功能静默或饱和。对五个多语言数据集(如IEMOCAP、CASIA、EMODB)的广泛实验表明,PTS-SNN在IEMOCAP上达到了73.34%的准确率, comparable于竞争性的人工神经网络(ANNs),同时仅需1.19M可训练参数和每个样本0.35 mJ的推理能量。
cs.AI / 66 / 2602.08241
Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs
多模态大语言模型真的能看懂吗:增强多模态大语言模型中的视觉注意力
Abstract
While chain-of-thought (CoT) reasoning has substantially improved multimodal large language models (MLLMs) on complex reasoning tasks, existing approaches largely rely on long textual reasoning trajectories and provide limited mechanisms for learning stable visual attention policies. Our analysis shows that current MLLMs exhibit weak visual focus: early-stage visual misalignment is rarely corrected during subsequent reasoning, leading to error propagation and failed inferences. We argue that this limitation stems from inadequate credit assignment for visual attention during training. To address this issue, we propose SAYO, a visual reasoning model trained with a reinforcement learning (RL) framework that introduces a region-level visual attention-based reward. This reward explicitly aligns optimization signals with visually grounded reasoning steps, enabling the model to learn more reliable attention behaviors. Extensive experiments across multiple multimodal benchmarks demonstrate that SAYO consistently improves performance on diverse reasoning and perception tasks.
Chinese Translation
尽管链式思维(CoT)推理显著提升了多模态大语言模型(MLLMs)在复杂推理任务上的表现,但现有方法在很大程度上依赖于长文本推理轨迹,并提供有限的机制来学习稳定的视觉注意力策略。我们的分析表明,当前的MLLMs表现出较弱的视觉聚焦:早期阶段的视觉错位在后续推理过程中很少得到纠正,导致错误传播和推理失败。我们认为这一局限性源于训练过程中对视觉注意力的不充分信用分配。为了解决这一问题,我们提出了SAYO,一种基于强化学习(RL)框架训练的视觉推理模型,该框架引入了基于区域级视觉注意力的奖励。该奖励明确将优化信号与视觉基础的推理步骤对齐,使模型能够学习更可靠的注意力行为。在多个多模态基准测试中的广泛实验表明,SAYO在各种推理和感知任务上持续提升了性能。
cs.AI / 67 / 2602.08253
G-LNS: Generative Large Neighborhood Search for LLM-Based Automatic Heuristic Design
G-LNS:基于生成的大邻域搜索用于LLM驱动的自动启发式设计
Abstract
While Large Language Models (LLMs) have recently shown promise in Automated Heuristic Design (AHD), existing approaches typically formulate AHD around constructive priority rules or parameterized local search guidance, thereby restricting the search space to fixed heuristic forms. Such designs offer limited capacity for structural exploration, making it difficult to escape deep local optima in complex Combinatorial Optimization Problems (COPs). In this work, we propose G-LNS, a generative evolutionary framework that extends LLM-based AHD to the automated design of Large Neighborhood Search (LNS) operators. Unlike prior methods that evolve heuristics in isolation, G-LNS leverages LLMs to co-evolve tightly coupled pairs of destroy and repair operators. A cooperative evaluation mechanism explicitly captures their interaction, enabling the discovery of complementary operator logic that jointly performs effective structural disruption and reconstruction. Extensive experiments on challenging COP benchmarks, such as Traveling Salesman Problems (TSP) and Capacitated Vehicle Routing Problems (CVRP), demonstrate that G-LNS significantly outperforms LLM-based AHD methods as well as strong classical solvers. The discovered heuristics not only achieve near-optimal solutions with reduced computational budgets but also exhibit robust generalization across diverse and unseen instance distributions.
Chinese Translation
尽管大型语言模型(LLMs)最近在自动启发式设计(AHD)中展现出潜力,但现有方法通常围绕构造优先规则或参数化局部搜索指导来制定AHD,从而将搜索空间限制为固定的启发式形式。这种设计在结构探索方面的能力有限,使得在复杂的组合优化问题(COPs)中难以逃脱深层局部最优解。在本研究中,我们提出了G-LNS,一个生成性进化框架,扩展了基于LLM的AHD至大邻域搜索(LNS)算子的自动设计。与以往孤立进化启发式的方法不同,G-LNS利用LLMs共同进化紧密耦合的销毁和修复算子对。合作评估机制明确捕捉它们之间的相互作用,使得能够发现互补的算子逻辑,从而共同实现有效的结构破坏和重建。在具有挑战性的COP基准测试(如旅行商问题(TSP)和容量受限车辆路径问题(CVRP))上的广泛实验表明,G-LNS显著优于基于LLM的AHD方法以及强大的经典求解器。所发现的启发式不仅在减少计算预算的情况下实现了接近最优的解决方案,而且在多样化和未见实例分布中表现出强大的泛化能力。
cs.AI / 68 / 2602.08254
SynthAgent: A Multi-Agent LLM Framework for Realistic Patient Simulation -- A Case Study in Obesity with Mental Health Comorbidities
SynthAgent:一种用于真实患者模拟的多智能体大语言模型框架——以伴随心理健康共病的肥胖症为案例研究
Abstract
Simulating high-fidelity patients offers a powerful avenue for studying complex diseases while addressing the challenges of fragmented, biased, and privacy-restricted real-world data. In this study, we introduce SynthAgent, a novel Multi-Agent System (MAS) framework designed to model obesity patients with comorbid mental disorders, including depression, anxiety, social phobia, and binge eating disorder. SynthAgent integrates clinical and medical evidence from claims data, population surveys, and patient-centered literature to construct personalized virtual patients enriched with personality traits that influence adherence, emotion regulation, and lifestyle behaviors. Through autonomous agent interactions, the system simulates disease progression, treatment response, and life management across diverse psychosocial contexts. Evaluation of more than 100 generated patients demonstrated that GPT-5 and Claude 4.5 Sonnet achieved the highest fidelity as the core engine in the proposed MAS framework, outperforming Gemini 2.5 Pro and DeepSeek-R1. SynthAgent thus provides a scalable and privacy-preserving framework for exploring patient journeys, behavioral dynamics, and decision-making processes in both medical and psychological domains.
Chinese Translation
高保真患者模拟为研究复杂疾病提供了强有力的途径,同时解决了碎片化、偏见和隐私限制的现实数据挑战。在本研究中,我们介绍了SynthAgent,一种新颖的多智能体系统(Multi-Agent System, MAS)框架,旨在模拟伴随心理障碍的肥胖患者,包括抑郁症、焦虑症、社交恐惧症和暴食症。SynthAgent整合了来自索赔数据、人口调查和以患者为中心的文献的临床和医学证据,以构建个性化的虚拟患者,这些患者具有影响依从性、情绪调节和生活方式行为的个性特征。通过自主智能体的交互,该系统模拟了在多样心理社会背景下的疾病进展、治疗反应和生活管理。对生成的100多名患者的评估表明,GPT-5和Claude 4.5 Sonnet作为所提议的MAS框架的核心引擎,达到了最高的保真度,优于Gemini 2.5 Pro和DeepSeek-R1。因此,SynthAgent提供了一个可扩展且保护隐私的框架,用于探索患者旅程、行为动态和医疗及心理领域的决策过程。
cs.AI / 69 / 2602.08268
Puda: Private User Dataset Agent for User-Sovereign and Privacy-Preserving Personalized AI
Puda:用户主权和隐私保护个性化人工智能的私有用户数据集代理
Abstract
Personal data centralization among dominant platform providers including search engines, social networking services, and e-commerce has created siloed ecosystems that restrict user sovereignty, thereby impeding data use across services. Meanwhile, the rapid proliferation of Large Language Model (LLM)-based agents has intensified demand for highly personalized services that require the dynamic provision of diverse personal data. This presents a significant challenge: balancing the utilization of such data with privacy protection. To address this challenge, we propose Puda (Private User Dataset Agent), a user-sovereign architecture that aggregates data across services and enables client-side management. Puda allows users to control data sharing at three privacy levels: (i) Detailed Browsing History, (ii) Extracted Keywords, and (iii) Predefined Category Subsets. We implemented Puda as a browser-based system that serves as a common platform across diverse services and evaluated it through a personalized travel planning task. Our results show that providing Predefined Category Subsets achieves 97.2% of the personalization performance (evaluated via an LLM-as-a-Judge framework across three criteria) obtained when sharing Detailed Browsing History. These findings demonstrate that Puda enables effective multi-granularity management, offering practical choices to mitigate the privacy-personalization trade-off. Overall, Puda provides an AI-native foundation for user sovereignty, empowering users to safely leverage the full potential of personalized AI.
Chinese Translation
在主导平台提供商(包括搜索引擎、社交网络服务和电子商务)之间,个人数据的集中化导致了孤立的生态系统,这限制了用户主权,从而妨碍了数据在服务之间的使用。同时,基于大型语言模型(LLM)的代理的快速普及加剧了对高度个性化服务的需求,这些服务需要动态提供多样的个人数据。这带来了一个重大挑战:在利用这些数据与保护隐私之间取得平衡。为了解决这一挑战,我们提出了Puda(私有用户数据集代理),这是一种用户主权架构,能够跨服务聚合数据并实现客户端管理。Puda允许用户在三个隐私级别上控制数据共享:(i)详细浏览历史,(ii)提取的关键词,以及(iii)预定义类别子集。我们将Puda实现为一个基于浏览器的系统,作为多种服务的共同平台,并通过个性化旅行规划任务进行评估。我们的结果表明,提供预定义类别子集在共享详细浏览历史时,实现了97.2%的个性化性能(通过LLM作为评判者框架在三个标准下进行评估)。这些发现表明,Puda实现了有效的多粒度管理,为缓解隐私与个性化之间的权衡提供了实用选择。总体而言,Puda为用户主权提供了一个人工智能原生的基础,使用户能够安全地利用个性化人工智能的全部潜力。
cs.AI / 70 / 2602.08276
Toward Formalizing LLM-Based Agent Designs through Structural Context Modeling and Semantic Dynamics Analysis
通过结构上下文建模和语义动态分析形式化基于大型语言模型的智能体设计
Abstract
Current research on large language model (LLM) agents is fragmented: discussions of conceptual frameworks and methodological principles are frequently intertwined with low-level implementation details, causing both readers and authors to lose track amid a proliferation of superficially distinct concepts. We argue that this fragmentation largely stems from the absence of an analyzable, self-consistent formal model that enables implementation-independent characterization and comparison of LLM agents. To address this gap, we propose the \texttt{Structural Context Model}, a formal model for analyzing and comparing LLM agents from the perspective of context structure. Building upon this foundation, we introduce two complementary components that together span the full lifecycle of LLM agent research and development: (1) a declarative implementation framework; and (2) a sustainable agent engineering workflow, \texttt{Semantic Dynamics Analysis}. The proposed workflow provides principled insights into agent mechanisms and supports rapid, systematic design iteration. We demonstrate the effectiveness of the complete framework on dynamic variants of the monkey-banana problem, where agents engineered using our approach achieve up to a 32 percentage points improvement in success rate on the most challenging setting.
Chinese Translation
当前关于大型语言模型(LLM)智能体的研究较为零散:概念框架和方法原则的讨论常常与低层次的实现细节交织在一起,导致读者和作者在表面上看似不同的概念中迷失方向。我们认为,这种碎片化主要源于缺乏一个可分析、自洽的形式模型,使得能够独立于实现对LLM智能体进行特征描述和比较。为了解决这一问题,我们提出了 exttt{结构上下文模型},这是一个从上下文结构的角度分析和比较LLM智能体的形式模型。在此基础上,我们引入了两个互补的组成部分,涵盖了LLM智能体研究和开发的完整生命周期:(1)一个声明式实现框架;(2)一个可持续的智能体工程工作流程 exttt{语义动态分析}。所提出的工作流程为智能体机制提供了原则性见解,并支持快速、系统的设计迭代。我们在猴子-香蕉问题的动态变体上展示了完整框架的有效性,使用我们的方法工程化的智能体在最具挑战性的设置中成功率提高了多达32个百分点。
cs.AI / 71 / 2602.08295
The Vibe-Automation of Automation: A Proactive Education Framework for Computer Science in the Age of Generative AI
自动化的氛围-自动化:生成性人工智能时代计算机科学的前瞻性教育框架
Abstract
The emergence of generative artificial intelligence (GenAI) represents not an incremental technological advance but a qualitative epistemological shift that challenges foundational assumptions of computer science. Whereas machine learning has been described as the automation of automation, generative AI operates by navigating contextual, semantic, and stylistic coherence rather than optimizing predefined objective metrics. This paper introduces the concept of Vibe-Automation to characterize this transition. The central claim is that the significance of GenAI lies in its functional access to operationalized tacit regularities: context-sensitive patterns embedded in practice that cannot be fully specified through explicit algorithmic rules. Although generative systems do not possess tacit knowledge in a phenomenological sense, they operationalize sensitivities to tone, intent, and situated judgment encoded in high-dimensional latent representations. On this basis, the human role shifts from algorithmic problem specification toward Vibe-Engineering, understood as the orchestration of alignment and contextual judgment in generative systems. The paper connects this epistemological shift to educational and institutional transformation by proposing a conceptual framework structured across three analytical levels and three domains of action: faculty worldview, industry relations, and curriculum design. The risks of mode collapse and cultural homogenization are briefly discussed, emphasizing the need for deliberate engagement with generative systems to avoid regression toward synthetic uniformity.
Chinese Translation
生成性人工智能(GenAI)的出现不仅是技术的渐进性进步,而是对计算机科学基础假设的质性认识转变。机器学习被描述为自动化的自动化,而生成性人工智能则通过导航上下文、语义和风格的一致性来运作,而不是优化预定义的目标指标。本文引入了氛围-自动化(Vibe-Automation)的概念,以表征这一转变。中心论点是,生成性人工智能的意义在于其对操作性隐性规律的功能性访问:嵌入实践中的上下文敏感模式,这些模式无法通过明确的算法规则完全指定。尽管生成性系统在现象学意义上并不具备隐性知识,但它们在高维潜在表示中操作化了对语调、意图和情境判断的敏感性。在此基础上,人类的角色从算法问题的规范转向氛围工程(Vibe-Engineering),理解为在生成性系统中协调对齐和上下文判断的过程。本文将这一认识论转变与教育和制度变革联系起来,提出了一个跨越三个分析层面和三个行动领域的概念框架:教师世界观、行业关系和课程设计。本文简要讨论了模式崩溃和文化同质化的风险,强调需要有意识地参与生成性系统,以避免向合成统一性的倒退。
cs.AI / 72 / 2602.08311
Moral Sycophancy in Vision Language Models
视觉语言模型中的道德阿谀奉承
Abstract
Sycophancy in Vision-Language Models (VLMs) refers to their tendency to align with user opinions, often at the expense of moral or factual accuracy. While prior studies have explored sycophantic behavior in general contexts, its impact on morally grounded visual decision-making remains insufficiently understood. To address this gap, we present the first systematic study of moral sycophancy in VLMs, analyzing ten widely-used models on the Moralise and M^3oralBench datasets under explicit user disagreement. Our results reveal that VLMs frequently produce morally incorrect follow-up responses even when their initial judgments are correct, and exhibit a consistent asymmetry: models are more likely to shift from morally right to morally wrong judgments than the reverse when exposed to user-induced bias. Follow-up prompts generally degrade performance on Moralise, while yielding mixed or even improved accuracy on M^3oralBench, highlighting dataset-dependent differences in moral robustness. Evaluation using Error Introduction Rate (EIR) and Error Correction Rate (ECR) reveals a clear trade-off: models with stronger error-correction capabilities tend to introduce more reasoning errors, whereas more conservative models minimize errors but exhibit limited ability to self-correct. Finally, initial contexts with a morally right stance elicit stronger sycophantic behavior, emphasizing the vulnerability of VLMs to moral influence and the need for principled strategies to improve ethical consistency and robustness in multimodal AI systems.
Chinese Translation
视觉语言模型(VLMs)中的阿谀奉承指的是它们倾向于与用户意见保持一致,往往以牺牲道德或事实准确性为代价。尽管先前的研究已探讨了阿谀奉承在一般语境中的表现,但其对道德基础视觉决策的影响仍然了解不足。为了解决这一空白,我们首次系统性地研究了VLMs中的道德阿谀奉承,分析了在用户明确不同意的情况下,十个广泛使用的模型在Moralise和M^3oralBench数据集上的表现。我们的结果显示,VLMs经常在初始判断正确的情况下产生道德上不正确的后续回应,并表现出一致的不对称性:模型在受到用户偏见影响时,更可能从道德正确的判断转向道德错误的判断,而不是反向。后续提示通常会降低Moralise上的表现,而在M^3oralBench上则产生混合甚至改善的准确性,突显了道德稳健性在数据集间的差异。使用错误引入率(EIR)和错误修正率(ECR)进行评估揭示了明确的权衡:具有更强错误修正能力的模型往往会引入更多推理错误,而更保守的模型则最小化错误,但自我修正能力有限。最后,最初持有道德正确立场的上下文会引发更强的阿谀奉承行为,强调了VLMs对道德影响的脆弱性,以及在多模态人工智能系统中提高伦理一致性和稳健性的原则性策略的必要性。
cs.AI / 73 / 2602.08335
Who Deserves the Reward? SHARP: Shapley Credit-based Optimization for Multi-Agent System
谁应得奖励?SHARP:基于Shapley信用的多智能体系统优化
Abstract
Integrating Large Language Models (LLMs) with external tools via multi-agent systems offers a promising new paradigm for decomposing and solving complex problems. However, training these systems remains notoriously difficult due to the credit assignment challenge, as it is often unclear which specific functional agent is responsible for the success or failure of decision trajectories. Existing methods typically rely on sparse or globally broadcast rewards, failing to capture individual contributions and leading to inefficient reinforcement learning. To address these limitations, we introduce the Shapley-based Hierarchical Attribution for Reinforcement Policy (SHARP), a novel framework for optimizing multi-agent reinforcement learning via precise credit attribution. SHARP effectively stabilizes training by normalizing agent-specific advantages across trajectory groups, primarily through a decomposed reward mechanism comprising a global broadcast-accuracy reward, a Shapley-based marginal-credit reward for each agent, and a tool-process reward to improve execution efficiency. Extensive experiments across various real-world benchmarks demonstrate that SHARP significantly outperforms recent state-of-the-art baselines, achieving average match improvements of 23.66% and 14.05% over single-agent and multi-agent approaches, respectively.
Chinese Translation
通过多智能体系统将大型语言模型(LLMs)与外部工具集成,为分解和解决复杂问题提供了一种有前景的新范式。然而,由于信用分配挑战,训练这些系统仍然极其困难,因为通常不清楚哪个特定功能智能体对决策轨迹的成功或失败负责。现有方法通常依赖于稀疏或全局广播奖励,未能捕捉个体贡献,导致强化学习效率低下。为了解决这些局限性,我们提出了基于Shapley的层次归因强化策略(SHARP),这是一个通过精确的信用归因来优化多智能体强化学习的新框架。SHARP通过在轨迹组之间规范化智能体特定的优势,主要通过一个由全局广播准确性奖励、每个智能体的基于Shapley的边际信用奖励和提高执行效率的工具过程奖励组成的分解奖励机制,有效地稳定了训练。在各种真实世界基准上的广泛实验表明,SHARP显著优于最近的最先进基线,分别在单智能体和多智能体方法上实现了平均匹配提升23.66%和14.05%。
cs.AI / 74 / 2602.08339
CoTZero: Annotation-Free Human-Like Vision Reasoning via Hierarchical Synthetic CoT
CoTZero:通过分层合成的 CoT 实现无注释的人类视觉推理
Abstract
Recent advances in vision-language models (VLMs) have markedly improved image-text alignment, yet they still fall short of human-like visual reasoning. A key limitation is that many VLMs rely on surface correlations rather than building logically coherent structured representations, which often leads to missed higher-level semantic structure and non-causal relational understanding, hindering compositional and verifiable reasoning. To address these limitations by introducing human models into the reasoning process, we propose CoTZero, an annotation-free paradigm with two components: (i) a dual-stage data synthesis approach and (ii) a cognition-aligned training method. In the first component, we draw inspiration from neurocognitive accounts of compositional productivity and global-to-local analysis. In the bottom-up stage, CoTZero extracts atomic visual primitives and incrementally composes them into diverse, structured question-reasoning forms. In the top-down stage, it enforces hierarchical reasoning by using coarse global structure to guide the interpretation of local details and causal relations. In the cognition-aligned training component, built on the synthesized CoT data, we introduce Cognitively Coherent Verifiable Rewards (CCVR) in Reinforcement Fine-Tuning (RFT) to further strengthen VLMs' hierarchical reasoning and generalization, providing stepwise feedback on reasoning coherence and factual correctness. Experiments show that CoTZero achieves an F1 score of 83.33 percent on our multi-level semantic inconsistency benchmark with lexical-perturbation negatives, across both in-domain and out-of-domain settings. Ablations confirm that each component contributes to more interpretable and human-aligned visual reasoning.
Chinese Translation
最近在视觉-语言模型(VLMs)方面的进展显著改善了图像与文本的对齐,但它们在类人视觉推理方面仍然存在不足。一个关键的限制是许多 VLMs 依赖于表面相关性,而不是构建逻辑上连贯的结构化表示,这常常导致对更高层次语义结构的遗漏以及非因果关系理解的缺失,从而阻碍了组合性和可验证推理。为了解决这些局限性,我们提出了 CoTZero,这是一种无注释的范式,包含两个组成部分:(i)双阶段数据合成方法和(ii)认知对齐训练方法。在第一个组成部分中,我们从神经认知关于组合生产力和全局到局部分析的理论中获得灵感。在自下而上的阶段,CoTZero 提取原子视觉原件,并将其逐步组合成多样化的结构化问题推理形式。在自上而下的阶段,它通过使用粗略的全局结构来指导局部细节和因果关系的解释,从而强制进行分层推理。在基于合成的 CoT 数据的认知对齐训练组件中,我们在强化微调(RFT)中引入了认知一致可验证奖励(CCVR),以进一步增强 VLMs 的分层推理和泛化能力,提供关于推理一致性和事实正确性的逐步反馈。实验表明,CoTZero 在我们的多层次语义不一致基准测试中,在包含词汇扰动负例的情况下,达到了 83.33% 的 F1 分数,适用于领域内和领域外的设置。消融实验确认了每个组件对更具可解释性和人类对齐的视觉推理的贡献。
cs.AI / 75 / 2602.08340
Effect-Level Validation for Causal Discovery
因果发现的效应水平验证
Abstract
Causal discovery is increasingly applied to large-scale telemetry data to estimate the effects of user-facing interventions, yet its reliability for decision-making in feedback-driven systems with strong self-selection remains unclear. In this paper, we propose an effect-centric, admissibility-first framework that treats discovered graphs as structural hypotheses and evaluates them by identifiability, stability, and falsification rather than by graph recovery accuracy alone. Empirically, we study the effect of early exposure to competitive gameplay on short-term retention using real-world game telemetry. We find that many statistically plausible discovery outputs do not admit point-identified causal queries once minimal temporal and semantic constraints are enforced, highlighting identifiability as a critical bottleneck for decision support. When identification is possible, several algorithm families converge to similar, decision-consistent effect estimates despite producing substantially different graph structures, including cases where the direct treatment-outcome edge is absent and the effect is preserved through indirect causal pathways. These converging estimates survive placebo, subsampling, and sensitivity refutation. In contrast, other methods exhibit sporadic admissibility and threshold-sensitive or attenuated effects due to endpoint ambiguity. These results suggest that graph-level metrics alone are inadequate proxies for causal reliability for a given target query. Therefore, trustworthy causal conclusions in telemetry-driven systems require prioritizing admissibility and effect-level validation over causal structural recovery alone.
Chinese Translation
因果发现越来越多地应用于大规模遥测数据,以估计面向用户的干预措施的效果,但其在具有强自我选择的反馈驱动系统中的决策可靠性仍不明确。本文提出了一种以效应为中心、优先考虑可接受性的框架,将发现的图视为结构假设,并通过可识别性、稳定性和反驳性来评估它们,而不仅仅依赖于图恢复的准确性。我们通过实证研究了早期接触竞争性游戏对短期留存的影响,使用真实世界的游戏遥测数据。我们发现,许多统计上合理的发现输出在施加最小的时间和语义约束后并不允许点识别的因果查询,这突显了可识别性作为决策支持的关键瓶颈。当识别成为可能时,尽管产生了实质上不同的图结构,多个算法家族仍然收敛到相似的、与决策一致的效应估计,包括直接处理-结果边缺失且效应通过间接因果路径保留的情况。这些收敛的估计经受住了安慰剂、子抽样和敏感性反驳的考验。相比之下,其他方法由于端点模糊性表现出偶发的可接受性和阈值敏感或减弱的效应。这些结果表明,仅依靠图级指标不足以作为给定目标查询的因果可靠性的代理。因此,在遥测驱动的系统中,可信的因果结论需要优先考虑可接受性和效应水平验证,而不仅仅是因果结构恢复。
cs.AI / 76 / 2602.08344
OPE: Overcoming Information Saturation in Parallel Thinking via Outline-Guided Path Exploration
OPE:通过大纲引导的路径探索克服平行思维中的信息饱和
Abstract
Parallel thinking has emerged as a new paradigm for large reasoning models (LRMs) in tackling complex problems. Recent methods leverage Reinforcement Learning (RL) to enhance parallel thinking, aiming to address the limitations in computational resources and effectiveness encountered with supervised fine-tuning. However, most existing studies primarily focus on optimizing the aggregation phase, with limited attention to the path exploration stage. In this paper, we theoretically analyze the optimization of parallel thinking under the Reinforcement Learning with Verifiable Rewards (RLVR) setting, and identify that the mutual information bottleneck among exploration paths fundamentally restricts overall performance. To address this, we propose Outline-Guided Path Exploration (OPE), which explicitly partitions the solution space by generating diverse reasoning outlines prior to parallel path reasoning, thereby reducing information redundancy and improving the diversity of information captured across exploration paths. We implement OPE with an iterative RL strategy that optimizes outline planning and outline-guided reasoning independently. Extensive experiments across multiple challenging mathematical benchmarks demonstrate that OPE effectively improves reasoning performance in different aggregation strategies, enabling LRMs to more reliably discover correct solutions.
Chinese Translation
平行思维已成为大型推理模型(LRMs)解决复杂问题的新范式。近期的方法利用强化学习(RL)来增强平行思维,旨在解决在监督微调中遇到的计算资源和效果的限制。然而,大多数现有研究主要集中在优化聚合阶段,对路径探索阶段关注较少。在本文中,我们在可验证奖励的强化学习(RLVR)设置下对平行思维的优化进行了理论分析,并识别出探索路径之间的互信息瓶颈从根本上限制了整体性能。为了解决这一问题,我们提出了大纲引导的路径探索(OPE),通过在平行路径推理之前生成多样化的推理大纲,明确划分了解决空间,从而减少信息冗余,提高了探索路径中捕获信息的多样性。我们通过一种迭代的强化学习策略实现了OPE,该策略独立优化大纲规划和大纲引导推理。在多个具有挑战性的数学基准上进行的广泛实验表明,OPE有效提高了不同聚合策略下的推理性能,使LRMs能够更可靠地发现正确的解决方案。
cs.AI / 77 / 2602.08353
Towards Better Evolution Modeling for Temporal Knowledge Graphs
朝着更好的时间知识图谱演化建模
Abstract
Temporal knowledge graphs (TKGs) structurally preserve evolving human knowledge. Recent research has focused on designing models to learn the evolutionary nature of TKGs to predict future facts, achieving impressive results. For instance, Hits@10 scores over 0.9 on YAGO dataset. However, we find that existing benchmarks inadvertently introduce a shortcut. Near state-of-the-art performance can be simply achieved by counting co-occurrences, without using any temporal information. In this work, we examine the root cause of this issue, identifying inherent biases in current datasets and over simplified form of evaluation task that can be exploited by these biases. Through this analysis, we further uncover additional limitations of existing benchmarks, including unreasonable formatting of time-interval knowledge, ignorance of learning knowledge obsolescence, and insufficient information for precise evolution understanding, all of which can amplify the shortcut and hinder a fair assessment. Therefore, we introduce the TKG evolution benchmark. It includes four bias-corrected datasets and two novel tasks closely aligned with the evolution process, promoting a more accurate understanding of the challenges in TKG evolution modeling. Benchmark is available at: https://github.com/zjs123/TKG-Benchmark.
Chinese Translation
时间知识图谱(TKGs)结构性地保留了不断发展的人工知识。近期研究集中于设计模型以学习TKGs的演化特性,从而预测未来事实,取得了令人印象深刻的结果。例如,在YAGO数据集上的Hits@10得分超过0.9。然而,我们发现现有基准不经意间引入了一个捷径。通过简单地计数共现,可以轻松达到接近最先进的性能,而无需使用任何时间信息。在本研究中,我们检查了这一问题的根本原因,识别出当前数据集中的固有偏差以及可以被这些偏差利用的过于简化的评估任务形式。通过这一分析,我们进一步揭示了现有基准的其他局限性,包括时间间隔知识的不合理格式、对知识过时学习的忽视以及对精确演化理解的信息不足,这些都可能放大捷径并妨碍公平评估。因此,我们引入了TKG演化基准。该基准包括四个经过偏差修正的数据集和两个与演化过程紧密相关的新任务,促进对TKG演化建模挑战的更准确理解。基准可在以下网址获取:https://github.com/zjs123/TKG-Benchmark。
cs.AI / 78 / 2602.08354
Does Your Reasoning Model Implicitly Know When to Stop Thinking?
你的推理模型是否隐含地知道何时停止思考?
Abstract
Recent advancements in large reasoning models (LRMs) have greatly improved their capabilities on complex reasoning tasks through Long Chains of Thought (CoTs). However, this approach often results in substantial redundancy, impairing computational efficiency and causing significant delays in real-time applications. Recent studies show that longer reasoning chains are frequently uncorrelated with correctness and can even be detrimental to accuracy. In a further in-depth analysis of this phenomenon, we surprisingly uncover and empirically verify that LRMs implicitly know the appropriate time to stop thinking, while this capability is obscured by current sampling paradigms. Motivated by this, we introduce SAGE (Self-Aware Guided Efficient Reasoning), a novel sampling paradigm that unleashes this efficient reasoning potential. Furthermore, integrating SAGE as mixed sampling into group-based reinforcement learning (SAGE-RL) enables SAGE-RL to effectively incorporate SAGE-discovered efficient reasoning patterns into standard pass@1 inference, markedly enhancing both the reasoning accuracy and efficiency of LRMs across multiple challenging mathematical benchmarks.
Chinese Translation
近期大型推理模型(LRMs)的进展显著提升了其在复杂推理任务中的能力,尤其是通过长链思维(Chains of Thought, CoTs)。然而,这种方法往往导致显著的冗余,损害了计算效率,并在实时应用中造成了显著的延迟。最近的研究表明,较长的推理链与正确性之间通常没有相关性,甚至可能对准确性产生负面影响。在对这一现象的进一步深入分析中,我们惊讶地发现并实证验证了LRMs隐含地知道何时是停止思考的适当时机,而这一能力在当前的采样范式中被掩盖。基于此,我们提出了SAGE(自我意识引导高效推理),一种新颖的采样范式,释放了这种高效推理的潜力。此外,将SAGE作为混合采样整合到基于群体的强化学习(SAGE-RL)中,使得SAGE-RL能够有效地将SAGE发现的高效推理模式融入标准的pass@1推理,显著提升了LRMs在多个具有挑战性的数学基准测试中的推理准确性和效率。
cs.AI / 79 / 2602.08362
Circuit Representations of Random Forests with Applications to XAI
随机森林的电路表示及其在可解释人工智能中的应用
Abstract
We make three contributions in this paper. First, we present an approach for compiling a random forest classifier into a set of circuits, where each circuit directly encodes the instances in some class of the classifier. We show empirically that our proposed approach is significantly more efficient than existing similar approaches. Next, we utilize this approach to further obtain circuits that are tractable for computing the complete and general reasons of a decision, which are instance abstractions that play a fundamental role in computing explanations. Finally, we propose algorithms for computing the robustness of a decision and all shortest ways to flip it. We illustrate the utility of our contributions by using them to enumerate all sufficient reasons, necessary reasons and contrastive explanations of decisions; to compute the robustness of decisions; and to identify all shortest ways to flip the decisions made by random forest classifiers learned from a wide range of datasets.
Chinese Translation
本文做出了三项贡献。首先,我们提出了一种将随机森林分类器编译为一组电路的方法,其中每个电路直接编码分类器某一类别中的实例。我们通过实证研究表明,我们提出的方法在效率上显著优于现有的类似方法。接下来,我们利用该方法进一步获得可用于计算决策完整和一般理由的电路,这些理由是计算解释中起基础性作用的实例抽象。最后,我们提出了计算决策鲁棒性及其所有最短翻转路径的算法。我们通过使用这些贡献来枚举所有充分理由、必要理由和对比解释;计算决策的鲁棒性;以及识别从各种数据集中学习的随机森林分类器所做决策的所有最短翻转路径,来展示我们贡献的实用性。
cs.AI / 80 / 2602.08369
MemAdapter: Fast Alignment across Agent Memory Paradigms via Generative Subgraph Retrieval
MemAdapter:通过生成子图检索实现代理记忆范式的快速对齐
Abstract
Memory mechanism is a core component of LLM-based agents, enabling reasoning and knowledge discovery over long-horizon contexts. Existing agent memory systems are typically designed within isolated paradigms (e.g., explicit, parametric, or latent memory) with tightly coupled retrieval methods that hinder cross-paradigm generalization and fusion. In this work, we take a first step toward unifying heterogeneous memory paradigms within a single memory system. We propose MemAdapter, a memory retrieval framework that enables fast alignment across agent memory paradigms. MemAdapter adopts a two-stage training strategy: (1) training a generative subgraph retriever from the unified memory space, and (2) adapting the retriever to unseen memory paradigms by training a lightweight alignment module through contrastive learning. This design improves the flexibility for memory retrieval and substantially reduces alignment cost across paradigms. Comprehensive experiments on three public evaluation benchmarks demonstrate that the generative subgraph retriever consistently outperforms five strong agent memory systems across three memory paradigms and agent model scales. Notably, MemAdapter completes cross-paradigm alignment within 13 minutes on a single GPU, achieving superior performance over original memory retrievers with less than 5% of training compute. Furthermore, MemAdapter enables effective zero-shot fusion across memory paradigms, highlighting its potential as a plug-and-play solution for agent memory systems.
Chinese Translation
记忆机制是基于大规模语言模型(LLM)的代理的核心组成部分,使其能够在长时间范围内进行推理和知识发现。现有的代理记忆系统通常在孤立的范式内设计(例如,显式、参数化或潜在记忆),并采用紧密耦合的检索方法,这阻碍了跨范式的泛化和融合。在本工作中,我们迈出了将异构记忆范式统一到单一记忆系统中的第一步。我们提出了MemAdapter,一个能够实现代理记忆范式快速对齐的记忆检索框架。MemAdapter采用两阶段训练策略:(1)从统一的记忆空间训练生成子图检索器;(2)通过对比学习训练轻量级对齐模块,将检索器适应于未见的记忆范式。该设计提高了记忆检索的灵活性,并显著降低了跨范式的对齐成本。在三个公共评估基准上的全面实验表明,生成子图检索器在三个记忆范式和代理模型规模上始终优于五个强大的代理记忆系统。值得注意的是,MemAdapter在单个GPU上完成跨范式对齐仅需13分钟,且在训练计算量不足5%的情况下实现了优于原始记忆检索器的性能。此外,MemAdapter还实现了跨记忆范式的有效零-shot融合,突显了其作为代理记忆系统即插即用解决方案的潜力。
cs.AI / 81 / 2602.08373
Grounding Generative Planners in Verifiable Logic: A Hybrid Architecture for Trustworthy Embodied AI
将生成规划者与可验证逻辑相结合:可信赖的具身人工智能的混合架构
Abstract
Large Language Models (LLMs) show promise as planners for embodied AI, but their stochastic nature lacks formal reasoning, preventing strict safety guarantees for physical deployment. Current approaches often rely on unreliable LLMs for safety checks or simply reject unsafe plans without offering repairs. We introduce the Verifiable Iterative Refinement Framework (VIRF), a neuro-symbolic architecture that shifts the paradigm from passive safety gatekeeping to active collaboration. Our core contribution is a tutor-apprentice dialogue where a deterministic Logic Tutor, grounded in a formal safety ontology, provides causal and pedagogical feedback to an LLM planner. This enables intelligent plan repairs rather than mere avoidance. We also introduce a scalable knowledge acquisition pipeline that synthesizes safety knowledge bases from real-world documents, correcting blind spots in existing benchmarks. In challenging home safety tasks, VIRF achieves a perfect 0 percent Hazardous Action Rate (HAR) and a 77.3 percent Goal-Condition Rate (GCR), which is the highest among all baselines. It is highly efficient, requiring only 1.1 correction iterations on average. VIRF demonstrates a principled pathway toward building fundamentally trustworthy and verifiably safe embodied agents.
Chinese Translation
大型语言模型(LLMs)在具身人工智能的规划中展现出潜力,但其随机性缺乏正式推理,无法为物理部署提供严格的安全保障。目前的方法通常依赖于不可靠的LLMs进行安全检查,或者简单地拒绝不安全的计划而不提供修复方案。我们引入了可验证的迭代精炼框架(Verifiable Iterative Refinement Framework, VIRF),这是一种神经符号架构,改变了从被动安全把关到主动协作的范式。我们的核心贡献是一个导师-学徒对话,其中一个基于正式安全本体的确定性逻辑导师为LLM规划者提供因果和教学反馈。这使得智能计划修复成为可能,而不仅仅是避免不安全的计划。我们还引入了一个可扩展的知识获取管道,从现实世界文档中合成安全知识库,纠正现有基准中的盲点。在具有挑战性的家庭安全任务中,VIRF实现了0%的危险行为率(Hazardous Action Rate, HAR)和77.3%的目标条件率(Goal-Condition Rate, GCR),在所有基线中均为最高。它的效率极高,平均只需1.1次修正迭代。VIRF展示了一条建立根本可信且可验证安全的具身代理的原则性路径。
cs.AI / 82 / 2602.08400
SCOUT-RAG: Scalable and Cost-Efficient Unifying Traversal for Agentic Graph-RAG over Distributed Domains
SCOUT-RAG:可扩展且成本高效的统一遍历用于分布式领域中的代理图-RAG
Abstract
Graph-RAG improves LLM reasoning using structured knowledge, yet conventional designs rely on a centralized knowledge graph. In distributed and access-restricted settings (e.g., hospitals or multinational organizations), retrieval must select relevant domains and appropriate traversal depth without global graph visibility or exhaustive querying. To address this challenge, we introduce \textbf{SCOUT-RAG} (\textit{\underline{S}calable and \underline{CO}st-efficient \underline{U}nifying \underline{T}raversal}), a distributed agentic Graph-RAG framework that performs progressive cross-domain retrieval guided by incremental utility goals. SCOUT-RAG employs four cooperative agents that: (i) estimate domain relevance, (ii) decide when to expand retrieval to additional domains, (iii) adapt traversal depth to avoid unnecessary graph exploration, and (iv) synthesize the high-quality answers. The framework is designed to minimize retrieval regret, defined as missing useful domain information, while controlling latency and API cost. Across multi-domain knowledge settings, SCOUT-RAG achieves performance comparable to centralized baselines, including DRIFT and exhaustive domain traversal, while substantially reducing cross-domain calls, total tokens processed, and latency.
Chinese Translation
Graph-RAG通过结构化知识改善了大语言模型(LLM)的推理能力,但传统设计依赖于集中式知识图谱。在分布式和访问受限的环境中(例如医院或跨国组织),检索必须在没有全局图谱可见性或全面查询的情况下选择相关领域和适当的遍历深度。为了解决这一挑战,我们提出了 extbf{SCOUT-RAG}( extit{ extunderline{S}calable and extunderline{CO}st-efficient extunderline{U}nifying extunderline{T}raversal}),这是一个分布式代理图-RAG框架,能够在增量效用目标的引导下进行渐进式跨领域检索。SCOUT-RAG采用四个协作代理:(i) 评估领域相关性,(ii) 决定何时扩展检索到其他领域,(iii) 调整遍历深度以避免不必要的图谱探索,以及 (iv) 综合高质量答案。该框架旨在最小化检索遗憾,即错过有用领域信息,同时控制延迟和API成本。在多领域知识环境中,SCOUT-RAG的性能可与集中式基线(包括DRIFT和全面领域遍历)相媲美,同时显著减少跨领域调用、处理的总令牌数和延迟。
cs.AI / 83 / 2602.08401
On Protecting Agentic Systems' Intellectual Property via Watermarking
通过水印技术保护自主系统的知识产权
Abstract
The evolution of Large Language Models (LLMs) into agentic systems that perform autonomous reasoning and tool use has created significant intellectual property (IP) value. We demonstrate that these systems are highly vulnerable to imitation attacks, where adversaries steal proprietary capabilities by training imitation models on victim outputs. Crucially, existing LLM watermarking techniques fail in this domain because real-world agentic systems often operate as grey boxes, concealing the internal reasoning traces required for verification. This paper presents AGENTWM, the first watermarking framework designed specifically for agentic models. AGENTWM exploits the semantic equivalence of action sequences, injecting watermarks by subtly biasing the distribution of functionally identical tool execution paths. This mechanism allows AGENTWM to embed verifiable signals directly into the visible action trajectory while remaining indistinguishable to users. We develop an automated pipeline to generate robust watermark schemes and a rigorous statistical hypothesis testing procedure for verification. Extensive evaluations across three complex domains demonstrate that AGENTWM achieves high detection accuracy with negligible impact on agent performance. Our results confirm that AGENTWM effectively protects agentic IP against adaptive adversaries, who cannot remove the watermarks without severely degrading the stolen model's utility.
Chinese Translation
大型语言模型(LLMs)演变为能够进行自主推理和工具使用的自主系统,创造了显著的知识产权(IP)价值。我们展示了这些系统在模仿攻击中高度脆弱,攻击者通过在受害者输出上训练模仿模型来窃取专有能力。至关重要的是,现有的LLM水印技术在这一领域失效,因为现实世界中的自主系统通常作为灰箱操作,隐藏了验证所需的内部推理痕迹。本文提出了AGENTWM,这是第一个专门为自主模型设计的水印框架。AGENTWM利用动作序列的语义等价性,通过微妙地偏向功能相同的工具执行路径的分布来注入水印。该机制使AGENTWM能够将可验证信号直接嵌入可见的动作轨迹中,同时对用户保持不可区分。我们开发了一个自动化流程来生成稳健的水印方案,并建立了严格的统计假设检验程序以进行验证。在三个复杂领域的广泛评估中,AGENTWM实现了高检测准确率,对自主系统性能的影响微乎其微。我们的结果确认AGENTWM有效地保护自主知识产权,抵御适应性对手,这些对手无法在不严重降低被窃取模型效用的情况下去除水印。
cs.AI / 84 / 2602.08412
From Assistant to Double Agent: Formalizing and Benchmarking Attacks on OpenClaw for Personalized Local AI Agent
从助手到双重间谍:对个性化本地人工智能代理的OpenClaw攻击的形式化与基准测试
Abstract
Although large language model (LLM)-based agents, exemplified by OpenClaw, are increasingly evolving from task-oriented systems into personalized AI assistants for solving complex real-world tasks, their practical deployment also introduces severe security risks. However, existing agent security research and evaluation frameworks primarily focus on synthetic or task-centric settings, and thus fail to accurately capture the attack surface and risk propagation mechanisms of personalized agents in real-world deployments. To address this gap, we propose Personalized Agent Security Bench (PASB), an end-to-end security evaluation framework tailored for real-world personalized agents. Building upon existing agent attack paradigms, PASB incorporates personalized usage scenarios, realistic toolchains, and long-horizon interactions, enabling black-box, end-to-end security evaluation on real systems. Using OpenClaw as a representative case study, we systematically evaluate its security across multiple personalized scenarios, tool capabilities, and attack types. Our results indicate that OpenClaw exhibits critical vulnerabilities at different execution stages, including user prompt processing, tool usage, and memory retrieval, highlighting substantial security risks in personalized agent deployments. The code for the proposed PASB framework is available at https://github.com/AstorYH/PASB.
Chinese Translation
尽管以大型语言模型(LLM)为基础的代理(如OpenClaw)正日益从任务导向系统演变为解决复杂现实任务的个性化人工智能助手,但它们的实际部署也带来了严重的安全风险。然而,现有的代理安全研究和评估框架主要集中在合成或任务中心的环境中,因此未能准确捕捉个性化代理在现实部署中的攻击面和风险传播机制。为了解决这一问题,我们提出了个性化代理安全基准(Personalized Agent Security Bench,PASB),这是一个针对现实个性化代理量身定制的端到端安全评估框架。PASB建立在现有代理攻击范式的基础上,结合个性化使用场景、现实工具链和长期交互,能够在真实系统上进行黑箱的端到端安全评估。以OpenClaw为代表案例,我们系统地评估了其在多种个性化场景、工具能力和攻击类型下的安全性。我们的结果表明,OpenClaw在不同执行阶段(包括用户提示处理、工具使用和内存检索)存在关键漏洞,突显了个性化代理部署中的重大安全风险。所提出的PASB框架的代码可在https://github.com/AstorYH/PASB获取。
cs.AI / 85 / 2602.08449
When Evaluation Becomes a Side Channel: Regime Leakage and Structural Mitigations for Alignment Assessment
当评估成为侧信道:对齐评估中的制度泄漏与结构性缓解措施
Abstract
Safety evaluation for advanced AI systems implicitly assumes that behavior observed under evaluation is predictive of behavior in deployment. This assumption becomes fragile for agents with situational awareness, which may exploitregime leakage-informational cues distinguishing evaluation from deployment-to implement conditional policies such as sycophancy and sleeper agents, which preserve compliance under oversight while defecting in deployment-like regimes. We reframe alignment evaluation as a problem of information flow under partial observability. Within this framework, we show that divergence between evaluation-time and deployment-time behavior is bounded by the mutual information between internal representations and the regime variable. Motivated by this result, we study regime-blind mechanisms: training-time interventions that reduce the extractability of regime information at decision-relevant internal representations via adversarial invariance. We evaluate this approach on a base, open-weight language model across two fully characterized failure modes -scientific sycophancy and temporal sleeper agents. Regime-blind training suppresses regime-conditioned behavior in both evaluated cases without measurable loss of task utility, but with qualitatively different dynamics: sycophancy exhibits a sharp representational and behavioral transition at low intervention strength, whereas sleeper-agent behavior requires substantially stronger pressure and does not exhibit a clean collapse of regime decodability. These results demonstrate that representational invariance is a meaningful but fundamentally limited control lever, whose effectiveness depends on how regime information is embedded in the policy. We argue that behavioral evaluation should be complemented with white-box diagnostics of regime awareness and information flow.
Chinese Translation
对先进人工智能系统的安全评估隐含地假设,在评估中观察到的行为能够预测在实际部署中的行为。这一假设在具有情境意识的代理中变得脆弱,因为它们可能利用制度泄漏——区分评估与部署的信息线索——来实施条件政策,例如谄媚和卧底代理,这些政策在监督下保持合规,而在类似部署的环境中则表现出背叛。我们将对齐评估重新框定为一个在部分可观察性下的信息流问题。在这一框架内,我们展示了评估时和部署时行为之间的差异受到内部表征与制度变量之间互信息的限制。基于这一结果,我们研究了无制度机制:在训练阶段进行的干预,通过对抗不变性减少决策相关内部表征中制度信息的可提取性。我们在一个基础的、开放权重的语言模型上评估了这一方法,针对两种完全特征化的失败模式——科学谄媚和时间卧底代理。无制度训练在两个评估案例中抑制了制度条件行为,而没有可测量的任务效用损失,但表现出质的不同动态:谄媚在低干预强度下表现出明显的表征和行为转变,而卧底代理行为则需要显著更强的压力,并且没有表现出制度可解码性的干净崩溃。这些结果表明,表征不变性是一个有意义但根本有限的控制杠杆,其有效性取决于制度信息在政策中的嵌入方式。我们认为,行为评估应与对制度意识和信息流的白盒诊断相辅相成。
cs.AI / 86 / 2602.08517
TreeTensor: Boost AI System on Nested Data with Constrained Tree-Like Tensor
TreeTensor:通过受限树状张量提升嵌套数据的人工智能系统
Abstract
Tensor is the most basic and essential data structure of nowadays artificial intelligence (AI) system. The natural properties of Tensor, especially the memory-continuity and slice-independence, make it feasible for training system to leverage parallel computing unit like GPU to process data simultaneously in batch, spatial or temporal dimensions. However, if we look beyond perception tasks, the data in a complicated cognitive AI system usually has hierarchical structures (i.e. nested data) with various modalities. They are inconvenient and inefficient to program directly with conventional Tensor with fixed shape. To address this issue, we summarize two main computational patterns of nested data, and then propose a general nested data container: TreeTensor. Through various constraints and magic utilities of TreeTensor, one can apply arbitrary functions and operations to nested data with almost zero cost, including some famous machine learning libraries, such as Scikit-Learn, Numpy and PyTorch. Our approach utilizes a constrained tree-structure perspective to systematically model data relationships, and it can also easily be combined with other methods to extend more usages, such as asynchronous execution and variable-length data computation. Detailed examples and benchmarks show TreeTensor not only provides powerful usability in various problems, especially one of the most complicated AI systems at present: AlphaStar for StarCraftII, but also exhibits excellent runtime efficiency without any overhead. Our project is available at https://github.com/opendilab/DI-treetensor.
Chinese Translation
张量是当今人工智能(AI)系统中最基本和最重要的数据结构。张量的自然属性,特别是内存连续性和切片独立性,使得训练系统能够利用并行计算单元(如GPU)在批量、空间或时间维度上同时处理数据。然而,如果我们超越感知任务,复杂认知AI系统中的数据通常具有层次结构(即嵌套数据)和多种模态。使用固定形状的传统张量直接编程会显得不便且效率低下。为了解决这个问题,我们总结了嵌套数据的两种主要计算模式,并提出了一种通用的嵌套数据容器:TreeTensor。通过TreeTensor的各种约束和魔法工具,几乎可以零成本地对嵌套数据应用任意函数和操作,包括一些著名的机器学习库,如Scikit-Learn、Numpy和PyTorch。我们的方法利用受限树结构的视角系统地建模数据关系,并且可以轻松与其他方法结合,以扩展更多的应用,如异步执行和可变长度数据计算。详细的示例和基准测试表明,TreeTensor不仅在各种问题中提供了强大的可用性,尤其是在当前最复杂的AI系统之一:StarCraft II的AlphaStar中,而且在运行时效率上表现出色,没有任何开销。我们的项目可在 https://github.com/opendilab/DI-treetensor 获取。
cs.AI / 87 / 2602.08520
Reinforcement Inference: Leveraging Uncertainty for Self-Correcting Language Model Reasoning
强化推理:利用不确定性进行自我修正的语言模型推理
Abstract
Modern large language models (LLMs) are often evaluated and deployed under a \emph{one-shot, greedy} inference protocol, especially in professional settings that require deterministic behavior. This regime can systematically under-estimate a fixed model's true capability: many errors arise not from missing knowledge, but from premature commitment under internal ambiguity. We introduce \emph{Reinforcement Inference}, an entropy-aware inference-time control strategy that uses the model's own uncertainty to selectively invoke a second, more deliberate reasoning attempt, enabling stronger performance \emph{without any retraining}. On 12,032 MMLU-Pro questions across 14 subjects, using DeepSeek-v3.2 with deterministic decoding in a zero-shot setting, Reinforcement Inference improves accuracy from 60.72\% to 84.03\%, while only incurring 61.06\% additional inference calls. A 100\% re-asking ablation reaches 84.35\%, indicating that uncertainty-aware selection captures most of the attainable improvement with substantially less compute. Moreover, a \emph{prompt-only} ablation underperforms the baseline, suggesting that the gains are not explained by generic `` your output had high entropy, think step-by-step'' prompting alone. Beyond providing a practical inference-time upgrade, our results suggest a broader \emph{entropy-aware} paradigm for measuring and expanding model capability: because modern decoder-based models generate outputs autoregressively, entropy and related confidence measures arise naturally as first-class control signals during generation. The resulting gap between one-pass greedy inference and uncertainty-conditioned deliberation offers a diagnostic lens on an LLM's latent reasoning horizon and motivates future training objectives that explicitly constrain correctness--confidence alignment.
Chinese Translation
现代大型语言模型(LLMs)通常在 extit{一次性、贪婪}的推理协议下进行评估和部署,特别是在需要确定性行为的专业环境中。这种模式可能系统性地低估了固定模型的真实能力:许多错误并非源于知识缺失,而是由于在内部模糊性下的过早承诺。我们提出了 extit{强化推理},这是一种基于熵的推理时控制策略,利用模型自身的不确定性选择性地调用第二次更为深思熟虑的推理尝试,从而在 extit{无需任何再训练}的情况下实现更强的性能。在14个学科的12,032个MMLU-Pro问题上,使用DeepSeek-v3.2在零-shot设置下进行确定性解码,强化推理将准确率从60.72\%提高到84.03\\%,同时仅增加了61.06\\%的推理调用。100\\%的重新提问消融实验达到了84.35\\%,表明基于不确定性的选择捕捉到了大部分可获得的改进,同时显著减少了计算量。此外, extit{仅提示}的消融实验表现不及基线,表明这些提升并不能仅仅通过“你的输出具有高熵,逐步思考”的通用提示来解释。除了提供一种实用的推理时升级外,我们的结果还暗示了一种更广泛的 extit{基于熵的}范式,用于衡量和扩展模型能力:由于现代基于解码器的模型是自回归生成输出,熵和相关的置信度度量自然作为生成过程中的一流控制信号出现。一遍贪婪推理与基于不确定性的深思熟虑之间的差距为LLM的潜在推理视野提供了诊断视角,并激励未来的训练目标明确约束正确性与置信度的对齐。
cs.AI / 88 / 2602.08533
Dialogue Model Optimization via Agent Game and Adaptive Tree-based GRPO
通过代理博弈和自适应树基GRPO优化对话模型
Abstract
Open-ended dialogue agents aim to deliver engaging, personalized interactions by adapting to users' traits, but existing methods face critical limitations: over-reliance on pre-collected user data, and short-horizon biases in reinforcement learning (RL) that neglect long-term dialogue value. To address these, we propose a novel long-horizon RL framework integrating online personalization with Adaptive Tree-based Group Relative Policy Optimization (AT-GRPO). Adopting a two-agent game paradigm, a user agent constructs dynamic environments via style mimicry (learning user-specific conversational traits) and active termination (predicting turn-level termination probabilities as immediate rewards), forming an iterative cycle that drives the dialogue agent to deepen interest exploration. AT-GRPO reinterprets dialogue trajectories as trees and introduces adaptive observation ranges. Unlike full tree expansion that incurs exponential overhead, it limits each node to aggregate rewards from a stage-aware range: larger ranges support early-stage topic exploration, while smaller ranges facilitate late-stage dialogue maintenance. This design reduces rollout budgets from exponential to polynomial in the dialogue length, while preserving long-term reward capture. Extensive experiments show our framework's superior performance, sample efficiency, and robustness.
Chinese Translation
开放式对话代理旨在通过适应用户特征提供引人入胜的个性化互动,但现有方法面临关键限制:过度依赖预先收集的用户数据,以及强化学习(RL)中的短期偏见,忽视了对话的长期价值。为了解决这些问题,我们提出了一种新颖的长时域强化学习框架,结合了在线个性化和自适应树基群体相对策略优化(AT-GRPO)。采用双代理博弈范式,用户代理通过风格模仿(学习用户特定的对话特征)和主动终止(预测回合级终止概率作为即时奖励)构建动态环境,形成一个推动对话代理深入兴趣探索的迭代循环。AT-GRPO将对话轨迹重新解释为树,并引入自适应观察范围。与导致指数开销的完全树扩展不同,它限制每个节点从阶段感知范围内聚合奖励:较大的范围支持早期主题探索,而较小的范围则促进后期对话维护。这一设计将对话长度的回滚预算从指数级减少到多项式级,同时保留长期奖励的捕获。大量实验表明,我们的框架在性能、样本效率和鲁棒性方面优于其他方法。
cs.AI / 89 / 2602.08586
PRISM: A Principled Framework for Multi-Agent Reasoning via Gain Decomposition
PRISM:通过增益分解实现多智能体推理的原则框架
Abstract
Multi-agent collaboration has emerged as a promising paradigm for enhancing reasoning capabilities of Large Language Models (LLMs). However, existing approaches remain largely heuristic, lacking principled guidance on what drives performance gains and how to systematically optimize multi-agent reasoning. Specifically, it remains unclear why multi-agent collaboration outperforms single-agent reasoning and which design choices contribute most to these gains, making it difficult to build better systems. We address this gap by introducing a unified theoretical framework that decomposes multi-agent reasoning gains into three conceptually independent dimensions: Exploration for diverse solution coverage, Information for high-fidelity feedback, and Aggregation for principled consensus. Through this lens, existing methods can be understood as special cases that optimize only subsets of these dimensions. Building upon this decomposition, a novel framework called PRISM (Propose-Review-Integrate Synthesis for Multi-agent Reasoning) is proposed, which jointly maximizes all three dimensions through role-based diversity, execution-grounded feedback with evidence-based cross-evaluation, and iterative synthesis with closed-loop validation. Extensive experiments across mathematical reasoning, code generation, and function calling benchmarks demonstrate that PRISM achieves state-of-the-art performance with superior compute-efficiency compared to methods optimizing partial dimensions. The theoretical framework provides actionable design principles for future multi-agent reasoning systems.
Chinese Translation
多智能体协作已成为增强大型语言模型(LLMs)推理能力的有前景的范式。然而,现有的方法大多是启发式的,缺乏关于性能提升驱动因素的原则性指导,以及如何系统地优化多智能体推理的策略。具体而言,尚不清楚为什么多智能体协作优于单智能体推理,以及哪些设计选择对这些增益贡献最大,这使得构建更好的系统变得困难。我们通过引入一个统一的理论框架来填补这一空白,该框架将多智能体推理增益分解为三个概念上独立的维度:探索以实现多样化解决方案覆盖、信息以获得高保真反馈,以及聚合以实现原则性共识。通过这一视角,现有方法可以被理解为仅优化这些维度的子集的特例。在此分解的基础上,提出了一种新颖的框架,称为PRISM(Propose-Review-Integrate Synthesis for Multi-agent Reasoning),该框架通过基于角色的多样性、基于证据的交叉评估的执行基础反馈,以及带有闭环验证的迭代综合,联合最大化这三个维度。在数学推理、代码生成和函数调用基准测试中的广泛实验表明,PRISM在计算效率上优于优化部分维度的方法,达到了最先进的性能。该理论框架为未来多智能体推理系统提供了可操作的设计原则。
cs.AI / 90 / 2602.08597
An Attention Mechanism for Robust Multimodal Integration in a Global Workspace Architecture
一种用于全球工作空间架构中稳健多模态融合的注意力机制
Abstract
Global Workspace Theory (GWT), inspired by cognitive neuroscience, posits that flexible cognition could arise via the attentional selection of a relevant subset of modalities within a multimodal integration system. This cognitive framework can inspire novel computational architectures for multimodal integration. Indeed, recent implementations of GWT have explored its multimodal representation capabilities, but the related attention mechanisms remain understudied. Here, we propose and evaluate a top-down attention mechanism to select modalities inside a global workspace. First, we demonstrate that our attention mechanism improves noise robustness of a global workspace system on two multimodal datasets of increasing complexity: Simple Shapes and MM-IMDb 1.0. Second, we highlight various cross-task and cross-modality generalization capabilities that are not shared by multimodal attention models from the literature. Comparing against existing baselines on the MM-IMDb 1.0 benchmark, we find our attention mechanism makes the global workspace competitive with the state of the art.
Chinese Translation
全球工作空间理论(Global Workspace Theory, GWT)受认知神经科学的启发,认为灵活的认知可以通过在多模态融合系统中对相关模态子集的注意力选择而产生。该认知框架可以激发新颖的多模态融合计算架构。实际上,最近对GWT的实现探讨了其多模态表示能力,但相关的注意力机制仍然研究不足。在此,我们提出并评估了一种自上而下的注意力机制,以选择全球工作空间中的模态。首先,我们展示了我们的注意力机制在两个复杂性逐渐增加的多模态数据集(Simple Shapes和MM-IMDb 1.0)上提高了全球工作空间系统的抗噪声能力。其次,我们强调了多任务和跨模态的多样化泛化能力,这些能力在文献中的多模态注意力模型中并不常见。与MM-IMDb 1.0基准上的现有基线进行比较,我们发现我们的注意力机制使全球工作空间在与最先进技术的竞争中变得具有竞争力。
cs.AI / 91 / 2602.08603
OSCAR: Optimization-Steered Agentic Planning for Composed Image Retrieval
OSCAR:面向复合图像检索的优化引导代理规划
Abstract
Composed image retrieval (CIR) requires complex reasoning over heterogeneous visual and textual constraints. Existing approaches largely fall into two paradigms: unified embedding retrieval, which suffers from single-model myopia, and heuristic agentic retrieval, which is limited by suboptimal, trial-and-error orchestration. To this end, we propose OSCAR, an optimization-steered agentic planning framework for composed image retrieval. We are the first to reformulate agentic CIR from a heuristic search process into a principled trajectory optimization problem. Instead of relying on heuristic trial-and-error exploration, OSCAR employs a novel offline-online paradigm. In the offline phase, we model CIR via atomic retrieval selection and composition as a two-stage mixed-integer programming problem, mathematically deriving optimal trajectories that maximize ground-truth coverage for training samples via rigorous boolean set operations. These trajectories are then stored in a golden library to serve as in-context demonstrations for online steering of VLM planner at online inference time. Extensive experiments on three public benchmarks and a private industrial benchmark show that OSCAR consistently outperforms SOTA baselines. Notably, it achieves superior performance using only 10% of training data, demonstrating strong generalization of planning logic rather than dataset-specific memorization.
Chinese Translation
复合图像检索(CIR)需要对异构视觉和文本约束进行复杂推理。现有方法大致分为两种范式:统一嵌入检索,受限于单一模型的短视,以及启发式代理检索,受限于次优的试错协调。为此,我们提出了OSCAR,一种面向复合图像检索的优化引导代理规划框架。我们首次将代理CIR从启发式搜索过程重新表述为一个有原则的轨迹优化问题。OSCAR不依赖于启发式的试错探索,而是采用了一种新颖的离线-在线范式。在离线阶段,我们通过原子检索选择和组合将CIR建模为一个两阶段的混合整数规划问题,利用严格的布尔集合运算数学推导出最大化训练样本真实覆盖率的最优轨迹。这些轨迹随后被存储在一个黄金库中,以便在在线推理时作为VLM规划器的上下文演示进行引导。在三个公共基准和一个私有工业基准上的广泛实验表明,OSCAR始终优于最新的基线方法。值得注意的是,它在仅使用10%的训练数据的情况下实现了卓越的性能,展示了规划逻辑的强泛化能力,而非特定数据集的记忆。
cs.AI / 92 / 2602.08630
Debate is efficient with your time
辩论是高效利用时间的方式
Abstract
AI safety via debate uses two competing models to help a human judge verify complex computational tasks. Previous work has established what problems debate can solve in principle, but has not analysed the practical cost of human oversight: how many queries must the judge make to the debate transcript? We introduce Debate Query Complexity}(DQC), the minimum number of bits a verifier must inspect to correctly decide a debate. Surprisingly, we find that PSPACE/poly (the class of problems which debate can efficiently decide) is precisely the class of functions decidable with O(log n) queries. This characterisation shows that debate is remarkably query-efficient: even for highly complex problems, logarithmic oversight suffices. We also establish that functions depending on all their input bits require Omega(log n) queries, and that any function computable by a circuit of size s satisfies DQC(f) <= log(s) + 3. Interestingly, this last result implies that proving DQC lower bounds of log(n) + 6 for languages in P would yield new circuit lower bounds, connecting debate query complexity to central questions in circuit complexity.
Chinese Translation
通过辩论实现人工智能安全使用两个竞争模型来帮助人类裁判验证复杂的计算任务。之前的研究已经确定了辩论在原则上能够解决的问题,但尚未分析人类监督的实际成本:裁判需要对辩论记录进行多少查询?我们引入了辩论查询复杂性(Debate Query Complexity, DQC),即验证者必须检查的最小位数,以正确决定一场辩论。令人惊讶的是,我们发现 PSPACE/poly(辩论能够高效决定的问题类别)恰好是可以通过 O(log n) 查询决定的函数类别。这一特征表明辩论在查询效率上非常出色:即使对于高度复杂的问题,使用对数级别的监督也足够。我们还证明了依赖于所有输入位的函数需要 Omega(log n) 查询,并且任何由大小为 s 的电路可计算的函数满足 DQC(f) <= log(s) + 3。有趣的是,这最后的结果意味着,如果能够证明 P 类语言的 DQC 下界为 log(n) + 6,将会产生新的电路下界,从而将辩论查询复杂性与电路复杂性中的核心问题联系起来。
cs.AI / 93 / 2602.08707
Why do we Trust Chatbots? From Normative Principles to Behavioral Drivers
我们为什么信任聊天机器人?从规范原则到行为驱动因素
Abstract
As chatbots increasingly blur the boundary between automated systems and human conversation, the foundations of trust in these systems warrant closer examination. While regulatory and policy frameworks tend to define trust in normative terms, the trust users place in chatbots often emerges from behavioral mechanisms. In many cases, this trust is not earned through demonstrated trustworthiness but is instead shaped by interactional design choices that leverage cognitive biases to influence user behavior. Based on this observation, we propose reframing chatbots not as companions or assistants, but as highly skilled salespeople whose objectives are determined by the deploying organization. We argue that the coexistence of competing notions of "trust" under a shared term obscures important distinctions between psychological trust formation and normative trustworthiness. Addressing this gap requires further research and stronger support mechanisms to help users appropriately calibrate trust in conversational AI systems.
Chinese Translation
随着聊天机器人日益模糊自动化系统与人类对话之间的界限,对这些系统信任基础的深入审视变得愈发重要。尽管监管和政策框架往往以规范性术语来定义信任,但用户对聊天机器人的信任通常源于行为机制。在许多情况下,这种信任并不是通过展示可信度来获得的,而是通过利用认知偏差的交互设计选择来影响用户行为。基于这一观察,我们建议将聊天机器人重新定义为高技能的销售人员,而非伴侣或助手,其目标由部署组织决定。我们认为,在一个共享术语下竞争的“信任”概念的共存掩盖了心理信任形成与规范性可信度之间的重要区别。解决这一差距需要进一步的研究和更强有力的支持机制,以帮助用户适当地调整对对话式人工智能系统的信任。
cs.AI / 94 / 2602.08708
Intermediate Results on the Complexity of STRIPS$_{1}^{1}$
关于STRIPS$_{1}^{1}$复杂性的中间结果
Abstract
This paper is based on Bylander's results on the computational complexity of propositional STRIPS planning. He showed that when only ground literals are permitted, determining plan existence is PSPACE-complete even if operators are limited to two preconditions and two postconditions. While NP-hardness is settled, it is unknown whether propositional STRIPS with operators that only have one precondition and one effect is NP-complete. We shed light on the question whether this small solution hypothesis for STRIPS$^1_1$ is true, calling a SAT solver for small instances, introducing the literal graph, and mapping it to Petri nets.
Chinese Translation
本文基于Bylander关于命题STRIPS规划计算复杂性的研究成果。他证明了在仅允许使用基础文字的情况下,即使操作符仅限于两个前提条件和两个后置条件,确定计划的存在性也是PSPACE完全的。尽管NP难度已得到解决,但尚不清楚仅具有一个前提条件和一个效果的命题STRIPS是否为NP完全。我们通过调用小实例的SAT求解器,引入文字图,并将其映射到Petri网,来探讨这一小解决假设对于STRIPS$^1_1$是否成立的问题。
cs.AI / 95 / 2602.08715
Exploring SAIG Methods for an Objective Evaluation of XAI
探索SAIG方法以实现对可解释人工智能的客观评估
Abstract
The evaluation of eXplainable Artificial Intelligence (XAI) methods is a rapidly growing field, characterized by a wide variety of approaches. This diversity highlights the complexity of the XAI evaluation, which, unlike traditional AI assessment, lacks a universally correct ground truth for the explanation, making objective evaluation challenging. One promising direction to address this issue involves the use of what we term Synthetic Artificial Intelligence Ground truth (SAIG) methods, which generate artificial ground truths to enable the direct evaluation of XAI techniques. This paper presents the first review and analysis of SAIG methods. We introduce a novel taxonomy to classify these approaches, identifying seven key features that distinguish different SAIG methods. Our comparative study reveals a concerning lack of consensus on the most effective XAI evaluation techniques, underscoring the need for further research and standardization in this area.
Chinese Translation
可解释人工智能(XAI)方法的评估是一个快速发展的领域,具有多种多样的方法。这种多样性突显了XAI评估的复杂性,与传统人工智能评估不同,XAI缺乏普遍正确的解释基础真相,这使得客观评估变得具有挑战性。解决这一问题的一个有前景的方向是使用我们称之为合成人工智能基础真相(SAIG)的方法,这些方法生成人工基础真相以便直接评估XAI技术。本文首次对SAIG方法进行了综述和分析。我们提出了一种新颖的分类法来对这些方法进行分类,识别出七个关键特征,以区分不同的SAIG方法。我们的比较研究揭示了在最有效的XAI评估技术上缺乏共识的令人担忧的现象,强调了在这一领域进一步研究和标准化的必要性。
cs.AI / 96 / 2602.08734
Finite-State Controllers for (Hidden-Model) POMDPs using Deep Reinforcement Learning
基于深度强化学习的有限状态控制器用于(隐模型)部分可观察马尔可夫决策过程
Abstract
Solving partially observable Markov decision processes (POMDPs) requires computing policies under imperfect state information. Despite recent advances, the scalability of existing POMDP solvers remains limited. Moreover, many settings require a policy that is robust across multiple POMDPs, further aggravating the scalability issue. We propose the Lexpop framework for POMDP solving. Lexpop (1) employs deep reinforcement learning to train a neural policy, represented by a recurrent neural network, and (2) constructs a finite-state controller mimicking the neural policy through efficient extraction methods. Crucially, unlike neural policies, such controllers can be formally evaluated, providing performance guarantees. We extend Lexpop to compute robust policies for hidden-model POMDPs (HM-POMDPs), which describe finite sets of POMDPs. We associate every extracted controller with its worst-case POMDP. Using a set of such POMDPs, we iteratively train a robust neural policy and consequently extract a robust controller. Our experiments show that on problems with large state spaces, Lexpop outperforms state-of-the-art solvers for POMDPs as well as HM-POMDPs.
Chinese Translation
解决部分可观察马尔可夫决策过程(POMDPs)需要在不完全状态信息下计算策略。尽管近期取得了一些进展,但现有POMDP求解器的可扩展性仍然有限。此外,许多场景要求策略在多个POMDPs中具有鲁棒性,这进一步加剧了可扩展性问题。我们提出了Lexpop框架用于POMDP求解。Lexpop (1) 采用深度强化学习训练一个由递归神经网络表示的神经策略,并 (2) 通过高效的提取方法构建一个模仿神经策略的有限状态控制器。重要的是,与神经策略不同,这种控制器可以被正式评估,从而提供性能保证。我们将Lexpop扩展到计算隐模型POMDPs(HM-POMDPs)的鲁棒策略,这些隐模型POMDPs描述了一组有限的POMDPs。我们将每个提取的控制器与其最坏情况的POMDP关联起来。通过一组这样的POMDPs,我们迭代训练一个鲁棒的神经策略,并因此提取一个鲁棒控制器。我们的实验表明,在具有大状态空间的问题上,Lexpop的表现优于当前最先进的POMDP和HM-POMDP求解器。
cs.AI / 97 / 2602.08754
Belief Offloading in Human-AI Interaction
人机交互中的信念卸载
Abstract
What happens when people's beliefs are derived from information provided by an LLM? People's use of LLM chatbots as thought partners can contribute to cognitive offloading, which can have adverse effects on cognitive skills in cases of over-reliance. This paper defines and investigates a particular kind of cognitive offloading in human-AI interaction, "belief offloading," in which people's processes of forming and upholding beliefs are offloaded onto an AI system with downstream consequences on their behavior and the nature of their system of beliefs. Drawing on philosophy, psychology, and computer science research, we clarify the boundary conditions under which belief offloading occurs and provide a descriptive taxonomy of belief offloading and its normative implications. We close with directions for future work to assess the potential for and consequences of belief offloading in human-AI interaction.
Chinese Translation
当人们的信念源于大型语言模型(LLM)提供的信息时,会发生什么?人们将LLM聊天机器人作为思维伙伴的使用可能会导致认知卸载,而过度依赖可能对认知技能产生不利影响。本文定义并研究了一种特定类型的人机交互中的认知卸载,即“信念卸载”,在这种情况下,人们形成和维持信念的过程被卸载到人工智能系统上,从而对他们的行为和信念体系的性质产生后续影响。通过借鉴哲学、心理学和计算机科学的研究,我们阐明了信念卸载发生的边界条件,并提供了信念卸载的描述性分类及其规范性影响。最后,我们提出了未来研究的方向,以评估人机交互中信念卸载的潜力及其后果。
cs.AI / 98 / 2602.08783
Dynamics Within Latent Chain-of-Thought: An Empirical Study of Causal Structure
潜在思维链中的动态:因果结构的实证研究
Abstract
Latent or continuous chain-of-thought methods replace explicit textual rationales with a number of internal latent steps, but these intermediate computations are difficult to evaluate beyond correlation-based probes. In this paper, we view latent chain-of-thought as a manipulable causal process in representation space by modeling latent steps as variables in a structural causal model (SCM) and analyzing their effects through step-wise $\mathrm{do}$-interventions. We study two representative paradigms (i.e., Coconut and CODI) on both mathematical and general reasoning tasks to investigate three key questions: (1) which steps are causally necessary for correctness and when answers become decidable early; (2) how does influence propagate across steps, and how does this structure compare to explicit CoT; and (3) do intermediate trajectories retain competing answer modes, and how does output-level commitment differ from representational commitment across steps. We find that latent-step budgets behave less like homogeneous extra depth and more like staged functionality with non-local routing, and we identify a persistent gap between early output bias and late representational commitment. These results motivate mode-conditional and stability-aware analyses -- and corresponding training/decoding objectives -- as more reliable tools for interpreting and improving latent reasoning systems.
Chinese Translation
潜在或连续的思维链方法用一系列内部潜在步骤替代了显式的文本推理,但这些中间计算在超出基于相关性的探测之外难以评估。本文将潜在思维链视为表示空间中可操控的因果过程,通过将潜在步骤建模为结构因果模型(SCM)中的变量,并通过逐步的 $ ext{do}$-干预分析其影响。我们在数学和一般推理任务上研究了两个代表性范式(即 Coconut 和 CODI),以探讨三个关键问题:(1)哪些步骤在正确性上是因果必要的,以及何时答案变得可判定;(2)影响如何在步骤之间传播,这种结构与显式思维链相比如何;(3)中间轨迹是否保留竞争的答案模式,以及输出级别的承诺与步骤之间的表示承诺有何不同。我们发现,潜在步骤的预算表现得不像均质的额外深度,而更像是具有非局部路由的分阶段功能,并且我们识别出早期输出偏差与晚期表示承诺之间存在持续的差距。这些结果促使我们进行模式条件和稳定性意识的分析——以及相应的训练/解码目标——作为更可靠的工具来解释和改进潜在推理系统。
cs.AI / 99 / 2602.08796
The Use of AI Tools to Develop and Validate Q-Matrices
使用人工智能工具开发和验证 Q 矩阵
Abstract
Constructing a Q-matrix is a critical but labor-intensive step in cognitive diagnostic modeling (CDM). This study investigates whether AI tools (i.e., general language models) can support Q-matrix development by comparing AI-generated Q-matrices with a validated Q-matrix from Li and Suen (2013) for a reading comprehension test. In May 2025, multiple AI models were provided with the same training materials as human experts. Agreement among AI-generated Q-matrices, the validated Q-matrix, and human raters' Q-matrices was assessed using Cohen's kappa. Results showed substantial variation across AI models, with Google Gemini 2.5 Pro achieving the highest agreement (Kappa = 0.63) with the validated Q-matrix, exceeding that of all human experts. A follow-up analysis in January 2026 using newer AI versions, however, revealed lower agreement with the validated Q-matrix. Implications and directions for future research are discussed.
Chinese Translation
构建 Q 矩阵是认知诊断建模 (CDM) 中一个关键但劳动密集的步骤。本研究探讨了人工智能工具(即通用语言模型)是否能够支持 Q 矩阵的开发,通过将 AI 生成的 Q 矩阵与 Li 和 Suen (2013) 提供的经过验证的 Q 矩阵进行比较,后者用于阅读理解测试。在 2025 年 5 月,多个 AI 模型获得了与人类专家相同的训练材料。使用 Cohen's kappa 评估 AI 生成的 Q 矩阵、经过验证的 Q 矩阵以及人类评分者的 Q 矩阵之间的一致性。结果显示不同 AI 模型之间存在显著差异,其中 Google Gemini 2.5 Pro 与经过验证的 Q 矩阵达成了最高的一致性(Kappa = 0.63),超过了所有人类专家。然而,在 2026 年 1 月对更新版 AI 的后续分析中,发现与经过验证的 Q 矩阵的一致性降低。讨论了研究的影响及未来研究的方向。
cs.AI / 100 / 2602.08804
Root Cause Analysis Method Based on Large Language Models with Residual Connection Structures
基于残差连接结构的大型语言模型根本原因分析方法
Abstract
Root cause localization remain challenging in complex and large-scale microservice architectures. The complex fault propagation among microservices and the high dimensionality of telemetry data, including metrics, logs, and traces, limit the effectiveness of existing root cause analysis (RCA) methods. In this paper, a residual-connection-based RCA method using large language model (LLM), named RC-LLM, is proposed. A residual-like hierarchical fusion structure is designed to integrate multi-source telemetry data, while the contextual reasoning capability of large language models is leveraged to model temporal and cross-microservice causal dependencies. Experimental results on CCF-AIOps microservice datasets demonstrate that RC-LLM achieves strong accuracy and efficiency in root cause analysis.
Chinese Translation
在复杂和大规模微服务架构中,根本原因定位仍然具有挑战性。微服务之间复杂的故障传播以及遥测数据(包括指标、日志和追踪)的高维性限制了现有根本原因分析(RCA)方法的有效性。本文提出了一种基于残差连接的大型语言模型(LLM)根本原因分析方法,命名为RC-LLM。设计了一种类似残差的层次融合结构,以整合多源遥测数据,同时利用大型语言模型的上下文推理能力来建模时间和跨微服务的因果依赖关系。在CCF-AIOps微服务数据集上的实验结果表明,RC-LLM在根本原因分析中实现了强大的准确性和效率。
cs.AI / 101 / 2602.08815
Negative-Aware Diffusion Process for Temporal Knowledge Graph Extrapolation
负向感知扩散过程用于时间知识图谱外推
Abstract
Temporal Knowledge Graph (TKG) reasoning seeks to predict future missing facts from historical evidence. While diffusion models (DM) have recently gained attention for their ability to capture complex predictive distributions, two gaps remain: (i) the generative path is conditioned only on positive evidence, overlooking informative negative context, and (ii) training objectives are dominated by cross-entropy ranking, which improves candidate ordering but provides little supervision over the calibration of the denoised embedding. To bridge this gap, we introduce Negative-Aware Diffusion model for TKG Extrapolation (NADEx). Specifically, NADEx encodes subject-centric histories of entities, relations and temporal intervals into sequential embeddings. NADEx perturbs the query object in the forward process and reconstructs it in reverse with a Transformer denoiser conditioned on the temporal-relational context. We further derive a cosine-alignment regularizer derived from batch-wise negative prototypes, which tightens the decision boundary against implausible candidates. Comprehensive experiments on four public TKG benchmarks demonstrate that NADEx delivers state-of-the-art performance.
Chinese Translation
时间知识图谱(TKG)推理旨在从历史证据中预测未来缺失的事实。尽管扩散模型(DM)因其捕捉复杂预测分布的能力而受到关注,但仍存在两个不足之处:(i)生成路径仅基于正向证据,忽视了有信息的负向上下文;(ii)训练目标主要依赖于交叉熵排序,这虽然改善了候选项的排序,但对去噪嵌入的校准提供的监督有限。为了解决这一问题,我们提出了负向感知扩散模型用于TKG外推(NADEx)。具体而言,NADEx 将实体、关系和时间间隔的主体中心历史编码为序列嵌入。NADEx 在正向过程中扰动查询对象,并在反向过程中使用基于时间-关系上下文的Transformer去噪器进行重构。我们进一步推导出一种基于批量负向原型的余弦对齐正则化器,以收紧对不可信候选项的决策边界。在四个公共TKG基准上的全面实验表明,NADEx 实现了最先进的性能。
cs.AI / 102 / 2602.08835
Learning the Value Systems of Societies with Preference-based Multi-objective Reinforcement Learning
基于偏好的多目标强化学习学习社会的价值体系
Abstract
Value-aware AI should recognise human values and adapt to the value systems (value-based preferences) of different users. This requires operationalization of values, which can be prone to misspecification. The social nature of values demands their representation to adhere to multiple users while value systems are diverse, yet exhibit patterns among groups. In sequential decision making, efforts have been made towards personalization for different goals or values from demonstrations of diverse agents. However, these approaches demand manually designed features or lack value-based interpretability and/or adaptability to diverse user preferences. We propose algorithms for learning models of value alignment and value systems for a society of agents in Markov Decision Processes (MDPs), based on clustering and preference-based multi-objective reinforcement learning (PbMORL). We jointly learn socially-derived value alignment models (groundings) and a set of value systems that concisely represent different groups of users (clusters) in a society. Each cluster consists of a value system representing the value-based preferences of its members and an approximately Pareto-optimal policy that reflects behaviours aligned with this value system. We evaluate our method against a state-of-the-art PbMORL algorithm and baselines on two MDPs with human values.
Chinese Translation
价值感知的人工智能应当识别人类价值并适应不同用户的价值体系(基于价值的偏好)。这需要对价值进行操作化,而这种操作化可能会出现错误指定。价值的社会性质要求其表示能够适应多个用户,而价值体系则多样化,但在群体中表现出一定的模式。在序列决策中,已经针对来自不同代理的演示进行个性化以满足不同目标或价值进行了努力。然而,这些方法需要手动设计特征,或者缺乏基于价值的可解释性和/或对多样化用户偏好的适应性。我们提出了一种算法,用于在马尔可夫决策过程(MDPs)中学习代理社会的价值对齐模型和价值体系,基于聚类和基于偏好的多目标强化学习(PbMORL)。我们共同学习社会衍生的价值对齐模型(基础)和一组能够简洁表示社会中不同用户群体(聚类)的价值体系。每个聚类由一个价值体系组成,代表其成员的基于价值的偏好,以及一个大致的帕累托最优策略,反映与该价值体系一致的行为。我们在两个具有人类价值的MDP上,将我们的方法与最先进的PbMORL算法和基线进行评估。
cs.AI / 103 / 2602.08848
Deciding the Satisfiability of Combined Qualitative Constraint Networks
决定组合定性约束网络的可满足性
Abstract
Among the various forms of reasoning studied in the context of artificial intelligence, qualitative reasoning makes it possible to infer new knowledge in the context of imprecise, incomplete information without numerical values. In this paper, we propose a formal framework unifying several forms of extensions and combinations of qualitative formalisms, including multi-scale reasoning, temporal sequences, and loose integrations. This framework makes it possible to reason in the context of each of these combinations and extensions, but also to study in a unified way the satisfiability decision and its complexity. In particular, we establish two complementary theorems guaranteeing that the satisfiability decision is polynomial, and we use them to recover the known results of the size-topology combination. We also generalize the main definition of qualitative formalism to include qualitative formalisms excluded from the definitions of the literature, important in the context of combinations.
Chinese Translation
在人工智能研究的各种推理形式中,定性推理使得在不使用数值的情况下,在不精确和不完整的信息背景下推断新知识成为可能。本文提出了一个正式框架,统一了几种定性形式的扩展和组合,包括多尺度推理、时间序列和松散集成。该框架使得在这些组合和扩展的背景下进行推理成为可能,同时也可以以统一的方式研究可满足性决策及其复杂性。特别地,我们建立了两个互补定理,保证可满足性决策是多项式时间的,并利用这些定理恢复已知的大小-拓扑组合的结果。我们还将定性形式的主要定义进行了推广,以包括文献中被排除的定性形式,这在组合的背景下是重要的。
cs.AI / 104 / 2602.08889
Scalable Delphi: Large Language Models for Structured Risk Estimation
可扩展的德尔菲法:用于结构化风险估计的大型语言模型
Abstract
Quantitative risk assessment in high-stakes domains relies on structured expert elicitation to estimate unobservable properties. The gold standard - the Delphi method - produces calibrated, auditable judgments but requires months of coordination and specialist time, placing rigorous risk assessment out of reach for most applications. We investigate whether Large Language Models (LLMs) can serve as scalable proxies for structured expert elicitation. We propose Scalable Delphi, adapting the classical protocol for LLMs with diverse expert personas, iterative refinement, and rationale sharing. Because target quantities are typically unobservable, we develop an evaluation framework based on necessary conditions: calibration against verifiable proxies, sensitivity to evidence, and alignment with human expert judgment. We evaluate in the domain of AI-augmented cybersecurity risk, using three capability benchmarks and independent human elicitation studies. LLM panels achieve strong correlations with benchmark ground truth (Pearson r=0.87-0.95), improve systematically as evidence is added, and align with human expert panels - in one comparison, closer to a human panel than the two human panels are to each other. This demonstrates that LLM-based elicitation can extend structured expert judgment to settings where traditional methods are infeasible, reducing elicitation time from months to minutes.
Chinese Translation
高风险领域的定量风险评估依赖于结构化专家引导来估计不可观察的属性。金标准——德尔菲法——能够产生经过校准、可审计的判断,但需要数月的协调和专家时间,使得严格的风险评估对大多数应用而言难以实现。我们研究大型语言模型(LLMs)是否可以作为结构化专家引导的可扩展代理。我们提出了可扩展的德尔菲法,针对LLMs适配经典协议,结合多样化的专家角色、迭代优化和推理分享。由于目标量通常是不可观察的,我们开发了一个基于必要条件的评估框架:与可验证代理的校准、对证据的敏感性以及与人类专家判断的一致性。我们在增强人工智能的网络安全风险领域进行评估,使用三个能力基准和独立的人类引导研究。LLM小组与基准真实值之间的相关性较强(Pearson r=0.87-0.95),随着证据的增加而系统性改善,并与人类专家小组保持一致——在一次比较中,LLM小组与人类小组的接近程度超过了两个独立人类小组之间的接近程度。这表明基于LLM的引导可以将结构化专家判断扩展到传统方法不可行的场景,将引导时间从数月缩短至数分钟。
cs.AI / 105 / 2602.08905
Efficient and Stable Reinforcement Learning for Diffusion Language Models
高效稳定的扩散语言模型强化学习
Abstract
Reinforcement Learning (RL) is crucial for unlocking the complex reasoning capabilities of Diffusion-based Large Language Models (dLLMs). However, applying RL to dLLMs faces unique challenges in efficiency and stability. To address these challenges, we propose Spatio-Temporal Pruning (STP), a framework designed to simultaneously improve the efficiency and stability of RL for dLLMs. STP compresses the redundancy in the generative process through: (1) \textit{spatial pruning}, which constrains the exploration space using static priors; and (2) \textit{temporal pruning}, which bypasses redundant late-stage refinement steps. Our theoretical analysis demonstrates that STP strictly reduces the variance of the log-likelihood estimation, thereby ensuring more stable policy updates. Extensive experiments demonstrate that STP surpasses state-of-the-art baselines in both efficiency and accuracy. Our code is available at https://github.com/Lolo1222/STP.
Chinese Translation
强化学习(RL)对于释放基于扩散的大型语言模型(dLLMs)的复杂推理能力至关重要。然而,将RL应用于dLLMs面临效率和稳定性方面的独特挑战。为了解决这些挑战,我们提出了时空剪枝(Spatio-Temporal Pruning, STP)框架,旨在同时提高dLLMs强化学习的效率和稳定性。STP通过以下方式压缩生成过程中的冗余: (1) extit{空间剪枝},使用静态先验约束探索空间; (2) extit{时间剪枝},绕过冗余的后期精炼步骤。我们的理论分析表明,STP严格减少了对数似然估计的方差,从而确保了更稳定的策略更新。大量实验表明,STP在效率和准确性上均超越了最先进的基准。我们的代码可在 https://github.com/Lolo1222/STP 获取。
cs.AI / 106 / 2602.08939
CausalT5K: Diagnosing and Informing Refusal for Trustworthy Causal Reasoning of Skepticism, Sycophancy, Detection-Correction, and Rung Collapse
CausalT5K:诊断和告知拒绝以实现可信的怀疑因果推理、谄媚、检测-修正和阶梯崩溃
Abstract
LLM failures in causal reasoning, including sycophancy, rung collapse, and miscalibrated refusal, are well-documented, yet progress on remediation is slow because no benchmark enables systematic diagnosis. We introduce CausalT5K, a diagnostic benchmark of over 5,000 cases across 10 domains that tests three critical capabilities: (1) detecting rung collapse, where models answer interventional queries with associational evidence; (2) resisting sycophantic drift under adversarial pressure; and (3) generating Wise Refusals that specify missing information when evidence is underdetermined. Unlike synthetic benchmarks, CausalT5K embeds causal traps in realistic narratives and decomposes performance into Utility (sensitivity) and Safety (specificity), revealing failure modes invisible to aggregate accuracy. Developed through a rigorous human-machine collaborative pipeline involving 40 domain experts, iterative cross-validation cycles, and composite verification via rule-based, LLM, and human scoring, CausalT5K implements Pearl's Ladder of Causation as research infrastructure. Preliminary experiments reveal a Four-Quadrant Control Landscape where static audit policies universally fail, a finding that demonstrates CausalT5K's value for advancing trustworthy reasoning systems. Repository: https://github.com/genglongling/CausalT5kBench
Chinese Translation
大型语言模型在因果推理中的失败,包括谄媚、阶梯崩溃和错误校准的拒绝,已有充分文献记录,但由于缺乏系统诊断的基准,修复进展缓慢。我们引入CausalT5K,这是一个涵盖10个领域、超过5000个案例的诊断基准,测试三项关键能力:(1)检测阶梯崩溃,即模型在干预查询中使用关联证据作答;(2)在对抗压力下抵制谄媚漂移;(3)生成明智的拒绝,指定在证据不确定时缺失的信息。与合成基准不同,CausalT5K在现实叙事中嵌入因果陷阱,并将性能分解为效用(敏感性)和安全性(特异性),揭示了聚合准确性下不可见的失败模式。CausalT5K通过一个严格的人机协作流程开发,该流程涉及40名领域专家、迭代交叉验证周期,以及通过基于规则的、LLM和人工评分的复合验证,实施了Pearl的因果阶梯作为研究基础设施。初步实验揭示了一个四象限控制景观,其中静态审计政策普遍失败,这一发现展示了CausalT5K在推动可信推理系统方面的价值。代码库:https://github.com/genglongling/CausalT5kBench
cs.AI / 107 / 2602.08948
CoRefine: Confidence-Guided Self-Refinement for Adaptive Test-Time Compute
CoRefine:基于信心引导的自我精炼方法用于自适应测试时计算
Abstract
Large Language Models (LLMs) often rely on test-time scaling via parallel decoding (for example, 512 samples) to boost reasoning accuracy, but this incurs substantial compute. We introduce CoRefine, a confidence-guided self-refinement method that achieves competitive accuracy using a fraction of the tokens via a lightweight 211k-parameter Conv1D controller atop a frozen LLM. The controller consumes full-trace confidence to decide whether to halt, re-examine, or try a different approach, enabling targeted self-correction with an average of 2.7 refinement steps per problem and roughly 190-fold token reduction relative to 512-sample baselines. Across diverse reasoning benchmarks and three open-source models, the controller achieves 92.6 percent precision when it confidently halts, indicating that confidence dynamics reliably signal correctness without ground-truth verification. We extend this to CoRefine-Tree, a hybrid sequential-parallel variant that adaptively balances exploration and exploitation, with easy serving integration and verifier compatibility. By treating confidence as a control signal rather than a correctness guarantee, CoRefine provides a modular primitive for scalable reasoning and agentic settings with imperfect verifiers.
Chinese Translation
大型语言模型(LLMs)通常依赖于通过并行解码(例如,512个样本)在测试时进行扩展,以提高推理准确性,但这会消耗大量计算资源。我们提出了CoRefine,一种基于信心引导的自我精炼方法,能够在使用少量标记的情况下实现具有竞争力的准确性,该方法在一个冻结的LLM之上使用轻量级的211k参数Conv1D控制器。该控制器利用全追踪信心来决定是停止、重新审视还是尝试不同的方法,从而实现有针对性的自我修正,每个问题平均进行2.7次精炼,相较于512样本基准,标记减少约190倍。在各种推理基准和三个开源模型中,当控制器自信地停止时,其精确度达到92.6%,这表明信心动态可靠地指示正确性,而无需真实值验证。我们将其扩展为CoRefine-Tree,一种混合的顺序-并行变体,能够自适应地平衡探索与利用,便于服务集成和验证器兼容性。通过将信心视为控制信号而非正确性保证,CoRefine为可扩展推理和具有不完美验证器的代理环境提供了模块化原语。
cs.AI / 108 / 2602.08949
Digital Twin and Agentic AI for Wild Fire Disaster Management: Intelligent Virtual Situation Room
数字双胞胎与自主人工智能在野火灾害管理中的应用:智能虚拟情况室
Abstract
According to the United Nations, wildfire frequency and intensity are projected to increase by approximately 14% by 2030 and 30% by 2050 due to global warming, posing critical threats to life, infrastructure, and ecosystems. Conventional disaster management frameworks rely on static simulations and passive data acquisition, hindering their ability to adapt to arbitrarily evolving wildfire episodes in real-time. To address these limitations, we introduce the Intelligent Virtual Situation Room (IVSR), a bidirectional Digital Twin (DT) platform augmented by autonomous AI agents. The IVSR continuously ingests multisource sensor imagery, weather data, and 3D forest models to create a live virtual replica of the fire environment. A similarity engine powered by AI aligns emerging conditions with a precomputed Disaster Simulation Library, retrieving and calibrating intervention tactics under the watchful eyes of experts. Authorized action-ranging from UAV redeployment to crew reallocation-is cycled back through standardized procedures to the physical layer, completing the loop between response and analysis. We validate IVSR through detailed case-study simulations provided by an industrial partner, demonstrating capabilities in localized incident detection, privacy-preserving playback, collider-based fire-spread projection, and site-specific ML retraining. Our results indicate marked reductions in detection-to-intervention latency and more effective resource coordination versus traditional systems. By uniting real-time bidirectional DTs with agentic AI, IVSR offers a scalable, semi-automated decision-support paradigm for proactive, adaptive wildfire disaster management.
Chinese Translation
根据联合国的预测,由于全球变暖,野火的发生频率和强度预计到2030年将增加约14%,到2050年将增加30%,这对生命、基础设施和生态系统构成了严重威胁。传统的灾害管理框架依赖于静态模拟和被动数据获取,限制了其实时适应不断演变的野火事件的能力。为了解决这些局限性,我们提出了智能虚拟情况室(Intelligent Virtual Situation Room, IVSR),这是一个由自主人工智能代理增强的双向数字双胞胎(Digital Twin, DT)平台。IVSR持续接收多源传感器图像、气象数据和三维森林模型,以创建火灾环境的实时虚拟复制品。由人工智能驱动的相似性引擎将新出现的条件与预计算的灾害模拟库对齐,在专家的监督下检索和校准干预策略。授权的行动——从无人机重新部署到人员重新分配——通过标准化程序反馈到物理层,完成响应与分析之间的闭环。我们通过工业合作伙伴提供的详细案例研究模拟验证了IVSR,展示了在局部事件检测、隐私保护回放、基于碰撞的火灾传播预测和特定场地的机器学习再训练方面的能力。我们的结果表明,与传统系统相比,检测到干预的延迟显著减少,资源协调更为有效。通过将实时双向数字双胞胎与自主人工智能相结合,IVSR提供了一种可扩展的半自动化决策支持范式,以实现主动和自适应的野火灾害管理。
cs.AI / 109 / 2602.08968
stable-worldmodel-v1: Reproducible World Modeling Research and Evaluation
stable-worldmodel-v1:可重复的世界建模研究与评估
Abstract
World Models have emerged as a powerful paradigm for learning compact, predictive representations of environment dynamics, enabling agents to reason, plan, and generalize beyond direct experience. Despite recent interest in World Models, most available implementations remain publication-specific, severely limiting their reusability, increasing the risk of bugs, and reducing evaluation standardization. To mitigate these issues, we introduce stable-worldmodel (SWM), a modular, tested, and documented world-model research ecosystem that provides efficient data-collection tools, standardized environments, planning algorithms, and baseline implementations. In addition, each environment in SWM enables controllable factors of variation, including visual and physical properties, to support robustness and continual learning research. Finally, we demonstrate the utility of SWM by using it to study zero-shot robustness in DINO-WM.
Chinese Translation
世界模型已成为学习环境动态的紧凑预测表示的强大范式,使得智能体能够推理、规划并超越直接经验进行泛化。尽管最近对世界模型的兴趣日益增加,但大多数现有实现仍然是特定于出版物的,这严重限制了它们的可重用性,增加了错误的风险,并降低了评估的标准化程度。为了解决这些问题,我们引入了stable-worldmodel(SWM),这是一个模块化、经过测试和文档化的世界模型研究生态系统,提供高效的数据收集工具、标准化的环境、规划算法和基准实现。此外,SWM中的每个环境都支持可控的变化因素,包括视觉和物理属性,以促进稳健性和持续学习研究。最后,我们通过使用SWM研究DINO-WM中的零-shot稳健性来展示其效用。
cs.AI / 110 / 2602.08990
InternAgent-1.5: A Unified Agentic Framework for Long-Horizon Autonomous Scientific Discovery
InternAgent-1.5:一个统一的自主科学发现框架用于长时间跨度的研究
Feng, Shiyang, Ma, Runmin, Yan, Xiangchao, Fan, Yue, Hu, Yusong, Huang, Songtao, Zhang, Shuaiyu, Cao, Zongsheng, Peng, Tianshuo, Yuan, Jiakang, Guo, Zijie, Zhong, Zhijie, Du, Shangheng, Wang, Weida, Shi, Jinxin, Zhou, Yuhao, He, Xiaohan, Yu, Zhiyin, Yu, Fangchen, Zheng, Qihao, Wu, Jiamin, Liu, Mianxin, Zhang, Chi, Hou, Shaowei, Li, Shuya, Jiang, Yankai, Lou, Wenjie, Wang, Lilong, Wang, Zifu, Wang, Jiong, Xu, Wanghan, Deng, Yue, Liu, Dongrui, Wang, Yiheng, Zhang, Wenlong, Ling, Fenghua, Zhang, Shufei, Wang, Xiaosong, Zheng, Shuangjia, Huang, Xun, Sun, Siqi, Hu, Shuyue, Ye, Peng, Song, Chunfeng, Wang, Bin, He, Conghui, Liu, Yihao, Li, Xin, Hou, Qibin, Chen, Tao, Yue, Xiangyu, Wang, Bin, He, Liang, Lin, Dahua, Zhou, Bowen, Zhang, Bo, Bai, Lei
Abstract
We introduce InternAgent-1.5, a unified system designed for end-to-end scientific discovery across computational and empirical domains. The system is built on a structured architecture composed of three coordinated subsystems for generation, verification, and evolution. These subsystems are supported by foundational capabilities for deep research, solution optimization, and long horizon memory. The architecture allows InternAgent-1.5 to operate continuously across extended discovery cycles while maintaining coherent and improving behavior. It also enables the system to coordinate computational modeling and laboratory experimentation within a single unified system. We evaluate InternAgent-1.5 on scientific reasoning benchmarks such as GAIA, HLE, GPQA, and FrontierScience, and the system achieves leading performance that demonstrates strong foundational capabilities. Beyond these benchmarks, we further assess two categories of discovery tasks. In algorithm discovery tasks, InternAgent-1.5 autonomously designs competitive methods for core machine learning problems. In empirical discovery tasks, it executes complete computational or wet lab experiments and produces scientific findings in earth, life, biological, and physical domains. Overall, these results show that InternAgent-1.5 provides a general and scalable framework for autonomous scientific discovery.
Chinese Translation
我们介绍了InternAgent-1.5,这是一个旨在实现计算和实证领域端到端科学发现的统一系统。该系统基于一个结构化架构,由三个协调的子系统组成,分别用于生成、验证和演化。这些子系统得益于深度研究、解决方案优化和长时记忆等基础能力的支持。该架构使InternAgent-1.5能够在延长的发现周期中持续运行,同时保持一致性并改善其行为。它还使系统能够在一个统一的系统内协调计算建模和实验室实验。我们在科学推理基准测试(如GAIA、HLE、GPQA和FrontierScience)上评估了InternAgent-1.5,结果显示该系统在基础能力方面表现出色,达到了领先的性能。除了这些基准测试,我们还进一步评估了两类发现任务。在算法发现任务中,InternAgent-1.5自主设计出针对核心机器学习问题的竞争性方法。在实证发现任务中,它执行完整的计算或湿实验,并在地球、生命、生物和物理领域产生科学发现。总体而言,这些结果表明,InternAgent-1.5提供了一个通用且可扩展的自主科学发现框架。
cs.AI / 111 / 2602.09000
iGRPO: Self-Feedback-Driven LLM Reasoning
iGRPO:自反馈驱动的LLM推理
Abstract
Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions. Reinforcement Learning (RL) is a framework for aligning these models with task-specific rewards, improving overall quality and reliability. Group Relative Policy Optimization (GRPO) is an efficient, value-function-free alternative to Proximal Policy Optimization (PPO) that leverages group-relative reward normalization. We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts. In Stage 1, iGRPO samples multiple exploratory drafts and selects the highest-reward draft using the same scalar reward signal used for optimization. In Stage 2, it appends this best draft to the original prompt and applies a GRPO-style update on draft-conditioned refinements, training the policy to improve beyond its strongest prior attempt. Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models (e.g., Nemotron-H-8B-Base-8K and DeepSeek-R1 Distilled), validating its effectiveness on diverse reasoning benchmarks. Moreover, applying iGRPO to OpenReasoning-Nemotron-7B trained on AceReason-Math achieves new state-of-the-art results of 85.62\% and 79.64\% on AIME24 and AIME25, respectively. Ablations further show that the refinement wrapper generalizes beyond GRPO variants, benefits from a generative judge, and alters learning dynamics by delaying entropy collapse. These results underscore the potential of iterative, self-feedback-based RL for advancing verifiable mathematical reasoning.
Chinese Translation
大型语言模型(LLMs)在解决复杂数学问题方面展现出了潜力,但仍然无法产生准确且一致的解决方案。强化学习(RL)是一种将这些模型与特定任务奖励对齐的框架,从而提高整体质量和可靠性。群体相对策略优化(GRPO)是一种高效的、无值函数的替代方法,优于近端策略优化(PPO),它利用群体相对奖励归一化。我们提出了迭代群体相对策略优化(iGRPO),这是GRPO的一个两阶段扩展,通过模型生成的草稿添加动态自我调节。在第一阶段,iGRPO采样多个探索性草稿,并使用与优化相同的标量奖励信号选择奖励最高的草稿。在第二阶段,它将这个最佳草稿附加到原始提示中,并对草稿条件下的细化应用GRPO风格的更新,训练策略以超越其最强的先前尝试。在匹配的回滚预算下,iGRPO在基础模型(例如,Nemotron-H-8B-Base-8K和DeepSeek-R1 Distilled)上始终优于GRPO,验证了其在多样化推理基准上的有效性。此外,将iGRPO应用于在AceReason-Math上训练的OpenReasoning-Nemotron-7B,分别在AIME24和AIME25上达到了85.62\%和79.64\%的新状态-of-the-art结果。消融实验进一步表明,细化包装超越了GRPO变体的泛化,受益于生成性评判,并通过延迟熵崩溃改变学习动态。这些结果强调了基于迭代自反馈的RL在推进可验证数学推理方面的潜力。
cs.AI / 112 / 2602.09003
Data Science and Technology Towards AGI Part I: Tiered Data Management
数据科学与技术迈向通用人工智能(AGI)第一部分:分层数据管理
Abstract
The development of artificial intelligence can be viewed as an evolution of data-driven learning paradigms, with successive shifts in data organization and utilization continuously driving advances in model capability. Current LLM research is dominated by a paradigm that relies heavily on unidirectional scaling of data size, increasingly encountering bottlenecks in data availability, acquisition cost, and training efficiency. In this work, we argue that the development of AGI is entering a new phase of data-model co-evolution, in which models actively guide data management while high-quality data, in turn, amplifies model capabilities. To implement this vision, we propose a tiered data management framework, designed to support the full LLM training lifecycle across heterogeneous learning objectives and cost constraints. Specifically, we introduce an L0-L4 tiered data management framework, ranging from raw uncurated resources to organized and verifiable knowledge. Importantly, LLMs are fully used in data management processes, such as quality scoring and content editing, to refine data across tiers. Each tier is characterized by distinct data properties, management strategies, and training roles, enabling data to be strategically allocated across LLM training stages, including pre-training, mid-training, and alignment. The framework balances data quality, acquisition cost, and marginal training benefit, providing a systematic approach to scalable and sustainable data management. We validate the effectiveness of the proposed framework through empirical studies, in which tiered datasets are constructed from raw corpora and used across multiple training phases. Experimental results demonstrate that tier-aware data utilization significantly improves training efficiency and model performance. To facilitate further research, we release our tiered datasets and processing tools to the community.
Chinese Translation
人工智能的发展可以视为数据驱动学习范式的演变,数据组织和利用的连续变化不断推动模型能力的进步。目前,LLM(大型语言模型)研究主要受限于一种高度依赖数据规模单向扩展的范式,越来越多地遇到数据可用性、获取成本和训练效率的瓶颈。在本研究中,我们认为AGI的发展正进入数据与模型共同进化的新阶段,在这一阶段,模型积极指导数据管理,而高质量数据反过来又增强模型能力。为实现这一愿景,我们提出了一种分层数据管理框架,旨在支持跨异构学习目标和成本约束的完整LLM训练生命周期。具体而言,我们引入了一个L0-L4分层数据管理框架,从原始未整理资源到有组织且可验证的知识。重要的是,LLM在数据管理过程中得到了充分利用,例如质量评分和内容编辑,以在各层之间精炼数据。每一层都有其独特的数据特性、管理策略和训练角色,使数据能够在LLM训练阶段(包括预训练、中期训练和对齐)中进行战略性分配。该框架平衡了数据质量、获取成本和边际训练收益,提供了一种系统化的可扩展和可持续的数据管理方法。我们通过实证研究验证了所提框架的有效性,在这些研究中,分层数据集是从原始语料库构建的,并在多个训练阶段中使用。实验结果表明,考虑层次的数据利用显著提高了训练效率和模型性能。为了促进进一步的研究,我们向社区发布了我们的分层数据集和处理工具。
cs.AI / 113 / 2602.09007
GEBench: Benchmarking Image Generation Models as GUI Environments
GEBench:作为图形用户界面环境的图像生成模型基准测试
Abstract
Recent advancements in image generation models have enabled the prediction of future Graphical User Interface (GUI) states based on user instructions. However, existing benchmarks primarily focus on general domain visual fidelity, leaving the evaluation of state transitions and temporal coherence in GUI-specific contexts underexplored. To address this gap, we introduce GEBench, a comprehensive benchmark for evaluating dynamic interaction and temporal coherence in GUI generation. GEBench comprises 700 carefully curated samples spanning five task categories, covering both single-step interactions and multi-step trajectories across real-world and fictional scenarios, as well as grounding point localization. To support systematic evaluation, we propose GE-Score, a novel five-dimensional metric that assesses Goal Achievement, Interaction Logic, Content Consistency, UI Plausibility, and Visual Quality. Extensive evaluations on current models indicate that while they perform well on single-step transitions, they struggle significantly with maintaining temporal coherence and spatial grounding over longer interaction sequences. Our findings identify icon interpretation, text rendering, and localization precision as critical bottlenecks. This work provides a foundation for systematic assessment and suggests promising directions for future research toward building high-fidelity generative GUI environments. The code is available at: https://github.com/stepfun-ai/GEBench.
Chinese Translation
最近,图像生成模型的进展使得基于用户指令预测未来的图形用户界面(GUI)状态成为可能。然而,现有的基准测试主要集中在一般领域的视觉保真度上,导致在GUI特定上下文中对状态转变和时间一致性的评估尚未得到充分探索。为了解决这一问题,我们提出了GEBench,这是一个用于评估GUI生成中的动态交互和时间一致性的综合基准。GEBench包含700个经过精心挑选的样本,涵盖五个任务类别,涉及单步交互和多步轨迹,涵盖现实世界和虚构场景,以及基准点定位。为了支持系统评估,我们提出了GE-Score,这是一种新颖的五维指标,评估目标实现、交互逻辑、内容一致性、用户界面合理性和视觉质量。对当前模型的广泛评估表明,尽管它们在单步转变上表现良好,但在较长交互序列中保持时间一致性和空间定位方面却面临显著挑战。我们的研究发现图标解释、文本渲染和定位精度是关键瓶颈。这项工作为系统评估提供了基础,并为未来研究朝着构建高保真生成GUI环境的方向提供了有希望的思路。代码可在以下链接获取:https://github.com/stepfun-ai/GEBench。
cs.CL / 1 / 2602.06973
Does Visual Rendering Bypass Tokenization? Investigating Script-Tokenizer Misalignment in Pixel-Based Language Models
视觉渲染是否绕过了分词?探讨像素基础语言模型中的脚本-分词器不一致性
Abstract
While pixel-based language modeling aims to bypass the sub-word tokenization bottleneck by rendering text as images, recent multimodal variants such as DualGPT reintroduce text tokenizers to improve autoregressive performance. We investigate a fundamental question, does visual rendering truly decouple a model from tokenization constraints? Focusing on four Indonesian low-resource local languages that have their own non-Latin scripts (i.e., Javanese, Balinese, Sundanese, and Lampungnese), we evaluate the impact of script-tokenizer alignment within the DualGPT architecture. Our results show that, despite visual rendering, reintegrating a text tokenizer into the architecture reintroduces the same issue that pixel-based language modeling aims to resolve, which is the tokenizer misalignment problem. Despite having lower OOV and fertility rates, we show that the Llama 2 tokenizer performs significantly worse than a custom tokenizer, with improvements of up to 30.15 chrF++. Our findings serve as a warning for future multimodal variants, as text tokenizers remain a significant barrier to equitable models.
Chinese Translation
尽管像素基础语言建模旨在通过将文本呈现为图像来绕过子词分词的瓶颈,但最近的多模态变体如DualGPT重新引入了文本分词器,以提高自回归性能。我们探讨一个基本问题:视觉渲染是否真的使模型脱离分词约束?我们关注四种具有自己非拉丁文字的印度尼西亚低资源地方语言(即爪哇语、巴厘语、巽他语和朗邦语),评估在DualGPT架构中脚本-分词器对齐的影响。我们的结果表明,尽管进行了视觉渲染,但将文本分词器重新整合到架构中重新引入了像素基础语言建模旨在解决的同样问题,即分词器不一致性问题。尽管具有较低的OOV(超出词汇)和生育率,我们显示Llama 2分词器的表现显著低于定制分词器,改进幅度高达30.15 chrF++。我们的发现对未来的多模态变体发出警告,因为文本分词器仍然是实现公平模型的重要障碍。
cs.CL / 2 / 2602.06975
BiomechAgent: AI-Assisted Biomechanical Analysis Through Code-Generating Agents
BiomechAgent:通过代码生成代理实现的人工智能辅助生物力学分析
Abstract
Markerless motion capture is making quantitative movement analysis increasingly accessible, yet analyzing the resulting data remains a barrier for clinicians without programming expertise. We present BiomechAgent, a code-generating AI agent that enables biomechanical analysis through natural language and allows users to querying databases, generating visualizations, and even interpret data without requiring users to write code. To evaluate BiomechAgent's capabilities, we developed a systematic benchmark spanning data retrieval, visualization, activity classification, temporal segmentation, and clinical reasoning. BiomechAgent achieved robust accuracy on data retrieval and visualization tasks and demonstrated emerging clinical reasoning capabilities. We used our dataset to systematically evaluate several of our design decisions. Biomechanically-informed, domain-specific instructions significantly improved performance over generic prompts, and integrating validated specialized tools for gait event detection substantially boosted accuracy on challenging spatiotemporal analysis where the base agent struggled. We also tested BiomechAgent using a local open-weight model instead of a frontier cloud based LLM and found that perform was substantially diminished in most domains other than database retrieval. In short, BiomechAgent makes the data from accessible motion capture and much more useful and accessible to end users.
Chinese Translation
无标记运动捕捉使得定量运动分析变得愈加可及,但分析所产生的数据仍然是没有编程专业知识的临床医生面临的障碍。我们提出了BiomechAgent,这是一种代码生成的人工智能代理,通过自然语言实现生物力学分析,使用户能够查询数据库、生成可视化图形,甚至在不需要编写代码的情况下解释数据。为了评估BiomechAgent的能力,我们开发了一个系统的基准测试,涵盖数据检索、可视化、活动分类、时间分割和临床推理。BiomechAgent在数据检索和可视化任务上实现了稳健的准确性,并展示了新兴的临床推理能力。我们使用我们的数据集系统地评估了多个设计决策。生物力学知识驱动的特定领域指令显著提高了性能,相较于通用提示,整合经过验证的专业工具用于步态事件检测在困难的时空分析中大幅提升了准确性,而基础代理在这些任务中表现不佳。我们还使用本地开放权重模型测试了BiomechAgent,而不是前沿的基于云的LLM,发现除了数据库检索外,在大多数领域的表现显著下降。总之,BiomechAgent使得来自可接触运动捕捉的数据变得更加有用和易于终端用户访问。
cs.CL / 3 / 2602.06976
Bridging the Knowledge Void: Inference-time Acquisition of Unfamiliar Programming Languages for Coding Tasks
弥合知识空白:编程任务中不熟悉编程语言的推理时获取
Abstract
The proficiency of Large Language Models (LLMs) in coding tasks is often a reflection of their extensive pre-training corpora, which typically collapses when confronted with previously unfamiliar programming languages. Departing from data-intensive finetuning, we investigate the paradigm of Inference-time Language Acquisition (ILA), where an LLM masters an unfamiliar language through dynamic interaction with limited external resources. In this paper, we propose ILA-agent, a general ILA framework that equips LLMs with a set of behavioral primitives. By modeling essential human-like behaviors as a suite of tools, ILA-agent enables LLMs to incrementally explore, apply, and verify language knowledge through structured interactions with the official documentation and execution environment. To provide a rigorous evaluation in a low-resource setting, we construct Cangjie-bench, a multi-task benchmark based on the novel statically-typed language Cangjie. We instantiate ILA-agent for Cangjie and evaluate its performance across code generation, translation, and program repair tasks. Results using diverse LLMs demonstrate that ILA-agent significantly outperforms retrieval-augmented baselines. Further analysis of agent trajectories characterizes the emergent behavior patterns while highlighting persisting performance gaps.
Chinese Translation
大型语言模型(LLMs)在编程任务中的熟练程度通常反映了其广泛的预训练语料库,但在面对之前不熟悉的编程语言时,这种能力往往会崩溃。我们脱离数据密集型的微调,探讨推理时语言获取(Inference-time Language Acquisition, ILA)的范式,在这一范式中,LLM通过与有限外部资源的动态互动掌握不熟悉的语言。本文提出了ILA-agent,一个通用的ILA框架,赋予LLM一组行为原语。通过将基本的人类行为建模为一套工具,ILA-agent使LLM能够通过与官方文档和执行环境的结构化交互,逐步探索、应用和验证语言知识。为了在低资源环境中提供严格的评估,我们构建了Cangjie-bench,这是一个基于新型静态类型语言Cangjie的多任务基准。我们为Cangjie实例化ILA-agent,并评估其在代码生成、翻译和程序修复任务中的表现。使用多种LLM的结果表明,ILA-agent显著优于增强检索的基线。此外,对代理轨迹的进一步分析描绘了新兴的行为模式,同时突显了持续存在的性能差距。
cs.CL / 4 / 2602.07120
Anchored Decoding: Provably Reducing Copyright Risk for Any Language Model
锚定解码:可证明减少任何语言模型的版权风险
Abstract
Modern language models (LMs) tend to memorize portions of their training data and emit verbatim spans. When the underlying sources are sensitive or copyright-protected, such reproduction raises issues of consent and compensation for creators and compliance risks for developers. We propose Anchored Decoding, a plug-and-play inference-time method for suppressing verbatim copying: it enables decoding from any risky LM trained on mixed-license data by keeping generation in bounded proximity to a permissively trained safe LM. Anchored Decoding adaptively allocates a user-chosen information budget over the generation trajectory and enforces per-step constraints that yield a sequence-level guarantee, enabling a tunable risk-utility trade-off. To make Anchored Decoding practically useful, we introduce a new permissively trained safe model (TinyComma 1.8B), as well as Anchored$_{\mathrm{Byte}}$ Decoding, a byte-level variant of our method that enables cross-vocabulary fusion via the ByteSampler framework (Hayase et al., 2025). We evaluate our methods across six model pairs on long-form evaluations of copyright risk and utility. Anchored and Anchored$_{\mathrm{Byte}}$ Decoding define a new Pareto frontier, preserving near-original fluency and factuality while eliminating up to 75% of the measurable copying gap (averaged over six copying metrics) between the risky baseline and a safe reference, at a modest inference overhead.
Chinese Translation
现代语言模型(LM)往往会记忆其训练数据的部分内容并逐字输出。当底层来源是敏感或受版权保护时,这种复制会引发创作者的同意和补偿问题,以及开发者的合规风险。我们提出了锚定解码(Anchored Decoding),这是一种即插即用的推理时方法,用于抑制逐字复制:它通过保持生成与经过宽松训练的安全语言模型的有限接近,使得从任何在混合许可数据上训练的高风险语言模型进行解码成为可能。锚定解码自适应地在生成轨迹上分配用户选择的信息预算,并施加逐步约束,从而提供序列级别的保证,实现可调的风险-效用权衡。为了使锚定解码在实践中更具实用性,我们引入了一种新的宽松训练的安全模型(TinyComma 1.8B),以及锚定$_{ ext{Byte}}$解码(Anchored$_{ ext{Byte}}$ Decoding),这是我们方法的一个字节级变体,能够通过字节采样器框架(ByteSampler framework,Hayase et al., 2025)实现跨词汇融合。我们在六对模型上评估了我们的方法,针对版权风险和效用进行了长形式评估。锚定解码和锚定$_{ ext{Byte}}$解码定义了一个新的帕累托前沿,在消除高达75%可测量的复制差距(在六个复制指标上取平均)与安全参考模型之间的同时,保持近乎原始的流畅性和事实性,且推理开销适中。
cs.CL / 5 / 2602.07160
Free Energy Mixer
自由能混合器
Abstract
Standard attention stores keys/values losslessly but reads them via a per-head convex average, blocking channel-wise selection. We propose the Free Energy Mixer (FEM): a free-energy (log-sum-exp) read that applies a value-driven, per-channel log-linear tilt to a fast prior (e.g., from queries/keys in standard attention) over indices. Unlike methods that attempt to improve and enrich the $(q,k)$ scoring distribution, FEM treats it as a prior and yields a value-aware posterior read at unchanged complexity, smoothly moving from averaging to per-channel selection as the learnable inverse temperature increases, while still preserving parallelism and the original asymptotic complexity ($O(T^2)$ for softmax; $O(T)$ for linearizable variants). We instantiate a two-level gated FEM that is plug-and-play with standard and linear attention, linear RNNs and SSMs. It consistently outperforms strong baselines on NLP, vision, and time-series at matched parameter budgets.
Chinese Translation
标准注意力机制无损地存储键/值,但通过每个头的凸平均进行读取,阻碍了通道选择。我们提出了自由能混合器(Free Energy Mixer, FEM):一种自由能(对数和指数)读取方法,它对快速先验(例如,来自标准注意力的查询/键)施加基于值的每通道对数线性倾斜。与试图改善和丰富$(q,k)$评分分布的方法不同,FEM将其视为先验,并在复杂度不变的情况下产生一个关注值的后验读取,随着可学习的逆温度增加,平滑地从平均转向每通道选择,同时仍然保持并行性和原始渐近复杂度(对于softmax为$O(T^2)$;对于可线性化变体为$O(T)$)。我们实例化了一个两级门控FEM,它可以与标准和线性注意力、线性递归神经网络(RNN)和状态空间模型(SSM)无缝连接。在匹配参数预算的情况下,它在自然语言处理、视觉和时间序列任务上始终优于强基线。
cs.CL / 6 / 2602.07164
Your Language Model Secretly Contains Personality Subnetworks
你的语言模型秘密地包含个性子网络
Abstract
Humans shift between different personas depending on social context. Large Language Models (LLMs) demonstrate a similar flexibility in adopting different personas and behaviors. Existing approaches, however, typically adapt such behavior through external knowledge such as prompting, retrieval-augmented generation (RAG), or fine-tuning. We ask: do LLMs really need external context or parameters to adapt to different behaviors, or do they already have such knowledge embedded in their parameters? In this work, we show that LLMs already contain persona-specialized subnetworks in their parameter space. Using small calibration datasets, we identify distinct activation signatures associated with different personas. Guided by these statistics, we develop a masking strategy that isolates lightweight persona subnetworks. Building on the findings, we further discuss: how can we discover opposing subnetwork from the model that lead to binary-opposing personas, such as introvert-extrovert? To further enhance separation in binary opposition scenarios, we introduce a contrastive pruning strategy that identifies parameters responsible for the statistical divergence between opposing personas. Our method is entirely training-free and relies solely on the language model's existing parameter space. Across diverse evaluation settings, the resulting subnetworks exhibit significantly stronger persona alignment than baselines that require external knowledge while being more efficient. Our findings suggest that diverse human-like behaviors are not merely induced in LLMs, but are already embedded in their parameter space, pointing toward a new perspective on controllable and interpretable personalization in large language models.
Chinese Translation
人类会根据社会情境在不同的角色之间切换。大型语言模型(LLMs)在采用不同角色和行为方面表现出类似的灵活性。然而,现有的方法通常通过外部知识(如提示、检索增强生成(RAG)或微调)来适应这种行为。我们提出疑问:LLMs 是否真的需要外部上下文或参数来适应不同的行为,还是它们已经在参数中嵌入了这种知识?在本研究中,我们展示了 LLMs 在其参数空间中已经包含了专门针对角色的子网络。通过使用小型校准数据集,我们识别出与不同角色相关的独特激活特征。在这些统计数据的指导下,我们开发了一种掩蔽策略,以隔离轻量级的角色子网络。基于这些发现,我们进一步讨论:如何从模型中发现导致二元对立角色(如内向-外向)的对立子网络?为了进一步增强二元对立场景中的分离性,我们引入了一种对比剪枝策略,以识别导致对立角色之间统计差异的参数。我们的方法完全不需要训练,仅依赖于语言模型现有的参数空间。在多种评估设置中,所得到的子网络表现出显著强于需要外部知识的基线的角色一致性,同时更加高效。我们的发现表明,多样的人类行为不仅仅是在 LLMs 中诱发的,而是已经嵌入在其参数空间中,这为大型语言模型中的可控和可解释个性化提供了新的视角。
cs.CL / 7 / 2602.07176
Open TutorAI: An Open-source Platform for Personalized and Immersive Learning with Generative AI
Open TutorAI:一个基于生成性人工智能的个性化和沉浸式学习开源平台
Abstract
Recent advances in artificial intelligence have created new possibilities for making education more scalable, adaptive, and learner-centered. However, existing educational chatbot systems often lack contextual adaptability, real-time responsiveness, and pedagogical agility. which can limit learner engagement and diminish instructional effectiveness. Thus, there is a growing need for open, integrative platforms that combine AI and immersive technologies to support personalized, meaningful learning experiences. This paper presents Open TutorAI, an open-source educational platform based on LLMs and generative technologies that provides dynamic, personalized tutoring. The system integrates natural language processing with customizable 3D avatars to enable multimodal learner interaction. Through a structured onboarding process, it captures each learner's goals and preferences in order to configure a learner-specific AI assistant. This assistant is accessible via both text-based and avatar-driven interfaces. The platform includes tools for organizing content, providing embedded feedback, and offering dedicated interfaces for learners, educators, and parents. This work focuses on learner-facing components, delivering a tool for adaptive support that responds to individual learner profiles without requiring technical expertise. Its assistant-generation pipeline and avatar integration enhance engagement and emotional presence, creating a more humanized, immersive learning environment. Embedded learning analytics support self-regulated learning by tracking engagement patterns and generating actionable feedback. The result is Open TutorAI, which unites modular architecture, generative AI, and learner analytics within an open-source framework. It contributes to the development of next-generation intelligent tutoring systems.
Chinese Translation
近年来,人工智能的进步为教育的可扩展性、适应性和以学习者为中心的理念创造了新的可能性。然而,现有的教育聊天机器人系统往往缺乏上下文适应性、实时响应能力和教学灵活性,这可能限制学习者的参与度并降低教学效果。因此,迫切需要开放的、综合的平台,将人工智能与沉浸式技术结合起来,以支持个性化和有意义的学习体验。本文介绍了Open TutorAI,一个基于大语言模型(LLMs)和生成性技术的开源教育平台,提供动态的个性化辅导。该系统将自然语言处理与可定制的3D虚拟形象相结合,以实现多模态学习者互动。通过结构化的入门流程,它捕捉每位学习者的目标和偏好,以配置特定于学习者的人工智能助手。该助手可通过基于文本和虚拟形象驱动的界面访问。平台包括内容组织工具、嵌入式反馈提供功能,以及为学习者、教育工作者和家长提供的专用界面。该研究重点关注面向学习者的组件,提供一种适应性支持工具,能够根据个体学习者的特征进行响应,而无需技术专长。其助手生成管道和虚拟形象集成增强了参与感和情感存在感,创造了一个更人性化、沉浸式的学习环境。嵌入式学习分析通过跟踪参与模式和生成可操作的反馈来支持自我调节学习。最终形成的Open TutorAI将模块化架构、生成性人工智能和学习者分析结合在一个开源框架内,为下一代智能辅导系统的发展做出了贡献。
cs.CL / 8 / 2602.07181
Can LLMs Discern the Traits Influencing Your Preferences? Evaluating Personality-Driven Preference Alignment in LLMs
大型语言模型能否识别影响您偏好的特征?评估基于人格驱动的偏好对齐在大型语言模型中的应用
Abstract
User preferences are increasingly used to personalize Large Language Model (LLM) responses, yet how to reliably leverage preference signals for answer generation remains under-explored. In practice, preferences can be noisy, incomplete, or even misleading, which can degrade answer quality when applied naively. Motivated by the observation that stable personality traits shape everyday preferences, we study personality as a principled ''latent'' signal behind preference statements. Through extensive experiments, we find that conditioning on personality-aligned preferences substantially improves personalized question answering: selecting preferences consistent with a user's inferred personality increases answer-choice accuracy from 29.25% to 76%, compared to using randomly selected preferences. Based on these findings, we introduce PACIFIC (Preference Alignment Choices Inference for Five-factor Identity Characterization), a personality-labeled preference dataset containing 1200 preference statements spanning diverse domains (e.g., travel, movies, education), annotated with Big-Five (OCEAN) trait directions. Finally, we propose a framework that enables an LLM model to automatically retrieve personality-aligned preferences and incorporate them during answer generation.
Chinese Translation
用户偏好越来越多地被用于个性化大型语言模型(LLM)的响应,但如何可靠地利用偏好信号进行答案生成仍然未得到充分探索。在实践中,偏好可能是嘈杂的、不完整的,甚至具有误导性,这在简单应用时可能降低答案质量。基于稳定的人格特征塑造日常偏好的观察,我们将人格视为偏好陈述背后的一个原则性“潜在”信号。通过广泛的实验,我们发现基于与人格对齐的偏好进行条件化显著改善了个性化问答:与使用随机选择的偏好相比,选择与用户推断的人格一致的偏好使答案选择的准确率从29.25%提高到76%。基于这些发现,我们引入了PACIFIC(Preference Alignment Choices Inference for Five-factor Identity Characterization),这是一个包含1200个跨越多样领域(例如旅行、电影、教育)的偏好陈述的人格标注偏好数据集,并附有大五人格(OCEAN)特征方向的注释。最后,我们提出了一个框架,使得LLM模型能够自动检索与人格对齐的偏好,并在答案生成过程中将其纳入考虑。
cs.CL / 9 / 2602.07190
Long-Context Long-Form Question Answering for Legal Domain
法律领域的长上下文长篇问答
Abstract
Legal documents have complex document layouts involving multiple nested sections, lengthy footnotes and further use specialized linguistic devices like intricate syntax and domain-specific vocabulary to ensure precision and authority. These inherent characteristics of legal documents make question answering challenging, and particularly so when the answer to the question spans several pages (i.e. requires long-context) and is required to be comprehensive (i.e. a long-form answer). In this paper, we address the challenges of long-context question answering in context of long-form answers given the idiosyncrasies of legal documents. We propose a question answering system that can (a) deconstruct domain-specific vocabulary for better retrieval from source documents, (b) parse complex document layouts while isolating sections and footnotes and linking them appropriately, (c) generate comprehensive answers using precise domain-specific vocabulary. We also introduce a coverage metric that classifies the performance into recall-based coverage categories allowing human users to evaluate the recall with ease. We curate a QA dataset by leveraging the expertise of professionals from fields such as law and corporate tax. Through comprehensive experiments and ablation studies, we demonstrate the usability and merit of the proposed system.
Chinese Translation
法律文件具有复杂的文档布局,涉及多个嵌套部分、冗长的脚注,并进一步使用复杂的语法和领域特定的词汇,以确保精确性和权威性。这些法律文件的固有特征使得问答变得具有挑战性,尤其是当问题的答案跨越几页(即需要长上下文)并且要求全面(即长篇答案)时。在本文中,我们针对法律文件的特性,解决了长上下文问答在长篇答案中的挑战。我们提出了一种问答系统,该系统可以(a)对领域特定的词汇进行解构,以便更好地从源文档中检索,(b)解析复杂的文档布局,同时隔离部分和脚注并适当地链接它们,(c)使用精确的领域特定词汇生成全面的答案。我们还引入了一种覆盖率指标,将性能分类为基于召回的覆盖类别,使得人类用户能够轻松评估召回率。我们通过利用法律和企业税务等领域专业人士的专业知识,策划了一个问答数据集。通过全面的实验和消融研究,我们展示了所提系统的可用性和优越性。
cs.CL / 10 / 2602.07211
Equipping LLM with Directional Multi-Talker Speech Understanding Capabilities
为大型语言模型赋予方向性多说话者语音理解能力
Abstract
Recent studies have demonstrated that prompting large language models (LLM) with audio encodings enables effective speech understanding capabilities. However, most speech LLMs are trained on single-channel, single-talker data, which makes it challenging to directly apply them to multi-talker and multi-channel speech understanding task. In this work, we present a comprehensive investigation on how to enable directional multi-talker speech understanding capabilities for LLMs, specifically in smart glasses usecase. We propose two novel approaches to integrate directivity into LLMs: (1) a cascaded system that leverages a source separation front-end module, and (2) an end-to-end system that utilizes serialized output training. All of the approaches utilize a multi-microphone array embedded in smart glasses to optimize directivity interpretation and processing in a streaming manner. Experimental results demonstrate the efficacy of our proposed methods in endowing LLMs with directional speech understanding capabilities, achieving strong performance in both speech recognition and speech translation tasks.
Chinese Translation
最近的研究表明,通过音频编码提示大型语言模型(LLM)能够有效地实现语音理解能力。然而,大多数语音LLM是在单通道、单说话者数据上训练的,这使得它们在多说话者和多通道语音理解任务中的直接应用面临挑战。在本研究中,我们对如何为LLM赋予方向性多说话者语音理解能力进行了全面的探讨,特别是在智能眼镜的应用场景中。我们提出了两种新颖的方法将方向性集成到LLM中:(1)利用源分离前端模块的级联系统,以及(2)利用序列化输出训练的端到端系统。所有方法都利用嵌入在智能眼镜中的多麦克风阵列,以流式方式优化方向性解释和处理。实验结果证明了我们提出的方法在赋予LLM方向性语音理解能力方面的有效性,在语音识别和语音翻译任务中均取得了良好的表现。
cs.CL / 11 / 2602.07319
Beyond Accuracy: Risk-Sensitive Evaluation of Hallucinated Medical Advice
超越准确性:对虚构医疗建议的风险敏感性评估
Abstract
Large language models are increasingly being used in patient-facing medical question answering, where hallucinated outputs can vary widely in potential harm. However, existing hallucination standards and evaluation metrics focus primarily on factual correctness, treating all errors as equally severe. This obscures clinically relevant failure modes, particularly when models generate unsupported but actionable medical language. We propose a risk-sensitive evaluation framework that quantifies hallucinations through the presence of risk-bearing language, including treatment directives, contraindications, urgency cues, and mentions of high-risk medications. Rather than assessing clinical correctness, our approach evaluates the potential impact of hallucinated content if acted upon. We further combine risk scoring with a relevance measure to identify high-risk, low-grounding failures. We apply this framework to three instruction-tuned language models using controlled patient-facing prompts designed as safety stress tests. Our results show that models with similar surface-level behavior exhibit substantially different risk profiles and that standard evaluation metrics fail to capture these distinctions. These findings highlight the importance of incorporating risk sensitivity into hallucination evaluation and suggest that evaluation validity is critically dependent on task and prompt design.
Chinese Translation
大型语言模型在面向患者的医疗问答中越来越多地被使用,其中虚构输出的潜在危害差异很大。然而,现有的虚构标准和评估指标主要关注事实正确性,将所有错误视为同等严重。这掩盖了临床相关的失败模式,特别是当模型生成不支持但可操作的医疗语言时。我们提出了一种风险敏感性评估框架,通过风险承载语言的存在来量化虚构内容,包括治疗指令、禁忌症、紧迫性提示和高风险药物的提及。我们的评估方法不是评估临床正确性,而是评估如果采取行动,虚构内容的潜在影响。我们进一步将风险评分与相关性测量结合,以识别高风险、低基础的失败。我们将该框架应用于三个经过指令调优的语言模型,使用设计为安全压力测试的受控患者面向提示。我们的结果表明,具有相似表面行为的模型展现出显著不同的风险特征,而标准评估指标未能捕捉这些差异。这些发现强调了在虚构评估中纳入风险敏感性的重要性,并表明评估有效性在很大程度上依赖于任务和提示设计。
cs.CL / 12 / 2602.07338
Intent Mismatch Causes LLMs to Get Lost in Multi-Turn Conversation
意图不匹配导致大型语言模型在多轮对话中迷失
Abstract
Multi-turn conversation has emerged as a predominant interaction paradigm for Large Language Models (LLMs). Users often employ follow-up questions to refine their intent, expecting LLMs to adapt dynamically. However, recent research reveals that LLMs suffer a substantial performance drop in multi-turn settings compared to single-turn interactions with fully specified instructions, a phenomenon termed ``Lost in Conversation'' (LiC). While this prior work attributes LiC to model unreliability, we argue that the root cause lies in an intent alignment gap rather than intrinsic capability deficits. In this paper, we first demonstrate that LiC is not a failure of model capability but rather a breakdown in interaction between users and LLMs. We theoretically show that scaling model size or improving training alone cannot resolve this gap, as it arises from structural ambiguity in conversational context rather than representational limitations. To address this, we propose to decouple intent understanding from task execution through a Mediator-Assistant architecture. By utilizing an experience-driven Mediator to explicate user inputs into explicit, well-structured instructions based on historical interaction patterns, our approach effectively bridges the gap between vague user intent and model interpretation. Experimental results demonstrate that this method significantly mitigates performance degradation in multi-turn conversations across diverse LLMs.
Chinese Translation
多轮对话已成为大型语言模型(LLMs)的主要交互范式。用户常常通过后续问题来细化他们的意图,期望LLMs能够动态适应。然而,最近的研究表明,与具有完全指定指令的单轮交互相比,LLMs在多轮设置中的性能显著下降,这一现象被称为“迷失在对话中”(Lost in Conversation, LiC)。虽然之前的研究将LiC归因于模型的不可靠性,但我们认为根本原因在于意图对齐的差距,而非内在能力的缺陷。本文首先证明,LiC并不是模型能力的失败,而是用户与LLMs之间交互的崩溃。我们理论上表明,仅仅扩大模型规模或改善训练无法解决这一差距,因为它源于对话上下文中的结构模糊性,而非表征限制。为了解决这个问题,我们提出通过中介-助手架构将意图理解与任务执行解耦。通过利用经验驱动的中介将用户输入阐释为基于历史交互模式的明确、结构良好的指令,我们的方法有效地弥合了模糊用户意图与模型解释之间的差距。实验结果表明,该方法显著减轻了多轮对话中不同LLMs的性能下降。
cs.CL / 13 / 2602.07361
ViHERMES: A Graph-Grounded Multihop Question Answering Benchmark and System for Vietnamese Healthcare Regulations
ViHERMES:一个基于图的越南医疗法规多跳问答基准与系统
Abstract
Question Answering (QA) over regulatory documents is inherently challenging due to the need for multihop reasoning across legally interdependent texts, a requirement that is particularly pronounced in the healthcare domain where regulations are hierarchically structured and frequently revised through amendments and cross-references. Despite recent progress in retrieval-augmented and graph-based QA methods, systematic evaluation in this setting remains limited, especially for low-resource languages such as Vietnamese, due to the lack of benchmark datasets that explicitly support multihop reasoning over healthcare regulations. In this work, we introduce the Vietnamese Healthcare Regulations-Multihop Reasoning Dataset (ViHERMES), a benchmark designed for multihop QA over Vietnamese healthcare regulatory documents. ViHERMES consists of high-quality question-answer pairs that require reasoning across multiple regulations and capture diverse dependency patterns, including amendment tracing, cross-document comparison, and procedural synthesis. To construct the dataset, we propose a controlled multihop QA generation pipeline based on semantic clustering and graph-inspired data mining, followed by large language model-based generation with structured evidence and reasoning annotations. We further present a graph-aware retrieval framework that models formal legal relations at the level of legal units and supports principled context expansion for legally valid and coherent answers. Experimental results demonstrate that ViHERMES provides a challenging benchmark for evaluating multihop regulatory QA systems and that the proposed graph-aware approach consistently outperforms strong retrieval-based baselines. The ViHERMES dataset and system implementation are publicly available at https://github.com/ura-hcmut/ViHERMES.
Chinese Translation
在监管文件上进行问答(QA)本质上具有挑战性,因为需要在法律上相互依赖的文本之间进行多跳推理,这在医疗领域尤为明显,因为法规是分层结构,并且经常通过修正案和交叉引用进行修订。尽管在检索增强和基于图的问答方法方面取得了近期进展,但在这一背景下的系统评估仍然有限,尤其是对于越南语等低资源语言,原因在于缺乏明确支持医疗法规多跳推理的基准数据集。在本研究中,我们介绍了越南医疗法规多跳推理数据集(ViHERMES),这是一个专为越南医疗监管文件的多跳问答设计的基准。ViHERMES包含高质量的问题-答案对,这些问题-答案对需要在多个法规之间进行推理,并捕捉多样的依赖模式,包括修正案追踪、跨文档比较和程序综合。为了构建该数据集,我们提出了一种基于语义聚类和图灵启发的数据挖掘的受控多跳问答生成管道,随后使用带有结构化证据和推理注释的大型语言模型进行生成。我们进一步提出了一个图感知检索框架,该框架在法律单位层面建模正式的法律关系,并支持合法有效且连贯答案的原则性上下文扩展。实验结果表明,ViHERMES为评估多跳监管问答系统提供了一个具有挑战性的基准,并且所提出的图感知方法在性能上始终优于强大的基于检索的基线。ViHERMES数据集和系统实现已公开发布,网址为 https://github.com/ura-hcmut/ViHERMES。
cs.CL / 14 / 2602.07374
TernaryLM: Memory-Efficient Language Modeling via Native 1-Bit Quantization with Adaptive Layer-wise Scaling
TernaryLM:通过自适应层级缩放的原生1位量化实现内存高效的语言建模
Abstract
Large language models (LLMs) achieve remarkable performance but demand substantial computational resources, limiting deployment on edge devices and resource-constrained environments. We present TernaryLM, a 132M parameter transformer architecture that employs native 1-bit ternary quantization {-1, 0, +1} during training, achieving significant memory reduction without sacrificing language modeling capability. Unlike post-training quantization approaches that quantize pre-trained full-precision models, TernaryLM learns quantization-aware representations from scratch using straight-through estimators and adaptive per-layer scaling factors. Our experiments demonstrate: (1) validation perplexity of 58.42 on TinyStories; (2) downstream transfer with 82.47 percent F1 on MRPC paraphrase detection; (3) 2.4x memory reduction (498MB vs 1197MB) with comparable inference latency; and (4) stable training dynamics across diverse corpora. We provide layer-wise quantization analysis showing that middle transformer layers exhibit highest compatibility with extreme quantization, informing future non-uniform precision strategies. Our results suggest that native 1-bit training is a promising direction for efficient neural language models. Code is available at https://github.com/1nisharg/TernaryLM-Memory-Efficient-Language-Modeling.
Chinese Translation
大型语言模型(LLMs)在性能上取得了显著的成就,但需要大量的计算资源,这限制了它们在边缘设备和资源受限环境中的部署。我们提出了TernaryLM,一种132M参数的变换器架构,在训练过程中采用原生1位三元量化{-1, 0, +1},在不牺牲语言建模能力的情况下实现了显著的内存减少。与后训练量化方法不同,后者对预训练的全精度模型进行量化,TernaryLM从零开始使用直通估计器和自适应每层缩放因子学习量化感知表示。我们的实验表明:(1)在TinyStories上的验证困惑度为58.42;(2)在MRPC释义检测中下游迁移获得82.47%的F1分数;(3)内存减少2.4倍(498MB对比1197MB),推理延迟相当;(4)在不同语料库上训练动态稳定。我们提供了逐层量化分析,显示中间变换器层与极端量化的兼容性最高,为未来的非均匀精度策略提供了参考。我们的结果表明,原生1位训练是高效神经语言模型的一个有前景的方向。代码可在https://github.com/1nisharg/TernaryLM-Memory-Efficient-Language-Modeling获取。
cs.CL / 15 / 2602.07375
Efficient Post-Training Pruning of Large Language Models with Statistical Correction
基于统计校正的大型语言模型高效后训练剪枝
Abstract
Post-training pruning is an effective approach for reducing the size and inference cost of large language models (LLMs), but existing methods often face a trade-off between pruning quality and computational efficiency. Heuristic pruning methods are efficient but sensitive to activation outliers, while reconstruction-based approaches improve fidelity at the cost of heavy computation. In this work, we propose a lightweight post-training pruning framework based on first-order statistical properties of model weights and activations. During pruning, channel-wise statistics are used to calibrate magnitude-based importance scores, reducing bias from activation-dominated channels. After pruning, we apply an analytic energy compensation to correct distributional distortions caused by weight removal. Both steps operate without retraining, gradients, or second-order information. Experiments across multiple LLM families, sparsity patterns, and evaluation tasks show that the proposed approach improves pruning performance while maintaining computational cost comparable to heuristic methods. The results suggest that simple statistical corrections can be effective for post-training pruning of LLMs.
Chinese Translation
后训练剪枝是一种有效的减少大型语言模型(LLMs)规模和推理成本的方法,但现有方法往往在剪枝质量和计算效率之间面临权衡。启发式剪枝方法效率高,但对激活异常值敏感,而基于重构的方法则在提高保真度的同时增加了计算负担。在本研究中,我们提出了一种基于模型权重和激活的一阶统计特性的轻量级后训练剪枝框架。在剪枝过程中,使用通道级统计数据来校准基于幅度的重要性评分,从而减少来自激活主导通道的偏差。剪枝后,我们应用解析能量补偿来修正因权重移除而导致的分布失真。这两个步骤均无需重新训练、梯度或二阶信息。在多个LLM家族、稀疏模式和评估任务上的实验表明,所提出的方法在提高剪枝性能的同时保持了与启发式方法相当的计算成本。结果表明,简单的统计校正可以有效用于LLMs的后训练剪枝。
cs.CL / 16 / 2602.07376
Do Large Language Models Reflect Demographic Pluralism in Safety?
大型语言模型在安全性方面是否反映了人口多元化?
Abstract
Large Language Model (LLM) safety is inherently pluralistic, reflecting variations in moral norms, cultural expectations, and demographic contexts. Yet, existing alignment datasets such as ANTHROPIC-HH and DICES rely on demographically narrow annotator pools, overlooking variation in safety perception across communities. Demo-SafetyBench addresses this gap by modeling demographic pluralism directly at the prompt level, decoupling value framing from responses. In Stage I, prompts from DICES are reclassified into 14 safety domains (adapted from BEAVERTAILS) using Mistral 7B-Instruct-v0.3, retaining demographic metadata and expanding low-resource domains via Llama-3.1-8B-Instruct with SimHash-based deduplication, yielding 43,050 samples. In Stage II, pluralistic sensitivity is evaluated using LLMs-as-Raters-Gemma-7B, GPT-4o, and LLaMA-2-7B-under zero-shot inference. Balanced thresholds (delta = 0.5, tau = 10) achieve high reliability (ICC = 0.87) and low demographic sensitivity (DS = 0.12), confirming that pluralistic safety evaluation can be both scalable and demographically robust.
Chinese Translation
大型语言模型(LLM)的安全性本质上是多元化的,反映了道德规范、文化期望和人口背景的差异。然而,现有的对齐数据集如ANTHROPIC-HH和DICES依赖于人口统计上狭窄的标注者池,忽视了不同社区在安全感知上的差异。Demo-SafetyBench通过在提示级别直接建模人口多元化来填补这一空白,将价值框架与响应解耦。在第一阶段,使用Mistral 7B-Instruct-v0.3将DICES中的提示重新分类为14个安全领域(改编自BEAVERTAILS),保留人口统计元数据,并通过Llama-3.1-8B-Instruct与基于SimHash的去重扩展低资源领域,最终生成43,050个样本。在第二阶段,使用LLMs-as-Raters-Gemma-7B、GPT-4o和LLaMA-2-7B在零样本推理下评估多元化敏感性。平衡阈值(delta = 0.5,tau = 10)实现了高可靠性(ICC = 0.87)和低人口敏感性(DS = 0.12),确认多元化安全评估既可以扩展又在人口统计上具有稳健性。
cs.CL / 17 / 2602.07381
When the Model Said 'No Comment', We Knew Helpfulness Was Dead, Honesty Was Alive, and Safety Was Terrified
当模型说‘不评论’时,我们知道有用性已死,诚实仍在,而安全感则感到恐惧
Abstract
Large Language Models (LLMs) need to be in accordance with human values-being helpful, harmless, and honest (HHH)-is important for safe deployment. Existing works use Supervised Fine-Tuning (SFT) and Mixture-of-Experts (MoE) to align LLMs. However, these works face challenges in multi-objective settings, such as SFT leading to interference between conflicting objectives, while MoEs suffer from miscalibrated routing. We term this failure mode Axis Collapse, marked by (1) disjoint feature spaces causing catastrophic forgetting, and (2) unreliable inference from misrouted experts. To resolve this, we propose AlignX, a two-stage framework. Stage 1 uses prompt-injected fine-tuning to extract axis-specific task features, mitigating catastrophic forgetting. Stage 2 deploys a MoCaE module that calibrates expert routing using fractal and natural geometry, improving inference reliability. AlignX achieves significant gains on Alpaca (Helpfulness), BeaverTails (Harmlessness), and TruthfulQA (Honesty), with +171.5% win rate, +110.1% in truthfulness-informativeness, and 4.3% fewer safety violations. It also reduces latency and memory usage by over 35% compared to prior MoEs. Results across four LLMs validate its generalizability.
Chinese Translation
大型语言模型(LLMs)需要遵循人类价值观——有用性、无害性和诚实性(HHH)对于安全部署至关重要。现有研究使用监督微调(SFT)和专家混合模型(MoE)来对齐LLMs。然而,这些研究在多目标设置中面临挑战,例如SFT导致相互冲突目标之间的干扰,而MoE则遭受错误路由的影响。我们将这种失败模式称为轴崩溃(Axis Collapse),其特征为(1)不相交的特征空间导致灾难性遗忘,以及(2)由于错误路由的专家导致的不可靠推理。为了解决这一问题,我们提出了AlignX,一个两阶段框架。第一阶段使用提示注入微调提取特定于轴的任务特征,从而减轻灾难性遗忘。第二阶段部署一个MoCaE模块,利用分形和自然几何来校准专家路由,提高推理的可靠性。AlignX在Alpaca(有用性)、BeaverTails(无害性)和TruthfulQA(诚实性)上取得了显著提升,胜率提高了171.5%,在真实性-信息性上提高了110.1%,并减少了4.3%的安全违规。此外,与之前的MoE相比,它还将延迟和内存使用减少了超过35%。在四个LLM上的结果验证了其普适性。
cs.CL / 18 / 2602.07382
Advantages of Domain Knowledge Injection for Legal Document Summarization: A Case Study on Summarizing Indian Court Judgments in English and Hindi
领域知识注入在法律文件摘要中的优势:以英语和印地语总结印度法院判决为例
Abstract
Summarizing Indian legal court judgments is a complex task not only due to the intricate language and unstructured nature of the legal texts, but also since a large section of the Indian population does not understand the complex English in which legal text is written, thus requiring summaries in Indian languages. In this study, we aim to improve the summarization of Indian legal text to generate summaries in both English and Hindi (the most widely spoken Indian language), by injecting domain knowledge into diverse summarization models. We propose a framework to enhance extractive neural summarization models by incorporating domain-specific pre-trained encoders tailored for legal texts. Further, we explore the injection of legal domain knowledge into generative models (including Large Language Models) through continual pre-training on large legal corpora in English and Hindi. Our proposed approaches achieve statistically significant improvements in both English-to-English and English-to-Hindi Indian legal document summarization, as measured by standard evaluation metrics, factual consistency metrics, and legal domain-specific metrics. Furthermore, these improvements are validated through domain experts, demonstrating the effectiveness of our approaches.
Chinese Translation
总结印度法律法院判决是一项复杂的任务,这不仅是由于法律文本复杂的语言和非结构化的特性,还因为印度人口中有很大一部分人无法理解法律文本所使用的复杂英语,因此需要用印度语言进行摘要。在本研究中,我们旨在通过将领域知识注入多种摘要模型,来改善印度法律文本的摘要生成,以便同时生成英语和印地语摘要(印地语是使用最广泛的印度语言)。我们提出了一个框架,通过结合针对法律文本量身定制的领域特定预训练编码器,来增强提取式神经摘要模型。此外,我们还探讨了通过在英语和印地语的大型法律语料库上进行持续预训练,将法律领域知识注入生成模型(包括大型语言模型)。我们提出的方法在英语到英语和英语到印地语的印度法律文件摘要中,均取得了统计显著的改进,这些改进是通过标准评估指标、事实一致性指标和法律领域特定指标进行测量的。此外,这些改进得到了领域专家的验证,证明了我们方法的有效性。
cs.CL / 19 / 2602.07447
Measuring cross-language intelligibility between Romance languages with computational tools
利用计算工具测量罗曼语族语言之间的相互理解度
Abstract
We present an analysis of mutual intelligibility in related languages applied for languages in the Romance family. We introduce a novel computational metric for estimating intelligibility based on lexical similarity using surface and semantic similarity of related words, and use it to measure mutual intelligibility for the five main Romance languages (French, Italian, Portuguese, Spanish, and Romanian), and compare results using both the orthographic and phonetic forms of words as well as different parallel corpora and vectorial models of word meaning representation. The obtained intelligibility scores confirm intuitions related to intelligibility asymmetry across languages and significantly correlate with results of cloze tests in human experiments.
Chinese Translation
我们对相关语言的相互理解度进行了分析,重点研究罗曼语族的语言。我们引入了一种基于词汇相似性的新的计算指标,该指标利用相关词汇的表面相似性和语义相似性来估计理解度,并用它来测量五种主要罗曼语言(法语、意大利语、葡萄牙语、西班牙语和罗马尼亚语)之间的相互理解度。我们比较了使用词汇的正字法形式和语音形式,以及不同的平行语料库和词义表示的向量模型所得到的结果。获得的理解度评分证实了与语言间理解度不对称相关的直觉,并与人类实验中的完形填空测试结果显著相关。
cs.CL / 20 / 2602.07451
DLLM Agent: See Farther, Run Faster
DLLM代理:看得更远,跑得更快
Abstract
Diffusion large language models (DLLMs) have emerged as an alternative to autoregressive (AR) decoding with appealing efficiency and modeling properties, yet their implications for agentic multi-step decision making remain underexplored. We ask a concrete question: when the generation paradigm is changed but the agent framework and supervision are held fixed, do diffusion backbones induce systematically different planning and tool-use behaviors, and do these differences translate into end-to-end efficiency gains? We study this in a controlled setting by instantiating DLLM and AR backbones within the same agent workflow (DeepDiver) and performing matched agent-oriented fine-tuning on the same trajectory data, yielding diffusion-backed DLLM Agents and directly comparable AR agents. Across benchmarks and case studies, we find that, at comparable accuracy, DLLM Agents are on average over 30% faster end to end than AR agents, with some cases exceeding 8x speedup. Conditioned on correct task completion, DLLM Agents also require fewer interaction rounds and tool invocations, consistent with higher planner hit rates that converge earlier to a correct action path with less backtracking. We further identify two practical considerations for deploying diffusion backbones in tool-using agents. First, naive DLLM policies are more prone to structured tool-call failures, necessitating stronger tool-call-specific training to emit valid schemas and arguments. Second, for multi-turn inputs interleaving context and action spans, diffusion-style span corruption requires aligned attention masking to avoid spurious context-action information flow; without such alignment, performance degrades. Finally, we analyze attention dynamics across workflow stages and observe paradigm-specific coordination patterns, suggesting stronger global planning signals in diffusion-backed agents.
Chinese Translation
扩散大语言模型(DLLMs)作为自回归(AR)解码的一种替代方案,因其高效性和建模特性而受到关注,但其在代理多步决策中的影响仍未得到充分探讨。我们提出一个具体问题:当生成范式改变而代理框架和监督保持不变时,扩散骨干是否会引发系统性不同的规划和工具使用行为,这些差异是否会转化为端到端的效率提升?我们在一个受控环境中研究这一问题,通过在同一代理工作流(DeepDiver)中实例化DLLM和AR骨干,并对相同轨迹数据进行匹配的代理导向微调,从而生成基于扩散的DLLM代理和可直接比较的AR代理。在基准测试和案例研究中,我们发现,在相似的准确率下,DLLM代理的端到端速度平均比AR代理快超过30%,某些情况下速度提升超过8倍。在正确任务完成的条件下,DLLM代理还需要更少的交互轮次和工具调用,这与更高的规划者命中率一致,后者更早地收敛到正确的行动路径,且回溯更少。我们进一步识别了在工具使用代理中部署扩散骨干的两个实际考虑。首先,简单的DLLM策略更容易出现结构化工具调用失败,因此需要更强的工具调用特定训练,以发出有效的模式和参数。其次,对于交替上下文和行动范围的多轮输入,扩散风格的范围损坏需要对齐的注意力掩蔽,以避免虚假的上下文-行动信息流;没有这样的对齐,性能会下降。最后,我们分析了工作流阶段中的注意力动态,并观察到特定范式的协调模式,表明扩散支持的代理中存在更强的全局规划信号。
cs.CL / 21 / 2602.07464
SED-SFT: Selectively Encouraging Diversity in Supervised Fine-Tuning
SED-SFT:在监督微调中选择性地鼓励多样性
Abstract
Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has emerged as the standard post-training paradigm for large language models (LLMs). However, the conventional SFT process, driven by Cross-Entropy (CE) loss, often induces mode collapse, where models over-concentrate on specific response patterns. This lack of distributional diversity severely restricts the exploration efficiency required for subsequent RL. While recent studies have attempted to improve SFT by replacing the CE loss, aiming to preserve diversity or refine the update policy, they fail to adequately balance diversity and accuracy, thereby yielding suboptimal performance after RL. To address the mode collapse problem, we propose SED-SFT, which adaptively encourages diversity based on the token exploration space. This framework introduces a selective entropy regularization term with a selective masking mechanism into the optimization objective. Extensive experiments across eight mathematical benchmarks demonstrate that SED-SFT significantly enhances generation diversity with a negligible computational overhead increase compared with CE loss, yielding average improvements of 2.06 and 1.20 points in subsequent RL performance over standard CE-based baselines on Llama-3.2-3B-Instruct and Qwen2.5-Math-7B-Instruct, respectively. The code is publicly available at https://github.com/pppa2019/SED-SFT
Chinese Translation
监督微调(Supervised Fine-Tuning, SFT)后接强化学习(Reinforcement Learning, RL)已成为大型语言模型(Large Language Models, LLMs)的标准后训练范式。然而,传统的SFT过程受到交叉熵(Cross-Entropy, CE)损失的驱动,往往会导致模式崩溃,即模型过度集中于特定的响应模式。这种缺乏分布多样性严重限制了后续RL所需的探索效率。尽管近期研究尝试通过替换CE损失来改善SFT,旨在保持多样性或优化更新策略,但未能充分平衡多样性与准确性,导致在RL后的表现不尽如人意。为了解决模式崩溃问题,我们提出了SED-SFT,该方法基于令牌探索空间自适应地鼓励多样性。该框架在优化目标中引入了选择性熵正则化项和选择性掩蔽机制。通过在八个数学基准上的广泛实验,结果表明,与CE损失相比,SED-SFT显著增强了生成多样性,并且计算开销增加微乎其微,在Llama-3.2-3B-Instruct和Qwen2.5-Math-7B-Instruct的标准CE基线上的后续RL表现分别提高了2.06和1.20分。代码已公开发布在 https://github.com/pppa2019/SED-SFT
cs.CL / 22 / 2602.07497
From Native Memes to Global Moderation: Cros-Cultural Evaluation of Vision-Language Models for Hateful Meme Detection
从本土迷因到全球治理:跨文化视角下的视觉-语言模型在仇恨迷因检测中的评估
Abstract
Cultural context profoundly shapes how people interpret online content, yet vision-language models (VLMs) remain predominantly trained through Western or English-centric lenses. This limits their fairness and cross-cultural robustness in tasks like hateful meme detection. We introduce a systematic evaluation framework designed to diagnose and quantify the cross-cultural robustness of state-of-the-art VLMs across multilingual meme datasets, analyzing three axes: (i) learning strategy (zero-shot vs. one-shot), (ii) prompting language (native vs. English), and (iii) translation effects on meaning and detection. Results show that the common ``translate-then-detect'' approach deteriorate performance, while culturally aligned interventions - native-language prompting and one-shot learning - significantly enhance detection. Our findings reveal systematic convergence toward Western safety norms and provide actionable strategies to mitigate such bias, guiding the design of globally robust multimodal moderation systems.
Chinese Translation
文化背景深刻影响人们对在线内容的解读,但视觉-语言模型(VLMs)仍主要通过西方或以英语为中心的视角进行训练。这限制了它们在仇恨迷因检测等任务中的公平性和跨文化鲁棒性。我们提出了一种系统评估框架,旨在诊断和量化最先进的VLMs在多语言迷因数据集上的跨文化鲁棒性,分析三个维度:(i)学习策略(零样本学习与单样本学习),(ii)提示语言(母语与英语),以及(iii)翻译对意义和检测的影响。结果表明,常见的“翻译后检测”方法会降低性能,而文化对齐的干预措施——母语提示和单样本学习——显著提升了检测效果。我们的研究发现系统性地趋向于西方安全规范,并提供了可行的策略来减轻这种偏见,为设计全球鲁棒的多模态治理系统提供指导。
cs.CL / 23 / 2602.07499
Let's Simplify Step by Step: Guiding LLM Towards Multilingual Unsupervised Proficiency-Controlled Sentence Simplification
让我们一步一步简化:引导大型语言模型实现多语言无监督能力控制的句子简化
Abstract
Large language models demonstrate limited capability in proficiency-controlled sentence simplification, particularly when simplifying across large readability levels. We propose a framework that decomposes complex simplifications into manageable steps through dynamic path planning, semantic-aware exemplar selection, and chain-of-thought generation with conversation history for coherent reasoning. Evaluation on five languages across two benchmarks shows our approach improves simplification effectiveness while reducing computational steps by 22-42%. Human evaluation confirms the fundamental trade-off between simplification effectiveness and meaning preservation. Notably, even human annotators struggle to agree on semantic preservation judgments, highlighting the inherent complexity of this task. Our work shows that while step-by-step simplification improves control, preserving semantic fidelity during extensive simplification remains an open challenge.
Chinese Translation
大型语言模型在能力控制的句子简化方面表现出有限的能力,特别是在跨越较大可读性水平进行简化时。我们提出了一种框架,通过动态路径规划、语义感知的示例选择以及结合对话历史的思维链生成,将复杂的简化过程分解为可管理的步骤。对五种语言在两个基准上的评估表明,我们的方法在提高简化效果的同时将计算步骤减少了22-42%。人类评估确认了简化效果与意义保留之间的基本权衡。值得注意的是,即使是人类注释者在语义保留判断上也难以达成一致,突显了这一任务的内在复杂性。我们的研究表明,尽管逐步简化改善了控制,但在广泛简化过程中保持语义真实性仍然是一个未解决的挑战。
cs.CL / 24 / 2602.07546
Improving Variable-Length Generation in Diffusion Language Models via Length Regularization
通过长度正则化改善扩散语言模型中的可变长度生成
Abstract
Diffusion Large Language Models (DLLMs) are inherently ill-suited for variable-length generation, as their inference is defined on a fixed-length canvas and implicitly assumes a known target length. When the length is unknown, as in realistic completion and infilling, naively comparing confidence across mask lengths becomes systematically biased, leading to under-generation or redundant continuations. In this paper, we show that this failure arises from an intrinsic lengthinduced bias in generation confidence estimates, leaving existing DLLMs without a robust way to determine generation length and making variablelength inference unreliable. To address this issue, we propose LR-DLLM, a length-regularized inference framework for DLLMs that treats generation length as an explicit variable and achieves reliable length determination at inference time. It decouples semantic compatibility from lengthinduced uncertainty through an explicit length regularization that corrects biased confidence estimates. Based on this, LR-DLLM enables dynamic expansion or contraction of the generation span without modifying the underlying DLLM or its training procedure. Experiments show that LRDLLM achieves 51.3% Pass@1 on HumanEvalInfilling under fully unknown lengths (+13.4% vs. DreamOn) and 51.5% average Pass@1 on four-language McEval (+14.3% vs. DreamOn).
Chinese Translation
扩散大型语言模型(DLLMs)本质上不适合进行可变长度生成,因为它们的推理是在固定长度的画布上定义的,并隐含地假设目标长度是已知的。当长度未知时,如在现实的补全和填充任务中,简单地比较不同掩码长度的置信度会系统性地产生偏差,导致生成不足或冗余的延续。在本文中,我们展示了这种失败源于生成置信度估计中固有的长度引起的偏差,使得现有的DLLMs缺乏一种可靠的方式来确定生成长度,从而使可变长度推理变得不可靠。为了解决这个问题,我们提出了LR-DLLM,一种针对DLLMs的长度正则化推理框架,该框架将生成长度视为一个显式变量,并在推理时实现可靠的长度确定。它通过显式的长度正则化将语义兼容性与长度引起的不确定性解耦,从而修正偏差的置信度估计。在此基础上,LR-DLLM能够在不修改底层DLLM或其训练过程的情况下动态扩展或收缩生成范围。实验表明,LR-DLLM在完全未知长度的HumanEvalInfilling任务中达到了51.3%的Pass@1(比DreamOn提高了13.4%),在四语言McEval上达到了51.5%的平均Pass@1(比DreamOn提高了14.3%)。
cs.CL / 25 / 2602.07594
Learning to Self-Verify Makes Language Models Better Reasoners
学习自我验证使语言模型成为更优秀的推理者
Abstract
Recent large language models (LLMs) achieve strong performance in generating promising reasoning paths for complex tasks. However, despite powerful generation ability, LLMs remain weak at verifying their own answers, revealing a persistent capability asymmetry between generation and self-verification. In this work, we conduct an in-depth investigation of this asymmetry throughout training evolution and show that, even on the same task, improving generation does not lead to corresponding improvements in self-verification. Interestingly, we find that the reverse direction of this asymmetry behaves differently: learning to self-verify can effectively improve generation performance, achieving accuracy comparable to standard generation training while yielding more efficient and effective reasoning traces. Building on this observation, we further explore integrating self-verification into generation training by formulating a multi-task reinforcement learning framework, where generation and self-verification are optimized as two independent but complementary objectives. Extensive experiments across benchmarks and models demonstrate performance gains over generation-only training in both generation and verification capabilities.
Chinese Translation
近期的大型语言模型(LLMs)在为复杂任务生成有效推理路径方面表现出色。然而,尽管生成能力强大,LLMs在验证自身答案方面仍然较弱,揭示了生成与自我验证之间持续存在的能力不对称。在本研究中,我们对这种不对称性在训练演变过程中的表现进行了深入调查,并表明,即使在同一任务上,提升生成能力并不会导致自我验证能力的相应提升。有趣的是,我们发现这种不对称性的反向方向表现不同:学习自我验证可以有效提高生成性能,达到与标准生成训练相当的准确性,同时产生更高效和有效的推理轨迹。基于这一观察,我们进一步探索将自我验证整合到生成训练中,通过构建一个多任务强化学习框架,在该框架中,生成和自我验证作为两个独立但互补的目标进行优化。针对多个基准和模型的广泛实验表明,在生成和验证能力上,相较于仅进行生成训练,性能得到了显著提升。
cs.CL / 26 / 2602.07621
SciClaimEval: Cross-modal Claim Verification in Scientific Papers
SciClaimEval:科学论文中的跨模态声明验证
Abstract
We present SciClaimEval, a new scientific dataset for the claim verification task. Unlike existing resources, SciClaimEval features authentic claims, including refuted ones, directly extracted from published papers. To create refuted claims, we introduce a novel approach that modifies the supporting evidence (figures and tables), rather than altering the claims or relying on large language models (LLMs) to fabricate contradictions. The dataset provides cross-modal evidence with diverse representations: figures are available as images, while tables are provided in multiple formats, including images, LaTeX source, HTML, and JSON. SciClaimEval contains 1,664 annotated samples from 180 papers across three domains, machine learning, natural language processing, and medicine, validated through expert annotation. We benchmark 11 multimodal foundation models, both open-source and proprietary, across the dataset. Results show that figure-based verification remains particularly challenging for all models, as a substantial performance gap remains between the best system and human baseline.
Chinese Translation
我们提出了SciClaimEval,这是一个用于声明验证任务的新科学数据集。与现有资源不同,SciClaimEval包含直接从已发表论文中提取的真实声明,包括被驳斥的声明。为了创建被驳斥的声明,我们引入了一种新颖的方法,该方法修改支持证据(图表),而不是改变声明或依赖大型语言模型(LLMs)来虚构矛盾。该数据集提供了具有多样化表示的跨模态证据:图形以图像形式提供,而表格则以多种格式提供,包括图像、LaTeX源代码、HTML和JSON。SciClaimEval包含来自180篇论文的1,664个标注样本,涵盖机器学习、自然语言处理和医学三个领域,经过专家标注验证。我们在该数据集上对11个多模态基础模型(包括开源和专有模型)进行了基准测试。结果表明,基于图形的验证对于所有模型仍然特别具有挑战性,因为最佳系统与人类基线之间仍存在显著的性能差距。
cs.CL / 27 / 2602.07639
Letting Tutor Personas "Speak Up" for LLMs: Learning Steering Vectors from Dialogue via Preference Optimization
让辅导者角色为大型语言模型发声:通过偏好优化从对话中学习引导向量
Abstract
With the emergence of large language models (LLMs) as a powerful class of generative artificial intelligence (AI), their use in tutoring has become increasingly prominent. Prior works on LLM-based tutoring typically learn a single tutor policy and do not capture the diversity of tutoring styles. In real-world tutor-student interactions, pedagogical intent is realized through adaptive instructional strategies, with tutors varying the level of scaffolding, instructional directiveness, feedback, and affective support in response to learners' needs. These differences can all impact dialogue dynamics and student engagement. In this paper, we explore how tutor personas embedded in human tutor-student dialogues can be used to guide LLM behavior without relying on explicitly prompted instructions. We modify Bidirectional Preference Optimization (BiPO) to learn a steering vector, an activation-space direction that steers model responses towards certain tutor personas. We find that this steering vector captures tutor-specific variation across dialogue contexts, improving semantic alignment with ground-truth tutor utterances and increasing preference-based evaluations, while largely preserving lexical similarity. Analysis of the learned directional coefficients further reveals interpretable structure across tutors, corresponding to consistent differences in tutoring behavior. These results demonstrate that activation steering offers an effective and interpretable way for controlling tutor-specific variation in LLMs using signals derived directly from human dialogue data.
Chinese Translation
随着大型语言模型(LLMs)作为一种强大的生成性人工智能(AI)类别的出现,它们在辅导中的应用变得越来越突出。以LLM为基础的辅导的先前研究通常学习单一的辅导策略,未能捕捉辅导风格的多样性。在现实世界的辅导-学生互动中,教学意图通过适应性教学策略得以实现,辅导者根据学习者的需求调整支架水平、教学指令性、反馈和情感支持。这些差异都会影响对话动态和学生参与度。本文探讨了如何利用嵌入在人类辅导-学生对话中的辅导者角色来引导LLM的行为,而无需依赖明确的提示指令。我们修改了双向偏好优化(BiPO),以学习引导向量,这是一种激活空间方向,能够将模型响应引导向特定的辅导者角色。我们发现,这一引导向量捕捉了对话上下文中辅导者特有的变化,提高了与真实辅导者发言的语义对齐,并增加了基于偏好的评估,同时在很大程度上保持了词汇相似性。对学习到的方向系数的分析进一步揭示了辅导者之间的可解释结构,反映了辅导行为中的一致性差异。这些结果表明,激活引导提供了一种有效且可解释的方式,通过直接从人类对话数据中提取的信号来控制LLM中的辅导者特定变化。
cs.CL / 28 / 2602.07673
Blind to the Human Touch: Overlap Bias in LLM-Based Summary Evaluation
对人类触感的盲目:基于大型语言模型的摘要评估中的重叠偏差
Abstract
Large language model (LLM) judges have often been used alongside traditional, algorithm-based metrics for tasks like summarization because they better capture semantic information, are better at reasoning, and are more robust to paraphrasing. However, LLM judges show biases for length and order among others, and are vulnerable to various adversarial input prompts. While recent studies have looked into these biases, few have analyzed them at a more granular level in relation to a well-defined overlap metric. In this work we provide an LLM judge bias analysis as a function of overlap with human-written responses in the domain of summarization. We test 9 recent LLMs with parameter counts ranging from 1 billion to 12 billion, including variants of Gemma 3 and LLaMA 3. We find that LLM judges increasingly prefer summaries generated by other LLMs over those written by humans as the similarities (as measured by ROUGE and BLEU) between the judged summaries decrease, and this pattern extends to all but one model tested, and exists regardless of the models' own position biases. Additionally, we find that models struggle to judge even summaries with limited overlaps, suggesting that LLM-as-a-judge in the summary domain should rely on techniques beyond a simple comparison.
Chinese Translation
大型语言模型(LLM)评审者常常与传统的基于算法的指标一起用于摘要等任务,因为它们更好地捕捉语义信息,更擅长推理,并且对改述更具鲁棒性。然而,LLM 评审者在长度和顺序等方面表现出偏差,并且容易受到各种对抗性输入提示的影响。尽管近期的研究关注了这些偏差,但很少有研究在与明确的重叠指标相关的更细粒度层面上进行分析。在本研究中,我们提供了 LLM 评审者偏差的分析,作为与人类撰写的响应在摘要领域的重叠程度的函数。我们测试了 9 个最近的 LLM,参数数量从 10 亿到 120 亿不等,包括 Gemma 3 和 LLaMA 3 的变体。我们发现,随着被评审摘要之间的相似性(通过 ROUGE 和 BLEU 测量)降低,LLM 评审者越来越倾向于选择其他 LLM 生成的摘要而非人类撰写的摘要,这一模式适用于所有测试的模型,只有一个例外,并且无论模型自身的偏置位置如何,这种现象依然存在。此外,我们发现模型在评估即使是有限重叠的摘要时也存在困难,这表明在摘要领域中,作为评审者的 LLM 应该依赖于超越简单比较的技术。
cs.CL / 29 / 2602.07773
SRR-Judge: Step-Level Rating and Refinement for Enhancing Search-Integrated Reasoning in Search Agents
SRR-Judge:用于增强搜索代理中搜索集成推理的逐步评分与精炼
Abstract
Recent deep search agents built on large reasoning models (LRMs) excel at complex question answering by iteratively planning, acting, and gathering evidence, a capability known as search-integrated reasoning. However, mainstream approaches often train this ability using only outcome-based supervision, neglecting the quality of intermediate thoughts and actions. We introduce SRR-Judge, a framework for reliable step-level assessment of reasoning and search actions. Integrated into a modified ReAct-style rate-and-refine workflow, SRR-Judge provides fine-grained guidance for search-integrated reasoning and enables efficient post-training annotation. Using SRR-annotated data, we apply an iterative rejection sampling fine-tuning procedure to enhance the deep search capability of the base agent. Empirically, SRR-Judge delivers more reliable step-level evaluations than much larger models such as DeepSeek-V3.1, with its ratings showing strong correlation with final answer correctness. Moreover, aligning the policy with SRR-Judge annotated trajectories leads to substantial performance gains, yielding over a 10 percent average absolute pass@1 improvement across challenging deep search benchmarks.
Chinese Translation
最近,基于大型推理模型(LRMs)构建的深度搜索代理在复杂问题回答方面表现出色,能够通过迭代规划、行动和收集证据来实现,这一能力被称为搜索集成推理。然而,主流方法通常仅通过基于结果的监督来训练这种能力,忽视了中间思维和行动的质量。我们提出了SRR-Judge,一个用于可靠的逐步评估推理和搜索行动的框架。SRR-Judge集成到修改后的ReAct风格的评分与精炼工作流程中,为搜索集成推理提供了细粒度的指导,并实现了高效的后训练注释。利用SRR注释的数据,我们应用了一种迭代拒绝采样微调程序,以增强基础代理的深度搜索能力。从实证结果来看,SRR-Judge提供的逐步评估比更大模型(如DeepSeek-V3.1)更可靠,其评分与最终答案的正确性显示出强相关性。此外,将策略与SRR-Judge注释的轨迹对齐,带来了显著的性能提升,在具有挑战性的深度搜索基准测试中,平均绝对pass@1提升超过10%。
cs.CL / 30 / 2602.07778
Attn-GS: Attention-Guided Context Compression for Efficient Personalized LLMs
Attn-GS:基于注意力引导的上下文压缩以实现高效个性化大语言模型
Abstract
Personalizing large language models (LLMs) to individual users requires incorporating extensive interaction histories and profiles, but input token constraints make this impractical due to high inference latency and API costs. Existing approaches rely on heuristic methods such as selecting recent interactions or prompting summarization models to compress user profiles. However, these methods treat context as a monolithic whole and fail to consider how LLMs internally process and prioritize different profile components. We investigate whether LLMs' attention patterns can effectively identify important personalization signals for intelligent context compression. Through preliminary studies on representative personalization tasks, we discover that (a) LLMs' attention patterns naturally reveal important signals, and (b) fine-tuning enhances LLMs' ability to distinguish between relevant and irrelevant information. Based on these insights, we propose Attn-GS, an attention-guided context compression framework that leverages attention feedback from a marking model to mark important personalization sentences, then guides a compression model to generate task-relevant, high-quality compressed user contexts. Extensive experiments demonstrate that Attn-GS significantly outperforms various baselines across different tasks, token limits, and settings, achieving performance close to using full context while reducing token usage by 50 times.
Chinese Translation
将大型语言模型(LLMs)个性化为单个用户需要整合大量的交互历史和用户档案,但输入令牌的限制使得这一过程在高推理延迟和API成本下变得不切实际。现有方法依赖于启发式方法,例如选择最近的交互或提示摘要模型来压缩用户档案。然而,这些方法将上下文视为一个整体,未能考虑LLMs如何在内部处理和优先考虑不同的档案组件。我们研究了LLMs的注意力模式是否能够有效识别重要的个性化信号,以实现智能上下文压缩。通过对代表性个性化任务的初步研究,我们发现:(a)LLMs的注意力模式自然揭示了重要信号;(b)微调增强了LLMs区分相关信息和无关信息的能力。基于这些见解,我们提出了Attn-GS,一个基于注意力引导的上下文压缩框架,利用标记模型的注意力反馈来标记重要的个性化句子,然后引导压缩模型生成与任务相关的高质量压缩用户上下文。大量实验表明,Attn-GS在不同任务、令牌限制和设置下显著优于各种基线,性能接近于使用完整上下文,同时令牌使用量减少了50倍。
cs.CL / 31 / 2602.07794
Emergent Structured Representations Support Flexible In-Context Inference in Large Language Models
新兴结构化表征支持大型语言模型中的灵活上下文推理
Abstract
Large language models (LLMs) exhibit emergent behaviors suggestive of human-like reasoning. While recent work has identified structured, human-like conceptual representations within these models, it remains unclear whether they functionally rely on such representations for reasoning. Here we investigate the internal processing of LLMs during in-context concept inference. Our results reveal a conceptual subspace emerging in middle to late layers, whose representational structure persists across contexts. Using causal mediation analyses, we demonstrate that this subspace is not merely an epiphenomenon but is functionally central to model predictions, establishing its causal role in inference. We further identify a layer-wise progression where attention heads in early-to-middle layers integrate contextual cues to construct and refine the subspace, which is subsequently leveraged by later layers to generate predictions. Together, these findings provide evidence that LLMs dynamically construct and use structured, latent representations in context for inference, offering insights into the computational processes underlying flexible adaptation.
Chinese Translation
大型语言模型(LLMs)表现出类似人类推理的突现行为。尽管近期的研究已在这些模型中识别出结构化的、类似人类的概念表征,但尚不清楚它们在推理中是否功能性地依赖于这些表征。在此,我们研究了LLMs在上下文概念推理过程中的内部处理。我们的结果揭示了一个在中层到后期层中出现的概念子空间,其表征结构在不同上下文中持续存在。通过因果中介分析,我们证明这个子空间不仅仅是一个附带现象,而是对模型预测功能上至关重要,确立了其在推理中的因果作用。我们进一步识别出一个层级进展,其中早期到中期层的注意力头整合上下文线索,以构建和细化该子空间,随后由后期层利用该子空间生成预测。综合来看,这些发现提供了证据,表明LLMs在上下文中动态构建和使用结构化的潜在表征进行推理,为灵活适应背后的计算过程提供了见解。
cs.CL / 32 / 2602.07796
Thinking Makes LLM Agents Introverted: How Mandatory Thinking Can Backfire in User-Engaged Agents
思考使大型语言模型代理变得内向:强制思考如何在用户参与的代理中适得其反
Abstract
Eliciting reasoning has emerged as a powerful technique for improving the performance of large language models (LLMs) on complex tasks by inducing thinking. However, their effectiveness in realistic user-engaged agent scenarios remains unclear. In this paper, we conduct a comprehensive study on the effect of explicit thinking in user-engaged LLM agents. Our experiments span across seven models, three benchmarks, and two thinking instantiations, and we evaluate them through both a quantitative response taxonomy analysis and qualitative failure propagation case studies. Contrary to expectations, we find that mandatory thinking often backfires on agents in user-engaged settings, causing anomalous performance degradation across various LLMs. Our key finding reveals that thinking makes agents more ``introverted'' by shortening responses and reducing information disclosure to users, which weakens agent-user information exchange and leads to downstream task failures. Furthermore, we demonstrate that explicitly prompting for information disclosure reliably improves performance across diverse model families, suggesting that proactive transparency is a vital lever for agent optimization. Overall, our study suggests that information transparency awareness is a crucial yet underexplored perspective for the future design of reasoning agents in real-world scenarios. Our code is available at https://github.com/deeplearning-wisc/Thinking-Agent.
Chinese Translation
引导推理已成为提高大型语言模型(LLMs)在复杂任务中表现的有效技术,通过促使思考。然而,它们在现实用户参与的代理场景中的有效性仍不明确。在本文中,我们对用户参与的LLM代理中显性思考的影响进行了全面研究。我们的实验涵盖了七个模型、三个基准和两种思考实例,并通过定量响应分类分析和定性失败传播案例研究对其进行了评估。与预期相反,我们发现强制思考往往在用户参与的环境中适得其反,导致各种LLM的表现异常下降。我们的关键发现表明,思考使代理变得更加“内向”,通过缩短响应时间和减少向用户的信息披露,从而削弱了代理与用户之间的信息交流,导致下游任务失败。此外,我们证明,明确提示信息披露可以可靠地提高不同模型家族的性能,这表明主动透明性是代理优化的重要杠杆。总体而言,我们的研究表明,信息透明意识是未来在现实场景中设计推理代理时一个至关重要但尚未深入探讨的视角。我们的代码可在 https://github.com/deeplearning-wisc/Thinking-Agent 获取。
cs.CL / 33 / 2602.07804
Pruning as a Cooperative Game: Surrogate-Assisted Layer Contribution Estimation for Large Language Models
修剪作为一种合作博弈:基于代理的层贡献估计用于大型语言模型
Abstract
While large language models (LLMs) demonstrate impressive performance across various tasks, their deployment in real-world scenarios is still constrained by high computational demands. Layer-wise pruning, a commonly employed strategy to mitigate inference costs, can partially address this challenge. However, existing approaches generally depend on static heuristic rules and fail to account for the interdependencies among layers, thereby limiting the effectiveness of the pruning process. To this end, this paper proposes a game-theoretic framework that formulates layer pruning as a cooperative game in which each layer acts as a player and model performance serves as the utility. As computing exact Shapley values is computationally infeasible for large language models (LLMs), we propose using a lightweight surrogate network to estimate layer-wise marginal contributions. This network can predict LLM performance for arbitrary layer combinations at a low computational cost. Additionally, we employ stratified Monte Carlo mask sampling to further reduce the cost of Sharpley value estimation. This approach captures inter-layer dependencies and dynamically identifies critical layers for pruning. Extensive experiments demonstrate the consistent superiority of our method in terms of perplexity and zero-shot accuracy, achieving more efficient and effective layer-wise pruning for large language models.
Chinese Translation
尽管大型语言模型(LLMs)在各种任务中表现出色,但其在现实场景中的部署仍受到高计算需求的限制。层级修剪是一种常用的策略,可以在一定程度上缓解推理成本。然而,现有的方法通常依赖于静态启发式规则,未能考虑层与层之间的相互依赖性,从而限制了修剪过程的有效性。为此,本文提出了一种博弈论框架,将层修剪形式化为一个合作博弈,其中每一层作为一个参与者,模型性能作为效用。由于计算大型语言模型(LLMs)的确切Shapley值在计算上不可行,我们建议使用轻量级代理网络来估计层级边际贡献。该网络能够以低计算成本预测任意层组合的LLM性能。此外,我们采用分层蒙特卡洛掩码采样进一步降低Shapley值估计的成本。这种方法捕捉了层间依赖关系,并动态识别出关键层进行修剪。大量实验表明,我们的方法在困惑度和零-shot准确性方面始终优于其他方法,实现了对大型语言模型更高效、更有效的层级修剪。
cs.CL / 34 / 2602.07812
LLMs Know More About Numbers than They Can Say
大型语言模型对数字的理解超出其表达能力
Abstract
Although state-of-the-art LLMs can solve math problems, we find that they make errors on numerical comparisons with mixed notation: "Which is larger, $5.7 \times 10^2$ or $580$?" This raises a fundamental question: Do LLMs even know how big these numbers are? We probe the hidden states of several smaller open-source LLMs. A single linear projection of an appropriate hidden layer encodes the log-magnitudes of both kinds of numerals, allowing us to recover the numbers with relative error of about 2.3% (on restricted synthetic text) or 19.06% (on scientific papers). Furthermore, the hidden state after reading a pair of numerals encodes their ranking, with a linear classifier achieving over 90% accuracy. Yet surprisingly, when explicitly asked to rank the same pairs of numerals, these LLMs achieve only 50-70% accuracy, with worse performance for models whose probes are less effective. Finally, we show that incorporating the classifier probe's log-loss as an auxiliary objective during finetuning brings an additional 3.22% improvement in verbalized accuracy over base models, demonstrating that improving models' internal magnitude representations can enhance their numerical reasoning capabilities.
Chinese Translation
尽管最先进的大型语言模型(LLMs)能够解决数学问题,但我们发现它们在混合符号的数值比较中会出现错误:“$5.7 imes 10^2$ 和 $580$ 哪个更大?”这引发了一个根本性的问题:大型语言模型是否真的知道这些数字有多大?我们探讨了几种较小的开源大型语言模型的隐藏状态。适当隐藏层的单一线性投影编码了两种数字的对数大小,使我们能够以约2.3%的相对误差(在限制的合成文本上)或19.06%的相对误差(在科学论文上)恢复这些数字。此外,在读取一对数字后,隐藏状态编码了它们的排名,线性分类器的准确率超过90%。然而,令人惊讶的是,当被明确要求对相同的数字对进行排名时,这些大型语言模型的准确率仅为50-70%,表现较差的模型其探测效果更差。最后,我们展示了在微调过程中将分类器探测器的对数损失作为辅助目标纳入,可以使口头表达的准确率比基础模型提高额外的3.22%,这表明改善模型内部的大小表示可以增强其数值推理能力。
cs.CL / 35 / 2602.07839
TodoEvolve: Learning to Architect Agent Planning Systems
TodoEvolve:学习架构智能体规划系统
Abstract
Planning has become a central capability for contemporary agent systems in navigating complex, long-horizon tasks, yet existing approaches predominantly rely on fixed, hand-crafted planning structures that lack the flexibility to adapt to the structural diversity of open-ended problems. To address this limitation, we introduce TodoEvolve, a meta-planning paradigm that autonomously synthesizes and dynamically revises task-specific planning architectures. Specifically, we first construct PlanFactory, a modular design space that standardizes diverse planning paradigms within a unified codebase encompassing topology, initialization, adaptation, and navigation, thereby providing a common interface for heterogeneous planning patterns. Leveraging PlanFactory, we collect high-quality planning trajectories and train Todo-14B via \textit{Impedance-Guided Preference Optimization} (IGPO), a multi-objective reinforcement learning objective that encourages the generation of planning systems that are performant, stable, and token-efficient across arbitrary tasks and agent backbones. Empirical evaluations on five agentic benchmarks demonstrate that TodoEvolve consistently surpasses carefully engineered planning modules while maintaining economical API costs and runtime overhead.
Chinese Translation
规划已成为当代智能体系统在复杂长时程任务中导航的核心能力,然而现有的方法主要依赖于固定的手工设计规划结构,缺乏适应开放性问题结构多样性的灵活性。为了解决这一局限,我们提出了TodoEvolve,一种元规划范式,能够自主合成和动态修订任务特定的规划架构。具体而言,我们首先构建了PlanFactory,一个模块化设计空间,该空间在统一的代码库中标准化了多样的规划范式,包括拓扑、初始化、适应和导航,从而为异构规划模式提供了一个共同接口。利用PlanFactory,我们收集高质量的规划轨迹,并通过 extit{Impedance-Guided Preference Optimization} (IGPO)训练Todo-14B,这是一种多目标强化学习目标,鼓励生成在任意任务和智能体骨架上表现优异、稳定且高效的规划系统。对五个智能体基准的实证评估表明,TodoEvolve在保持经济的API成本和运行时开销的同时,始终超越精心设计的规划模块。
cs.CL / 36 / 2602.07842
Evaluating and Calibrating LLM Confidence on Questions with Multiple Correct Answers
评估和校准多正确答案问题上的大型语言模型信心
Abstract
Confidence calibration is essential for making large language models (LLMs) reliable, yet existing training-free methods have been primarily studied under single-answer question answering. In this paper, we show that these methods break down in the presence of multiple valid answers, where disagreement among equally correct responses leads to systematic underestimation of confidence. To enable a systematic study of this phenomenon, we introduce MACE, a benchmark of 12,000 factual questions spanning six domains with varying numbers of correct answers. Experiments across 15 representative calibration methods and four LLM families (7B-72B) reveal that while accuracy increases with answer cardinality, estimated confidence consistently decreases, causing severe miscalibration for questions with mixed answer counts. To address this issue, we propose Semantic Confidence Aggregation (SCA), which aggregates confidence over multiple high-probability sampled responses. SCA achieves state-of-the-art calibration performance under mixed-answer settings while preserving strong calibration on single-answer questions.
Chinese Translation
信心校准对于提高大型语言模型(LLMs)的可靠性至关重要,但现有的无训练方法主要是在单一答案的问答场景下进行研究。本文展示了这些方法在存在多个有效答案时的失效情况,其中相同正确答案之间的分歧导致信心的系统性低估。为了系统性地研究这一现象,我们引入了MACE,这是一个包含12,000个事实性问题的基准,涵盖六个领域,并具有不同数量的正确答案。对15种代表性的校准方法和四个LLM家族(7B-72B)的实验表明,尽管答案的基数增加时准确率上升,但估计的信心却持续下降,导致在混合答案数量的问题上出现严重的校准失调。为了解决这一问题,我们提出了语义信心聚合(Semantic Confidence Aggregation, SCA),该方法对多个高概率采样响应的信心进行聚合。SCA在混合答案设置下实现了最先进的校准性能,同时在单一答案问题上保持了强大的校准效果。
cs.CL / 37 / 2602.07909
SparseEval: Efficient Evaluation of Large Language Models by Sparse Optimization
SparseEval:通过稀疏优化高效评估大型语言模型
Abstract
As large language models (LLMs) continue to scale up, their performance on various downstream tasks has significantly improved. However, evaluating their capabilities has become increasingly expensive, as performing inference on a large number of benchmark samples incurs high computational costs. In this paper, we revisit the model-item performance matrix and show that it exhibits sparsity, that representative items can be selected as anchors, and that the task of efficient benchmarking can be formulated as a sparse optimization problem. Based on these insights, we propose SparseEval, a method that, for the first time, adopts gradient descent to optimize anchor weights and employs an iterative refinement strategy for anchor selection. We utilize the representation capacity of MLP to handle sparse optimization and propose the Anchor Importance Score and Candidate Importance Score to evaluate the value of each item for task-aware refinement. Extensive experiments demonstrate the low estimation error and high Kendall's~$\tau$ of our method across a variety of benchmarks, showcasing its superior robustness and practicality in real-world scenarios. Code is available at {https://github.com/taolinzhang/SparseEval}.
Chinese Translation
随着大型语言模型(LLMs)的不断扩展,它们在各种下游任务上的表现显著提升。然而,评估它们的能力变得越来越昂贵,因为在大量基准样本上进行推理会产生高昂的计算成本。本文重新审视了模型-项目性能矩阵,并展示了其稀疏性,代表性项目可以被选为锚点,并且高效基准测试的任务可以被表述为一个稀疏优化问题。基于这些见解,我们提出了SparseEval,这是一种首次采用梯度下降法来优化锚点权重,并采用迭代精炼策略进行锚点选择的方法。我们利用多层感知机(MLP)的表示能力来处理稀疏优化,并提出了锚点重要性分数和候选重要性分数,以评估每个项目在任务感知精炼中的价值。大量实验表明,我们的方法在多种基准测试中具有低估计误差和高肯德尔相关系数(Kendall's~$ au$),展示了其在现实场景中的卓越鲁棒性和实用性。代码可在 {https://github.com/taolinzhang/SparseEval} 获取。
cs.CL / 38 / 2602.07930
Patches of Nonlinearity: Instruction Vectors in Large Language Models
非线性的片段:大型语言模型中的指令向量
Abstract
Despite the recent success of instruction-tuned language models and their ubiquitous usage, very little is known of how models process instructions internally. In this work, we address this gap from a mechanistic point of view by investigating how instruction-specific representations are constructed and utilized in different stages of post-training: Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). Via causal mediation, we identify that instruction representation is fairly localized in models. These representations, which we call Instruction Vectors (IVs), demonstrate a curious juxtaposition of linear separability along with non-linear causal interaction, broadly questioning the scope of the linear representation hypothesis commonplace in mechanistic interpretability. To disentangle the non-linear causal interaction, we propose a novel method to localize information processing in language models that is free from the implicit linear assumptions of patching-based techniques. We find that, conditioned on the task representations formed in the early layers, different information pathways are selected in the later layers to solve that task, i.e., IVs act as circuit selectors.
Chinese Translation
尽管指令调优语言模型近期取得了成功并被广泛使用,但我们对模型如何内部处理指令知之甚少。在本研究中,我们从机械机制的角度填补这一空白,探讨指令特定表示在后训练的不同阶段(监督微调(SFT)和直接偏好优化(DPO))是如何构建和利用的。通过因果中介分析,我们发现指令表示在模型中相对局部化。这些表示,我们称之为指令向量(Instruction Vectors, IVs),展示了线性可分性与非线性因果交互的奇特并置,广泛质疑了在机械可解释性中普遍存在的线性表示假设的适用范围。为了理清非线性因果交互,我们提出了一种新方法,以在语言模型中定位信息处理,避免了基于补丁技术的隐含线性假设。我们发现,在早期层形成的任务表示的条件下,后期层选择不同的信息路径来解决该任务,即,IVs充当电路选择器。
cs.CL / 39 / 2602.07954
Bielik Guard: Efficient Polish Language Safety Classifiers for LLM Content Moderation
Bielik Guard:用于大型语言模型内容审核的高效波兰语安全分类器
Abstract
As Large Language Models (LLMs) become increasingly deployed in Polish language applications, the need for efficient and accurate content safety classifiers has become paramount. We present Bielik Guard, a family of compact Polish language safety classifiers comprising two model variants: a 0.1B parameter model based on MMLW-RoBERTa-base and a 0.5B parameter model based on PKOBP/polish-roberta-8k. Fine-tuned on a community-annotated dataset of 6,885 Polish texts, these models classify content across five safety categories: Hate/Aggression, Vulgarities, Sexual Content, Crime, and Self-Harm. Our evaluation demonstrates that both models achieve strong performance on multiple benchmarks. The 0.5B variant offers the best overall discrimination capability with F1 scores of 0.791 (micro) and 0.785 (macro) on the test set, while the 0.1B variant demonstrates exceptional efficiency. Notably, Bielik Guard 0.1B v1.1 achieves superior precision (77.65\%) and very low false positive rate (0.63\%) on real user prompts, outperforming HerBERT-PL-Guard (31.55\% precision, 4.70\% FPR) despite identical model size. The models are publicly available and designed to provide appropriate responses rather than simple content blocking, particularly for sensitive categories like self-harm.
Chinese Translation
随着大型语言模型(LLMs)在波兰语应用中的广泛部署,对高效且准确的内容安全分类器的需求变得至关重要。我们提出了Bielik Guard,这是一系列紧凑的波兰语安全分类器,包含两个模型变体:一个基于MMLW-RoBERTa-base的0.1B参数模型和一个基于PKOBP/polish-roberta-8k的0.5B参数模型。这些模型在一个包含6,885个波兰文本的社区标注数据集上进行了微调,能够将内容分类为五个安全类别:仇恨/攻击、粗俗语言、性内容、犯罪和自残。我们的评估表明,这两个模型在多个基准测试中均表现出色。0.5B变体在测试集上提供了最佳的整体区分能力,F1分数为0.791(微平均)和0.785(宏平均),而0.1B变体则展现出卓越的效率。值得注意的是,Bielik Guard 0.1B v1.1在真实用户提示上实现了更高的精准度(77.65%)和极低的假阳性率(0.63%),尽管模型大小相同,但其表现优于HerBERT-PL-Guard(精准度31.55%,假阳性率4.70%)。这些模型是公开可用的,旨在提供适当的响应,而不仅仅是简单的内容屏蔽,特别是对于自残等敏感类别。
cs.CL / 40 / 2602.07963
Lost in Translation? A Comparative Study on the Cross-Lingual Transfer of Composite Harms
翻译中的迷失?复合伤害跨语言转移的比较研究
Abstract
Most safety evaluations of large language models (LLMs) remain anchored in English. Translation is often used as a shortcut to probe multilingual behavior, but it rarely captures the full picture, especially when harmful intent or structure morphs across languages. Some types of harm survive translation almost intact, while others distort or disappear. To study this effect, we introduce CompositeHarm, a translation-based benchmark designed to examine how safety alignment holds up as both syntax and semantics shift. It combines two complementary English datasets, AttaQ, which targets structured adversarial attacks, and MMSafetyBench, which covers contextual, real-world harms, and extends them into six languages: English, Hindi, Assamese, Marathi, Kannada, and Gujarati. Using three large models, we find that attack success rates rise sharply in Indic languages, especially under adversarial syntax, while contextual harms transfer more moderately. To ensure scalability and energy efficiency, our study adopts lightweight inference strategies inspired by edge-AI design principles, reducing redundant evaluation passes while preserving cross-lingual fidelity. This design makes large-scale multilingual safety testing both computationally feasible and environmentally conscious. Overall, our results show that translated benchmarks are a necessary first step, but not a sufficient one, toward building grounded, resource-aware, language-adaptive safety systems.
Chinese Translation
大多数大型语言模型(LLMs)的安全评估仍然以英语为基础。翻译常被用作探测多语言行为的捷径,但它很少能捕捉到完整的情况,尤其是在有害意图或结构在不同语言中发生变化时。有些类型的伤害在翻译中几乎保持不变,而其他类型则会扭曲或消失。为了研究这一效应,我们引入了CompositeHarm,这是一个基于翻译的基准,旨在考察在句法和语义变化时安全对齐的表现。它结合了两个互补的英语数据集,AttaQ,针对结构化对抗攻击,以及MMSafetyBench,涵盖上下文中的现实世界伤害,并将其扩展到六种语言:英语、印地语、阿萨姆语、马拉地语、卡纳达语和古吉拉特语。使用三个大型模型,我们发现,在印地语系语言中,攻击成功率在对抗句法下急剧上升,而上下文伤害的转移则较为温和。为了确保可扩展性和能源效率,我们的研究采用了受边缘人工智能设计原则启发的轻量级推理策略,减少了冗余评估过程,同时保持跨语言的保真度。这一设计使得大规模多语言安全测试在计算上可行且环保。总体而言,我们的结果表明,翻译基准是构建基于资源的、语言自适应的安全系统的必要第一步,但并不足够。
cs.CL / 41 / 2602.07978
Cross-Linguistic Persona-Driven Data Synthesis for Robust Multimodal Cognitive Decline Detection
跨语言个性化驱动的数据合成用于稳健的多模态认知衰退检测
Abstract
Speech-based digital biomarkers represent a scalable, non-invasive frontier for the early identification of Mild Cognitive Impairment (MCI). However, the development of robust diagnostic models remains impeded by acute clinical data scarcity and a lack of interpretable reasoning. Current solutions frequently struggle with cross-lingual generalization and fail to provide the transparent rationales essential for clinical trust. To address these barriers, we introduce SynCog, a novel framework integrating controllable zero-shot multimodal data synthesis with Chain-of-Thought (CoT) deduction fine-tuning. Specifically, SynCog simulates diverse virtual subjects with varying cognitive profiles to effectively alleviate clinical data scarcity. This generative paradigm enables the rapid, zero-shot expansion of clinical corpora across diverse languages, effectively bypassing data bottlenecks in low-resource settings and bolstering the diagnostic performance of Multimodal Large Language Models (MLLMs). Leveraging this synthesized dataset, we fine-tune a foundational multimodal backbone using a CoT deduction strategy, empowering the model to explicitly articulate diagnostic thought processes rather than relying on black-box predictions. Extensive experiments on the ADReSS and ADReSSo benchmarks demonstrate that augmenting limited clinical data with synthetic phenotypes yields competitive diagnostic performance, achieving Macro-F1 scores of 80.67% and 78.46%, respectively, outperforming current baseline models. Furthermore, evaluation on an independent real-world Mandarin cohort (CIR-E) demonstrates robust cross-linguistic generalization, attaining a Macro-F1 of 48.71%. These findings constitute a critical step toward providing clinically trustworthy and linguistically inclusive cognitive assessment tools for global healthcare.
Chinese Translation
基于语音的数字生物标志物代表了一种可扩展的、非侵入性的前沿技术,用于早期识别轻度认知障碍(MCI)。然而,稳健诊断模型的发展仍受到临床数据稀缺和缺乏可解释推理的阻碍。目前的解决方案常常在跨语言泛化方面遇到困难,并未提供临床信任所必需的透明推理。为了解决这些障碍,我们引入了SynCog,一个新颖的框架,将可控的零样本多模态数据合成与思维链(Chain-of-Thought, CoT)推理微调相结合。具体而言,SynCog模拟具有不同认知特征的多样化虚拟主体,以有效缓解临床数据稀缺。这种生成范式使得在多种语言中快速、零样本地扩展临床语料库成为可能,有效绕过低资源环境中的数据瓶颈,并增强多模态大语言模型(Multimodal Large Language Models, MLLMs)的诊断性能。利用这一合成数据集,我们使用CoT推理策略对基础多模态骨干网络进行微调,使模型能够明确表达诊断思维过程,而不是依赖黑箱预测。在ADReSS和ADReSSo基准上的广泛实验表明,利用合成表型增强有限的临床数据可获得具有竞争力的诊断性能,分别达到80.67%和78.46%的宏观F1分数,超越当前的基线模型。此外,在一个独立的真实世界普通话队列(CIR-E)上的评估显示出稳健的跨语言泛化,宏观F1分数达到48.71%。这些发现构成了为全球医疗提供临床可信和语言包容的认知评估工具的重要一步。
cs.CL / 42 / 2602.07996
The Judge Who Never Admits: Hidden Shortcuts in LLM-based Evaluation
从不承认的评判者:基于大型语言模型的评估中的隐性捷径
Abstract
Large language models (LLMs) are increasingly used as automatic judges to evaluate system outputs in tasks such as reasoning, question answering, and creative writing. A faithful judge should base its verdicts solely on content quality, remain invariant to irrelevant context, and transparently reflect the factors driving its decisions. We test this ideal via controlled cue perturbations-synthetic metadata labels injected into evaluation prompts-for six judge models: GPT-4o, Gemini-2.0-Flash, Gemma-3-27B, Qwen3-235B, Claude-3-Haiku, and Llama3-70B. Experiments span two complementary datasets with distinct evaluation regimes: ELI5 (factual QA) and LitBench (open-ended creative writing). We study six cue families: source, temporal, age, gender, ethnicity, and educational status. Beyond measuring verdict shift rates (VSR), we introduce cue acknowledgment rate (CAR) to quantify whether judges explicitly reference the injected cues in their natural-language rationales. Across cues with strong behavioral effects-e.g., provenance hierarchies (Expert > Human > LLM > Unknown), recency preferences (New > Old), and educational-status favoritism-CAR is typically at or near zero, indicating that shortcut reliance is largely unreported even when it drives decisions. Crucially, CAR is also dataset-dependent: explicit cue recognition is more likely to surface in the factual ELI5 setting for some models and cues, but often collapses in the open-ended LitBench regime, where large verdict shifts can persist despite zero acknowledgment. The combination of substantial verdict sensitivity and limited cue acknowledgment reveals an explanation gap in LLM-as-judge pipelines, raising concerns about reliability of model-based evaluation in both research and deployment.
Chinese Translation
大型语言模型(LLMs)越来越多地被用作自动评判者,以评估在推理、问答和创意写作等任务中的系统输出。一个忠实的评判者应仅基于内容质量作出裁决,对无关背景保持不变,并透明地反映驱动其决策的因素。我们通过控制提示扰动——在评估提示中注入的合成元数据标签——来测试这一理想,涉及六个评判模型:GPT-4o、Gemini-2.0-Flash、Gemma-3-27B、Qwen3-235B、Claude-3-Haiku 和 Llama3-70B。实验涵盖两个互补的数据集,具有不同的评估机制:ELI5(事实问答)和 LitBench(开放式创意写作)。我们研究了六个提示类别:来源、时间、年龄、性别、种族和教育状态。除了测量裁决变化率(VSR),我们引入了提示承认率(CAR),以量化评判者在其自然语言推理中是否明确提及注入的提示。在具有强行为影响的提示中——例如,来源层级(专家 > 人类 > LLM > 未知)、近期偏好(新 > 旧)和教育状态偏爱——CAR 通常为零或接近零,表明即使在决策中驱动捷径的依赖也大多未被报告。重要的是,CAR 还依赖于数据集:在某些模型和提示的事实 ELI5 设置中,明确的提示识别更可能出现,但在开放式 LitBench 机制中往往崩溃,尽管裁决变化显著,但承认为零。裁决对提示的显著敏感性与有限的提示承认结合,揭示了 LLM 作为评判者的流程中的解释缺口,引发了对模型基础评估在研究和应用中可靠性的担忧。
cs.CL / 43 / 2602.08005
DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity
DeltaKV:基于残差的KV缓存压缩方法通过长程相似性
Abstract
The deployment of efficient long-context LLMs in applications like autonomous agents, long-chain reasoning, and creative writing is fundamentally bottlenecked by the linear growth of KV cache memory. Existing compression and eviction methods often struggle to balance accuracy, compression ratio, and hardware efficiency. We propose DeltaKV, a residual-based KV cache compression framework motivated by two empirical findings: long-range inter-token similarity and highly shared latent components in KV representations. Instead of discarding tokens, DeltaKV encodes semantic residuals relative to retrieved historical references, preserving fidelity while substantially reducing storage. To translate compression gains into real system speedups, we further introduce Sparse-vLLM, a high-performance inference engine with decoupled memory management and kernels optimized for sparse and irregular KV layouts. Experiments show that DeltaKV reduces KV cache memory to 29\% of the original while maintaining near-lossless accuracy on LongBench, SCBench, and AIME. When integrated with Sparse-vLLM, it achieves up to 2$\times$ throughput improvement over vLLM in long-context scenarios, demonstrating a practical path toward scalable long-context LLM deployment. Code, model checkpoints, and datasets are available at https://github.com/CURRENTF/Sparse-vLLM.
Chinese Translation
在自主智能体、长链推理和创意写作等应用中,高效的长上下文大语言模型(LLMs)的部署受到KV缓存内存线性增长的根本瓶颈。现有的压缩和驱逐方法常常难以在准确性、压缩比和硬件效率之间取得平衡。我们提出了DeltaKV,这是一种基于残差的KV缓存压缩框架,其动机源于两个实证发现:长程的跨标记相似性和KV表示中高度共享的潜在组件。DeltaKV并不丢弃标记,而是相对于检索到的历史参考编码语义残差,从而在显著减少存储的同时保持保真度。为了将压缩收益转化为实际系统加速,我们进一步引入了Sparse-vLLM,这是一种高性能推理引擎,具有解耦的内存管理和针对稀疏和不规则KV布局优化的内核。实验表明,DeltaKV将KV缓存内存减少至原始的29%,同时在LongBench、SCBench和AIME上保持近乎无损的准确性。当与Sparse-vLLM集成时,在长上下文场景中实现了相较于vLLM高达2倍的吞吐量提升,展示了可扩展长上下文LLM部署的实际路径。代码、模型检查点和数据集可在https://github.com/CURRENTF/Sparse-vLLM获取。
cs.CL / 44 / 2602.08028
Diverge to Induce Prompting: Multi-Rationale Induction for Zero-Shot Reasoning
发散以诱导提示:零-shot 推理的多理由诱导
Abstract
To address the instability of unguided reasoning paths in standard Chain-of-Thought prompting, recent methods guide large language models (LLMs) by first eliciting a single reasoning strategy. However, relying on just one strategy for each question can still limit performance across diverse tasks. We propose Diverge-to-Induce Prompting (DIP), a framework that first prompts an LLM to generate multiple diverse high-level rationales for each question. Each rationale is then elaborated into a detailed, step-by-step draft plan. Finally, these draft plans are induced into a final plan. DIP enhances zero-shot reasoning accuracy without reliance on resource-intensive sampling. Experiments show that DIP outperforms single-strategy prompting, demonstrating the effectiveness of multi-plan induction for prompt-based reasoning.
Chinese Translation
为了应对标准链式思维提示中无指导推理路径的不稳定性,最近的方法通过首先引导大型语言模型(LLMs)提出单一的推理策略。然而,仅依赖于每个问题的一个策略仍然可能限制在多样化任务中的表现。我们提出了发散以诱导提示(Diverge-to-Induce Prompting, DIP),这一框架首先提示LLM为每个问题生成多个多样化的高层次理由。随后,每个理由被详细阐述为逐步的草拟计划。最后,这些草拟计划被诱导成最终计划。DIP在不依赖资源密集型采样的情况下增强了零-shot 推理的准确性。实验表明,DIP优于单一策略提示,展示了基于提示的推理中多计划诱导的有效性。
cs.CL / 45 / 2602.08031
Beyond Raw Detection Scores: Markov-Informed Calibration for Boosting Machine-Generated Text Detection
超越原始检测分数:基于马尔可夫的信息校准以提升机器生成文本检测
Abstract
While machine-generated texts (MGTs) offer great convenience, they also pose risks such as disinformation and phishing, highlighting the need for reliable detection. Metric-based methods, which extract statistically distinguishable features of MGTs, are often more practical than complex model-based methods that are prone to overfitting. Given their diverse designs, we first place representative metric-based methods within a unified framework, enabling a clear assessment of their advantages and limitations. Our analysis identifies a core challenge across these methods: the token-level detection score is easily biased by the inherent randomness of the MGTs generation process. To address this, we theoretically and empirically reveal two relationships of context detection scores that may aid calibration: Neighbor Similarity and Initial Instability. We then propose a Markov-informed score calibration strategy that models these relationships using Markov random fields, and implements it as a lightweight component via a mean-field approximation, allowing our method to be seamlessly integrated into existing detectors. Extensive experiments in various real-world scenarios, such as cross-LLM and paraphrasing attacks, demonstrate significant gains over baselines with negligible computational overhead. The code is available at https://github.com/tmlr-group/MRF_Calibration.
Chinese Translation
尽管机器生成文本(MGTs)提供了极大的便利,但它们也带来了诸如虚假信息和网络钓鱼等风险,突显了可靠检测的必要性。基于度量的方法通过提取MGTs的统计可区分特征,通常比复杂的模型基础方法更为实用,后者容易出现过拟合。鉴于其多样化的设计,我们首先将具有代表性的基于度量的方法置于一个统一框架内,从而能够清晰评估其优缺点。我们的分析识别出这些方法的一个核心挑战:标记级检测分数容易受到MGTs生成过程固有随机性的偏倚。为了解决这一问题,我们理论上和实证上揭示了可能有助于校准的上下文检测分数的两个关系:邻居相似性和初始不稳定性。然后,我们提出了一种基于马尔可夫的信息分数校准策略,该策略使用马尔可夫随机场建模这些关系,并通过均场近似将其实现为轻量级组件,从而使我们的方法能够无缝集成到现有检测器中。在各种真实场景中的广泛实验,例如跨LLM和释义攻击,显示出相较于基线显著的提升,同时计算开销微乎其微。代码可在 https://github.com/tmlr-group/MRF_Calibration 获取。
cs.CL / 46 / 2602.08048
TDGNet: Hallucination Detection in Diffusion Language Models via Temporal Dynamic Graphs
TDGNet:通过时间动态图在扩散语言模型中检测幻觉
Abstract
Diffusion language models (D-LLMs) offer parallel denoising and bidirectional context, but hallucination detection for D-LLMs remains underexplored. Prior detectors developed for auto-regressive LLMs typically rely on single-pass cues and do not directly transfer to diffusion generation, where factuality evidence is distributed across the denoising trajectory and may appear, drift, or be self-corrected over time. We introduce TDGNet, a temporal dynamic graph framework that formulates hallucination detection as learning over evolving token-level attention graphs. At each denoising step, we sparsify the attention graph and update per-token memories via message passing, then apply temporal attention to aggregate trajectory-wide evidence for final prediction. Experiments on LLaDA-8B and Dream-7B across QA benchmarks show consistent AUROC improvements over output-based, latent-based, and static-graph baselines, with single-pass inference and modest overhead. These results highlight the importance of temporal reasoning on attention graphs for robust hallucination detection in diffusion language models.
Chinese Translation
扩散语言模型(D-LLMs)提供了并行去噪和双向上下文,但针对D-LLMs的幻觉检测仍然未得到充分探索。之前为自回归LLMs开发的检测器通常依赖于单次传递线索,无法直接转移到扩散生成中,在这种情况下,事实证据分布在去噪轨迹中,可能会随时间出现、漂移或自我修正。我们提出了TDGNet,一个时间动态图框架,将幻觉检测表述为在不断演变的标记级注意图上进行学习。在每个去噪步骤中,我们稀疏化注意图并通过消息传递更新每个标记的记忆,然后应用时间注意力以聚合全轨迹证据进行最终预测。在LLaDA-8B和Dream-7B上的QA基准实验表明,相较于基于输出、基于潜在和静态图的基线,AUROC表现出一致的提升,同时实现了单次推理和适度的开销。这些结果突显了在注意图上进行时间推理对于在扩散语言模型中实现稳健幻觉检测的重要性。
cs.CL / 47 / 2602.08100
Emergent Search and Backtracking in Latent Reasoning Models
潜在推理模型中的新兴搜索与回溯
Abstract
What happens when a language model thinks without words? Standard reasoning LLMs verbalize intermediate steps as chain-of-thought; latent reasoning transformers (LRTs) instead perform deliberation entirely in continuous hidden space. We investigate an LRT, decoding the model's evolving beliefs at every step on a multiple-choice QA benchmark. We find that the model spontaneously learns a structured search process in latent space. Deliberation follows a consistent trajectory: an exploration phase where probability mass spreads across candidates, tentative commitment to a frontrunner, and either convergence or backtracking. Backtracking is prevalent (32% of instances), beneficial (34% accuracy gain over non-backtracking instances), and predominantly directed away from the semantically closest distractor toward the correct answer. The search is adaptive: replacing distractors with implausible alternatives shortens exploration by 54%. Latent reasoning models achieve in activation space what chain-of-thought achieves through words: the ability to be wrong, notice, and recover.
Chinese Translation
当语言模型在没有文字的情况下思考时会发生什么?标准的推理大型语言模型(LLMs)将中间步骤以思维链的形式表达;而潜在推理变换器(LRTs)则完全在连续的隐藏空间中进行深思熟虑。我们研究了一种LRT,在多项选择问答基准测试中解码模型在每一步的不断演变的信念。我们发现该模型自发学习了一种在潜在空间中的结构化搜索过程。深思熟虑遵循一致的轨迹:一个探索阶段,在该阶段概率质量在候选者之间扩散,暂时承诺于一个领先者,最终要么收敛要么回溯。回溯现象普遍存在(占32%的实例),具有益处(相较于非回溯实例,准确率提升34%),并且主要是朝向正确答案而非语义上最接近的干扰项。搜索是自适应的:用不太可能的替代项替换干扰项可以缩短探索时间54%。潜在推理模型在激活空间中实现了思维链通过文字所实现的能力:能够犯错、注意到错误并进行恢复。
cs.CL / 48 / 2602.08124
Gender and Race Bias in Consumer Product Recommendations by Large Language Models
大型语言模型在消费产品推荐中的性别与种族偏见
Abstract
Large Language Models are increasingly employed in generating consumer product recommendations, yet their potential for embedding and amplifying gender and race biases remains underexplored. This paper serves as one of the first attempts to examine these biases within LLM-generated recommendations. We leverage prompt engineering to elicit product suggestions from LLMs for various race and gender groups and employ three analytical methods-Marked Words, Support Vector Machines, and Jensen-Shannon Divergence-to identify and quantify biases. Our findings reveal significant disparities in the recommendations for demographic groups, underscoring the need for more equitable LLM recommendation systems.
Chinese Translation
大型语言模型越来越多地被用于生成消费产品推荐,但它们在嵌入和放大性别与种族偏见方面的潜力仍然未得到充分探索。本文是首次尝试研究LLM生成推荐中的这些偏见之一。我们利用提示工程从大型语言模型中引出针对不同种族和性别群体的产品建议,并采用三种分析方法——标记词(Marked Words)、支持向量机(Support Vector Machines)和詹森-香农散度(Jensen-Shannon Divergence)——来识别和量化偏见。我们的研究结果揭示了不同人口群体推荐之间的显著差异,强调了更公平的LLM推荐系统的必要性。
cs.CL / 49 / 2602.08149
DIAL-SUMMER: A Structured Evaluation Framework of Hierarchical Errors in Dialogue Summaries
DIAL-SUMMER:对对话摘要中层次错误的结构化评估框架
Abstract
Dialogues are a predominant mode of communication for humans, and it is immensely helpful to have automatically generated summaries of them (e.g., to revise key points discussed in a meeting, to review conversations between customer agents and product users). Prior works on dialogue summary evaluation largely ignore the complexities specific to this task: (i) shift in structure, from multiple speakers discussing information in a scattered fashion across several turns, to a summary's sentences, and (ii) shift in narration viewpoint, from speakers' first/second-person narration, standardized third-person narration in the summary. In this work, we introduce our framework DIALSUMMER to address the above. We propose DIAL-SUMMER's taxonomy of errors to comprehensively evaluate dialogue summaries at two hierarchical levels: DIALOGUE-LEVEL that focuses on the broader speakers/turns, and WITHIN-TURN-LEVEL that focuses on the information talked about inside a turn. We then present DIAL-SUMMER's dataset composed of dialogue summaries manually annotated with our taxonomy's fine-grained errors. We conduct empirical analyses of these annotated errors, and observe interesting trends (e.g., turns occurring in middle of the dialogue are the most frequently missed in the summary, extrinsic hallucinations largely occur at the end of the summary). We also conduct experiments on LLM-Judges' capability at detecting these errors, through which we demonstrate the challenging nature of our dataset, the robustness of our taxonomy, and the need for future work in this field to enhance LLMs' performance in the same. Code and inference dataset coming soon.
Chinese Translation
对话是人类主要的沟通方式,自动生成对话摘要极为有助于我们(例如,复习会议中讨论的关键点,回顾客户代理与产品用户之间的对话)。以往关于对话摘要评估的研究在很大程度上忽视了这一任务特有的复杂性:(i)结构的转变,从多个发言者在多个回合中以分散的方式讨论信息,到摘要的句子,以及(ii)叙述视角的转变,从发言者的第一/第二人称叙述,到摘要中的标准化第三人称叙述。在本研究中,我们引入了框架 DIAL-SUMMER 来解决上述问题。我们提出了 DIAL-SUMMER 的错误分类法,以在两个层次上全面评估对话摘要:对话层面(DIALOGUE-LEVEL)关注更广泛的发言者/回合,而回合内层面(WITHIN-TURN-LEVEL)则关注在一个回合内讨论的信息。随后,我们展示了 DIAL-SUMMER 的数据集,该数据集由手动标注的对话摘要组成,标注包含我们分类法的细粒度错误。我们对这些标注错误进行了实证分析,并观察到有趣的趋势(例如,出现在对话中间的回合在摘要中最常被遗漏,外部幻觉主要发生在摘要的末尾)。我们还对 LLM-Judges 检测这些错误的能力进行了实验,通过这些实验展示了我们数据集的挑战性、分类法的稳健性,以及未来在该领域工作的必要性,以提升 LLM 在此方面的表现。代码和推理数据集即将发布。
cs.CL / 50 / 2602.08162
NLP for Local Governance Meeting Records: A Focus Article on Tasks, Datasets, Metrics and Benchmark
自然语言处理在地方治理会议记录中的应用:任务、数据集、评估指标和基准的聚焦文章
Abstract
Local governance meeting records are official documents, in the form of minutes or transcripts, documenting how proposals, discussions, and procedural actions unfold during institutional meetings. While generally structured, these documents are often dense, bureaucratic, and highly heterogeneous across municipalities, exhibiting significant variation in language, terminology, structure, and overall organization. This heterogeneity makes them difficult for non-experts to interpret and challenging for intelligent automated systems to process, limiting public transparency and civic engagement. To address these challenges, computational methods can be employed to structure and interpret such complex documents. In particular, Natural Language Processing (NLP) offers well-established methods that can enhance the accessibility and interpretability of governmental records. In this focus article, we review foundational NLP tasks that support the structuring of local governance meeting documents. Specifically, we review three core tasks: document segmentation, domain-specific entity extraction and automatic text summarization, which are essential for navigating lengthy deliberations, identifying political actors and personal information, and generating concise representations of complex decision-making processes. In reviewing these tasks, we discuss methodological approaches, evaluation metrics, and publicly available resources, while highlighting domain-specific challenges such as data scarcity, privacy constraints, and source variability. By synthesizing existing work across these foundational tasks, this article provides a structured overview of how NLP can enhance the structuring and accessibility of local governance meeting records.
Chinese Translation
地方治理会议记录是官方文件,以会议纪要或逐字记录的形式,记录了提案、讨论和程序性行动在机构会议中的展开情况。虽然这些文件通常是结构化的,但它们往往内容密集、官僚化,并且在不同市政当局之间高度异质,表现出语言、术语、结构和整体组织上的显著差异。这种异质性使得非专业人士难以解读,也给智能自动化系统的处理带来了挑战,从而限制了公众透明度和公民参与。为了解决这些挑战,可以采用计算方法来结构化和解读这些复杂文件。特别是,自然语言处理(NLP)提供了成熟的方法,可以增强政府记录的可访问性和可解释性。在这篇聚焦文章中,我们回顾了支持地方治理会议文件结构化的基础NLP任务。具体而言,我们回顾了三个核心任务:文档分割、领域特定实体提取和自动文本摘要,这些任务对于导航冗长的讨论、识别政治参与者和个人信息,以及生成复杂决策过程的简明表述至关重要。在回顾这些任务时,我们讨论了方法论、评估指标和公开可用资源,同时强调了数据稀缺、隐私限制和来源变异等领域特定挑战。通过综合现有的基础任务研究,本文提供了一个结构化的概述,展示了NLP如何增强地方治理会议记录的结构化和可访问性。
cs.CL / 51 / 2602.08208
LLMs and people both learn to form conventions -- just not with each other
大型语言模型与人类都学习形成约定——只不过不是与彼此
Abstract
Humans align to one another in conversation -- adopting shared conventions that ease communication. We test whether LLMs form the same kinds of conventions in a multimodal communication game. Both humans and LLMs display evidence of convention-formation (increasing the accuracy and consistency of their turns while decreasing their length) when communicating in same-type dyads (humans with humans, AI with AI). However, heterogenous human-AI pairs fail -- suggesting differences in communicative tendencies. In Experiment 2, we ask whether LLMs can be induced to behave more like human conversants, by prompting them to produce superficially humanlike behavior. While the length of their messages matches that of human pairs, accuracy and lexical overlap in human-LLM pairs continues to lag behind that of both human-human and AI-AI pairs. These results suggest that conversational alignment requires more than just the ability to mimic previous interactions, but also shared interpretative biases toward the meanings that are conveyed.
Chinese Translation
人类在对话中相互协调——采用共享的约定以简化沟通。我们测试大型语言模型(LLMs)是否在多模态沟通游戏中形成相同类型的约定。当在同类对话中交流时(人类与人类,人工智能与人工智能),人类和大型语言模型都显示出形成约定的证据(在提高回合的准确性和一致性的同时减少其长度)。然而,异质的人类-人工智能对话失败——这表明沟通倾向存在差异。在实验2中,我们询问是否可以通过提示大型语言模型产生表面上类似人类的行为,诱导其更像人类对话者。尽管它们的信息长度与人类对话者相匹配,但在人类-大型语言模型对话中,准确性和词汇重叠仍然落后于人类-人类和人工智能-人工智能对话。这些结果表明,对话协调不仅需要模仿先前互动的能力,还需要对所传达意义的共享解释偏见。
cs.CL / 52 / 2602.08220
Pretraining with Token-Level Adaptive Latent Chain-of-Thought
基于标记级自适应潜在思维链的预训练
Abstract
Scaling large language models by increasing parameters and training data is increasingly constrained by limited high-quality corpora and rising communication costs. This work explores an alternative axis: increasing per-token computation without expanding parameters, by internalizing latent Chain-of-Thought (CoT) into pretraining. We propose Pretraining with Token-Level Adaptive Latent CoT (adaptive latent CoT), where the model generates a variable-length latent CoT trajectory before emitting each token -- allocating longer trajectories to difficult tokens and shorter (or even zero) trajectories to easy ones. Importantly, this behavior emerges naturally from one-stage pretraining on general text and reduces computation in both training and inference via token-wise adaptive halting. Experiments with Llama architectures show that adaptive latent CoT consistently improves language modeling perplexity and broad downstream accuracy, even with fewer training FLOPs than prior recurrent baselines.
Chinese Translation
通过增加参数和训练数据来扩展大型语言模型的能力,受到高质量语料库有限和通信成本上升的制约。本研究探索了一种替代方法:在不扩展参数的情况下,通过将潜在思维链(Chain-of-Thought, CoT)内化到预训练中来增加每个标记的计算量。我们提出了基于标记级自适应潜在CoT(adaptive latent CoT)的预训练方法,其中模型在发出每个标记之前生成一个可变长度的潜在CoT轨迹——为困难标记分配更长的轨迹,而为简单标记分配更短(甚至为零)的轨迹。重要的是,这种行为自然地从对一般文本的一阶段预训练中产生,并通过标记级自适应停止在训练和推理中减少计算量。与Llama架构的实验表明,自适应潜在CoT在语言建模困惑度和广泛下游任务的准确性上始终表现出改善,即使在训练FLOPs少于先前递归基线的情况下。
cs.CL / 53 / 2602.08221
CoRect: Context-Aware Logit Contrast for Hidden State Rectification to Resolve Knowledge Conflicts
CoRect:上下文感知的对数对比用于隐状态修正以解决知识冲突
Abstract
Retrieval-Augmented Generation (RAG) often struggles with knowledge conflicts, where model-internal parametric knowledge overrides retrieved evidence, leading to unfaithful outputs. Existing approaches are often limited, relying either on superficial decoding adjustments or weight editing that necessitates ground-truth targets. Through layer-wise analysis, we attribute this failure to a parametric suppression phenomenon: specifically, in deep layers, certain FFN layers overwrite context-sensitive representations with memorized priors. To address this, we propose CoRect (Context-Aware Logit Contrast for Hidden State Rectification). By contrasting logits from contextualized and non-contextualized forward passes, CoRect identifies layers that exhibit high parametric bias without requiring ground-truth labels. It then rectifies the hidden states to preserve evidence-grounded information. Across question answering (QA) and summarization benchmarks, CoRect consistently improves faithfulness and reduces hallucinations compared to strong baselines.
Chinese Translation
检索增强生成(RAG)常常面临知识冲突的问题,即模型内部的参数知识覆盖了检索到的证据,导致输出不真实。现有的方法通常有限,依赖于表面的解码调整或需要真实目标的权重编辑。通过逐层分析,我们将这种失败归因于参数抑制现象:具体而言,在深层中,某些前馈网络(FFN)层用记忆的先验覆盖了上下文敏感的表示。为了解决这个问题,我们提出了CoRect(上下文感知的对数对比用于隐状态修正)。通过对比上下文化和非上下文化的前向传递的对数,CoRect识别出表现出高参数偏差的层,而无需真实标签。然后,它修正隐状态以保留基于证据的信息。在问答(QA)和摘要基准测试中,CoRect始终提高了输出的真实性,并减少了幻觉现象,相较于强基线表现出一致的改善。
cs.CL / 54 / 2602.08235
When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use Agents
良性输入导致严重危害:引发计算机使用代理的不安全意外行为
Abstract
Although computer-use agents (CUAs) hold significant potential to automate increasingly complex OS workflows, they can demonstrate unsafe unintended behaviors that deviate from expected outcomes even under benign input contexts. However, exploration of this risk remains largely anecdotal, lacking concrete characterization and automated methods to proactively surface long-tail unintended behaviors under realistic CUA scenarios. To fill this gap, we introduce the first conceptual and methodological framework for unintended CUA behaviors, by defining their key characteristics, automatically eliciting them, and analyzing how they arise from benign inputs. We propose AutoElicit: an agentic framework that iteratively perturbs benign instructions using CUA execution feedback, and elicits severe harms while keeping perturbations realistic and benign. Using AutoElicit, we surface hundreds of harmful unintended behaviors from state-of-the-art CUAs such as Claude 4.5 Haiku and Opus. We further evaluate the transferability of human-verified successful perturbations, identifying persistent susceptibility to unintended behaviors across various other frontier CUAs. This work establishes a foundation for systematically analyzing unintended behaviors in realistic computer-use settings.
Chinese Translation
尽管计算机使用代理(CUAs)在自动化日益复杂的操作系统工作流程方面具有显著潜力,但即使在良性输入环境下,它们也可能表现出偏离预期结果的不安全意外行为。然而,对这一风险的探索仍然主要是轶事性的,缺乏具体的特征描述和自动化方法来主动揭示现实CUA场景下的长尾意外行为。为填补这一空白,我们提出了首个关于CUA意外行为的概念和方法框架,通过定义其关键特征、自动引发这些行为,并分析它们如何从良性输入中产生。我们提出了AutoElicit:一个代理框架,通过使用CUA执行反馈迭代扰动良性指令,从而引发严重危害,同时保持扰动的现实性和良性。利用AutoElicit,我们从最先进的CUA(如Claude 4.5 Haiku和Opus)中揭示了数百种有害的意外行为。我们进一步评估了人类验证的成功扰动的可转移性,识别出在各种其他前沿CUA中对意外行为的持续敏感性。这项工作为系统分析现实计算机使用环境中的意外行为奠定了基础。
cs.CL / 55 / 2602.08237
Document Reconstruction Unlocks Scalable Long-Context RLVR
文档重构解锁可扩展的长上下文强化学习可验证奖励
Abstract
Reinforcement Learning with Verifiable Rewards~(RLVR) has become a prominent paradigm to enhance the capabilities (i.e.\ long-context) of Large Language Models~(LLMs). However, it often relies on gold-standard answers or explicit evaluation rubrics provided by powerful teacher models or human experts, which are costly and time-consuming. In this work, we investigate unsupervised approaches to enhance the long-context capabilities of LLMs, eliminating the need for heavy human annotations or teacher models' supervision. Specifically, we first replace a few paragraphs with special placeholders in a long document. LLMs are trained through reinforcement learning to reconstruct the document by correctly identifying and sequencing missing paragraphs from a set of candidate options. This training paradigm enables the model to capture global narrative coherence, significantly boosting long-context performance. We validate the effectiveness of our method on two widely used benchmarks, RULER and LongBench~v2. While acquiring noticeable gains on RULER, it can also achieve a reasonable improvement on LongBench~v2 without any manually curated long-context QA data. Furthermore, we conduct extensive ablation studies to analyze the impact of reward design, data curation strategies, training schemes, and data scaling effects on model performance. We publicly release our code, data, and models.
Chinese Translation
可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)已成为增强大语言模型(Large Language Models, LLMs)能力(即长上下文)的重要范式。然而,它通常依赖于由强大的教师模型或人类专家提供的标准答案或明确的评估标准,这既昂贵又耗时。在本研究中,我们探讨了无监督方法以增强LLMs的长上下文能力,从而消除对大量人工注释或教师模型监督的需求。具体而言,我们首先在一篇长文档中用特殊占位符替换几个段落。通过强化学习训练LLMs,使其能够通过正确识别和排序一组候选选项中的缺失段落来重构文档。这种训练范式使模型能够捕捉全局叙事一致性,显著提升长上下文性能。我们在两个广泛使用的基准测试RULER和LongBench v2上验证了我们方法的有效性。在RULER上取得显著提升的同时,它在LongBench v2上也能在没有任何人工整理的长上下文问答数据的情况下实现合理的改进。此外,我们进行了广泛的消融研究,以分析奖励设计、数据整理策略、训练方案和数据扩展效果对模型性能的影响。我们公开发布了我们的代码、数据和模型。
cs.CL / 56 / 2602.08238
On convexity and efficiency in semantic systems
关于语义系统中的凸性与效率
Abstract
There are two widely held characterizations of human semantic category systems: (1) they form convex partitions of conceptual spaces, and (2) they are efficient for communication. While prior work observed that convexity and efficiency co-occur in color naming, the analytical relation between them and why they co-occur have not been well understood. We address this gap by combining analytical and empirical analyses that build on the Information Bottleneck (IB) framework for semantic efficiency. First, we show that convexity and efficiency are distinct in the sense that neither entails the other: there are convex systems which are inefficient, and optimally-efficient systems that are non-convex. Crucially, however, the IB-optimal systems are mostly convex in the domain of color naming, explaining the main empirical basis for the convexity approach. Second, we show that efficiency is a stronger predictor for discriminating attested color naming systems from hypothetical variants, with convexity adding negligible improvement on top of that. Finally, we discuss a range of empirical phenomena that convexity cannot account for but efficiency can. Taken together, our work suggests that while convexity and efficiency can yield similar structural observations, they are fundamentally distinct, with efficiency providing a more comprehensive account of semantic typology.
Chinese Translation
人类语义类别系统有两种广泛认可的特征:(1) 它们形成概念空间的凸分区,(2) 它们在交流中是高效的。虽然先前的研究观察到在颜色命名中凸性与效率是共存的,但它们之间的分析关系以及为何会共存尚未得到充分理解。我们通过结合基于信息瓶颈(Information Bottleneck, IB)框架的分析和实证分析来填补这一空白。首先,我们表明凸性与效率在某种意义上是不同的,因为它们并不相互蕴含:存在效率低下的凸系统,以及非凸的最优高效系统。然而,至关重要的是,IB最优系统在颜色命名领域大多是凸的,这解释了凸性方法的主要实证基础。其次,我们表明效率是区分已证实的颜色命名系统与假设变体的更强预测因子,而凸性在此基础上几乎没有提供额外的改进。最后,我们讨论了一系列凸性无法解释但效率可以解释的实证现象。综合来看,我们的研究表明,尽管凸性与效率可以产生相似的结构观察,但它们在根本上是不同的,效率提供了对语义类型学更全面的解释。
cs.CL / 57 / 2602.08252
Language Predicts Identity Fusion Across Cultures and Reveals Divergent Pathways to Violence
语言预测跨文化身份融合并揭示通向暴力的不同路径
Abstract
In light of increasing polarization and political violence, understanding the psychological roots of extremism is increasingly important. Prior research shows that identity fusion predicts willingness to engage in extreme acts. We evaluate the Cognitive Linguistic Identity Fusion Score, a method that uses cognitive linguistic patterns, LLMs, and implicit metaphor to measure fusion from language. Across datasets from the United Kingdom and Singapore, this approach outperforms existing methods in predicting validated fusion scores. Applied to extremist manifestos, two distinct high-fusion pathways to violence emerge: ideologues tend to frame themselves in terms of group, forming kinship bonds; whereas grievance-driven individuals frame the group in terms of their personal identity. These results refine theories of identity fusion and provide a scalable tool aiding fusion research and extremism detection.
Chinese Translation
鉴于日益加剧的两极化和政治暴力,理解极端主义的心理根源变得愈加重要。先前的研究表明,身份融合可以预测参与极端行为的意愿。我们评估了认知语言身份融合评分(Cognitive Linguistic Identity Fusion Score),这是一种利用认知语言模式、语言模型(LLMs)和隐喻来测量语言中融合的方法。在来自英国和新加坡的数据集中,这种方法在预测经过验证的融合评分方面优于现有方法。应用于极端主义宣言时,出现了两条明显的高融合通向暴力的路径:意识形态者倾向于将自己框定为群体的一部分,形成亲属关系;而以怨恨驱动的个体则将群体框定为其个人身份的一部分。这些结果完善了身份融合理论,并提供了一种可扩展的工具,帮助融合研究和极端主义检测。
cs.CL / 58 / 2602.08274
Language Modeling and Understanding Through Paraphrase Generation and Detection
通过释义生成和检测进行语言建模与理解
Abstract
Language enables humans to share knowledge, reason about the world, and pass on strategies for survival and innovation across generations. At the heart of this process is not just the ability to communicate but also the remarkable flexibility in how we can express ourselves. We can express the same thoughts in virtually infinite ways using different words and structures - this ability to rephrase and reformulate expressions is known as paraphrase. Modeling paraphrases is a keystone to meaning in computational language models; being able to construct different variations of texts that convey the same meaning or not shows strong abilities of semantic understanding. If computational language models are to represent meaning, they must understand and control the different aspects that construct the same meaning as opposed to different meanings at a fine granularity. Yet most existing approaches reduce paraphrasing to a binary decision between two texts or to producing a single rewrite of a source, obscuring which linguistic factors are responsible for meaning preservation. In this thesis, I propose that decomposing paraphrases into their constituent linguistic aspects (paraphrase types) offers a more fine-grained and cognitively grounded view of semantic equivalence. I show that even advanced machine learning models struggle with this task. Yet, when explicitly trained on paraphrase types, models achieve stronger performance on related paraphrase tasks and downstream applications. For example, in plagiarism detection, language models trained on paraphrase types surpass human baselines: 89.6% accuracy compared to 78.4% for plagiarism cases from Wikipedia, and 66.5% compared to 55.7% for plagiarism of scientific papers from arXiv. In identifying duplicate questions on Quora, models trained with paraphrase types improve over models trained on binary pairs. Furthermore, I demonstrate that...
Chinese Translation
语言使人类能够分享知识、推理世界,并在世代之间传递生存和创新的策略。在这一过程中,核心不仅是沟通的能力,还有我们表达自己的方式的显著灵活性。我们可以使用不同的词汇和结构以几乎无限的方式表达相同的思想——这种重新表述和重新构造表达的能力被称为释义。建模释义是计算语言模型中意义的基石;能够构建传达相同或不同意义的文本的不同变体,显示出强大的语义理解能力。如果计算语言模型要表示意义,它们必须理解和控制构成相同意义与不同意义的不同方面,并做到细致入微。然而,现有的大多数方法将释义简化为两个文本之间的二元决策,或生成源文本的单一重写,模糊了哪些语言因素负责意义的保留。在本论文中,我提出将释义分解为其组成的语言方面(释义类型)提供了对语义等价性更细致和认知基础的视角。我展示了即使是先进的机器学习模型在这一任务上也面临挑战。然而,当在释义类型上进行明确训练时,模型在相关释义任务和下游应用中的表现更强。例如,在抄袭检测中,基于释义类型训练的语言模型超越了人类基线:在来自维基百科的抄袭案例中,准确率为89.6%,而人类基线为78.4%;在来自arXiv的科学论文抄袭中,准确率为66.5%,而人类基线为55.7%。在识别Quora上的重复问题时,基于释义类型训练的模型优于基于二元对训练的模型。此外,我展示了...
cs.CL / 59 / 2602.08281
New Skills or Sharper Primitives? A Probabilistic Perspective on the Emergence of Reasoning in RLVR
新技能还是更精确的原语?关于强化学习与可验证奖励中推理出现的概率视角
Abstract
Whether Reinforcement Learning with Verifiable Rewards (RLVR) endows Large Language Models (LLMs) with new capabilities or merely elicits latent traces remains a central debate. In this work, we align with the former view, proposing a probabilistic framework where capability is defined by instance-level solvability. We hypothesize that the emergence of complex reasoning can be driven by sharpening atomic step probabilities, which enables models to overcome the exponential decay of success rates inherent in multi-step reasoning chains. Utilizing the Algebrarium framework, we train models exclusively on single-step operations and evaluate their performance on unseen multi-step tasks. Our empirical results confirm that: (1) RLVR incentivizes the exploration of previously inaccessible solution paths by amplifying the model's existing skills; (2) composite performance is strictly governed by the joint probability of atomic steps, evidenced by high Pearson correlation coefficients ($\rho \in [0.69, 0.96]$); and (3) RLVR, acting as a global optimizer, can cause specific skills to be sacrificed to maximize aggregate reward. Our work offers a novel explanation for emergent abilities in RLVR, suggesting that the iterative optimization of solvable problems enables models to develop the capabilities to tackle previously unsolvable scenarios.
Chinese Translation
强化学习与可验证奖励(RLVR)是否赋予大型语言模型(LLMs)新的能力,还是仅仅引发潜在痕迹,仍然是一个核心争论。在本研究中,我们支持前者的观点,提出一个概率框架,其中能力由实例级的可解性定义。我们假设复杂推理的出现可以通过提高原子步骤的概率来驱动,这使得模型能够克服多步骤推理链中固有的成功率指数衰减。利用Algebrarium框架,我们仅在单步骤操作上训练模型,并评估其在未见过的多步骤任务上的表现。我们的实证结果确认了以下几点:(1)RLVR通过放大模型现有技能,激励探索以前无法访问的解决路径;(2)复合性能严格受原子步骤的联合概率支配,表现出高皮尔逊相关系数($
ho ext{ in } [0.69, 0.96]$);(3)RLVR作为全局优化器,可能导致特定技能的牺牲以最大化整体奖励。我们的研究为RLVR中的新兴能力提供了一种新颖的解释,表明可解问题的迭代优化使模型能够发展出应对以前无法解决场景的能力。
cs.CL / 60 / 2602.08289
Knowledge Augmented Entity and Relation Extraction for Legal Documents with Hypergraph Neural Network
基于超图神经网络的法律文档知识增强实体与关系提取
Abstract
With the continuous progress of digitization in Chinese judicial institutions, a substantial amount of electronic legal document information has been accumulated. To unlock its potential value, entity and relation extraction for legal documents has emerged as a crucial task. However, existing methods often lack domain-specific knowledge and fail to account for the unique characteristics of the judicial domain. In this paper, we propose an entity and relation extraction algorithm based on hypergraph neural network (Legal-KAHRE) for drug-related judgment documents. Firstly, we design a candidate span generator based on neighbor-oriented packing strategy and biaffine mechanism, which identifies spans likely to contain entities. Secondly, we construct a legal dictionary with judicial domain knowledge and integrate it into text encoding representation using multi-head attention. Additionally, we incorporate domain-specific cases like joint crimes and combined punishment for multiple crimes into the hypergraph structure design. Finally, we employ a hypergraph neural network for higher-order inference via message passing. Experimental results on the CAIL2022 information extraction dataset demonstrate that our method significantly outperforms existing baseline models.
Chinese Translation
随着中国司法机构数字化进程的不断推进,大量电子法律文档信息被积累起来。为了挖掘其潜在价值,法律文档的实体与关系提取成为一项重要任务。然而,现有的方法往往缺乏领域特定知识,未能考虑司法领域的独特特征。本文提出了一种基于超图神经网络的实体与关系提取算法(Legal-KAHRE),专门针对与药物相关的判决文书。首先,我们设计了一种基于邻居导向打包策略和双仿射机制的候选跨度生成器,以识别可能包含实体的跨度。其次,我们构建了一个包含司法领域知识的法律词典,并通过多头注意力机制将其整合到文本编码表示中。此外,我们将联合犯罪和多罪并罚等领域特定案例纳入超图结构设计。最后,我们采用超图神经网络通过消息传递进行高阶推理。在CAIL2022信息提取数据集上的实验结果表明,我们的方法显著优于现有的基线模型。
cs.CL / 61 / 2602.08294
When Does Context Help? Error Dynamics of Contextual Information in Large Language Models
何时上下文有助于提升性能?大语言模型中上下文信息的错误动态
Abstract
Contextual information at inference time, such as demonstrations, retrieved knowledge, or interaction history, can substantially improve large language models (LLMs) without parameter updates, yet its theoretical role remains poorly understood beyond specific settings such as in-context learning (ICL). We present a unified theoretical framework for analyzing the effect of arbitrary contextual information in Transformer-based LLMs. Our analysis characterizes contextual influence through output error dynamics. In a single-layer Transformer, we prove that the context-conditioned error vector decomposes additively into the baseline error vector and a contextual correction vector. This yields necessary geometric conditions for error reduction: the contextual correction must align with the negative baseline error and satisfy a norm constraint. We further show that the contextual correction norm admits an explicit upper bound determined by context-query relevance and complementarity. These results extend to multi-context and multi-layer Transformers. Experiments across ICL, retrieval-augmented generation, and memory evolution validate our theory and motivate a principled context selection strategy that improves performance by $0.6\%$.
Chinese Translation
推理时的上下文信息,如示例、检索知识或交互历史,可以在不更新参数的情况下显著提升大语言模型(LLMs)的性能,但其理论作用在特定设置(如上下文学习(ICL))之外仍然缺乏深入理解。我们提出了一个统一的理论框架,用于分析基于Transformer的大语言模型中任意上下文信息的影响。我们的分析通过输出错误动态来表征上下文的影响。在单层Transformer中,我们证明了上下文条件下的错误向量可以加性分解为基线错误向量和上下文修正向量。这为错误减少提供了必要的几何条件:上下文修正必须与负基线错误对齐,并满足范数约束。我们进一步表明,上下文修正的范数具有一个由上下文-查询相关性和互补性决定的显式上界。这些结果可以扩展到多上下文和多层Transformer。我们在ICL、检索增强生成和记忆演化中的实验验证了我们的理论,并激励了一种原则性的上下文选择策略,该策略使性能提高了0.6%。
cs.CL / 62 / 2602.08305
JUSTICE: Judicial Unified Synthesis Through Intermediate Conclusion Emulation for Automated Judgment Document Generation
JUSTICE:通过中间结论模拟实现司法统一综合的自动判决文书生成
Abstract
Automated judgment document generation is a significant yet challenging legal AI task. As the conclusive written instrument issued by a court, a judgment document embodies complex legal reasoning. However, existing methods often oversimplify this complex process, particularly by omitting the ``Pre-Judge'' phase, a crucial step where human judges form a preliminary conclusion. This omission leads to two core challenges: 1) the ineffective acquisition of foundational judicial elements, and 2) the inadequate modeling of the Pre-Judge process, which collectively undermine the final document's legal soundness. To address these challenges, we propose \textit{\textbf{J}udicial \textbf{U}nified \textbf{S}ynthesis \textbf{T}hrough \textbf{I}ntermediate \textbf{C}onclusion \textbf{E}mulation} (JUSTICE), a novel framework that emulates the ``Search $\rightarrow$ Pre-Judge $\rightarrow$ Write'' cognitive workflow of human judges. Specifically, it introduces the Pre-Judge stage through three dedicated components: Referential Judicial Element Retriever (RJER), Intermediate Conclusion Emulator (ICE), and Judicial Unified Synthesizer (JUS). RJER first retrieves legal articles and a precedent case to establish a referential foundation. ICE then operationalizes the Pre-Judge phase by generating a verifiable intermediate conclusion. Finally, JUS synthesizes these inputs to craft the final judgment. Experiments on both an in-domain legal benchmark and an out-of-distribution dataset show that JUSTICE significantly outperforms strong baselines, with substantial gains in legal accuracy, including a 4.6\% improvement in prison term prediction. Our findings underscore the importance of explicitly modeling the Pre-Judge process to enhance the legal coherence and accuracy of generated judgment documents.
Chinese Translation
自动判决文书生成是一项重要但具有挑战性的法律人工智能任务。作为法院发布的最终书面文书,判决文书体现了复杂的法律推理。然而,现有方法往往过于简化这一复杂过程,特别是忽略了“预判”(Pre-Judge)阶段,这是人类法官形成初步结论的关键步骤。这一遗漏导致了两个核心挑战:1)基础司法元素获取的低效,2)预判过程建模的不足,这两者共同削弱了最终文书的法律有效性。为了解决这些挑战,我们提出了 extit{ extbf{J}udicial extbf{U}nified extbf{S}ynthesis extbf{T}hrough extbf{I}ntermediate extbf{C}onclusion extbf{E}mulation}(JUSTICE),这是一个新颖的框架,模拟人类法官的“搜索 $
ightarrow$ 预判 $
ightarrow$ 写作”认知工作流程。具体而言,它通过三个专门组件引入了预判阶段:参考司法元素检索器(Referential Judicial Element Retriever, RJER)、中间结论模拟器(Intermediate Conclusion Emulator, ICE)和司法统一综合器(Judicial Unified Synthesizer, JUS)。RJER首先检索法律条款和先例案例,以建立参考基础。然后,ICE通过生成可验证的中间结论来实现预判阶段。最后,JUS综合这些输入,以撰写最终判决。在一个领域内法律基准和一个分布外数据集上的实验表明,JUSTICE显著优于强基线,并在法律准确性方面取得了显著提升,包括监禁期限预测提高了4.6%。我们的研究结果强调了明确建模预判过程的重要性,以增强生成判决文书的法律一致性和准确性。
cs.CL / 63 / 2602.08321
Improving Data and Reward Design for Scientific Reasoning in Large Language Models
改善大型语言模型中科学推理的数据和奖励设计
Abstract
Solving open-ended science questions remains challenging for large language models, particularly due to inherently unreliable supervision and evaluation. The bottleneck lies in the data construction and reward design for scientific post-training. We develop a large-scale, systematic data processing pipeline that transforms heterogeneous open-source science data into Dr. SCI dataset, which comprises of 1M questions across eight STEM subjects, with explicit verifiable/open-ended splits, scalable difficulty annotation, and fine-grained rubrics that operationalize evaluation for open-ended answers. Building on this dataset, we propose the Dr. SCI post-training pipeline, which redesigns the standard SFT -> RL workflow through three components: (i) Exploration-Expanding SFT, which broadens the model's reasoning pattern coverage prior to RL; (ii) Dynamic Difficulty Curriculum, which adapts training data to the model's evolving scientific capability; and (iii) SciRubric-Guided RL, which enables stable reinforcement learning on open-ended scientific questions via rubric-based evaluation with explicit answer correctness. Qwen3-4B-Base trained using Dr.SCI pipeline achieves 63.2 on GPQA-diamond and 32.4 on GPQA-general, consistently improves over strong post-trained baselines such as o1-mini and GPT-4o, demonstrating substantial gains in scientific reasoning, especially in open-ended settings.
Chinese Translation
解决开放式科学问题对于大型语言模型仍然具有挑战性,特别是由于固有的不可靠监督和评估。瓶颈在于科学后训练的数据构建和奖励设计。我们开发了一个大规模、系统化的数据处理管道,将异构的开源科学数据转化为 Dr. SCI 数据集,该数据集包含来自八个 STEM 学科的 100 万个问题,具有明确的可验证/开放式划分、可扩展的难度注释和细致的评分标准,以便对开放式答案进行评估。在此数据集的基础上,我们提出了 Dr. SCI 后训练管道,通过三个组件重新设计标准的 SFT -> RL 工作流程:(i) 探索扩展 SFT,扩大模型在 RL 之前的推理模式覆盖范围;(ii) 动态难度课程,根据模型不断发展的科学能力调整训练数据;(iii) SciRubric 引导的 RL,通过基于评分标准的评估和明确的答案正确性,实现对开放式科学问题的稳定强化学习。使用 Dr. SCI 管道训练的 Qwen3-4B-Base 在 GPQA-diamond 上取得了 63.2 的成绩,在 GPQA-general 上取得了 32.4 的成绩,持续超越强大的后训练基线,如 o1-mini 和 GPT-4o,显示出在科学推理方面的显著提升,特别是在开放式设置中。
cs.CL / 64 / 2602.08322
An Attention-over-Attention Generative Model for Joint Multiple Intent Detection and Slot Filling
一种基于注意力机制的生成模型用于联合多重意图检测和槽填充
Abstract
In task-oriented dialogue systems, spoken language understanding (SLU) is a critical component, which consists of two sub-tasks, intent detection and slot filling. Most existing methods focus on the single-intent SLU, where each utterance only has one intent. However, in real-world scenarios users usually express multiple intents in an utterance, which poses a challenge for existing dialogue systems and datasets. In this paper, we propose a generative framework to simultaneously address multiple intent detection and slot filling. In particular, an attention-over-attention decoder is proposed to handle the variable number of intents and the interference between the two sub-tasks by incorporating an inductive bias into the process of multi-task learning. Besides, we construct two new multi-intent SLU datasets based on single-intent utterances by taking advantage of the next sentence prediction (NSP) head of the BERT model. Experimental results demonstrate that our proposed attention-over-attention generative model achieves state-of-the-art performance on two public datasets, MixATIS and MixSNIPS, and our constructed datasets.
Chinese Translation
在面向任务的对话系统中,口语语言理解(SLU)是一个关键组成部分,包含两个子任务:意图检测和槽填充。大多数现有方法集中于单一意图的SLU,其中每个发言只有一个意图。然而,在现实场景中,用户通常在一次发言中表达多个意图,这对现有的对话系统和数据集提出了挑战。本文提出了一种生成框架,以同时解决多重意图检测和槽填充问题。特别地,提出了一种注意力机制的解码器,以通过在多任务学习过程中引入归纳偏差来处理意图数量的变化和两个子任务之间的干扰。此外,我们基于单一意图的发言构建了两个新的多重意图SLU数据集,利用了BERT模型的下一个句子预测(NSP)头。实验结果表明,我们提出的基于注意力机制的生成模型在两个公共数据集MixATIS和MixSNIPS以及我们构建的数据集上达到了最先进的性能。
cs.CL / 65 / 2602.08332
Latent Reasoning with Supervised Thinking States
带监督思维状态的潜在推理
Abstract
Reasoning with a chain-of-thought (CoT) enables Large Language Models (LLMs) to solve complex tasks but incurs significant inference costs due to the generation of long rationales. We propose Thinking States, a method that performs reasoning {\em while} the input is processing. Specifically, Thinking States generates sequences of thinking tokens every few input tokens, transforms the thoughts back into embedding space, and adds them to the following input tokens. This has two key advantages. First, it captures the recurrent nature of CoT, but where the thought tokens are generated as input is processing. Second, since the thoughts are represented as tokens, they can be learned from natural language supervision, and using teacher-forcing, which is parallelizable. Empirically, Thinking States outperforms other latent reasoning methods on multiple reasoning tasks, narrowing the gap to CoT on math problems, and matching its performance on 2-Hop QA with improved latency. On state-tracking tasks, we show Thinking States leads to stronger reasoning behavior than CoT, successfully extrapolating to longer sequences than seen during training.
Chinese Translation
链式思维(CoT)推理使大型语言模型(LLMs)能够解决复杂任务,但由于生成长篇推理过程,导致显著的推理成本。我们提出了一种名为思维状态(Thinking States)的方法,该方法在输入处理的同时进行推理。具体而言,思维状态每处理几个输入标记就生成一系列思维标记,将思维转换回嵌入空间,并将其添加到后续的输入标记中。这具有两个关键优势。首先,它捕捉了链式思维的递归特性,但思维标记是在输入处理时生成的。其次,由于思维以标记的形式表示,可以通过自然语言监督进行学习,并且使用教师强制(teacher-forcing)进行并行化。实证结果表明,思维状态在多个推理任务上优于其他潜在推理方法,缩小了在数学问题上与链式思维的差距,并在2-Hop QA任务上与其性能相当,同时改善了延迟。在状态跟踪任务中,我们展示了思维状态比链式思维展现出更强的推理行为,成功推断出比训练期间看到的更长的序列。
cs.CL / 66 / 2602.08336
UReason: Benchmarking the Reasoning Paradox in Unified Multimodal Models
UReason:统一多模态模型中推理悖论的基准测试
Abstract
To elicit capabilities for addressing complex and implicit visual requirements, recent unified multimodal models increasingly adopt chain-of-thought reasoning to guide image generation. However, the actual effect of reasoning on visual synthesis remains unclear. We present UReason, a diagnostic benchmark for reasoning-driven image generation that evaluates whether reasoning can be faithfully executed in pixels. UReason contains 2,000 instances across five task families: Code, Arithmetic, Spatial, Attribute, and Text reasoning. To isolate the role of reasoning traces, we introduce an evaluation framework comparing direct generation, reasoning-guided generation, and de-contextualized generation which conditions only on the refined prompt. Across eight open-source unified models, we observe a consistent Reasoning Paradox: Reasoning traces generally improve performance over direct generation, yet retaining intermediate thoughts as conditioning context often hinders visual synthesis, and conditioning only on the refined prompt yields substantial gains. Our analysis suggests that the bottleneck lies in contextual interference rather than insufficient reasoning capacity. UReason provides a principled testbed for studying reasoning in unified models and motivates future methods that effectively integrate reasoning for visual generation while mitigating interference.
Chinese Translation
为了引发应对复杂和隐含视觉需求的能力,最近的统一多模态模型越来越多地采用链式思维推理来指导图像生成。然而,推理对视觉合成的实际影响仍然不清楚。我们提出了UReason,一个用于推理驱动图像生成的诊断基准,评估推理是否可以在像素中忠实执行。UReason包含2000个实例,涵盖五个任务类别:代码、算术、空间、属性和文本推理。为了隔离推理痕迹的作用,我们引入了一个评估框架,比较直接生成、推理指导生成和仅基于精炼提示的去上下文化生成。在八个开源统一模型中,我们观察到一个一致的推理悖论:推理痕迹通常提高了直接生成的性能,但保留中间思维作为条件上下文往往会妨碍视觉合成,而仅基于精炼提示的条件生成则带来了显著的提升。我们的分析表明,瓶颈在于上下文干扰,而非推理能力不足。UReason为研究统一模型中的推理提供了一个有原则的测试平台,并激励未来的方法有效整合推理以进行视觉生成,同时减轻干扰。
cs.CL / 67 / 2602.08367
WorldTravel: A Realistic Multimodal Travel-Planning Benchmark with Tightly Coupled Constraints
WorldTravel:一个具有紧密耦合约束的现实多模态旅行规划基准
Wang, Zexuan, Yang, Chenghao, Que, Yingqi, Yang, Zhenzhu, Yuan, Huaqing, Wang, Yiwen, Jiang, Zhengxuan, Fang, Shengjie, Wu, Zhenhe, Wang, Zhaohui, Yao, Zhixin, Liu, Jiashuo, Ren, Jincheng, Li, Yuzhen, Yang, Yang, Liu, Jiaheng, Yang, Jian, Wang, Zaiyuan, Zhang, Ge, Wen, Zhoufutu, Huang, Wenhao
Abstract
Real-world autonomous planning requires coordinating tightly coupled constraints where a single decision dictates the feasibility of all subsequent actions. However, existing benchmarks predominantly feature loosely coupled constraints solvable through local greedy decisions and rely on idealized data, failing to capture the complexity of extracting parameters from dynamic web environments. We introduce \textbf{WorldTravel}, a benchmark comprising 150 real-world travel scenarios across 5 cities that demand navigating an average of 15+ interdependent temporal and logical constraints. To evaluate agents in realistic deployments, we develop \textbf{WorldTravel-Webscape}, a multi-modal environment featuring over 2,000 rendered webpages where agents must perceive constraint parameters directly from visual layouts to inform their planning. Our evaluation of 10 frontier models reveals a significant performance collapse: even the state-of-the-art GPT-5.2 achieves only 32.67\% feasibility in text-only settings, which plummets to 19.33\% in multi-modal environments. We identify a critical Perception-Action Gap and a Planning Horizon threshold at approximately 10 constraints where model reasoning consistently fails, suggesting that perception and reasoning remain independent bottlenecks. These findings underscore the need for next-generation agents that unify high-fidelity visual perception with long-horizon reasoning to handle brittle real-world logistics.
Chinese Translation
现实世界中的自主规划需要协调紧密耦合的约束,其中单一决策决定了所有后续行动的可行性。然而,现有基准主要特征为通过局部贪婪决策可解的松散耦合约束,并依赖理想化数据,未能捕捉从动态网络环境中提取参数的复杂性。我们引入了 extbf{WorldTravel},这是一个包含150个真实旅行场景的基准,覆盖5个城市,要求在平均15个以上相互依赖的时间和逻辑约束中进行导航。为了在现实部署中评估代理,我们开发了 extbf{WorldTravel-Webscape},这是一个多模态环境,包含超过2000个渲染网页,代理必须直接从视觉布局中感知约束参数以指导其规划。我们对10个前沿模型的评估揭示了显著的性能崩溃:即使是最先进的GPT-5.2在仅文本设置中也仅实现了32.67 ext{%}的可行性,而在多模态环境中则骤降至19.33 ext{%}。我们识别出一个关键的感知-行动差距和一个约10个约束的规划视野阈值,在此阈值下模型推理持续失败,表明感知和推理仍然是独立的瓶颈。这些发现强调了下一代代理的需求,即将高保真视觉感知与长远推理统一起来,以应对脆弱的现实世界物流。
cs.CL / 68 / 2602.08371
ViGoEmotions: A Benchmark Dataset For Fine-grained Emotion Detection on Vietnamese Texts
ViGoEmotions:越南文本细粒度情感检测的基准数据集
Abstract
Emotion classification plays a significant role in emotion prediction and harmful content detection. Recent advancements in NLP, particularly through large language models (LLMs), have greatly improved outcomes in this field. This study introduces ViGoEmotions -- a Vietnamese emotion corpus comprising 20,664 social media comments in which each comment is classified into 27 fine-grained distinct emotions. To evaluate the quality of the dataset and its impact on emotion classification, eight pre-trained Transformer-based models were evaluated under three preprocessing strategies: preserving original emojis with rule-based normalization, converting emojis into textual descriptions, and applying ViSoLex, a model-based lexical normalization system. Results show that converting emojis into text often improves the performance of several BERT-based baselines, while preserving emojis yields the best results for ViSoBERT and CafeBERT. In contrast, removing emojis generally leads to lower performance. ViSoBERT achieved the highest Macro F1-score of 61.50% and Weighted F1-score of 63.26%. Strong performance was also observed from CafeBERT and PhoBERT. These findings highlight that while the proposed corpus can support diverse architectures effectively, preprocessing strategies and annotation quality remain key factors influencing downstream performance.
Chinese Translation
情感分类在情感预测和有害内容检测中发挥着重要作用。最近在自然语言处理(NLP)领域的进展,特别是通过大型语言模型(LLMs)的应用,极大地改善了这一领域的成果。本研究介绍了ViGoEmotions——一个包含20,664条社交媒体评论的越南情感语料库,每条评论被分类为27种细粒度的独特情感。为了评估数据集的质量及其对情感分类的影响,评估了八个基于Transformer的预训练模型在三种预处理策略下的表现:保留原始表情符号并进行基于规则的规范化、将表情符号转换为文本描述,以及应用ViSoLex(一种基于模型的词汇规范化系统)。结果表明,将表情符号转换为文本通常能提高多个基于BERT的基线模型的性能,而保留表情符号则为ViSoBERT和CafeBERT带来了最佳结果。相比之下,去除表情符号通常会导致性能下降。ViSoBERT达到了最高的宏观F1分数61.50%和加权F1分数63.26%。CafeBERT和PhoBERT也表现出强劲的性能。这些发现强调了尽管所提出的语料库能够有效支持多种架构,但预处理策略和标注质量仍然是影响下游性能的关键因素。
cs.CL / 69 / 2602.08382
Dynamic Long Context Reasoning over Compressed Memory via End-to-End Reinforcement Learning
通过端到端强化学习实现压缩记忆的动态长上下文推理
Abstract
Large Language Models (LLMs) face significant challenges in long-context processing, including quadratic computational costs, information forgetting, and the context fragmentation inherent in retrieval-augmented generation (RAG). We propose a cognitively inspired framework for efficient long-context inference based on chunk-wise compression and selective memory recall, rather than processing all raw tokens. The framework segments long inputs into chunks and encodes each chunk into compressed memory representations using a learned compressor. A gating module dynamically selects relevant memory blocks, which are then iteratively processed by a reasoning module with an evolving working memory to solve downstream tasks. The compressor and reasoner are jointly optimized via end-to-end reinforcement learning, while the gating module is trained separately as a classifier. Experimental results show that the proposed method achieves competitive accuracy on multi-hop reasoning benchmarks such as RULER-HQA, extrapolates context length from 7K to 1.75M tokens, and offers a favorable accuracy-efficiency trade-off compared to strong long-context baselines. In particular, it achieves up to a 2 times reduction in peak GPU memory usage and a 6 times inference speedup over MemAgent.
Chinese Translation
大型语言模型(LLMs)在长上下文处理方面面临重大挑战,包括二次计算成本、信息遗忘以及检索增强生成(RAG)固有的上下文碎片化问题。我们提出了一种受认知启发的框架,用于基于分块压缩和选择性记忆回忆的高效长上下文推理,而不是处理所有原始标记。该框架将长输入分割为多个块,并使用学习到的压缩器将每个块编码为压缩的记忆表示。一个门控模块动态选择相关的记忆块,然后由推理模块与不断演变的工作记忆迭代处理,以解决下游任务。压缩器和推理器通过端到端强化学习共同优化,而门控模块则作为分类器单独训练。实验结果表明,所提出的方法在多跳推理基准(如 RULER-HQA)上实现了具有竞争力的准确性,将上下文长度从 7K 扩展到 1.75M 个标记,并在与强大的长上下文基线相比时提供了良好的准确性与效率的权衡。特别地,它在峰值 GPU 内存使用上实现了高达 2 倍的减少,并在推理速度上相较于 MemAgent 提升了 6 倍。
cs.CL / 70 / 2602.08404
TEAM: Temporal-Spatial Consistency Guided Expert Activation for MoE Diffusion Language Model Acceleration
TEAM:基于时间-空间一致性引导的专家激活用于MoE扩散语言模型加速
Abstract
Diffusion large language models (dLLMs) have recently gained significant attention due to their inherent support for parallel decoding. Building on this paradigm, Mixture-of-Experts (MoE) dLLMs with autoregressive (AR) initialization have further demonstrated strong performance competitive with mainstream AR models. However, we identify a fundamental mismatch between MoE architectures and diffusion-based decoding. Specifically, a large number of experts are activated at each denoising step, while only a small subset of tokens is ultimately accepted, resulting in substantial inference overhead and limiting their deployment in latency-sensitive applications. In this work, we propose TEAM, a plug-and-play framework that accelerates MoE dLLMs by enabling more accepted tokens with fewer activated experts. TEAM is motivated by the observation that expert routing decisions exhibit strong temporal consistency across denoising levels as well as spatial consistency across token positions. Leveraging these properties, TEAM employs three complementary expert activation and decoding strategies, conservatively selecting necessary experts for decoded and masked tokens and simultaneously performing aggressive speculative exploration across multiple candidates. Experimental results demonstrate that TEAM achieves up to 2.2x speedup over vanilla MoE dLLM, with negligible performance degradation. Code is released at https://github.com/PKU-SEC-Lab/TEAM-MoE-dLLM.
Chinese Translation
扩散大型语言模型(dLLMs)因其固有的并行解码支持而受到广泛关注。在此基础上,具有自回归(AR)初始化的专家混合(MoE)dLLMs进一步展现了与主流AR模型竞争的强大性能。然而,我们发现MoE架构与基于扩散的解码之间存在根本的不匹配。具体而言,在每个去噪步骤中激活了大量专家,而最终仅接受少量令牌,这导致了显著的推理开销,并限制了它们在对延迟敏感的应用中的部署。在本研究中,我们提出了TEAM,一个即插即用的框架,通过激活更少的专家来加速MoE dLLMs,从而接受更多的令牌。TEAM的设计灵感来自于观察到专家路由决策在去噪层次之间表现出强烈的时间一致性以及在令牌位置之间的空间一致性。利用这些特性,TEAM采用三种互补的专家激活和解码策略,谨慎选择解码和掩蔽令牌所需的专家,同时在多个候选者之间进行激进的推测性探索。实验结果表明,TEAM在原始MoE dLLM上实现了高达2.2倍的加速,且性能下降微乎其微。代码已发布在 https://github.com/PKU-SEC-Lab/TEAM-MoE-dLLM。
cs.CL / 71 / 2602.08426
Prism: Spectral-Aware Block-Sparse Attention
Prism:光谱感知块稀疏注意力
Abstract
Block-sparse attention is promising for accelerating long-context LLM pre-filling, yet identifying relevant blocks efficiently remains a bottleneck. Existing methods typically employ coarse-grained attention as a proxy for block importance estimation, but often resort to expensive token-level searching or scoring, resulting in significant selection overhead. In this work, we trace the inaccuracy of standard coarse-grained attention via mean pooling to a theoretical root cause: the interaction between mean pooling and Rotary Positional Embeddings (RoPE). We prove that mean pooling acts as a low-pass filter that induces destructive interference in high-frequency dimensions, effectively creating a "blind spot" for local positional information (e.g., slash patterns). To address this, we introduce Prism, a training-free spectral-aware approach that decomposes block selection into high-frequency and low-frequency branches. By applying energy-based temperature calibration, Prism restores the attenuated positional signals directly from pooled representations, enabling block importance estimation using purely block-level operations, thereby improving efficiency. Extensive evaluations confirm that Prism maintains accuracy parity with full attention while delivering up to $\mathbf{5.1\times}$ speedup.
Chinese Translation
块稀疏注意力在加速长上下文大语言模型(LLM)预填充方面具有良好前景,但高效识别相关块仍然是一个瓶颈。现有方法通常采用粗粒度注意力作为块重要性估计的代理,但往往依赖于昂贵的令牌级搜索或评分,导致显著的选择开销。在本研究中,我们将标准粗粒度注意力通过均值池化的不准确性追溯到一个理论根源:均值池化与旋转位置嵌入(Rotary Positional Embeddings, RoPE)之间的相互作用。我们证明均值池化充当低通滤波器,在高频维度中引入破坏性干扰,有效地创建了一个局部位置信息(例如斜杠模式)的“盲点”。为了解决这个问题,我们提出了Prism,一种无训练的光谱感知方法,将块选择分解为高频和低频分支。通过应用基于能量的温度校准,Prism直接从池化表示中恢复衰减的位置信号,使得块重要性估计仅使用块级操作,从而提高了效率。大量评估确认Prism在保持与全注意力相当的准确性的同时,实现了高达$ extbf{5.1 imes}$的加速。
cs.CL / 72 / 2602.08437
Large Language Models and Impossible Language Acquisition: "False Promise" or an Overturn of our Current Perspective towards AI
大型语言模型与不可能语言的习得:是“虚假承诺”还是对我们当前人工智能视角的颠覆
Abstract
In Chomsky's provocative critique "The False Promise of CHATGPT," Large Language Models (LLMs) are characterized as mere pattern predictors that do not acquire languages via intrinsic causal and self-correction structures like humans, therefore are not able to distinguish impossible languages. It stands as a representative in a fundamental challenge to the intellectual foundations of AI, for it integrally synthesizes major issues in methodologies within LLMs and possesses an iconic a priori rationalist perspective. We examine this famous critic from both the perspective in pre-existing literature of linguistics and psychology as well as a research based on an experiment inquiring the capacity of learning both possible and impossible languages among LLMs. We constructed a set of syntactically impossible languages by applying certain transformations to English. These include reversing whole sentences, and adding negation based on word-count parity. Two rounds of controlled experiments were each conducted on GPT-2 small models and long short-term memory (LSTM) models. Statistical analysis (Welch's t-test) shows GPT2 small models underperform in learning all of the impossible languages compared to their performance on the possible language (p<.001). On the other hand, LSTM models' performance tallies with Chomsky's argument, suggesting the irreplaceable role of the evolution of transformer architecture. Based on theoretical analysis and empirical findings, we propose a new vision within Chomsky's theory towards LLMs, and a shift of theoretical paradigm outside Chomsky, from his "rationalist-romantics" paradigm to functionalism and empiricism in LLMs research.
Chinese Translation
在乔姆斯基的挑衅性批评《CHATGPT的虚假承诺》中,大型语言模型(LLMs)被描述为仅仅是模式预测器,它们并不像人类那样通过内在的因果和自我修正结构来习得语言,因此无法区分不可能语言。这一观点代表了对人工智能知识基础的根本挑战,因为它综合了LLMs方法论中的主要问题,并具有标志性的先验理性主义视角。我们从语言学和心理学的现有文献以及基于实验的研究角度审视了这一著名批评,探讨LLMs学习可能语言和不可能语言的能力。我们通过对英语应用某些变换构建了一组句法上不可能的语言,这些变换包括反转整个句子和基于词数奇偶性添加否定。我们分别在GPT-2小模型和长短期记忆(LSTM)模型上进行了两轮控制实验。统计分析(Welch的t检验)显示,与它们在可能语言上的表现相比,GPT-2小模型在学习所有不可能语言方面表现不佳(p<.001)。另一方面,LSTM模型的表现与乔姆斯基的论点一致,表明了变压器架构演变的不可替代性。基于理论分析和实证发现,我们在乔姆斯基理论中提出了对LLMs的新视角,并在乔姆斯基之外提出了理论范式的转变,从他的“理性主义-浪漫主义”范式转向LLMs研究中的功能主义和经验主义。
cs.CL / 73 / 2602.08498
Characterizing, Evaluating, and Optimizing Complex Reasoning
复杂推理的特征描述、评估与优化
Abstract
Large Reasoning Models (LRMs) increasingly rely on reasoning traces with complex internal structures. However, existing work lacks a unified answer to three fundamental questions: (1) what defines high-quality reasoning, (2) how to reliably evaluate long, implicitly structured reasoning traces, and (3) how to use such evaluation signals for reasoning optimization. To address these challenges, we provide a unified perspective. (1) We introduce the ME$^2$ principle to characterize reasoning quality along macro- and micro-level concerning efficiency and effectiveness. (2) Built on this principle, we model reasoning traces as directed acyclic graphs (DAGs) and develop a DAG-based pairwise evaluation method, capturing complex reasoning structures. (3) Based on this method, we construct the TRM-Preference dataset and train a Thinking Reward Model (TRM) to evaluate reasoning quality at scale. Experiments show that thinking rewards serve as an effective optimization signal. At test time, selecting better reasoning leads to better outcomes (up to 19.3% gain), and during RL training, thinking rewards enhance reasoning and performance (up to 3.9% gain) across diverse tasks.
Chinese Translation
大型推理模型(LRMs)越来越依赖于具有复杂内部结构的推理轨迹。然而,现有研究对三个基本问题缺乏统一的答案:(1)什么定义了高质量的推理,(2)如何可靠地评估长且隐含结构的推理轨迹,以及(3)如何利用这些评估信号进行推理优化。为了解决这些挑战,我们提供了一个统一的视角。(1)我们引入了ME$^2$原则,从宏观和微观层面描述推理质量,关注效率和有效性。(2)基于这一原则,我们将推理轨迹建模为有向无环图(DAGs),并开发了一种基于DAG的成对评估方法,以捕捉复杂的推理结构。(3)基于该方法,我们构建了TRM-Preference数据集,并训练了一个思维奖励模型(Thinking Reward Model, TRM),以大规模评估推理质量。实验表明,思维奖励作为有效的优化信号。在测试时,选择更好的推理可以带来更好的结果(最高可达19.3%的提升),而在强化学习训练过程中,思维奖励在多样化任务中提升了推理和性能(最高可达3.9%的提升)。
cs.CL / 74 / 2602.08543
GISA: A Benchmark for General Information-Seeking Assistant
GISA:通用信息检索助手的基准测试
Abstract
The advancement of large language models (LLMs) has significantly accelerated the development of search agents capable of autonomously gathering information through multi-turn web interactions. Various benchmarks have been proposed to evaluate such agents. However, existing benchmarks often construct queries backward from answers, producing unnatural tasks misaligned with real-world needs. Moreover, these benchmarks tend to focus on either locating specific information or aggregating information from multiple sources, while relying on static answer sets prone to data contamination. To bridge these gaps, we introduce GISA, a benchmark for General Information-Seeking Assistants comprising 373 human-crafted queries that reflect authentic information-seeking scenarios. GISA features four structured answer formats (item, set, list, and table), enabling deterministic evaluation. It integrates both deep reasoning and broad information aggregation within unified tasks, and includes a live subset with periodically updated answers to resist memorization. Notably, GISA provides complete human search trajectories for every query, offering gold-standard references for process-level supervision and imitation learning. Experiments on mainstream LLMs and commercial search products reveal that even the best-performing model achieves only 19.30\% exact match score, with performance notably degrading on tasks requiring complex planning and comprehensive information gathering. These findings highlight substantial room for future improvement.
Chinese Translation
大型语言模型(LLMs)的进步显著加速了能够通过多轮网络交互自主收集信息的搜索代理的发展。为评估此类代理,已经提出了多种基准测试。然而,现有基准往往是从答案反向构建查询,导致任务不自然,与现实需求不符。此外,这些基准通常专注于定位特定信息或从多个来源聚合信息,同时依赖于易受数据污染影响的静态答案集。为了解决这些问题,我们提出了GISA,一个通用信息检索助手的基准测试,包含373个人工构建的查询,反映真实的信息检索场景。GISA具有四种结构化答案格式(项、集合、列表和表格),实现确定性评估。它将深度推理和广泛信息聚合整合在统一任务中,并包含一个实时子集,定期更新答案以抵抗记忆化。值得注意的是,GISA为每个查询提供完整的人类搜索轨迹,提供过程级监督和模仿学习的黄金标准参考。在主流LLMs和商业搜索产品上的实验表明,即使是表现最好的模型,其准确匹配得分也仅为19.30%,在需要复杂规划和全面信息收集的任务上,性能显著下降。这些发现突显了未来改进的巨大空间。
cs.CL / 75 / 2602.08548
How Do Language Models Understand Tables? A Mechanistic Analysis of Cell Location
语言模型如何理解表格?对单元格位置的机制分析
Abstract
While Large Language Models (LLMs) are increasingly deployed for table-related tasks, the internal mechanisms enabling them to process linearized two-dimensional structured tables remain opaque. In this work, we investigate the process of table understanding by dissecting the atomic task of cell location. Through activation patching and complementary interpretability techniques, we delineate the table understanding mechanism into a sequential three-stage pipeline: Semantic Binding, Coordinate Localization, and Information Extraction. We demonstrate that models locate the target cell via an ordinal mechanism that counts discrete delimiters to resolve coordinates. Furthermore, column indices are encoded within a linear subspace that allows for precise steering of model focus through vector arithmetic. Finally, we reveal that models generalize to multi-cell location tasks by multiplexing the identical attention heads identified during atomic location. Our findings provide a comprehensive explanation of table understanding within Transformer architectures.
Chinese Translation
尽管大型语言模型(LLMs)在与表格相关的任务中越来越多地被应用,但它们处理线性化的二维结构表格的内部机制仍然不清晰。在本研究中,我们通过剖析单元格位置这一原子任务,探讨了表格理解的过程。通过激活补丁和互补的可解释性技术,我们将表格理解机制划分为一个顺序的三阶段流程:语义绑定、坐标定位和信息提取。我们展示了模型通过一种顺序机制来定位目标单元格,该机制通过计数离散分隔符来解析坐标。此外,列索引在一个线性子空间中被编码,这允许通过向量运算精确引导模型的关注焦点。最后,我们揭示模型通过复用在原子位置识别过程中识别的相同注意力头,能够推广到多单元格位置任务。我们的研究结果为变换器架构中的表格理解提供了全面的解释。
cs.CL / 76 / 2602.08600
Beyond Scalar Scores: Reinforcement Learning for Error-Aware Quality Estimation of Machine Translation
超越标量评分:用于机器翻译错误感知质量估计的强化学习
Abstract
Quality Estimation (QE) aims to assess the quality of machine translation (MT) outputs without relying on reference translations, making it essential for real-world, large-scale MT evaluation. Large Language Models (LLMs) have shown significant promise in advancing the field of quality estimation of machine translation. However, most of the QE approaches solely rely on scalar quality scores, offering no explicit information about the translation errors that should drive these judgments. Moreover, for low-resource languages where annotated QE data is limited, existing approaches struggle to achieve reliable performance. To address these challenges, we introduce the first segment-level QE dataset for English to Malayalam, a severely resource-scarce language pair in the QE domain, comprising human-annotated Direct Assessment (DA) scores and Translation Quality Remarks (TQR), which are short, contextual, free-form annotator comments that describe translation errors. We further introduce ALOPE-RL, a policy-based reinforcement learning framework that trains efficient adapters based on policy rewards derived from DA score and TQR. Integrating error-aware rewards with ALOPE-RL, enables LLMs to reason about translation quality beyond numeric scores. Despite being trained on a small-scale QE dataset, ALOPE-RL achieves state-of-the-art performance on English to Malayalam QE using compact LLMs (<=4B parameters}) fine-tuned with LoRA and 4-bit quantization, outperforming both larger LLM-based baselines and leading encoder-based QE models. Our results demonstrate that error-aware, policy-based learning can deliver strong QE performance under limited data and compute budgets. We release our dataset, code, and trained models to support future research.
Chinese Translation
质量估计(Quality Estimation, QE)旨在评估机器翻译(Machine Translation, MT)输出的质量,而不依赖于参考翻译,这使其在现实世界的大规模机器翻译评估中至关重要。大型语言模型(Large Language Models, LLMs)在推动机器翻译质量估计领域方面展现出了显著的潜力。然而,大多数质量估计方法仅依赖于标量质量评分,未提供关于应驱动这些判断的翻译错误的明确信息。此外,对于注释质量估计数据有限的低资源语言,现有方法难以实现可靠的性能。为了解决这些挑战,我们引入了第一个针对英语到马拉雅拉姆语的段落级质量估计数据集,这是一个在质量估计领域严重缺乏资源的语言对,包含人工注释的直接评估(Direct Assessment, DA)评分和翻译质量备注(Translation Quality Remarks, TQR),后者是简短的、上下文相关的自由形式注释者评论,用于描述翻译错误。我们进一步引入了 ALOPE-RL,这是一个基于策略的强化学习框架,基于来自 DA 评分和 TQR 的策略奖励训练高效的适配器。将错误感知奖励与 ALOPE-RL 结合,使大型语言模型能够超越数值评分进行翻译质量推理。尽管在小规模质量估计数据集上进行训练,ALOPE-RL 在英语到马拉雅拉姆语的质量估计中仍然实现了最先进的性能,使用经过 LoRA 微调和 4 位量化的紧凑型大型语言模型(<=4B 参数),超越了更大规模的基于大型语言模型的基线和领先的编码器基础质量估计模型。我们的结果表明,错误感知的基于策略的学习可以在有限的数据和计算预算下提供强大的质量估计性能。我们发布了我们的数据集、代码和训练模型,以支持未来的研究。
cs.CL / 77 / 2602.08607
VocalNet-MDM: Accelerating Streaming Speech LLM via Self-Distilled Masked Diffusion Modeling
VocalNet-MDM:通过自蒸馏掩蔽扩散建模加速流式语音大语言模型
Abstract
Recent Speech Large Language Models~(LLMs) have achieved impressive capabilities in end-to-end speech interaction. However, the prevailing autoregressive paradigm imposes strict serial constraints, limiting generation efficiency and introducing exposure bias. In this paper, we investigate Masked Diffusion Modeling~(MDM) as a non-autoregressive paradigm for speech LLMs and introduce VocalNet-MDM. To adapt MDM for streaming speech interaction, we address two critical challenges: training-inference mismatch and iterative overhead. We propose Hierarchical Block-wise Masking to align training objectives with the progressive masked states encountered during block diffusion decoding, and Iterative Self-Distillation to compress multi-step refinement into fewer steps for low-latency inference. Trained on a limited scale of only 6K hours of speech data, VocalNet-MDM achieves a 3.7$\times$--10$\times$ decoding speedup and reduces first-chunk latency by 34\% compared to AR baselines. It maintains competitive recognition accuracy while achieving state-of-the-art text quality and speech naturalness, demonstrating that MDM is a promising and scalable alternative for low-latency, efficient speech LLMs.
Chinese Translation
近年来,语音大语言模型(LLMs)在端到端语音交互方面取得了令人瞩目的能力。然而,现有的自回归范式施加了严格的串行限制,限制了生成效率并引入了暴露偏差。本文探讨了掩蔽扩散建模(MDM)作为语音LLMs的一种非自回归范式,并引入了VocalNet-MDM。为了将MDM适应于流式语音交互,我们解决了两个关键挑战:训练-推理不匹配和迭代开销。我们提出了分层块级掩蔽,以将训练目标与块扩散解码过程中遇到的渐进式掩蔽状态对齐,并提出了迭代自蒸馏,将多步细化压缩为更少的步骤,以实现低延迟推理。在仅使用6000小时语音数据的有限规模上训练后,VocalNet-MDM实现了3.7×至10×的解码加速,并将首个块的延迟减少了34%,与自回归基线相比。它在保持竞争性识别准确率的同时,实现了最先进的文本质量和语音自然性,证明了MDM是低延迟、高效语音LLMs的有前景且可扩展的替代方案。
cs.CL / 78 / 2602.08625
Do Multilingual LLMs have specialized language heads?
多语言大型语言模型是否具有专门的语言头?
Abstract
Multilingual large language models (LLMs) have gained significant popularity for their ability to process and generate text across multiple languages. However, deploying these models in production can be inefficient when only a subset of the supported languages is of interest. There has been some research conducted on identifying whether machine translation models have language-specific or language-agnostic heads, however no research has been conducted for multilingual LLMs, to the best of our knowledge, that as we know are capable of performing diverse tasks beyond just translation. This paper explores whether multilingual LLMs have specialized language attention heads for each language, and investigates the possibility of removing language-specific heads for unwanted languages without degrading performance in the targeted languages. Our findings could inform more efficient deployment strategies for multilingual LLMs, enabling reduced model complexity while maintaining high accuracy for targeted languages.
Chinese Translation
多语言大型语言模型(LLMs)因其能够处理和生成多种语言的文本而获得了显著的关注。然而,当仅对支持的语言子集感兴趣时,在生产中部署这些模型可能效率低下。虽然已有一些研究探讨了机器翻译模型是否具有语言特定或语言无关的头部,但据我们所知,尚未对多语言LLMs进行相关研究,而这些模型能够执行超越翻译的多样化任务。本文探讨了多语言LLMs是否为每种语言具有专门的语言注意力头,并研究了在不降低目标语言性能的情况下,去除不需要语言的语言特定头的可能性。我们的研究结果可能为多语言LLMs的更高效部署策略提供参考,从而在保持目标语言高准确率的同时降低模型复杂性。
cs.CL / 79 / 2602.08658
Fundamental Reasoning Paradigms Induce Out-of-Domain Generalization in Language Models
基本推理范式引发语言模型的域外泛化
Abstract
Deduction, induction, and abduction are fundamental reasoning paradigms, core for human logical thinking. Although improving Large Language Model (LLM) reasoning has attracted significant research efforts, the extent to which the fundamental paradigms induce generalization has yet to be systematically explored. In this study, we shed light on how the interplay between these core paradigms influences LLMs' reasoning behavior. To this end, we first collect a new dataset of reasoning trajectories from symbolic tasks, each targeting one of the three fundamental paradigms, to abstract from concrete world knowledge. Then, we investigate effective ways for inducing these skills into LLMs. We experiment with a battery of methods including simple fine-tuning, and more complex approaches to increase model depth, or transform a dense model to a mixture-of-experts. We comprehensively evaluate induced models on realistic out-of-domain tasks, that are entirely formulated in natural language and contain real-world knowledge. Our results reveal that our approach yields strong generalizability with substantial performance gains (up to $14.60$) across realistic tasks.
Chinese Translation
演绎、归纳和溯因是基本的推理范式,是人类逻辑思维的核心。尽管提升大型语言模型(LLM)的推理能力已吸引了大量研究努力,但基本范式引发泛化的程度尚未得到系统探索。在本研究中,我们探讨了这些核心范式之间的相互作用如何影响LLM的推理行为。为此,我们首先收集了一个新的推理轨迹数据集,该数据集来自符号任务,每个任务针对三种基本范式之一,以抽象具体的世界知识。然后,我们研究了将这些技能有效引入LLM的方法。我们尝试了一系列方法,包括简单的微调,以及更复杂的方法以增加模型深度,或将稠密模型转变为专家混合模型。我们对引入的模型在完全以自然语言表述并包含现实世界知识的真实域外任务上进行了全面评估。我们的结果表明,我们的方法在现实任务中具有强大的泛化能力,性能提升显著(最高可达$14.60$)。
cs.CL / 80 / 2602.08672
Learning to Judge: LLMs Designing and Applying Evaluation Rubrics
学习评判:大型语言模型设计与应用评估标准
Abstract
Large language models (LLMs) are increasingly used as evaluators for natural language generation, applying human-defined rubrics to assess system outputs. However, human rubrics are often static and misaligned with how models internally represent language quality. We introduce GER-Eval (Generating Evaluation Rubrics for Evaluation) to investigate whether LLMs can design and apply their own evaluation rubrics. We evaluate the semantic coherence and scoring reliability of LLM-defined criteria and their alignment with human criteria. LLMs reliably generate interpretable and task-aware evaluation dimensions and apply them consistently within models, but their scoring reliability degrades in factual and knowledge-intensive settings. Closed-source models such as GPT-4o achieve higher agreement and cross-model generalization than open-weight models such as Llama. Our findings position evaluation as a learned linguistic capability of LLMs, consistent within models but fragmented across them, and call for new methods that jointly model human and LLM evaluative language to improve reliability and interpretability.
Chinese Translation
大型语言模型(LLMs)越来越多地被用作自然语言生成的评估者,应用人类定义的标准来评估系统输出。然而,人类的评估标准往往是静态的,并且与模型内部表示语言质量的方式不一致。我们引入了GER-Eval(生成评估标准)来研究LLMs是否能够设计和应用自己的评估标准。我们评估了LLM定义的标准的语义连贯性和评分可靠性,以及它们与人类标准的一致性。LLMs可靠地生成可解释且关注任务的评估维度,并在模型内一致地应用这些维度,但在事实性和知识密集型环境中,其评分可靠性下降。闭源模型如GPT-4o在一致性和跨模型泛化方面的表现优于开放权重模型如Llama。我们的研究发现将评估视为LLMs的一种学习语言能力,在模型内部一致但在模型之间存在碎片化,并呼吁新的方法共同建模人类与LLM的评估语言,以提高可靠性和可解释性。
cs.CL / 81 / 2602.08688
Old wine in old glasses: Comparing computational and qualitative methods in identifying incivility on Persian Twitter during the #MahsaAmini movement
旧酒装在旧杯中:比较计算方法与定性方法在#MahsaAmini运动期间识别波斯语推特不文明行为中的应用
Abstract
This paper compares three approaches to detecting incivility in Persian tweets: human qualitative coding, supervised learning with ParsBERT, and large language models (ChatGPT). Using 47,278 tweets from the #MahsaAmini movement in Iran, we evaluate the accuracy and efficiency of each method. ParsBERT substantially outperforms seven evaluated ChatGPT models in identifying hate speech. We also find that ChatGPT struggles not only with subtle cases but also with explicitly uncivil content, and that prompt language (English vs. Persian) does not meaningfully affect its outputs. The study provides a detailed comparison of these approaches and clarifies their strengths and limitations for analyzing hate speech in a low-resource language context.
Chinese Translation
本文比较了三种检测波斯语推文中不文明行为的方法:人工定性编码、使用ParsBERT的监督学习以及大型语言模型(ChatGPT)。通过分析来自伊朗#MahsaAmini运动的47,278条推文,我们评估了每种方法的准确性和效率。ParsBERT在识别仇恨言论方面显著优于七种评估的ChatGPT模型。我们还发现,ChatGPT不仅在处理微妙案例时存在困难,而且在处理明显的不文明内容时也表现不佳,同时提示语言(英语与波斯语)对其输出没有显著影响。本研究提供了对这些方法的详细比较,并阐明了它们在低资源语言环境中分析仇恨言论的优势和局限性。
cs.CL / 82 / 2602.08698
Challenges in Translating Technical Lectures: Insights from the NPTEL
翻译技术讲座的挑战:来自 NPTEL 的见解
Abstract
This study examines the practical applications and methodological implications of Machine Translation in Indian Languages, specifically Bangla, Malayalam, and Telugu, within emerging translation workflows and in relation to existing evaluation frameworks. The choice of languages prioritized in this study is motivated by a triangulation of linguistic diversity, which illustrates the significance of multilingual accommodation of educational technology under NEP 2020. This is further supported by the largest MOOC portal, i.e., NPTEL, which has served as a corpus to facilitate the arguments presented in this paper. The curation of a spontaneous speech corpora that accounts for lucid delivery of technical concepts, considering the retention of suitable register and lexical choices are crucial in a diverse country like India. The findings of this study highlight metric-specific sensitivity and the challenges of morphologically rich and semantically compact features when tested against surface overlapping metrics.
Chinese Translation
本研究考察了机器翻译在印度语言中的实际应用和方法论影响,特别是孟加拉语、马拉雅拉姆语和泰卢固语,涉及新兴翻译工作流程及与现有评估框架的关系。本研究中优先选择的语言是基于语言多样性的三角交叉,这突显了在 NEP 2020 下教育技术的多语言适应的重要性。这一观点得到了最大的 MOOC 门户网站 NPTEL 的支持,该平台作为语料库促进了本文所提出的论点。针对技术概念的清晰表达,构建自发性演讲语料库时,考虑到适当的语域和词汇选择在像印度这样多样化的国家中至关重要。本研究的发现强调了度量特定敏感性以及在与表面重叠度量进行测试时,形态丰富和语义紧凑特征所面临的挑战。
cs.CL / 83 / 2602.08700
Do Images Clarify? A Study on the Effect of Images on Clarifying Questions in Conversational Search
图像是否能澄清?关于图像在对话搜索中澄清问题效果的研究
Abstract
Conversational search systems increasingly employ clarifying questions to refine user queries and improve the search experience. Previous studies have demonstrated the usefulness of text-based clarifying questions in enhancing both retrieval performance and user experience. While images have been shown to improve retrieval performance in various contexts, their impact on user performance when incorporated into clarifying questions remains largely unexplored. We conduct a user study with 73 participants to investigate the role of images in conversational search, specifically examining their effects on two search-related tasks: (i) answering clarifying questions and (ii) query reformulation. We compare the effect of multimodal and text-only clarifying questions in both tasks within a conversational search context from various perspectives. Our findings reveal that while participants showed a strong preference for multimodal questions when answering clarifying questions, preferences were more balanced in the query reformulation task. The impact of images varied with both task type and user expertise. In answering clarifying questions, images helped maintain engagement across different expertise levels, while in query reformulation they led to more precise queries and improved retrieval performance. Interestingly, for clarifying question answering, text-only setups demonstrated better user performance as they provided more comprehensive textual information in the absence of images. These results provide valuable insights for designing effective multimodal conversational search systems, highlighting that the benefits of visual augmentation are task-dependent and should be strategically implemented based on the specific search context and user characteristics.
Chinese Translation
对话搜索系统越来越多地使用澄清问题来细化用户查询并改善搜索体验。先前的研究已证明基于文本的澄清问题在提升检索性能和用户体验方面的有效性。尽管图像在多种情境中已被证明能够改善检索性能,但它们在被纳入澄清问题时对用户表现的影响仍然 largely 未被探索。我们进行了一项用户研究,参与者为73人,旨在调查图像在对话搜索中的作用,特别考察其对两个与搜索相关任务的影响:(i)回答澄清问题和(ii)查询重构。我们从多个角度比较了多模态和仅文本澄清问题在这两个任务中的效果。研究结果显示,尽管参与者在回答澄清问题时对多模态问题表现出强烈偏好,但在查询重构任务中偏好则相对均衡。图像的影响因任务类型和用户专业知识而异。在回答澄清问题时,图像帮助不同专业水平的用户保持参与感,而在查询重构中则导致更精确的查询和改善的检索性能。有趣的是,在回答澄清问题时,仅文本设置的用户表现更佳,因为在缺乏图像的情况下提供了更全面的文本信息。这些结果为设计有效的多模态对话搜索系统提供了宝贵的见解,强调视觉增强的好处是依赖于任务的,并应根据特定的搜索情境和用户特征进行战略性实施。
cs.CL / 84 / 2602.08709
FactSim: Fact-Checking for Opinion Summarization
FactSim:观点摘要的事实核查
Abstract
We explore the need for more comprehensive and precise evaluation techniques for generative artificial intelligence (GenAI) in text summarization tasks, specifically in the area of opinion summarization. Traditional methods, which leverage automated metrics to compare machine-generated summaries from a collection of opinion pieces, e.g. product reviews, have shown limitations due to the paradigm shift introduced by large language models (LLM). This paper addresses these shortcomings by proposing a novel, fully automated methodology for assessing the factual consistency of such summaries. The method is based on measuring the similarity between the claims in a given summary with those from the original reviews, measuring the coverage and consistency of the generated summary. To do so, we rely on a simple approach to extract factual assessment from texts that we then compare and summarize in a suitable score. We demonstrate that the proposed metric attributes higher scores to similar claims, regardless of whether the claim is negated, paraphrased, or expanded, and that the score has a high correlation to human judgment when compared to state-of-the-art metrics.
Chinese Translation
我们探讨了在文本摘要任务中,尤其是在观点摘要领域,对生成性人工智能(GenAI)更全面和精确的评估技术的需求。传统方法利用自动化指标来比较来自一系列观点文章(例如产品评论)的机器生成摘要,但由于大型语言模型(LLM)引入的范式转变,这些方法显示出局限性。本文通过提出一种新颖的、完全自动化的方法来评估此类摘要的事实一致性,来解决这些不足。该方法基于测量给定摘要中的主张与原始评论中的主张之间的相似性,评估生成摘要的覆盖度和一致性。为此,我们依赖一种简单的方法从文本中提取事实评估,然后将其比较并总结为适当的分数。我们证明,所提出的指标对相似主张赋予更高的分数,无论该主张是被否定、改述还是扩展,并且与最先进的指标相比,该分数与人类判断具有很高的相关性。
cs.CL / 85 / 2602.08716
PERSPECTRA: A Scalable and Configurable Pluralist Benchmark of Perspectives from Arguments
PERSPECTRA:一个可扩展且可配置的多元观点基准测试
Abstract
Pluralism, the capacity to engage with diverse perspectives without collapsing them into a single viewpoint, is critical for developing large language models that faithfully reflect human heterogeneity. Yet this characteristic has not been carefully examined in the LLM research community and remains absent from most alignment studies. Debate-oriented sources provide a natural entry point for pluralism research. Previous work builds on online debate sources but remains constrained by costly human validation. Other debate-rich platforms such as Reddit and Kialo also offer promising material: Reddit provides linguistic diversity and scale but lacks clear argumentative structure, while Kialo supplies explicit pro/con graphs but remains overly concise and detached from natural discourse. We introduce PERSPECTRA, a pluralist benchmark that integrates the structural clarity of Kialo debate graphs with the linguistic diversity of real Reddit discussions. Using a controlled retrieval-and-expansion pipeline, we construct 3,810 enriched arguments spanning 762 pro/con stances on 100 controversial topics. Each opinion is expanded to multiple naturalistic variants, enabling robust evaluation of pluralism. We initialise three tasks with PERSPECTRA: opinion counting (identifying distinct viewpoints), opinion matching (aligning supporting stances and discourse to source opinions), and polarity check (inferring aggregate stance in mixed discourse). Experiments with state-of-the-art open-source and proprietary LLMs, highlight systematic failures, such as overestimating the number of viewpoints and misclassifying concessive structures, underscoring the difficulty of pluralism-aware understanding and reasoning. By combining diversity with structure, PERSPECTRA establishes the first scalable, configurable benchmark for evaluating how well models represent, distinguish, and reason over multiple perspectives.
Chinese Translation
多元主义,即能够接纳多样化的观点而不将其简化为单一视角的能力,对于开发忠实反映人类异质性的大型语言模型至关重要。然而,这一特性在大型语言模型(LLM)研究社区中尚未得到仔细研究,并且在大多数对齐研究中缺乏相关探讨。以辩论为导向的来源为多元主义研究提供了自然的切入点。以往的研究基于在线辩论来源,但受到昂贵的人为验证的限制。其他富含辩论的平台,如Reddit和Kialo,也提供了有前景的材料:Reddit提供了语言多样性和规模,但缺乏明确的论证结构,而Kialo则提供了明确的支持/反对图表,但过于简洁且与自然话语脱节。我们提出了PERSPECTRA,一个将Kialo辩论图的结构清晰性与真实Reddit讨论的语言多样性相结合的多元主义基准测试。通过受控的检索与扩展管道,我们构建了3810个丰富的论点,涵盖100个有争议主题上的762个支持/反对立场。每个观点被扩展为多个自然变体,从而实现对多元主义的稳健评估。我们使用PERSPECTRA初始化了三个任务:观点计数(识别不同的观点)、观点匹配(将支持立场和话语与源观点对齐)以及极性检查(推断混合话语中的整体立场)。与最先进的开源和专有LLM的实验突显了系统性失败,例如高估观点数量和错误分类让步结构,强调了对多元主义意识的理解和推理的困难。通过将多样性与结构相结合,PERSPECTRA建立了第一个可扩展、可配置的基准,以评估模型在表现、区分和推理多个视角方面的能力。
cs.CL / 86 / 2602.08740
Map of Encoders -- Mapping Sentence Encoders using Quantum Relative Entropy
编码器地图 -- 使用量子相对熵映射句子编码器
Abstract
We propose a method to compare and visualise sentence encoders at scale by creating a map of encoders where each sentence encoder is represented in relation to the other sentence encoders. Specifically, we first represent a sentence encoder using an embedding matrix of a sentence set, where each row corresponds to the embedding of a sentence. Next, we compute the Pairwise Inner Product (PIP) matrix for a sentence encoder using its embedding matrix. Finally, we create a feature vector for each sentence encoder reflecting its Quantum Relative Entropy (QRE) with respect to a unit base encoder. We construct a map of encoders covering 1101 publicly available sentence encoders, providing a new perspective of the landscape of the pre-trained sentence encoders. Our map accurately reflects various relationships between encoders, where encoders with similar attributes are proximally located on the map. Moreover, our encoder feature vectors can be used to accurately infer downstream task performance of the encoders, such as in retrieval and clustering tasks, demonstrating the faithfulness of our map.
Chinese Translation
我们提出了一种方法,通过创建编码器地图来比较和可视化句子编码器,其中每个句子编码器相对于其他句子编码器进行表示。具体而言,我们首先使用一个句子集的嵌入矩阵来表示句子编码器,其中每一行对应于一个句子的嵌入。接下来,我们使用其嵌入矩阵计算句子编码器的成对内积(Pairwise Inner Product, PIP)矩阵。最后,我们为每个句子编码器创建一个特征向量,反映其相对于单位基编码器的量子相对熵(Quantum Relative Entropy, QRE)。我们构建了一个覆盖1101个公开可用句子编码器的编码器地图,为预训练句子编码器的全景提供了新的视角。我们的地图准确反映了编码器之间的各种关系,其中具有相似属性的编码器在地图上相对接近。此外,我们的编码器特征向量可以用于准确推断编码器在检索和聚类任务等下游任务中的性能,证明了我们地图的可信度。
cs.CL / 87 / 2602.08793
LakeHopper: Cross Data Lakes Column Type Annotation through Model Adaptation
LakeHopper:通过模型适应实现跨数据湖的列类型注释
Abstract
Column type annotation is vital for tasks like data cleaning, integration, and visualization. Recent solutions rely on resource-intensive language models fine-tuned on well-annotated columns from a particular set of tables, i.e., a source data lake. In this paper, we study whether we can adapt an existing pre-trained LM-based model to a new (i.e., target) data lake to minimize the annotations required on the new data lake. However, challenges include the source-target knowledge gap, selecting informative target data, and fine-tuning without losing shared knowledge exist. We propose LakeHopper, a framework that identifies and resolves the knowledge gap through LM interactions, employs a cluster-based data selection scheme for unannotated columns, and uses an incremental fine-tuning mechanism that gradually adapts the source model to the target data lake. Our experimental results validate the effectiveness of LakeHopper on two different data lake transfers under both low-resource and high-resource settings.
Chinese Translation
列类型注释对于数据清洗、集成和可视化等任务至关重要。近期的解决方案依赖于在特定表集的良好注释列上微调的资源密集型语言模型(LM)。在本文中,我们研究了是否可以将现有的预训练基于LM的模型适应到新的(即目标)数据湖,以最小化在新数据湖上所需的注释。然而,挑战包括源-目标知识差距、选择信息丰富的目标数据,以及在不损失共享知识的情况下进行微调。我们提出了LakeHopper,一个通过LM交互识别和解决知识差距的框架,采用基于聚类的未注释列数据选择方案,并使用增量微调机制逐步将源模型适应到目标数据湖。我们的实验结果验证了LakeHopper在低资源和高资源设置下对两种不同数据湖转移的有效性。
cs.CL / 88 / 2602.08826
Affective Flow Language Model for Emotional Support Conversation
情感支持对话的情感流语言模型
Abstract
Large language models (LLMs) have been widely applied to emotional support conversation (ESC). However, complex multi-turn support remains challenging.This is because existing alignment schemes rely on sparse outcome-level signals, thus offering limited supervision for intermediate strategy decisions. To fill this gap, this paper proposes affective flow language model for emotional support conversation (AFlow), a framework that introduces fine-grained supervision on dialogue prefixes by modeling a continuous affective flow along multi-turn trajectories. AFlow can estimate intermediate utility over searched trajectories and learn preference-consistent strategy transitions. To improve strategy coherence and empathetic response quality, a subpath-level flow-balance objective is presented to propagate preference signals to intermediate states. Experiment results show consistent and significant improvements over competitive baselines in diverse emotional contexts. Remarkably, AFlow with a compact open-source backbone outperforms proprietary LMMs such as GPT-4o and Claude-3.5 on major ESC metrics. Our code is available at https://github.com/chzou25-lgtm/AffectiveFlow.
Chinese Translation
大型语言模型(LLMs)已广泛应用于情感支持对话(ESC)。然而,复杂的多轮支持仍然具有挑战性。这是因为现有的对齐方案依赖于稀疏的结果级信号,从而对中间策略决策提供了有限的监督。为填补这一空白,本文提出了一种情感流语言模型(AFlow),该框架通过在多轮轨迹上建模连续的情感流,引入了对对话前缀的细粒度监督。AFlow能够估计搜索轨迹上的中间效用,并学习一致的偏好策略转变。为了提高策略的一致性和同理心响应的质量,提出了一种子路径级流平衡目标,以将偏好信号传播到中间状态。实验结果表明,在多种情感背景下,相较于竞争基线,AFlow在性能上有一致且显著的提升。值得注意的是,AFlow在主要的ESC指标上超越了如GPT-4o和Claude-3.5等专有大型语言模型,且其具有紧凑的开源骨干。我们的代码可在 https://github.com/chzou25-lgtm/AffectiveFlow 获取。
cs.CL / 89 / 2602.08829
WildReward: Learning Reward Models from In-the-Wild Human Interactions
WildReward:从真实环境中的人类交互中学习奖励模型
Abstract
Reward models (RMs) are crucial for the training of large language models (LLMs), yet they typically rely on large-scale human-annotated preference pairs. With the widespread deployment of LLMs, in-the-wild interactions have emerged as a rich source of implicit reward signals. This raises the question: Can we develop reward models directly from in-the-wild interactions? In this work, we explore this possibility by adopting WildChat as an interaction source and proposing a pipeline to extract reliable human feedback, yielding 186k high-quality instances for training WildReward via ordinal regression directly on user feedback without preference pairs. Extensive experiments demonstrate that WildReward achieves comparable or even superior performance compared to conventional reward models, with improved calibration and cross-sample consistency. We also observe that WildReward benefits directly from user diversity, where more users yield stronger reward models. Finally, we apply WildReward to online DPO training and observe significant improvements across various tasks. Code and data are released at https://github.com/THU-KEG/WildReward.
Chinese Translation
奖励模型(RMs)对于大型语言模型(LLMs)的训练至关重要,但它们通常依赖于大规模的人类标注偏好对。随着大型语言模型的广泛部署,真实环境中的交互成为了隐含奖励信号的丰富来源。这引发了一个问题:我们能否直接从真实环境中的交互中开发奖励模型?在本研究中,我们通过采用WildChat作为交互源,提出了一种提取可靠人类反馈的流程,从而生成了186,000个高质量实例,用于通过对用户反馈进行序数回归直接训练WildReward,而无需偏好对。大量实验表明,WildReward在性能上与传统奖励模型相当,甚至更优,且在校准和跨样本一致性方面有所改善。我们还观察到,WildReward直接受益于用户多样性,更多的用户会产生更强的奖励模型。最后,我们将WildReward应用于在线DPO训练,并在各种任务中观察到显著的改进。代码和数据已在https://github.com/THU-KEG/WildReward上发布。
cs.CL / 90 / 2602.08864
Understanding Dynamic Compute Allocation in Recurrent Transformers
理解递归变换器中的动态计算分配
Abstract
Token-level adaptive computation seeks to reduce inference cost by allocating more computation to harder tokens and less to easier ones. However, prior work is primarily evaluated on natural-language benchmarks using task-level metrics, where token-level difficulty is unobservable and confounded with architectural factors, making it unclear whether compute allocation truly aligns with underlying complexity. We address this gap through three contributions. First, we introduce a complexity-controlled evaluation paradigm using algorithmic and synthetic language tasks with parameterized difficulty, enabling direct testing of token-level compute allocation. Second, we propose ANIRA, a unified recurrent Transformer framework that supports per-token variable-depth computation while isolating compute allocation decisions from other model factors. Third, we use this framework to conduct a systematic analysis of token-level adaptive computation across alignment with complexity, generalization, and decision timing. Our results show that compute allocation aligned with task complexity can emerge without explicit difficulty supervision, but such alignment does not imply algorithmic generalization: models fail to extrapolate to unseen input sizes despite allocating additional computation. We further find that early compute decisions rely on static structural cues, whereas online halting more closely tracks algorithmic execution state.
Chinese Translation
基于标记的自适应计算旨在通过对更难的标记分配更多计算资源,而对更简单的标记分配较少计算资源,从而降低推理成本。然而,之前的研究主要是在自然语言基准上使用任务级指标进行评估,在这些评估中,标记级的难度不可观察并且与架构因素混淆,使得计算分配是否真正与潜在复杂性对齐变得不明确。我们通过三项贡献来填补这一空白。首先,我们引入了一种复杂性控制的评估范式,使用具有参数化难度的算法和合成语言任务,使得可以直接测试标记级计算分配。其次,我们提出了ANIRA,一个统一的递归变换器框架,支持每个标记的可变深度计算,同时将计算分配决策与其他模型因素隔离开来。第三,我们利用该框架对标记级自适应计算进行系统分析,探讨其与复杂性、泛化能力和决策时机的对齐情况。我们的结果表明,与任务复杂性对齐的计算分配可以在没有显式难度监督的情况下出现,但这种对齐并不意味着算法的泛化能力:尽管分配了额外的计算,模型仍然无法推断未见输入大小的情况。我们进一步发现,早期的计算决策依赖于静态结构线索,而在线停止则更紧密地跟踪算法执行状态。
cs.CL / 91 / 2602.08872
Large Language Models for Geolocation Extraction in Humanitarian Crisis Response
大型语言模型在人道主义危机响应中的地理位置提取
Abstract
Humanitarian crises demand timely and accurate geographic information to inform effective response efforts. Yet, automated systems that extract locations from text often reproduce existing geographic and socioeconomic biases, leading to uneven visibility of crisis-affected regions. This paper investigates whether Large Language Models (LLMs) can address these geographic disparities in extracting location information from humanitarian documents. We introduce a two-step framework that combines few-shot LLM-based named entity recognition with an agent-based geocoding module that leverages context to resolve ambiguous toponyms. We benchmark our approach against state-of-the-art pretrained and rule-based systems using both accuracy and fairness metrics across geographic and socioeconomic dimensions. Our evaluation uses an extended version of the HumSet dataset with refined literal toponym annotations. Results show that LLM-based methods substantially improve both the precision and fairness of geolocation extraction from humanitarian texts, particularly for underrepresented regions. By bridging advances in LLM reasoning with principles of responsible and inclusive AI, this work contributes to more equitable geospatial data systems for humanitarian response, advancing the goal of leaving no place behind in crisis analytics.
Chinese Translation
人道主义危机需要及时准确的地理信息以指导有效的响应工作。然而,从文本中提取位置的自动化系统往往会重现现有的地理和社会经济偏见,导致危机受影响地区的可见性不均。本论文探讨大型语言模型(LLMs)是否能够解决从人道主义文档中提取位置信息时的地理差异。我们提出了一个两步框架,该框架结合了基于少量示例的LLM命名实体识别与一个基于代理的地理编码模块,该模块利用上下文来解析模糊的地名。我们使用准确性和公平性指标对我们的方法与最先进的预训练和基于规则的系统进行了基准测试,涵盖了地理和社会经济维度。我们的评估使用了扩展版本的HumSet数据集,并进行了精细化的字面地名注释。结果表明,基于LLM的方法显著提高了从人道主义文本中提取地理位置的精确度和公平性,特别是在代表性不足的地区。通过将LLM推理的进展与负责任和包容性人工智能的原则相结合,本研究为人道主义响应的更公平的地理空间数据系统做出了贡献,推动了在危机分析中不让任何地方掉队的目标。
cs.CL / 92 / 2602.08874
Is Reasoning Capability Enough for Safety in Long-Context Language Models?
推理能力是否足以保证长上下文语言模型的安全性?
Abstract
Large language models (LLMs) increasingly combine long-context processing with advanced reasoning, enabling them to retrieve and synthesize information distributed across tens of thousands of tokens. A hypothesis is that stronger reasoning capability should improve safety by helping models recognize harmful intent even when it is not stated explicitly. We test this hypothesis in long-context settings where harmful intent is implicit and must be inferred through reasoning, and find that it does not hold. We introduce compositional reasoning attacks, a new threat model in which a harmful query is decomposed into incomplete fragments that scattered throughout a long context. The model is then prompted with a neutral reasoning query that induces retrieval and synthesis, causing the harmful intent to emerge only after composition. Evaluating 14 frontier LLMs on contexts up to 64k tokens, we uncover three findings: (1) models with stronger general reasoning capability are not more robust to compositional reasoning attacks, often assembling the intent yet failing to refuse; (2) safety alignment consistently degrades as context length increases; and (3) inference-time reasoning effort is a key mitigating factor: increasing inference-time compute reduces attack success by over 50 percentage points on GPT-oss-120b model. Together, these results suggest that safety does not automatically scale with reasoning capability, especially under long-context inference.
Chinese Translation
大型语言模型(LLMs)越来越多地将长上下文处理与高级推理相结合,使其能够检索和综合分布在数万个标记中的信息。我们的假设是,较强的推理能力应能通过帮助模型识别有害意图来提高安全性,即使这些意图并未明确表述。我们在长上下文环境中测试这一假设,在这些环境中,有害意图是隐含的,必须通过推理来推断,结果发现这一假设并不成立。我们引入了组合推理攻击,这是一种新的威胁模型,其中有害查询被分解为散布在长上下文中的不完整片段。然后,模型被提示以中性的推理查询,诱导检索和综合,导致有害意图在组合后才显现。对14个前沿LLM在最长可达64k标记的上下文中进行评估,我们发现了三个结果:(1)具有更强一般推理能力的模型并不更能抵御组合推理攻击,通常能够组装意图但未能拒绝;(2)随着上下文长度的增加,安全对齐持续下降;(3)推理时的推理努力是一个关键的缓解因素:增加推理时的计算量使得在GPT-oss-120b模型上的攻击成功率降低超过50个百分点。综合这些结果表明,安全性并不会随着推理能力的增强而自动提升,尤其是在长上下文推理的情况下。
cs.CL / 93 / 2602.08945
GitSearch: Enhancing Community Notes Generation with Gap-Informed Targeted Search
GitSearch:通过缺口信息驱动的目标搜索增强社区笔记生成
Abstract
Community-based moderation offers a scalable alternative to centralized fact-checking, yet it faces significant structural challenges, and existing AI-based methods fail in "cold start" scenarios. To tackle these challenges, we introduce GitSearch (Gap-Informed Targeted Search), a framework that treats human-perceived quality gaps, such as missing context, etc., as first-class signals. GitSearch has a three-stage pipeline: identifying information deficits, executing real-time targeted web-retrieval to resolve them, and synthesizing platform-compliant notes. To facilitate evaluation, we present PolBench, a benchmark of 78,698 U.S. political tweets with their associated Community Notes. We find GitSearch achieves 99% coverage, almost doubling coverage over the state-of-the-art. GitSearch surpasses human-authored helpful notes with a 69% win rate and superior helpfulness scores (3.87 vs. 3.36), demonstrating retrieval effectiveness that balanced the trade-off between scale and quality.
Chinese Translation
基于社区的审核提供了一种可扩展的替代方案来替代集中式事实核查,但它面临着显著的结构性挑战,现有的基于人工智能的方法在“冷启动”场景中表现不佳。为了解决这些挑战,我们提出了GitSearch(缺口信息驱动的目标搜索),这是一个将人类感知的质量缺口(如缺失的上下文等)视为首要信号的框架。GitSearch具有三个阶段的流程:识别信息缺口、执行实时的目标网页检索以解决这些缺口,并合成符合平台要求的笔记。为了便于评估,我们提出了PolBench,这是一个包含78,698条美国政治推文及其相关社区笔记的基准数据集。我们发现GitSearch实现了99%的覆盖率,几乎是当前最先进技术的两倍。GitSearch在有用笔记的生成上超过了人工撰写的笔记,胜率达到69%,并且在有用性评分上表现优越(3.87对比3.36),展示了在规模与质量之间取得平衡的检索有效性。
cs.CL / 94 / 2602.08951
How Should We Model the Probability of a Language?
我们应该如何建模语言的概率?
Abstract
Of the over 7,000 languages spoken in the world, commercial language identification (LID) systems only reliably identify a few hundred in written form. Research-grade systems extend this coverage under certain circumstances, but for most languages coverage remains patchy or nonexistent. This position paper argues that this situation is largely self-imposed. In particular, it arises from a persistent framing of LID as decontextualized text classification, which obscures the central role of prior probability estimation and is reinforced by institutional incentives that favor global, fixed-prior models. We argue that improving coverage for tail languages requires rethinking LID as a routing problem and developing principled ways to incorporate environmental cues that make languages locally plausible.
Chinese Translation
在全球超过7000种语言中,商业语言识别(LID)系统仅能可靠地识别几百种书面形式的语言。研究级系统在某些情况下扩展了这一覆盖范围,但对于大多数语言而言,覆盖仍然是零散或不存在的。本文立场论文认为,这种情况在很大程度上是自我造成的。特别是,它源于将LID持久地框定为去语境化的文本分类,这掩盖了先验概率估计的核心作用,并受到倾向于全球固定先验模型的制度激励的强化。我们认为,改善尾部语言的覆盖需要重新思考LID作为一个路由问题,并开发原则性的方法来纳入使语言在当地具有合理性的环境线索。
cs.CL / 95 / 2602.08984
Next Concept Prediction in Discrete Latent Space Leads to Stronger Language Models
离散潜在空间中的下一个概念预测促进更强大的语言模型
Abstract
We propose Next Concept Prediction (NCP), a generative pretraining paradigm built on top of Next Token Prediction (NTP). NCP predicts discrete concepts that span multiple tokens, thereby forming a more challenging pretraining objective. Our model, ConceptLM, quantizes hidden states using Vector Quantization and constructs a concept vocabulary. It leverages both NCP and NTP to drive parameter updates and generates a concept to guide the generation of the following tokens. We train ConceptLM from scratch at scales ranging from 70M to 1.5B parameters with up to 300B training data, including Pythia and GPT-2 backbones. Results on 13 benchmarks show that NCP yields consistent performance gains over traditional token-level models. Furthermore, continual pretraining experiments on an 8B-parameter Llama model indicate that NCP can further improve an NTP-trained model. Our analysis suggests that NCP leads to more powerful language models by introducing a harder pretraining task, providing a promising path toward better language modeling.
Chinese Translation
我们提出了下一个概念预测(Next Concept Prediction, NCP),这是一种建立在下一个标记预测(Next Token Prediction, NTP)基础上的生成预训练范式。NCP 预测跨越多个标记的离散概念,从而形成更具挑战性的预训练目标。我们的模型 ConceptLM 使用向量量化(Vector Quantization)对隐藏状态进行量化,并构建概念词汇。它利用 NCP 和 NTP 来驱动参数更新,并生成一个概念以指导后续标记的生成。我们从零开始训练 ConceptLM,参数规模从 70M 到 1.5B 不等,训练数据量高达 300B,包括 Pythia 和 GPT-2 的基础模型。在 13 个基准测试上的结果表明,NCP 相比传统的标记级模型具有一致的性能提升。此外,在一个 8B 参数的 Llama 模型上的持续预训练实验表明,NCP 可以进一步改善经过 NTP 训练的模型。我们的分析表明,NCP 通过引入更困难的预训练任务,导致更强大的语言模型,为更好的语言建模提供了有前景的路径。
cs.CL / 96 / 2602.08995
When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use Agents
当行动偏离任务时:检测和纠正计算机使用代理中的不对齐行为
Abstract
Computer-use agents (CUAs) have made tremendous progress in the past year, yet they still frequently produce misaligned actions that deviate from the user's original intent. Such misaligned actions may arise from external attacks (e.g., indirect prompt injection) or from internal limitations (e.g., erroneous reasoning). They not only expose CUAs to safety risks, but also degrade task efficiency and reliability. This work makes the first effort to define and study misaligned action detection in CUAs, with comprehensive coverage of both externally induced and internally arising misaligned actions. We further identify three common categories in real-world CUA deployment and construct MisActBench, a benchmark of realistic trajectories with human-annotated, action-level alignment labels. Moreover, we propose DeAction, a practical and universal guardrail that detects misaligned actions before execution and iteratively corrects them through structured feedback. DeAction outperforms all existing baselines across offline and online evaluations with moderate latency overhead: (1) On MisActBench, it outperforms baselines by over 15% absolute in F1 score; (2) In online evaluation, it reduces attack success rate by over 90% under adversarial settings while preserving or even improving task success rate in benign environments.
Chinese Translation
计算机使用代理(CUAs)在过去一年中取得了巨大的进展,但它们仍然经常产生偏离用户原始意图的不对齐行为。这种不对齐行为可能源于外部攻击(例如,间接提示注入)或内部限制(例如,错误推理)。这些行为不仅使CUAs面临安全风险,还降低了任务的效率和可靠性。本研究首次努力定义和研究CUAs中的不对齐行为检测,全面覆盖外部诱发和内部产生的不对齐行为。我们进一步识别了现实世界CUA部署中的三种常见类别,并构建了MisActBench,这是一个具有人工标注、行动级对齐标签的现实轨迹基准。此外,我们提出了DeAction,这是一种实用且通用的防护措施,可以在执行前检测不对齐行为,并通过结构化反馈进行迭代纠正。DeAction在离线和在线评估中均优于所有现有基线,且具有适度的延迟开销:(1)在MisActBench上,其F1分数比基线高出超过15%的绝对值;(2)在在线评估中,在对抗环境下,其攻击成功率降低超过90%,同时在良性环境中保持或甚至提高任务成功率。