← Back to Index
Daily Research Digest

arXiv Papers

2026-01-19
144
Papers
4
Categories
144
Translated
机器人学 (Robotics)
25
cs.RO / 1

Verified Design of Robotic Autonomous Systems using Probabilistic Model Checking

使用概率模型检查的机器人自主系统的验证设计
Azaiez, Atef, Anisi, Alireza David
Abstract
Safety and reliability play a crucial role when designing Robotic Autonomous Systems (RAS). Early consideration of hazards, risks and mitigation actions -- already in the concept study phase -- are important steps in building a solid foundations for the subsequent steps in the system engineering life cycle. The complex nature of RAS, as well as the uncertain and dynamic environments the robots operate within, do not merely effect fault management and operation robustness, but also makes the task of system design concept selection, a hard problem to address. Approaches to tackle the mentioned challenges and their implications on system design, range from ad-hoc concept development and design practices, to systematic, statistical and analytical techniques of Model Based Systems Engineering. In this paper, we propose a methodology to apply a formal method, namely Probabilistic Model Checking (PMC), to enable systematic evaluation and analysis of a given set of system design concepts, ultimately leading to a set of Verified Designs (VD). We illustrate the application of the suggested methodology -- using PRISM as probabilistic model checker -- to a practical RAS concept selection use-case from agriculture robotics. Along the way, we also develop and present a domain-specific Design Evaluation Criteria for agri-RAS.
Chinese Translation
安全性和可靠性在设计机器人自主系统(RAS)时起着至关重要的作用。在概念研究阶段,尽早考虑危险、风险和缓解措施是为后续系统工程生命周期的步骤奠定坚实基础的重要步骤。RAS的复杂性以及机器人所处的不确定和动态环境,不仅影响故障管理和操作的稳健性,还使得系统设计概念选择的任务变得困难。应对上述挑战及其对系统设计影响的方法,从临时概念开发和设计实践到基于模型的系统工程的系统性、统计性和分析性技术不一而足。本文提出了一种方法论,应用一种形式化方法,即概率模型检查(PMC),以实现对给定系统设计概念集的系统评估和分析,最终得出一组经过验证的设计(VD)。我们使用PRISM作为概率模型检查器,展示了所建议方法论在农业机器人领域的实际RAS概念选择用例中的应用。同时,我们还开发并提出了一套特定领域的农业RAS设计评估标准。
cs.RO / 2

Collaborative Continuum Robots: A Survey

协作连续机器人:综述
Li, Xinyu, Tang, Qian, Yin, Guoxin, Zheng, Gang, Burgner-Kahrs, Jessica, Stefanini, Cesare, Wu, Ke
Abstract
Continuum robots (CRs), owing to their compact structure, inherent compliance, and flexible deformation, have been widely applied in various fields. By coordinating multiple CRs to form collaborative continuum robots (CCRs), task adaptability, workspace, flexibility, load capacity, and operational stability can be further improved, thus offering significant advantages. In recent years, interest in this emerging field has grown steadily within the continuum-robotics community, accompanied by a consistent rise in related publications. By presenting a comprehensive overview of recent progress from different system-architecture levels, this survey provides a clear framework for research on CCRs. First, CCRs are classified into the three collaboration modes of separated collaboration, assistance collaboration, and parallel collaboration, with definitions provided. Next, advances in structural design, modeling, motion planning, and control for each mode are systematically summarized. Finally, current challenges and future opportunities for CCRs are discussed.
Chinese Translation
连续机器人(CRs)由于其紧凑的结构、固有的柔顺性和灵活的变形能力,已广泛应用于各个领域。通过协调多个连续机器人形成协作连续机器人(CCRs),可以进一步提高任务适应性、工作空间、灵活性、负载能力和操作稳定性,从而提供显著的优势。近年来,连续机器人领域对这一新兴领域的兴趣稳步增长,相关出版物也持续增加。通过呈现不同系统架构层面的最新进展的全面概述,本综述为CCRs的研究提供了清晰的框架。首先,将CCRs分为三种协作模式:分离协作、辅助协作和并行协作,并提供了相应的定义。接下来,系统总结了每种模式在结构设计、建模、运动规划和控制方面的进展。最后,讨论了CCRs当前面临的挑战和未来的机遇。
cs.RO / 3

A Survey of Real-Time Support, Analysis, and Advancements in ROS 2

ROS 2 实时支持、分析与进展的综述
Casini, Daniel, Chen, Jian-Jia, Li, Jing, Reghenzani, Federico, Teper, Harun
Abstract
The Robot Operating System 2 (ROS~2) has emerged as a relevant middleware framework for robotic applications, offering modularity, distributed execution, and communication. In the last six years, ROS~2 has drawn increasing attention from the real-time systems community and industry. This survey presents a comprehensive overview of research efforts that analyze, enhance, and extend ROS~2 to support real-time execution. We first provide a detailed description of the internal scheduling mechanisms of ROS~2 and its layered architecture, including the interaction with DDS-based communication and other communication middleware. We then review key contributions from the literature, covering timing analysis for both single- and multi-threaded executors, metrics such as response time, reaction time, and data age, and different communication modes. The survey also discusses community-driven enhancements to the ROS~2 runtime, including new executor algorithm designs, real-time GPU management, and microcontroller support via micro-ROS. Furthermore, we summarize techniques for bounding DDS communication delays, message filters, and profiling tools that have been developed to support analysis and experimentation. To help systematize this growing body of work, we introduce taxonomies that classify the surveyed contributions based on different criteria. This survey aims to guide both researchers and practitioners in understanding and improving the real-time capabilities of ROS~2.
Chinese Translation
机器人操作系统 2 (ROS 2) 已成为机器人应用中一个重要的中间件框架,提供模块化、分布式执行和通信。在过去六年中,ROS 2 受到实时系统社区和工业界的越来越多关注。本综述提供了对分析、增强和扩展 ROS 2 以支持实时执行的研究工作的全面概述。我们首先详细描述了 ROS 2 的内部调度机制及其分层架构,包括与基于 DDS 的通信和其他通信中间件的交互。然后,我们回顾了文献中的关键贡献,涵盖了单线程和多线程执行器的时序分析、响应时间、反应时间和数据年龄等指标,以及不同的通信模式。该综述还讨论了社区驱动的 ROS 2 运行时增强,包括新的执行器算法设计、实时 GPU 管理以及通过 micro-ROS 对微控制器的支持。此外,我们总结了用于界定 DDS 通信延迟、消息过滤器和支持分析与实验的分析工具的技术。为了帮助系统化这一日益增长的研究成果,我们引入了基于不同标准对调查贡献进行分类的分类法。本综述旨在指导研究人员和从业者理解和改善 ROS 2 的实时能力。
cs.RO / 4

Energy-Efficient Omnidirectional Locomotion for Wheeled Quadrupeds via Predictive Energy-Aware Nominal Gait Selection

基于预测能量感知的标准步态选择实现轮式四足机器人能量高效全向运动
Yang, Xu, Yang, Wei, He, Kaibo, Yang, Bo, Sui, Yanan, Mo, Yilin
Abstract
Wheeled-legged robots combine the efficiency of wheels with the versatility of legs, but face significant energy optimization challenges when navigating diverse environments. In this work, we present a hierarchical control framework that integrates predictive power modeling with residual reinforcement learning to optimize omnidirectional locomotion efficiency for wheeled quadrupedal robots. Our approach employs a novel power prediction network that forecasts energy consumption across different gait patterns over a 1-second horizon, enabling intelligent selection of the most energy-efficient nominal gait. A reinforcement learning policy then generates residual adjustments to this nominal gait, fine-tuning the robot's actions to balance energy efficiency with performance objectives. Comparative analysis shows our method reduces energy consumption by up to 35\% compared to fixed-gait approaches while maintaining comparable velocity tracking performance. We validate our framework through extensive simulations and real-world experiments on a modified Unitree Go1 platform, demonstrating robust performance even under external disturbances. Videos and implementation details are available at \href{https://sites.google.com/view/switching-wpg}{https://sites.google.com/view/switching-wpg}.
Chinese Translation
轮腿机器人结合了轮子的高效性和腿部的灵活性,但在多样化环境中导航时面临显著的能量优化挑战。在本研究中,我们提出了一种层次控制框架,将预测功率建模与残差强化学习相结合,以优化轮式四足机器人全向运动的能量效率。我们的方法采用了一种新颖的功率预测网络,能够在1秒的时间范围内预测不同步态模式下的能量消耗,从而智能选择最具能量效率的标准步态。随后,强化学习策略对该标准步态生成残差调整,微调机器人的动作,以在能量效率与性能目标之间取得平衡。比较分析表明,我们的方法在保持相似速度跟踪性能的同时,能量消耗降低了多达35 ext{%},相较于固定步态方法。我们通过在修改后的Unitree Go1平台上进行广泛的仿真和实际实验验证了我们的框架,即使在外部干扰下也表现出强大的性能。视频和实现细节可在 exttt{https://sites.google.com/view/switching-wpg} 中获取。
cs.RO / 5

Adaptive Sliding Mode Control for Vehicle Platoons with State-Dependent Friction Uncertainty

具有状态依赖摩擦不确定性的车辆编队自适应滑模控制
Yadav, Rishabh Dev
Abstract
Multi-robot formation control has various applications in domains such as vehicle troops, platoons, payload transportation, and surveillance. Maintaining formation in a vehicle platoon requires designing a suitable control scheme that can tackle external disturbances and uncertain system parameters while maintaining a predefined safe distance between the robots. A crucial challenge in this context is dealing with the unknown/uncertain friction forces between wheels and the ground, which vary with changes in road surface, wear in tires, and speed of the vehicle. Although state-of-the-art adaptive controllers can handle a priori bounded uncertainties, they struggle with accurately modeling and identifying frictional forces, which are often state-dependent and cannot be a priori bounded. This thesis proposes a new adaptive sliding mode controller for wheeled mobile robot-based vehicle platoons that can handle the unknown and complex behavior of frictional forces without prior knowledge of their parameters and structures. The controller uses the adaptive sliding mode control techniques to regulate the platoon's speed and maintain a predefined inter-robot distance, even in the presence of external disturbances and uncertain system parameters. This approach involves a two-stage process: first, the kinematic controller calculates the desired velocities based on the desired trajectory; and second, the dynamics model generates the commands to achieve the desired motion. By separating the kinematics and dynamics of the robot, this approach can simplify the control problem and allow for more efficient and robust control of the wheeled mobile robot.
Chinese Translation
多机器人编队控制在车辆部队、编队、货物运输和监视等领域具有广泛的应用。在车辆编队中保持编队需要设计合适的控制方案,以应对外部干扰和不确定的系统参数,同时保持机器人之间的预定义安全距离。在这个背景下,一个关键的挑战是处理轮子与地面之间的未知/不确定摩擦力,这些摩擦力会随着路面变化、轮胎磨损和车辆速度的变化而变化。尽管最先进的自适应控制器能够处理先验有界的不确定性,但它们在准确建模和识别摩擦力方面存在困难,因为摩擦力通常依赖于状态,并且无法先验有界。本文提出了一种新的自适应滑模控制器,适用于基于轮式移动机器人(wheeled mobile robot)的车辆编队,能够在没有先验知识的情况下处理摩擦力的未知和复杂行为。该控制器利用自适应滑模控制技术来调节编队的速度,并在存在外部干扰和不确定系统参数的情况下保持预定义的机器人间距。该方法包括两个阶段:首先,运动学控制器根据期望轨迹计算期望速度;其次,动力学模型生成实现期望运动的指令。通过将机器人的运动学和动力学分离,该方法可以简化控制问题,并实现对轮式移动机器人的更高效和更稳健的控制。
cs.RO / 6

Multi-Agent Formation Navigation Using Diffusion-Based Trajectory Generation

基于扩散的多智能体编队导航
Quang, Hieu Do, Truong-Quoc, Chien, Van Tran, Quoc
Abstract
This paper introduces a diffusion-based planner for leader--follower formation control in cluttered environments. The diffusion policy is used to generate the trajectory of the midpoint of two leaders as a rigid bar in the plane, thereby defining their desired motion paths in a planar formation. While the followers track the leaders and form desired foramtion geometry using a distance-constrained formation controller based only on the relative positions in followers' local coordinates. The proposed approach produces smooth motions and low tracking errors, with most failures occurring in narrow obstacle-free space, or obstacle configurations that are not in the training data set. Simulation results demonstrate the potential of diffusion models for reliable multi-agent formation planning.
Chinese Translation
本文介绍了一种基于扩散的规划器,用于在复杂环境中进行领导者-跟随者编队控制。扩散策略用于生成两个领导者中点的轨迹,将其视为平面中的刚性杆,从而定义它们在平面编队中的期望运动路径。跟随者通过仅基于跟随者局部坐标系中的相对位置的距离约束编队控制器来跟踪领导者并形成期望的编队几何形状。所提出的方法产生平滑的运动和低跟踪误差,大多数失败发生在狭窄的无障碍空间或不在训练数据集中的障碍配置中。仿真结果展示了扩散模型在可靠的多智能体编队规划中的潜力。
cs.RO / 7

Bidirectional Human-Robot Communication for Physical Human-Robot Interaction

双向人机沟通在物理人机交互中的应用
Wang, Junxiang, Wang, Cindy, Zarrin, Rana Soltani, Erickson, Zackory
Abstract
Effective physical human-robot interaction requires systems that are not only adaptable to user preferences but also transparent about their actions. This paper introduces BRIDGE, a system for bidirectional human-robot communication in physical assistance. Our method allows users to modify a robot's planned trajectory -- position, velocity, and force -- in real time using natural language. We utilize a large language model (LLM) to interpret any trajectory modifications implied by user commands in the context of the planned motion and conversation history. Importantly, our system provides verbal feedback in response to the user, either assuring any resulting changes or posing a clarifying question. We evaluated our method in a user study with 18 older adults across three assistive tasks, comparing BRIDGE to an ablation without verbal feedback and a baseline. Results show that participants successfully used the system to modify trajectories in real time. Moreover, the bidirectional feedback led to significantly higher ratings of interactivity and transparency, demonstrating that the robot's verbal response is critical for a more intuitive user experience. Videos and code can be found on our project website: https://bidir-comm.github.io/
Chinese Translation
有效的物理人机交互需要系统不仅能够适应用户偏好,还能对其行为保持透明。本文介绍了BRIDGE,一个用于物理辅助的双向人机沟通系统。我们的方法允许用户通过自然语言实时修改机器人的规划轨迹——位置、速度和力量。我们利用大型语言模型(LLM)来解释用户命令中隐含的任何轨迹修改,结合规划运动和对话历史的上下文。重要的是,我们的系统在响应用户时提供口头反馈,既可以确认任何由此产生的变化,也可以提出澄清问题。我们在一项用户研究中对18位老年人进行了评估,比较了BRIDGE与没有口头反馈的消融版本和基线。结果表明,参与者成功地使用该系统实时修改轨迹。此外,双向反馈显著提高了互动性和透明度的评分,证明机器人的口头响应对于提供更直观的用户体验至关重要。视频和代码可以在我们的项目网站上找到:https://bidir-comm.github.io/
cs.RO / 8

SurfSLAM: Sim-to-Real Underwater Stereo Reconstruction For Real-Time SLAM

SurfSLAM:用于实时SLAM的仿真到现实水下立体重建
Bagoren, Onur, Isaacson, Seth, Sundar, Sacchin, Sun, Yung-Ching, Sheppard, Anja, Ma, Haoyu, Shariff, Abrar, Vasudevan, Ram, Skinner, Katherine A.
Abstract
Localization and mapping are core perceptual capabilities for underwater robots. Stereo cameras provide a low-cost means of directly estimating metric depth to support these tasks. However, despite recent advances in stereo depth estimation on land, computing depth from image pairs in underwater scenes remains challenging. In underwater environments, images are degraded by light attenuation, visual artifacts, and dynamic lighting conditions. Furthermore, real-world underwater scenes frequently lack rich texture useful for stereo depth estimation and 3D reconstruction. As a result, stereo estimation networks trained on in-air data cannot transfer directly to the underwater domain. In addition, there is a lack of real-world underwater stereo datasets for supervised training of neural networks. Poor underwater depth estimation is compounded in stereo-based Simultaneous Localization and Mapping (SLAM) algorithms, making it a fundamental challenge for underwater robot perception. To address these challenges, we propose a novel framework that enables sim-to-real training of underwater stereo disparity estimation networks using simulated data and self-supervised finetuning. We leverage our learned depth predictions to develop \algname, a novel framework for real-time underwater SLAM that fuses stereo cameras with IMU, barometric, and Doppler Velocity Log (DVL) measurements. Lastly, we collect a challenging real-world dataset of shipwreck surveys using an underwater robot. Our dataset features over 24,000 stereo pairs, along with high-quality, dense photogrammetry models and reference trajectories for evaluation. Through extensive experiments, we demonstrate the advantages of the proposed training approach on real-world data for improving stereo estimation in the underwater domain and for enabling accurate trajectory estimation and 3D reconstruction of complex shipwreck sites.
Chinese Translation
定位和地图构建是水下机器人核心的感知能力。立体相机提供了一种低成本的方式来直接估计度量深度,以支持这些任务。然而,尽管近期在陆地上的立体深度估计取得了进展,从水下场景中的图像对计算深度仍然具有挑战性。在水下环境中,图像受到光衰减、视觉伪影和动态光照条件的影响。此外,现实世界中的水下场景通常缺乏丰富的纹理,这些纹理对于立体深度估计和三维重建是有用的。因此,基于空气数据训练的立体估计网络无法直接迁移到水下领域。此外,缺乏用于神经网络监督训练的现实世界水下立体数据集。水下深度估计不佳在基于立体的同时定位与地图构建(SLAM)算法中加剧,使其成为水下机器人感知的一个基本挑战。为了解决这些挑战,我们提出了一种新颖的框架,利用仿真数据和自监督微调实现水下立体视差估计网络的仿真到现实训练。我们利用学习到的深度预测开发了 extit{SurfSLAM},这是一个将立体相机与惯性测量单元(IMU)、气压计和多普勒速度记录仪(DVL)测量融合的实时水下SLAM新框架。最后,我们使用水下机器人收集了一个具有挑战性的现实世界船舶失事调查数据集。我们的数据集包含超过24,000对立体图像,以及高质量、密集的摄影测量模型和用于评估的参考轨迹。通过广泛的实验,我们展示了所提训练方法在现实世界数据上改善水下领域立体估计的优势,并实现复杂船舶失事现场的准确轨迹估计和三维重建。
cs.RO / 9

Approximately Optimal Global Planning for Contact-Rich SE(2) Manipulation on a Graph of Reachable Sets

基于可达集图的接触丰富 SE(2) 操作的近似最优全局规划
Liu, Simin, Zhao, Tong, Graesdal, Bernhard Paus, Werner, Peter, Wang, Jiuguang, Dolan, John, Liu, Changliu, Pang, Tao
Abstract
If we consider human manipulation, it is clear that contact-rich manipulation (CRM)-the ability to use any surface of the manipulator to make contact with objects-can be far more efficient and natural than relying solely on end-effectors (i.e., fingertips). However, state-of-the-art model-based planners for CRM are still focused on feasibility rather than optimality, limiting their ability to fully exploit CRM's advantages. We introduce a new paradigm that computes approximately optimal manipulator plans. This approach has two phases. Offline, we construct a graph of mutual reachable sets, where each set contains all object orientations reachable from a starting object orientation and grasp. Online, we plan over this graph, effectively computing and sequencing local plans for globally optimized motion. On a challenging, representative contact-rich task, our approach outperforms a leading planner, reducing task cost by 61%. It also achieves a 91% success rate across 250 queries and maintains sub-minute query times, ultimately demonstrating that globally optimized contact-rich manipulation is now practical for real-world tasks.
Chinese Translation
考虑到人类的操作能力,接触丰富操作(CRM)——利用操作器的任何表面与物体接触的能力——显然比仅依赖末端执行器(即手指)要高效和自然得多。然而,现有的基于模型的 CRM 规划器仍然侧重于可行性而非最优性,这限制了它们充分利用 CRM 优势的能力。我们提出了一种新的范式,用于计算近似最优的操作器规划。该方法分为两个阶段。离线阶段,我们构建了一个互可达集的图,其中每个集合包含从起始物体朝向和抓取位置可达的所有物体朝向。在线阶段,我们在该图上进行规划,有效地计算和排序局部计划,以实现全局优化的运动。在一个具有挑战性的代表性接触丰富任务中,我们的方法优于领先的规划器,任务成本降低了 61%。它在 250 次查询中也达到了 91% 的成功率,并保持了不到一分钟的查询时间,最终证明全局优化的接触丰富操作在现实任务中是可行的。
cs.RO / 10

IMU-based Real-Time Crutch Gait Phase and Step Detections in Lower-Limb Exoskeletons

基于惯性测量单元的下肢外骨骼实时拐杖步态相位和步伐检测
Shakkour, Anis R., Hexner, David, Bitton, Yehuda, Sintov, Avishai
Abstract
Lower limb exoskeletons and prostheses require precise, real time gait phase and step detections to ensure synchronized motion and user safety. Conventional methods often rely on complex force sensing hardware that introduces control latency. This paper presents a minimalist framework utilizing a single, low cost Inertial-Measurement Unit (IMU) integrated into the crutch hand grip, eliminating the need for mechanical modifications. We propose a five phase classification system, including standard gait phases and a non locomotor auxiliary state, to prevent undesired motion. Three deep learning architectures were benchmarked on both a PC and an embedded system. To improve performance under data constrained conditions, models were augmented with a Finite State Machine (FSM) to enforce biomechanical consistency. The Temporal Convolutional Network (TCN) emerged as the superior architecture, yielding the highest success rates and lowest latency. Notably, the model generalized to a paralyzed user despite being trained exclusively on healthy participants. Achieving a 94% success rate in detecting crutch steps, this system provides a high performance, cost effective solution for real time exoskeleton control.
Chinese Translation
下肢外骨骼和假肢需要精确的实时步态相位和步伐检测,以确保运动的同步性和用户的安全。传统方法通常依赖复杂的力传感硬件,这会引入控制延迟。本文提出了一种简约框架,利用集成在拐杖手柄中的单个低成本惯性测量单元(IMU),消除了对机械改造的需求。我们提出了一个五相分类系统,包括标准步态相位和一个非运动辅助状态,以防止不必要的运动。我们在PC和嵌入式系统上对三种深度学习架构进行了基准测试。为了在数据受限的条件下提高性能,模型与有限状态机(FSM)相结合,以强制执行生物力学一致性。时间卷积网络(TCN)被证明是最优架构,具有最高的成功率和最低的延迟。值得注意的是,该模型在仅以健康参与者训练的情况下,成功推广到一名瘫痪用户。该系统在检测拐杖步伐时实现了94%的成功率,提供了一种高性能、成本效益高的实时外骨骼控制解决方案。
cs.RO / 11

Is open robotics innovation a threat to international peace and security?

开放机器人创新是否对国际和平与安全构成威胁?
Righetti, Ludovic, Boulanin, Vincent
Abstract
Open access to publication, software and hardware is central to robotics: it lowers barriers to entry, supports reproducible science and accelerates reliable system development. However, openness also exacerbates the inherent dual-use risks associated with research and innovation in robotics. It lowers barriers for states and non-state actors to develop and deploy robotics systems for military use and harmful purposes. Compared to other fields of engineering where dual-use risks are present - e.g., those that underlie the development of weapons of mass destruction (chemical, biological, radiological, and nuclear weapons) and even the field of AI, robotics offers no specific regulation and little guidance as to how research and innovation may be conducted and disseminated responsibly. While other fields can be used for guidance, robotics has its own needs and specificities which have to be taken into account. The robotics community should therefore work toward its own set of sector-specific guidance and possibly regulation. To that end, we propose a roadmap focusing on four practices: a) education in responsible robotics; b) incentivizing risk assessment; c) moderating the diffusion of high-risk material; and d) developing red lines.
Chinese Translation
开放获取出版物、软件和硬件是机器人技术的核心:它降低了进入壁垒,支持可重复的科学研究,并加速可靠系统的发展。然而,开放性也加剧了与机器人研究和创新相关的固有双重用途风险。它降低了国家和非国家行为者开发和部署军事用途及有害目的的机器人系统的门槛。与其他存在双重用途风险的工程领域相比——例如,涉及大规模杀伤性武器(化学、生物、放射性和核武器)开发的领域,甚至是人工智能领域,机器人技术缺乏具体的监管和关于如何负责任地进行和传播研究与创新的指导。虽然其他领域可以提供指导,但机器人技术有其自身的需求和特性,必须加以考虑。因此,机器人社区应致力于制定一套特定于该领域的指导方针和可能的监管。为此,我们提出了一条关注四个实践的路线图:a) 负责任的机器人教育;b) 激励风险评估;c) 调节高风险材料的传播;d) 制定红线。
cs.RO / 12

Where to Touch, How to Contact: Hierarchical RL-MPC Framework for Geometry-Aware Long-Horizon Dexterous Manipulation

触摸何处,如何接触:基于层次化强化学习-模型预测控制的几何感知长时间灵巧操作框架
Xie, Zhixian, Xiang, Yu, Posa, Michael, Jin, Wanxin
Abstract
A key challenge in contact-rich dexterous manipulation is the need to jointly reason over geometry, kinematic constraints, and intricate, nonsmooth contact dynamics. End-to-end visuomotor policies bypass this structure, but often require large amounts of data, transfer poorly from simulation to reality, and generalize weakly across tasks/embodiments. We address those limitations by leveraging a simple insight: dexterous manipulation is inherently hierarchical - at a high level, a robot decides where to touch (geometry) and move the object (kinematics); at a low level it determines how to realize that plan through contact dynamics. Building on this insight, we propose a hierarchical RL--MPC framework in which a high-level reinforcement learning (RL) policy predicts a contact intention, a novel object-centric interface that specifies (i) an object-surface contact location and (ii) a post-contact object-level subgoal pose. Conditioned on this contact intention, a low-level contact-implicit model predictive control (MPC) optimizes local contact modes and replans with contact dynamics to generate robot actions that robustly drive the object toward each subgoal. We evaluate the framework on non-prehensile tasks, including geometry-generalized pushing and object 3D reorientation. It achieves near-100% success with substantially reduced data (10x less than end-to-end baselines), highly robust performance, and zero-shot sim-to-real transfer.
Chinese Translation
在接触丰富的灵巧操作中,一个关键挑战是需要共同考虑几何形状、运动学约束以及复杂的非光滑接触动态。端到端的视觉运动策略绕过了这一结构,但通常需要大量数据,从仿真到现实的迁移效果较差,并且在任务/实现之间的泛化能力较弱。我们通过利用一个简单的见解来解决这些局限性:灵巧操作本质上是层次化的——在高层次上,机器人决定触摸的位置(几何)和移动物体的方式(运动学);在低层次上,它确定如何通过接触动态实现该计划。基于这一见解,我们提出了一个层次化的强化学习-模型预测控制(RL-MPC)框架,其中高层次的强化学习(RL)策略预测接触意图,一个新的面向对象的接口指定(i)物体-表面接触位置和(ii)接触后物体级子目标姿态。基于这一接触意图,低层次的隐式接触模型预测控制(MPC)优化局部接触模式,并通过接触动态重新规划,以生成能够稳健地将物体驱动向每个子目标的机器人动作。我们在非抓取任务上评估了该框架,包括几何泛化推送和物体三维重定向。它在大幅减少数据(比端到端基线少10倍)的情况下,达到了近100%的成功率,表现出高度稳健的性能,并实现了零样本的仿真到现实迁移。
cs.RO / 13

Crane Lowering Guidance Using a Attachable Camera Module for Driver Vision Support

使用可附加摄像头模块的起重机降落引导系统以支持驾驶员视野
Kang, HyoJae, Ahn, SunWoo, Choi, InGyu, Go, GeonYeong, Son, KunWoo, Kang, Min-Sung
Abstract
Cranes have long been essential equipment for lifting and placing heavy loads in construction projects. This study focuses on the lowering phase of crane operation, the stage in which the load is moved to the desired location. During this phase, a constant challenge exists: the load obstructs the operator's view of the landing point. As a result, operators traditionally have to rely on verbal or gestural instructions from ground personnel, which significantly impacts site safety. To alleviate this constraint, the proposed system incorporates a attachable camera module designed to be attached directly to the load via a suction cup. This module houses a single-board computer, battery, and compact camera. After installation, it streams and processes images of the ground directly below the load in real time to generate installation guidance. Simultaneously, this guidance is transmitted to and monitored by a host computer. Preliminary experiments were conducted by attaching this module to a test object, confirming the feasibility of real-time image acquisition and transmission. This approach has the potential to significantly improve safety on construction sites by providing crane operators with an instant visual reference of hidden landing zones.
Chinese Translation
起重机在建筑项目中一直是提升和放置重物的重要设备。本研究聚焦于起重机操作的降落阶段,即将负载移动到所需位置的阶段。在此阶段,存在一个持续的挑战:负载遮挡了操作员对着陆点的视线。因此,操作员传统上必须依赖地面人员的口头或手势指示,这对现场安全产生了显著影响。为了解决这一限制,所提出的系统结合了一个可附加的摄像头模块,该模块设计为通过吸盘直接附加到负载上。该模块内置单板计算机、电池和紧凑型摄像头。安装后,它实时流传输并处理负载正下方地面的图像,以生成安装指导。同时,这些指导信息被传输到主计算机进行监控。初步实验通过将该模块附加到测试物体上进行,确认了实时图像采集和传输的可行性。这种方法有潜力通过为起重机操作员提供隐藏着陆区的即时视觉参考,显著提高建筑工地的安全性。
cs.RO / 14

H-AIM: Orchestrating LLMs, PDDL, and Behavior Trees for Hierarchical Multi-Robot Planning

H-AIM:协调大规模语言模型、规划领域定义语言和行为树以实现分层多机器人规划
Zeng, Haishan, Li, Peng
Abstract
In embodied artificial intelligence, enabling heterogeneous robot teams to execute long-horizon tasks from high-level instructions remains a critical challenge. While large language models (LLMs) show promise in instruction parsing and preliminary planning, they exhibit limitations in long-term reasoning and dynamic multi-robot coordination. We propose Hierarchical Autonomous Intelligent Multi-Robot Planning(H-AIM), a novel embodied multi-robot task planning framework that addresses these issues through a three-stage cascaded architecture: 1) It leverages an LLM to parse instructions and generate Planning Domain Definition Language (PDDL) problem descriptions, thereby transforming commands into formal planning problems; 2) It combines the semantic reasoning of LLMs with the search capabilities of a classical planner to produce optimized action sequences; 3) It compiles the resulting plan into behavior trees for reactive control. The framework supports dynamically sized heterogeneous robot teams via a shared blackboard mechanism for communication and state synchronization. To validate our approach, we introduce the MACE-THOR benchmark dataset, comprising 42 complex tasks across 8 distinct household layouts. Experimental results demonstrate that H-AIM achieves a remarkable performance improvement, elevating the task success rate from 12% to 55% and boosting the goal condition recall from 32% to 72% against the strongest baseline, LaMMA-P.
Chinese Translation
在具身人工智能领域,使异构机器人团队能够根据高层指令执行长期任务仍然是一个关键挑战。尽管大规模语言模型(LLMs)在指令解析和初步规划方面显示出潜力,但它们在长期推理和动态多机器人协调方面存在局限性。我们提出了一种新的具身多机器人任务规划框架——分层自主智能多机器人规划(H-AIM),通过三阶段级联架构解决这些问题:1)利用LLM解析指令并生成规划领域定义语言(PDDL)问题描述,从而将命令转化为正式的规划问题;2)将LLM的语义推理与经典规划器的搜索能力相结合,生成优化的行动序列;3)将生成的计划编译为行为树以实现反应控制。该框架通过共享黑板机制支持动态大小的异构机器人团队进行通信和状态同步。为了验证我们的方法,我们引入了MACE-THOR基准数据集,其中包含8种不同家庭布局下的42个复杂任务。实验结果表明,H-AIM在任务成功率方面实现了显著的性能提升,从12%提高到55%,并将目标条件召回率从32%提升至72%,相较于最强基线LaMMA-P。
cs.RO / 15

A3D: Adaptive Affordance Assembly with Dual-Arm Manipulation

A3D:基于双臂操作的自适应可用性组装
Liang, Jiaqi, Chen, Yue, Yu, Qize, Shen, Yan, Zhang, Haipeng, Dong, Hao, Wu, Ruihai
Abstract
Furniture assembly is a crucial yet challenging task for robots, requiring precise dual-arm coordination where one arm manipulates parts while the other provides collaborative support and stabilization. To accomplish this task more effectively, robots need to actively adapt support strategies throughout the long-horizon assembly process, while also generalizing across diverse part geometries. We propose A3D, a framework which learns adaptive affordances to identify optimal support and stabilization locations on furniture parts. The method employs dense point-level geometric representations to model part interaction patterns, enabling generalization across varied geometries. To handle evolving assembly states, we introduce an adaptive module that uses interaction feedback to dynamically adjust support strategies during assembly based on previous interactions. We establish a simulation environment featuring 50 diverse parts across 8 furniture types, designed for dual-arm collaboration evaluation. Experiments demonstrate that our framework generalizes effectively to diverse part geometries and furniture categories in both simulation and real-world settings.
Chinese Translation
家具组装是机器人面临的一项关键而具有挑战性的任务,要求精确的双臂协调,其中一只手臂负责操作部件,而另一只手臂提供协作支持和稳定性。为了更有效地完成这一任务,机器人需要在长时间的组装过程中主动调整支持策略,同时能够在不同的部件几何形状之间进行泛化。我们提出了A3D框架,该框架学习自适应可用性,以识别家具部件上的最佳支持和稳定位置。该方法采用密集的点级几何表示来建模部件交互模式,从而实现对多样几何形状的泛化。为了处理不断变化的组装状态,我们引入了一个自适应模块,该模块利用交互反馈在组装过程中动态调整支持策略,基于之前的交互结果。我们建立了一个模拟环境,包含50个不同的部件,涵盖8种家具类型,旨在评估双臂协作。实验表明,我们的框架在模拟和现实环境中对多样的部件几何形状和家具类别具有良好的泛化能力。
cs.RO / 16

Visual Marker Search for Autonomous Drone Landing in Diverse Urban Environments

多样化城市环境中自主无人机着陆的视觉标记搜索
Yao, Jiaohong, Liang, Linfeng, Deng, Yao, Zheng, Xi, Han, Richard, Qi, Yuankai
Abstract
Marker-based landing is widely used in drone delivery and return-to-base systems for its simplicity and reliability. However, most approaches assume idealized landing site visibility and sensor performance, limiting robustness in complex urban settings. We present a simulation-based evaluation suite on the AirSim platform with systematically varied urban layouts, lighting, and weather to replicate realistic operational diversity. Using onboard camera sensors (RGB for marker detection and depth for obstacle avoidance), we benchmark two heuristic coverage patterns and a reinforcement learning-based agent, analyzing how exploration strategy and scene complexity affect success rate, path efficiency, and robustness. Results underscore the need to evaluate marker-based autonomous landing under diverse, sensor-relevant conditions to guide the development of reliable aerial navigation systems.
Chinese Translation
基于标记的着陆在无人机配送和返航系统中因其简单性和可靠性而被广泛应用。然而,大多数方法假设理想化的着陆点可见性和传感器性能,这在复杂的城市环境中限制了其鲁棒性。我们在AirSim平台上提出了一种基于模拟的评估套件,系统性地改变城市布局、光照和天气,以复制现实的操作多样性。通过使用机载摄像头传感器(RGB用于标记检测,深度用于避障),我们基准测试了两种启发式覆盖模式和一种基于强化学习的智能体,分析了探索策略和场景复杂性如何影响成功率、路径效率和鲁棒性。结果强调了在多样化的、与传感器相关的条件下评估基于标记的自主着陆的必要性,以指导可靠的空中导航系统的发展。
cs.RO / 17

Learning Quadrupedal Locomotion for a Heavy Hydraulic Robot Using an Actuator Model

基于执行器模型的重型液压机器人四足运动学习
Lee, Minho, Kim, Hyeonseok, Kim, Jin Tak, Park, Sangshin, Lee, Jeong Hyun, Cho, Jungsan, Hwangbo, Jemin
Abstract
The simulation-to-reality (sim-to-real) transfer of large-scale hydraulic robots presents a significant challenge in robotics because of the inherent slow control response and complex fluid dynamics. The complex dynamics result from the multiple interconnected cylinder structure and the difference in fluid rates of the cylinders. These characteristics complicate detailed simulation for all joints, making it unsuitable for reinforcement learning (RL) applications. In this work, we propose an analytical actuator model driven by hydraulic dynamics to represent the complicated actuators. The model predicts joint torques for all 12 actuators in under 1 microsecond, allowing rapid processing in RL environments. We compare our model with neural network-based actuator models and demonstrate the advantages of our model in data-limited scenarios. The locomotion policy trained in RL with our model is deployed on a hydraulic quadruped robot, which is over 300 kg. This work is the first demonstration of a successful transfer of stable and robust command-tracking locomotion with RL on a heavy hydraulic quadruped robot, demonstrating advanced sim-to-real transferability.
Chinese Translation
大型液压机器人的仿真到现实(sim-to-real)转移在机器人技术中面临重大挑战,因为其固有的控制响应缓慢和复杂的流体动力学。复杂的动力学源于多个相互连接的气缸结构以及气缸流体速率的差异。这些特性使得对所有关节进行详细仿真变得复杂,从而不适合强化学习(RL)应用。在本研究中,我们提出了一种由液压动力学驱动的分析性执行器模型,以表示复杂的执行器。该模型能够在不到1微秒的时间内预测所有12个执行器的关节扭矩,从而在RL环境中实现快速处理。我们将我们的模型与基于神经网络的执行器模型进行了比较,并展示了在数据有限场景下我们模型的优势。使用我们的模型在RL中训练的运动策略被部署在一台重达300公斤以上的液压四足机器人上。本研究首次成功展示了在重型液压四足机器人上通过RL实现稳定且鲁棒的命令跟踪运动的转移,证明了先进的仿真到现实的可转移性。
cs.RO / 18

Adaptive Monitoring of Stochastic Fire Front Processes via Information-seeking Predictive Control

通过信息寻求预测控制自适应监测随机火焰前沿过程
Papaioannou, Savvas, Kolios, Panayiotis, Panayiotou, Christos G., Polycarpou, Marios M.
Abstract
We consider the problem of adaptively monitoring a wildfire front using a mobile agent (e.g., a drone), whose trajectory determines where sensor data is collected and thus influences the accuracy of fire propagation estimation. This is a challenging problem, as the stochastic nature of wildfire evolution requires the seamless integration of sensing, estimation, and control, often treated separately in existing methods. State-of-the-art methods either impose linear-Gaussian assumptions to establish optimality or rely on approximations and heuristics, often without providing explicit performance guarantees. To address these limitations, we formulate the fire front monitoring task as a stochastic optimal control problem that integrates sensing, estimation, and control. We derive an optimal recursive Bayesian estimator for a class of stochastic nonlinear elliptical-growth fire front models. Subsequently, we transform the resulting nonlinear stochastic control problem into a finite-horizon Markov decision process and design an information-seeking predictive control law obtained via a lower confidence bound-based adaptive search algorithm with asymptotic convergence to the optimal policy.
Chinese Translation
我们考虑使用移动代理(例如,无人机)自适应监测野火前沿的问题,其轨迹决定了传感器数据的收集位置,从而影响火灾传播估计的准确性。这是一个具有挑战性的问题,因为野火演变的随机性要求将感知、估计和控制无缝集成,而现有方法通常将其分开处理。最先进的方法要么施加线性-高斯假设以建立最优性,要么依赖于近似和启发式方法,通常没有提供明确的性能保证。为了解决这些局限性,我们将火焰前沿监测任务表述为一个集成感知、估计和控制的随机最优控制问题。我们为一类随机非线性椭圆生长火焰前沿模型推导出最优递归贝叶斯估计器。随后,我们将得到的非线性随机控制问题转化为有限时间范围的马尔可夫决策过程,并设计了一种通过基于下置信界的自适应搜索算法获得的信息寻求预测控制律,该算法具有渐近收敛于最优策略的特性。
cs.RO / 19

VLAgents: A Policy Server for Efficient VLA Inference

VLAgents:高效VLA推理的策略服务器
Jülg, Tobias, Gamal, Khaled, Nilavadi, Nisarga, Krack, Pierre, Bien, Seongjin, Krawez, Michael, Walter, Florian, Burgard, Wolfram
Abstract
The rapid emergence of Vision-Language-Action models (VLAs) has a significant impact on robotics. However, their deployment remains complex due to the fragmented interfaces and the inherent communication latency in distributed setups. To address this, we introduce VLAgents, a modular policy server that abstracts VLA inferencing behind a unified Gymnasium-style protocol. Crucially, its communication layer transparently adapts to the context by supporting both zero-copy shared memory for high-speed simulation and compressed streaming for remote hardware. In this work, we present the architecture of VLAgents and validate it by integrating seven policies -- including OpenVLA and Pi Zero. In a benchmark with both local and remote communication, we further demonstrate how it outperforms the default policy servers provided by OpenVLA, OpenPi, and LeRobot. VLAgents is available at https://github.com/RobotControlStack/vlagents
Chinese Translation
视觉-语言-动作模型(VLA)的快速出现对机器人技术产生了重大影响。然而,由于接口碎片化和分布式设置中固有的通信延迟,其部署仍然复杂。为了解决这个问题,我们提出了VLAgents,一个模块化的策略服务器,通过统一的Gymnasium风格协议抽象了VLA推理。关键是,它的通信层能够透明地适应不同的上下文,支持高速度仿真的零拷贝共享内存和远程硬件的压缩流。在本研究中,我们展示了VLAgents的架构,并通过集成七个策略(包括OpenVLA和Pi Zero)进行了验证。在本地和远程通信的基准测试中,我们进一步展示了它如何超越OpenVLA、OpenPi和LeRobot提供的默认策略服务器。VLAgents可在https://github.com/RobotControlStack/vlagents获取。
cs.RO / 20

Skill-Aware Diffusion for Generalizable Robotic Manipulation

技能感知扩散用于可泛化的机器人操作
Huang, Aoshen, Chen, Jiaming, Cheng, Jiyu, Song, Ran, Pan, Wei, Zhang, Wei
Abstract
Robust generalization in robotic manipulation is crucial for robots to adapt flexibly to diverse environments. Existing methods usually improve generalization by scaling data and networks, but model tasks independently and overlook skill-level information. Observing that tasks within the same skill share similar motion patterns, we propose Skill-Aware Diffusion (SADiff), which explicitly incorporates skill-level information to improve generalization. SADiff learns skill-specific representations through a skill-aware encoding module with learnable skill tokens, and conditions a skill-constrained diffusion model to generate object-centric motion flow. A skill-retrieval transformation strategy further exploits skill-specific trajectory priors to refine the mapping from 2D motion flow to executable 3D actions. Furthermore, we introduce IsaacSkill, a high-fidelity dataset containing fundamental robotic skills for comprehensive evaluation and sim-to-real transfer. Experiments in simulation and real-world settings show that SADiff achieves good performance and generalization across various manipulation tasks. Code, data, and videos are available at https://sites.google.com/view/sa-diff.
Chinese Translation
机器人操作中的稳健泛化对于机器人灵活适应多样化环境至关重要。现有方法通常通过扩展数据和网络来提高泛化能力,但独立地建模任务并忽视技能水平信息。我们观察到同一技能下的任务共享相似的运动模式,因此提出了技能感知扩散(Skill-Aware Diffusion, SADiff),该方法明确地结合了技能水平信息以提高泛化能力。SADiff通过具有可学习技能标记的技能感知编码模块学习特定技能的表示,并对技能约束的扩散模型进行条件设置,以生成以物体为中心的运动流。技能检索转换策略进一步利用特定技能的轨迹先验,优化从二维运动流到可执行三维动作的映射。此外,我们引入了IsaacSkill,这是一个包含基本机器人技能的高保真数据集,用于全面评估和仿真到现实的转移。在仿真和现实环境中的实验表明,SADiff在各种操作任务中实现了良好的性能和泛化能力。代码、数据和视频可在 https://sites.google.com/view/sa-diff 获取。
cs.RO / 21

Distributed Control Barrier Functions for Safe Multi-Vehicle Navigation in Heterogeneous USV Fleets

异构无人船队安全多车辆导航的分布式控制障碍函数
Paine, Tyler, Long, Brendan, Wenger, Jeremy, DeFilippo, Michael, Usevitch, James, Benjamin, Michael
Abstract
Collision avoidance in heterogeneous fleets of uncrewed vessels is challenging because the decision-making processes and controllers often differ between platforms, and it is further complicated by the limitations on sharing trajectories and control values in real-time. This paper presents a pragmatic approach that addresses these issues by adding a control filter on each autonomous vehicle that assumes worst-case behavior from other contacts, including crewed vessels. This distributed safety control filter is developed using control barrier function (CBF) theory and the application is clearly described to ensure explainability of these safety-critical methods. This work compares the worst-case CBF approach with a Collision Regulations (COLREGS) behavior-based approach in simulated encounters. Real-world experiments with three different uncrewed vessels and a human operated vessel were performed to confirm the approach is effective across a range of platforms and is robust to uncooperative behavior from human operators. Results show that combining both CBF methods and COLREGS behaviors achieves the best safety and efficiency.
Chinese Translation
在异构无人船队中,避免碰撞是一项挑战,因为不同平台之间的决策过程和控制器通常存在差异,并且实时共享轨迹和控制值的限制使问题更加复杂。本文提出了一种务实的方法,通过在每个自主车辆上添加控制滤波器来解决这些问题,该滤波器假设其他接触物体(包括有人船)的最坏情况行为。该分布式安全控制滤波器基于控制障碍函数(Control Barrier Function, CBF)理论进行开发,并清晰描述了其应用,以确保这些安全关键方法的可解释性。本文将最坏情况CBF方法与基于碰撞规则(Collision Regulations, COLREGS)行为的方法进行了模拟对比。通过与三种不同的无人船和一艘有人操作船的实际实验,验证了该方法在多种平台上的有效性,并且对人类操作员的不合作行为具有鲁棒性。结果表明,结合CBF方法和COLREGS行为能够实现最佳的安全性和效率。
cs.RO / 22

The Mini Wheelbot Dataset: High-Fidelity Data for Robot Learning

迷你轮机器人数据集:用于机器人学习的高保真数据
Hose, Henrik, Brunzema, Paul, Subhasish, Devdutt, Trimpe, Sebastian
Abstract
The development of robust learning-based control algorithms for unstable systems requires high-quality, real-world data, yet access to specialized robotic hardware remains a significant barrier for many researchers. This paper introduces a comprehensive dynamics dataset for the Mini Wheelbot, an open-source, quasi-symmetric balancing reaction wheel unicycle. The dataset provides 1 kHz synchronized data encompassing all onboard sensor readings, state estimates, ground-truth poses from a motion capture system, and third-person video logs. To ensure data diversity, we include experiments across multiple hardware instances and surfaces using various control paradigms, including pseudo-random binary excitation, nonlinear model predictive control, and reinforcement learning agents. We include several example applications in dynamics model learning, state estimation, and time-series classification to illustrate common robotics algorithms that can be benchmarked on our dataset.
Chinese Translation
开发针对不稳定系统的稳健学习控制算法需要高质量的真实世界数据,但对专业机器人硬件的获取仍然是许多研究者面临的重大障碍。本文介绍了一个全面的动态数据集,针对迷你轮机器人(Mini Wheelbot),这是一种开源的准对称平衡反应轮独轮车。该数据集提供了1 kHz同步数据,涵盖所有机载传感器读数、状态估计、来自运动捕捉系统的真实位姿以及第三方视频日志。为了确保数据的多样性,我们在多个硬件实例和不同表面上进行实验,使用包括伪随机二进制激励、非线性模型预测控制和强化学习代理等多种控制范式。我们还包括了几个动态模型学习、状态估计和时间序列分类的示例应用,以说明可以在我们的数据集上进行基准测试的常见机器人算法。
cs.RO / 23

ACoT-VLA: Action Chain-of-Thought for Vision-Language-Action Models

ACoT-VLA:面向视觉-语言-行动模型的行动思维链
Zhong, Linqing, Liu, Yi, Wei, Yifei, Xiong, Ziyu, Yao, Maoqing, Liu, Si, Ren, Guanghui
Abstract
Vision-Language-Action (VLA) models have emerged as essential generalist robot policies for diverse manipulation tasks, conventionally relying on directly translating multimodal inputs into actions via Vision-Language Model (VLM) embeddings. Recent advancements have introduced explicit intermediary reasoning, such as sub-task prediction (language) or goal image synthesis (vision), to guide action generation. However, these intermediate reasoning are often indirect and inherently limited in their capacity to convey the full, granular information required for precise action execution. Instead, we posit that the most effective form of reasoning is one that deliberates directly in the action space. We introduce Action Chain-of-Thought (ACoT), a paradigm where the reasoning process itself is formulated as a structured sequence of coarse action intents that guide the final policy. In this paper, we propose ACoT-VLA, a novel architecture that materializes the ACoT paradigm. Specifically, we introduce two complementary components: an Explicit Action Reasoner (EAR) and Implicit Action Reasoner (IAR). The former proposes coarse reference trajectories as explicit action-level reasoning steps, while the latter extracts latent action priors from internal representations of multimodal input, co-forming an ACoT that conditions the downstream action head to enable grounded policy learning. Extensive experiments in real-world and simulation environments demonstrate the superiority of our proposed method, which achieves 98.5%, 84.1%, and 47.4% on LIBERO, LIBERO-Plus and VLABench, respectively.
Chinese Translation
视觉-语言-行动(VLA)模型已成为多样化操作任务的重要通用机器人策略,传统上依赖于通过视觉-语言模型(VLM)嵌入将多模态输入直接转换为行动。近期的进展引入了明确的中介推理,例如子任务预测(语言)或目标图像合成(视觉),以指导行动生成。然而,这些中介推理往往是间接的,并且在传达精确行动执行所需的全面、细致信息方面固有局限。相反,我们认为最有效的推理形式是直接在行动空间中进行深思熟虑。我们提出了行动思维链(ACoT),一种将推理过程本身构建为一系列结构化的粗略行动意图,从而指导最终策略的范式。在本文中,我们提出了ACoT-VLA,这是一种实现ACoT范式的新颖架构。具体而言,我们引入了两个互补组件:显式行动推理器(EAR)和隐式行动推理器(IAR)。前者提出粗略参考轨迹作为显式行动级推理步骤,而后者则从多模态输入的内部表示中提取潜在行动先验,共同形成一个ACoT,以调节下游行动头,从而实现基于真实环境的策略学习。我们在真实世界和仿真环境中的广泛实验表明,我们提出的方法具有优越性,在LIBERO、LIBERO-Plus和VLABench上分别达到了98.5%、84.1%和47.4%的成绩。
cs.RO / 24

The Great March 100: 100 Detail-oriented Tasks for Evaluating Embodied AI Agents

伟大进军100:评估具身人工智能代理的100个细节导向任务
Wang, Ziyu, Liu, Chenyuan, Xiang, Yushun, Zhang, Runhao, Hao, Qingbo, Lu, Hongliang, Chen, Houyu, Feng, Zhizhong, Zheng, Kaiyue, Ye, Dehao, Zeng, Xianchao, Zhou, Xinyu, Wen, Boran, Li, Jiaxin, Zhang, Mingyu, Zheng, Kecheng, Zhu, Qian, Cheng, Ran, Li, Yong-Lu
Abstract
Recently, with the rapid development of robot learning and imitation learning, numerous datasets and methods have emerged. However, these datasets and their task designs often lack systematic consideration and principles. This raises important questions: Do the current datasets and task designs truly advance the capabilities of robotic agents? Do evaluations on a few common tasks accurately reflect the differentiated performance of various methods proposed by different teams and evaluated on different tasks? To address these issues, we introduce the Great March 100 (\textbf{GM-100}) as the first step towards a robot learning Olympics. GM-100 consists of 100 carefully designed tasks that cover a wide range of interactions and long-tail behaviors, aiming to provide a diverse and challenging set of tasks to comprehensively evaluate the capabilities of robotic agents and promote diversity and complexity in robot dataset task designs. These tasks are developed through systematic analysis and expansion of existing task designs, combined with insights from human-object interaction primitives and object affordances. We collect a large amount of trajectory data on different robotic platforms and evaluate several baseline models. Experimental results demonstrate that the GM-100 tasks are 1) feasible to execute and 2) sufficiently challenging to effectively differentiate the performance of current VLA models. Our data and code are available at https://rhos.ai/research/gm-100.
Chinese Translation
近年来,随着机器人学习和模仿学习的快速发展,出现了众多数据集和方法。然而,这些数据集及其任务设计往往缺乏系统性的考虑和原则。这引发了重要问题:当前的数据集和任务设计是否真正推动了机器人代理的能力?在少数常见任务上的评估是否准确反映了不同团队提出的各种方法在不同任务上的差异化表现?为了解决这些问题,我们提出了伟大进军100(Great March 100,简称 GM-100),作为机器人学习奥林匹克的第一步。GM-100由100个精心设计的任务组成,涵盖广泛的交互和长尾行为,旨在提供多样且具有挑战性的任务集,以全面评估机器人代理的能力,并促进机器人数据集任务设计的多样性和复杂性。这些任务通过对现有任务设计的系统分析和扩展,以及对人类-物体交互原理和物体可供性(affordances)的洞察进行开发。我们在不同的机器人平台上收集了大量轨迹数据,并评估了几种基线模型。实验结果表明,GM-100任务1) 可行执行,2) 足够具有挑战性,能够有效区分当前VLA模型的表现。我们的数据和代码可在 https://rhos.ai/research/gm-100 获取。
cs.RO / 25

Learning Semantic-Geometric Task Graph-Representations from Human Demonstrations

从人类示范中学习语义-几何任务图表示
Herbert, Franziska, Prasad, Vignesh, Liu, Han, Koert, Dorothea, Chalvatzaki, Georgia
Abstract
Learning structured task representations from human demonstrations is essential for understanding long-horizon manipulation behaviors, particularly in bimanual settings where action ordering, object involvement, and interaction geometry can vary significantly. A key challenge lies in jointly capturing the discrete semantic structure of tasks and the temporal evolution of object-centric geometric relations in a form that supports reasoning over task progression. In this work, we introduce a semantic-geometric task graph-representation that encodes object identities, inter-object relations, and their temporal geometric evolution from human demonstrations. Building on this formulation, we propose a learning framework that combines a Message Passing Neural Network (MPNN) encoder with a Transformer-based decoder, decoupling scene representation learning from action-conditioned reasoning about task progression. The encoder operates solely on temporal scene graphs to learn structured representations, while the decoder conditions on action-context to predict future action sequences, associated objects, and object motions over extended time horizons. Through extensive evaluation on human demonstration datasets, we show that semantic-geometric task graph-representations are particularly beneficial for tasks with high action and object variability, where simpler sequence-based models struggle to capture task progression. Finally, we demonstrate that task graph representations can be transferred to a physical bimanual robot and used for online action selection, highlighting their potential as reusable task abstractions for downstream decision-making in manipulation systems.
Chinese Translation
从人类示范中学习结构化任务表示对于理解长时间范围的操作行为至关重要,特别是在双手操作环境中,行动顺序、物体参与和交互几何关系可能会显著变化。一个关键挑战在于如何以支持任务进展推理的形式,联合捕捉任务的离散语义结构和以物体为中心的几何关系的时间演变。在本研究中,我们引入了一种语义-几何任务图表示,它编码了物体身份、物体间关系及其来自人类示范的时间几何演变。基于这一表述,我们提出了一个学习框架,该框架结合了消息传递神经网络(Message Passing Neural Network, MPNN)编码器与基于变换器(Transformer)的解码器,将场景表示学习与基于行动的任务进展推理解耦。编码器仅在时间场景图上操作,以学习结构化表示,而解码器则基于行动上下文来预测未来的行动序列、相关物体及其在较长时间范围内的运动。通过对人类示范数据集的广泛评估,我们表明语义-几何任务图表示对于具有高行动和物体变异性的任务特别有益,而简单的基于序列的模型在捕捉任务进展方面则显得力不从心。最后,我们展示了任务图表示可以转移到物理双手机器人上,并用于在线行动选择,突显了它们作为可重用任务抽象在操作系统下游决策中的潜力。
计算机视觉 (Computer Vision)
49
cs.CV / 1

Future Optical Flow Prediction Improves Robot Control & Video Generation

未来光流预测提升机器人控制与视频生成
Ranasinghe, Kanchana, Zhou, Honglu, Fang, Yu, Yang, Luyu, Xue, Le, Xu, Ran, Xiong, Caiming, Savarese, Silvio, Ryoo, Michael S, Niebles, Juan Carlos
Abstract
Future motion representations, such as optical flow, offer immense value for control and generative tasks. However, forecasting generalizable spatially dense motion representations remains a key challenge, and learning such forecasting from noisy, real-world data remains relatively unexplored. We introduce FOFPred, a novel language-conditioned optical flow forecasting model featuring a unified Vision-Language Model (VLM) and Diffusion architecture. This unique combination enables strong multimodal reasoning with pixel-level generative fidelity for future motion prediction. Our model is trained on web-scale human activity data-a highly scalable but unstructured source. To extract meaningful signals from this noisy video-caption data, we employ crucial data preprocessing techniques and our unified architecture with strong image pretraining. The resulting trained model is then extended to tackle two distinct downstream tasks in control and generation. Evaluations across robotic manipulation and video generation under language-driven settings establish the cross-domain versatility of FOFPred, confirming the value of a unified VLM-Diffusion architecture and scalable learning from diverse web data for future optical flow prediction.
Chinese Translation
未来运动表示,如光流,对于控制和生成任务具有巨大价值。然而,预测可泛化的空间密集运动表示仍然是一个关键挑战,而从噪声较大的现实世界数据中学习这种预测仍然相对未被探索。我们提出了FOFPred,一种新颖的语言条件光流预测模型,具有统一的视觉-语言模型(Vision-Language Model, VLM)和扩散架构(Diffusion architecture)。这种独特的组合使得在像素级生成保真度下进行强大的多模态推理成为可能,以实现未来运动预测。我们的模型是在网络规模的人类活动数据上进行训练的,这是一种高度可扩展但未结构化的来源。为了从这些噪声视频-字幕数据中提取有意义的信号,我们采用了关键的数据预处理技术以及我们具有强大图像预训练的统一架构。最终训练出的模型被扩展以应对控制和生成中的两个不同下游任务。在语言驱动的设置下,对机器人操作和视频生成的评估验证了FOFPred的跨领域多功能性,确认了统一的VLM-扩散架构和从多样化网络数据中可扩展学习在未来光流预测中的价值。
cs.CV / 2

ICONIC-444: A 3.1-Million-Image Dataset for OOD Detection Research

ICONIC-444:用于OOD检测研究的310万图像数据集
Krumpl, Gerhard, Avenhaus, Henning, Possegger, Horst
Abstract
Current progress in out-of-distribution (OOD) detection is limited by the lack of large, high-quality datasets with clearly defined OOD categories across varying difficulty levels (near- to far-OOD) that support both fine- and coarse-grained computer vision tasks. To address this limitation, we introduce ICONIC-444 (Image Classification and OOD Detection with Numerous Intricate Complexities), a specialized large-scale industrial image dataset containing over 3.1 million RGB images spanning 444 classes tailored for OOD detection research. Captured with a prototype industrial sorting machine, ICONIC-444 closely mimics real-world tasks. It complements existing datasets by offering structured, diverse data suited for rigorous OOD evaluation across a spectrum of task complexities. We define four reference tasks within ICONIC-444 to benchmark and advance OOD detection research and provide baseline results for 22 state-of-the-art post-hoc OOD detection methods.
Chinese Translation
当前在分布外(OOD)检测方面的进展受到缺乏大型高质量数据集的限制,这些数据集具有明确的OOD类别,并涵盖不同难度级别(近OOD到远OOD),以支持细粒度和粗粒度的计算机视觉任务。为了解决这一限制,我们推出了ICONIC-444(图像分类与具有众多复杂性的OOD检测),这是一个专门的大规模工业图像数据集,包含超过310万张RGB图像,涵盖444个类别,旨在支持OOD检测研究。ICONIC-444通过原型工业分拣机捕获,紧密模拟现实世界任务。它通过提供结构化、多样化的数据,补充了现有数据集,适用于在各种任务复杂性下进行严格的OOD评估。我们在ICONIC-444中定义了四个参考任务,以基准测试和推动OOD检测研究,并为22种最先进的后处理OOD检测方法提供基线结果。
cs.CV / 3

A Unified 3D Object Perception Framework for Real-Time Outside-In Multi-Camera Systems

统一的三维物体感知框架用于实时外向内多摄像头系统
Wang, Yizhou, Pusegaonkar, Sameer, Wang, Yuxing, Li, Anqi, Kumar, Vishal, Sethi, Chetan, Aiyer, Ganapathy, He, Yun, Thakkar, Kartikay, Rathi, Swapnil, Rupde, Bhushan, Tang, Zheng, Biswas, Sujit
Abstract
Accurate 3D object perception and multi-target multi-camera (MTMC) tracking are fundamental for the digital transformation of industrial infrastructure. However, transitioning "inside-out" autonomous driving models to "outside-in" static camera networks presents significant challenges due to heterogeneous camera placements and extreme occlusion. In this paper, we present an adapted Sparse4D framework specifically optimized for large-scale infrastructure environments. Our system leverages absolute world-coordinate geometric priors and introduces an occlusion-aware ReID embedding module to maintain identity stability across distributed sensor networks. To bridge the Sim2Real domain gap without manual labeling, we employ a generative data augmentation strategy using the NVIDIA COSMOS framework, creating diverse environmental styles that enhance the model's appearance-invariance. Evaluated on the AI City Challenge 2025 benchmark, our camera-only framework achieves a state-of-the-art HOTA of $45.22$. Furthermore, we address real-time deployment constraints by developing an optimized TensorRT plugin for Multi-Scale Deformable Aggregation (MSDA). Our hardware-accelerated implementation achieves a $2.15\times$ speedup on modern GPU architectures, enabling a single Blackwell-class GPU to support over 64 concurrent camera streams.
Chinese Translation
准确的三维物体感知和多目标多摄像头(MTMC)跟踪对于工业基础设施的数字化转型至关重要。然而,将“内向外”自主驾驶模型转变为“外向内”静态摄像头网络面临着由于摄像头布局异构和极端遮挡带来的重大挑战。本文提出了一种专门针对大规模基础设施环境优化的改进型Sparse4D框架。我们的系统利用绝对世界坐标几何先验,并引入了一个考虑遮挡的ReID嵌入模块,以保持分布式传感器网络中的身份稳定性。为了在没有人工标注的情况下弥合Sim2Real领域差距,我们采用了基于NVIDIA COSMOS框架的生成数据增强策略,创造多样的环境风格,以增强模型的外观不变性。在AI City Challenge 2025基准测试中,我们的仅摄像头框架达到了最先进的HOTA值$45.22$。此外,我们通过开发一个优化的TensorRT插件用于多尺度可变形聚合(MSDA)来解决实时部署的限制。我们的硬件加速实现使现代GPU架构的速度提升达$2.15 imes$,使单个Blackwell级GPU能够支持超过64个并发摄像头流。
cs.CV / 4

Can Vision-Language Models Understand Construction Workers? An Exploratory Study

视觉-语言模型能理解建筑工人吗?一项探索性研究
Bui, Hieu, Chodosh, Nathaniel E., Tavakoli, Arash
Abstract
As robotics become increasingly integrated into construction workflows, their ability to interpret and respond to human behavior will be essential for enabling safe and effective collaboration. Vision-Language Models (VLMs) have emerged as a promising tool for visual understanding tasks and offer the potential to recognize human behaviors without extensive domain-specific training. This capability makes them particularly appealing in the construction domain, where labeled data is scarce and monitoring worker actions and emotional states is critical for safety and productivity. In this study, we evaluate the performance of three leading VLMs, GPT-4o, Florence 2, and LLaVa-1.5, in detecting construction worker actions and emotions from static site images. Using a curated dataset of 1,000 images annotated across ten action and ten emotion categories, we assess each model's outputs through standardized inference pipelines and multiple evaluation metrics. GPT-4o consistently achieved the highest scores across both tasks, with an average F1-score of 0.756 and accuracy of 0.799 in action recognition, and an F1-score of 0.712 and accuracy of 0.773 in emotion recognition. Florence 2 performed moderately, with F1-scores of 0.497 for action and 0.414 for emotion, while LLaVa-1.5 showed the lowest overall performance, with F1-scores of 0.466 for action and 0.461 for emotion. Confusion matrix analyses revealed that all models struggled to distinguish semantically close categories, such as collaborating in teams versus communicating with supervisors. While the results indicate that general-purpose VLMs can offer a baseline capability for human behavior recognition in construction environments, further improvements, such as domain adaptation, temporal modeling, or multimodal sensing, may be needed for real-world reliability.
Chinese Translation
随着机器人技术越来越多地融入建筑工作流程,它们解读和响应人类行为的能力将对实现安全有效的协作至关重要。视觉-语言模型(Vision-Language Models, VLMs)作为一种有前景的视觉理解工具,能够在没有广泛领域特定训练的情况下识别人类行为。这一能力使其在建筑领域尤为吸引人,因为该领域标注数据稀缺,而监测工人的行为和情感状态对安全和生产力至关重要。在本研究中,我们评估了三种领先的视觉-语言模型,GPT-4o、Florence 2 和 LLaVa-1.5,在从静态工地图像中检测建筑工人动作和情感的表现。我们使用了一个包含1000张图像的精心策划的数据集,这些图像在十个动作和十个情感类别上进行了标注,通过标准化推理流程和多种评估指标评估每个模型的输出。GPT-4o在两个任务中始终获得最高分,动作识别的平均F1分数为0.756,准确率为0.799,情感识别的F1分数为0.712,准确率为0.773。Florence 2表现中等,动作的F1分数为0.497,情感的F1分数为0.414,而LLaVa-1.5的整体表现最低,动作的F1分数为0.466,情感的F1分数为0.461。混淆矩阵分析显示,所有模型在区分语义相近的类别(如团队协作与与主管沟通)时都存在困难。尽管结果表明通用视觉-语言模型在建筑环境中可以提供人类行为识别的基线能力,但可能需要进一步改进,例如领域适应、时间建模或多模态感知,以确保在实际应用中的可靠性。
cs.CV / 5

One Model, Many Behaviors: Training-Induced Effects on Out-of-Distribution Detection

一个模型,多种行为:训练引发的分布外检测效果
Krumpl, Gerhard, Avenhaus, Henning, Possegger, Horst
Abstract
Out-of-distribution (OOD) detection is crucial for deploying robust and reliable machine-learning systems in open-world settings. Despite steady advances in OOD detectors, their interplay with modern training pipelines that maximize in-distribution (ID) accuracy and generalization remains under-explored. We investigate this link through a comprehensive empirical study. Fixing the architecture to the widely adopted ResNet-50, we benchmark 21 post-hoc, state-of-the-art OOD detection methods across 56 ImageNet-trained models obtained via diverse training strategies and evaluate them on eight OOD test sets. Contrary to the common assumption that higher ID accuracy implies better OOD detection performance, we uncover a non-monotonic relationship: OOD performance initially improves with accuracy but declines once advanced training recipes push accuracy beyond the baseline. Moreover, we observe a strong interdependence between training strategy, detector choice, and resulting OOD performance, indicating that no single method is universally optimal.
Chinese Translation
分布外(OOD)检测对于在开放世界环境中部署稳健可靠的机器学习系统至关重要。尽管OOD检测器在不断进步,但它们与现代训练流程之间的相互作用,尤其是那些最大化分布内(ID)准确性和泛化能力的流程,仍然未得到充分探索。我们通过一项全面的实证研究来探讨这一联系。将架构固定为广泛采用的ResNet-50,我们对21种后处理的最先进OOD检测方法进行了基准测试,这些方法在56个通过多种训练策略获得的ImageNet训练模型上进行评估,并在八个OOD测试集上进行测试。与普遍假设的更高ID准确性意味着更好OOD检测性能的观点相反,我们发现了一种非单调关系:OOD性能最初随着准确性的提高而改善,但一旦先进的训练方案将准确性推高至基线之上,性能则会下降。此外,我们观察到训练策略、检测器选择与最终OOD性能之间存在强烈的相互依赖性,这表明没有任何单一方法是普遍最优的。
cs.CV / 6

Effects of Different Attention Mechanisms Applied on 3D Models in Video Classification

不同注意力机制在视频分类中的3D模型应用效果
Rasras, Mohammad, Marin, Iuliana, Radu, Serban, Mocanu, Irina
Abstract
Human action recognition has become an important research focus in computer vision due to the wide range of applications where it is used. 3D Resnet-based CNN models, particularly MC3, R3D, and R(2+1)D, have different convolutional filters to extract spatiotemporal features. This paper investigates the impact of reducing the captured knowledge from temporal data, while increasing the resolution of the frames. To establish this experiment, we created similar designs to the three originals, but with a dropout layer added before the final classifier. Secondly, we then developed ten new versions for each one of these three designs. The variants include special attention blocks within their architecture, such as convolutional block attention module (CBAM), temporal convolution networks (TCN), in addition to multi-headed and channel attention mechanisms. The purpose behind that is to observe the extent of the influence each of these blocks has on performance for the restricted-temporal models. The results of testing all the models on UCF101 have shown accuracy of 88.98% for the variant with multiheaded attention added to the modified R(2+1)D. This paper concludes the significance of missing temporal features in the performance of the newly created increased resolution models. The variants had different behavior on class-level accuracy, despite the similarity of their enhancements to the overall performance.
Chinese Translation
人类动作识别已成为计算机视觉领域的重要研究焦点,因其在广泛应用中的重要性。基于3D Resnet的卷积神经网络(CNN)模型,特别是MC3、R3D和R(2+1)D,采用不同的卷积滤波器来提取时空特征。本文研究了在提高帧分辨率的同时,减少从时间数据中捕获的知识的影响。为了建立这一实验,我们创建了与三种原始设计相似的模型,但在最终分类器之前添加了一个dropout层。其次,我们为这三种设计分别开发了十个新版本。这些变体在其架构中包含特殊的注意力模块,如卷积块注意力模块(CBAM)、时间卷积网络(TCN),以及多头和通道注意力机制。这样做的目的是观察这些模块对限制时间模型性能的影响程度。在UCF101上测试所有模型的结果显示,添加多头注意力的修改版R(2+1)D的准确率达到了88.98%。本文总结了缺失时间特征对新创建的高分辨率模型性能的重要性。尽管这些变体在整体性能的增强上相似,但在类别级准确性上表现出不同的行为。
cs.CV / 7

Medical SAM3: A Foundation Model for Universal Prompt-Driven Medical Image Segmentation

Medical SAM3:一种通用提示驱动的医学图像分割基础模型
Jiang, Chongcong, Ding, Tianxingjian, Song, Chuhan, Tu, Jiachen, Yan, Ziyang, Shao, Yihua, Wang, Zhenyi, Shang, Yuzhang, Han, Tianyu, Tian, Yu
Abstract
Promptable segmentation foundation models such as SAM3 have demonstrated strong generalization capabilities through interactive and concept-based prompting. However, their direct applicability to medical image segmentation remains limited by severe domain shifts, the absence of privileged spatial prompts, and the need to reason over complex anatomical and volumetric structures. Here we present Medical SAM3, a foundation model for universal prompt-driven medical image segmentation, obtained by fully fine-tuning SAM3 on large-scale, heterogeneous 2D and 3D medical imaging datasets with paired segmentation masks and text prompts. Through a systematic analysis of vanilla SAM3, we observe that its performance degrades substantially on medical data, with its apparent competitiveness largely relying on strong geometric priors such as ground-truth-derived bounding boxes. These findings motivate full model adaptation beyond prompt engineering alone. By fine-tuning SAM3's model parameters on 33 datasets spanning 10 medical imaging modalities, Medical SAM3 acquires robust domain-specific representations while preserving prompt-driven flexibility. Extensive experiments across organs, imaging modalities, and dimensionalities demonstrate consistent and significant performance gains, particularly in challenging scenarios characterized by semantic ambiguity, complex morphology, and long-range 3D context. Our results establish Medical SAM3 as a universal, text-guided segmentation foundation model for medical imaging and highlight the importance of holistic model adaptation for achieving robust prompt-driven segmentation under severe domain shift. Code and model will be made available at https://github.com/AIM-Research-Lab/Medical-SAM3.
Chinese Translation
可提示的分割基础模型,如SAM3,已通过交互式和基于概念的提示展示出强大的泛化能力。然而,由于严重的领域转移、缺乏特权空间提示以及需要对复杂的解剖和体积结构进行推理,它们在医学图像分割中的直接适用性仍然有限。在此,我们提出了Medical SAM3,这是一种通用的提示驱动医学图像分割基础模型,通过在大规模异构的2D和3D医学影像数据集上,结合配对的分割掩膜和文本提示,对SAM3进行全面微调而获得。通过对原始SAM3的系统分析,我们观察到其在医学数据上的性能显著下降,其明显的竞争力在很大程度上依赖于强几何先验,例如基于真实值的边界框。这些发现促使我们在模型适应上超越仅仅是提示工程。通过在涵盖10种医学影像模态的33个数据集上微调SAM3的模型参数,Medical SAM3获得了稳健的领域特定表示,同时保持了提示驱动的灵活性。针对器官、影像模态和维度的广泛实验表明,在语义模糊、复杂形态和长距离3D上下文等具有挑战性的场景中,性能持续显著提升。我们的结果确立了Medical SAM3作为医学影像的通用文本引导分割基础模型,并强调了在严重领域转移下实现稳健提示驱动分割的整体模型适应的重要性。代码和模型将在 https://github.com/AIM-Research-Lab/Medical-SAM3 上发布。
cs.CV / 8

FrankenMotion: Part-level Human Motion Generation and Composition

FrankenMotion:基于部位的人体运动生成与组合
Li, Chuqiao, Xie, Xianghui, Cao, Yong, Geiger, Andreas, Pons-Moll, Gerard
Abstract
Human motion generation from text prompts has made remarkable progress in recent years. However, existing methods primarily rely on either sequence-level or action-level descriptions due to the absence of fine-grained, part-level motion annotations. This limits their controllability over individual body parts. In this work, we construct a high-quality motion dataset with atomic, temporally-aware part-level text annotations, leveraging the reasoning capabilities of large language models (LLMs). Unlike prior datasets that either provide synchronized part captions with fixed time segments or rely solely on global sequence labels, our dataset captures asynchronous and semantically distinct part movements at fine temporal resolution. Based on this dataset, we introduce a diffusion-based part-aware motion generation framework, namely FrankenMotion, where each body part is guided by its own temporally-structured textual prompt. This is, to our knowledge, the first work to provide atomic, temporally-aware part-level motion annotations and have a model that allows motion generation with both spatial (body part) and temporal (atomic action) control. Experiments demonstrate that FrankenMotion outperforms all previous baseline models adapted and retrained for our setting, and our model can compose motions unseen during training. Our code and dataset will be publicly available upon publication.
Chinese Translation
近年来,从文本提示生成人体运动取得了显著进展。然而,现有方法主要依赖于序列级或动作级描述,原因在于缺乏细粒度的部位级运动注释。这限制了对个别身体部位的可控性。在本研究中,我们构建了一个高质量的运动数据集,包含原子级、时间感知的部位级文本注释,利用大型语言模型(LLMs)的推理能力。与之前的数据集不同,后者要么提供与固定时间段同步的部位标题,要么仅依赖于全局序列标签,我们的数据集捕捉了在细时间分辨率下异步且语义上独特的部位运动。基于该数据集,我们提出了一种基于扩散的部位感知运动生成框架,即FrankenMotion,其中每个身体部位由其自身的时间结构文本提示引导。据我们所知,这是首个提供原子级、时间感知的部位级运动注释的工作,并且拥有一个允许在空间(身体部位)和时间(原子动作)上进行控制的运动生成模型。实验表明,FrankenMotion在我们设定下优于所有先前的基线模型,并且我们的模型能够组合训练期间未见的运动。我们的代码和数据集将在发表后公开。
cs.CV / 9

Classification of Chest XRay Diseases through image processing and analysis techniques

通过图像处理和分析技术对胸部X光疾病进行分类
Novoa, Santiago Martínez, Ibáñez, María Catalina, Mesa, Lina Gómez, Kramer, Jeremias
Abstract
Multi-Classification Chest X-Ray Images are one of the most prevalent forms of radiological examination used for diagnosing thoracic diseases. In this study, we offer a concise overview of several methods employed for tackling this task, including DenseNet121. In addition, we deploy an open-source web-based application. In our study, we conduct tests to compare different methods and see how well they work. We also look closely at the weaknesses of the methods we propose and suggest ideas for making them better in the future. Our code is available at: https://github.com/AML4206-MINE20242/Proyecto_AML
Chinese Translation
多分类胸部X光图像是用于诊断胸部疾病的最常见放射学检查形式之一。在本研究中,我们简要概述了几种用于解决此任务的方法,包括DenseNet121。此外,我们还部署了一个开源的基于网络的应用程序。在我们的研究中,我们进行了测试,以比较不同方法的效果,并观察它们的表现。我们还仔细分析了我们提出的方法的弱点,并提出了未来改进的建议。我们的代码可在以下链接获取:https://github.com/AML4206-MINE20242/Proyecto_AML
cs.CV / 10

Self-learned representation-guided latent diffusion model for breast cancer classification in deep ultraviolet whole surface images

自学习表示引导的潜在扩散模型用于深紫外全表面图像中的乳腺癌分类
Afshin, Pouya, Helminiak, David, Niu, Tianling, Jorns, Julie M., Yen, Tina, Yu, Bing, Ye, Dong Hye
Abstract
Breast-Conserving Surgery (BCS) requires precise intraoperative margin assessment to preserve healthy tissue. Deep Ultraviolet Fluorescence Scanning Microscopy (DUV-FSM) offers rapid, high-resolution surface imaging for this purpose; however, the scarcity of annotated DUV data hinders the training of robust deep learning models. To address this, we propose an Self-Supervised Learning (SSL)-guided Latent Diffusion Model (LDM) to generate high-quality synthetic training patches. By guiding the LDM with embeddings from a fine-tuned DINO teacher, we inject rich semantic details of cellular structures into the synthetic data. We combine real and synthetic patches to fine-tune a Vision Transformer (ViT), utilizing patch prediction aggregation for WSI-level classification. Experiments using 5-fold cross-validation demonstrate that our method achieves 96.47 % accuracy and reduces the FID score to 45.72, significantly outperforming class-conditioned baselines.
Chinese Translation
乳腺保留手术(BCS)需要精确的术中边缘评估以保护健康组织。深紫外荧光扫描显微镜(DUV-FSM)为此提供了快速、高分辨率的表面成像;然而,标注的DUV数据稀缺限制了稳健深度学习模型的训练。为了解决这一问题,我们提出了一种自监督学习(SSL)引导的潜在扩散模型(LDM),用于生成高质量的合成训练补丁。通过使用经过微调的DINO教师的嵌入来引导LDM,我们将细胞结构的丰富语义细节注入合成数据中。我们结合真实和合成补丁来微调视觉变换器(ViT),利用补丁预测聚合进行WSI级别的分类。使用5折交叉验证的实验表明,我们的方法达到了96.47%的准确率,并将FID分数降低至45.72,显著优于基于类别的基线。
cs.CV / 11

RobuMTL: Enhancing Multi-Task Learning Robustness Against Weather Conditions

RobuMTL:增强多任务学习在天气条件下的鲁棒性
Shaffee, Tasneem, Reda, Sherief
Abstract
Robust Multi-Task Learning (MTL) is crucial for autonomous systems operating in real-world environments, where adverse weather conditions can severely degrade model performance and reliability. In this paper, we introduce RobuMTL, a novel architecture designed to adaptively address visual degradation by dynamically selecting task-specific hierarchical Low-Rank Adaptation (LoRA) modules and a LoRA expert squad based on input perturbations in a mixture-of-experts fashion. Our framework enables adaptive specialization based on input characteristics, improving robustness across diverse real-world conditions. To validate our approach, we evaluated it on the PASCAL and NYUD-v2 datasets and compared it against single-task models, standard MTL baselines, and state-of-the-art methods. On the PASCAL benchmark, RobuMTL delivers a +2.8% average relative improvement under single perturbations and up to +44.4% under mixed weather conditions compared to the MTL baseline. On NYUD-v2, RobuMTL achieves a +9.7% average relative improvement across tasks. The code is available at GitHub.
Chinese Translation
鲁棒的多任务学习(MTL)对于在真实环境中运行的自主系统至关重要,因为恶劣的天气条件可能严重降低模型的性能和可靠性。本文介绍了RobuMTL,一种新颖的架构,旨在通过动态选择特定任务的分层低秩适应(LoRA)模块和基于输入扰动的LoRA专家小组,以混合专家的方式自适应地应对视觉退化。我们的框架使得根据输入特征进行自适应专业化成为可能,从而提高了在多种真实世界条件下的鲁棒性。为了验证我们的方法,我们在PASCAL和NYUD-v2数据集上进行了评估,并与单任务模型、标准MTL基线和最先进的方法进行了比较。在PASCAL基准上,RobuMTL在单一扰动下提供了+2.8%的平均相对提升,在混合天气条件下则提高了高达+44.4%相较于MTL基线。在NYUD-v2上,RobuMTL在各任务间实现了+9.7%的平均相对提升。代码可在GitHub上获取。
cs.CV / 12

Sparse Data Tree Canopy Segmentation: Fine-Tuning Leading Pretrained Models on Only 150 Images

稀疏数据树冠分割:仅基于150张图像微调领先的预训练模型
Szczecina, David, Sun, Hudson, Bertnyk, Anthony, Azad, Niloofar, Gao, Kyle, Xu, Lincoln Linlin
Abstract
Tree canopy detection from aerial imagery is an important task for environmental monitoring, urban planning, and ecosystem analysis. Simulating real-life data annotation scarcity, the Solafune Tree Canopy Detection competition provides a small and imbalanced dataset of only 150 annotated images, posing significant challenges for training deep models without severe overfitting. In this work, we evaluate five representative architectures, YOLOv11, Mask R-CNN, DeepLabv3, Swin-UNet, and DINOv2, to assess their suitability for canopy segmentation under extreme data scarcity. Our experiments show that pretrained convolution-based models, particularly YOLOv11 and Mask R-CNN, generalize significantly better than pretrained transformer-based models. DeeplabV3, Swin-UNet and DINOv2 underperform likely due to differences between semantic and instance segmentation tasks, the high data requirements of Vision Transformers, and the lack of strong inductive biases. These findings confirm that transformer-based architectures struggle in low-data regimes without substantial pretraining or augmentation and that differences between semantic and instance segmentation further affect model performance. We provide a detailed analysis of training strategies, augmentation policies, and model behavior under the small-data constraint and demonstrate that lightweight CNN-based methods remain the most reliable for canopy detection on limited imagery.
Chinese Translation
从航空影像中检测树冠是环境监测、城市规划和生态系统分析的重要任务。为了模拟现实生活中数据标注稀缺的情况,Solafune树冠检测竞赛提供了一个仅包含150张标注图像的小型不平衡数据集,这对在不严重过拟合的情况下训练深度模型提出了重大挑战。在本研究中,我们评估了五种代表性架构:YOLOv11、Mask R-CNN、DeepLabv3、Swin-UNet和DINOv2,以评估它们在极端数据稀缺情况下进行树冠分割的适用性。我们的实验表明,基于卷积的预训练模型,特别是YOLOv11和Mask R-CNN,相较于基于变换器的预训练模型具有显著更好的泛化能力。DeeplabV3、Swin-UNet和DINOv2的表现较差,可能是由于语义分割和实例分割任务之间的差异、视觉变换器对数据的高需求以及缺乏强的归纳偏差。这些发现证实了基于变换器的架构在低数据环境下缺乏实质性预训练或数据增强时的困难,并且语义分割与实例分割之间的差异进一步影响了模型性能。我们提供了在小数据约束下的训练策略、增强策略和模型行为的详细分析,并证明轻量级的基于CNN的方法在有限影像上的树冠检测中仍然是最可靠的选择。
cs.CV / 13

PatientVLM Meets DocVLM: Pre-Consultation Dialogue Between Vision-Language Models for Efficient Diagnosis

PatientVLM与DocVLM的结合:基于视觉-语言模型的高效诊断前咨询对话
Lokesh, K, Penamakuri, Abhirama Subramanyam, Agarwal, Uday, Challa, Apoorva, Gowda, Shreya K, Gupta, Somesh, Mishra, Anand
Abstract
Traditionally, AI research in medical diagnosis has largely centered on image analysis. While this has led to notable advancements, the absence of patient-reported symptoms continues to hinder diagnostic accuracy. To address this, we propose a Pre-Consultation Dialogue Framework (PCDF) that mimics real-world diagnostic procedures, where doctors iteratively query patients before reaching a conclusion. Specifically, we simulate diagnostic dialogues between two vision-language models (VLMs): a DocVLM, which generates follow-up questions based on the image and dialogue history, and a PatientVLM, which responds using a symptom profile derived from the ground-truth diagnosis. We additionally conducted a small-scale clinical validation of the synthetic symptoms generated by our framework, with licensed clinicians confirming their clinical relevance, symptom coverage, and overall realism. These findings indicate that the resulting DocVLM-PatientVLM interactions form coherent, multi-turn consultations paired with images and diagnoses, which we then use to fine-tune the DocVLM. This dialogue-based supervision leads to substantial gains over image-only training, highlighting the value of realistic symptom elicitation for diagnosis.
Chinese Translation
传统上,人工智能在医学诊断中的研究主要集中在图像分析上。尽管这带来了显著的进展,但缺乏患者报告的症状仍然阻碍了诊断的准确性。为了解决这一问题,我们提出了一种前咨询对话框架(Pre-Consultation Dialogue Framework, PCDF),模拟现实世界的诊断程序,在此过程中,医生在得出结论之前会反复询问患者。具体而言,我们模拟了两个视觉-语言模型(Vision-Language Models, VLMs)之间的诊断对话:DocVLM,它根据图像和对话历史生成后续问题;以及PatientVLM,它使用基于真实诊断的症状概况进行回应。此外,我们还对我们框架生成的合成症状进行了小规模的临床验证,获得了持证临床医生对其临床相关性、症状覆盖率和整体真实性的确认。这些发现表明,生成的DocVLM-PatientVLM交互形成了连贯的多轮咨询,并与图像和诊断相结合,我们随后利用这些对话来微调DocVLM。这种基于对话的监督相较于仅基于图像的训练带来了显著的提升,突显了现实症状引导在诊断中的价值。
cs.CV / 14

MMedExpert-R1: Strengthening Multimodal Medical Reasoning via Domain-Specific Adaptation and Clinical Guideline Reinforcement

MMedExpert-R1:通过领域特定适应和临床指南强化增强多模态医学推理
Ding, Meidan, Zhang, Jipeng, Wang, Wenxuan, Zhong, Haiqin, Luo, Xiaoling, Chen, Wenting, Shen, Linlin
Abstract
Medical Vision-Language Models (MedVLMs) excel at perception tasks but struggle with complex clinical reasoning required in real-world scenarios. While reinforcement learning (RL) has been explored to enhance reasoning capabilities, existing approaches face critical mismatches: the scarcity of deep reasoning data, cold-start limits multi-specialty alignment, and standard RL algorithms fail to model clinical reasoning diversity. We propose MMedExpert-R1, a novel reasoning MedVLM that addresses these challenges through domain-specific adaptation and clinical guideline reinforcement. We construct MMedExpert, a high-quality dataset of 10K samples across four specialties with step-by-step reasoning traces. Our Domain-Specific Adaptation (DSA) creates specialty-specific LoRA modules to provide diverse initialization, while Guideline-Based Advantages (GBA) explicitly models different clinical reasoning perspectives to align with real-world diagnostic strategies. Conflict-Aware Capability Integration then merges these specialized experts into a unified agent, ensuring robust multi-specialty alignment. Comprehensive experiments demonstrate state-of-the-art performance, with our 7B model achieving 27.50 on MedXpert-MM and 83.03 on OmniMedVQA, establishing a robust foundation for reliable multimodal medical reasoning systems.
Chinese Translation
医学视觉语言模型(MedVLMs)在感知任务中表现出色,但在现实场景中所需的复杂临床推理方面存在困难。尽管强化学习(RL)已被探索以增强推理能力,但现有方法面临关键不匹配:深度推理数据稀缺、冷启动限制多专业对齐,且标准RL算法无法建模临床推理的多样性。我们提出了MMedExpert-R1,这是一种新颖的推理型MedVLM,旨在通过领域特定适应和临床指南强化来应对这些挑战。我们构建了MMedExpert,这是一个涵盖四个专业的高质量数据集,包含10K个样本及其逐步推理轨迹。我们的领域特定适应(DSA)创建了专业特定的LoRA模块,以提供多样化的初始化,而基于指南的优势(GBA)则明确建模不同的临床推理视角,以与现实世界的诊断策略对齐。冲突感知能力整合随后将这些专业专家合并为一个统一的代理,确保稳健的多专业对齐。全面的实验表明我们的模型在性能上处于最前沿,我们的7B模型在MedXpert-MM上达到了27.50,在OmniMedVQA上达到了83.03,为可靠的多模态医学推理系统奠定了坚实的基础。
cs.CV / 15

IDDR-NGP: Incorporating Detectors for Distractor Removal with Instant Neural Radiance Field

IDDR-NGP:结合探测器进行干扰物去除的即时神经辐射场
Huang, Xianliang, Gou, Jiajie, Chen, Shuhang, Zhong, Zhizhou, Guan, Jihong, Zhou, Shuigeng
Abstract
This paper presents the first unified distractor removal method, named IDDR-NGP, which directly operates on Instant-NPG. The method is able to remove a wide range of distractors in 3D scenes, such as snowflakes, confetti, defoliation and petals, whereas existing methods usually focus on a specific type of distractors. By incorporating implicit 3D representations with 2D detectors, we demonstrate that it is possible to efficiently restore 3D scenes from multiple corrupted images. We design the learned perceptual image patch similarity~( LPIPS) loss and the multi-view compensation loss (MVCL) to jointly optimize the rendering results of IDDR-NGP, which could aggregate information from multi-view corrupted images. All of them can be trained in an end-to-end manner to synthesize high-quality 3D scenes. To support the research on distractors removal in implicit 3D representations, we build a new benchmark dataset that consists of both synthetic and real-world distractors. To validate the effectiveness and robustness of IDDR-NGP, we provide a wide range of distractors with corresponding annotated labels added to both realistic and synthetic scenes. Extensive experimental results demonstrate the effectiveness and robustness of IDDR-NGP in removing multiple types of distractors. In addition, our approach achieves results comparable with the existing SOTA desnow methods and is capable of accurately removing both realistic and synthetic distractors.
Chinese Translation
本文提出了一种首个统一的干扰物去除方法,命名为IDDR-NGP,该方法直接作用于Instant-NPG。该方法能够去除3D场景中各种干扰物,如雪花、五彩纸屑、落叶和花瓣,而现有方法通常只关注特定类型的干扰物。通过将隐式3D表示与2D探测器结合,我们展示了从多张受损图像中高效恢复3D场景的可能性。我们设计了学习感知图像块相似性(LPIPS)损失和多视图补偿损失(MVCL),以共同优化IDDR-NGP的渲染结果,这可以从多视图受损图像中聚合信息。所有这些都可以以端到端的方式进行训练,以合成高质量的3D场景。为了支持隐式3D表示中干扰物去除的研究,我们构建了一个新的基准数据集,该数据集包含合成和真实世界的干扰物。为了验证IDDR-NGP的有效性和鲁棒性,我们提供了广泛的干扰物及其对应的标注标签,添加到现实和合成场景中。大量实验结果证明了IDDR-NGP在去除多种类型干扰物方面的有效性和鲁棒性。此外,我们的方法在去雪效果上达到了与现有最先进(SOTA)去雪方法相当的结果,并能够准确去除真实和合成的干扰物。
cs.CV / 16

Your One-Stop Solution for AI-Generated Video Detection

您的AI生成视频检测一站式解决方案
Ma, Long, Xue, Zihao, Wang, Yan, Yan, Zhiyuan, Xu, Jin, Jiang, Xiaorui, Yu, Haiyang, Liao, Yong, Bi, Zhen
Abstract
Recent advances in generative modeling can create remarkably realistic synthetic videos, making it increasingly difficult for humans to distinguish them from real ones and necessitating reliable detection methods. However, two key limitations hinder the development of this field. \textbf{From the dataset perspective}, existing datasets are often limited in scale and constructed using outdated or narrowly scoped generative models, making it difficult to capture the diversity and rapid evolution of modern generative techniques. Moreover, the dataset construction process frequently prioritizes quantity over quality, neglecting essential aspects such as semantic diversity, scenario coverage, and technological representativeness. \textbf{From the benchmark perspective}, current benchmarks largely remain at the stage of dataset creation, leaving many fundamental issues and in-depth analysis yet to be systematically explored. Addressing this gap, we propose AIGVDBench, a benchmark designed to be comprehensive and representative, covering \textbf{31} state-of-the-art generation models and over \textbf{440,000} videos. By executing more than \textbf{1,500} evaluations on \textbf{33} existing detectors belonging to four distinct categories. This work presents \textbf{8 in-depth analyses} from multiple perspectives and identifies \textbf{4 novel findings} that offer valuable insights for future research. We hope this work provides a solid foundation for advancing the field of AI-generated video detection. Our benchmark is open-sourced at https://github.com/LongMa-2025/AIGVDBench.
Chinese Translation
最近在生成建模方面的进展使得生成的合成视频变得极为逼真,这使得人类越来越难以将其与真实视频区分开来,因此迫切需要可靠的检测方法。然而,两个关键限制因素阻碍了该领域的发展。\textbf{从数据集的角度来看},现有的数据集通常规模有限,并且是基于过时或狭窄范围的生成模型构建的,这使得难以捕捉现代生成技术的多样性和快速演变。此外,数据集构建过程往往优先考虑数量而非质量,忽视了语义多样性、场景覆盖和技术代表性等重要方面。\textbf{从基准的角度来看},当前的基准大多停留在数据集创建阶段,许多基础问题和深入分析尚未得到系统性探索。为了解决这一空白,我们提出了AIGVDBench,这是一个旨在全面且具有代表性的基准,涵盖了\textbf{31}种最先进的生成模型和超过\textbf{440,000}个视频。通过对\textbf{33}个属于四个不同类别的现有检测器进行超过\textbf{1,500}次评估,本研究从多个角度呈现了\textbf{8个深入分析},并识别出\textbf{4个新发现},为未来的研究提供了宝贵的见解。我们希望这项工作为推动AI生成视频检测领域的发展奠定坚实的基础。我们的基准已在https://github.com/LongMa-2025/AIGVDBench上开源。
cs.CV / 17

M3DDM+: An improved video outpainting by a modified masking strategy

M3DDM+: 一种通过改进的掩膜策略提升视频外绘的算法
Murakawa, Takuya, Fukuzawa, Takumi, Ding, Ning, Tamaki, Toru
Abstract
M3DDM provides a computationally efficient framework for video outpainting via latent diffusion modeling. However, it exhibits significant quality degradation -- manifested as spatial blur and temporal inconsistency -- under challenging scenarios characterized by limited camera motion or large outpainting regions, where inter-frame information is limited. We identify the cause as a training-inference mismatch in the masking strategy: M3DDM's training applies random mask directions and widths across frames, whereas inference requires consistent directional outpainting throughout the video. To address this, we propose M3DDM+, which applies uniform mask direction and width across all frames during training, followed by fine-tuning of the pretrained M3DDM model. Experiments demonstrate that M3DDM+ substantially improves visual fidelity and temporal coherence in information-limited scenarios while maintaining computational efficiency. The code is available at https://github.com/tamaki-lab/M3DDM-Plus.
Chinese Translation
M3DDM 提供了一种通过潜在扩散建模实现视频外绘的计算高效框架。然而,在受限相机运动或大范围外绘等具有挑战性的场景下,它表现出显著的质量下降——主要表现为空间模糊和时间不一致。我们识别出这一问题的根源在于掩膜策略中的训练-推理不匹配:M3DDM 在训练时对每帧应用随机的掩膜方向和宽度,而推理时则需要在整个视频中保持一致的方向外绘。为了解决这个问题,我们提出了 M3DDM+,该方法在训练期间对所有帧应用统一的掩膜方向和宽度,随后对预训练的 M3DDM 模型进行微调。实验表明,M3DDM+ 在信息受限的场景中显著提高了视觉保真度和时间一致性,同时保持了计算效率。代码可在 https://github.com/tamaki-lab/M3DDM-Plus 获取。
cs.CV / 18

PhysRVG: Physics-Aware Unified Reinforcement Learning for Video Generative Models

PhysRVG:物理感知统一强化学习用于视频生成模型
Zhang, Qiyuan, Gong, Biao, Tan, Shuai, Zhang, Zheng, Shen, Yujun, Zhu, Xing, Li, Yuyuan, Yao, Kelu, Shen, Chunhua, Zou, Changqing
Abstract
Physical principles are fundamental to realistic visual simulation, but remain a significant oversight in transformer-based video generation. This gap highlights a critical limitation in rendering rigid body motion, a core tenet of classical mechanics. While computer graphics and physics-based simulators can easily model such collisions using Newton formulas, modern pretrain-finetune paradigms discard the concept of object rigidity during pixel-level global denoising. Even perfectly correct mathematical constraints are treated as suboptimal solutions (i.e., conditions) during model optimization in post-training, fundamentally limiting the physical realism of generated videos. Motivated by these considerations, we introduce, for the first time, a physics-aware reinforcement learning paradigm for video generation models that enforces physical collision rules directly in high-dimensional spaces, ensuring the physics knowledge is strictly applied rather than treated as conditions. Subsequently, we extend this paradigm to a unified framework, termed Mimicry-Discovery Cycle (MDcycle), which allows substantial fine-tuning while fully preserving the model's ability to leverage physics-grounded feedback. To validate our approach, we construct new benchmark PhysRVGBench and perform extensive qualitative and quantitative experiments to thoroughly assess its effectiveness.
Chinese Translation
物理原理是现实视觉仿真的基础,但在基于变换器的视频生成中却被显著忽视。这一差距突显了在刚体运动渲染方面的关键限制,这是经典力学的核心原则。尽管计算机图形学和基于物理的模拟器可以轻松使用牛顿公式建模此类碰撞,但现代的预训练-微调范式在像素级全局去噪过程中却忽略了物体刚性的概念。即使是完全正确的数学约束在后期训练中的模型优化中也被视为次优解(即条件),从根本上限制了生成视频的物理真实感。受到这些考虑的启发,我们首次引入了一种物理感知的强化学习范式,用于视频生成模型,该范式在高维空间中直接强制执行物理碰撞规则,确保物理知识被严格应用,而不是作为条件对待。随后,我们将这一范式扩展为一个统一框架,称为模仿发现循环(Mimicry-Discovery Cycle,MDcycle),该框架允许在充分保留模型利用基于物理反馈能力的同时进行大量微调。为了验证我们的方法,我们构建了新的基准PhysRVGBench,并进行了广泛的定性和定量实验,以全面评估其有效性。
cs.CV / 19

CoDance: An Unbind-Rebind Paradigm for Robust Multi-Subject Animation

CoDance:一种用于鲁棒多主体动画的解绑定-重绑定范式
Tan, Shuai, Gong, Biao, Ma, Ke, Feng, Yutong, Zhang, Qiyuan, Wang, Yan, Shen, Yujun, Zhao, Hengshuang
Abstract
Character image animation is gaining significant importance across various domains, driven by the demand for robust and flexible multi-subject rendering. While existing methods excel in single-person animation, they struggle to handle arbitrary subject counts, diverse character types, and spatial misalignment between the reference image and the driving poses. We attribute these limitations to an overly rigid spatial binding that forces strict pixel-wise alignment between the pose and reference, and an inability to consistently rebind motion to intended subjects. To address these challenges, we propose CoDance, a novel Unbind-Rebind framework that enables the animation of arbitrary subject counts, types, and spatial configurations conditioned on a single, potentially misaligned pose sequence. Specifically, the Unbind module employs a novel pose shift encoder to break the rigid spatial binding between the pose and the reference by introducing stochastic perturbations to both poses and their latent features, thereby compelling the model to learn a location-agnostic motion representation. To ensure precise control and subject association, we then devise a Rebind module, leveraging semantic guidance from text prompts and spatial guidance from subject masks to direct the learned motion to intended characters. Furthermore, to facilitate comprehensive evaluation, we introduce a new multi-subject CoDanceBench. Extensive experiments on CoDanceBench and existing datasets show that CoDance achieves SOTA performance, exhibiting remarkable generalization across diverse subjects and spatial layouts. The code and weights will be open-sourced.
Chinese Translation
角色图像动画在多个领域中变得越来越重要,这得益于对鲁棒且灵活的多主体渲染的需求。尽管现有方法在单人动画方面表现出色,但在处理任意主体数量、多样化角色类型以及参考图像与驱动姿势之间的空间错位时却显得力不从心。我们将这些局限归因于过于严格的空间绑定,这迫使姿势与参考之间进行严格的像素级对齐,以及无法始终如一地将运动重新绑定到预期主体。为了解决这些挑战,我们提出了CoDance,这是一种新颖的解绑定-重绑定框架,能够在单一、可能错位的姿势序列的条件下,实现任意主体数量、类型和空间配置的动画。具体而言,解绑定模块采用了一种新颖的姿势偏移编码器,通过对姿势及其潜在特征引入随机扰动,打破姿势与参考之间的严格空间绑定,从而迫使模型学习位置无关的运动表示。为了确保精确控制和主体关联,我们设计了重绑定模块,利用文本提示的语义指导和主体掩码的空间指导,将学习到的运动引导到预期角色。此外,为了便于全面评估,我们引入了一个新的多主体CoDanceBench。在CoDanceBench和现有数据集上的大量实验表明,CoDance达到了SOTA性能,在多样化主体和空间布局中表现出显著的泛化能力。代码和权重将开源。
cs.CV / 20

Graph Smoothing for Enhanced Local Geometry Learning in Point Cloud Analysis

图平滑用于增强点云分析中的局部几何学习
Yuan, Shangbo, Xu, Jie, Hu, Ping, Zhu, Xiaofeng, Zhao, Na
Abstract
Graph-based methods have proven to be effective in capturing relationships among points for 3D point cloud analysis. However, these methods often suffer from suboptimal graph structures, particularly due to sparse connections at boundary points and noisy connections in junction areas. To address these challenges, we propose a novel method that integrates a graph smoothing module with an enhanced local geometry learning module. Specifically, we identify the limitations of conventional graph structures, particularly in handling boundary points and junction areas. In response, we introduce a graph smoothing module designed to optimize the graph structure and minimize the negative impact of unreliable sparse and noisy connections. Based on the optimized graph structure, we improve the feature extract function with local geometry information. These include shape features derived from adaptive geometric descriptors based on eigenvectors and distribution features obtained through cylindrical coordinate transformation. Experimental results on real-world datasets validate the effectiveness of our method in various point cloud learning tasks, i.e., classification, part segmentation, and semantic segmentation.
Chinese Translation
基于图的方法已被证明在捕捉3D点云分析中点与点之间的关系方面有效。然而,这些方法往往受限于次优的图结构,特别是在边界点的稀疏连接和交界区域的噪声连接方面。为了解决这些挑战,我们提出了一种新颖的方法,将图平滑模块与增强的局部几何学习模块相结合。具体而言,我们识别了传统图结构的局限性,尤其是在处理边界点和交界区域时。对此,我们引入了一个图平滑模块,旨在优化图结构并最小化不可靠的稀疏和噪声连接的负面影响。在优化的图结构基础上,我们改进了特征提取函数,结合了局部几何信息。这些信息包括基于特征向量的自适应几何描述符提取的形状特征,以及通过圆柱坐标变换获得的分布特征。在真实世界数据集上的实验结果验证了我们的方法在各种点云学习任务中的有效性,即分类、部分分割和语义分割。
cs.CV / 21

Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning

通过交错多模态推理的视觉逆图形代理
Yin, Shaofeng, Ge, Jiaxin, Wang, Zora Zhiruo, Li, Xiuyu, Black, Michael J., Darrell, Trevor, Kanazawa, Angjoo, Feng, Haiwen
Abstract
Vision-as-inverse-graphics, the concept of reconstructing an image as an editable graphics program is a long-standing goal of computer vision. Yet even strong VLMs aren't able to achieve this in one-shot as they lack fine-grained spatial and physical grounding capability. Our key insight is that closing this gap requires interleaved multimodal reasoning through iterative execution and verification. Stemming from this, we present VIGA (Vision-as-Inverse-Graphic Agent) that starts from an empty world and reconstructs or edits scenes through a closed-loop write-run-render-compare-revise procedure. To support long-horizon reasoning, VIGA combines (i) a skill library that alternates generator and verifier roles and (ii) an evolving context memory that contains plans, code diffs, and render history. VIGA is task-agnostic as it doesn't require auxiliary modules, covering a wide range of tasks such as 3D reconstruction, multi-step scene editing, 4D physical interaction, and 2D document editing, etc. Empirically, we found VIGA substantially improves one-shot baselines on BlenderGym (35.32%) and SlideBench (117.17%). Moreover, VIGA is also model-agnostic as it doesn't require finetuning, enabling a unified protocol to evaluate heterogeneous foundation VLMs. To better support this protocol, we introduce BlenderBench, a challenging benchmark that stress-tests interleaved multimodal reasoning with graphics engine, where VIGA improves by 124.70%.
Chinese Translation
视觉逆图形,即将图像重建为可编辑的图形程序,是计算机视觉的一个长期目标。然而,即使是强大的视觉语言模型(VLMs)也无法一次性实现这一目标,因为它们缺乏细粒度的空间和物理基础能力。我们的关键见解是,缩小这一差距需要通过迭代执行和验证的交错多模态推理。基于此,我们提出了VIGA(视觉逆图形代理),该代理从一个空的世界开始,通过闭环的写-运行-渲染-比较-修订程序重建或编辑场景。为了支持长时间的推理,VIGA结合了(i) 交替生成器和验证器角色的技能库,以及(ii) 包含计划、代码差异和渲染历史的演变上下文记忆。VIGA是任务无关的,因为它不需要辅助模块,覆盖了广泛的任务,如3D重建、多步骤场景编辑、4D物理交互和2D文档编辑等。实证研究表明,VIGA在BlenderGym(提高35.32%)和SlideBench(提高117.17%)上的一次性基线表现显著提升。此外,VIGA也是模型无关的,因为它不需要微调,从而实现了评估异构基础视觉语言模型的统一协议。为了更好地支持这一协议,我们引入了BlenderBench,这是一个具有挑战性的基准,旨在通过图形引擎对交错多模态推理进行压力测试,其中VIGA的表现提高了124.70%。
cs.CV / 22

SoLA-Vision: Fine-grained Layer-wise Linear Softmax Hybrid Attention

SoLA-Vision:细粒度层级线性Softmax混合注意力
Li, Ruibang, Luo, Guan, Zhang, Yiwei, Gao, Jin, Li, Bing, Hu, Weiming
Abstract
Standard softmax self-attention excels in vision tasks but incurs quadratic complexity O(N^2), limiting high-resolution deployment. Linear attention reduces the cost to O(N), yet its compressed state representations can impair modeling capacity and accuracy. We present an analytical study that contrasts linear and softmax attention for visual representation learning from a layer-stacking perspective. We further conduct systematic experiments on layer-wise hybridization patterns of linear and softmax attention. Our results show that, compared with rigid intra-block hybrid designs, fine-grained layer-wise hybridization can match or surpass performance while requiring fewer softmax layers. Building on these findings, we propose SoLA-Vision (Softmax-Linear Attention Vision), a flexible layer-wise hybrid attention backbone that enables fine-grained control over how linear and softmax attention are integrated. By strategically inserting a small number of global softmax layers, SoLA-Vision achieves a strong trade-off between accuracy and computational cost. On ImageNet-1K, SoLA-Vision outperforms purely linear and other hybrid attention models. On dense prediction tasks, it consistently surpasses strong baselines by a considerable margin. Code will be released.
Chinese Translation
标准的softmax自注意力在视觉任务中表现出色,但其复杂度为O(N^2),限制了高分辨率的应用。线性注意力将成本降低至O(N),但其压缩状态表示可能会损害建模能力和准确性。我们进行了一项分析研究,从层堆叠的角度对线性和softmax注意力在视觉表征学习中的表现进行了对比。我们进一步对线性和softmax注意力的层级混合模式进行了系统实验。结果表明,与僵化的块内混合设计相比,细粒度的层级混合能够在需要更少的softmax层的情况下匹配或超越性能。基于这些发现,我们提出了SoLA-Vision(Softmax-Linear Attention Vision),一种灵活的层级混合注意力骨干网络,能够细致控制线性和softmax注意力的集成方式。通过战略性地插入少量全局softmax层,SoLA-Vision在准确性和计算成本之间实现了良好的权衡。在ImageNet-1K上,SoLA-Vision超越了纯线性和其他混合注意力模型。在密集预测任务中,它始终以相当大的优势超越强基线。代码将会发布。
cs.CV / 23

Democratizing planetary-scale analysis: An ultra-lightweight Earth embedding database for accurate and flexible global land monitoring

民主化行星尺度分析:一种超轻量级地球嵌入数据库用于准确灵活的全球土地监测
Chen, Shuang, Wang, Jie, Yuan, Shuai, Li, Jiayang, Xia, Yu, Liao, Yuanhong, Wei, Junbo, Yuan, Jincheng, Xu, Xiaoqing, Zhu, Xiaolin, Zhu, Peng, Zhang, Hongsheng, Zhou, Yuyu, Fu, Haohuan, Huang, Huabing, Chen, Bin, Dai, Fan, Gong, Peng
Abstract
The rapid evolution of satellite-borne Earth Observation (EO) systems has revolutionized terrestrial monitoring, yielding petabyte-scale archives. However, the immense computational and storage requirements for global-scale analysis often preclude widespread use, hindering planetary-scale studies. To address these barriers, we present Embedded Seamless Data (ESD), an ultra-lightweight, 30-m global Earth embedding database spanning the 25-year period from 2000 to 2024. By transforming high-dimensional, multi-sensor observations from the Landsat series (5, 7, 8, and 9) and MODIS Terra into information-dense, quantized latent vectors, ESD distills essential geophysical and semantic features into a unified latent space. Utilizing the ESDNet architecture and Finite Scalar Quantization (FSQ), the dataset achieves a transformative ~340-fold reduction in data volume compared to raw archives. This compression allows the entire global land surface for a single year to be encapsulated within approximately 2.4 TB, enabling decadal-scale global analysis on standard local workstations. Rigorous validation demonstrates high reconstructive fidelity (MAE: 0.0130; RMSE: 0.0179; CC: 0.8543). By condensing the annual phenological cycle into 12 temporal steps, the embeddings provide inherent denoising and a semantically organized space that outperforms raw reflectance in land-cover classification, achieving 79.74% accuracy (vs. 76.92% for raw fusion). With robust few-shot learning capabilities and longitudinal consistency, ESD provides a versatile foundation for democratizing planetary-scale research and advancing next-generation geospatial artificial intelligence.
Chinese Translation
卫星载荷地球观测(EO)系统的快速发展彻底改变了陆地监测,产生了PB级的档案。然而,全球尺度分析所需的巨大计算和存储需求常常阻碍了其广泛应用,妨碍了行星尺度研究。为了解决这些障碍,我们提出了嵌入式无缝数据(Embedded Seamless Data,ESD),这是一个超轻量级的30米全球地球嵌入数据库,覆盖2000年至2024年25年的时间段。通过将来自Landsat系列(5、7、8和9)和MODIS Terra的高维多传感器观测转化为信息密集的量化潜在向量,ESD将基本的地球物理和语义特征提炼到一个统一的潜在空间。利用ESDNet架构和有限标量量化(Finite Scalar Quantization,FSQ),该数据集实现了与原始档案相比约340倍的数据量减少。这种压缩使得单一年份的全球陆地表面能够被封装在约2.4 TB内,从而在标准本地工作站上实现十年尺度的全球分析。严格的验证表明高重建保真度(MAE: 0.0130; RMSE: 0.0179; CC: 0.8543)。通过将年度物候周期浓缩为12个时间步骤,这些嵌入提供了固有的去噪和一个语义组织的空间,在土地覆盖分类中优于原始反射率,准确率达到79.74%(相比之下,原始融合为76.92%)。凭借强大的少样本学习能力和纵向一致性,ESD为民主化行星尺度研究和推动下一代地理空间人工智能提供了多功能的基础。
cs.CV / 24

ATATA: One Algorithm to Align Them All

ATATA:一个统一对齐的算法
Pang, Boyi, Ignatyev, Savva, Ippolitov, Vladimir, Khafizov, Ramil, Melnik, Yurii, Voynov, Oleg, Nakhodnov, Maksim, Alanov, Aibek, Fan, Xiaopeng, Wonka, Peter, Burnaev, Evgeny
Abstract
We suggest a new multi-modal algorithm for joint inference of paired structurally aligned samples with Rectified Flow models. While some existing methods propose a codependent generation process, they do not view the problem of joint generation from a structural alignment perspective. Recent work uses Score Distillation Sampling to generate aligned 3D models, but SDS is known to be time-consuming, prone to mode collapse, and often provides cartoonish results. By contrast, our suggested approach relies on the joint transport of a segment in the sample space, yielding faster computation at inference time. Our approach can be built on top of an arbitrary Rectified Flow model operating on the structured latent space. We show the applicability of our method to the domains of image, video, and 3D shape generation using state-of-the-art baselines and evaluate it against both editing-based and joint inference-based competing approaches. We demonstrate a high degree of structural alignment for the sample pairs obtained with our method and a high visual quality of the samples. Our method improves the state-of-the-art for image and video generation pipelines. For 3D generation, it is able to show comparable quality while working orders of magnitude faster.
Chinese Translation
我们提出了一种新的多模态算法,用于与修正流模型共同推断成对的结构对齐样本。虽然一些现有方法提出了相互依赖的生成过程,但它们并未从结构对齐的角度看待联合生成的问题。最近的研究使用评分蒸馏采样(Score Distillation Sampling)生成对齐的3D模型,但已知SDS耗时较长,容易出现模式崩溃,并且通常提供卡通化的结果。相比之下,我们建议的方法依赖于样本空间中一个片段的联合传输,从而在推断时实现更快的计算。我们的方法可以基于任意在结构潜在空间上运行的修正流模型构建。我们展示了我们方法在图像、视频和3D形状生成领域的适用性,使用最先进的基准进行评估,并与基于编辑和基于联合推断的竞争方法进行比较。我们证明了通过我们的方法获得的样本对具有高度的结构对齐性,并且样本的视觉质量也很高。我们的方法提高了图像和视频生成管道的最先进水平。在3D生成方面,它能够在更快的速度下展示出可比的质量。
cs.CV / 25

Bio-inspired fine-tuning for selective transfer learning in image classification

基于生物启发的图像分类选择性迁移学习的微调
Davila, Ana, Colan, Jacinto, Hasegawa, Yasuhisa
Abstract
Deep learning has significantly advanced image analysis across diverse domains but often depends on large, annotated datasets for success. Transfer learning addresses this challenge by utilizing pre-trained models to tackle new tasks with limited labeled data. However, discrepancies between source and target domains can hinder effective transfer learning. We introduce BioTune, a novel adaptive fine-tuning technique utilizing evolutionary optimization. BioTune enhances transfer learning by optimally choosing which layers to freeze and adjusting learning rates for unfrozen layers. Through extensive evaluation on nine image classification datasets, spanning natural and specialized domains such as medical imaging, BioTune demonstrates superior accuracy and efficiency over state-of-the-art fine-tuning methods, including AutoRGN and LoRA, highlighting its adaptability to various data characteristics and distribution changes. Additionally, BioTune consistently achieves top performance across four different CNN architectures, underscoring its flexibility. Ablation studies provide valuable insights into the impact of BioTune's key components on overall performance. The source code is available at https://github.com/davilac/BioTune.
Chinese Translation
深度学习在多个领域的图像分析中取得了显著进展,但通常依赖于大量标注数据集以获得成功。迁移学习通过利用预训练模型来应对这一挑战,以处理有限标注数据的新任务。然而,源领域与目标领域之间的差异可能会阻碍有效的迁移学习。我们提出了BioTune,这是一种利用进化优化的创新自适应微调技术。BioTune通过最优选择冻结哪些层以及调整未冻结层的学习率来增强迁移学习。通过在九个图像分类数据集上的广泛评估,这些数据集涵盖自然和专业领域(如医学成像),BioTune展示了比最先进的微调方法(包括AutoRGN和LoRA)更优越的准确性和效率,突显了其对各种数据特征和分布变化的适应性。此外,BioTune在四种不同的卷积神经网络(CNN)架构中始终实现了最佳性能,强调了其灵活性。消融研究提供了关于BioTune关键组件对整体性能影响的宝贵见解。源代码可在https://github.com/davilac/BioTune获取。
cs.CV / 26

Image-Text Knowledge Modeling for Unsupervised Multi-Scenario Person Re-Identification

用于无监督多场景人物重识别的图像-文本知识建模
Pang, Zhiqi, Zhao, Lingling, Liu, Yang, Wang, Chunyu, Sharma, Gaurav
Abstract
We propose unsupervised multi-scenario (UMS) person re-identification (ReID) as a new task that expands ReID across diverse scenarios (cross-resolution, clothing change, etc.) within a single coherent framework. To tackle UMS-ReID, we introduce image-text knowledge modeling (ITKM) -- a three-stage framework that effectively exploits the representational power of vision-language models. We start with a pre-trained CLIP model with an image encoder and a text encoder. In Stage I, we introduce a scenario embedding in the image encoder and fine-tune the encoder to adaptively leverage knowledge from multiple scenarios. In Stage II, we optimize a set of learned text embeddings to associate with pseudo-labels from Stage I and introduce a multi-scenario separation loss to increase the divergence between inter-scenario text representations. In Stage III, we first introduce cluster-level and instance-level heterogeneous matching modules to obtain reliable heterogeneous positive pairs (e.g., a visible image and an infrared image of the same person) within each scenario. Next, we propose a dynamic text representation update strategy to maintain consistency between text and image supervision signals. Experimental results across multiple scenarios demonstrate the superiority and generalizability of ITKM; it not only outperforms existing scenario-specific methods but also enhances overall performance by integrating knowledge from multiple scenarios.
Chinese Translation
我们提出了无监督多场景(UMS)人物重识别(ReID)作为一种新任务,旨在在单一一致框架内扩展ReID至多样化场景(跨分辨率、服装变化等)。为了解决UMS-ReID问题,我们引入了图像-文本知识建模(ITKM)——一个三阶段框架,有效利用视觉-语言模型的表征能力。我们首先使用预训练的CLIP模型,该模型包含图像编码器和文本编码器。在第一阶段,我们在图像编码器中引入场景嵌入,并微调编码器以自适应地利用来自多个场景的知识。在第二阶段,我们优化一组学习到的文本嵌入,以与第一阶段的伪标签关联,并引入多场景分离损失,以增加场景间文本表征的差异性。在第三阶段,我们首先引入集群级和实例级异构匹配模块,以在每个场景内获得可靠的异构正样本对(例如,同一人的可见图像和红外图像)。接下来,我们提出了一种动态文本表征更新策略,以保持文本和图像监督信号之间的一致性。在多个场景下的实验结果表明,ITKM的优越性和泛化能力;它不仅优于现有的场景特定方法,还通过整合来自多个场景的知识提升了整体性能。
cs.CV / 27

Language-Agnostic Visual Embeddings for Cross-Script Handwriting Retrieval

跨脚本手写检索的语言无关视觉嵌入
Chen, Fangke, Dong, Tianhao, Chen, Sirry, Zhang, Guobin, Zhang, Yishu, Chen, Yining
Abstract
Handwritten word retrieval is vital for digital archives but remains challenging due to large handwriting variability and cross-lingual semantic gaps. While large vision-language models offer potential solutions, their prohibitive computational costs hinder practical edge deployment. To address this, we propose a lightweight asymmetric dual-encoder framework that learns unified, style-invariant visual embeddings. By jointly optimizing instance-level alignment and class-level semantic consistency, our approach anchors visual embeddings to language-agnostic semantic prototypes, enforcing invariance across scripts and writing styles. Experiments show that our method outperforms 28 baselines and achieves state-of-the-art accuracy on within-language retrieval benchmarks. We further conduct explicit cross-lingual retrieval, where the query language differs from the target language, to validate the effectiveness of the learned cross-lingual representations. Achieving strong performance with only a fraction of the parameters required by existing models, our framework enables accurate and resource-efficient cross-script handwriting retrieval.
Chinese Translation
手写词检索对于数字档案至关重要,但由于手写变异性大和跨语言语义差距,仍然面临挑战。虽然大型视觉-语言模型提供了潜在的解决方案,但其高昂的计算成本阻碍了实际边缘部署。为了解决这个问题,我们提出了一种轻量级的不对称双编码器框架,该框架学习统一的、风格不变的视觉嵌入。通过联合优化实例级对齐和类别级语义一致性,我们的方法将视觉嵌入锚定到语言无关的语义原型上,从而在不同脚本和书写风格之间强制保持不变性。实验表明,我们的方法在28个基线模型上表现优越,并在同语言检索基准上达到了最先进的准确率。我们进一步进行明确的跨语言检索,其中查询语言与目标语言不同,以验证所学习的跨语言表示的有效性。我们的框架在仅需现有模型所需参数的一小部分的情况下,取得了强劲的性能,使得准确且资源高效的跨脚本手写检索成为可能。
cs.CV / 28

FTDMamba: Frequency-Assisted Temporal Dilation Mamba for Unmanned Aerial Vehicle Video Anomaly Detection

FTDMamba:用于无人机视频异常检测的频率辅助时间膨胀Mamba
Liu, Cheng-Zhuang, Chen, Si-Bao, Shu, Qing-Ling, Ding, Chris, Tang, Jin, Luo, Bin
Abstract
Recent advances in video anomaly detection (VAD) mainly focus on ground-based surveillance or unmanned aerial vehicle (UAV) videos with static backgrounds, whereas research on UAV videos with dynamic backgrounds remains limited. Unlike static scenarios, dynamically captured UAV videos exhibit multi-source motion coupling, where the motion of objects and UAV-induced global motion are intricately intertwined. Consequently, existing methods may misclassify normal UAV movements as anomalies or fail to capture true anomalies concealed within dynamic backgrounds. Moreover, many approaches do not adequately address the joint modeling of inter-frame continuity and local spatial correlations across diverse temporal scales. To overcome these limitations, we propose the Frequency-Assisted Temporal Dilation Mamba (FTDMamba) network for UAV VAD, including two core components: (1) a Frequency Decoupled Spatiotemporal Correlation Module, which disentangles coupled motion patterns and models global spatiotemporal dependencies through frequency analysis; and (2) a Temporal Dilation Mamba Module, which leverages Mamba's sequence modeling capability to jointly learn fine-grained temporal dynamics and local spatial structures across multiple temporal receptive fields. Additionally, unlike existing UAV VAD datasets which focus on static backgrounds, we construct a large-scale Moving UAV VAD dataset (MUVAD), comprising 222,736 frames with 240 anomaly events across 12 anomaly types. Extensive experiments demonstrate that FTDMamba achieves state-of-the-art (SOTA) performance on two public static benchmarks and the new MUVAD dataset. The code and MUVAD dataset will be available at: https://github.com/uavano/FTDMamba.
Chinese Translation
近年来,视频异常检测(VAD)的研究主要集中在地面监控或具有静态背景的无人机(UAV)视频上,而对具有动态背景的无人机视频的研究仍然有限。与静态场景不同,动态捕获的无人机视频表现出多源运动耦合,其中物体运动与无人机引起的全球运动错综复杂地交织在一起。因此,现有方法可能将正常的无人机运动误分类为异常,或未能捕捉到隐藏在动态背景中的真实异常。此外,许多方法未能充分解决在不同时间尺度下的帧间连续性和局部空间相关性的联合建模。为克服这些局限性,我们提出了频率辅助时间膨胀Mamba(FTDMamba)网络用于无人机VAD,包括两个核心组件:(1)频率解耦时空相关模块,通过频率分析解耦耦合运动模式并建模全球时空依赖关系;(2)时间膨胀Mamba模块,利用Mamba的序列建模能力共同学习多个时间感受野下的细粒度时间动态和局部空间结构。此外,与现有的专注于静态背景的无人机VAD数据集不同,我们构建了一个大规模的移动无人机VAD数据集(MUVAD),包含222,736帧和240个异常事件,涵盖12种异常类型。大量实验表明,FTDMamba在两个公共静态基准和新的MUVAD数据集上实现了最先进的(SOTA)性能。代码和MUVAD数据集将可在:https://github.com/uavano/FTDMamba获取。
cs.CV / 29

X-Distill: Cross-Architecture Vision Distillation for Visuomotor Learning

X-Distill:用于视觉运动学习的跨架构视觉蒸馏
Shao, Maanping, Zhang, Feihong, Zhang, Gu, Cheng, Baiye, Xue, Zhengrong, Xu, Huazhe
Abstract
Visuomotor policies often leverage large pre-trained Vision Transformers (ViTs) for their powerful generalization capabilities. However, their significant data requirements present a major challenge in the data-scarce context of most robotic learning settings, where compact CNNs with strong inductive biases can be more easily optimized. To address this trade-off, we introduce X-Distill, a simple yet highly effective method that synergizes the strengths of both architectures. Our approach involves an offline, cross-architecture knowledge distillation, transferring the rich visual representations of a large, frozen DINOv2 teacher to a compact ResNet-18 student on the general-purpose ImageNet dataset. This distilled encoder, now endowed with powerful visual priors, is then jointly fine-tuned with a diffusion policy head on the target manipulation tasks. Extensive experiments on $34$ simulated benchmarks and $5$ challenging real-world tasks demonstrate that our method consistently outperforms policies equipped with from-scratch ResNet or fine-tuned DINOv2 encoders. Notably, X-Distill also surpasses 3D encoders that utilize privileged point cloud observations or much larger Vision-Language Models. Our work highlights the efficacy of a simple, well-founded distillation strategy for achieving state-of-the-art performance in data-efficient robotic manipulation.
Chinese Translation
视觉运动策略通常利用大型预训练的视觉变换器(Vision Transformers, ViTs)来发挥其强大的泛化能力。然而,它们对数据的显著需求在大多数机器人学习环境中构成了一个主要挑战,在这些环境中,具有强烈归纳偏置的紧凑卷积神经网络(CNNs)更容易被优化。为了解决这一权衡,我们提出了X-Distill,这是一种简单但极为有效的方法,能够协同利用两种架构的优势。我们的方法涉及离线的跨架构知识蒸馏,将大型、冻结的DINOv2教师模型的丰富视觉表征转移到紧凑的ResNet-18学生模型上,使用的是通用的ImageNet数据集。这个经过蒸馏的编码器现在具备了强大的视觉先验,然后与扩散策略头共同在目标操作任务上进行微调。在34个模拟基准和5个具有挑战性的真实世界任务上的广泛实验表明,我们的方法始终优于从头训练的ResNet或微调的DINOv2编码器。此外,X-Distill还超越了利用特权点云观测或更大规模视觉-语言模型的3D编码器。我们的工作突显了简单且基础扎实的蒸馏策略在实现数据高效的机器人操作中的前沿性能的有效性。
cs.CV / 30

Efficient On-Board Processing of Oblique UAV Video for Rapid Flood Extent Mapping

高效的倾斜无人机视频机载处理用于快速洪水范围映射
Sharma, Vishisht, Leroux, Sam, Landuyt, Lisa, Witvrouwen, Nick, Simoens, Pieter
Abstract
Effective disaster response relies on rapid disaster response, where oblique aerial video is the primary modality for initial scouting due to its ability to maximize spatial coverage and situational awareness in limited flight time. However, the on-board processing of high-resolution oblique streams is severely bottlenecked by the strict Size, Weight, and Power (SWaP) constraints of Unmanned Aerial Vehicles (UAVs). The computational density required to process these wide-field-of-view streams precludes low-latency inference on standard edge hardware. To address this, we propose Temporal Token Reuse (TTR), an adaptive inference framework capable of accelerating video segmentation on embedded devices. TTR exploits the intrinsic spatiotemporal redundancy of aerial video by formulating image patches as tokens; it utilizes a lightweight similarity metric to dynamically identify static regions and propagate their precomputed deep features, thereby bypassing redundant backbone computations. We validate the framework on standard benchmarks and a newly curated Oblique Floodwater Dataset designed for hydrological monitoring. Experimental results on edge-grade hardware demonstrate that TTR achieves a 30% reduction in inference latency with negligible degradation in segmentation accuracy (< 0.5% mIoU). These findings confirm that TTR effectively shifts the operational Pareto frontier, enabling high-fidelity, real-time oblique video understanding for time-critical remote sensing missions
Chinese Translation
有效的灾害响应依赖于快速的灾害应对,其中倾斜航拍视频由于其在有限飞行时间内最大化空间覆盖和情境意识的能力,成为初步侦查的主要方式。然而,高分辨率倾斜视频流的机载处理受到无人机(UAV)严格的尺寸、重量和功耗(SWaP)限制的严重瓶颈。处理这些广视场视频流所需的计算密度使得在标准边缘硬件上进行低延迟推理变得不可行。为了解决这个问题,我们提出了时间令牌重用(Temporal Token Reuse, TTR),这是一种能够加速嵌入式设备上视频分割的自适应推理框架。TTR通过将图像块形式化为令牌,利用航拍视频的内在时空冗余;它使用轻量级相似性度量动态识别静态区域并传播其预计算的深度特征,从而绕过冗余的主干计算。我们在标准基准测试和新创建的倾斜洪水数据集上验证了该框架,该数据集旨在进行水文监测。在边缘级硬件上的实验结果表明,TTR实现了推理延迟减少30%,而分割精度几乎没有下降(< 0.5% mIoU)。这些发现确认了TTR有效地推动了操作的帕累托前沿,使得在时间敏感的遥感任务中实现高保真、实时的倾斜视频理解成为可能。
cs.CV / 31

SAMannot: A Memory-Efficient, Local, Open-source Framework for Interactive Video Instance Segmentation based on SAM2

SAMannot:基于SAM2的内存高效、本地开源交互视频实例分割框架
Dinya, Gergely, Gelencsér, András, Kupán, Krisztina, Küpper, Clemens, Karacs, Kristóf, Gelencsér-Horváth, Anna
Abstract
Current research workflows for precise video segmentation are often forced into a compromise between labor-intensive manual curation, costly commercial platforms, and/or privacy-compromising cloud-based services. The demand for high-fidelity video instance segmentation in research is often hindered by the bottleneck of manual annotation and the privacy concerns of cloud-based tools. We present SAMannot, an open-source, local framework that integrates the Segment Anything Model 2 (SAM2) into a human-in-the-loop workflow. To address the high resource requirements of foundation models, we modified the SAM2 dependency and implemented a processing layer that minimizes computational overhead and maximizes throughput, ensuring a highly responsive user interface. Key features include persistent instance identity management, an automated ``lock-and-refine'' workflow with barrier frames, and a mask-skeletonization-based auto-prompting mechanism. SAMannot facilitates the generation of research-ready datasets in YOLO and PNG formats alongside structured interaction logs. Verified through animal behavior tracking use-cases and subsets of the LVOS and DAVIS benchmark datasets, the tool provides a scalable, private, and cost-effective alternative to commercial platforms for complex video annotation tasks.
Chinese Translation
当前精确视频分割的研究工作流程常常在劳动密集型手动标注、昂贵的商业平台和/或侵犯隐私的云服务之间进行妥协。研究中对高保真视频实例分割的需求常常受到手动标注瓶颈和云工具隐私问题的制约。我们提出了SAMannot,一个开源的本地框架,将Segment Anything Model 2 (SAM2)集成到人机协作的工作流程中。为了应对基础模型的高资源需求,我们修改了SAM2的依赖关系,并实现了一个处理层,以最小化计算开销并最大化吞吐量,确保高度响应的用户界面。主要特性包括持久的实例身份管理、带有障碍帧的自动“锁定与精炼”工作流程,以及基于掩膜骨架化的自动提示机制。SAMannot促进了YOLO和PNG格式的研究准备数据集的生成,并附带结构化交互日志。通过动物行为追踪用例和LVOS与DAVIS基准数据集的子集验证,该工具为复杂视频标注任务提供了可扩展、私密且具有成本效益的替代方案。
cs.CV / 32

Context-Aware Semantic Segmentation via Stage-Wise Attention

基于阶段性注意力的上下文感知语义分割
Carreaud, Antoine, Naha, Elias, Chansel, Arthur, Lahellec, Nina, Skaloud, Jan, Gressin, Adrien
Abstract
Semantic ultra high resolution image (UHR) segmentation is essential in remote sensing applications such as aerial mapping and environmental monitoring. Transformer-based models struggle in this setting because memory grows quadratically with token count, constraining either the contextual scope or the spatial resolution. We introduce CASWiT (Context-Aware Stage-Wise Transformer), a dual-branch, Swin-based architecture that injects global cues into fine-grained UHR features. A context encoder processes a downsampled neighborhood to capture long-range dependencies, while a high resolution encoder extracts detailed features from UHR patches. A cross-scale fusion module, combining cross-attention and gated feature injection, enriches high-resolution tokens with context. Beyond architecture, we propose a SimMIM-style pretraining. We mask 75% of the high-resolution image tokens and the low-resolution center region that spatially corresponds to the UHR patch, then train the shared dual-encoder with small decoder to reconstruct the UHR initial image. Extensive experiments on the large-scale IGN FLAIR-HUB aerial dataset demonstrate the effectiveness of CASWiT. Our method achieves 65.83% mIoU, outperforming RGB baselines by 1.78 points. On URUR, CASWiT achieves 49.1% mIoU, surpassing the current SoTA by +0.9% under the official evaluation protocol. All codes are provided on: https://huggingface.co/collections/heig-vd-geo/caswit.
Chinese Translation
语义超高分辨率图像(UHR)分割在遥感应用中至关重要,例如航空制图和环境监测。然而,基于Transformer的模型在这一场景中表现不佳,因为内存随着标记数量的平方增长,这限制了上下文范围或空间分辨率。我们提出了CASWiT(上下文感知阶段性Transformer),这是一种双分支的基于Swin的架构,将全局线索注入细粒度的UHR特征。上下文编码器处理下采样的邻域以捕捉长距离依赖关系,而高分辨率编码器从UHR补丁中提取详细特征。一个跨尺度融合模块结合了交叉注意力和门控特征注入,丰富了高分辨率标记的上下文。除了架构外,我们还提出了一种类似SimMIM的预训练方法。我们对75%的高分辨率图像标记和与UHR补丁空间对应的低分辨率中心区域进行遮蔽,然后训练共享的双编码器与小解码器重构UHR初始图像。在大规模IGN FLAIR-HUB航空数据集上的大量实验表明了CASWiT的有效性。我们的方法实现了65.83%的mIoU,超越了RGB基线1.78个百分点。在URUR上,CASWiT实现了49.1%的mIoU,超过当前最先进技术(SoTA)0.9%(根据官方评估协议)。所有代码可在以下链接获取:https://huggingface.co/collections/heig-vd-geo/caswit。
cs.CV / 33

Enhancing Vision Language Models with Logic Reasoning for Situational Awareness

通过逻辑推理增强视觉语言模型以提高情境意识
Pradeep, Pavana, Kant, Krishna, Yu, Suya
Abstract
Vision-Language Models (VLMs) offer the ability to generate high-level, interpretable descriptions of complex activities from images and videos, making them valuable for situational awareness (SA) applications. In such settings, the focus is on identifying infrequent but significant events with high reliability and accuracy, while also extracting fine-grained details and assessing recognition quality. In this paper, we propose an approach that integrates VLMs with traditional computer vision methods through explicit logic reasoning to enhance SA in three key ways: (a) extracting fine-grained event details, (b) employing an intelligent fine-tuning (FT) strategy that achieves substantially higher accuracy than uninformed selection, and (c) generating justifications for VLM outputs during inference. We demonstrate that our intelligent FT mechanism improves the accuracy and provides a valuable means, during inferencing, to either confirm the validity of the VLM output or indicate why it may be questionable.
Chinese Translation
视觉语言模型(VLMs)能够从图像和视频中生成复杂活动的高层次、可解释的描述,使其在情境意识(SA)应用中具有重要价值。在这种环境下,重点是以高可靠性和准确性识别不频繁但重要的事件,同时提取细粒度细节并评估识别质量。本文提出了一种方法,通过显式逻辑推理将VLM与传统计算机视觉方法相结合,以在三个关键方面增强SA:(a)提取细粒度事件细节,(b)采用一种智能微调(FT)策略,其准确性显著高于无信息选择,以及(c)在推理过程中为VLM输出生成解释。我们证明了我们的智能FT机制提高了准确性,并在推理过程中提供了一种有价值的手段,以确认VLM输出的有效性或指示其可能存在的问题。
cs.CV / 34

Beer-Lambert Autoencoder for Unsupervised Stain Representation Learning and Deconvolution in Multi-immunohistochemical Brightfield Histology Images

用于多重免疫组化明场组织学图像的无监督染色表示学习与去卷积的Beer-Lambert自编码器
Eastwood, Mark, McKee, Thomas, Hu, Zedong, Tejpar, Sabine, Minhas, Fayyaz
Abstract
Separating the contributions of individual chromogenic stains in RGB histology whole slide images (WSIs) is essential for stain normalization, quantitative assessment of marker expression, and cell-level readouts in immunohistochemistry (IHC). Classical Beer-Lambert (BL) color deconvolution is well-established for two- or three-stain settings, but becomes under-determined and unstable for multiplex IHC (mIHC) with K>3 chromogens. We present a simple, data-driven encoder-decoder architecture that learns cohort-specific stain characteristics for mIHC RGB WSIs and yields crisp, well-separated per-stain concentration maps. The encoder is a compact U-Net that predicts K nonnegative concentration channels; the decoder is a differentiable BL forward model with a learnable stain matrix initialized from typical chromogen hues. Training is unsupervised with a perceptual reconstruction objective augmented by loss terms that discourage unnecessary stain mixing. On a colorectal mIHC panel comprising 5 stains (H, CDX2, MUC2, MUC5, CD8) we show excellent RGB reconstruction, and significantly reduced inter-channel bleed-through compared with matrix-based deconvolution. Code and model are available at https://github.com/measty/StainQuant.git.
Chinese Translation
在RGB组织学全幻灯片图像(WSIs)中分离各个染色剂的贡献对于染色标准化、标记表达的定量评估以及免疫组化(IHC)中的细胞级读数至关重要。经典的Beer-Lambert(BL)颜色去卷积在两种或三种染色剂的情况下已得到充分验证,但在具有K>3种染色剂的多重免疫组化(mIHC)中则变得不确定且不稳定。我们提出了一种简单的数据驱动编码器-解码器架构,能够学习mIHC RGB WSIs的特定群体染色特征,并生成清晰、良好分离的每种染色剂浓度图。编码器是一个紧凑的U-Net,预测K个非负浓度通道;解码器是一个可微分的BL前向模型,具有从典型染色剂色调初始化的可学习染色矩阵。训练是无监督的,采用感知重建目标,并通过损失项来抑制不必要的染色混合。在包含5种染色剂(H、CDX2、MUC2、MUC5、CD8)的结直肠mIHC面板上,我们展示了优异的RGB重建效果,并显著减少了与基于矩阵的去卷积相比的通道间串扰。代码和模型可在https://github.com/measty/StainQuant.git获取。
cs.CV / 35

Assessing Building Heat Resilience Using UAV and Street-View Imagery with Coupled Global Context Vision Transformer

利用无人机和街景图像评估建筑热适应性:结合全球上下文视觉变换器
Knoblauch, Steffen, Muthusamy, Ram Kumar, Li, Hao, Chazua, Iddy, Adamu, Benedcto, Maholi, Innocent, Zipf, Alexander
Abstract
Climate change is intensifying human heat exposure, particularly in densely built urban centers of the Global South. Low-cost construction materials and high thermal-mass surfaces further exacerbate this risk. Yet scalable methods for assessing such heat-relevant building attributes remain scarce. We propose a machine learning framework that fuses openly available unmanned aerial vehicle (UAV) and street-view (SV) imagery via a coupled global context vision transformer (CGCViT) to learn heat-relevant representations of urban structures. Thermal infrared (TIR) measurements from HotSat-1 are used to quantify the relationship between building attributes and heat-associated health risks. Our dual-modality cross-view learning approach outperforms the best single-modality models by up to $9.3\%$, demonstrating that UAV and SV imagery provide valuable complementary perspectives on urban structures. The presence of vegetation surrounding buildings (versus no vegetation), brighter roofing (versus darker roofing), and roofing made of concrete, clay, or wood (versus metal or tarpaulin) are all significantly associated with lower HotSat-1 TIR values. Deployed across the city of Dar es Salaam, Tanzania, the proposed framework illustrates how household-level inequalities in heat exposure - often linked to socio-economic disadvantage and reflected in building materials - can be identified and addressed using machine learning. Our results point to the critical role of localized, data-driven risk assessment in shaping climate adaptation strategies that deliver equitable outcomes.
Chinese Translation
气候变化正在加剧人类的热暴露,尤其是在全球南方的密集城市中心。低成本的建筑材料和高热质量表面进一步加剧了这一风险。然而,评估这些与热相关的建筑属性的可扩展方法仍然稀缺。我们提出了一种机器学习框架,通过结合全球上下文视觉变换器(CGCViT),融合公开可用的无人机(UAV)和街景(SV)图像,以学习城市结构的与热相关的表示。利用HotSat-1的热红外(TIR)测量来量化建筑属性与热相关健康风险之间的关系。我们的双模态交叉视图学习方法比最佳单模态模型的性能提高了多达9.3%,证明了UAV和SV图像为城市结构提供了有价值的互补视角。建筑周围的植被存在(与无植被相比)、更明亮的屋顶(与较暗的屋顶相比)以及由混凝土、粘土或木材制成的屋顶(与金属或防水布相比)均显著与较低的HotSat-1 TIR值相关。该框架在坦桑尼亚达累斯萨拉姆市部署,展示了如何利用机器学习识别和解决家庭层面的热暴露不平等——这通常与社会经济劣势相关,并反映在建筑材料上。我们的结果指出了本地化、数据驱动的风险评估在塑造气候适应策略中发挥的关键作用,以实现公平的结果。
cs.CV / 36

Think-Clip-Sample: Slow-Fast Frame Selection for Video Understanding

思考-剪辑-采样:视频理解中的慢-快帧选择
Tan, Wenhui, Song, Ruihua, Li, Jiaze, Ju, Jianzhong, Luo, Zhenbo
Abstract
Recent progress in multi-modal large language models (MLLMs) has significantly advanced video understanding. However, their performance on long-form videos remains limited by computational constraints and suboptimal frame selection. We present Think-Clip-Sample (TCS), a training-free framework that enhances long video understanding through two key components: (i) Multi-Query Reasoning, which generates multiple queries to capture complementary aspects of the question and video; and (ii) Clip-level Slow-Fast Sampling, which adaptively balances dense local details and sparse global context. Extensive experiments on MLVU, LongVideoBench, and VideoMME demonstrate that TCS consistently improves performance across different MLLMs, boosting up to 6.9% accuracy, and is capable of achieving comparable accuracy with 50% fewer inference time cost, highlighting both efficiency and efficacy of TCS on long video understanding.
Chinese Translation
近年来,多模态大语言模型(MLLMs)的进展显著推动了视频理解的发展。然而,它们在长视频上的表现仍受到计算限制和次优帧选择的制约。我们提出了思考-剪辑-采样(Think-Clip-Sample, TCS),这是一个无训练框架,通过两个关键组件增强长视频理解:(i)多查询推理(Multi-Query Reasoning),生成多个查询以捕捉问题和视频的互补方面;(ii)剪辑级慢-快采样(Clip-level Slow-Fast Sampling),自适应平衡密集的局部细节和稀疏的全局上下文。在MLVU、LongVideoBench和VideoMME上的大量实验表明,TCS在不同的MLLMs中始终提高了性能,准确率提升高达6.9%,并且能够以减少50%的推理时间成本实现可比的准确率,突显了TCS在长视频理解中的效率和有效性。
cs.CV / 37

Heterogeneous Uncertainty-Guided Composed Image Retrieval with Fine-Grained Probabilistic Learning

异质不确定性引导的细粒度概率学习组合图像检索
Tang, Haomiao, Wang, Jinpeng, Zhao, Minyi, Meng, Guanghao, Luo, Ruisheng, Chen, Long, Xia, Shu-Tao
Abstract
Composed Image Retrieval (CIR) enables image search by combining a reference image with modification text. Intrinsic noise in CIR triplets incurs intrinsic uncertainty and threatens the model's robustness. Probabilistic learning approaches have shown promise in addressing such issues; however, they fall short for CIR due to their instance-level holistic modeling and homogeneous treatment of queries and targets. This paper introduces a Heterogeneous Uncertainty-Guided (HUG) paradigm to overcome these limitations. HUG utilizes a fine-grained probabilistic learning framework, where queries and targets are represented by Gaussian embeddings that capture detailed concepts and uncertainties. We customize heterogeneous uncertainty estimations for multi-modal queries and uni-modal targets. Given a query, we capture uncertainties not only regarding uni-modal content quality but also multi-modal coordination, followed by a provable dynamic weighting mechanism to derive comprehensive query uncertainty. We further design uncertainty-guided objectives, including query-target holistic contrast and fine-grained contrasts with comprehensive negative sampling strategies, which effectively enhance discriminative learning. Experiments on benchmarks demonstrate HUG's effectiveness beyond state-of-the-art baselines, with faithful analysis justifying the technical contributions.
Chinese Translation
组合图像检索(CIR)通过将参考图像与修改文本结合,实现图像搜索。CIR 三元组中的内在噪声导致内在不确定性,并威胁模型的鲁棒性。概率学习方法在解决此类问题上显示出潜力;然而,由于它们在实例级别的整体建模和对查询与目标的同质处理,导致在 CIR 中效果不佳。本文提出了一种异质不确定性引导(HUG)范式,以克服这些局限性。HUG 利用细粒度概率学习框架,其中查询和目标通过高斯嵌入表示,捕捉详细的概念和不确定性。我们为多模态查询和单模态目标定制异质不确定性估计。在给定查询时,我们不仅捕捉单模态内容质量的不确定性,还捕捉多模态协调的不确定性,随后通过可证明的动态加权机制推导出综合查询不确定性。我们进一步设计了不确定性引导的目标,包括查询-目标整体对比和细粒度对比,并结合全面的负采样策略,有效增强了区分学习。基准实验表明,HUG 的有效性超越了最先进的基线,并通过可信的分析证明了其技术贡献。
cs.CV / 38

SUG-Occ: An Explicit Semantics and Uncertainty Guided Sparse Learning Framework for Real-Time 3D Occupancy Prediction

SUG-Occ:一种显式语义和不确定性引导的稀疏学习框架用于实时3D占用预测
Wu, Hanlin, Lin, Pengfei, Javanmardi, Ehsan, Bao, Nanren, Qian, Bo, Si, Hao, Tsukada, Manabu
Abstract
As autonomous driving moves toward full scene understanding, 3D semantic occupancy prediction has emerged as a crucial perception task, offering voxel-level semantics beyond traditional detection and segmentation paradigms. However, such a refined representation for scene understanding incurs prohibitive computation and memory overhead, posing a major barrier to practical real-time deployment. To address this, we propose SUG-Occ, an explicit Semantics and Uncertainty Guided Sparse Learning Enabled 3D Occupancy Prediction Framework, which exploits the inherent sparsity of 3D scenes to reduce redundant computation while maintaining geometric and semantic completeness. Specifically, we first utilize semantic and uncertainty priors to suppress projections from free space during view transformation while employing an explicit unsigned distance encoding to enhance geometric consistency, producing a structurally consistent sparse 3D representation. Secondly, we design an cascade sparse completion module via hyper cross sparse convolution and generative upsampling to enable efficiently coarse-to-fine reasoning. Finally, we devise an object contextual representation (OCR) based mask decoder that aggregates global semantic context from sparse features and refines voxel-wise predictions via lightweight query-context interactions, avoiding expensive attention operations over volumetric features. Extensive experiments on SemanticKITTI benchmark demonstrate that the proposed approach outperforms the baselines, achieving a 7.34/% improvement in accuracy and a 57.8\% gain in efficiency.
Chinese Translation
随着自动驾驶技术向全面场景理解发展,3D语义占用预测已成为一项关键的感知任务,提供了超越传统检测和分割范式的体素级语义。然而,这种精细的场景理解表示需要巨大的计算和内存开销,成为实际实时部署的主要障碍。为此,我们提出了SUG-Occ,一个显式语义和不确定性引导的稀疏学习启用的3D占用预测框架,利用3D场景的固有稀疏性来减少冗余计算,同时保持几何和语义的完整性。具体而言,我们首先利用语义和不确定性先验在视图变换过程中抑制自由空间的投影,同时采用显式无符号距离编码来增强几何一致性,从而生成结构一致的稀疏3D表示。其次,我们设计了一个级联稀疏补全模块,通过超交叉稀疏卷积和生成上采样实现高效的粗到细推理。最后,我们构建了一个基于对象上下文表示(OCR)的掩码解码器,该解码器从稀疏特征中聚合全局语义上下文,并通过轻量级的查询-上下文交互来细化体素级预测,避免对体积特征进行昂贵的注意力操作。在SemanticKITTI基准上的大量实验表明,所提出的方法优于基线,准确率提高了7.34%,效率提升了57.8%。
cs.CV / 39

Wetland mapping from sparse annotations with satellite image time series and temporal-aware segment anything model

基于稀疏标注的湿地映射:卫星影像时间序列与时间感知的任意分割模型
Yuan, Shuai, Lin, Tianwu, Chen, Shuang, Xia, Yu, Qin, Peng, Liu, Xiangyu, Xu, Xiaoqing, Xu, Nan, Zhang, Hongsheng, Wang, Jie, Gong, Peng
Abstract
Accurate wetland mapping is essential for ecosystem monitoring, yet dense pixel-level annotation is prohibitively expensive and practical applications usually rely on sparse point labels, under which existing deep learning models perform poorly, while strong seasonal and inter-annual wetland dynamics further render single-date imagery inadequate and lead to significant mapping errors; although foundation models such as SAM show promising generalization from point prompts, they are inherently designed for static images and fail to model temporal information, resulting in fragmented masks in heterogeneous wetlands. To overcome these limitations, we propose WetSAM, a SAM-based framework that integrates satellite image time series for wetland mapping from sparse point supervision through a dual-branch design, where a temporally prompted branch extends SAM with hierarchical adapters and dynamic temporal aggregation to disentangle wetland characteristics from phenological variability, and a spatial branch employs a temporally constrained region-growing strategy to generate reliable dense pseudo-labels, while a bidirectional consistency regularization jointly optimizes both branches. Extensive experiments across eight global regions of approximately 5,000 km2 each demonstrate that WetSAM substantially outperforms state-of-the-art methods, achieving an average F1-score of 85.58%, and delivering accurate and structurally consistent wetland segmentation with minimal labeling effort, highlighting its strong generalization capability and potential for scalable, low-cost, high-resolution wetland mapping.
Chinese Translation
准确的湿地映射对于生态系统监测至关重要,但密集的像素级标注成本高昂,实际应用通常依赖于稀疏的点标签。在这种情况下,现有的深度学习模型表现不佳,而强烈的季节性和年际湿地动态进一步使单日期影像不足,导致显著的映射误差。尽管基础模型如SAM(Segment Anything Model)在点提示下显示出良好的泛化能力,但它们本质上是为静态图像设计的,无法建模时间信息,导致在异质湿地中生成碎片化的掩膜。为克服这些局限性,我们提出了WetSAM,一个基于SAM的框架,通过双分支设计整合卫星影像时间序列,从稀疏点监督中进行湿地映射。其中,一个时间提示分支通过层次适配器和动态时间聚合扩展SAM,以从物候变异中解开湿地特征,另一个空间分支采用时间约束的区域生长策略生成可靠的密集伪标签,同时双向一致性正则化共同优化两个分支。在八个全球区域(每个区域约5000平方公里)进行的大量实验表明,WetSAM显著优于最先进的方法,平均F1-score达到85.58%,并以最小的标注努力提供准确且结构一致的湿地分割,突显了其强大的泛化能力和可扩展、低成本、高分辨率湿地映射的潜力。
cs.CV / 40

SME-YOLO: A Real-Time Detector for Tiny Defect Detection on PCB Surfaces

SME-YOLO:一种用于PCB表面微小缺陷检测的实时检测器
Han, Meng
Abstract
Surface defects on Printed Circuit Boards (PCBs) directly compromise product reliability and safety. However, achieving high-precision detection is challenging because PCB defects are typically characterized by tiny sizes, high texture similarity, and uneven scale distributions. To address these challenges, this paper proposes a novel framework based on YOLOv11n, named SME-YOLO (Small-target Multi-scale Enhanced YOLO). First, we employ the Normalized Wasserstein Distance Loss (NWDLoss). This metric effectively mitigates the sensitivity of Intersection over Union (IoU) to positional deviations in tiny objects. Second, the original upsampling module is replaced by the Efficient Upsampling Convolution Block (EUCB). By utilizing multi-scale convolutions, the EUCB gradually recovers spatial resolution and enhances the preservation of edge and texture details for tiny defects. Finally, this paper proposes the Multi-Scale Focused Attention (MSFA) module. Tailored to the specific spatial distribution of PCB defects, this module adaptively strengthens perception within key scale intervals, achieving efficient fusion of local fine-grained features and global context information. Experimental results on the PKU-PCB dataset demonstrate that SME-YOLO achieves state-of-the-art performance. Specifically, compared to the baseline YOLOv11n, SME-YOLO improves mAP by 2.2% and Precision by 4%, validating the effectiveness of the proposed method.
Chinese Translation
印刷电路板(PCB)表面的缺陷直接影响产品的可靠性和安全性。然而,由于PCB缺陷通常具有微小尺寸、高纹理相似性和不均匀的尺度分布,实现高精度检测具有挑战性。为了解决这些问题,本文提出了一种基于YOLOv11n的新框架,命名为SME-YOLO(小目标多尺度增强YOLO)。首先,我们采用归一化Wasserstein距离损失(NWDLoss)。该度量有效减轻了交并比(IoU)对微小物体位置偏差的敏感性。其次,原始的上采样模块被高效上采样卷积块(EUCB)所替代。通过利用多尺度卷积,EUCB逐步恢复空间分辨率,并增强微小缺陷的边缘和纹理细节的保留。最后,本文提出了多尺度聚焦注意力(MSFA)模块。该模块针对PCB缺陷的特定空间分布,自适应地增强关键尺度区间内的感知,实现局部细粒度特征与全局上下文信息的高效融合。在PKU-PCB数据集上的实验结果表明,SME-YOLO达到了最先进的性能。具体而言,与基线YOLOv11n相比,SME-YOLO的mAP提高了2.2%,精确度提高了4%,验证了所提方法的有效性。
cs.CV / 41

Topology-Guaranteed Image Segmentation: Enforcing Connectivity, Genus, and Width Constraints

拓扑保证的图像分割:强制连接性、属和宽度约束
Li, Wenxiao, Tai, Xue-Cheng, Liu, Jun
Abstract
Existing research highlights the crucial role of topological priors in image segmentation, particularly in preserving essential structures such as connectivity and genus. Accurately capturing these topological features often requires incorporating width-related information, including the thickness and length inherent to the image structures. However, traditional mathematical definitions of topological structures lack this dimensional width information, limiting methods like persistent homology from fully addressing practical segmentation needs. To overcome this limitation, we propose a novel mathematical framework that explicitly integrates width information into the characterization of topological structures. This method leverages persistent homology, complemented by smoothing concepts from partial differential equations (PDEs), to modify local extrema of upper-level sets. This approach enables the resulting topological structures to inherently capture width properties. We incorporate this enhanced topological description into variational image segmentation models. Using some proper loss functions, we are also able to design neural networks that can segment images with the required topological and width properties. Through variational constraints on the relevant topological energies, our approach successfully preserves essential topological invariants such as connectivity and genus counts, simultaneously ensuring that segmented structures retain critical width attributes, including line thickness and length. Numerical experiments demonstrate the effectiveness of our method, showcasing its capability to maintain topological fidelity while explicitly embedding width characteristics into segmented image structures.
Chinese Translation
现有研究强调了拓扑先验在图像分割中的关键作用,特别是在保持连接性和属等基本结构方面。准确捕捉这些拓扑特征通常需要结合与宽度相关的信息,包括图像结构固有的厚度和长度。然而,传统的拓扑结构数学定义缺乏这种维度宽度信息,限制了持久同调(persistent homology)等方法无法完全满足实际分割需求。为克服这一限制,我们提出了一种新颖的数学框架,明确将宽度信息整合到拓扑结构的表征中。该方法利用持久同调,并结合偏微分方程(PDEs)中的平滑概念,修改上层集的局部极值。这种方法使得生成的拓扑结构能够固有地捕捉宽度属性。我们将这种增强的拓扑描述纳入变分图像分割模型中。通过一些适当的损失函数,我们还能够设计出能够分割具有所需拓扑和宽度属性的图像的神经网络。通过对相关拓扑能量的变分约束,我们的方法成功地保持了连接性和属计数等基本拓扑不变量,同时确保分割结构保留关键的宽度属性,包括线条厚度和长度。数值实验验证了我们方法的有效性,展示了其在保持拓扑保真度的同时,明确嵌入宽度特征到分割图像结构中的能力。
cs.CV / 42

PubMed-OCR: PMC Open Access OCR Annotations

PubMed-OCR:PMC开放获取OCR注释
Heidenreich, Hunter, Getachew, Yosheb, Dinica, Olivia, Elliott, Ben
Abstract
PubMed-OCR is an OCR-centric corpus of scientific articles derived from PubMed Central Open Access PDFs. Each page image is annotated with Google Cloud Vision and released in a compact JSON schema with word-, line-, and paragraph-level bounding boxes. The corpus spans 209.5K articles (1.5M pages; ~1.3B words) and supports layout-aware modeling, coordinate-grounded QA, and evaluation of OCR-dependent pipelines. We analyze corpus characteristics (e.g., journal coverage and detected layout features) and discuss limitations, including reliance on a single OCR engine and heuristic line reconstruction. We release the data and schema to facilitate downstream research and invite extensions.
Chinese Translation
PubMed-OCR是一个以OCR为中心的科学文章语料库,来源于PubMed Central开放获取PDF文档。每个页面图像都经过Google Cloud Vision的注释,并以紧凑的JSON格式发布,包含单词、行和段落级的边界框。该语料库涵盖了209.5K篇文章(1.5M页面;约13亿个单词),支持布局感知建模、坐标基础的问答(QA)以及OCR依赖管道的评估。我们分析了语料库的特征(例如,期刊覆盖范围和检测到的布局特征),并讨论了局限性,包括对单一OCR引擎的依赖和启发式行重建。我们发布了数据和模式,以促进下游研究,并邀请扩展。
cs.CV / 43

Map2Thought: Explicit 3D Spatial Reasoning via Metric Cognitive Maps

Map2Thought:通过度量认知地图进行显式的三维空间推理
Gao, Xiangjun, Zhang, Zhensong, Chen, Dave Zhenyu, Xu, Songcen, Quan, Long, Pérez-Pellitero, Eduardo, Jang, Youngkyoon
Abstract
We propose Map2Thought, a framework that enables explicit and interpretable spatial reasoning for 3D VLMs. The framework is grounded in two key components: Metric Cognitive Map (Metric-CogMap) and Cognitive Chain-of-Thought (Cog-CoT). Metric-CogMap provides a unified spatial representation by integrating a discrete grid for relational reasoning with a continuous, metric-scale representation for precise geometric understanding. Building upon the Metric-CogMap, Cog-CoT performs explicit geometric reasoning through deterministic operations, including vector operations, bounding-box distances, and occlusion-aware appearance order cues, producing interpretable inference traces grounded in 3D structure. Experimental results show that Map2Thought enables explainable 3D understanding, achieving 59.9% accuracy using only half the supervision, closely matching the 60.9% baseline trained with the full dataset. It consistently outperforms state-of-the-art methods by 5.3%, 4.8%, and 4.0% under 10%, 25%, and 50% training subsets, respectively, on the VSI-Bench.
Chinese Translation
我们提出了Map2Thought,一个能够实现显式和可解释的三维视觉语言模型(3D VLMs)空间推理的框架。该框架基于两个关键组件:度量认知地图(Metric Cognitive Map, Metric-CogMap)和认知思维链(Cognitive Chain-of-Thought, Cog-CoT)。Metric-CogMap通过将离散网格与连续的度量尺度表示相结合,提供了一种统一的空间表示,以便进行关系推理和精确的几何理解。在Metric-CogMap的基础上,Cog-CoT通过确定性操作(包括向量运算、边界框距离和考虑遮挡的外观顺序线索)执行显式几何推理,生成基于三维结构的可解释推理轨迹。实验结果表明,Map2Thought实现了可解释的三维理解,在仅使用一半的监督下达到了59.9%的准确率,接近使用完整数据集训练的60.9%的基线。它在VSI-Bench的10%、25%和50%的训练子集下,分别比最先进的方法提高了5.3%、4.8%和4.0%的性能。
cs.CV / 44

PRISM-CAFO: Prior-conditioned Remote-sensing Infrastructure Segmentation and Mapping for CAFOs

PRISM-CAFO:基于先验条件的集中动物饲养设施的遥感分割与制图
Hoque, Oishee Bintey, Mandal, Nibir Chandra, Luong, Kyle, Wilson, Amanda, Swarup, Samarth, Marathe, Madhav, Adiga, Abhijin
Abstract
Large-scale livestock operations pose significant risks to human health and the environment, while also being vulnerable to threats such as infectious diseases and extreme weather events. As the number of such operations continues to grow, accurate and scalable mapping has become increasingly important. In this work, we present an infrastructure-first, explainable pipeline for identifying and characterizing Concentrated Animal Feeding Operations (CAFOs) from aerial and satellite imagery. Our method (1) detects candidate infrastructure (e.g., barns, feedlots, manure lagoons, silos) with a domain-tuned YOLOv8 detector, then derives SAM2 masks from these boxes and filters component-specific criteria, (2) extracts structured descriptors (e.g., counts, areas, orientations, and spatial relations) and fuses them with deep visual features using a lightweight spatial cross-attention classifier, and (3) outputs both CAFO type predictions and mask-level attributions that link decisions to visible infrastructure. Through comprehensive evaluation, we show that our approach achieves state-of-the-art performance, with Swin-B+PRISM-CAFO surpassing the best performing baseline by up to 15\%. Beyond strong predictive performance across diverse U.S. regions, we run systematic gradient--activation analyses that quantify the impact of domain priors and show ho
Chinese Translation
大规模的畜牧业运营对人类健康和环境构成了重大风险,同时也容易受到传染病和极端天气事件等威胁。随着此类运营数量的不断增加,准确且可扩展的制图变得愈发重要。在本研究中,我们提出了一种以基础设施为先、可解释的流程,用于从航空和卫星影像中识别和表征集中动物饲养设施(CAFOs)。我们的方法包括:(1) 使用领域调优的YOLOv8检测器检测候选基础设施(如谷仓、饲料场、粪便池、筒仓),然后从这些框中推导出SAM2掩膜并过滤特定组件标准;(2) 提取结构化描述符(如计数、面积、方向和空间关系),并使用轻量级空间交叉注意力分类器将其与深度视觉特征融合;(3) 输出CAFO类型预测和掩膜级归因,将决策与可见基础设施关联起来。通过全面评估,我们展示了我们的方法达到了最先进的性能,Swin-B+PRISM-CAFO在表现最佳的基线之上提升了多达15%。除了在美国不同地区的强大预测性能外,我们还进行了系统的梯度激活分析,以量化领域先验的影响。
cs.CV / 45

MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models

MHA2MLA-VLM:实现DeepSeek在视觉-语言模型中的经济多头潜在注意力
Fan, Xiaoran, Sun, Zhichao, Ji, Tao, Shen, Lixing, Gui, Tao
Abstract
As vision-language models (VLMs) tackle increasingly complex and multimodal tasks, the rapid growth of Key-Value (KV) cache imposes significant memory and computational bottlenecks during inference. While Multi-Head Latent Attention (MLA) offers an effective means to compress the KV cache and accelerate inference, adapting existing VLMs to the MLA architecture without costly pretraining remains largely unexplored. In this work, we present MHA2MLA-VLM, a parameter-efficient and multimodal-aware framework for converting off-the-shelf VLMs to MLA. Our approach features two core techniques: (1) a modality-adaptive partial-RoPE strategy that supports both traditional and multimodal settings by selectively masking nonessential dimensions, and (2) a modality-decoupled low-rank approximation method that independently compresses the visual and textual KV spaces. Furthermore, we introduce parameter-efficient fine-tuning to minimize adaptation cost and demonstrate that minimizing output activation error, rather than parameter distance, substantially reduces performance loss. Extensive experiments on three representative VLMs show that MHA2MLA-VLM restores original model performance with minimal supervised data, significantly reduces KV cache footprint, and integrates seamlessly with KV quantization.
Chinese Translation
随着视觉-语言模型(VLMs)处理越来越复杂和多模态的任务,快速增长的键值(KV)缓存在推理过程中造成了显著的内存和计算瓶颈。虽然多头潜在注意力(MLA)提供了一种有效的方法来压缩KV缓存并加速推理,但在没有昂贵的预训练情况下,将现有的VLM适配到MLA架构仍然很大程度上未被探索。在本研究中,我们提出了MHA2MLA-VLM,这是一个参数高效且具备多模态意识的框架,用于将现成的VLM转换为MLA。我们的方法具有两个核心技术:(1)一种模态自适应的部分-RoPE策略,通过选择性地屏蔽非必要维度,支持传统和多模态设置;(2)一种模态解耦的低秩近似方法,独立压缩视觉和文本的KV空间。此外,我们引入了参数高效的微调以最小化适配成本,并证明最小化输出激活误差而非参数距离显著减少了性能损失。在三个代表性VLM上的广泛实验表明,MHA2MLA-VLM以最少的监督数据恢复了原始模型性能,显著减少了KV缓存占用,并与KV量化无缝集成。
cs.CV / 46

Generative Scenario Rollouts for End-to-End Autonomous Driving

用于端到端自主驾驶的生成场景回放
Yasarla, Rajeev, Hegde, Deepti, Han, Shizhong, Cheng, Hsin-Pai, Shi, Yunxiao, Sadeghigooghari, Meysam, Mahajan, Shweta, Bhattacharyya, Apratim, Liu, Litian, Garrepalli, Risheek, Svantesson, Thomas, Porikli, Fatih, Cai, Hong
Abstract
Vision-Language-Action (VLA) models are emerging as highly effective planning models for end-to-end autonomous driving systems. However, current works mostly rely on imitation learning from sparse trajectory annotations and under-utilize their potential as generative models. We propose Generative Scenario Rollouts (GeRo), a plug-and-play framework for VLA models that jointly performs planning and generation of language-grounded future traffic scenes through an autoregressive rollout strategy. First, a VLA model is trained to encode ego vehicle and agent dynamics into latent tokens under supervision from planning, motion, and language tasks, facilitating text-aligned generation. Next, GeRo performs language-conditioned autoregressive generation. Given multi-view images, a scenario description, and ego-action questions, it generates future latent tokens and textual responses to guide long-horizon rollouts. A rollout-consistency loss stabilizes predictions using ground truth or pseudo-labels, mitigating drift and preserving text-action alignment. This design enables GeRo to perform temporally consistent, language-grounded rollouts that support long-horizon reasoning and multi-agent planning. On Bench2Drive, GeRo improves driving score and success rate by +15.7 and +26.2, respectively. By integrating reinforcement learning with generative rollouts, GeRo achieves state-of-the-art closed-loop and open-loop performance, demonstrating strong zero-shot robustness. These results highlight the promise of generative, language-conditioned reasoning as a foundation for safer and more interpretable end-to-end autonomous driving.
Chinese Translation
视觉-语言-动作(VLA)模型作为端到端自主驾驶系统的高效规划模型正在逐渐崭露头角。然而,当前的研究大多依赖于稀疏轨迹注释的模仿学习,未充分发挥其作为生成模型的潜力。我们提出了生成场景回放(Generative Scenario Rollouts,GeRo),这是一个可插拔的VLA模型框架,通过自回归回放策略联合执行规划和生成基于语言的未来交通场景。首先,训练一个VLA模型,在规划、运动和语言任务的监督下,将自车和代理的动态编码为潜在标记,从而促进文本对齐的生成。接下来,GeRo执行基于语言的自回归生成。给定多视角图像、场景描述和自车动作问题,它生成未来的潜在标记和文本响应,以指导长时间范围的回放。回放一致性损失使用真实值或伪标签来稳定预测,减轻漂移并保持文本-动作对齐。该设计使GeRo能够执行时间一致的、基于语言的回放,支持长时间范围的推理和多代理规划。在Bench2Drive上,GeRo分别提高了驾驶评分和成功率15.7和26.2。通过将强化学习与生成回放相结合,GeRo实现了最先进的闭环和开环性能,展现出强大的零样本鲁棒性。这些结果突显了生成的、基于语言的推理作为更安全、更可解释的端到端自主驾驶基础的潜力。
cs.CV / 47

ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes

ReScene4D:演变室内3D场景的时间一致性语义实例分割
Steiner, Emily, Zheng, Jianhao, Howard-Jenkins, Henry, Xie, Chris, Armeni, Iro
Abstract
Indoor environments evolve as objects move, appear, or disappear. Capturing these dynamics requires maintaining temporally consistent instance identities across intermittently captured 3D scans, even when changes are unobserved. We introduce and formalize the task of temporally sparse 4D indoor semantic instance segmentation (SIS), which jointly segments, identifies, and temporally associates object instances. This setting poses a challenge for existing 3DSIS methods, which require a discrete matching step due to their lack of temporal reasoning, and for 4D LiDAR approaches, which perform poorly due to their reliance on high-frequency temporal measurements that are uncommon in the longer-horizon evolution of indoor environments. We propose ReScene4D, a novel method that adapts 3DSIS architectures for 4DSIS without needing dense observations. It explores strategies to share information across observations, demonstrating that this shared context not only enables consistent instance tracking but also improves standard 3DSIS quality. To evaluate this task, we define a new metric, t-mAP, that extends mAP to reward temporal identity consistency. ReScene4D achieves state-of-the-art performance on the 3RScan dataset, establishing a new benchmark for understanding evolving indoor scenes.
Chinese Translation
室内环境随着物体的移动、出现或消失而不断演变。捕捉这些动态需要在间歇性捕获的3D扫描中保持时间一致的实例身份,即使在未观察到变化的情况下。我们引入并形式化了时间稀疏的4D室内语义实例分割(SIS)任务,该任务共同进行物体实例的分割、识别和时间关联。这一设置对现有的3D SIS方法构成了挑战,因为它们由于缺乏时间推理而需要一个离散的匹配步骤;同时,对于4D LiDAR方法而言,由于依赖于在室内环境较长时间演变中不常见的高频时间测量,表现也不佳。我们提出了ReScene4D,这是一种新颖的方法,旨在在不需要密集观测的情况下,将3D SIS架构适配为4D SIS。它探索了跨观测共享信息的策略,证明这种共享上下文不仅能够实现一致的实例跟踪,还能提高标准3D SIS的质量。为了评估这一任务,我们定义了一种新的指标t-mAP,该指标扩展了mAP,以奖励时间身份一致性。ReScene4D在3RScan数据集上实现了最先进的性能,为理解演变中的室内场景建立了新的基准。
cs.CV / 48

ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

ShapeR:从随意捕获中生成鲁棒的条件3D形状
Siddiqui, Yawar, Frost, Duncan, Aroudj, Samir, Avetisyan, Armen, Howard-Jenkins, Henry, DeTone, Daniel, Moulon, Pierre, Wu, Qirui, Li, Zhengqin, Straub, Julian, Newcombe, Richard, Engel, Jakob
Abstract
Recent advances in 3D shape generation have achieved impressive results, but most existing methods rely on clean, unoccluded, and well-segmented inputs. Such conditions are rarely met in real-world scenarios. We present ShapeR, a novel approach for conditional 3D object shape generation from casually captured sequences. Given an image sequence, we leverage off-the-shelf visual-inertial SLAM, 3D detection algorithms, and vision-language models to extract, for each object, a set of sparse SLAM points, posed multi-view images, and machine-generated captions. A rectified flow transformer trained to effectively condition on these modalities then generates high-fidelity metric 3D shapes. To ensure robustness to the challenges of casually captured data, we employ a range of techniques including on-the-fly compositional augmentations, a curriculum training scheme spanning object- and scene-level datasets, and strategies to handle background clutter. Additionally, we introduce a new evaluation benchmark comprising 178 in-the-wild objects across 7 real-world scenes with geometry annotations. Experiments show that ShapeR significantly outperforms existing approaches in this challenging setting, achieving an improvement of 2.7x in Chamfer distance compared to state of the art.
Chinese Translation
近期在3D形状生成方面的进展取得了令人瞩目的成果,但大多数现有方法依赖于干净、无遮挡且良好分割的输入。这些条件在现实世界场景中很少满足。我们提出了ShapeR,这是一种从随意捕获的序列中生成条件3D物体形状的新方法。给定一组图像序列,我们利用现成的视觉惯性SLAM、3D检测算法和视觉-语言模型,为每个物体提取一组稀疏的SLAM点、姿态多视图图像和机器生成的标题。经过训练的流形变换器能够有效地对这些模态进行条件生成,从而生成高保真的度量3D形状。为了确保对随意捕获数据挑战的鲁棒性,我们采用了一系列技术,包括实时组合增强、涵盖物体和场景级数据集的课程训练方案,以及处理背景杂乱的策略。此外,我们引入了一个新的评估基准,包含178个在野外捕获的物体,跨越7个真实场景,并附有几何注释。实验表明,ShapeR在这一具有挑战性的环境中显著优于现有方法,与最先进的技术相比,在Chamfer距离上实现了2.7倍的改进。
cs.CV / 49

UniX: Unifying Autoregression and Diffusion for Chest X-Ray Understanding and Generation

UniX:统一自回归与扩散用于胸部X光理解与生成
Zhang, Ruiheng, Yao, Jingfeng, Zhao, Huangxuan, Yan, Hao, He, Xiao, Chen, Lei, Wei, Zhou, Luo, Yong, Wang, Zengmao, Zhang, Lefei, Tao, Dacheng, Du, Bo
Abstract
Despite recent progress, medical foundation models still struggle to unify visual understanding and generation, as these tasks have inherently conflicting goals: semantic abstraction versus pixel-level reconstruction. Existing approaches, typically based on parameter-shared autoregressive architectures, frequently lead to compromised performance in one or both tasks. To address this, we present UniX, a next-generation unified medical foundation model for chest X-ray understanding and generation. UniX decouples the two tasks into an autoregressive branch for understanding and a diffusion branch for high-fidelity generation. Crucially, a cross-modal self-attention mechanism is introduced to dynamically guide the generation process with understanding features. Coupled with a rigorous data cleaning pipeline and a multi-stage training strategy, this architecture enables synergistic collaboration between tasks while leveraging the strengths of diffusion models for superior generation. On two representative benchmarks, UniX achieves a 46.1% improvement in understanding performance (Micro-F1) and a 24.2% gain in generation quality (FD-RadDino), using only a quarter of the parameters of LLM-CXR. By achieving performance on par with task-specific models, our work establishes a scalable paradigm for synergistic medical image understanding and generation. Codes and models are available at https://github.com/ZrH42/UniX.
Chinese Translation
尽管近期取得了一些进展,医疗基础模型在统一视觉理解与生成方面仍然面临挑战,因为这两项任务本质上存在相互冲突的目标:语义抽象与像素级重建。现有的方法通常基于参数共享的自回归架构,往往导致在一项或两项任务上的性能妥协。为了解决这一问题,我们提出了UniX,一种下一代统一医疗基础模型,用于胸部X光的理解与生成。UniX将这两项任务解耦为理解的自回归分支和高保真生成的扩散分支。关键是,引入了一种跨模态自注意力机制,以动态指导生成过程中的理解特征。结合严格的数据清洗流程和多阶段训练策略,该架构实现了任务之间的协同合作,同时利用扩散模型的优势以获得更优的生成效果。在两个代表性基准上,UniX在理解性能(Micro-F1)上提高了46.1%,在生成质量(FD-RadDino)上提升了24.2%,且仅使用了LLM-CXR四分之一的参数。通过实现与任务特定模型相当的性能,我们的工作建立了一种可扩展的协同医疗图像理解与生成范式。代码和模型可在 https://github.com/ZrH42/UniX 获取。
人工智能 (Artificial Intelligence)
25
cs.AI / 1

Japanese AI Agent System on Human Papillomavirus Vaccination: System Design

日本人乳头瘤病毒疫苗的人工智能代理系统:系统设计
Liu, Junyu, Yang, Siwen, Ma, Dexiu, Niu, Qian, Zhang, Zequn, Nagai-Tanima, Momoko, Aoyama, Tomoki
Abstract
Human papillomavirus (HPV) vaccine hesitancy poses significant public health challenges, particularly in Japan where proactive vaccination recommendations were suspended from 2013 to 2021. The resulting information gap is exacerbated by misinformation on social media, and traditional ways cannot simultaneously address individual queries while monitoring population-level discourse. This study aimed to develop a dual-purpose AI agent system that provides verified HPV vaccine information through a conversational interface while generating analytical reports for medical institutions based on user interactions and social media. We implemented a system comprising: a vector database integrating academic papers, government sources, news media, and social media; a Retrieval-Augmented Generation chatbot using ReAct agent architecture with multi-tool orchestration across five knowledge sources; and an automated report generation system with modules for news analysis, research synthesis, social media sentiment analysis, and user interaction pattern identification. Performance was assessed using a 0-5 scoring scale. For single-turn evaluation, the chatbot achieved mean scores of 4.83 for relevance, 4.89 for routing, 4.50 for reference quality, 4.90 for correctness, and 4.88 for professional identity (overall 4.80). Multi-turn evaluation yielded higher scores: context retention 4.94, topic coherence 5.00, and overall 4.98. The report generation system achieved completeness 4.00-5.00, correctness 4.00-5.00, and helpfulness 3.67-5.00, with reference validity 5.00 across all periods. This study demonstrates the feasibility of an integrated AI agent system for bidirectional HPV vaccine communication. The architecture enables verified information delivery with source attribution while providing systematic public discourse analysis, with a transferable framework for adaptation to other medical contexts.
Chinese Translation
人乳头瘤病毒(HPV)疫苗犹豫对公共健康构成了重大挑战,尤其是在日本,自2013年至2021年,主动疫苗接种建议被暂停。由此产生的信息缺口因社交媒体上的错误信息而加剧,传统方式无法同时解决个体查询和监测人口层面的讨论。本研究旨在开发一个双重目的的人工智能代理系统,通过对话界面提供经过验证的HPV疫苗信息,同时基于用户互动和社交媒体生成分析报告。我们实施了一个系统,包括:一个整合学术论文、政府来源、新闻媒体和社交媒体的向量数据库;一个使用ReAct代理架构的检索增强生成聊天机器人,能够跨五个知识来源进行多工具协调;以及一个自动报告生成系统,包含新闻分析、研究综合、社交媒体情感分析和用户互动模式识别模块。性能评估采用0-5评分标准。在单轮评估中,聊天机器人在相关性、路由、参考质量、正确性和专业身份方面的平均得分分别为4.83、4.89、4.50、4.90和4.88(总体4.80)。多轮评估得分更高:上下文保持4.94,主题连贯性5.00,总体4.98。报告生成系统在完整性上得分4.00-5.00,正确性4.00-5.00,帮助性3.67-5.00,参考有效性在所有期间均为5.00。本研究展示了一个集成的人工智能代理系统在双向HPV疫苗沟通中的可行性。该架构能够提供经过验证的信息传递并附有来源,同时提供系统化的公共话语分析,具备可转移的框架以适应其他医疗背景。
cs.AI / 2

Do You Trust Me? Cognitive-Affective Signatures of Trustworthiness in Large Language Models

你信任我吗?大型语言模型中可信度的认知-情感特征
Yeo, Gerard, Churina, Svetlana, Jaidka, Kokil
Abstract
Perceived trustworthiness underpins how users navigate online information, yet it remains unclear whether large language models (LLMs),increasingly embedded in search, recommendation, and conversational systems, represent this construct in psychologically coherent ways. We analyze how instruction-tuned LLMs (Llama 3.1 8B, Qwen 2.5 7B, Mistral 7B) encode perceived trustworthiness in web-like narratives using the PEACE-Reviews dataset annotated for cognitive appraisals, emotions, and behavioral intentions. Across models, systematic layer- and head-level activation differences distinguish high- from low-trust texts, revealing that trust cues are implicitly encoded during pretraining. Probing analyses show linearly de-codable trust signals and fine-tuning effects that refine rather than restructure these representations. Strongest associations emerge with appraisals of fairness, certainty, and accountability-self -- dimensions central to human trust formation online. These findings demonstrate that modern LLMs internalize psychologically grounded trust signals without explicit supervision, offering a representational foundation for designing credible, transparent, and trust-worthy AI systems in the web ecosystem. Code and appendix are available at: https://github.com/GerardYeo/TrustworthinessLLM.
Chinese Translation
感知的可信度是用户在在线信息中导航的基础,但尚不清楚大型语言模型(LLMs)在搜索、推荐和对话系统中日益嵌入的情况下,是否以心理上连贯的方式表现这一构念。我们分析了经过指令调优的LLMs(Llama 3.1 8B、Qwen 2.5 7B、Mistral 7B)如何在类似网络的叙事中编码感知的可信度,使用PEACE-Reviews数据集,该数据集标注了认知评估、情感和行为意图。在各模型中,系统的层级和头部激活差异区分了高可信文本与低可信文本,揭示了在预训练过程中信任线索是隐式编码的。探测分析显示线性可解码的信任信号和微调效应,这些效应是对这些表征的精细化而非重构。与公平性、确定性和问责自我等维度的评估之间出现了最强的关联,这些维度是人类在线信任形成的核心。这些发现表明,现代LLMs在没有明确监督的情况下内化了心理基础的信任信号,为在网络生态系统中设计可信、透明和值得信赖的人工智能系统提供了表征基础。代码和附录可在以下链接获取:https://github.com/GerardYeo/TrustworthinessLLM。
cs.AI / 3

Building AI Agents to Improve Job Referral Requests to Strangers

构建人工智能代理以改善对陌生人的职位推荐请求
Chu, Ross, Huang, Yuting
Abstract
This paper develops AI agents that help job seekers write effective requests for job referrals in a professional online community. The basic workflow consists of an improver agent that rewrites the referral request and an evaluator agent that measures the quality of revisions using a model trained to predict the probability of receiving referrals from other users. Revisions suggested by the LLM (large language model) increase predicted success rates for weaker requests while reducing them for stronger requests. Enhancing the LLM with Retrieval-Augmented Generation (RAG) prevents edits that worsen stronger requests while it amplifies improvements for weaker requests. Overall, using LLM revisions with RAG increases the predicted success rate for weaker requests by 14\% without degrading performance on stronger requests. Although improvements in model-predicted success do not guarantee more referrals in the real world, they provide low-cost signals for promising features before running higher-stakes experiments on real users.
Chinese Translation
本文开发了人工智能代理,帮助求职者在专业在线社区中撰写有效的职位推荐请求。基本工作流程包括一个改进代理(improver agent),负责重写推荐请求,以及一个评估代理(evaluator agent),使用训练模型来测量修订的质量,该模型用于预测从其他用户那里获得推荐的概率。大型语言模型(LLM)建议的修订提高了较弱请求的预测成功率,同时降低了较强请求的成功率。通过检索增强生成(Retrieval-Augmented Generation, RAG)增强LLM,能够防止对较强请求的编辑恶化,同时放大对较弱请求的改进。总体而言,使用结合RAG的LLM修订将较弱请求的预测成功率提高了14%,而不会降低较强请求的表现。尽管模型预测成功的改进并不保证在现实世界中获得更多推荐,但它们为在真实用户上进行更高风险实验之前提供了低成本的有前景特征信号。
cs.AI / 4

ORBITFLOW: SLO-Aware Long-Context LLM Serving with Fine-Grained KV Cache Reconfiguration

ORBITFLOW:基于SLO的长上下文LLM服务与细粒度KV缓存重配置
Ma, Xinyue, Hong, Heelim, Um, Taegeon, Lee, Jongseop, Choy, Seoyeong, Lee, Woo-Yeon, Jeon, Myeongjae
Abstract
Serving long-context LLMs is challenging because request lengths and batch composition vary during token generation, causing the memory footprint to fluctuate significantly at runtime. Offloading KV caches to host memory limits effective memory usage, but existing static and predetermined offloading strategies cannot adapt to the rapidly shifting memory demands of long-context serving. This often leads to excessive CPU-to-GPU KV transfers that translate into latency spikes and frequent SLO violations. To address these challenges, we introduce ORBITFLOW, a fine-grained and adaptive KV cache management system that meets latency SLOs in long-context LLM serving. ORBITFLOW employs a lightweight ILP solver to decide which layers' KV caches to retain on the GPU for each request, within memory capacity constraints. It continuously refines KV placements based on runtime feedback when the active plan becomes suboptimal during token generation. Under heavy load, ORBITFLOW invokes a fallback mechanism to temporarily defer in-flight requests with large memory footprints, preserving overall SLO attainment. Our experiments demonstrate that ORBITFLOW improves SLO attainment for TPOT and TBT by up to 66% and 48%, respectively, while reducing the 95th percentile latency by 38% and achieving up to 3.3x higher throughput compared to existing offloading methods.
Chinese Translation
服务长上下文LLM面临挑战,因为在生成token的过程中,请求长度和批次组成会有所不同,导致运行时内存占用显著波动。将KV缓存卸载到主存限制了有效的内存使用,但现有的静态和预定卸载策略无法适应长上下文服务快速变化的内存需求。这通常会导致过多的CPU到GPU的KV传输,从而引发延迟峰值和频繁的SLO违规。为了解决这些挑战,我们提出了ORBITFLOW,一个细粒度和自适应的KV缓存管理系统,能够满足长上下文LLM服务中的延迟SLO。ORBITFLOW采用轻量级的ILP求解器,根据内存容量约束,决定在每个请求中保留哪些层的KV缓存在GPU上。它会根据运行时反馈不断优化KV放置,当活动计划在token生成过程中变得次优时。面对高负载,ORBITFLOW会调用回退机制,暂时延迟内存占用大的在途请求,从而保持整体SLO的达成。我们的实验表明,ORBITFLOW在TPOT和TBT的SLO达成率分别提高了66%和48%,同时将第95百分位延迟降低了38%,并且相比现有的卸载方法实现了高达3.3倍的吞吐量提升。
cs.AI / 5

CTHA: Constrained Temporal Hierarchical Architecture for Stable Multi-Agent LLM Systems

CTHA:用于稳定多智能体大语言模型系统的受限时间层次架构
Jardine, Percy
Abstract
Recently, multi-time-scale agent architectures have extended the ubiquitous single-loop paradigm by introducing temporal hierarchies with distinct cognitive layers. While yielding substantial performance gains, this diversification fundamentally compromises the coordination stability intrinsic to unified agent systems, which causes severe inter-layer conflicts, unbounded error propagation, and restricted scalability. To address these challenges, we propose Constrained Temporal Hierarchical Architecture (CTHA), a general framework that projects the inter-layer communication space onto structured manifolds to restore coordination stability, while incorporating principled arbitration mechanisms to ensure coherent decision-making. Specifically, CTHA enforces three key constraints: (1) Message Contract Constraints that formalize information flow between layers via typed summary, plan, and policy packets; (2) Authority Manifold Constraints that bound each layer's decision space according to its temporal scope; and (3) Arbiter Resolution Constraints that guarantee conflict-free composition of multi-layer decisions. Empirical experiments demonstrate that CTHA is effective for complex task execution at scale, offering 47% reduction in failure cascades, 2.3x improvement in sample efficiency, and superior scalability compared to unconstrained hierarchical baselines. We anticipate that CTHA, as a principled extension of temporal hierarchies, will contribute to a deeper understanding of multi-agent coordination and suggest promising directions for the evolution of robust autonomous systems.
Chinese Translation
近年来,多时间尺度的智能体架构通过引入具有不同认知层次的时间层次结构,扩展了普遍存在的单循环范式。尽管这种多样化带来了显著的性能提升,但从根本上削弱了统一智能体系统所固有的协调稳定性,导致严重的层间冲突、无界的错误传播和受限的可扩展性。为了解决这些挑战,我们提出了受限时间层次架构(CTHA),这是一个通用框架,它将层间通信空间投影到结构化流形上,以恢复协调稳定性,同时结合原则性仲裁机制以确保一致的决策。具体而言,CTHA 强制执行三个关键约束:(1)消息合同约束,通过类型化的摘要、计划和策略数据包形式化层间信息流;(2)权威流形约束,根据时间范围限制每个层的决策空间;(3)仲裁者解决约束,确保多层决策的无冲突组合。实证实验表明,CTHA 在大规模复杂任务执行中有效,减少了 47% 的失败级联,样本效率提高了 2.3 倍,并且相比于无约束的层次基线具有更优的可扩展性。我们预期,作为时间层次的原则性扩展,CTHA 将有助于更深入地理解多智能体协调,并为稳健自主系统的发展提供有前景的方向。
cs.AI / 6

Explore with Long-term Memory: A Benchmark and Multimodal LLM-based Reinforcement Learning Framework for Embodied Exploration

利用长期记忆进行探索:基于多模态大语言模型的强化学习框架与基准测试用于具身探索
Wang, Sen, Liu, Bangwei, Gao, Zhenkun, Ma, Lizhuang, Wang, Xuhong, Xie, Yuan, Tan, Xin
Abstract
An ideal embodied agent should possess lifelong learning capabilities to handle long-horizon and complex tasks, enabling continuous operation in general environments. This not only requires the agent to accurately accomplish given tasks but also to leverage long-term episodic memory to optimize decision-making. However, existing mainstream one-shot embodied tasks primarily focus on task completion results, neglecting the crucial process of exploration and memory utilization. To address this, we propose Long-term Memory Embodied Exploration (LMEE), which aims to unify the agent's exploratory cognition and decision-making behaviors to promote lifelong learning.We further construct a corresponding dataset and benchmark, LMEE-Bench, incorporating multi-goal navigation and memory-based question answering to comprehensively evaluate both the process and outcome of embodied exploration. To enhance the agent's memory recall and proactive exploration capabilities, we propose MemoryExplorer, a novel method that fine-tunes a multimodal large language model through reinforcement learning to encourage active memory querying. By incorporating a multi-task reward function that includes action prediction, frontier selection, and question answering, our model achieves proactive exploration. Extensive experiments against state-of-the-art embodied exploration models demonstrate that our approach achieves significant advantages in long-horizon embodied tasks.
Chinese Translation
理想的具身智能体应具备终身学习能力,以应对长期和复杂的任务,从而在一般环境中持续运作。这不仅要求智能体准确完成给定任务,还需利用长期情节记忆来优化决策。然而,现有主流的一次性具身任务主要关注任务完成结果,忽视了探索和记忆利用这一关键过程。为了解决这一问题,我们提出了长期记忆具身探索(Long-term Memory Embodied Exploration,LMEE),旨在统一智能体的探索认知和决策行为,以促进终身学习。我们进一步构建了相应的数据集和基准测试LMEE-Bench,结合多目标导航和基于记忆的问题回答,以全面评估具身探索的过程和结果。为了增强智能体的记忆召回和主动探索能力,我们提出了MemoryExplorer,这是一种通过强化学习微调多模态大语言模型的新方法,以鼓励主动的记忆查询。通过结合包括动作预测、边界选择和问题回答在内的多任务奖励函数,我们的模型实现了主动探索。与最先进的具身探索模型进行的大量实验表明,我们的方法在长期具身任务中取得了显著优势。
cs.AI / 7

Optimisation of complex product innovation processes based on trend models with three-valued logic

基于三值逻辑的复杂产品创新过程优化研究
Bočková, Nina, Volná, Barbora, Dohnal, Mirko
Abstract
This paper investigates complex product-innovation processes using models grounded in a set of heuristics. Each heuristic is expressed through simple trends -- increasing, decreasing, or constant -- which serve as minimally information-intensive quantifiers, avoiding reliance on numerical values or rough sets. A solution to a trend model is defined as a set of scenarios with possible transitions between them, represented by a transition graph. Any possible future or past behaviour of the system under study can thus be depicted by a path within this graph.
Chinese Translation
本文探讨了基于一组启发式模型的复杂产品创新过程。每个启发式通过简单的趋势表达——增加、减少或保持不变——这些趋势作为信息密度最低的量化指标,避免依赖数值或粗糙集。趋势模型的解决方案被定义为一组可能的场景及其之间的转变,这些转变通过转移图表示。因此,研究系统的任何可能的未来或过去行为都可以通过该图中的一条路径来描绘。
cs.AI / 8

ARC Prize 2025: Technical Report

ARC奖2025:技术报告
Chollet, François, Knoop, Mike, Kamradt, Gregory, Landers, Bryan
Abstract
The ARC-AGI benchmark series serves as a critical measure of few-shot generalization on novel tasks, a core aspect of intelligence. The ARC Prize 2025 global competition targeted the newly released ARC-AGI-2 dataset, which features greater task complexity compared to its predecessor. The Kaggle competition attracted 1,455 teams and 15,154 entries, with the top score reaching 24% on the ARC-AGI-2 private evaluation set. Paper submissions nearly doubled year-over-year to 90 entries, reflecting the growing research interest in fluid intelligence and abstract reasoning. The defining theme of 2025 is the emergence of the refinement loop -- a per-task iterative program optimization loop guided by a feedback signal. Refinement loops come in a variety of forms, in particular evolutionary program synthesis approaches and application-layer refinements to commercial AI systems. Such refinement loops are also possible in weight space, as evidenced by zero-pretraining deep learning methods which are now achieving competitive performance with remarkably small networks (7M parameters). In parallel, four frontier AI labs (Anthropic, Google DeepMind, OpenAI, and xAI) reported ARC-AGI performance in public model cards in 2025, establishing ARC-AGI as an industry standard benchmark for AI reasoning. However, our analysis indicates that current frontier AI reasoning performance remains fundamentally constrained to knowledge coverage, giving rise to new forms of benchmark contamination. In this paper, we survey the top-performing methods, examine the role of refinement loops in AGI progress, discuss knowledge-dependent overfitting, and preview ARC-AGI-3, which introduces interactive reasoning challenges that require exploration, planning, memory, goal acquisition, and alignment capabilities.
Chinese Translation
ARC-AGI基准系列作为对新任务的少量样本泛化能力的关键测量,是智能的核心方面。ARC奖2025全球竞赛针对新发布的ARC-AGI-2数据集,该数据集相比其前身具有更高的任务复杂性。Kaggle竞赛吸引了1455个团队和15154个参赛作品,最高得分在ARC-AGI-2私有评估集上达到了24%。论文提交数量几乎是去年的两倍,达到了90篇,反映出对流体智能和抽象推理的研究兴趣日益增长。2025年的主题是精炼循环的出现——一个由反馈信号引导的每个任务的迭代程序优化循环。精炼循环有多种形式,特别是进化程序合成方法和对商业AI系统的应用层精炼。这种精炼循环在权重空间中也是可能的,零预训练深度学习方法的出现证明了这一点,这些方法现在在参数极少(700万参数)的情况下也能达到竞争性表现。同时,四个前沿AI实验室(Anthropic、Google DeepMind、OpenAI和xAI)在2025年公开模型卡中报告了ARC-AGI的表现,确立了ARC-AGI作为AI推理的行业标准基准。然而,我们的分析表明,目前前沿AI推理的表现仍然受到知识覆盖的根本限制,导致新的基准污染形式的出现。本文调查了表现最佳的方法,考察了精炼循环在AGI进展中的作用,讨论了依赖知识的过拟合,并预览了ARC-AGI-3,该版本引入了需要探索、规划、记忆、目标获取和对齐能力的互动推理挑战。
cs.AI / 9

What Matters in Data Curation for Multimodal Reasoning? Insights from the DCVLR Challenge

多模态推理中的数据管理:来自DCVLR挑战的见解
Shin, Yosub, Buriek, Michael, Sobolev, Boris, Bushuyeu, Pavel, Kumar, Vikas, Xu, Haoyang, Watson, Samuel, Molybog, Igor
Abstract
We study data curation for multimodal reasoning through the NeurIPS 2025 Data Curation for Vision-Language Reasoning (DCVLR) challenge, which isolates dataset selection by fixing the model and training protocol. Using a compact curated dataset derived primarily from Walton Multimodal Cold Start, our submission placed first in the challenge. Through post-competition ablations, we show that difficulty-based example selection on an aligned base dataset is the dominant driver of performance gains. Increasing dataset size does not reliably improve mean accuracy under the fixed training recipe, but mainly reduces run-to-run variance, while commonly used diversity and synthetic augmentation heuristics provide no additional benefit and often degrade performance. These results characterize DCVLR as a saturation-regime evaluation and highlight the central role of alignment and difficulty in data-efficient multimodal reasoning.
Chinese Translation
我们通过NeurIPS 2025视觉-语言推理数据管理挑战(DCVLR)研究多模态推理中的数据管理,该挑战通过固定模型和训练协议来隔离数据集选择。我们使用主要来自Walton多模态冷启动的紧凑整理数据集进行提交,并在挑战中获得第一名。通过赛后消融实验,我们表明,基于难度的示例选择在对齐的基础数据集上是性能提升的主要驱动因素。增加数据集规模在固定训练方案下并不能可靠地提高平均准确率,而主要是减少了运行间的方差,而常用的多样性和合成增强启发式方法并未提供额外的收益,反而常常降低性能。这些结果将DCVLR特征化为饱和状态评估,并强调了对齐和难度在数据高效多模态推理中的核心作用。
cs.AI / 10

AdaMARP: An Adaptive Multi-Agent Interaction Framework for General Immersive Role-Playing

AdaMARP:一种用于通用沉浸式角色扮演的自适应多智能体交互框架
Xu, Zhenhua, Chen, Dongsheng, Wang, Shuo, Li, Jian, Wang, Chengjie, Han, Meng, Wang, Yabiao
Abstract
LLM role-playing aims to portray arbitrary characters in interactive narratives, yet existing systems often suffer from limited immersion and adaptability. They typically under-model dynamic environmental information and assume largely static scenes and casts, offering insufficient support for multi-character orchestration, scene transitions, and on-the-fly character introduction. We propose an adaptive multi-agent role-playing framework, AdaMARP, featuring an immersive message format that interleaves [Thought], (Action), , and Speech, together with an explicit Scene Manager that governs role-playing through discrete actions (init_scene, pick_speaker, switch_scene, add_role, end) accompanied by rationales. To train these capabilities, we construct AdaRPSet for the Actor Model and AdaSMSet for supervising orchestration decisions, and introduce AdaptiveBench for trajectory-level evaluation. Experiments across multiple backbones and model scales demonstrate consistent improvements: AdaRPSet enhances character consistency, environment grounding, and narrative coherence, with an 8B actor outperforming several commercial LLMs, while AdaSMSet enables smoother scene transitions and more natural role introductions, surpassing Claude Sonnet 4.5 using only a 14B LLM.
Chinese Translation
大型语言模型(LLM)角色扮演旨在在互动叙事中描绘任意角色,然而现有系统往往面临沉浸感和适应性不足的问题。它们通常对动态环境信息建模不足,并假设场景和角色大致静态,无法充分支持多角色的协调、场景转换和即时角色引入。我们提出了一种自适应多智能体角色扮演框架AdaMARP,采用一种沉浸式消息格式,交织[Thought](思考)、(Action)(动作)、(环境)和Speech(语言),并配备一个明确的场景管理器,通过离散动作(init_scene、pick_speaker、switch_scene、add_role、end)及其理由来管理角色扮演。为了训练这些能力,我们构建了用于演员模型的AdaRPSet和用于监督协调决策的AdaSMSet,并引入了AdaptiveBench进行轨迹级评估。在多个基础模型和模型规模上的实验表明,AdaRPSet在角色一致性、环境基础和叙事连贯性方面持续改善,其中一个8B的演员模型在表现上超越了几款商业LLM,而AdaSMSet则实现了更流畅的场景转换和更自然的角色引入,超越了仅使用14B LLM的Claude Sonnet 4.5。
cs.AI / 11

Efficient Protein Optimization via Structure-aware Hamiltonian Dynamics

基于结构感知的哈密顿动力学高效蛋白质优化
Wang, Jiahao, Zheng, Shuangjia
Abstract
The ability to engineer optimized protein variants has transformative potential for biotechnology and medicine. Prior sequence-based optimization methods struggle with the high-dimensional complexities due to the epistasis effect and the disregard for structural constraints. To address this, we propose HADES, a Bayesian optimization method utilizing Hamiltonian dynamics to efficiently sample from a structure-aware approximated posterior. Leveraging momentum and uncertainty in the simulated physical movements, HADES enables rapid transition of proposals toward promising areas. A position discretization procedure is introduced to propose discrete protein sequences from such a continuous state system. The posterior surrogate is powered by a two-stage encoder-decoder framework to determine the structure and function relationships between mutant neighbors, consequently learning a smoothed landscape to sample from. Extensive experiments demonstrate that our method outperforms state-of-the-art baselines in in-silico evaluations across most metrics. Remarkably, our approach offers a unique advantage by leveraging the mutual constraints between protein structure and sequence, facilitating the design of protein sequences with similar structures and optimized properties. The code and data are publicly available at https://github.com/GENTEL-lab/HADES.
Chinese Translation
工程化优化蛋白质变体的能力对生物技术和医学具有变革性潜力。先前的基于序列的优化方法由于表型效应和对结构约束的忽视,难以应对高维复杂性。为了解决这一问题,我们提出了HADES,一种利用哈密顿动力学的贝叶斯优化方法,能够高效地从结构感知的近似后验中进行采样。HADES利用模拟物理运动中的动量和不确定性,快速引导提案向有前景的区域过渡。我们引入了一种位置离散化程序,从这种连续状态系统中提出离散的蛋白质序列。后验代理由一个两阶段的编码器-解码器框架驱动,以确定突变邻居之间的结构和功能关系,从而学习一个平滑的景观进行采样。大量实验表明,我们的方法在大多数指标的计算机模拟评估中优于最先进的基准。值得注意的是,我们的方法通过利用蛋白质结构与序列之间的相互约束,提供了独特的优势,促进了具有相似结构和优化属性的蛋白质序列的设计。代码和数据可在 https://github.com/GENTEL-lab/HADES 上公开获取。
cs.AI / 12

BAPO: Boundary-Aware Policy Optimization for Reliable Agentic Search

BAPO:边界感知策略优化用于可靠的自主搜索
Liu, Shiyu, Yin, Yongjing, Yan, Jianhao, Tang, Yunbo, Zhang, Qinggang, Li, Bei, Chen, Xin, Wang, Jingang, Cai, Xunliang, Su, Jinsong
Abstract
RL-based agentic search enables LLMs to solve complex questions via dynamic planning and external search. While this approach significantly enhances accuracy with agent policies optimized via large-scale reinforcement learning, we identify a critical gap in reliability: these agents fail to recognize their reasoning boundaries and rarely admit ``I DON'T KNOW'' (IDK) even when evidence is insufficient or reasoning reaches its limit. The lack of reliability often leads to plausible but unreliable answers, introducing significant risks in many real-world scenarios. To this end, we propose Boundary-Aware Policy Optimization (BAPO), a novel RL framework designed to cultivate reliable boundary awareness without compromising accuracy. BAPO introduces two key components: (i) a group-based boundary-aware reward that encourages an IDK response only when the reasoning reaches its limit, and (ii) an adaptive reward modulator that strategically suspends this reward during early exploration, preventing the model from exploiting IDK as a shortcut. Extensive experiments on four benchmarks demonstrate that BAPO substantially enhances the overall reliability of agentic search.
Chinese Translation
基于强化学习的自主搜索使得大型语言模型(LLMs)能够通过动态规划和外部搜索解决复杂问题。尽管这种方法通过大规模强化学习优化的代理策略显著提高了准确性,但我们识别出一个关键的可靠性缺口:这些代理无法识别其推理边界,并且在证据不足或推理达到极限时很少承认“我不知道”(IDK)。缺乏可靠性往往导致看似合理但不可靠的答案,在许多现实场景中引入了重大风险。为此,我们提出了边界感知策略优化(BAPO),这是一种新颖的强化学习框架,旨在培养可靠的边界意识而不妨碍准确性。BAPO引入了两个关键组件:(i)基于群体的边界感知奖励,仅在推理达到极限时鼓励IDK响应,以及(ii)一种自适应奖励调节器,在早期探索阶段战略性地暂停该奖励,防止模型利用IDK作为捷径。在四个基准测试上的广泛实验表明,BAPO显著增强了自主搜索的整体可靠性。
cs.AI / 13

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

AgencyBench:在100万标记的真实世界背景下评估自主智能体的前沿
Li, Keyu, Shi, Junhao, Xiao, Yang, Jiang, Mohan, Sun, Jie, Wu, Yunze, Xia, Shijie, Cai, Xiaojie, Xu, Tianze, Si, Weiye, Li, Wenjie, Wang, Dequan, Liu, Pengfei
Abstract
Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture long-horizon real-world scenarios. Moreover, the reliance on human-in-the-loop feedback for realistic tasks creates a scalability bottleneck, hindering automated rollout collection and evaluation. To bridge this gap, we introduce AgencyBench, a comprehensive benchmark derived from daily AI usage, evaluating 6 core agentic capabilities across 32 real-world scenarios, comprising 138 tasks with specific queries, deliverables, and rubrics. These scenarios require an average of 90 tool calls, 1 million tokens, and hours of execution time to resolve. To enable automated evaluation, we employ a user simulation agent to provide iterative feedback, and a Docker sandbox to conduct visual and functional rubric-based assessment. Experiments reveal that closed-source models significantly outperform open-source models (48.4% vs 32.1%). Further analysis reveals significant disparities across models in resource efficiency, feedback-driven self-correction, and specific tool-use preferences. Finally, we investigate the impact of agentic scaffolds, observing that proprietary models demonstrate superior performance within their native ecosystems (e.g., Claude-4.5-Opus via Claude-Agent-SDK), while open-source models exhibit distinct performance peaks, suggesting potential optimization for specific execution frameworks. AgencyBench serves as a critical testbed for next-generation agents, highlighting the necessity of co-optimizing model architecture with agentic frameworks. We believe this work sheds light on the future direction of autonomous agents, and we release the full benchmark and evaluation toolkit at https://github.com/GAIR-NLP/AgencyBench.
Chinese Translation
基于大型语言模型(LLMs)的自主智能体展示了多方面的能力,能够显著促进经济生产。然而,现有的基准测试仍然集中于单一的智能能力,未能捕捉长时间跨度的真实世界场景。此外,依赖人类反馈进行现实任务的评估造成了可扩展性的瓶颈,阻碍了自动化数据收集和评估。为了解决这一问题,我们引入了AgencyBench,这是一个基于日常AI使用情况的综合基准,评估32个真实世界场景中的6种核心智能能力,包括138个具体查询、交付物和评分标准。这些场景平均需要90次工具调用、100万标记和数小时的执行时间才能解决。为了实现自动化评估,我们使用用户模拟智能体提供迭代反馈,并利用Docker沙箱进行基于视觉和功能的评分评估。实验结果表明,闭源模型的表现显著优于开源模型(48.4%对32.1%)。进一步分析显示,各模型在资源效率、基于反馈的自我修正以及特定工具使用偏好方面存在显著差异。最后,我们研究了智能支架的影响,观察到专有模型在其本土生态系统中表现优越(例如,通过Claude-Agent-SDK的Claude-4.5-Opus),而开源模型则表现出明显的性能峰值,暗示在特定执行框架中可能进行优化。AgencyBench作为下一代智能体的关键测试平台,强调了将模型架构与智能框架共同优化的必要性。我们相信这项工作为自主智能体的未来方向提供了启示,并在https://github.com/GAIR-NLP/AgencyBench发布了完整的基准和评估工具包。
cs.AI / 14

MiCA: A Mobility-Informed Causal Adapter for Lightweight Epidemic Forecasting

MiCA:一种基于移动性信息的轻量级因果适配器用于流行病预测
Guo, Suhan, Deng, Jiahong, Shen, Furao
Abstract
Accurate forecasting of infectious disease dynamics is critical for public health planning and intervention. Human mobility plays a central role in shaping the spatial spread of epidemics, but mobility data are noisy, indirect, and difficult to integrate reliably with disease records. Meanwhile, epidemic case time series are typically short and reported at coarse temporal resolution. These conditions limit the effectiveness of parameter-heavy mobility-aware forecasters that rely on clean and abundant data. In this work, we propose the Mobility-Informed Causal Adapter (MiCA), a lightweight and architecture-agnostic module for epidemic forecasting. MiCA infers mobility relations through causal discovery and integrates them into temporal forecasting models via gated residual mixing. This design allows lightweight forecasters to selectively exploit mobility-derived spatial structure while remaining robust under noisy and data-limited conditions, without introducing heavy relational components such as graph neural networks or full attention. Extensive experiments on four real-world epidemic datasets, including COVID-19 incidence, COVID-19 mortality, influenza, and dengue, show that MiCA consistently improves lightweight temporal backbones, achieving an average relative error reduction of 7.5\% across forecasting horizons. Moreover, MiCA attains performance competitive with SOTA spatio-temporal models while remaining lightweight.
Chinese Translation
准确预测传染病动态对公共卫生规划和干预至关重要。人类移动性在塑造流行病的空间传播中起着核心作用,但移动性数据通常噪声较大、间接且难以与疾病记录可靠整合。同时,流行病病例时间序列通常较短且报告的时间分辨率较粗。这些条件限制了依赖于干净且丰富数据的参数密集型移动性感知预测模型的有效性。在本研究中,我们提出了移动性信息因果适配器(MiCA),这是一个轻量级且与架构无关的流行病预测模块。MiCA通过因果发现推断移动性关系,并通过门控残差混合将其整合到时间预测模型中。这种设计使得轻量级预测模型能够在噪声和数据有限的条件下选择性地利用移动性衍生的空间结构,而无需引入图神经网络或全注意力等重型关系组件。在四个真实世界流行病数据集上的广泛实验,包括COVID-19发病率、COVID-19死亡率、流感和登革热,表明MiCA持续改善轻量级时间骨干网络,在各个预测时间段内实现了平均相对误差降低7.5%。此外,MiCA在保持轻量级的同时,其性能与最先进的时空模型相当。
cs.AI / 15

ReCreate: Reasoning and Creating Domain Agents Driven by Experience

ReCreate:基于经验驱动的领域智能体推理与创造
Hao, Zhezheng, Wang, Hong, Luo, Jian, Zhang, Jianqing, Zhou, Yuyan, Lin, Qiang, Wang, Can, Dong, Hande, Chen, Jiawei
Abstract
Large Language Model agents are reshaping the industrial landscape. However, most practical agents remain human-designed because tasks differ widely, making them labor-intensive to build. This situation poses a central question: can we automatically create and adapt domain agents in the wild? While several recent approaches have sought to automate agent creation, they typically treat agent generation as a black-box procedure and rely solely on final performance metrics to guide the process. Such strategies overlook critical evidence explaining why an agent succeeds or fails, and often require high computational costs. To address these limitations, we propose ReCreate, an experience-driven framework for the automatic creation of domain agents. ReCreate systematically leverages agent interaction histories, which provide rich concrete signals on both the causes of success or failure and the avenues for improvement. Specifically, we introduce an agent-as-optimizer paradigm that effectively learns from experience via three key components: (i) an experience storage and retrieval mechanism for on-demand inspection; (ii) a reasoning-creating synergy pipeline that maps execution experience into scaffold edits; and (iii) hierarchical updates that abstract instance-level details into reusable domain patterns. In experiments across diverse domains, ReCreate consistently outperforms human-designed agents and existing automated agent generation methods, even when starting from minimal seed scaffolds.
Chinese Translation
大型语言模型智能体正在重塑工业格局。然而,大多数实际智能体仍然是人类设计的,因为任务差异很大,导致构建过程劳动密集。这种情况提出了一个核心问题:我们能否在实际环境中自动创建和适应领域智能体?虽然最近有几种方法试图自动化智能体创建,但它们通常将智能体生成视为一个黑箱过程,仅依赖最终性能指标来指导该过程。这些策略忽视了解释智能体成功或失败的关键证据,并且通常需要高昂的计算成本。为了解决这些局限性,我们提出了ReCreate,一个基于经验驱动的领域智能体自动创建框架。ReCreate系统性地利用智能体交互历史,这些历史提供了关于成功或失败原因以及改进途径的丰富具体信号。具体而言,我们引入了一种智能体作为优化器的范式,通过三个关键组件有效地从经验中学习:(i) 用于按需检查的经验存储和检索机制;(ii) 将执行经验映射到脚手架编辑的推理-创造协同管道;以及(iii) 将实例级细节抽象为可重用领域模式的分层更新。在多个领域的实验中,ReCreate始终优于人类设计的智能体和现有的自动化智能体生成方法,即使在从最小种子脚手架开始时。
cs.AI / 16

Do We Always Need Query-Level Workflows? Rethinking Agentic Workflow Generation for Multi-Agent Systems

我们总是需要查询级工作流吗?重新思考多智能体系统的自主工作流生成
Wang, Zixu, Xu, Bingbing, Yuan, Yige, Shen, Huawei, Cheng, Xueqi
Abstract
Multi-Agent Systems (MAS) built on large language models typically solve complex tasks by coordinating multiple agents through workflows. Existing approaches generates workflows either at task level or query level, but their relative costs and benefits remain unclear. After rethinking and empirical analyses, we show that query-level workflow generation is not always necessary, since a small set of top-K best task-level workflows together already covers equivalent or even more queries. We further find that exhaustive execution-based task-level evaluation is both extremely token-costly and frequently unreliable. Inspired by the idea of self-evolution and generative reward modeling, we propose a low-cost task-level generation framework \textbf{SCALE}, which means \underline{\textbf{S}}elf prediction of the optimizer with few shot \underline{\textbf{CAL}}ibration for \underline{\textbf{E}}valuation instead of full validation execution. Extensive experiments demonstrate that \textbf{SCALE} maintains competitive performance, with an average degradation of just 0.61\% compared to existing approach across multiple datasets, while cutting overall token usage by up to 83\%.
Chinese Translation
基于大型语言模型的多智能体系统(MAS)通常通过工作流协调多个智能体来解决复杂任务。现有方法在任务级或查询级生成工作流,但它们的相对成本和收益仍不明确。经过重新思考和实证分析,我们发现查询级工作流生成并非总是必要,因为一小组前K个最佳任务级工作流已经能够覆盖等效或更多的查询。我们进一步发现,基于穷举执行的任务级评估既极其耗费令牌,又常常不可靠。受到自我演化和生成奖励建模思想的启发,我们提出了一种低成本的任务级生成框架 extbf{SCALE},其意为 extbf{S}elf prediction of the optimizer with few shot extbf{CAL}ibration for extbf{E}valuation,而不是完全验证执行。大量实验表明, extbf{SCALE}在多个数据集上保持了竞争力的性能,与现有方法相比,平均性能下降仅为0.61 ext{%},同时整体令牌使用量减少了多达83 ext{%}。
cs.AI / 17

TANDEM: Temporal-Aware Neural Detection for Multimodal Hate Speech

TANDEM:面向时间的神经检测用于多模态仇恨言论
Koushik, Girish A., Treharne, Helen, Kanojia, Diptesh
Abstract
Social media platforms are increasingly dominated by long-form multimodal content, where harmful narratives are constructed through a complex interplay of audio, visual, and textual cues. While automated systems can flag hate speech with high accuracy, they often function as "black boxes" that fail to provide the granular, interpretable evidence, such as precise timestamps and target identities, required for effective human-in-the-loop moderation. In this work, we introduce TANDEM, a unified framework that transforms audio-visual hate detection from a binary classification task into a structured reasoning problem. Our approach employs a novel tandem reinforcement learning strategy where vision-language and audio-language models optimize each other through self-constrained cross-modal context, stabilizing reasoning over extended temporal sequences without requiring dense frame-level supervision. Experiments across three benchmark datasets demonstrate that TANDEM significantly outperforms zero-shot and context-augmented baselines, achieving 0.73 F1 in target identification on HateMM (a 30% improvement over state-of-the-art) while maintaining precise temporal grounding. We further observe that while binary detection is robust, differentiating between offensive and hateful content remains challenging in multi-class settings due to inherent label ambiguity and dataset imbalance. More broadly, our findings suggest that structured, interpretable alignment is achievable even in complex multimodal settings, offering a blueprint for the next generation of transparent and actionable online safety moderation tools.
Chinese Translation
社交媒体平台越来越多地被长格式多模态内容所主导,其中有害叙事通过音频、视觉和文本线索的复杂交互构建而成。尽管自动化系统可以高精度地标记仇恨言论,但它们往往作为“黑箱”运作,无法提供有效的人机协作所需的细致、可解释的证据,如精确的时间戳和目标身份。在本研究中,我们引入了TANDEM,一个统一框架,将音视频仇恨检测从二元分类任务转变为结构化推理问题。我们的方法采用了一种新颖的串联强化学习策略,其中视觉-语言和音频-语言模型通过自我约束的跨模态上下文相互优化,从而在不需要密集帧级监督的情况下稳定地推理扩展的时间序列。针对三个基准数据集的实验表明,TANDEM显著优于零样本和上下文增强基线,在HateMM数据集上的目标识别中达到了0.73的F1分数(比现有技术提高了30%),同时保持了精确的时间定位。我们进一步观察到,尽管二元检测是稳健的,但在多类设置中区分冒犯性和仇恨内容仍然具有挑战性,这主要由于固有的标签模糊性和数据集不平衡。更广泛地说,我们的发现表明,即使在复杂的多模态环境中,结构化、可解释的对齐也是可以实现的,为下一代透明且可操作的在线安全审核工具提供了蓝图。
cs.AI / 18

Policy-Based Deep Reinforcement Learning Hyperheuristics for Job-Shop Scheduling Problems

基于策略的深度强化学习超启发式算法在作业车间调度问题中的应用
Lassoued, Sofiene, Gobachew, Asrat, Lier, Stefan, Schwung, Andreas
Abstract
This paper proposes a policy-based deep reinforcement learning hyper-heuristic framework for solving the Job Shop Scheduling Problem. The hyper-heuristic agent learns to switch scheduling rules based on the system state dynamically. We extend the hyper-heuristic framework with two key mechanisms. First, action prefiltering restricts decision-making to feasible low-level actions, enabling low-level heuristics to be evaluated independently of environmental constraints and providing an unbiased assessment. Second, a commitment mechanism regulates the frequency of heuristic switching. We investigate the impact of different commitment strategies, from step-wise switching to full-episode commitment, on both training behavior and makespan. Additionally, we compare two action selection strategies at the policy level: deterministic greedy selection and stochastic sampling. Computational experiments on standard JSSP benchmarks demonstrate that the proposed approach outperforms traditional heuristics, metaheuristics, and recent neural network-based scheduling methods
Chinese Translation
本文提出了一种基于策略的深度强化学习超启发式框架,用于解决作业车间调度问题(Job Shop Scheduling Problem)。该超启发式代理能够根据系统状态动态切换调度规则。我们通过两个关键机制扩展了该超启发式框架。首先,动作预过滤限制了决策过程仅限于可行的低级动作,使得低级启发式算法能够在不受环境约束的情况下独立评估,从而提供无偏的评估。其次,承诺机制调节启发式切换的频率。我们研究了不同承诺策略的影响,从逐步切换到全回合承诺,对训练行为和完工时间(makespan)均有影响。此外,我们在策略层面比较了两种动作选择策略:确定性贪婪选择和随机采样。在标准作业车间调度问题基准测试中的计算实验表明,所提出的方法优于传统启发式算法、元启发式算法以及近期基于神经网络的调度方法。
cs.AI / 19

Beyond Model Scaling: Test-Time Intervention for Efficient Deep Reasoning

超越模型扩展:高效深度推理的测试时干预
Wang, Qianyue, Hu, Jinwu, Wang, Yufeng, Lin, Huanxiang, Chen, Bolin, Wen, Zhiquan, Chen, Yaofo, Tan, Mingkui
Abstract
Large Reasoning Models (LRMs) excel at multi-step reasoning but often suffer from inefficient reasoning processes like overthinking and overshoot, where excessive or misdirected reasoning increases computational cost and degrades performance. Existing efficient reasoning methods operate in a closed-loop manner, lacking mechanisms for external intervention to guide the reasoning process. To address this, we propose Think-with-Me, a novel test-time interactive reasoning paradigm that introduces external feedback intervention into the reasoning process. Our key insights are that transitional conjunctions serve as natural points for intervention, signaling phases of self-validation or exploration and using transitional words appropriately to prolong the reasoning enhances performance, while excessive use affects performance. Building on these insights, Think-with-Me pauses reasoning at these points for external feedback, adaptively extending or terminating reasoning to reduce redundancy while preserving accuracy. The feedback is generated via a multi-criteria evaluation (rationality and completeness) and comes from either human or LLM proxies. We train the target model using Group Relative Policy Optimization (GRPO) to adapt to this interactive mode. Experiments show that Think-with-Me achieves a superior balance between accuracy and reasoning length under limited context windows. On AIME24, Think-with-Me outperforms QwQ-32B by 7.19% in accuracy while reducing average reasoning length by 81% under an 8K window. The paradigm also benefits security and creative tasks.
Chinese Translation
大型推理模型(LRMs)在多步骤推理方面表现出色,但常常面临低效推理过程的问题,如过度思考和推理过度,这些问题导致计算成本增加和性能下降。现有的高效推理方法以闭环方式运作,缺乏外部干预机制来指导推理过程。为了解决这一问题,我们提出了Think-with-Me,一种新颖的测试时互动推理范式,它将外部反馈干预引入推理过程中。我们的关键见解是,过渡性连接词作为干预的自然切入点,标志着自我验证或探索的阶段,并适当地使用过渡词以延长推理可以提升性能,而过度使用则会影响性能。在这些见解的基础上,Think-with-Me在这些点暂停推理以获取外部反馈,适应性地延长或终止推理,以减少冗余,同时保持准确性。反馈通过多标准评估(理性和完整性)生成,来源于人类或大型语言模型(LLM)代理。我们使用群体相对策略优化(Group Relative Policy Optimization, GRPO)训练目标模型,以适应这种互动模式。实验表明,Think-with-Me在有限上下文窗口下实现了准确性与推理长度之间的优越平衡。在AIME24上,Think-with-Me在准确性上比QwQ-32B提高了7.19%,同时在8K窗口下将平均推理长度减少了81%。该范式还对安全性和创造性任务具有积极影响。
cs.AI / 20

XChoice: Explainable Evaluation of AI-Human Alignment in LLM-based Constrained Choice Decision Making

XChoice:基于LLM的受限选择决策中AI与人类对齐的可解释评估
Qi, Weihong, Huang, Fan, Muralidharan, Rasika, An, Jisun, Kwak, Haewoon
Abstract
We present XChoice, an explainable framework for evaluating AI-human alignment in constrained decision making. Moving beyond outcome agreement such as accuracy and F1 score, XChoice fits a mechanism-based decision model to human data and LLM-generated decisions, recovering interpretable parameters that capture the relative importance of decision factors, constraint sensitivity, and implied trade-offs. Alignment is assessed by comparing these parameter vectors across models, options, and subgroups. We demonstrate XChoice on Americans' daily time allocation using the American Time Use Survey (ATUS) as human ground truth, revealing heterogeneous alignment across models and activities and salient misalignment concentrated in Black and married groups. We further validate robustness of XChoice via an invariance analysis and evaluate targeted mitigation with a retrieval augmented generation (RAG) intervention. Overall, XChoice provides mechanism-based metrics that diagnose misalignment and support informed improvements beyond surface outcome matching.
Chinese Translation
我们提出了XChoice,这是一个用于评估受限决策中AI与人类对齐的可解释框架。XChoice超越了结果一致性(如准确率和F1分数),将基于机制的决策模型拟合到人类数据和LLM生成的决策中,恢复可解释的参数,以捕捉决策因素的相对重要性、约束敏感性和隐含的权衡。通过比较这些参数向量在不同模型、选项和子群体之间,我们评估对齐情况。我们在美国人的日常时间分配上演示了XChoice,使用美国时间使用调查(ATUS)作为人类的真实数据,揭示了模型和活动之间的异质性对齐,以及在黑人和已婚群体中集中出现的显著不对齐。我们进一步通过不变性分析验证了XChoice的稳健性,并评估了使用检索增强生成(RAG)干预的针对性缓解。总体而言,XChoice提供了基于机制的指标,以诊断不对齐并支持超越表面结果匹配的知情改进。
cs.AI / 21

AstroReason-Bench: Evaluating Unified Agentic Planning across Heterogeneous Space Planning Problems

AstroReason-Bench:评估异构空间规划问题中的统一智能规划
Wang, Weiyi, Chen, Xinchi, Gong, Jingjing, Huang, Xuanjing, Qiu, Xipeng
Abstract
Recent advances in agentic Large Language Models (LLMs) have positioned them as generalist planners capable of reasoning and acting across diverse tasks. However, existing agent benchmarks largely focus on symbolic or weakly grounded environments, leaving their performance in physics-constrained real-world domains underexplored. We introduce AstroReason-Bench, a comprehensive benchmark for evaluating agentic planning in Space Planning Problems (SPP), a family of high-stakes problems with heterogeneous objectives, strict physical constraints, and long-horizon decision-making. AstroReason-Bench integrates multiple scheduling regimes, including ground station communication and agile Earth observation, and provides a unified agent-oriented interaction protocol. Evaluating on a range of state-of-the-art open- and closed-source agentic LLM systems, we find that current agents substantially underperform specialized solvers, highlighting key limitations of generalist planning under realistic constraints. AstroReason-Bench offers a challenging and diagnostic testbed for future agentic research.
Chinese Translation
近期在智能大型语言模型(LLMs)方面的进展使其成为能够在多样化任务中进行推理和行动的通用规划者。然而,现有的智能基准主要集中在符号或弱基础环境中,导致它们在物理约束的现实世界领域中的表现尚未得到充分探讨。我们提出了AstroReason-Bench,这是一个全面的基准,用于评估空间规划问题(SPP)中的智能规划,这是一类具有异构目标、严格物理约束和长期决策的高风险问题。AstroReason-Bench整合了多种调度机制,包括地面站通信和灵活的地球观测,并提供了统一的面向智能体的交互协议。在对一系列最先进的开源和闭源智能LLM系统进行评估时,我们发现当前的智能体在专门求解器面前表现显著不足,突显了在现实约束下通用规划的关键局限性。AstroReason-Bench为未来的智能研究提供了一个具有挑战性和诊断性的测试平台。
cs.AI / 22

Hyperparameter Optimization of Constraint Programming Solvers

约束编程求解器的超参数优化
Haddad, Hedieh, Falque, Thibault, Talbot, Pierre, Bouvry, Pascal
Abstract
The performance of constraint programming solvers is highly sensitive to the choice of their hyperparameters. Manually finding the best solver configuration is a difficult, time-consuming task that typically requires expert knowledge. In this paper, we introduce probe and solve algorithm, a novel two-phase framework for automated hyperparameter optimization integrated into the CPMpy library. This approach partitions the available time budget into two phases: a probing phase that explores different sets of hyperparameters using configurable hyperparameter optimization methods, followed by a solving phase where the best configuration found is used to tackle the problem within the remaining time. We implement and compare two hyperparameter optimization methods within the probe and solve algorithm: Bayesian optimization and Hamming distance search. We evaluate the algorithm on two different constraint programming solvers, ACE and Choco, across 114 combinatorial problem instances, comparing their performance against the solver's default configurations. Results show that using Bayesian optimization, the algorithm outperforms the solver's default configurations, improving solution quality for ACE in 25.4% of instances and matching the default performance in 57.9%, and for Choco, achieving superior results in 38.6% of instances. It also consistently surpasses Hamming distance search within the same framework, confirming the advantage of model-based exploration over simple local search. Overall, the probe and solve algorithm offers a practical, resource-aware approach for tuning constraint solvers that yields robust improvements across diverse problem types.
Chinese Translation
约束编程求解器的性能对其超参数的选择高度敏感。手动寻找最佳求解器配置是一项困难且耗时的任务,通常需要专业知识。在本文中,我们介绍了一种探测与求解算法(probe and solve algorithm),这是一种新颖的两阶段框架,用于自动化超参数优化,并集成到CPMpy库中。该方法将可用的时间预算划分为两个阶段:探测阶段,通过可配置的超参数优化方法探索不同的超参数集合,随后是求解阶段,在剩余时间内使用找到的最佳配置来解决问题。我们在探测与求解算法中实现并比较了两种超参数优化方法:贝叶斯优化(Bayesian optimization)和汉明距离搜索(Hamming distance search)。我们在两个不同的约束编程求解器ACE和Choco上评估该算法,涵盖114个组合问题实例,并将其性能与求解器的默认配置进行比较。结果表明,使用贝叶斯优化时,该算法的性能优于求解器的默认配置,在25.4%的实例中提高了ACE的解质量,并在57.9%的实例中与默认性能持平,而对于Choco,在38.6%的实例中取得了更优的结果。它在同一框架内也始终超过汉明距离搜索,证实了基于模型的探索相较于简单局部搜索的优势。总体而言,探测与求解算法提供了一种实用的、资源感知的方法来调优约束求解器,在多种问题类型中都能带来稳健的改进。
cs.AI / 23

Exploring LLM Features in Predictive Process Monitoring for Small-Scale Event-Logs

探索小规模事件日志中预测过程监控的LLM特征
Padella, Alessandro, de Leoni, Massimiliano, Dumas, Marlon
Abstract
Predictive Process Monitoring is a branch of process mining that aims to predict the outcome of an ongoing process. Recently, it leveraged machine-and-deep learning architectures. In this paper, we extend our prior LLM-based Predictive Process Monitoring framework, which was initially focused on total time prediction via prompting. The extension consists of comprehensively evaluating its generality, semantic leverage, and reasoning mechanisms, also across multiple Key Performance Indicators. Empirical evaluations conducted on three distinct event logs and across the Key Performance Indicators of Total Time and Activity Occurrence prediction indicate that, in data-scarce settings with only 100 traces, the LLM surpasses the benchmark methods. Furthermore, the experiments also show that the LLM exploits both its embodied prior knowledge and the internal correlations among training traces. Finally, we examine the reasoning strategies employed by the model, demonstrating that the LLM does not merely replicate existing predictive methods but performs higher-order reasoning to generate the predictions.
Chinese Translation
预测过程监控是过程挖掘的一个分支,旨在预测正在进行的过程的结果。最近,该领域利用了机器学习和深度学习架构。在本文中,我们扩展了之前基于LLM的预测过程监控框架,该框架最初专注于通过提示进行总时间预测。此次扩展全面评估了其通用性、语义利用和推理机制,同时涵盖多个关键绩效指标。对三个不同事件日志及总时间和活动发生预测的关键绩效指标进行的实证评估表明,在仅有100条轨迹的数据稀缺环境中,LLM超越了基准方法。此外,实验还表明,LLM利用了其内在的先验知识和训练轨迹之间的内部关联。最后,我们考察了模型采用的推理策略,证明LLM不仅仅是复制现有的预测方法,而是进行更高阶的推理以生成预测结果。
cs.AI / 24

Health Facility Location in Ethiopia: Leveraging LLMs to Integrate Expert Knowledge into Algorithmic Planning

埃塞俄比亚的卫生设施选址:利用大型语言模型将专家知识融入算法规划
Trabelsi, Yohai, Xiong, Guojun, Getnet, Fentabil, Verguet, Stéphane, Tambe, Milind
Abstract
Ethiopia's Ministry of Health is upgrading health posts to improve access to essential services, particularly in rural areas. Limited resources, however, require careful prioritization of which facilities to upgrade to maximize population coverage while accounting for diverse expert and stakeholder preferences. In collaboration with the Ethiopian Public Health Institute and Ministry of Health, we propose a hybrid framework that systematically integrates expert knowledge with optimization techniques. Classical optimization methods provide theoretical guarantees but require explicit, quantitative objectives, whereas stakeholder criteria are often articulated in natural language and difficult to formalize. To bridge these domains, we develop the Large language model and Extended Greedy (LEG) framework. Our framework combines a provable approximation algorithm for population coverage optimization with LLM-driven iterative refinement that incorporates human-AI alignment to ensure solutions reflect expert qualitative guidance while preserving coverage guarantees. Experiments on real-world data from three Ethiopian regions demonstrate the framework's effectiveness and its potential to inform equitable, data-driven health system planning.
Chinese Translation
埃塞俄比亚卫生部正在升级卫生站,以改善对基本服务的获取,特别是在农村地区。然而,有限的资源要求我们仔细优先考虑哪些设施需要升级,以最大化人口覆盖,同时考虑不同专家和利益相关者的偏好。我们与埃塞俄比亚公共卫生研究所和卫生部合作,提出了一种混合框架,系统地将专家知识与优化技术相结合。经典的优化方法提供理论保证,但需要明确的定量目标,而利益相关者的标准通常以自然语言表达,难以形式化。为了弥合这两个领域,我们开发了大型语言模型与扩展贪婪算法(LEG)框架。我们的框架结合了可证明的覆盖优化近似算法与基于大型语言模型的迭代优化,融入人机对齐,以确保解决方案反映专家的定性指导,同时保持覆盖保证。在来自埃塞俄比亚三个地区的真实数据上的实验表明,该框架的有效性及其在公平、数据驱动的卫生系统规划中的潜力。
cs.AI / 25

BoxMind: Closed-loop AI strategy optimization for elite boxing validated in the 2024 Olympics

BoxMind:在2024年奥运会上验证的精英拳击闭环AI策略优化
Wang, Kaiwen, Zheng, Kaili, Deng, Rongrong, Fan, Qingmin, Zhang, Milin, Li, Zongrui, Zhou, Xuesi, Han, Bo, Chen, Liren, Guo, Chenyi, Wu, Ji
Abstract
Competitive sports require sophisticated tactical analysis, yet combat disciplines like boxing remain underdeveloped in AI-driven analytics due to the complexity of action dynamics and the lack of structured tactical representations. To address this, we present BoxMind, a closed-loop AI expert system validated in elite boxing competition. By defining atomic punch events with precise temporal boundaries and spatial and technical attributes, we parse match footage into 18 hierarchical technical-tactical indicators. We then propose a graph-based predictive model that fuses these explicit technical-tactical profiles with learnable, time-variant latent embeddings to capture the dynamics of boxer matchups. Modeling match outcome as a differentiable function of technical-tactical indicators, we turn winning probability gradients into executable tactical adjustments. Experiments show that the outcome prediction model achieves state-of-the-art performance, with 69.8% accuracy on BoxerGraph test set and 87.5% on Olympic matches. Using this predictive model as a foundation, the system generates strategic recommendations that demonstrate proficiency comparable to human experts. BoxMind is validated through a closed-loop deployment during the 2024 Paris Olympics, directly contributing to the Chinese National Team's historic achievement of three gold and two silver medals. BoxMind establishes a replicable paradigm for transforming unstructured video data into strategic intelligence, bridging the gap between computer vision and decision support in competitive sports.
Chinese Translation
竞技体育需要复杂的战术分析,但由于动作动态的复杂性和缺乏结构化的战术表示,拳击等格斗项目在基于AI的分析方面仍然相对滞后。为了解决这一问题,我们提出了BoxMind,一个在精英拳击比赛中验证的闭环AI专家系统。通过定义具有精确时间边界和空间及技术属性的原子拳击事件,我们将比赛录像解析为18个层次的技术-战术指标。接着,我们提出了一种基于图的预测模型,将这些明确的技术-战术特征与可学习的、时间变化的潜在嵌入相融合,以捕捉拳手对决的动态。将比赛结果建模为技术-战术指标的可微分函数,我们将获胜概率梯度转化为可执行的战术调整。实验表明,结果预测模型在BoxerGraph测试集上的准确率达到69.8%,在奥运比赛中达到87.5%,表现出色。以此预测模型为基础,该系统生成的战略建议显示出与人类专家相当的专业水平。BoxMind通过在2024年巴黎奥运会的闭环部署得到了验证,直接助力中国国家队历史性地获得三枚金牌和两枚银牌。BoxMind建立了一个可复制的范式,将非结构化视频数据转化为战略情报,弥合了计算机视觉与竞技体育决策支持之间的鸿沟。
计算语言学 (Computation and Language)
45
cs.CL / 1

LLMs for Game Theory: Entropy-Guided In-Context Learning and Adaptive CoT Reasoning

用于博弈论的LLMs:基于熵的上下文学习与自适应链式推理
Banfi, Tommaso Felice, Gamage, Sashenka
Abstract
We propose a novel LLM-based framework for reasoning in discrete, game-theoretic tasks, illustrated with \emph{Tic-Tac-Toe}. The method integrates in-context learning with entropy-guided chain-of-thought (CoT) reasoning and adaptive context retrieval. The model dynamically adjusts both the number of retrieved examples and reasoning paths according to token-level uncertainty: concise reasoning with minimal context is used when uncertainty is low, whereas higher uncertainty triggers expanded multi-path CoT exploration. Experimental evaluation against a sub-optimal algorithmic opponent shows that entropy-aware adaptive reasoning substantially improves decision quality, increasing the average game outcome from \(-11.6\%\) with the baseline LLM to \(+9.5\%\) with entropy-guided adaptive reasoning over 100 games (win = +1, tie = 0, loss = -1), while maintaining a relatively low number of LLM queries per game. Statistical validation confirms that the improvement is significant, and correlation analysis reveals a negative association between token-level entropy and move optimality. These findings demonstrate that uncertainty-guided adaptive reasoning effectively enhances LLM performance in sequential decision-making environments.
Chinese Translation
我们提出了一种基于LLM的新框架,用于离散博弈任务中的推理,以 extit{井字棋}为例。该方法将上下文学习与基于熵的链式推理(CoT)和自适应上下文检索相结合。模型根据标记级的不确定性动态调整检索示例的数量和推理路径:在不确定性较低时使用简洁的推理和最小的上下文,而在不确定性较高时则触发扩展的多路径CoT探索。与一个次优算法对手的实验评估表明,基于熵的自适应推理显著提高了决策质量,在100局游戏中,平均游戏结果从基线LLM的 ext{-11.6 extperthousand}提升至 ext{+9.5 extperthousand}(胜 = +1,平 = 0,负 = -1),同时保持每局游戏的LLM查询数量相对较低。统计验证确认了这一改善是显著的,相关性分析揭示了标记级熵与移动最优性之间的负相关关系。这些发现表明,基于不确定性的自适应推理有效提升了LLM在顺序决策环境中的表现。
cs.CL / 2

BYOL: Bring Your Own Language Into LLMs

BYOL:将您的语言带入大型语言模型
Zamir, Syed Waqas, Hamidouche, Wassim, Amor, Boulbaba Ben, Marotti, Luana, Becker-Reshef, Inbal, Ferres, Juan Lavista
Abstract
Large Language Models (LLMs) exhibit strong multilingual capabilities, yet remain fundamentally constrained by the severe imbalance in global language resources. While over 7,000 languages are spoken worldwide, only a small subset (fewer than 100) has sufficient digital presence to meaningfully influence modern LLM training. This disparity leads to systematic underperformance, cultural misalignment, and limited accessibility for speakers of low-resource and extreme-low-resource languages. To address this gap, we introduce Bring Your Own Language (BYOL), a unified framework for scalable, language-aware LLM development tailored to each language's digital footprint. BYOL begins with a language resource classification that maps languages into four tiers (Extreme-Low, Low, Mid, High) using curated web-scale corpora, and uses this classification to select the appropriate integration pathway. For low-resource languages, we propose a full-stack data refinement and expansion pipeline that combines corpus cleaning, synthetic text generation, continual pretraining, and supervised finetuning. Applied to Chichewa and Maori, this pipeline yields language-specific LLMs that achieve approximately 12 percent average improvement over strong multilingual baselines across 12 benchmarks, while preserving English and multilingual capabilities via weight-space model merging. For extreme-low-resource languages, we introduce a translation-mediated inclusion pathway, and show on Inuktitut that a tailored machine translation system improves over a commercial baseline by 4 BLEU, enabling high-accuracy LLM access when direct language modeling is infeasible. Finally, we release human-translated versions of the Global MMLU-Lite benchmark in Chichewa, Maori, and Inuktitut, and make our codebase and models publicly available at https://github.com/microsoft/byol .
Chinese Translation
大型语言模型(LLMs)展现出强大的多语言能力,但仍然受到全球语言资源严重不平衡的根本限制。尽管全球有超过7000种语言,但只有少数(不到100种)具有足够的数字存在感,能够对现代LLM训练产生实质性影响。这种差异导致了系统性的表现不足、文化不对齐以及对低资源和极低资源语言使用者的有限可及性。为了解决这一问题,我们提出了“将您的语言带入”(BYOL),这是一个统一的框架,用于可扩展的、语言感知的LLM开发,旨在根据每种语言的数字足迹进行定制。BYOL首先进行语言资源分类,将语言映射为四个层级(极低资源、低资源、中等资源、高资源),并利用经过策划的网络规模语料库进行分类,以选择合适的整合路径。对于低资源语言,我们提出了一个全栈数据精炼和扩展管道,结合了语料清理、合成文本生成、持续预训练和监督微调。应用于奇切瓦语和毛利语,该管道生成的语言特定LLM在12个基准测试中实现了约12%的平均提升,相较于强大的多语言基线,同时通过权重空间模型合并保留了英语和多语言能力。对于极低资源语言,我们引入了一种翻译介导的纳入路径,并在因纽克提特语上展示了一个定制的机器翻译系统比商业基线提高了4 BLEU,从而在直接语言建模不可行时实现高准确度的LLM访问。最后,我们发布了奇切瓦语、毛利语和因纽克提特语的全球MMLU-Lite基准的人类翻译版本,并在 https://github.com/microsoft/byol 上公开了我们的代码库和模型。
cs.CL / 3

A Concise Agent is Less Expert: Revealing Side Effects of Using Style Features on Conversational Agents

简洁的智能体不那么专业:揭示使用风格特征对对话智能体的副作用
Cho, Young-Min, Yuan, Yuan, Guntuku, Sharath Chandra, Ungar, Lyle
Abstract
Style features such as friendly, helpful, or concise are widely used in prompts to steer the behavior of Large Language Model (LLM) conversational agents, yet their unintended side effects remain poorly understood. In this work, we present the first systematic study of cross-feature stylistic side effects. We conduct a comprehensive survey of 127 conversational agent papers from ACL Anthology and identify 12 frequently used style features. Using controlled, synthetic dialogues across task-oriented and open domain settings, we quantify how prompting for one style feature causally affects others via a pairwise LLM as a Judge evaluation framework. Our results reveal consistent and structured side effects, such as prompting for conciseness significantly reduces perceived expertise. They demonstrate that style features are deeply entangled rather than orthogonal. To support future research, we introduce CASSE (Conversational Agent Stylistic Side Effects), a dataset capturing these complex interactions. We further evaluate prompt based and activation steering based mitigation strategies and find that while they can partially restore suppressed traits, they often degrade the primary intended style. These findings challenge the assumption of faithful style control in LLMs and highlight the need for multi-objective and more principled approaches to safe, targeted stylistic steering in conversational agents.
Chinese Translation
风格特征如友好、乐于助人或简洁,广泛应用于提示中以引导大型语言模型(LLM)对话智能体的行为,但它们的意外副作用仍然不够明确。在本研究中,我们首次系统性地研究了跨特征的风格副作用。我们对来自ACL文献库的127篇对话智能体论文进行了全面调查,识别出12种常用的风格特征。通过在任务导向和开放域设置中使用受控的合成对话,我们量化了对一种风格特征的提示如何因果影响其他特征,采用成对LLM作为评估框架。我们的结果揭示了一致且结构化的副作用,例如,提示简洁性显著降低了感知的专业性。它们表明风格特征是紧密交织的,而非正交的。为了支持未来的研究,我们引入了CASSE(对话智能体风格副作用),这是一个捕捉这些复杂交互的数据集。我们进一步评估了基于提示和激活引导的缓解策略,发现尽管它们可以部分恢复被抑制的特征,但往往会降低主要预期风格。这些发现挑战了对LLM中忠实风格控制的假设,并强调了在对话智能体中进行安全、针对性的风格引导需要多目标和更原则性的方法。
cs.CL / 4

Reasoning Models Generate Societies of Thought

推理模型生成思想社群
Kim, Junsol, Lai, Shiyang, Scherrer, Nino, Arcas, Blaise Agüera y, Evans, James
Abstract
Large language models have achieved remarkable capabilities across domains, yet mechanisms underlying sophisticated reasoning remain elusive. Recent reasoning models outperform comparable instruction-tuned models on complex cognitive tasks, attributed to extended computation through longer chains of thought. Here we show that enhanced reasoning emerges not from extended computation alone, but from simulating multi-agent-like interactions -- a society of thought -- which enables diversification and debate among internal cognitive perspectives characterized by distinct personality traits and domain expertise. Through quantitative analysis and mechanistic interpretability methods applied to reasoning traces, we find that reasoning models like DeepSeek-R1 and QwQ-32B exhibit much greater perspective diversity than instruction-tuned models, activating broader conflict between heterogeneous personality- and expertise-related features during reasoning. This multi-agent structure manifests in conversational behaviors, including question-answering, perspective shifts, and the reconciliation of conflicting views, and in socio-emotional roles that characterize sharp back-and-forth conversations, together accounting for the accuracy advantage in reasoning tasks. Controlled reinforcement learning experiments reveal that base models increase conversational behaviors when rewarded solely for reasoning accuracy, and fine-tuning models with conversational scaffolding accelerates reasoning improvement over base models. These findings indicate that the social organization of thought enables effective exploration of solution spaces. We suggest that reasoning models establish a computational parallel to collective intelligence in human groups, where diversity enables superior problem-solving when systematically structured, which suggests new opportunities for agent organization to harness the wisdom of crowds.
Chinese Translation
大型语言模型在多个领域取得了显著的能力,然而其背后的复杂推理机制仍然难以捉摸。近期的推理模型在复杂认知任务上超越了可比的指令调优模型,这归因于通过更长的思维链进行的扩展计算。在此,我们展示了增强的推理并非仅源于扩展计算,而是来自于模拟多智能体式的互动——一种思想社群——这使得不同个性特征和领域专长的内部认知视角之间能够实现多样化和辩论。通过对推理轨迹进行定量分析和机制可解释性方法,我们发现像 DeepSeek-R1 和 QwQ-32B 这样的推理模型展现出比指令调优模型更大的视角多样性,在推理过程中激活了个性和专长相关特征之间更广泛的冲突。这种多智能体结构在对话行为中表现出来,包括问答、视角转变和冲突观点的调和,以及在特征鲜明的激烈对话中所体现的社会情感角色,共同解释了推理任务中的准确性优势。受控的强化学习实验表明,当仅因推理准确性获得奖励时,基础模型会增加对话行为,而用对话支架进行微调的模型则加速了相较于基础模型的推理改进。这些发现表明,思想的社会组织能够有效探索解决方案空间。我们建议推理模型建立了与人类群体中的集体智慧相对应的计算平行关系,其中多样性在系统性结构下能够实现更优的问题解决,这为智能体组织利用集体智慧提供了新的机会。
cs.CL / 5

EncodeRec: An Embedding Backbone for Recommendation Systems

EncodeRec:推荐系统的嵌入骨干网络
Hadad, Guy, Rabaev, Neomi, Shapira, Bracha
Abstract
Recent recommender systems increasingly leverage embeddings from large pre-trained language models (PLMs). However, such embeddings exhibit two key limitations: (1) PLMs are not explicitly optimized to produce structured and discriminative embedding spaces, and (2) their representations remain overly generic, often failing to capture the domain-specific semantics crucial for recommendation tasks. We present EncodeRec, an approach designed to align textual representations with recommendation objectives while learning compact, informative embeddings directly from item descriptions. EncodeRec keeps the language model parameters frozen during recommender system training, making it computationally efficient without sacrificing semantic fidelity. Experiments across core recommendation benchmarks demonstrate its effectiveness both as a backbone for sequential recommendation models and for semantic ID tokenization, showing substantial gains over PLM-based and embedding model baselines. These results underscore the pivotal role of embedding adaptation in bridging the gap between general-purpose language models and practical recommender systems.
Chinese Translation
近年来,推荐系统越来越多地利用来自大型预训练语言模型(PLMs)的嵌入。然而,这些嵌入存在两个主要限制:(1)PLMs并未明确优化以生成结构化和具有区分性的嵌入空间;(2)它们的表示过于通用,往往无法捕捉对推荐任务至关重要的领域特定语义。我们提出了EncodeRec,这是一种旨在将文本表示与推荐目标对齐的方案,同时直接从项目描述中学习紧凑且信息丰富的嵌入。EncodeRec在推荐系统训练过程中保持语言模型参数不变,从而在不牺牲语义保真度的情况下提高计算效率。针对核心推荐基准的实验表明,EncodeRec在作为序列推荐模型的骨干网络和语义ID标记化方面都表现出有效性,相较于基于PLM的和嵌入模型的基线显示出显著的提升。这些结果强调了嵌入适应在弥合通用语言模型与实际推荐系统之间差距中的关键作用。
cs.CL / 6

DialDefer: A Framework for Detecting and Mitigating LLM Dialogic Deference

DialDefer:检测和缓解大型语言模型对话性依赖的框架
Rabbani, Parisa, Sahoo, Priyam, Mathew, Ruben, Mondal, Aishee, Ketharaman, Harshita, Bozdag, Nimet Beyza, Hakkani-Tür, Dilek
Abstract
LLMs are increasingly used as third-party judges, yet their reliability when evaluating speakers in dialogue remains poorly understood. We show that LLMs judge identical claims differently depending on framing: the same content elicits different verdicts when presented as a statement to verify ("Is this statement correct?") versus attributed to a speaker ("Is this speaker correct?"). We call this dialogic deference and introduce DialDefer, a framework for detecting and mitigating these framing-induced judgment shifts. Our Dialogic Deference Score (DDS) captures directional shifts that aggregate accuracy obscures. Across nine domains, 3k+ instances, and four models, conversational framing induces large shifts (|DDS| up to 87pp, p < .0001) while accuracy remains stable (<2pp), with effects amplifying 2-4x on naturalistic Reddit conversations. Models can shift toward agreement (deference) or disagreement (skepticism) depending on domain -- the same model ranges from DDS = -53 on graduate-level science to +58 on social judgment. Ablations reveal that human-vs-LLM attribution drives the largest shifts (17.7pp swing), suggesting models treat disagreement with humans as more costly than with AI. Mitigation attempts reduce deference but can over-correct into skepticism, framing this as a calibration problem beyond accuracy optimization.
Chinese Translation
大型语言模型(LLMs)越来越多地被用作第三方评判者,但它们在对话中评估发言者的可靠性仍然不甚了解。我们展示了LLMs如何根据框架对相同的主张做出不同的判断:当内容以验证声明的形式呈现(“这个声明正确吗?”)与归因于发言者(“这个发言者正确吗?”)时,结果截然不同。我们称之为对话性依赖,并引入DialDefer,一个用于检测和缓解这些框架引起的判断偏移的框架。我们的对话性依赖评分(DDS)捕捉到聚合准确性所掩盖的方向性偏移。在九个领域、3000多个实例和四个模型中,交谈框架引起了显著的偏移(|DDS|高达87个百分点,p < .0001),而准确性保持稳定(<2个百分点),在自然主义的Reddit对话中效果放大了2-4倍。模型可以根据领域向同意(依赖)或不同意(怀疑)转变——同一模型在研究生级科学中表现为DDS = -53,而在社会判断中则为+58。消融实验表明,人类与LLM的归因驱动了最大的偏移(17.7个百分点的波动),这表明模型将与人类的不同意见视为比与AI的不同意见更具成本。缓解尝试减少了依赖,但可能会过度校正为怀疑,将其视为超越准确性优化的校准问题。
cs.CL / 7

Neural Induction of Finite-State Transducers

神经诱导有限状态转换器
Ginn, Michael, Palmer, Alexis, Hulden, Mans
Abstract
Finite-State Transducers (FSTs) are effective models for string-to-string rewriting tasks, often providing the efficiency necessary for high-performance applications, but constructing transducers by hand is difficult. In this work, we propose a novel method for automatically constructing unweighted FSTs following the hidden state geometry learned by a recurrent neural network. We evaluate our methods on real-world datasets for morphological inflection, grapheme-to-phoneme prediction, and historical normalization, showing that the constructed FSTs are highly accurate and robust for many datasets, substantially outperforming classical transducer learning algorithms by up to 87% accuracy on held-out test sets.
Chinese Translation
有限状态转换器(FSTs)是有效的字符串到字符串重写任务模型,通常提供高性能应用所需的效率,但手动构建转换器是困难的。在本研究中,我们提出了一种新颖的方法,通过递归神经网络学习的隐藏状态几何结构,自动构建无权重的FSTs。我们在真实世界的数据集上评估了我们的方法,包括形态变化、字素到音素预测和历史规范化,结果表明构建的FSTs在许多数据集上具有高度的准确性和鲁棒性,在保留的测试集上相比经典的转换器学习算法,准确率提高了多达87%。
cs.CL / 8

Massively Multilingual Joint Segmentation and Glossing

大规模多语言联合分段与注释
Ginn, Michael, Tjuatja, Lindia, Rice, Enora, Marashian, Ali, Valentini, Maria, Xu, Jasmine, Neubig, Graham, Palmer, Alexis
Abstract
Automated interlinear gloss prediction with neural networks is a promising approach to accelerate language documentation efforts. However, while state-of-the-art models like GlossLM achieve high scores on glossing benchmarks, user studies with linguists have found critical barriers to the usefulness of such models in real-world scenarios. In particular, existing models typically generate morpheme-level glosses but assign them to whole words without predicting the actual morpheme boundaries, making the predictions less interpretable and thus untrustworthy to human annotators. We conduct the first study on neural models that jointly predict interlinear glosses and the corresponding morphological segmentation from raw text. We run experiments to determine the optimal way to train models that balance segmentation and glossing accuracy, as well as the alignment between the two tasks. We extend the training corpus of GlossLM and pretrain PolyGloss, a family of seq2seq multilingual models for joint segmentation and glossing that outperforms GlossLM on glossing and beats various open-source LLMs on segmentation, glossing, and alignment. In addition, we demonstrate that PolyGloss can be quickly adapted to a new dataset via low-rank adaptation.
Chinese Translation
基于神经网络的自动逐行注释预测是一种有前景的方法,可以加速语言文献的整理工作。然而,尽管像GlossLM这样的最先进模型在注释基准测试中取得了高分,但与语言学家的用户研究发现了这些模型在现实场景中的实用性面临的重大障碍。特别是,现有模型通常生成形态素级别的注释,但将其分配给整个单词,而不预测实际的形态素边界,这使得预测结果不够可解释,从而不被人类注释者信任。我们首次研究了神经模型,这些模型可以从原始文本中联合预测逐行注释及相应的形态分段。我们进行实验以确定训练模型的最佳方式,以平衡分段和注释的准确性,以及这两个任务之间的对齐。我们扩展了GlossLM的训练语料库,并预训练了PolyGloss,这是一系列用于联合分段和注释的seq2seq多语言模型,其在注释方面优于GlossLM,并在分段、注释和对齐方面超越了各种开源大型语言模型。此外,我们还展示了PolyGloss可以通过低秩适应快速适应新的数据集。
cs.CL / 9

Selecting Language Models for Social Science: Start Small, Start Open, and Validate

为社会科学选择语言模型:从小开始,从开放开始,并进行验证
Stoltz, Dustin S., Taylor, Marshall A., Kumar, Sanuj
Abstract
Currently, there are thousands of large pretrained language models (LLMs) available to social scientists. How do we select among them? Using validity, reliability, reproducibility, and replicability as guides, we explore the significance of: (1) model openness, (2) model footprint, (3) training data, and (4) model architectures and fine-tuning. While ex-ante tests of validity (i.e., benchmarks) are often privileged in these discussions, we argue that social scientists cannot altogether avoid validating computational measures (ex-post). Replicability, in particular, is a more pressing guide for selecting language models. Being able to reliably replicate a particular finding that entails the use of a language model necessitates reliably reproducing a task. To this end, we propose starting with smaller, open models, and constructing delimited benchmarks to demonstrate the validity of the entire computational pipeline.
Chinese Translation
目前,有数千种大型预训练语言模型(LLMs)可供社会科学家使用。我们该如何在这些模型中进行选择?以有效性、可靠性、可重复性和可复制性为指导,我们探讨了以下几个方面的重要性:(1) 模型开放性,(2) 模型足迹,(3) 训练数据,以及 (4) 模型架构和微调。尽管在这些讨论中,事前有效性测试(即基准测试)往往受到重视,但我们认为社会科学家无法完全避免对计算度量进行事后验证。特别是,可复制性是选择语言模型时更为紧迫的指导。能够可靠地复制一个涉及使用语言模型的特定发现,需要可靠地再现一个任务。为此,我们建议从较小的开放模型开始,并构建有限的基准测试,以证明整个计算流程的有效性。
cs.CL / 10

Multi-Stage Patient Role-Playing Framework for Realistic Clinical Interactions

多阶段患者角色扮演框架用于真实临床互动
Jiang, Shijie, Zhang, Zefan, Zhu, Kehua, Bai, Tian, Zhao, Ruihong
Abstract
The simulation of realistic clinical interactions plays a pivotal role in advancing clinical Large Language Models (LLMs) and supporting medical diagnostic education. Existing approaches and benchmarks rely on generic or LLM-generated dialogue data, which limits the authenticity and diversity of doctor-patient interactions. In this work, we propose the first Chinese patient simulation dataset (Ch-PatientSim), constructed from realistic clinical interaction scenarios to comprehensively evaluate the performance of models in emulating patient behavior. Patients are simulated based on a five-dimensional persona structure. To address issues of the persona class imbalance, a portion of the dataset is augmented using few-shot generation, followed by manual verification. We evaluate various state-of-the-art LLMs and find that most produce overly formal responses that lack individual personality. To address this limitation, we propose a training-free Multi-Stage Patient Role-Playing (MSPRP) framework, which decomposes interactions into three stages to ensure both personalization and realism in model responses. Experimental results demonstrate that our approach significantly improves model performance across multiple dimensions of patient simulation.
Chinese Translation
真实临床互动的模拟在推动临床大型语言模型(LLMs)和支持医学诊断教育方面发挥着关键作用。现有的方法和基准依赖于通用或LLM生成的对话数据,这限制了医生与患者互动的真实性和多样性。在本研究中,我们提出了首个中文患者模拟数据集(Ch-PatientSim),该数据集基于真实的临床互动场景构建,以全面评估模型在模拟患者行为方面的表现。患者的模拟基于五维人格结构。为了解决人格类别不平衡的问题,我们对数据集的一部分进行了少量样本生成的增强,并进行了人工验证。我们评估了多种最先进的LLMs,发现大多数模型生成的回应过于正式,缺乏个性。为了解决这一局限性,我们提出了一种无训练的多阶段患者角色扮演(MSPRP)框架,该框架将互动分解为三个阶段,以确保模型回应的个性化和真实感。实验结果表明,我们的方法在多个患者模拟维度上显著提高了模型性能。
cs.CL / 11

Steering Language Models Before They Speak: Logit-Level Interventions

在语言模型发声之前进行引导:Logit级干预
An, Hyeseon, Park, Shinwoo, Jin, Hyundong, Han, Yo-Sub
Abstract
Steering LLMs is essential for specialized applications such as style-sensitive text rewriting, user-adaptive communication, and toxicity mitigation. Current steering methods, such as prompting-based and activation-based approaches, are widely used to guide model behavior. However, activation-based techniques require deep access to internal layers, while prompting-based steering often fails to provide consistent or fine-grained control. In order to address these limitations, we propose a training-free inference-time logit intervention for controllable generation. Our approach utilizes a statistical token score table derived from z-normalized log-odds of labeled corpora to shift the decoding distribution. Empirical evaluations across three diverse datasets focusing on writing complexity, formality, and toxicity demonstrate that our method effectively steers output characteristics, confirming its broad applicability and task-agnostic nature. Our results show that statistically grounded logit steering can achieve large, consistent, and multi-task control gains: up to +47%p accuracy and 50x f1 improvement.
Chinese Translation
引导大型语言模型(LLMs)对于风格敏感的文本重写、用户自适应交流和毒性缓解等专业应用至关重要。目前的引导方法,如基于提示和基于激活的方法,广泛用于指导模型行为。然而,基于激活的技术需要对内部层进行深度访问,而基于提示的引导往往无法提供一致或细粒度的控制。为了解决这些局限性,我们提出了一种无训练的推理时Logit干预方法,以实现可控生成。我们的方法利用从标记语料库的z标准化对数几率中导出的统计令牌得分表来调整解码分布。在关注写作复杂性、正式性和毒性的三个多样化数据集上的实证评估表明,我们的方法有效地引导输出特征,确认其广泛适用性和任务无关性。我们的结果显示,基于统计的Logit引导可以实现显著、一致和多任务的控制增益:准确率提高高达47个百分点,F1值提升50倍。
cs.CL / 12

ZPD Detector: Data Selection via Capability-Difficulty Alignment for Large Language Models

ZPD 检测器:基于能力-难度对齐的数据选择方法用于大型语言模型
Yang, Bo, Chen, Yunkui, Feng, Lanfei, Zhang, Yu, Li, Shijian
Abstract
As the cost of training large language models continues to increase and high-quality training data become increasingly scarce, selecting high-value samples or synthesizing effective training data under limited data budgets has emerged as a critical research problem. Most existing data selection methods rely on static criteria, such as difficulty, uncertainty, or heuristics, and fail to model the evolving relationship between the model and the data. Inspired by the educational theory of the Zone of Proximal Development (ZPD), we propose ZPD Detector, a data selection framework that adopts a bidirectional perspective between models and data by explicitly modeling the alignment between sample difficulty and the model's current capability. ZPD Detector integrates difficulty calibration, model capability estimation based on Item Response Theory (IRT), and a capability-difficulty matching score to dynamically identify the most informative samples at each learning stage, improving data utilization efficiency; moreover, this dynamic matching strategy provides new insights into training strategy design. All code and data will be released after our work be accepted to support reproducible researc
Chinese Translation
随着训练大型语言模型的成本不断增加,优质训练数据日益稀缺,在有限的数据预算下选择高价值样本或合成有效训练数据已成为一个关键的研究问题。现有的大多数数据选择方法依赖于静态标准,如难度、不确定性或启发式方法,未能建模模型与数据之间不断演变的关系。受到最近发展区(Zone of Proximal Development, ZPD)教育理论的启发,我们提出了 ZPD 检测器(ZPD Detector),这是一个数据选择框架,通过明确建模样本难度与模型当前能力之间的对齐关系,采用模型与数据之间的双向视角。ZPD 检测器整合了难度校准、基于项目反应理论(Item Response Theory, IRT)的模型能力估计,以及能力-难度匹配分数,以动态识别每个学习阶段中最具信息量的样本,从而提高数据利用效率;此外,这一动态匹配策略为训练策略设计提供了新的见解。所有代码和数据将在我们的工作被接受后发布,以支持可重复的研究。
cs.CL / 13

When Personalization Misleads: Understanding and Mitigating Hallucinations in Personalized LLMs

个性化误导:理解和缓解个性化大型语言模型中的幻觉
Sun, Zhongxiang, Zhan, Yi, Shen, Chenglei, Yu, Weijie, Zhang, Xiao, He, Ming, Xu, Jun
Abstract
Personalized large language models (LLMs) adapt model behavior to individual users to enhance user satisfaction, yet personalization can inadvertently distort factual reasoning. We show that when personalized LLMs face factual queries, there exists a phenomenon where the model generates answers aligned with a user's prior history rather than the objective truth, resulting in personalization-induced hallucinations that degrade factual reliability and may propagate incorrect beliefs, due to representational entanglement between personalization and factual representations. To address this issue, we propose Factuality-Preserving Personalized Steering (FPPS), a lightweight inference-time approach that mitigates personalization-induced factual distortions while preserving personalized behavior. We further introduce PFQABench, the first benchmark designed to jointly evaluate factual and personalized question answering under personalization. Experiments across multiple LLM backbones and personalization methods show that FPPS substantially improves factual accuracy while maintaining personalized performance.
Chinese Translation
个性化大型语言模型(LLMs)根据个体用户的需求调整模型行为,以提高用户满意度,但个性化可能无意中扭曲事实推理。我们展示了当个性化LLMs面临事实查询时,模型生成的答案往往与用户的历史记录一致,而非客观真相,这导致了因个性化引发的幻觉,降低了事实的可靠性,并可能传播错误信念,这源于个性化与事实表征之间的表征纠缠。为了解决这一问题,我们提出了事实保持个性化引导(Factuality-Preserving Personalized Steering, FPPS),这是一种轻量级的推理时方法,可以在保持个性化行为的同时,缓解因个性化引起的事实扭曲。我们进一步引入了PFQABench,这是第一个旨在共同评估个性化下的事实和个性化问答的基准。针对多个LLM基础模型和个性化方法的实验表明,FPPS显著提高了事实准确性,同时保持了个性化性能。
cs.CL / 14

Redefining Machine Simultaneous Interpretation: From Incremental Translation to Human-Like Strategies

重新定义机器同声传译:从增量翻译到类人策略
Zhang, Qianen, Yang, Zeyu, Nakamura, Satoshi
Abstract
Simultaneous Machine Translation (SiMT) requires high-quality translations under strict real-time constraints, which traditional policies with only READ/WRITE actions cannot fully address. We extend the action space of SiMT with four adaptive actions: Sentence_Cut, Drop, Partial_Summarization and Pronominalization, which enable real-time restructuring, omission, and simplification while preserving semantic fidelity. We adapt these actions in a large language model (LLM) framework and construct training references through action-aware prompting. To evaluate both quality and word-level monotonicity, we further develop a latency-aware TTS pipeline that maps textual outputs to speech with realistic timing. Experiments on the ACL60/60 English-Chinese, English-German and English-Japanese benchmarks show that our framework consistently improves semantic metrics and achieves lower delay compared to reference translations and salami-based baselines. Notably, combining Drop and Sentence_Cut leads to consistent improvements in the balance between fluency and latency. These results demonstrate that enriching the action space of LLM-based SiMT provides a promising direction for bridging the gap between human and machine interpretation.
Chinese Translation
同声机器翻译(SiMT)在严格的实时约束下要求高质量的翻译,而仅依赖于读取/写入操作的传统策略无法完全满足这一需求。我们通过引入四种自适应操作扩展了SiMT的操作空间:句子切分(Sentence_Cut)、丢弃(Drop)、部分摘要(Partial_Summarization)和代词化(Pronominalization),这些操作能够在保持语义忠实的同时实现实时重组、遗漏和简化。我们在大型语言模型(LLM)框架中适配这些操作,并通过基于操作的提示构建训练参考。为了评估翻译质量和词级单调性,我们进一步开发了一种延迟感知的文本到语音(TTS)管道,将文本输出映射到具有真实时序的语音。对ACL60/60英汉、英德和英日基准的实验表明,我们的框架在语义指标上始终有所提升,并且相较于参考翻译和基于切片的基线实现了更低的延迟。值得注意的是,结合丢弃和句子切分操作在流畅性和延迟之间实现了一致的改进。这些结果表明,丰富基于LLM的SiMT的操作空间为弥合人类与机器翻译之间的差距提供了一个有前景的方向。
cs.CL / 15

NAACL: Noise-AwAre Verbal Confidence Calibration for LLMs in RAG Systems

NAACL:在检索增强生成系统中针对大型语言模型的噪声感知口头置信度校准
Liu, Jiayu, Wang, Rui, Zong, Qing, Zeng, Qingcheng, Zheng, Tianshi, Shi, Haochen, Guo, Dadi, Xu, Baixuan, Li, Chunyang, Song, Yangqiu
Abstract
Accurately assessing model confidence is essential for deploying large language models (LLMs) in mission-critical factual domains. While retrieval-augmented generation (RAG) is widely adopted to improve grounding, confidence calibration in RAG settings remains poorly understood. We conduct a systematic study across four benchmarks, revealing that LLMs exhibit poor calibration performance due to noisy retrieved contexts. Specifically, contradictory or irrelevant evidence tends to inflate the model's false certainty, leading to severe overconfidence. To address this, we propose NAACL Rules (Noise-AwAre Confidence CaLibration Rules) to provide a principled foundation for resolving overconfidence under noise. We further design NAACL, a noise-aware calibration framework that synthesizes supervision from about 2K HotpotQA examples guided by these rules. By performing supervised fine-tuning (SFT) with this data, NAACL equips models with intrinsic noise awareness without relying on stronger teacher models. Empirical results show that NAACL yields substantial gains, improving ECE scores by 10.9% in-domain and 8.0% out-of-domain. By bridging the gap between retrieval noise and verbal calibration, NAACL paves the way for both accurate and epistemically reliable LLMs.
Chinese Translation
准确评估模型置信度对于在关键任务的事实领域中部署大型语言模型(LLMs)至关重要。虽然检索增强生成(RAG)被广泛采用以改善基础知识的获取,但在RAG环境中的置信度校准仍然不够明晰。我们在四个基准测试中进行系统研究,揭示LLMs由于噪声干扰的检索上下文而表现出较差的校准性能。具体而言,矛盾或无关的证据往往会夸大模型的虚假确定性,导致严重的过度自信。为了解决这一问题,我们提出了NAACL规则(Noise-AwAre Confidence CaLibration Rules),为在噪声下解决过度自信提供了原则性基础。我们进一步设计了NAACL,一个噪声感知校准框架,该框架基于这些规则合成约2000个HotpotQA示例的监督信息。通过使用这些数据进行监督微调(SFT),NAACL使模型具备内在的噪声感知能力,而无需依赖更强的教师模型。实证结果表明,NAACL带来了显著的提升,在领域内和领域外的ECE分数分别提高了10.9%和8.0%。通过弥合检索噪声与口头校准之间的差距,NAACL为准确且在认识论上可靠的LLMs铺平了道路。
cs.CL / 16

Finding the Translation Switch: Discovering and Exploiting the Task-Initiation Features in LLMs

寻找翻译开关:发现并利用大型语言模型中的任务启动特征
Wu, Xinwei, Liu, Heng, Zhao, Xiaohu, Ren, Yuqi, Xu, Linlong, Wang, Longyue, Xiong, Deyi, Luo, Weihua, Zhang, Kaifu
Abstract
Large Language Models (LLMs) frequently exhibit strong translation abilities, even without task-specific fine-tuning. However, the internal mechanisms governing this innate capability remain largely opaque. To demystify this process, we leverage Sparse Autoencoders (SAEs) and introduce a novel framework for identifying task-specific features. Our method first recalls features that are frequently co-activated on translation inputs and then filters them for functional coherence using a PCA-based consistency metric. This framework successfully isolates a small set of **translation initiation** features. Causal interventions demonstrate that amplifying these features steers the model towards correct translation, while ablating them induces hallucinations and off-task outputs, confirming they represent a core component of the model's innate translation competency. Moving from analysis to application, we leverage this mechanistic insight to propose a new data selection strategy for efficient fine-tuning. Specifically, we prioritize training on **mechanistically hard** samples-those that fail to naturally activate the translation initiation features. Experiments show this approach significantly improves data efficiency and suppresses hallucinations. Furthermore, we find these mechanisms are transferable to larger models of the same family. Our work not only decodes a core component of the translation mechanism in LLMs but also provides a blueprint for using internal model mechanism to create more robust and efficient models. The codes are available at https://github.com/flamewei123/AAAI26-translation-Initiation-Features.
Chinese Translation
大型语言模型(LLMs)通常表现出强大的翻译能力,即使在没有特定任务微调的情况下。然而,支配这种内在能力的内部机制仍然在很大程度上不透明。为了揭示这一过程,我们利用稀疏自编码器(Sparse Autoencoders, SAEs),并引入一种新颖的框架来识别特定任务的特征。我们的方法首先回忆在翻译输入上经常共同激活的特征,然后使用基于主成分分析(PCA)的连贯性度量对其进行功能一致性过滤。该框架成功地隔离出一小组**翻译启动**特征。因果干预表明,增强这些特征可以引导模型朝向正确的翻译,而去除它们则会导致幻觉和偏离任务的输出,确认它们代表了模型内在翻译能力的核心组成部分。从分析到应用,我们利用这一机械洞察提出了一种新的数据选择策略,以实现高效微调。具体而言,我们优先训练**机械上困难**的样本——那些未能自然激活翻译启动特征的样本。实验表明,这种方法显著提高了数据效率并抑制了幻觉。此外,我们发现这些机制可以迁移到同一系列的更大模型上。我们的工作不仅解码了LLMs翻译机制的核心组成部分,还提供了一种利用内部模型机制创建更强大和高效模型的蓝图。代码可在 https://github.com/flamewei123/AAAI26-translation-Initiation-Features 获取。
cs.CL / 17

From Interpretability to Performance: Optimizing Retrieval Heads for Long-Context Language Models

从可解释性到性能:优化长上下文语言模型的检索头
Ma, Youmi, Okazaki, Naoaki
Abstract
Advances in mechanistic interpretability have identified special attention heads, known as retrieval heads, that are responsible for retrieving information from the context. However, the role of these retrieval heads in improving model performance remains unexplored. This work investigates whether retrieval heads can be leveraged to enhance the long-context capabilities of LLMs. Specifically, we propose RetMask, a method that generates training signals by contrasting normal model outputs with those from an ablated variant in which the retrieval heads are masked. This mechanism-based approach achieves substantial improvements: +2.28 points on HELMET at 128K for Llama-3.1, with +70% gains on generation with citation and +32% on passage re-ranking, while preserving performance on general tasks. Experiments across three model families reveal that the effectiveness depends on retrieval head organization: models with concentrated patterns of retrieval heads respond strongly, while those with distributed patterns show limited gains. This mechanistic relationship validates the function of retrieval heads and demonstrates that mechanistic insights can be transformed into performance enhancements.
Chinese Translation
机械可解释性的进展揭示了被称为检索头的特殊注意力头,这些头负责从上下文中检索信息。然而,这些检索头在提升模型性能方面的作用仍未被探索。本研究探讨了是否可以利用检索头来增强大语言模型(LLMs)的长上下文能力。具体而言,我们提出了RetMask,一种通过对比正常模型输出与一个去除检索头的变体输出生成训练信号的方法。这种基于机制的方法实现了显著的改进:在Llama-3.1上,HELMET在128K时提高了2.28分,生成引用的增益超过70%,而段落重排序的增益为32%,同时保持了在一般任务上的性能。对三种模型家族的实验表明,效果依赖于检索头的组织方式:具有集中模式的检索头的模型反应强烈,而具有分散模式的模型则显示出有限的增益。这种机械关系验证了检索头的功能,并展示了机械洞察如何转化为性能提升。
cs.CL / 18

Budget-Aware Anytime Reasoning with LLM-Synthesized Preference Data

预算感知的随时推理与LLM合成偏好数据
Zhang, Xuanming, Ashrafi, Shwan, Mirsaidova, Aziza, Rezaeian, Amir, Ballesteros, Miguel, Chilton, Lydia B., Yu, Zhou, Roth, Dan
Abstract
We study the reasoning behavior of large language models (LLMs) under limited computation budgets. In such settings, producing useful partial solutions quickly is often more practical than exhaustive reasoning, which incurs high inference costs. Many real-world tasks, such as trip planning, require models to deliver the best possible output within a fixed reasoning budget. We introduce an anytime reasoning framework and the Anytime Index, a metric that quantifies how effectively solution quality improves as reasoning tokens increase. To further enhance efficiency, we propose an inference-time self-improvement method using LLM-synthesized preference data, where models learn from their own reasoning comparisons to produce better intermediate solutions. Experiments on NaturalPlan (Trip), AIME, and GPQA datasets show consistent gains across Grok-3, GPT-oss, GPT-4.1/4o, and LLaMA models, improving both reasoning quality and efficiency under budget constraints.
Chinese Translation
我们研究了在有限计算预算下大语言模型(LLMs)的推理行为。在这种情况下,快速产生有用的部分解决方案通常比耗尽性推理更为实用,因为后者会产生高昂的推理成本。许多现实世界任务,如旅行规划,需要模型在固定的推理预算内提供最佳可能的输出。我们引入了一种随时推理框架和随时索引(Anytime Index),该指标量化了解决方案质量随着推理令牌增加而有效改善的程度。为了进一步提高效率,我们提出了一种基于推理时自我改进的方法,利用LLM合成的偏好数据,使模型通过自身的推理比较学习,从而产生更好的中间解决方案。在NaturalPlan(旅行)、AIME和GPQA数据集上的实验显示,Grok-3、GPT-oss、GPT-4.1/4o和LLaMA模型在预算约束下的推理质量和效率均有一致提升。
cs.CL / 19

Spectral Characterization and Mitigation of Sequential Knowledge Editing Collapse

序列知识编辑崩溃的谱特征化与缓解
Zhang, Chi, Zhang, Mengqi, Ye, Xiaotian, Cheng, Runxi, Zhou, Zisheng, Zhou, Ying, Ren, Pengjie, Chen, Zhumin
Abstract
Sequential knowledge editing in large language models often causes catastrophic collapse of the model's general abilities, especially for parameter-modifying methods. Existing approaches mitigate this issue through heuristic constraints on parameter updates, yet the mechanisms underlying such degradation remain insufficiently understood. In this work, we present a spectral analysis of sequential knowledge editing and show that a model's general abilities are closely associated with dominant singular directions of pretrained weight matrices. These directions are highly sensitive to perturbations and are progressively disrupted by repeated edits, closely tracking the collapse in both editing efficacy and general performance. Building on this insight, we propose REVIVE, a plug-and-play framework that stabilizes sequential editing by explicitly preserving the dominant singular subspace. REVIVE represents parameter updates in the spectral basis of the original weights and filters components that would interfere with the protected region. Extensive experiments across multiple models and benchmarks show that REVIVE consistently improves editing efficacy while substantially preserving general abilities under long-horizon sequential editing, including extreme settings with up to 20,000 edits.
Chinese Translation
大型语言模型中的序列知识编辑常常导致模型一般能力的灾难性崩溃,尤其是在参数修改方法中。现有的方法通过对参数更新施加启发式约束来缓解这一问题,但这种退化背后的机制仍然理解不足。在本研究中,我们对序列知识编辑进行了谱分析,并展示了模型的一般能力与预训练权重矩阵的主导奇异方向密切相关。这些方向对扰动高度敏感,并且随着重复编辑的进行而逐渐受到破坏,紧密跟踪编辑效果和一般性能的崩溃。基于这一见解,我们提出了REVIVE,一个即插即用的框架,通过明确保留主导奇异子空间来稳定序列编辑。REVIVE在原始权重的谱基中表示参数更新,并过滤那些会干扰保护区域的成分。针对多个模型和基准的广泛实验表明,REVIVE在长时间序列编辑下持续提高了编辑效果,同时在极端设置下(包括多达20,000次编辑)显著保持了一般能力。
cs.CL / 20

CoG: Controllable Graph Reasoning via Relational Blueprints and Failure-Aware Refinement over Knowledge Graphs

CoG:通过关系蓝图和失败感知精炼在知识图谱上进行可控图推理
Liu, Yuanxiang, Li, Songze, Guo, Xiaoke, Gong, Zhaoyan, Zhang, Qifei, Chen, Huajun, Zhang, Wen
Abstract
Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities but often grapple with reliability challenges like hallucinations. While Knowledge Graphs (KGs) offer explicit grounding, existing paradigms of KG-augmented LLMs typically exhibit cognitive rigidity--applying homogeneous search strategies that render them vulnerable to instability under neighborhood noise and structural misalignment leading to reasoning stagnation. To address these challenges, we propose CoG, a training-free framework inspired by Dual-Process Theory that mimics the interplay between intuition and deliberation. First, functioning as the fast, intuitive process, the Relational Blueprint Guidance module leverages relational blueprints as interpretable soft structural constraints to rapidly stabilize the search direction against noise. Second, functioning as the prudent, analytical process, the Failure-Aware Refinement module intervenes upon encountering reasoning impasses. It triggers evidence-conditioned reflection and executes controlled backtracking to overcome reasoning stagnation. Experimental results on three benchmarks demonstrate that CoG significantly outperforms state-of-the-art approaches in both accuracy and efficiency.
Chinese Translation
大型语言模型(LLMs)展示了显著的推理能力,但常常面临诸如幻觉等可靠性挑战。虽然知识图谱(KGs)提供了明确的基础,但现有的KG增强LLMs范式通常表现出认知僵化——应用同质的搜索策略,使其在邻域噪声和结构不对齐下容易出现不稳定,导致推理停滞。为了解决这些挑战,我们提出了CoG,一个不需要训练的框架,灵感来自双过程理论,模拟直觉与深思之间的相互作用。首先,作为快速、直观的过程,关系蓝图指导模块利用关系蓝图作为可解释的软结构约束,以快速稳定搜索方向,抵御噪声。其次,作为谨慎、分析的过程,失败感知精炼模块在遇到推理障碍时进行干预。它触发基于证据的反思,并执行受控回溯,以克服推理停滞。在三个基准测试上的实验结果表明,CoG在准确性和效率上显著优于最先进的方法。
cs.CL / 21

Efficient Multilingual Name Type Classification Using Convolutional Networks

基于卷积网络的高效多语言名称类型分类
Lauc, Davor
Abstract
We present a convolutional neural network approach for classifying proper names by language and entity type. Our model, Onomas-CNN X, combines parallel convolution branches with depthwise-separable operations and hierarchical classification to process names efficiently on CPU hardware. We evaluate the architecture on a large multilingual dataset covering 104 languages and four entity types (person, organization, location, other). Onomas-CNN X achieves 92.1% accuracy while processing 2,813 names per second on a single CPU core - 46 times faster than fine-tuned XLM-RoBERTa with comparable accuracy. The model reduces energy consumption by a factor of 46 compared to transformer baselines. Our experiments demonstrate that specialized CNN architectures remain competitive with large pre-trained models for focused NLP tasks when sufficient training data exists.
Chinese Translation
我们提出了一种卷积神经网络方法,用于按语言和实体类型对专有名称进行分类。我们的模型 Onomas-CNN X 结合了并行卷积分支、深度可分离操作和层次分类,以便在 CPU 硬件上高效处理名称。我们在一个覆盖 104 种语言和四种实体类型(人、组织、地点、其他)的大型多语言数据集上评估了该架构。Onomas-CNN X 在单个 CPU 核心上以每秒处理 2,813 个名称的速度实现了 92.1% 的准确率,比经过微调的 XLM-RoBERTa 在可比准确率下快 46 倍。与变换器基线相比,该模型将能耗降低了 46 倍。我们的实验表明,当存在足够的训练数据时,专门的 CNN 架构在针对特定 NLP 任务时仍然与大型预训练模型具有竞争力。
cs.CL / 22

Integrity Shield A System for Ethical AI Use & Authorship Transparency in Assessments

完整性保护:用于评估中的伦理人工智能使用与作者透明度的系统
Shekhar, Ashish Raj, Agarwal, Shiven, Bordoloi, Priyanuj, Shah, Yash, Anvekar, Tejas, Gupta, Vivek
Abstract
Large Language Models (LLMs) can now solve entire exams directly from uploaded PDF assessments, raising urgent concerns about academic integrity and the reliability of grades and credentials. Existing watermarking techniques either operate at the token level or assume control over the model's decoding process, making them ineffective when students query proprietary black-box systems with instructor-provided documents. We present Integrity Shield, a document-layer watermarking system that embeds schema-aware, item-level watermarks into assessment PDFs while keeping their human-visible appearance unchanged. These watermarks consistently prevent MLLMs from answering shielded exam PDFs and encode stable, item-level signatures that can be reliably recovered from model or student responses. Across 30 exams spanning STEM, humanities, and medical reasoning, Integrity Shield achieves exceptionally high prevention (91-94% exam-level blocking) and strong detection reliability (89-93% signature retrieval) across four commercial MLLMs. Our demo showcases an interactive interface where instructors upload an exam, preview watermark behavior, and inspect pre/post AI performance & authorship evidence.
Chinese Translation
大型语言模型(LLMs)现在可以直接从上传的PDF评估中解决整个考试,这引发了关于学术诚信和成绩及证书可靠性的紧迫担忧。现有的水印技术要么在标记级别操作,要么假设对模型解码过程的控制,这使得它们在学生查询提供了教师文件的专有黑箱系统时变得无效。我们提出了完整性保护(Integrity Shield),这是一种文档层水印系统,它在评估PDF中嵌入了模式感知的项目级水印,同时保持其人类可见外观不变。这些水印始终防止大型语言模型(MLLMs)回答受保护的考试PDF,并编码稳定的项目级签名,这些签名可以可靠地从模型或学生的响应中恢复。在涵盖STEM、人文学科和医学推理的30场考试中,完整性保护在四个商业大型语言模型中实现了极高的防止率(91-94%的考试级阻止)和强大的检测可靠性(89-93%的签名检索)。我们的演示展示了一个互动界面,教师可以上传考试,预览水印行为,并检查AI性能及作者证据的前后变化。
cs.CL / 23

The Growing Gains and Pains of Iterative Web Corpora Crawling: Insights from South Slavic CLASSLA-web 2.0 Corpora

迭代网络语料库爬取的增长收益与痛点:来自南斯拉夫CLASSLA-web 2.0语料库的见解
Pungeršek, Taja Kuzman, Rupnik, Peter, Suchomel, Vít, Ljubešić, Nikola
Abstract
Crawling national top-level domains has proven to be highly effective for collecting texts in less-resourced languages. This approach has been recently used for South Slavic languages and resulted in the largest general corpora for this language group: the CLASSLA-web 1.0 corpora. Building on this success, we established a continuous crawling infrastructure for iterative national top-level domain crawling across South Slavic and related webs. We present the first outcome of this crawling infrastructure - the CLASSLA-web 2.0 corpus collection, with substantially larger web corpora containing 17.0 billion words in 38.1 million texts in seven languages: Bosnian, Bulgarian, Croatian, Macedonian, Montenegrin, Serbian, and Slovenian. In addition to genre categories, the new version is also automatically annotated with topic labels. Comparing CLASSLA-web 2.0 with its predecessor reveals that only one-fifth of the texts overlap, showing that re-crawling after just two years yields largely new content. However, while the new web crawls bring growing gains, we also notice growing pains - a manual inspection of top domains reveals a visible degradation of web content, as machine-generated sites now contribute a significant portion of texts.
Chinese Translation
爬取国家顶级域名已被证明在收集资源匮乏语言文本方面非常有效。最近,这一方法被应用于南斯拉夫语言,并产生了该语言组最大的通用语料库:CLASSLA-web 1.0语料库。在这一成功的基础上,我们建立了一个持续的爬取基础设施,用于对南斯拉夫及相关网络进行迭代国家顶级域名的爬取。我们展示了这一爬取基础设施的第一个成果——CLASSLA-web 2.0语料库集合,包含了显著更大的网络语料库,包含170亿个单词,涵盖了七种语言的3810万篇文本:波斯尼亚语、保加利亚语、克罗地亚语、马其顿语、黑山语、塞尔维亚语和斯洛文尼亚语。除了体裁类别外,新版本还自动标注了主题标签。将CLASSLA-web 2.0与其前身进行比较显示,只有五分之一的文本重叠,这表明在仅仅两年后重新爬取会产生大量新内容。然而,尽管新的网络爬取带来了不断增长的收益,我们也注意到增长的痛点——对顶级域名的人工检查显示网络内容的明显退化,因为机器生成的网站现在占据了相当一部分文本。
cs.CL / 24

DOREMI: Optimizing Long Tail Predictions in Document-Level Relation Extraction

DOREMI:优化文档级关系抽取中的长尾预测
Menotti, Laura, Marchesin, Stefano, Silvello, Gianmaria
Abstract
Document-Level Relation Extraction (DocRE) presents significant challenges due to its reliance on cross-sentence context and the long-tail distribution of relation types, where many relations have scarce training examples. In this work, we introduce DOcument-level Relation Extraction optiMizing the long taIl (DOREMI), an iterative framework that enhances underrepresented relations through minimal yet targeted manual annotations. Unlike previous approaches that rely on large-scale noisy data or heuristic denoising, DOREMI actively selects the most informative examples to improve training efficiency and robustness. DOREMI can be applied to any existing DocRE model and is effective at mitigating long-tail biases, offering a scalable solution to improve generalization on rare relations.
Chinese Translation
文档级关系抽取(DocRE)面临重大挑战,因为它依赖于跨句子上下文以及关系类型的长尾分布,其中许多关系的训练样本稀缺。在本研究中,我们提出了文档级关系抽取优化长尾(DOREMI),这是一个迭代框架,通过最小但有针对性的人工标注来增强被低估的关系。与依赖于大规模噪声数据或启发式去噪的先前方法不同,DOREMI主动选择最具信息量的示例,以提高训练效率和鲁棒性。DOREMI可以应用于任何现有的DocRE模型,并有效缓解长尾偏差,提供了一种可扩展的解决方案,以改善对稀有关系的泛化能力。
cs.CL / 25

T$^\star$: Progressive Block Scaling for MDM Through Trajectory Aware RL

T$^ullet$: 通过轨迹感知强化学习实现MDM的渐进块规模扩展
Xia, Hanchen, Chen, Baoyou, Ge, Yutang, Zhao, Guojiang, Zhu, Siyu
Abstract
We present T$^\star$, a simple \textsc{TraceRL}-based training curriculum for progressive block-size scaling in masked diffusion language models (MDMs). Starting from an AR-initialized small-block MDM, T$^\star$~transitions smoothly to larger blocks, enabling higher-parallelism decoding with minimal performance degradation on math reasoning benchmarks. Moreover, further analysis suggests that T$^\star$~can converge to an alternative decoding schedule $\hat{\rm S}$ that achieves comparable performance.
Chinese Translation
我们提出了T$^ullet$,这是一种基于 extsc{TraceRL}的训练课程,用于在掩蔽扩散语言模型(MDMs)中实现渐进的块大小扩展。从一个以自回归(AR)初始化的小块MDM开始,T$^ullet$能够平滑过渡到更大的块,从而在数学推理基准上实现更高的并行解码,同时性能下降最小。此外,进一步的分析表明,T$^ullet$可以收敛到一种替代的解码调度$ ilde{ m S}$,其性能可与之媲美。
cs.CL / 26

MultiCaption: Detecting disinformation using multilingual visual claims

MultiCaption:使用多语言视觉声明检测虚假信息
Frade, Rafael Martins, Panchendrarajan, Rrubaa, Zubiaga, Arkaitz
Abstract
Online disinformation poses an escalating threat to society, driven increasingly by the rapid spread of misleading content across both multimedia and multilingual platforms. While automated fact-checking methods have advanced in recent years, their effectiveness remains constrained by the scarcity of datasets that reflect these real-world complexities. To address this gap, we first present MultiCaption, a new dataset specifically designed for detecting contradictions in visual claims. Pairs of claims referring to the same image or video were labeled through multiple strategies to determine whether they contradict each other. The resulting dataset comprises 11,088 visual claims in 64 languages, offering a unique resource for building and evaluating misinformation-detection systems in truly multimodal and multilingual environments. We then provide comprehensive experiments using transformer-based architectures, natural language inference models, and large language models, establishing strong baselines for future research. The results show that MultiCaption is more challenging than standard NLI tasks, requiring task-specific finetuning for strong performance. Moreover, the gains from multilingual training and testing highlight the dataset's potential for building effective multilingual fact-checking pipelines without relying on machine translation.
Chinese Translation
在线虚假信息对社会构成日益严重的威胁,这一威胁日益受到多媒体和多语言平台上误导性内容快速传播的推动。尽管自动化事实核查方法近年来有所进展,但其有效性仍受到反映现实世界复杂性的数据库稀缺的限制。为了解决这一问题,我们首先提出了MultiCaption,这是一个专门设计用于检测视觉声明中矛盾的新数据集。通过多种策略对指向同一图像或视频的声明对进行标注,以确定它们是否相互矛盾。最终的数据集包含64种语言的11,088个视觉声明,为在真正的多模态和多语言环境中构建和评估虚假信息检测系统提供了独特的资源。随后,我们使用基于变换器的架构、自然语言推理模型和大型语言模型进行了全面的实验,为未来的研究建立了强有力的基准。结果表明,MultiCaption的挑战性超过了标准的自然语言推理任务,强性能需要特定任务的微调。此外,多语言训练和测试带来的收益突显了该数据集在构建有效的多语言事实核查管道中的潜力,而无需依赖机器翻译。
cs.CL / 27

Language of Thought Shapes Output Diversity in Large Language Models

思维语言塑造大型语言模型的输出多样性
Xu, Shaoyang, Zhang, Wenxuan
Abstract
Output diversity is crucial for Large Language Models as it underpins pluralism and creativity. In this work, we reveal that controlling the language used during model thinking-the language of thought-provides a novel and structural source of output diversity. Our preliminary study shows that different thinking languages occupy distinct regions in a model's thinking space. Based on this observation, we study two repeated sampling strategies under multilingual thinking-Single-Language Sampling and Mixed-Language Sampling-and conduct diversity evaluation on outputs that are controlled to be in English, regardless of the thinking language used. Across extensive experiments, we demonstrate that switching the thinking language from English to non-English languages consistently increases output diversity, with a clear and consistent positive correlation such that languages farther from English in the thinking space yield larger gains. We further show that aggregating samples across multiple thinking languages yields additional improvements through compositional effects, and that scaling sampling with linguistic heterogeneity expands the model's diversity ceiling. Finally, we show that these findings translate into practical benefits in pluralistic alignment scenarios, leading to broader coverage of cultural knowledge and value orientations in LLM outputs. Our code is publicly available at https://github.com/iNLP-Lab/Multilingual-LoT-Diversity.
Chinese Translation
输出多样性对大型语言模型至关重要,因为它支撑着多元化和创造力。在本研究中,我们揭示了控制模型思维过程中使用的语言——思维语言——提供了一种新颖且结构化的输出多样性来源。我们的初步研究表明,不同的思维语言占据模型思维空间中的不同区域。基于这一观察,我们研究了两种在多语言思维下的重复采样策略——单语言采样(Single-Language Sampling)和混合语言采样(Mixed-Language Sampling),并对输出进行多样性评估,输出被控制为英语,无论使用的思维语言是什么。在广泛的实验中,我们证明了将思维语言从英语切换到非英语语言始终会增加输出多样性,并且存在明显且一致的正相关关系,即在思维空间中离英语更远的语言会带来更大的增益。我们进一步展示了跨多种思维语言聚合样本通过组合效应带来的额外改善,以及通过语言异质性扩展采样规模可以提升模型的多样性上限。最后,我们表明这些发现转化为多元对齐场景中的实际利益,导致大型语言模型输出中更广泛的文化知识和价值取向覆盖。我们的代码已公开发布在 https://github.com/iNLP-Lab/Multilingual-LoT-Diversity。
cs.CL / 28

FactCorrector: A Graph-Inspired Approach to Long-Form Factuality Correction of Large Language Models

FactCorrector:一种基于图的长文本事实修正方法用于大型语言模型
Carnerero-Cano, Javier, Pronesti, Massimiliano, Marinescu, Radu, Tchrakian, Tigran, Barry, James, Gajcin, Jasmina, Hou, Yufang, Pascale, Alessandra, Daly, Elizabeth
Abstract
Large language models (LLMs) are widely used in knowledge-intensive applications but often generate factually incorrect responses. A promising approach to rectify these flaws is correcting LLMs using feedback. Therefore, in this paper, we introduce FactCorrector, a new post-hoc correction method that adapts across domains without retraining and leverages structured feedback about the factuality of the original response to generate a correction. To support rigorous evaluations of factuality correction methods, we also develop the VELI5 benchmark, a novel dataset containing systematically injected factual errors and ground-truth corrections. Experiments on VELI5 and several popular long-form factuality datasets show that the FactCorrector approach significantly improves factual precision while preserving relevance, outperforming strong baselines. We release our code at https://ibm.biz/factcorrector.
Chinese Translation
大型语言模型(LLMs)广泛应用于知识密集型的应用中,但常常生成事实不准确的回答。纠正这些缺陷的一个有前景的方法是利用反馈来修正LLMs。因此,在本文中,我们介绍了FactCorrector,一种新的后期修正方法,它能够在不重新训练的情况下跨领域适应,并利用关于原始回答事实性的结构化反馈来生成修正。为了支持对事实修正方法的严格评估,我们还开发了VELI5基准,这是一个包含系统性注入的事实错误和真实修正的新数据集。在VELI5和几个流行的长文本事实性数据集上的实验表明,FactCorrector方法显著提高了事实准确性,同时保持了相关性,超越了强基线。我们在https://ibm.biz/factcorrector发布了我们的代码。
cs.CL / 29

How DDAIR you? Disambiguated Data Augmentation for Intent Recognition

你敢 DDAIR 吗?用于意图识别的消歧义数据增强
Castillo-López, Galo, Lombard, Alexis, Semmar, Nasredine, de Chalendar, Gaël
Abstract
Large Language Models (LLMs) are effective for data augmentation in classification tasks like intent detection. In some cases, they inadvertently produce examples that are ambiguous with regard to untargeted classes. We present DDAIR (Disambiguated Data Augmentation for Intent Recognition) to mitigate this problem. We use Sentence Transformers to detect ambiguous class-guided augmented examples generated by LLMs for intent recognition in low-resource scenarios. We identify synthetic examples that are semantically more similar to another intent than to their target one. We also provide an iterative re-generation method to mitigate such ambiguities. Our findings show that sentence embeddings effectively help to (re)generate less ambiguous examples, and suggest promising potential to improve classification performance in scenarios where intents are loosely or broadly defined.
Chinese Translation
大型语言模型(LLMs)在意图检测等分类任务中的数据增强方面表现出色。然而,在某些情况下,它们无意中生成了与非目标类别模糊的示例。为了解决这个问题,我们提出了 DDAIR(用于意图识别的消歧义数据增强)。我们使用句子变换器(Sentence Transformers)来检测由 LLMs 生成的、在低资源场景下用于意图识别的模糊类别引导的增强示例。我们识别出在语义上与另一个意图比与目标意图更相似的合成示例。我们还提供了一种迭代再生成方法,以减轻这种模糊性。我们的研究结果表明,句子嵌入有效地帮助(再)生成更不模糊的示例,并显示出在意图定义模糊或广泛的场景中改善分类性能的良好潜力。
cs.CL / 30

Reasoning in Trees: Improving Retrieval-Augmented Generation for Multi-Hop Question Answering

树中的推理:提升检索增强生成在多跳问答中的表现
Shi, Yuling, Sun, Maolin, Liu, Zijun, Yang, Mo, Fang, Yixiong, Sun, Tianran, Gu, Xiaodong
Abstract
Retrieval-Augmented Generation (RAG) has demonstrated significant effectiveness in enhancing large language models (LLMs) for complex multi-hop question answering (QA). For multi-hop QA tasks, current iterative approaches predominantly rely on LLMs to self-guide and plan multi-step exploration paths during retrieval, leading to substantial challenges in maintaining reasoning coherence across steps from inaccurate query decomposition and error propagation. To address these issues, we introduce Reasoning Tree Guided RAG (RT-RAG), a novel hierarchical framework for complex multi-hop QA. RT-RAG systematically decomposes multi-hop questions into explicit reasoning trees, minimizing inaccurate decomposition through structured entity analysis and consensus-based tree selection that clearly separates core queries, known entities, and unknown entities. Subsequently, a bottom-up traversal strategy employs iterative query rewriting and refinement to collect high-quality evidence, thereby mitigating error propagation. Comprehensive experiments show that RT-RAG substantially outperforms state-of-the-art methods by 7.0% F1 and 6.0% EM, demonstrating the effectiveness of RT-RAG in complex multi-hop QA.
Chinese Translation
检索增强生成(RAG)在提升大型语言模型(LLMs)应对复杂多跳问答(QA)方面展现了显著的有效性。在多跳问答任务中,当前的迭代方法主要依赖LLMs自我引导和规划多步骤的检索探索路径,这导致在查询分解不准确和错误传播的情况下,维持推理一致性面临重大挑战。为了解决这些问题,我们提出了推理树引导的检索增强生成(RT-RAG),这是一个针对复杂多跳问答的新型分层框架。RT-RAG系统性地将多跳问题分解为明确的推理树,通过结构化实体分析和基于共识的树选择,最小化不准确的分解,清晰区分核心查询、已知实体和未知实体。随后,采用自下而上的遍历策略,通过迭代查询重写和细化来收集高质量证据,从而减轻错误传播。综合实验表明,RT-RAG在F1指标上比最先进的方法提高了7.0%,在EM指标上提高了6.0%,证明了RT-RAG在复杂多跳问答中的有效性。
cs.CL / 31

One LLM to Train Them All: Multi-Task Learning Framework for Fact-Checking

一个模型训练所有任务:用于事实核查的多任务学习框架
Larsson, Malin Astrid, Grunnaleite, Harald Fosen, Setty, Vinay
Abstract
Large language models (LLMs) are reshaping automated fact-checking (AFC) by enabling unified, end-to-end verification pipelines rather than isolated components. While large proprietary models achieve strong performance, their closed weights, complexity, and high costs limit sustainability. Fine-tuning smaller open weight models for individual AFC tasks can help but requires multiple specialized models resulting in high costs. We propose \textbf{multi-task learning (MTL)} as a more efficient alternative that fine-tunes a single model to perform claim detection, evidence ranking, and stance detection jointly. Using small decoder-only LLMs (e.g., Qwen3-4b), we explore three MTL strategies: classification heads, causal language modeling heads, and instruction-tuning, and evaluate them across model sizes, task orders, and standard non-LLM baselines. While multitask models do not universally surpass single-task baselines, they yield substantial improvements, achieving up to \textbf{44\%}, \textbf{54\%}, and \textbf{31\%} relative gains for claim detection, evidence re-ranking, and stance detection, respectively, over zero-/few-shot settings. Finally, we also provide practical, empirically grounded guidelines to help practitioners apply MTL with LLMs for automated fact-checking.
Chinese Translation
大型语言模型(LLMs)正在通过实现统一的端到端验证流程而非孤立的组件,重塑自动化事实核查(AFC)。尽管大型专有模型在性能上表现出色,但其封闭的权重、复杂性和高成本限制了可持续性。对较小的开放权重模型进行微调以处理单个AFC任务虽然有帮助,但需要多个专业模型,从而导致高成本。我们提出了 extbf{多任务学习(MTL)}作为一种更高效的替代方案,通过微调单一模型共同执行声明检测、证据排序和立场检测。使用小型解码器仅模型(例如,Qwen3-4b),我们探索了三种MTL策略:分类头、因果语言建模头和指令调优,并在模型规模、任务顺序和标准非LLM基准上进行评估。尽管多任务模型并不总是超越单任务基准,但它们在声明检测、证据重新排序和立场检测方面分别实现了高达 extbf{44\%}、 extbf{54 extbf{}}和 extbf{31 extbf{}}的相对增益,超越零样本/少样本设置。最后,我们还提供了实用的、基于实证的指导方针,以帮助从业者将MTL与LLMs结合应用于自动化事实核查。
cs.CL / 32

Membership Inference on LLMs in the Wild

野外大型语言模型的成员推断
Yi, Jiatong, Li, Yanyang
Abstract
Membership Inference Attacks (MIAs) act as a crucial auditing tool for the opaque training data of Large Language Models (LLMs). However, existing techniques predominantly rely on inaccessible model internals (e.g., logits) or suffer from poor generalization across domains in strict black-box settings where only generated text is available. In this work, we propose SimMIA, a robust MIA framework tailored for this text-only regime by leveraging an advanced sampling strategy and scoring mechanism. Furthermore, we present WikiMIA-25, a new benchmark curated to evaluate MIA performance on modern proprietary LLMs. Experiments demonstrate that SimMIA achieves state-of-the-art results in the black-box setting, rivaling baselines that exploit internal model information.
Chinese Translation
成员推断攻击(Membership Inference Attacks, MIAs)作为一种关键的审计工具,用于评估大型语言模型(Large Language Models, LLMs)不透明的训练数据。然而,现有技术主要依赖于不可访问的模型内部信息(例如,logits),或者在严格的黑箱设置中表现出较差的跨领域泛化能力,此时仅能获得生成的文本。在本研究中,我们提出了SimMIA,一个针对这种仅文本的环境量身定制的强健MIA框架,通过利用先进的采样策略和评分机制。此外,我们还提出了WikiMIA-25,一个新的基准,用于评估现代专有LLM上的MIA性能。实验结果表明,SimMIA在黑箱设置中实现了最先进的结果,与利用内部模型信息的基线相媲美。
cs.CL / 33

F-Actor: Controllable Conversational Behaviour in Full-Duplex Models

F-Actor:全双工模型中的可控对话行为
Züfle, Maike, Klejch, Ondrej, Sanders, Nicholas, Niehues, Jan, Birch, Alexandra, Lam, Tsz Kin
Abstract
Spoken conversational systems require more than accurate speech generation to have human-like conversations: to feel natural and engaging, they must produce conversational behaviour that adapts dynamically to the context. Current spoken conversational systems, however, rarely allow such customization, limiting their naturalness and usability. In this work, we present the first open, instruction-following full-duplex conversational speech model that can be trained efficiently under typical academic resource constraints. By keeping the audio encoder frozen and finetuning only the language model, our model requires just 2,000 hours of data, without relying on large-scale pretraining or multi-stage optimization. The model can follow explicit instructions to control speaker voice, conversation topic, conversational behaviour (e.g., backchanneling and interruptions), and dialogue initiation. We propose a single-stage training protocol and systematically analyze design choices. Both the model and training code will be released to enable reproducible research on controllable full-duplex speech systems.
Chinese Translation
口语对话系统不仅需要准确的语音生成才能进行类人对话:为了让对话显得自然和引人入胜,它们必须能够根据上下文动态调整对话行为。然而,目前的口语对话系统很少允许这种定制,限制了它们的自然性和可用性。在本研究中,我们提出了第一个开放的、遵循指令的全双工对话语音模型,该模型可以在典型的学术资源限制下高效训练。通过保持音频编码器不变,仅对语言模型进行微调,我们的模型只需2,000小时的数据,而不依赖于大规模的预训练或多阶段优化。该模型可以遵循明确的指令来控制说话者的声音、对话主题、对话行为(例如,反馈和打断)以及对话的发起。我们提出了一种单阶段训练协议,并系统地分析了设计选择。模型和训练代码将被发布,以便于可重复的可控全双工语音系统的研究。
cs.CL / 34

Idea First, Code Later: Disentangling Problem Solving from Code Generation in Evaluating LLMs for Competitive Programming

先有思路,后有代码:在评估大型语言模型在竞赛编程中的表现时,将问题解决与代码生成分开
Hadhoud, Sama, Elsetohy, Alaa, Hudi, Frederikus, Cruz, Jan Christian Blaise, Halim, Steven, Aji, Alham Fikri
Abstract
Large Language Models (LLMs) increasingly succeed on competitive programming problems, yet existing evaluations conflate algorithmic reasoning with code-level implementation. We argue that competitive programming is fundamentally a problem-solving task and propose centering natural-language editorials in both solution generation and evaluation. Generating an editorial prior to code improves solve rates for some LLMs, with substantially larger gains when using expertly written gold editorials. However, even with gold editorials, models continue to struggle with implementation, while the gap between generated and gold editorials reveals a persistent problem-solving bottleneck in specifying correct and complete algorithms. Beyond pass/fail metrics, we diagnose reasoning errors by comparing model-generated editorials to gold standards using expert annotations and validate an LLM-as-a-judge protocol for scalable evaluation. We introduce a dataset of 83 ICPC-style problems with gold editorials and full test suites, and evaluate 19 LLMs, arguing that future benchmarks should explicitly separate problem solving from implementation.
Chinese Translation
大型语言模型(LLMs)在竞赛编程问题上越来越成功,但现有评估将算法推理与代码级实现混为一谈。我们认为,竞赛编程本质上是一项问题解决任务,并建议在解决方案生成和评估中以自然语言编辑为中心。在代码生成之前生成编辑可以提高某些LLMs的解题率,当使用专家撰写的优质编辑时,提升幅度更为显著。然而,即使有优质编辑,模型在实现方面仍然面临困难,而生成的编辑与优质编辑之间的差距揭示了在指定正确且完整算法时存在的持续问题解决瓶颈。除了通过/不通过的指标外,我们通过将模型生成的编辑与专家标准进行比较,诊断推理错误,并验证了一种LLM作为评审的协议,以实现可扩展评估。我们引入了一个包含83个ICPC风格问题的数据库,配有优质编辑和完整测试套件,并评估了19个LLMs,认为未来的基准测试应明确区分问题解决与实现。
cs.CL / 35

Neural Chain-of-Thought Search: Searching the Optimal Reasoning Path to Enhance Large Language Models

神经链式思维搜索:寻找最佳推理路径以增强大型语言模型
Ling, Guoming, Huang, Zhongzhan, Lin, Yupei, Li, Junxin, Zhong, Shanshan, Wu, Hefeng, Lin, Liang
Abstract
Chain-of-Thought reasoning has significantly enhanced the problem-solving capabilities of Large Language Models. Unfortunately, current models generate reasoning steps sequentially without foresight, often becoming trapped in suboptimal reasoning paths with redundant steps. In contrast, we introduce Neural Chain-of-Thought Search (NCoTS), a framework that reformulates reasoning as a dynamic search for the optimal thinking strategy. By quantitatively characterizing the solution space, we reveal the existence of sparse superior reasoning paths that are simultaneously more accurate and concise than standard outputs. Our method actively navigates towards these paths by evaluating candidate reasoning operators using a dual-factor heuristic that optimizes for both correctness and computational cost. Consequently, NCoTS achieves a Pareto improvement across diverse reasoning benchmarks, boosting accuracy by over 3.5% while reducing generation length by over 22%. Our code and data are available at https://github.com/MilkThink-Lab/Neural-CoT-Search.
Chinese Translation
链式思维推理显著提升了大型语言模型的问题解决能力。然而,目前的模型在生成推理步骤时缺乏前瞻性,通常会陷入次优推理路径,导致冗余步骤的出现。与此不同,我们提出了神经链式思维搜索(Neural Chain-of-Thought Search, NCoTS)框架,将推理重新表述为对最佳思维策略的动态搜索。通过定量表征解空间,我们揭示了稀疏的优越推理路径的存在,这些路径在准确性和简洁性上均优于标准输出。我们的方法通过使用一种双因素启发式评估候选推理操作符,积极导航至这些路径,优化正确性和计算成本。因此,NCoTS在多种推理基准测试中实现了帕累托改进,准确性提高超过3.5%,同时生成长度减少超过22%。我们的代码和数据可在 https://github.com/MilkThink-Lab/Neural-CoT-Search 获取。
cs.CL / 36

How Much Would a Clinician Edit This Draft? Evaluating LLM Alignment for Patient Message Response Drafting

临床医生会编辑多少草稿?评估大型语言模型在患者消息响应草拟中的对齐性
Seegmiller, Parker, Gatto, Joseph, Greer, Sarah E., Isingizwe, Ganza Belise, Ray, Rohan, Burdick, Timothy E., Preum, Sarah Masud
Abstract
Large language models (LLMs) show promise in drafting responses to patient portal messages, yet their integration into clinical workflows raises various concerns, including whether they would actually save clinicians time and effort in their portal workload. We investigate LLM alignment with individual clinicians through a comprehensive evaluation of the patient message response drafting task. We develop a novel taxonomy of thematic elements in clinician responses and propose a novel evaluation framework for assessing clinician editing load of LLM-drafted responses at both content and theme levels. We release an expert-annotated dataset and conduct large-scale evaluations of local and commercial LLMs using various adaptation techniques including thematic prompting, retrieval-augmented generation, supervised fine-tuning, and direct preference optimization. Our results reveal substantial epistemic uncertainty in aligning LLM drafts with clinician responses. While LLMs demonstrate capability in drafting certain thematic elements, they struggle with clinician-aligned generation in other themes, particularly question asking to elicit further information from patients. Theme-driven adaptation strategies yield improvements across most themes. Our findings underscore the necessity of adapting LLMs to individual clinician preferences to enable reliable and responsible use in patient-clinician communication workflows.
Chinese Translation
大型语言模型(LLMs)在草拟患者门户消息响应方面展现出潜力,但其在临床工作流程中的整合引发了各种担忧,包括它们是否真的能够节省临床医生在门户工作中的时间和精力。我们通过对患者消息响应草拟任务的全面评估,研究LLM与个别临床医生的对齐性。我们开发了一种新的临床医生响应主题元素分类法,并提出了一种新的评估框架,用于在内容和主题层面上评估LLM草拟响应的临床编辑负担。我们发布了一个专家注释的数据集,并使用各种适应技术(包括主题提示、检索增强生成、监督微调和直接偏好优化)对本地和商业LLM进行大规模评估。我们的结果揭示了在将LLM草稿与临床医生响应对齐时存在显著的认知不确定性。尽管LLM在草拟某些主题元素方面表现出能力,但在其他主题,特别是提出问题以获取患者进一步信息的生成方面,它们却面临困难。基于主题的适应策略在大多数主题上都取得了改善。我们的研究结果强调了根据个别临床医生的偏好调整LLM的必要性,以便在患者与临床医生的沟通工作流程中实现可靠和负责任的使用。
cs.CL / 37

Reward Modeling for Scientific Writing Evaluation

科学写作评估的奖励建模
Şahinuç, Furkan, Dutta, Subhabrata, Gurevych, Iryna
Abstract
Scientific writing is an expert-domain task that demands deep domain knowledge, task-specific requirements and reasoning capabilities that leverage the domain knowledge to satisfy the task specifications. While scientific text generation has been widely studied, its evaluation remains a challenging and open problem. It is critical to develop models that can be reliably deployed for evaluating diverse open-ended scientific writing tasks while adhering to their distinct requirements. However, existing LLM-based judges and reward models are primarily optimized for general-purpose benchmarks with fixed scoring rubrics and evaluation criteria. Consequently, they often fail to reason over sparse knowledge of scientific domains when interpreting task-dependent and multi-faceted criteria. Moreover, fine-tuning for each individual task is costly and impractical for low-resource settings. To bridge these gaps, we propose cost-efficient, open-source reward models tailored for scientific writing evaluation. We introduce a two-stage training framework that initially optimizes scientific evaluation preferences and then refines reasoning capabilities. Our multi-aspect evaluation design and joint training across diverse tasks enable fine-grained assessment and robustness to dynamic criteria and scoring rubrics. Experimental analysis shows that our training regime strongly improves LLM-based scientific writing evaluation. Our models generalize effectively across tasks and to previously unseen scientific writing evaluation settings, allowing a single trained evaluator to be reused without task-specific retraining.
Chinese Translation
科学写作是一项需要深厚领域知识、特定任务要求和推理能力的专家领域任务,这些能力利用领域知识来满足任务规范。尽管科学文本生成已被广泛研究,但其评估仍然是一个具有挑战性和开放性的问题。开发能够可靠地用于评估多样化开放式科学写作任务的模型至关重要,同时还需遵循其独特要求。然而,现有的基于大型语言模型(LLM)的评估者和奖励模型主要针对具有固定评分标准和评估标准的一般性基准进行了优化。因此,它们在解释任务依赖和多方面标准时,往往无法对科学领域的稀疏知识进行推理。此外,对于每个单独任务进行微调在资源有限的情况下既昂贵又不切实际。为了解决这些问题,我们提出了针对科学写作评估的成本效益高的开源奖励模型。我们引入了一个两阶段的训练框架,初步优化科学评估偏好,然后细化推理能力。我们的多方面评估设计和跨多样任务的联合训练使得细粒度评估成为可能,并增强了对动态标准和评分标准的鲁棒性。实验分析表明,我们的训练方案显著改善了基于LLM的科学写作评估。我们的模型在任务之间有效泛化,并能够适应以前未见过的科学写作评估环境,从而允许单个训练的评估者在不进行任务特定再训练的情况下重复使用。
cs.CL / 38

Evaluating LLM Behavior in Hiring: Implicit Weights, Fairness Across Groups, and Alignment with Human Preferences

招聘中大型语言模型行为的评估:隐含权重、群体公平性与人类偏好的对齐
Hoffmann, Morgane, Jouffroy, Emma, Jouanneau, Warren, Palyart, Marc, Pebereau, Charles
Abstract
General-purpose Large Language Models (LLMs) show significant potential in recruitment applications, where decisions require reasoning over unstructured text, balancing multiple criteria, and inferring fit and competence from indirect productivity signals. Yet, it is still uncertain how LLMs assign importance to each attribute and whether such assignments are in line with economic principles, recruiter preferences or broader societal norms. We propose a framework to evaluate an LLM's decision logic in recruitment, by drawing on established economic methodologies for analyzing human hiring behavior. We build synthetic datasets from real freelancer profiles and project descriptions from a major European online freelance marketplace and apply a full factorial design to estimate how a LLM weighs different match-relevant criteria when evaluating freelancer-project fit. We identify which attributes the LLM prioritizes and analyze how these weights vary across project contexts and demographic subgroups. Finally, we explain how a comparable experimental setup could be implemented with human recruiters to assess alignment between model and human decisions. Our findings reveal that the LLM weighs core productivity signals, such as skills and experience, but interprets certain features beyond their explicit matching value. While showing minimal average discrimination against minority groups, intersectional effects reveal that productivity signals carry different weights between demographic groups.
Chinese Translation
通用大型语言模型(LLMs)在招聘应用中展现出显著潜力,这类决策需要对非结构化文本进行推理,平衡多个标准,并从间接的生产力信号中推断适合度和能力。然而,目前尚不确定LLMs如何为每个属性分配重要性,以及这种分配是否符合经济原则、招聘者偏好或更广泛的社会规范。我们提出了一个评估LLM在招聘中决策逻辑的框架,借鉴了分析人类招聘行为的既定经济方法论。我们从一个主要的欧洲在线自由职业市场的真实自由职业者档案和项目描述中构建合成数据集,并应用全因子设计来估计LLM在评估自由职业者与项目的匹配时如何权衡不同的匹配相关标准。我们识别出LLM优先考虑的属性,并分析这些权重在项目背景和人口子群体之间的变化。最后,我们解释了如何在与人类招聘者的可比实验设置中实施,以评估模型与人类决策之间的一致性。我们的研究结果表明,LLM重视核心生产力信号,如技能和经验,但在解释某些特征时超出了其显式匹配价值。尽管对少数群体的平均歧视最小,但交叉效应显示生产力信号在不同人口群体之间的权重存在差异。
cs.CL / 39

Relational Linearity is a Predictor of Hallucinations

关系线性是幻觉的预测因子
Lu, Yuetian, Liu, Yihong, Schütze, Hinrich
Abstract
Hallucination is a central failure mode in large language models (LLMs). We focus on hallucinations of answers to questions like: "Which instrument did Glenn Gould play?", but we ask these questions for synthetic entities that are unknown to the model. Surprisingly, we find that medium-size models like Gemma-7B-IT frequently hallucinate, i.e., they have difficulty recognizing that the hallucinated fact is not part of their knowledge. We hypothesize that an important factor in causing these hallucinations is the linearity of the relation: linear relations tend to be stored more abstractly, making it difficult for the LLM to assess its knowledge; the facts of nonlinear relations tend to be stored more directly, making knowledge assessment easier. To investigate this hypothesis, we create SyntHal, a dataset of 6000 synthetic entities for six relations. In our experiments with four models, we determine, for each relation, the hallucination rate on SyntHal and also measure its linearity, using $\Delta\cos$. We find a strong correlation ($r \in [.78,.82]$) between relational linearity and hallucination rate, providing evidence for our hypothesis that the underlying storage of triples of a relation is a factor in how well a model can self-assess its knowledge. This finding has implications for how to manage hallucination behavior and suggests new research directions for improving the representation of factual knowledge in LLMs.
Chinese Translation
幻觉是大型语言模型(LLMs)中的一种主要失效模式。我们关注于对诸如“格伦·古尔德演奏了哪种乐器?”等问题的幻觉,但我们对模型未知的合成实体提出这些问题。令人惊讶的是,我们发现像Gemma-7B-IT这样的中型模型经常出现幻觉,即它们难以识别幻觉事实并不属于其知识的一部分。我们假设,导致这些幻觉的重要因素是关系的线性:线性关系往往以更抽象的方式存储,使得LLM难以评估其知识;而非线性关系的事实则往往以更直接的方式存储,使知识评估变得更容易。为了验证这一假设,我们创建了SyntHal,一个包含6000个合成实体的六种关系的数据集。在我们对四个模型的实验中,我们确定了每种关系在SyntHal上的幻觉率,并使用$ ext{Δcos}$测量其线性。我们发现关系线性与幻觉率之间存在强相关性($r ext{ in } [.78,.82]$),为我们的假设提供了证据,即关系三元组的底层存储是模型自我评估其知识能力的一个因素。该发现对如何管理幻觉行为具有重要意义,并为改善LLMs中事实知识的表示提供了新的研究方向。
cs.CL / 40

The unreasonable effectiveness of pattern matching

模式匹配的非理性有效性
Lupyan, Gary, Arcas, Blaise Agüera y
Abstract
We report on an astonishing ability of large language models (LLMs) to make sense of "Jabberwocky" language in which most or all content words have been randomly replaced by nonsense strings, e.g., translating "He dwushed a ghanc zawk" to "He dragged a spare chair". This result addresses ongoing controversies regarding how to best think of what LLMs are doing: are they a language mimic, a database, a blurry version of the Web? The ability of LLMs to recover meaning from structural patterns speaks to the unreasonable effectiveness of pattern-matching. Pattern-matching is not an alternative to "real" intelligence, but rather a key ingredient.
Chinese Translation
我们报告了大型语言模型(LLMs)在理解“Jabberwocky”语言方面的惊人能力,其中大多数或所有内容词被随机替换为无意义的字符串,例如,将“He dwushed a ghanc zawk”翻译为“He dragged a spare chair”。这一结果解决了关于如何最好地理解LLMs所做工作的持续争议:它们是语言模仿者、数据库,还是模糊的网络版本?LLMs从结构模式中恢复意义的能力体现了模式匹配的非理性有效性。模式匹配并不是“真实”智能的替代品,而是一个关键成分。
cs.CL / 41

Hierarchical Orthogonal Residual Spread for Precise Massive Editing in Large Language Models

层次正交残差传播用于大型语言模型的精确大规模编辑
Gu, Xiaojie, Chen, Guangxu, Yang, Yuheng, Han, Jingxin, Zhang, Andi
Abstract
Large language models (LLMs) exhibit exceptional performance across various domains, yet they face critical safety concerns. Model editing has emerged as an effective approach to mitigate these issues. Existing model editing methods often focus on optimizing an information matrix that blends new and old knowledge. While effective, these approaches can be computationally expensive and may cause conflicts. In contrast, we shift our attention to Hierarchical Orthogonal Residual SprEad of the information matrix, which reduces noisy gradients and enables more stable edits from a different perspective. We demonstrate the effectiveness of our method HORSE through a clear theoretical comparison with several popular methods and extensive experiments conducted on two datasets across multiple LLMs. The results show that HORSE maintains precise massive editing across diverse scenarios. The code is available at https://github.com/XiaojieGu/HORSE
Chinese Translation
大型语言模型(LLMs)在多个领域表现出色,但面临着严重的安全隐患。模型编辑已成为缓解这些问题的有效方法。现有的模型编辑方法通常侧重于优化一个融合新旧知识的信息矩阵。尽管有效,这些方法可能计算成本高且可能导致冲突。相较之下,我们将注意力转向信息矩阵的层次正交残差传播(Hierarchical Orthogonal Residual Spread),该方法减少了噪声梯度,并从不同的角度实现了更稳定的编辑。我们通过与几种流行方法的明确理论比较以及在多个大型语言模型上的两个数据集上进行的广泛实验,展示了我们的方法HORSE的有效性。结果表明,HORSE在多种场景下保持了精确的大规模编辑。代码可在 https://github.com/XiaojieGu/HORSE 获取。
cs.CL / 42

Predict the Retrieval! Test time adaptation for Retrieval Augmented Generation

预测检索!检索增强生成的测试时适应
Sun, Xin, Chen, Zhongqi, Liu, Qiang, Wu, Shu, Song, Bowen, Wang, Weiqiang, Wang, Zilei, Wang, Liang
Abstract
Retrieval-Augmented Generation (RAG) has emerged as a powerful approach for enhancing large language models' question-answering capabilities through the integration of external knowledge. However, when adapting RAG systems to specialized domains, challenges arise from distribution shifts, resulting in suboptimal generalization performance. In this work, we propose TTARAG, a test-time adaptation method that dynamically updates the language model's parameters during inference to improve RAG system performance in specialized domains. Our method introduces a simple yet effective approach where the model learns to predict retrieved content, enabling automatic parameter adjustment to the target domain. Through extensive experiments across six specialized domains, we demonstrate that TTARAG achieves substantial performance improvements over baseline RAG systems. Code available at https://github.com/sunxin000/TTARAG.
Chinese Translation
检索增强生成(Retrieval-Augmented Generation, RAG)作为一种强大的方法,通过整合外部知识来增强大型语言模型的问答能力。然而,在将 RAG 系统适应于专业领域时,由于分布变化,导致了次优的泛化性能。在本研究中,我们提出了 TTARAG,一种测试时适应方法,该方法在推理过程中动态更新语言模型的参数,以提高 RAG 系统在专业领域的性能。我们的方法引入了一种简单而有效的方式,使模型能够学习预测检索到的内容,从而实现对目标领域的自动参数调整。通过在六个专业领域的广泛实验,我们证明了 TTARAG 在基线 RAG 系统上实现了显著的性能提升。代码可在 https://github.com/sunxin000/TTARAG 获取。
cs.CL / 43

CTest-Metric: A Unified Framework to Assess Clinical Validity of Metrics for CT Report Generation

CTest-Metric:评估CT报告生成指标临床有效性的统一框架
Sharma, Vanshali, Bejar, Andrea Mia, Durak, Gorkem, Bagci, Ulas
Abstract
In the generative AI era, where even critical medical tasks are increasingly automated, radiology report generation (RRG) continues to rely on suboptimal metrics for quality assessment. Developing domain-specific metrics has therefore been an active area of research, yet it remains challenging due to the lack of a unified, well-defined framework to assess their robustness and applicability in clinical contexts. To address this, we present CTest-Metric, a first unified metric assessment framework with three modules determining the clinical feasibility of metrics for CT RRG. The modules test: (i) Writing Style Generalizability (WSG) via LLM-based rephrasing; (ii) Synthetic Error Injection (SEI) at graded severities; and (iii) Metrics-vs-Expert correlation (MvE) using clinician ratings on 175 "disagreement" cases. Eight widely used metrics (BLEU, ROUGE, METEOR, BERTScore-F1, F1-RadGraph, RaTEScore, GREEN Score, CRG) are studied across seven LLMs built on a CT-CLIP encoder. Using our novel framework, we found that lexical NLG metrics are highly sensitive to stylistic variations; GREEN Score aligns best with expert judgments (Spearman~0.70), while CRG shows negative correlation; and BERTScore-F1 is least sensitive to factual error injection. We will release the framework, code, and allowable portion of the anonymized evaluation data (rephrased/error-injected CT reports), to facilitate reproducible benchmarking and future metric development.
Chinese Translation
在生成性人工智能时代,即使是关键的医疗任务也越来越多地实现自动化,放射学报告生成(RRG)仍然依赖于次优的质量评估指标。因此,开发特定领域的指标一直是一个活跃的研究领域,但由于缺乏统一且明确定义的框架来评估其在临床环境中的稳健性和适用性,这一过程仍然充满挑战。为了解决这一问题,我们提出了CTest-Metric,这是第一个统一的指标评估框架,包含三个模块,用于确定CT RRG指标的临床可行性。这些模块测试:(i)通过基于大型语言模型(LLM)的重述评估写作风格的可推广性(WSG);(ii)在不同严重程度下进行合成错误注入(SEI);以及(iii)使用临床医生对175个“分歧”案例的评分来评估指标与专家的相关性(MvE)。我们研究了八个广泛使用的指标(BLEU、ROUGE、METEOR、BERTScore-F1、F1-RadGraph、RaTEScore、GREEN Score、CRG),并在基于CT-CLIP编码器构建的七个LLM上进行了测试。使用我们的新框架,我们发现词汇自然语言生成(NLG)指标对风格变化高度敏感;GREEN Score与专家判断的对齐度最佳(Spearman~0.70),而CRG则显示出负相关;BERTScore-F1对事实错误注入的敏感性最低。我们将发布该框架、代码以及可允许的匿名评估数据部分(重述/错误注入的CT报告),以促进可重复的基准测试和未来指标的开发。
cs.CL / 44

Do explanations generalize across large reasoning models?

解释在大型推理模型中是否具有普遍性?
Pal, Koyena, Bau, David, Singh, Chandan
Abstract
Large reasoning models (LRMs) produce a textual chain of thought (CoT) in the process of solving a problem, which serves as a potentially powerful tool to understand the problem by surfacing a human-readable, natural-language explanation. However, it is unclear whether these explanations generalize, i.e. whether they capture general patterns about the underlying problem rather than patterns which are esoteric to the LRM. This is a crucial question in understanding or discovering new concepts, e.g. in AI for science. We study this generalization question by evaluating a specific notion of generalizability: whether explanations produced by one LRM induce the same behavior when given to other LRMs. We find that CoT explanations often exhibit this form of generalization (i.e. they increase consistency between LRMs) and that this increased generalization is correlated with human preference rankings and post-training with reinforcement learning. We further analyze the conditions under which explanations yield consistent answers and propose a straightforward, sentence-level ensembling strategy that improves consistency. Taken together, these results prescribe caution when using LRM explanations to yield new insights and outline a framework for characterizing LRM explanation generalization.
Chinese Translation
大型推理模型(LRMs)在解决问题的过程中生成文本思维链(CoT),这作为一种潜在的强大工具,通过呈现可读的人类自然语言解释来理解问题。然而,目前尚不清楚这些解释是否具有普遍性,即它们是否捕捉到关于基础问题的一般模式,而不是仅仅是对LRM而言的晦涩模式。这是理解或发现新概念(例如在科学中的人工智能)中的一个关键问题。我们通过评估一种特定的普遍性概念来研究这一普遍性问题:一个LRM生成的解释在提供给其他LRM时是否会诱导出相同的行为。我们发现,CoT解释通常表现出这种形式的普遍性(即它们增加了LRM之间的一致性),而这种增加的普遍性与人类偏好排名以及经过强化学习后的训练相关联。我们进一步分析了解释产生一致答案的条件,并提出了一种简单的句子级集成策略,以提高一致性。综合来看,这些结果在使用LRM解释以获得新见解时应保持谨慎,并概述了一个用于表征LRM解释普遍性的框架。
cs.CL / 45

How Long Is a Piece of String? A Brief Empirical Analysis of Tokenizers

一根绳子的长度是多少?对分词器的简要实证分析
Roberts, Jonathan, Han, Kai, Albanie, Samuel
Abstract
Frontier LLMs are increasingly utilised across academia, society and industry. A commonly used unit for comparing models, their inputs and outputs, and estimating inference pricing is the token. In general, tokens are used as a stable currency, assumed to be broadly consistent across tokenizers and contexts, enabling direct comparisons. However, tokenization varies significantly across models and domains of text, making naive interpretation of token counts problematic. We quantify this variation by providing a comprehensive empirical analysis of tokenization, exploring the compression of sequences to tokens across different distributions of textual data. Our analysis challenges commonly held heuristics about token lengths, finding them to be overly simplistic. We hope the insights of our study add clarity and intuition toward tokenization in contemporary LLMs.
Chinese Translation
前沿的大型语言模型(LLMs)在学术界、社会和工业中越来越多地被应用。用于比较模型、其输入和输出,以及估算推理定价的一个常用单位是“token”(标记)。一般而言,token被视为一种稳定的货币,假设在不同的分词器和上下文中大致一致,从而实现直接比较。然而,token化在不同模型和文本领域之间存在显著差异,使得对token计数的简单解释变得问题重重。我们通过提供全面的实证分析来量化这种变化,探讨在不同文本数据分布下序列到token的压缩。我们的分析挑战了关于token长度的普遍认知,发现这些认知过于简单化。我们希望本研究的见解能够为当代大型语言模型中的token化提供更清晰的理解和直觉。